[RFC PATCH 00/21] KVM: TDX huge page support for private memory

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 00/21] KVM: TDX huge page support for private memory
@ 2025-04-24  3:00 Yan Zhao
  2025-04-24  3:04 ` [RFC PATCH 01/21] KVM: gmem: Allocate 2M huge page from guest_memfd backend Yan Zhao
                   ` (21 more replies)
  0 siblings, 22 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:00 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, chao.p.peng, Yan Zhao

This is an RFC series to support huge pages in TDX. It's an evolution of
the previous patches from Isaku [0]. (Please find the main changes to [0]
at a later section).

As the series of enabling guest_memfd to support 1GB huge page with
in-place conversion [1] is still under development, we temporarily based
the TDX work on top of the series from Michael Roth that enables basic 2M
guest_memfd support without in-place conversion[2].  The goal is to have an
early review and discussion of the TDX huge page work (including changes to
KVM core MMU and the TDX specific code), which should remain stable, with
only minor adjustments, regardless the changes coming in guest_memfd.

The series is currently focused on supporting 2MB huge pages only.

Tip folks, there are some SEAMCALL wrapper changes in this series, but we
need to have some discussion on the KVM side to figure out what it needs
still. Please feel free to ignore it for now.

Design
======
guest_memfd
-----------
TDX huge page support has a basic assumption to guest_memfd: guest_memfd
allocates private huge pages whenever alignment of GFN/index, range size
and the consistency of page attributes allow.

Patch 01 (based on [2]) in this RFC acts as glue code to ensure this
assumption is met for TDX. It can be absorbed into any future
guest_memfd series (e.g., future in-place conversion series) in any form.

TDX interacts with guest_memfd through interfaces kvm_gmem_populate() and
kvm_gmem_get_pfn(), obtaining the allocated page and its order.

The remaining TDX code should remain stable despite future changes in
guest_memfd.

Basic huge page mapping/unmapping
---------------------------------
- TD build time
  This series enforces that all private mappings be 4KB during the TD build
  phase, due to the TDX module's requirement that tdh_mem_page_add(), the
  SEAMCALL for adding private pages during TD build time, only supports 4KB
  mappings. Enforcing 4KB mappings also simplifies the implementation of
  code for TD build time, by eliminating the need to consider merging or
  splitting in the mirror page table during TD build time.

  The underlying pages allocated from guest_memfd during TD build time
  phase can still be large, allowing for potential merging into 2MB
  mappings once the TD is running.

- TD runtime
  This series allows a private fault's max_level to be 2MB after TD is
  running. KVM core MMU will map/unmap 2MB mappings in the mirror page
  table according to a fault's goal_level as what're done for normal VMs.
  Changes in the mirror page table are then propagated to the S-EPT.

  For transitions from non-present to huge leaf in the mirror page table,
  hook set_external_spte is invoked, leading to the execution of
  tdh_mem_page_aug() to install a huge leaf in the S-EPT.

  Conversely, during transitions from a huge leaf to non-present, the
  remove_external_spte hook is invoked to execute SEAMCALLs that remove the
  huge leaf from the S-EPT.

  (For transitions from huge leaf to non-leaf, or from non-leaf to huge
   leaf, SPTE splitting/merging will be triggered. More details are in
   later sections.)

- Specify fault max_level
  In the TDP MMU, a fault's max_level is initially set to the 1GB level for
  x86. KVM then updates the fault's max_level by determining the lowest
  order among fault->max_level, the order of the allocated private page,
  and the TDX-specified max_level from hook private_max_mapping_level.
  For TDX, a private fault's req_level, and goal_level finally equal to the
  fault's max_level as TDX platforms do not have the flaw for NX huge page.

  So, if TDX has specific requirements to influence a fault's goal_level
  for private memory (e.g., if it knows an EPT violation is caused by a
  TD's ACCEPT operation, mapping at the ACCEPT's level is preferred), this
  can be achieved either by affecting the initial value of fault->max_level
  or through the private_max_mapping_level hook.

  The former approach requires more changes in the KVM core (e.g., by using
  some bits in the error_code passed to kvm_mmu_page_fault() and having
  KVM check for them). This RFC opts for the latter, simpler method, using
  the private_max_mapping_level hook.

Page splitting (page demotion)
------------------------------
Page splitting occurs in two paths:
(a) with exclusive kvm->mmu_lock, triggered by zapping operations,

    For normal VMs, if zapping a narrow region that would need to split a
    huge page, KVM can simply zap the surrounding GFNs rather than
    splitting a huge page. The pages can then be faulted back in, where KVM
    can handle mapping them at a 4KB level.

    The reason why TDX can't use the normal VM solution is that zapping
    private memory that is accepted cannot easily be re-faulted, since it
    can only be re-faulted as unaccepted. So KVM will have to sometimes do
    the page splitting as part of the zapping operations.

    These zapping operations can occur for few reasons:
    1. VM teardown.
    2. Memslot removal.
    3. Conversion of private pages to shared.
    4. Userspace does a hole punch to guest_memfd for some reason.

    For case 1 and 2, splitting before zapping is unnecessary because
    either the entire range will be zapped or huge pages do not span
    memslots.

    Case 3 or case 4 requires splitting, which is also followed by a
    backend page splitting in guest_memfd.

(b) with shared kvm->mmu_lock, triggered by fault.

    Splitting in this path is not accompanied by a backend page splitting
    (since backend page splitting necessitates a splitting and zapping
     operation in the former path).  It is triggered when KVM finds that a
    non-leaf entry is replacing a huge entry in the fault path, which is
    usually caused by vCPUs' concurrent ACCEPT operations at different
    levels.

    This series simply ignores the splitting request in the fault path to
    avoid unnecessary bounces between levels. The vCPU that performs ACCEPT
    at a lower level would finally figures out the page has been accepted
    at a higher level by another vCPU.

    A rare case that could lead to splitting in the fault path is when a TD
    is configured to receive #VE and accesses memory before the ACCEPT
    operation. By the time a vCPU accesses a private GFN, due to the lack
    of any guest preferred level, KVM could create a mapping at 2MB level.
    If the TD then only performs the ACCEPT operation at 4KB level,
    splitting in the fault path will be triggered. However, this is not
    regarded as a typical use case, as usually TD always accepts pages in
    the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
    splitting request is an endless EPT violation. This would not happen
    for a Linux guest, which does not expect any #VE.

- Splitting for private-to-shared conversion or punch hole
  Splitting of a huge mapping requires the allocation of page table page
  and the corresponding shadow structures. This memory allocation can fail.
  So, while the zapping operations in the two scenarios don't have an
  understanding of failure, the overall operations do. Therefore, the RFC
  introduces a separate step kvm_split_boundary_leafs() to split huge
  mappings ahead of the zapping operation.

  Patches 16-17 implement this change. As noted in the patch log, the
  downside of the current approach is that although
  kvm_split_boundary_leafs() is invoked before kvm_unmap_gfn_range() for
  each GFN range, the entire zapping range may consist of several GFN
  ranges. If an out-of-memory error occurs during the splitting of a GFN
  range, some previous GFN ranges may have been successfully split and
  zapped, even though their page attributes remain unchanged due to the
  splitting failure. This may not be a significant issue, as the user can
  retry the ioctl to split and zap the full range. However, if it becomes
  problematic, further modifications to invoke kvm_unmap_gfn_range() after
  executing kvm_mmu_invalidate_range_add() and kvm_split_boundary_leafs()
  for all GFN ranges could address the problem.

  Alternatively, a possible solution could be pre-allocating sufficiently
  large splitting caches at the start of the private-to-shared conversion
  or hole punch process. The downside is that this may allocate more memory
  than necessary and require more code changes.

- The full call stack for huge page splitting

  With exclusive kvm->mmu_lock,
  kvm_vm_set_mem_attributes/kvm_gmem_punch_hole
     |kvm_split_boundary_leafs
     |   |kvm_tdp_mmu_gfn_range_split_boundary
     |       |tdp_mmu_split_boundary_leafs
     |           |tdp_mmu_alloc_sp_for_split
     |           |tdp_mmu_split_huge_page
     |               |tdp_mmu_link_sp
     |                   |tdp_mmu_iter_set_spte
     |                       |tdp_mmu_set_spte
     |                           |split_external_spt
     |                               |kvm_x86_split_external_spt
     |                                   | BLOCK, TRACK, DEMOTION
     |kvm_mmu_unmap_gfn_range

  With shared kvm->mmu_lock,
  kvm_tdp_mmu_map
     |tdp_mmu_alloc_sp
     |kvm_mmu_alloc_external_spt
     |tdp_mmu_split_huge_page
         |tdp_mmu_link_sp
             |tdp_mmu_set_spte_atomic
                 |__tdp_mmu_set_spte_atomic
		    |set_external_spte_present
		        |split_external_spt
			    |kvm_x86_split_external_spt

- Handle busy & errors

  Splitting huge mappings in S-EPT requires to execute
  tdh_mem_range_block(), tdh_mem_track(), kicking off vCPUs,
  tdh_mem_page_demote() in sequence.

  Possible errors during the process include TDX_OPERAND_BUSY or
  TDX_INTERRUPTED_RESTARTABLE.

  With exclusive kvm->mmu_lock, TDX_OPERAND_BUSY can be handled similarly
  to removing a private page, i.e., by kicking off all vCPUs and retrying,
  which should succeed on the second attempt.

  TDX_INTERRUPTED_RESTARTABLE occurs when there is a pending interrupt on
  the host side during SEAMCALL tdh_mem_page_demote(). The approach is to
  retry indefinitely in KVM for TDX_INTERRUPTED_RESTARTABLE, because the
  interrupts are for host only in current exclusive kvm->mmu_lock path.

Page merging (page promotion)
-----------------------------
  The RFC disallows the page merging on the mirror page table.

  Unlike normal VMs, private memory in TDX requires the guest's ACCEPT
  operation. Therefore, transitioning from a non-leaf entry to a huge leaf
  entry in the S-EPT requires the non-leaf entry to be initially populated
  with small child entries, all in PENDING or ACCEPTED status.
  Subsequently, the merged huge leaf can be set to either PENDING or
  ACCEPTED status.

  Therefore, counter-intuitively, converting a partial range (e.g., one
  4KB) of a 2MB range from private to shared and then converting back to
  private does not result in a successful page promotion in the S-EPT.
  After converting a shared 4KB page back to private:
  a) Linux Guest: Accepts the 4K page prior to accessing memory, prompting
     KVM to map it at the 4KB level, which prevents further EPT violations
     and avoids triggering page promotion.
  b) Non-Linux Guest: May access the page before executing the ACCEPT
     operation. KVM identifies the physical page is 2MB contiguous and maps
     it at 2MB, causing a non-leaf to leaf transition in the mirror page
     table. However, after the preparation step, only 511 child entries in
     the S-EPT are in ACCEPTED status, with 1 newly mapped entry in PENDING
     status. The promotion request to the S-EPT fails due to this mixed
     status. If KVM re-enters the guest and triggers #VE for the guest to
     accept the page, the guest must accept the page at the 4KB level, as
     no 2MB mapping is available. After the ACCEPT operation, no further
     EPT violations occur to trigger page promotion.

  So, also to avoid the comprehensive BUSY handling and rolling back code
  due to shared kvm->mmu_lock, the RFC disallows the page merging on the
  mirror page table. This should have minimal performance impact in
  practice, as up to now no page merging is observed for a real guest,
  except for the selftests.

Patches layout
==============
Patch 01: Glue code to [2].
          It allows kvm_gmem_populate() and kvm_gmem_get_pfn() to get a
          2MB private huge page from guest_memfd whenever GFN/index
          alignment, remaining size, and page attribute layout.
          Though this patch may not be needed after guest_memfd supporting
          in-place conversion in future, the guest_memfd needs to ensure
          something similar.
Patches 02-03: SEAMCALL changes under x86/virt.
Patches 04-09: Basic private huge page mapping/unmapping.
           04: For build time, no huge pages, forced to 4KB.
        05-07: Enhancements of tdx_clear_page(),tdx_reclaim_page,
               tdx_wbinvd_page() to handle huge pages.
           08: inc/dec folio ref count for huge pages.
               The increasing of private folio ref count should be dropped
               after guest_memfd supporting in-place conversion. TDX will
               then only acquire private folio ref count upon errors during
               the page removing/reclaiming stage.
           09: Turn on mapping/unmapping of huge pages for TD runtime.
Patch 10: Disallow page merging in the mirror page table.
Patches 11-12: Allow guest's ACCEPT level to determine page mapping size. 
Patches 13-19: Basic page splitting support (with exclusive kvm->mmu_lock)
           13: Enhance tdp_mmu_alloc_sp_split() for external page table
           14: Add code to propagate splitting request to external page
               table in tdp_mmu_set_spte(), which updates SPTE under
               exclusive kvm->mmu_lock.
           15: TDX's counter part to patch 14. Implementation of hook
               split_external_spt.
        16-19: Split private huge pages for private-to-shared conversion
               and punch hole.
Patches 20-21: Ignore page splitting request with shared kvm->mmu_lock

Main changes to [0]:
===================
- Disallow huge mappings in TD build time.
- Use hook private_max_mapping_level to convey TDX's mapping level info
  instead of having KVM MMU core to check certain bits in error_code to
  determine a fault's max_level.
- Move tdh_mem_range_block() for page splitting to TDX's implementation of
  hook split_external_spt.
- Do page splitting before tdp_mmu_zap_leafs(). So instead of BUG_ON() the
  tdp_mmu_zap_leafs(), out-of-memory failure for splitting can fail the
  ioctl KVM_SET_MEMORY_ATTRIBUTES or punch hole.
- Restrict page splitting to be under exclusive kvm->mmu_lock and ignore
  the page splitting under shared kvm->mmu_lock.
- Drop page merging support.

Testing
-------
The series is based on kvm/next.

This patchset is also available at: [3]
It is able to launch TDs with page demotion working correctly. Though it's
still unable to trigger page promotion with a linux guest yet, the part of
page promotion code is tested working with a selftest.

It's able to check huge mapping count in KVM at runtime at
/sys/kernel/debug/kvm/pages_2m.
(Though this node includes huge mapping count for both shared and private
memory, currently there're not many shared huge pages. In future,
guest_memfd in-place conversion requires all shared pages to be 4KB. So
there's no need to expand this interface).

[0] https://lore.kernel.org/all/cover.1708933624.git.isaku.yamahata@intel.com
[1] https://lore.kernel.org/lkml/cover.1726009989.git.ackerleytng@google.com
[2] https://lore.kernel.org/all/20241212063635.712877-1-michael.roth@amd.com
[3] https://github.com/intel/tdx/tree/huge_page_kvm_next_2025_04_23

Edgecombe, Rick P (1):
  KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror
    root

Isaku Yamahata (1):
  KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table
    splitting

Xiaoyao Li (5):
  x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  KVM: TDX: Enhance tdx_clear_page() to support huge pages
  KVM: TDX: Assert the reclaimed pages were mapped as expected
  KVM: TDX: Add a helper for WBINVD on huge pages with TD's keyID
  KVM: TDX: Support huge page splitting with exclusive kvm->mmu_lock

Yan Zhao (14):
  KVM: gmem: Allocate 2M huge page from guest_memfd backend
  x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  KVM: TDX: Enforce 4KB mapping level during TD build Time
  KVM: TDX: Increase/decrease folio ref for huge pages
  KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  KVM: x86: Add "vcpu" "gfn" parameters to x86 hook
    private_max_mapping_level
  KVM: TDX: Determine max mapping level according to vCPU's ACCEPT level
  KVM: x86/tdp_mmu: Invoke split_external_spt hook with exclusive
    mmu_lock
  KVM: x86/mmu: Introduce kvm_split_boundary_leafs() to split boundary
    leafs
  KVM: Change the return type of gfn_handler_t() from bool to int
  KVM: x86: Split huge boundary leafs before private to shared
    conversion
  KVM: gmem: Split huge boundary leafs for punch hole of private memory
  KVM: x86: Force a prefetch fault's max mapping level to 4KB for TDX
  KVM: x86: Ignore splitting huge pages in fault path for TDX

 arch/arm64/kvm/mmu.c               |   4 +-
 arch/loongarch/kvm/mmu.c           |   4 +-
 arch/mips/kvm/mmu.c                |   4 +-
 arch/powerpc/kvm/book3s.c          |   4 +-
 arch/powerpc/kvm/e500_mmu_host.c   |   4 +-
 arch/riscv/kvm/mmu.c               |   4 +-
 arch/x86/include/asm/kvm-x86-ops.h |   1 +
 arch/x86/include/asm/kvm_host.h    |   7 +-
 arch/x86/include/asm/tdx.h         |   2 +
 arch/x86/kvm/mmu/mmu.c             |  67 +++++---
 arch/x86/kvm/mmu/mmu_internal.h    |   2 +-
 arch/x86/kvm/mmu/paging_tmpl.h     |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c         | 200 +++++++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h         |   1 +
 arch/x86/kvm/svm/sev.c             |   5 +-
 arch/x86/kvm/svm/svm.h             |   5 +-
 arch/x86/kvm/vmx/main.c            |   8 +-
 arch/x86/kvm/vmx/tdx.c             | 244 +++++++++++++++++++++++------
 arch/x86/kvm/vmx/tdx.h             |   4 +
 arch/x86/kvm/vmx/tdx_arch.h        |   3 +
 arch/x86/kvm/vmx/tdx_errno.h       |   1 +
 arch/x86/kvm/vmx/x86_ops.h         |  14 +-
 arch/x86/virt/vmx/tdx/tdx.c        |  31 +++-
 arch/x86/virt/vmx/tdx/tdx.h        |   1 +
 include/linux/kvm_host.h           |  13 +-
 virt/kvm/guest_memfd.c             | 183 ++++++++++------------
 virt/kvm/kvm_main.c                |  38 +++--
 27 files changed, 612 insertions(+), 244 deletions(-)

-- 
2.43.2

^ permalink raw reply	[flat|nested] 294+ messages in thread

* [RFC PATCH 01/21] KVM: gmem: Allocate 2M huge page from guest_memfd backend
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
@ 2025-04-24  3:04 ` Yan Zhao
  2025-04-24  3:04 ` [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:04 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

Allocate 2M huge pages from the guest_memfd's filemap when the max_order is
greater than or equal to PMD_ORDER.

Introduce a helper function, kvm_gmem_get_max_order(), to assist
kvm_gmem_populate() and kvm_gmem_get_pfn() in obtaining a max_order, based
on the alignment of GFN/index, range size and the consistency of page
attributes.

Pass in the max_order to __kvm_gmem_get_pfn(), which invokes
kvm_gmem_get_folio() to allocate a 2M huge page from gmem filemap if the
max_order is >= PMD_ORDER. __kvm_gmem_get_pfn() then updates the max_order
if the order of the allocated page is smaller than the requestd order.

Note:!!
This patch just serves as an glue layer on top of Michael Roth's series[1],
showing TDX's basic assumptions to the guest_memfd, i.e..
guest_memfd allocates private huge pages whenever alignment of GFN/index,
range size and the consistency of page attributes allow it.

As Dave mentioned at [2],
"Probably a good idea to focus on the long-term use case where we
have in-place conversion support, and only allow truncation in hugepage
(e.g., 2 MiB) size; conversion shared<->private could still be done on 4
KiB granularity as for hugetlb.",
"In general, I think our time is better spent
working on the real deal than on interim solutions that should not be
called "THP support",

Please don't spend much time on reviewing this patch, as it would propably
be gone or appear in another form after guest_memfd's solution based on
hugetlb for in-place-conversion is available.

Link: https://lore.kernel.org/all/20241212063635.712877-1-michael.roth@amd.com [1]
Link: https://lore.kernel.org/all/7c86c45c-17e4-4e9b-8d80-44fdfd37f38b@redhat.com [2]

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 virt/kvm/guest_memfd.c | 153 +++++++++++++++--------------------------
 1 file changed, 56 insertions(+), 97 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 5cd3b66063dc..4bb140e7f30d 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -265,36 +265,6 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct file *file,
 	return r;
 }
 
-static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index,
-					     unsigned int order)
-{
-	pgoff_t npages = 1UL << order;
-	pgoff_t huge_index = round_down(index, npages);
-	struct address_space *mapping  = inode->i_mapping;
-	gfp_t gfp = mapping_gfp_mask(mapping) | __GFP_NOWARN;
-	loff_t size = i_size_read(inode);
-	struct folio *folio;
-
-	/* Make sure hugepages would be fully-contained by inode */
-	if ((huge_index + npages) * PAGE_SIZE > size)
-		return NULL;
-
-	if (filemap_range_has_page(mapping, (loff_t)huge_index << PAGE_SHIFT,
-				   (loff_t)(huge_index + npages - 1) << PAGE_SHIFT))
-		return NULL;
-
-	folio = filemap_alloc_folio(gfp, order);
-	if (!folio)
-		return NULL;
-
-	if (filemap_add_folio(mapping, folio, huge_index, gfp)) {
-		folio_put(folio);
-		return NULL;
-	}
-
-	return folio;
-}
-
 /*
  * Returns a locked folio on success.  The caller is responsible for
  * setting the up-to-date flag before the memory is mapped into the guest.
@@ -304,14 +274,19 @@ static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index,
  * Ignore accessed, referenced, and dirty flags.  The memory is
  * unevictable and there is no storage to write back to.
  */
-static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
+static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, int max_order)
 {
 	struct folio *folio = NULL;
 
-	if (gmem_2m_enabled)
-		folio = kvm_gmem_get_huge_folio(inode, index, PMD_ORDER);
+	if (max_order >= PMD_ORDER) {
+		fgf_t fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
 
-	if (!folio)
+		fgp_flags |= fgf_set_order(1U << (PAGE_SHIFT + PMD_ORDER));
+		folio = __filemap_get_folio(inode->i_mapping, index, fgp_flags,
+					    mapping_gfp_mask(inode->i_mapping));
+	}
+
+	if (!folio || IS_ERR(folio))
 		folio = filemap_grab_folio(inode->i_mapping, index);
 
 	return folio;
@@ -402,49 +377,11 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 
 static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
 {
-	struct address_space *mapping = inode->i_mapping;
-	pgoff_t start, index, end;
-	int r;
-
-	/* Dedicated guest is immutable by default. */
-	if (offset + len > i_size_read(inode))
-		return -EINVAL;
-
-	filemap_invalidate_lock_shared(mapping);
-
-	start = offset >> PAGE_SHIFT;
-	end = (offset + len) >> PAGE_SHIFT;
-
-	r = 0;
-	for (index = start; index < end; ) {
-		struct folio *folio;
-
-		if (signal_pending(current)) {
-			r = -EINTR;
-			break;
-		}
-
-		folio = kvm_gmem_get_folio(inode, index);
-		if (IS_ERR(folio)) {
-			r = PTR_ERR(folio);
-			break;
-		}
-
-		index = folio_next_index(folio);
-
-		folio_unlock(folio);
-		folio_put(folio);
-
-		/* 64-bit only, wrapping the index should be impossible. */
-		if (WARN_ON_ONCE(!index))
-			break;
-
-		cond_resched();
-	}
-
-	filemap_invalidate_unlock_shared(mapping);
-
-	return r;
+	/*
+	 * Skip supporting allocate for now. This can be added easiler after
+	 * __kvm_gmem_get_pfn() is settled down.
+	 */
+	return -EOPNOTSUPP;
 }
 
 static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
@@ -853,7 +790,7 @@ static struct folio *__kvm_gmem_get_pfn(struct file *file,
 	    huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages)
 		*max_order = 0;
 
-	folio = kvm_gmem_get_folio(file_inode(file), index);
+	folio = kvm_gmem_get_folio(file_inode(file), index, *max_order);
 	if (IS_ERR(folio))
 		return folio;
 
@@ -869,6 +806,40 @@ static struct folio *__kvm_gmem_get_pfn(struct file *file,
 	return folio;
 }
 
+static int kvm_gmem_get_max_order(struct kvm *kvm, struct kvm_memory_slot *slot,
+				  gfn_t gfn, pgoff_t index, long npages, int *max_order)
+{
+	int ret = 0;
+	int order = 0;
+	/*
+	 * The max order shouldn't extend beyond the GFN range being
+	 * populated in this iteration, so set max_order accordingly.
+	 * __kvm_gmem_get_pfn() will then further adjust the order to
+	 * one that is contained by the backing memslot/folio.
+	 */
+	order = 0;
+
+	while (IS_ALIGNED(gfn, 1 << (order + 1)) && (npages >= (1 << (order + 1))))
+		order++;
+
+	order = min(order, PMD_ORDER);
+
+	while (!kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << order),
+						KVM_MEMORY_ATTRIBUTE_PRIVATE,
+						KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
+		if (!order) {
+			ret = -ENOENT;
+			return ret;
+		}
+		order--;
+	}
+
+	WARN_ON(!IS_ALIGNED(index, 1 << order) || (npages < (1 << order)));
+
+	*max_order = order;
+	return ret;
+}
+
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 		     gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
 		     int *max_order)
@@ -882,15 +853,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 	if (!file)
 		return -EFAULT;
 
-	/*
-	 * The caller might pass a NULL 'max_order', but internally this
-	 * function needs to be aware of any order limitations set by
-	 * __kvm_gmem_get_pfn() so the scope of preparation operations can
-	 * be limited to the corresponding range. The initial order can be
-	 * arbitrarily large, but gmem doesn't currently support anything
-	 * greater than PMD_ORDER so use that for now.
-	 */
-	max_order_local = PMD_ORDER;
+	kvm_gmem_get_max_order(kvm, slot, gfn, index, slot->npages - (gfn - slot->base_gfn),
+			       &max_order_local);
 
 	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &max_order_local);
 	if (IS_ERR(folio)) {
@@ -953,6 +917,11 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
 			break;
 		}
 
+		ret = kvm_gmem_get_max_order(kvm, slot, gfn, kvm_gmem_get_index(slot, gfn),
+					     npages - i, &max_order);
+		if (ret)
+			break;
+
 		folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order);
 		if (IS_ERR(folio)) {
 			ret = PTR_ERR(folio);
@@ -967,17 +936,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
 		}
 
 		folio_unlock(folio);
-		WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) ||
-			(npages - i) < (1 << max_order));
 
 		ret = -EINVAL;
-		while (!kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order),
-							KVM_MEMORY_ATTRIBUTE_PRIVATE,
-							KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
-			if (!max_order)
-				goto put_folio_and_exit;
-			max_order--;
-		}
 
 		p = src ? src + i * PAGE_SIZE : NULL;
 		ret = post_populate(kvm, gfn, pfn, p, max_order, opaque);
@@ -986,7 +946,6 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
 			kvm_gmem_mark_prepared(file, index, max_order);
 		}
 
-put_folio_and_exit:
 		folio_put(folio);
 		if (ret)
 			break;
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
  2025-04-24  3:04 ` [RFC PATCH 01/21] KVM: gmem: Allocate 2M huge page from guest_memfd backend Yan Zhao
@ 2025-04-24  3:04 ` Yan Zhao
  2025-04-24  7:48   ` Kirill A. Shutemov
                     ` (4 more replies)
  2025-04-24  3:04 ` [RFC PATCH 03/21] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
                   ` (19 subsequent siblings)
  21 siblings, 5 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:04 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.

Verify the validity of the level and ensure that the mapping range is fully
contained within the page folio.

As a conservative solution, perform CLFLUSH on all pages to be mapped into
the TD before invoking the SEAMCALL TDH_MEM_PAGE_AUG. This ensures that any
dirty cache lines do not write back later and clobber TD memory.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index f5e2a937c1e7..a66d501b5677 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1595,9 +1595,18 @@ u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u
 		.rdx = tdx_tdr_pa(td),
 		.r8 = page_to_phys(page),
 	};
+	unsigned long nr_pages = 1 << (level * 9);
+	struct folio *folio = page_folio(page);
+	unsigned long idx = 0;
 	u64 ret;
 
-	tdx_clflush_page(page);
+	if (!(level >= TDX_PS_4K && level < TDX_PS_NR) ||
+	    (folio_page_idx(folio, page) + nr_pages > folio_nr_pages(folio)))
+		return -EINVAL;
+
+	while (nr_pages--)
+		tdx_clflush_page(nth_page(page, idx++));
+
 	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
 
 	*ext_err1 = args.rcx;
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 03/21] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
  2025-04-24  3:04 ` [RFC PATCH 01/21] KVM: gmem: Allocate 2M huge page from guest_memfd backend Yan Zhao
  2025-04-24  3:04 ` [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
@ 2025-04-24  3:04 ` Yan Zhao
  2025-04-25  7:12   ` Binbin Wu
  2025-05-13 18:19   ` Edgecombe, Rick P
  2025-04-24  3:05 ` [RFC PATCH 04/21] KVM: TDX: Enforce 4KB mapping level during TD build Time Yan Zhao
                   ` (18 subsequent siblings)
  21 siblings, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:04 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

From: Xiaoyao Li <xiaoyao.li@intel.com>

Add a wrapper tdh_mem_page_demote() to invoke SEAMCALL TDH_MEM_PAGE_DEMOTE
to demote a huge leaf entry to a non-leaf entry in S-EPT. Currently, the
TDX module only supports demotion of a 2M huge leaf entry. After a
successful demotion, the old 2M huge leaf entry in S-EPT is replaced with a
non-leaf entry, linking to the newly-added page table page. The newly
linked page table page then contains 512 leaf entries, pointing to the 2M
guest private pages.

The "gpa" and "level" direct the TDX module to search and find the old
huge leaf entry.

As the new non-leaf entry points to a page table page, callers need to
pass in the page table page in parameter "page".

In case of S-EPT walk failure, the entry, level and state where the error
was detected are returned in ext_err1 and ext_err2.

On interrupt pending, SEAMCALL TDH_MEM_PAGE_DEMOTE returns error
TDX_INTERRUPTED_RESTARTABLE.

[Yan: Rebased and split patch, wrote changelog]

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/include/asm/tdx.h  |  2 ++
 arch/x86/virt/vmx/tdx/tdx.c | 20 ++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 3 files changed, 23 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 26ffc792e673..08eff4b2f5e7 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -177,6 +177,8 @@ u64 tdh_mng_key_config(struct tdx_td *td);
 u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
 u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
 u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
+			u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_finalize(struct tdx_td *td);
 u64 tdh_vp_flush(struct tdx_vp *vp);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index a66d501b5677..5699dfe500d9 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1684,6 +1684,26 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data)
 }
 EXPORT_SYMBOL_GPL(tdh_mng_rd);
 
+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
+			u64 *ext_err1, u64 *ext_err2)
+{
+	struct tdx_module_args args = {
+		.rcx = gpa | level,
+		.rdx = tdx_tdr_pa(td),
+		.r8 = page_to_phys(page),
+	};
+	u64 ret;
+
+	tdx_clflush_page(page);
+	ret = seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args);
+
+	*ext_err1 = args.rcx;
+	*ext_err2 = args.rdx;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_mem_page_demote);
+
 u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2)
 {
 	struct tdx_module_args args = {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 82bb82be8567..b4dc6b86d40a 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -24,6 +24,7 @@
 #define TDH_MNG_KEY_CONFIG		8
 #define TDH_MNG_CREATE			9
 #define TDH_MNG_RD			11
+#define TDH_MEM_PAGE_DEMOTE		15
 #define TDH_MR_EXTEND			16
 #define TDH_MR_FINALIZE			17
 #define TDH_VP_FLUSH			18
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 04/21] KVM: TDX: Enforce 4KB mapping level during TD build Time
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (2 preceding siblings ...)
  2025-04-24  3:04 ` [RFC PATCH 03/21] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
@ 2025-04-24  3:05 ` Yan Zhao
  2025-04-24  7:55   ` Kirill A. Shutemov
  2025-05-13 19:12   ` Edgecombe, Rick P
  2025-04-24  3:05 ` [RFC PATCH 05/21] KVM: TDX: Enhance tdx_clear_page() to support huge pages Yan Zhao
                   ` (17 subsequent siblings)
  21 siblings, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:05 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

During the TD build phase (i.e., before the TD becomes RUNNABLE), enforce a
4KB mapping level both in the S-EPT managed by the TDX module and the
mirror page table managed by KVM.

During this phase, TD's memory is added via tdh_mem_page_add(), which only
accepts 4KB granularity. Therefore, return PG_LEVEL_4K in TDX's
.private_max_mapping_level hook to ensure KVM maps at the 4KB level in the
mirror page table. Meanwhile, iterate over each 4KB page of a large gmem
backend page in tdx_gmem_post_populate() and invoke tdh_mem_page_add() to
map at the 4KB level in the S-EPT.

Still allow huge pages in gmem backend during TD build time. Based on [1],
which gmem series allows 2MB TPH and non-in-place conversion, pass in
region.nr_pages to kvm_gmem_populate() in tdx_vcpu_init_mem_region(). This
enables kvm_gmem_populate() to allocate huge pages from the gmem backend
when the remaining nr_pages, GFN alignment, and page private/shared
attribute permit.  KVM is then able to promote the initial 4K mapping to
huge after TD is RUNNABLE.

Disallow any private huge pages during TD build time. Use BUG_ON() in
tdx_mem_page_record_premap_cnt() and tdx_is_sept_zap_err_due_to_premap() to
assert the mapping level is 4KB.

Opportunistically, remove unused parameters in
tdx_mem_page_record_premap_cnt().

Link: https://lore.kernel.org/all/20241212063635.712877-1-michael.roth@amd.com [1]
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 45 ++++++++++++++++++++++++++++--------------
 1 file changed, 30 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 98cde20f14da..03885cb2869b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1530,14 +1530,16 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
  * The counter has to be zero on KVM_TDX_FINALIZE_VM, to ensure that there
  * are no half-initialized shared EPT pages.
  */
-static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
-					  enum pg_level level, kvm_pfn_t pfn)
+static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, enum pg_level level)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 
 	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
 		return -EINVAL;
 
+	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+		return -EINVAL;
+
 	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
 	atomic64_inc(&kvm_tdx->nr_premapped);
 	return 0;
@@ -1571,7 +1573,7 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
 		return tdx_mem_page_aug(kvm, gfn, level, page);
 
-	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
+	return tdx_mem_page_record_premap_cnt(kvm, level);
 }
 
 static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
@@ -1666,7 +1668,7 @@ int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
 static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
 					     u64 entry, int level)
 {
-	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
+	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE || level > PG_LEVEL_4K)
 		return false;
 
 	if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
@@ -3052,8 +3054,8 @@ struct tdx_gmem_post_populate_arg {
 	__u32 flags;
 };
 
-static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
-				  void __user *src, int order, void *_arg)
+static int tdx_gmem_post_populate_4k(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
+				     void __user *src, void *_arg)
 {
 	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -3120,6 +3122,21 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	return ret;
 }
 
+static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
+				  void __user *src, int order, void *_arg)
+{
+	unsigned long i, npages = 1 << order;
+	int ret;
+
+	for (i = 0; i < npages; i++) {
+		ret = tdx_gmem_post_populate_4k(kvm, gfn + i, pfn + i,
+						src + i * PAGE_SIZE, _arg);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
 static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -3166,20 +3183,15 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
 		};
 		gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
 					     u64_to_user_ptr(region.source_addr),
-					     1, tdx_gmem_post_populate, &arg);
+					     region.nr_pages, tdx_gmem_post_populate, &arg);
 		if (gmem_ret < 0) {
 			ret = gmem_ret;
 			break;
 		}
 
-		if (gmem_ret != 1) {
-			ret = -EIO;
-			break;
-		}
-
-		region.source_addr += PAGE_SIZE;
-		region.gpa += PAGE_SIZE;
-		region.nr_pages--;
+		region.source_addr += PAGE_SIZE * gmem_ret;
+		region.gpa += PAGE_SIZE * gmem_ret;
+		region.nr_pages -= gmem_ret;
 
 		cond_resched();
 	}
@@ -3224,6 +3236,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 
 int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
 {
+	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
+		return PG_LEVEL_4K;
+
 	return PG_LEVEL_4K;
 }
 
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 05/21] KVM: TDX: Enhance tdx_clear_page() to support huge pages
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (3 preceding siblings ...)
  2025-04-24  3:05 ` [RFC PATCH 04/21] KVM: TDX: Enforce 4KB mapping level during TD build Time Yan Zhao
@ 2025-04-24  3:05 ` Yan Zhao
  2025-05-13 19:17   ` Edgecombe, Rick P
  2025-04-24  3:05 ` [RFC PATCH 06/21] KVM: TDX: Assert the reclaimed pages were mapped as expected Yan Zhao
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:05 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

From: Xiaoyao Li <xiaoyao.li@intel.com>

KVM invokes tdx_clear_page() to zero pages using movdir64b().
Include level information to enable tdx_clear_page() to zero a huge page.

[Yan: split out, let tdx_clear_page() accept level]

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 03885cb2869b..1186085795ac 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -276,7 +276,7 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
 	vcpu->cpu = -1;
 }
 
-static void tdx_clear_page(struct page *page)
+static void __tdx_clear_page(struct page *page)
 {
 	const void *zero_page = (const void *) page_to_virt(ZERO_PAGE(0));
 	void *dest = page_to_virt(page);
@@ -295,6 +295,15 @@ static void tdx_clear_page(struct page *page)
 	__mb();
 }
 
+static void tdx_clear_page(struct page *page, int level)
+{
+	unsigned long nr = KVM_PAGES_PER_HPAGE(level);
+	unsigned long idx = 0;
+
+	while (nr--)
+		__tdx_clear_page(nth_page(page, idx++));
+}
+
 static void tdx_no_vcpus_enter_start(struct kvm *kvm)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -340,11 +349,10 @@ static int tdx_reclaim_page(struct page *page)
 
 	r = __tdx_reclaim_page(page);
 	if (!r)
-		tdx_clear_page(page);
+		tdx_clear_page(page, PG_LEVEL_4K);
 	return r;
 }
 
-
 /*
  * Reclaim the TD control page(s) which are crypto-protected by TDX guest's
  * private KeyID.  Assume the cache associated with the TDX private KeyID has
@@ -588,7 +596,7 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
 		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
 		return;
 	}
-	tdx_clear_page(kvm_tdx->td.tdr_page);
+	tdx_clear_page(kvm_tdx->td.tdr_page, PG_LEVEL_4K);
 
 	__free_page(kvm_tdx->td.tdr_page);
 	kvm_tdx->td.tdr_page = NULL;
@@ -1621,7 +1629,8 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
 		return -EIO;
 	}
-	tdx_clear_page(page);
+
+	tdx_clear_page(page, level);
 	tdx_unpin(kvm, page);
 	return 0;
 }
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 06/21] KVM: TDX: Assert the reclaimed pages were mapped as expected
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (4 preceding siblings ...)
  2025-04-24  3:05 ` [RFC PATCH 05/21] KVM: TDX: Enhance tdx_clear_page() to support huge pages Yan Zhao
@ 2025-04-24  3:05 ` Yan Zhao
  2025-05-13 19:25   ` Edgecombe, Rick P
  2025-04-24  3:05 ` [RFC PATCH 07/21] KVM: TDX: Add a helper for WBINVD on huge pages with TD's keyID Yan Zhao
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:05 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

From: Xiaoyao Li <xiaoyao.li@intel.com>

Provide level information to tdx_reclaim_page() to enable it to verify that
the reclaimed pages were mapped at the expected level in the S-EPT.

[Yan: split patch, wrote patch log]

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1186085795ac..69f3140928b5 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -325,7 +325,7 @@ static void tdx_no_vcpus_enter_stop(struct kvm *kvm)
 }
 
 /* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */
-static int __tdx_reclaim_page(struct page *page)
+static int __tdx_reclaim_page(struct page *page, int level)
 {
 	u64 err, tdx_pt, tdx_owner, tdx_size;
 
@@ -340,16 +340,18 @@ static int __tdx_reclaim_page(struct page *page)
 		pr_tdx_error_3(TDH_PHYMEM_PAGE_RECLAIM, err, tdx_pt, tdx_owner, tdx_size);
 		return -EIO;
 	}
+
+	WARN_ON_ONCE(tdx_size != pg_level_to_tdx_sept_level(level));
 	return 0;
 }
 
-static int tdx_reclaim_page(struct page *page)
+static int tdx_reclaim_page(struct page *page, int level)
 {
 	int r;
 
-	r = __tdx_reclaim_page(page);
+	r = __tdx_reclaim_page(page, level);
 	if (!r)
-		tdx_clear_page(page, PG_LEVEL_4K);
+		tdx_clear_page(page, level);
 	return r;
 }
 
@@ -364,7 +366,7 @@ static void tdx_reclaim_control_page(struct page *ctrl_page)
 	 * Leak the page if the kernel failed to reclaim the page.
 	 * The kernel cannot use it safely anymore.
 	 */
-	if (tdx_reclaim_page(ctrl_page))
+	if (tdx_reclaim_page(ctrl_page, PG_LEVEL_4K))
 		return;
 
 	__free_page(ctrl_page);
@@ -583,7 +585,7 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
 	if (!kvm_tdx->td.tdr_page)
 		return;
 
-	if (__tdx_reclaim_page(kvm_tdx->td.tdr_page))
+	if (__tdx_reclaim_page(kvm_tdx->td.tdr_page, PG_LEVEL_4K))
 		return;
 
 	/*
@@ -1791,7 +1793,7 @@ int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 	 * The HKID assigned to this TD was already freed and cache was
 	 * already flushed. We don't have to flush again.
 	 */
-	return tdx_reclaim_page(virt_to_page(private_spt));
+	return tdx_reclaim_page(virt_to_page(private_spt), PG_LEVEL_4K);
 }
 
 int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 07/21] KVM: TDX: Add a helper for WBINVD on huge pages with TD's keyID
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (5 preceding siblings ...)
  2025-04-24  3:05 ` [RFC PATCH 06/21] KVM: TDX: Assert the reclaimed pages were mapped as expected Yan Zhao
@ 2025-04-24  3:05 ` Yan Zhao
  2025-05-06  8:37   ` Binbin Wu
  2025-05-13 19:29   ` Edgecombe, Rick P
  2025-04-24  3:06 ` [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages Yan Zhao
                   ` (14 subsequent siblings)
  21 siblings, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:05 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

From: Xiaoyao Li <xiaoyao.li@intel.com>

After a guest page is removed from the S-EPT, KVM calls
tdh_phymem_page_wbinvd_hkid() to execute WBINVD on the page using the TD's
keyID.

Add a helper function that takes level information to perform WBINVD on a
huge page.

[Yan: split patch, added a helper, rebased to use struct page]
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 69f3140928b5..355b21fc169f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1586,6 +1586,23 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	return tdx_mem_page_record_premap_cnt(kvm, level);
 }
 
+static inline u64 tdx_wbinvd_page(struct kvm *kvm, u64 hkid, struct page *page, int level)
+{
+	unsigned long nr = KVM_PAGES_PER_HPAGE(level);
+	unsigned long idx = 0;
+	u64 err;
+
+	while (nr--) {
+		err = tdh_phymem_page_wbinvd_hkid(hkid, nth_page(page, idx++));
+
+		if (KVM_BUG_ON(err, kvm)) {
+			pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
+			return err;
+		}
+	}
+	return err;
+}
+
 static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 				      enum pg_level level, struct page *page)
 {
@@ -1625,12 +1642,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 		return -EIO;
 	}
 
-	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
-
-	if (KVM_BUG_ON(err, kvm)) {
-		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
+	err = tdx_wbinvd_page(kvm, kvm_tdx->hkid, page, level);
+	if (err)
 		return -EIO;
-	}
 
 	tdx_clear_page(page, level);
 	tdx_unpin(kvm, page);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (6 preceding siblings ...)
  2025-04-24  3:05 ` [RFC PATCH 07/21] KVM: TDX: Add a helper for WBINVD on huge pages with TD's keyID Yan Zhao
@ 2025-04-24  3:06 ` Yan Zhao
  2025-04-29  0:17   ` Vishal Annapurve
  2025-04-24  3:06 ` [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE Yan Zhao
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:06 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

Increase folio ref count before mapping a private page, and decrease
folio ref count after a mapping failure or successfully removing a private
page.

The folio ref count to inc/dec corresponds to the mapping/unmapping level,
ensuring the folio ref count remains balanced after entry splitting or
merging.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 355b21fc169f..e23dce59fc72 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1501,9 +1501,9 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
 }
 
-static void tdx_unpin(struct kvm *kvm, struct page *page)
+static void tdx_unpin(struct kvm *kvm, struct page *page, int level)
 {
-	put_page(page);
+	folio_put_refs(page_folio(page), KVM_PAGES_PER_HPAGE(level));
 }
 
 static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
@@ -1517,13 +1517,13 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 
 	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
 	if (unlikely(tdx_operand_busy(err))) {
-		tdx_unpin(kvm, page);
+		tdx_unpin(kvm, page, level);
 		return -EBUSY;
 	}
 
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
-		tdx_unpin(kvm, page);
+		tdx_unpin(kvm, page, level);
 		return -EIO;
 	}
 
@@ -1570,10 +1570,11 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
 	 * migration.  Until guest_memfd supports page migration, prevent page
 	 * migration.
-	 * TODO: Once guest_memfd introduces callback on page migration,
-	 * implement it and remove get_page/put_page().
+	 * TODO: To support in-place-conversion in gmem in futre, remove
+	 * folio_ref_add()/folio_put_refs(). Only increase the folio ref count
+	 * when there're errors during removing private pages.
 	 */
-	get_page(page);
+	folio_ref_add(page_folio(page), KVM_PAGES_PER_HPAGE(level));
 
 	/*
 	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
@@ -1647,7 +1648,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 		return -EIO;
 
 	tdx_clear_page(page, level);
-	tdx_unpin(kvm, page);
+	tdx_unpin(kvm, page, level);
 	return 0;
 }
 
@@ -1727,7 +1728,7 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
 	    !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
 		atomic64_dec(&kvm_tdx->nr_premapped);
-		tdx_unpin(kvm, page);
+		tdx_unpin(kvm, page, level);
 		return 0;
 	}
 
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (7 preceding siblings ...)
  2025-04-24  3:06 ` [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages Yan Zhao
@ 2025-04-24  3:06 ` Yan Zhao
  2025-05-13 20:10   ` Edgecombe, Rick P
  2025-04-24  3:06 ` [RFC PATCH 10/21] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:06 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

Allow TDX's .private_max_mapping_level hook to return 2MB after the TD is
RUNNABLE, enabling KVM to map TDX private pages at the 2MB level. Remove
TODOs and adjust KVM_BUG_ON()s accordingly.

Note: Instead of placing this patch at the tail of the series, it's
positioned here to show the code changes for basic mapping of private huge
pages (i.e., transitioning from non-present to present).

However, since this patch also allows KVM to trigger the merging of small
entries into a huge leaf entry or the splitting of a huge leaf entry into
small entries, errors are expected if any of these operations are triggered
due to the current lack of splitting/merging support.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e23dce59fc72..6b3a8f3e6c9c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1561,10 +1561,6 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	struct page *page = pfn_to_page(pfn);
 
-	/* TODO: handle large pages. */
-	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
-		return -EINVAL;
-
 	/*
 	 * Because guest_memfd doesn't support page migration with
 	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
@@ -1612,8 +1608,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 err, entry, level_state;
 
-	/* TODO: handle large pages. */
-	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+	if (KVM_BUG_ON(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K, kvm))
 		return -EINVAL;
 
 	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
@@ -1714,8 +1709,8 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 	gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
 	u64 err, entry, level_state;
 
-	/* For now large page isn't supported yet. */
-	WARN_ON_ONCE(level != PG_LEVEL_4K);
+	/* Before TD runnable, large page is not supported */
+	WARN_ON_ONCE(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K);
 
 	err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
 
@@ -1817,6 +1812,9 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	struct page *page = pfn_to_page(pfn);
 	int ret;
 
+	WARN_ON_ONCE(folio_page_idx(page_folio(page), page) + KVM_PAGES_PER_HPAGE(level) >
+		     folio_nr_pages(page_folio(page)));
+
 	/*
 	 * HKID is released after all private pages have been removed, and set
 	 * before any might be populated. Warn if zapping is attempted when
@@ -3265,7 +3263,7 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
 	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
 		return PG_LEVEL_4K;
 
-	return PG_LEVEL_4K;
+	return PG_LEVEL_2M;
 }
 
 static int tdx_online_cpu(unsigned int cpu)
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 10/21] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (8 preceding siblings ...)
  2025-04-24  3:06 ` [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE Yan Zhao
@ 2025-04-24  3:06 ` Yan Zhao
  2025-05-13 20:15   ` Edgecombe, Rick P
  2025-04-24  3:06 ` [RFC PATCH 11/21] KVM: x86: Add "vcpu" "gfn" parameters to x86 hook private_max_mapping_level Yan Zhao
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:06 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Edgecombe, Yan Zhao

From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>

Disallow page merging (huge page adjustment) for mirror root by leveraging
the disallowed_hugepage_adjust().

[Yan: Passing is_mirror to disallowed_hugepage_adjust()]

Signed-off-by: Edgecombe, Rick P <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu/mmu.c          | 6 +++---
 arch/x86/kvm/mmu/mmu_internal.h | 2 +-
 arch/x86/kvm/mmu/paging_tmpl.h  | 2 +-
 arch/x86/kvm/mmu/tdp_mmu.c      | 7 ++++---
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a284dce227a0..b923deeeb62e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3326,13 +3326,13 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	fault->pfn &= ~mask;
 }
 
-void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level)
+void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level, bool is_mirror)
 {
 	if (cur_level > PG_LEVEL_4K &&
 	    cur_level == fault->goal_level &&
 	    is_shadow_present_pte(spte) &&
 	    !is_large_pte(spte) &&
-	    spte_to_child_sp(spte)->nx_huge_page_disallowed) {
+	    (spte_to_child_sp(spte)->nx_huge_page_disallowed || is_mirror)) {
 		/*
 		 * A small SPTE exists for this pfn, but FNAME(fetch),
 		 * direct_map(), or kvm_tdp_mmu_map() would like to create a
@@ -3363,7 +3363,7 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * large page, as the leaf could be executable.
 		 */
 		if (fault->nx_huge_page_workaround_enabled)
-			disallowed_hugepage_adjust(fault, *it.sptep, it.level);
+			disallowed_hugepage_adjust(fault, *it.sptep, it.level, false);
 
 		base_gfn = gfn_round_for_level(fault->gfn, it.level);
 		if (it.level == fault->goal_level)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index db8f33e4de62..1c1764f46e66 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -411,7 +411,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			      const struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
-void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
+void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level, bool is_mirror);
 
 void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 68e323568e95..1559182038e3 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -717,7 +717,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		 * large page, as the leaf could be executable.
 		 */
 		if (fault->nx_huge_page_workaround_enabled)
-			disallowed_hugepage_adjust(fault, *it.sptep, it.level);
+			disallowed_hugepage_adjust(fault, *it.sptep, it.level, false);
 
 		base_gfn = gfn_round_for_level(fault->gfn, it.level);
 		if (it.level == fault->goal_level)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 405874f4d088..8ee01277cc07 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1244,6 +1244,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
 	int ret = RET_PF_RETRY;
+	bool is_mirror = is_mirror_sp(root);
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
 
@@ -1254,8 +1255,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
 		int r;
 
-		if (fault->nx_huge_page_workaround_enabled)
-			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
+		if (fault->nx_huge_page_workaround_enabled || is_mirror)
+			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level, is_mirror);
 
 		/*
 		 * If SPTE has been frozen by another thread, just give up and
@@ -1278,7 +1279,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 */
 		sp = tdp_mmu_alloc_sp(vcpu);
 		tdp_mmu_init_child_sp(sp, &iter);
-		if (is_mirror_sp(sp))
+		if (is_mirror)
 			kvm_mmu_alloc_external_spt(vcpu, sp);
 
 		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 11/21] KVM: x86: Add "vcpu" "gfn" parameters to x86 hook private_max_mapping_level
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (9 preceding siblings ...)
  2025-04-24  3:06 ` [RFC PATCH 10/21] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
@ 2025-04-24  3:06 ` Yan Zhao
  2025-04-24  3:07 ` [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's ACCEPT level Yan Zhao
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:06 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

Introduce "vcpu" and "gfn" parameters to the KVM x86 hook
private_max_mapping_level.

This is a preparation to enable TDX to return the max mapping level for a
specific GFN in a vCPU.

No functional change expected.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 arch/x86/kvm/mmu/mmu.c          | 6 +++---
 arch/x86/kvm/svm/sev.c          | 4 ++--
 arch/x86/kvm/svm/svm.h          | 4 ++--
 arch/x86/kvm/vmx/main.c         | 6 +++---
 arch/x86/kvm/vmx/tdx.c          | 4 ++--
 arch/x86/kvm/vmx/x86_ops.h      | 4 ++--
 7 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ed9b65785a24..f96d30ad4ae8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1896,7 +1896,7 @@ struct kvm_x86_ops {
 	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
 	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
 	void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
-	int (*private_max_mapping_level)(struct kvm *kvm, kvm_pfn_t pfn);
+	int (*private_max_mapping_level)(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b923deeeb62e..0e227199d73e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4466,7 +4466,7 @@ static inline u8 kvm_max_level_for_order(int order)
 	return PG_LEVEL_4K;
 }
 
-static u8 kvm_max_private_mapping_level(struct kvm *kvm, kvm_pfn_t pfn,
+static u8 kvm_max_private_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn,
 					u8 max_level, int gmem_order)
 {
 	u8 req_max_level;
@@ -4478,7 +4478,7 @@ static u8 kvm_max_private_mapping_level(struct kvm *kvm, kvm_pfn_t pfn,
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
-	req_max_level = kvm_x86_call(private_max_mapping_level)(kvm, pfn);
+	req_max_level = kvm_x86_call(private_max_mapping_level)(vcpu, pfn, gfn);
 	if (req_max_level)
 		max_level = min(max_level, req_max_level);
 
@@ -4510,7 +4510,7 @@ static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,
 	}
 
 	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
-	fault->max_level = kvm_max_private_mapping_level(vcpu->kvm, fault->pfn,
+	fault->max_level = kvm_max_private_mapping_level(vcpu, fault->pfn, fault->gfn,
 							 fault->max_level, max_order);
 
 	return RET_PF_CONTINUE;
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 0bc708ee2788..dc6cdf9fa1ba 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -4910,12 +4910,12 @@ void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
 	}
 }
 
-int sev_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
+int sev_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn)
 {
 	int level, rc;
 	bool assigned;
 
-	if (!sev_snp_guest(kvm))
+	if (!sev_snp_guest(vcpu->kvm))
 		return 0;
 
 	rc = snp_lookup_rmpentry(pfn, &assigned, &level);
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index d4490eaed55d..1a9738b6ae37 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -782,7 +782,7 @@ void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
 void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu);
 int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
 void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
-int sev_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn);
+int sev_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn);
 #else
 static inline struct page *snp_safe_alloc_page_node(int node, gfp_t gfp)
 {
@@ -809,7 +809,7 @@ static inline int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, in
 	return 0;
 }
 static inline void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end) {}
-static inline int sev_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
+static inline int sev_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn)
 {
 	return 0;
 }
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 94d5d907d37b..ae8540576821 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -880,10 +880,10 @@ static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	return tdx_vcpu_ioctl(vcpu, argp);
 }
 
-static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
+static int vt_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn)
 {
-	if (is_td(kvm))
-		return tdx_gmem_private_max_mapping_level(kvm, pfn);
+	if (is_td(vcpu->kvm))
+		return tdx_gmem_private_max_mapping_level(vcpu, pfn, gfn);
 
 	return 0;
 }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 6b3a8f3e6c9c..86775af85cd8 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3258,9 +3258,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	return ret;
 }
 
-int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
+int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn)
 {
-	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
+	if (unlikely(to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE))
 		return PG_LEVEL_4K;
 
 	return PG_LEVEL_2M;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 6bf8be570b2e..7c183da7c4d4 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -162,7 +162,7 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
-int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn);
+int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn);
 #else
 static inline void tdx_disable_virtualization_cpu(void) {}
 static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
@@ -227,7 +227,7 @@ static inline int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
 static inline void tdx_flush_tlb_all(struct kvm_vcpu *vcpu) {}
 static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
-static inline int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) { return 0; }
+static inline int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn) { return 0; }
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's ACCEPT level
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (10 preceding siblings ...)
  2025-04-24  3:06 ` [RFC PATCH 11/21] KVM: x86: Add "vcpu" "gfn" parameters to x86 hook private_max_mapping_level Yan Zhao
@ 2025-04-24  3:07 ` Yan Zhao
  2025-05-13 21:20   ` Edgecombe, Rick P
  2025-04-24  3:07 ` [RFC PATCH 13/21] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:07 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

Determine the max mapping level of a private GFN according to the vCPU's
ACCEPT level specified in the TDCALL TDG.MEM.PAGE.ACCEPT.

When an EPT violation occurs due to a vCPU invoking TDG.MEM.PAGE.ACCEPT
before any actual memory access, the vCPU's ACCEPT level is available in
the extended exit qualification. Set the vCPU's ACCEPT level as the max
mapping level for the faulting GFN. This is necessary because if KVM
specifies a mapping level greater than the vCPU's ACCEPT level, and no
other vCPUs are accepting at KVM's mapping level, TDG.MEM.PAGE.ACCEPT will
produce another EPT violation on the vCPU after re-entering the TD, with
the vCPU's ACCEPT level indicated in the extended exit qualification.

Introduce "violation_gfn_start", "violation_gfn_end", and
"violation_request_level" in "struct vcpu_tdx" to pass the vCPU's ACCEPT
level to TDX's private_max_mapping_level hook for determining the max
mapping level.

Instead of taking some bits of the error_code passed to
kvm_mmu_page_fault() and requiring KVM MMU core to check the error_code for
a fault's max_level, having TDX's private_max_mapping_level hook check for
request level avoids changes to the KVM MMU core. This approach also
accommodates future scenarios where the requested mapping level is unknown
at the start of tdx_handle_ept_violation() (i.e., before invoking
kvm_mmu_page_fault()).

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/tdx.c      | 36 +++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h      |  4 ++++
 arch/x86/kvm/vmx/tdx_arch.h |  3 +++
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 86775af85cd8..dd63a634e633 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1859,10 +1859,34 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
 	return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
 }
 
+static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	int level = -1;
+
+	u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
+
+	u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
+			TDX_EXT_EXIT_QUAL_INFO_SHIFT;
+
+	if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
+		level = (eeq_info & GENMASK(2, 0)) + 1;
+
+		tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
+		tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
+		tdx->violation_request_level = level;
+	} else {
+		tdx->violation_gfn_start = -1;
+		tdx->violation_gfn_end = -1;
+		tdx->violation_request_level = -1;
+	}
+}
+
 static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qual;
-	gpa_t gpa = to_tdx(vcpu)->exit_gpa;
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	gpa_t gpa = tdx->exit_gpa;
 	bool local_retry = false;
 	int ret;
 
@@ -1884,6 +1908,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 		 */
 		exit_qual = EPT_VIOLATION_ACC_WRITE;
 
+		tdx_get_accept_level(vcpu, gpa);
+
 		/* Only private GPA triggers zero-step mitigation */
 		local_retry = true;
 	} else {
@@ -2917,6 +2943,9 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
 
 	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 
+	tdx->violation_gfn_start = -1;
+	tdx->violation_gfn_end = -1;
+	tdx->violation_request_level = -1;
 	return 0;
 
 free_tdcx:
@@ -3260,9 +3289,14 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 
 int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn)
 {
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
 	if (unlikely(to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE))
 		return PG_LEVEL_4K;
 
+	if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
+		return tdx->violation_request_level;
+
 	return PG_LEVEL_2M;
 }
 
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 51f98443e8a2..6e13895813c5 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -70,6 +70,10 @@ struct vcpu_tdx {
 
 	u64 map_gpa_next;
 	u64 map_gpa_end;
+
+	u64 violation_gfn_start;
+	u64 violation_gfn_end;
+	int violation_request_level;
 };
 
 void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err);
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index a30e880849e3..af006a73ee05 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -82,7 +82,10 @@ struct tdx_cpuid_value {
 #define TDX_TD_ATTR_PERFMON		BIT_ULL(63)
 
 #define TDX_EXT_EXIT_QUAL_TYPE_MASK	GENMASK(3, 0)
+#define TDX_EXT_EXIT_QUAL_TYPE_ACCEPT  1
 #define TDX_EXT_EXIT_QUAL_TYPE_PENDING_EPT_VIOLATION  6
+#define TDX_EXT_EXIT_QUAL_INFO_MASK	GENMASK(63, 32)
+#define TDX_EXT_EXIT_QUAL_INFO_SHIFT	32
 /*
  * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
  */
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 13/21] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (11 preceding siblings ...)
  2025-04-24  3:07 ` [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's ACCEPT level Yan Zhao
@ 2025-04-24  3:07 ` Yan Zhao
  2025-04-24  3:07 ` [RFC PATCH 14/21] KVM: x86/tdp_mmu: Invoke split_external_spt hook with exclusive mmu_lock Yan Zhao
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:07 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

From: Isaku Yamahata <isaku.yamahata@intel.com>

Enhance tdp_mmu_alloc_sp_split() to allocate the external page table page
for splitting the mirror page table.

[Yan: Rebased and simplified the code ]

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8ee01277cc07..799a08f91bf9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -324,6 +324,8 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 				u64 old_spte, u64 new_spte, int level,
 				bool shared);
 
+static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror);
+
 static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	kvm_account_pgtable_pages((void *)sp->spt, +1);
@@ -1475,7 +1477,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
 	return spte_set;
 }
 
-static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(void)
+static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror)
 {
 	struct kvm_mmu_page *sp;
 
@@ -1489,6 +1491,15 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(void)
 		return NULL;
 	}
 
+	if (mirror) {
+		sp->external_spt = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+		if (!sp->external_spt) {
+			free_page((unsigned long)sp->spt);
+			kmem_cache_free(mmu_page_header_cache, sp);
+			return NULL;
+		}
+	}
+
 	return sp;
 }
 
@@ -1568,7 +1579,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 			else
 				write_unlock(&kvm->mmu_lock);
 
-			sp = tdp_mmu_alloc_sp_for_split();
+			sp = tdp_mmu_alloc_sp_for_split(is_mirror_sp(root));
 
 			if (shared)
 				read_lock(&kvm->mmu_lock);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 14/21] KVM: x86/tdp_mmu: Invoke split_external_spt hook with exclusive mmu_lock
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (12 preceding siblings ...)
  2025-04-24  3:07 ` [RFC PATCH 13/21] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
@ 2025-04-24  3:07 ` Yan Zhao
  2025-05-13 23:06   ` Edgecombe, Rick P
  2025-05-20  5:40   ` Binbin Wu
  2025-04-24  3:08 ` [RFC PATCH 15/21] KVM: TDX: Support huge page splitting with exclusive kvm->mmu_lock Yan Zhao
                   ` (7 subsequent siblings)
  21 siblings, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:07 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

Introduce the split_external_spt hook and call it within tdp_mmu_set_spte()
for the mirror page table when kvm->mmu_lock is held for writing.

When tdp_mmu_set_spte() is invoked to transition an old leaf SPTE to a new
non-leaf SPTE in the mirror page table, use the split_external_spt hook to
propagate the entry splitting request to the external page table.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  4 ++++
 arch/x86/kvm/mmu/tdp_mmu.c         | 26 ++++++++++++++++++++------
 3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 79406bf07a1c..f8403e0f6c1e 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -99,6 +99,7 @@ KVM_X86_OP_OPTIONAL(link_external_spt)
 KVM_X86_OP_OPTIONAL(set_external_spte)
 KVM_X86_OP_OPTIONAL(free_external_spt)
 KVM_X86_OP_OPTIONAL(remove_external_spte)
+KVM_X86_OP_OPTIONAL(split_external_spt)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f96d30ad4ae8..6962a8a424ef 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1812,6 +1812,10 @@ struct kvm_x86_ops {
 	int (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 				    kvm_pfn_t pfn_for_gfn);
 
+	/* Split the external page table into smaller page tables */
+	int (*split_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				  void *external_spt);
+
 	bool (*has_wbinvd_exit)(void);
 
 	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 799a08f91bf9..0f683753a7bb 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -325,6 +325,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 				bool shared);
 
 static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror);
+static void *get_external_spt(gfn_t gfn, u64 new_spte, int level);
 
 static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
@@ -384,6 +385,19 @@ static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
 	KVM_BUG_ON(ret, kvm);
 }
 
+static int split_external_spt(struct kvm *kvm, gfn_t gfn, u64 old_spte,
+			      u64 new_spte, int level)
+{
+	void *external_spt = get_external_spt(gfn, new_spte, level);
+	int ret;
+
+	KVM_BUG_ON(!external_spt, kvm);
+
+	ret = static_call(kvm_x86_split_external_spt)(kvm, gfn, level, external_spt);
+	KVM_BUG_ON(ret, kvm);
+
+	return ret;
+}
 /**
  * handle_removed_pt() - handle a page table removed from the TDP structure
  *
@@ -764,13 +778,13 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 
 	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
 
-	/*
-	 * Users that do non-atomic setting of PTEs don't operate on mirror
-	 * roots, so don't handle it and bug the VM if it's seen.
-	 */
 	if (is_mirror_sptep(sptep)) {
-		KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
-		remove_external_spte(kvm, gfn, old_spte, level);
+		if (!is_shadow_present_pte(new_spte))
+			remove_external_spte(kvm, gfn, old_spte, level);
+		else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
+			split_external_spt(kvm, gfn, old_spte, new_spte, level);
+		else
+			KVM_BUG_ON(1, kvm);
 	}
 
 	return old_spte;
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 15/21] KVM: TDX: Support huge page splitting with exclusive kvm->mmu_lock
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (13 preceding siblings ...)
  2025-04-24  3:07 ` [RFC PATCH 14/21] KVM: x86/tdp_mmu: Invoke split_external_spt hook with exclusive mmu_lock Yan Zhao
@ 2025-04-24  3:08 ` Yan Zhao
  2025-05-20  6:18   ` Binbin Wu
  2025-07-02 15:47   ` Edgecombe, Rick P
  2025-04-24  3:08 ` [RFC PATCH 16/21] KVM: x86/mmu: Introduce kvm_split_boundary_leafs() to split boundary leafs Yan Zhao
                   ` (6 subsequent siblings)
  21 siblings, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:08 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

From: Xiaoyao Li <xiaoyao.li@intel.com>

Implement the split_external_spt hook to support huge page splitting for
TDX when kvm->mmu_lock is held for writing.

Invoke tdh_mem_range_block(), tdh_mem_track(), kicking off vCPUs,
tdh_mem_page_demote() in sequence. Since kvm->mmu_lock is held for writing,
simply kick off vCPUs on tdx_operand_busy() to ensure the second SEAMCALL
invocation succeeds.

TDX module may return TDX_INTERRUPTED_RESTARTABLE when there is a pending
interrupt on the host side during tdh_mem_page_demote(). Retry indefinitely
on this error, as with exclusive kvm->mmu_lock the pending interrupt is for
host only.

[Yan: Split patch for exclusive mmu_lock only, handled busy error ]

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/main.c      |  1 +
 arch/x86/kvm/vmx/tdx.c       | 45 ++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx_errno.h |  1 +
 arch/x86/kvm/vmx/x86_ops.h   |  9 ++++++++
 4 files changed, 56 insertions(+)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index ae8540576821..16c0c31dd066 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -62,6 +62,7 @@ static __init int vt_hardware_setup(void)
 		vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
 		vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
 		vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
+		vt_x86_ops.split_external_spt = tdx_sept_split_private_spt;
 		vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
 	}
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index dd63a634e633..4386e1a0323e 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1806,6 +1806,51 @@ int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 	return tdx_reclaim_page(virt_to_page(private_spt), PG_LEVEL_4K);
 }
 
+static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
+					enum pg_level level, struct page *page)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	u64 err, entry, level_state;
+
+	do {
+		err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
+					  &entry, &level_state);
+	} while (err == TDX_INTERRUPTED_RESTARTABLE);
+
+	if (unlikely(tdx_operand_busy(err))) {
+		tdx_no_vcpus_enter_start(kvm);
+		err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
+					  &entry, &level_state);
+		tdx_no_vcpus_enter_stop(kvm);
+	}
+
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error_2(TDH_MEM_PAGE_DEMOTE, err, entry, level_state);
+		return -EIO;
+	}
+	return 0;
+}
+
+int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+			       void *private_spt)
+{
+	struct page *page = virt_to_page(private_spt);
+	int ret;
+
+	if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
+		return -EINVAL;
+
+	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
+	if (ret <= 0)
+		return ret;
+
+	tdx_track(kvm);
+
+	return tdx_spte_demote_private_spte(kvm, gfn, level, page);
+}
+
 int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 				 enum pg_level level, kvm_pfn_t pfn)
 {
diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
index 6ff4672c4181..33589e7fa1e1 100644
--- a/arch/x86/kvm/vmx/tdx_errno.h
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -14,6 +14,7 @@
 #define TDX_NON_RECOVERABLE_TD_NON_ACCESSIBLE	0x6000000500000000ULL
 #define TDX_NON_RECOVERABLE_TD_WRONG_APIC_MODE	0x6000000700000000ULL
 #define TDX_INTERRUPTED_RESUMABLE		0x8000000300000000ULL
+#define TDX_INTERRUPTED_RESTARTABLE		0x8000000400000000ULL
 #define TDX_OPERAND_INVALID			0xC000010000000000ULL
 #define TDX_OPERAND_BUSY			0x8000020000000000ULL
 #define TDX_PREVIOUS_TLB_EPOCH_BUSY		0x8000020100000000ULL
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 7c183da7c4d4..df7d4cd1436c 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -158,6 +158,8 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 			      enum pg_level level, kvm_pfn_t pfn);
 int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 				 enum pg_level level, kvm_pfn_t pfn);
+int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+			       void *private_spt);
 
 void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
@@ -224,6 +226,13 @@ static inline int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	return -EOPNOTSUPP;
 }
 
+static inline int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn,
+					     enum pg_level level,
+					     void *private_spt)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
 static inline void tdx_flush_tlb_all(struct kvm_vcpu *vcpu) {}
 static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 16/21] KVM: x86/mmu: Introduce kvm_split_boundary_leafs() to split boundary leafs
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (14 preceding siblings ...)
  2025-04-24  3:08 ` [RFC PATCH 15/21] KVM: TDX: Support huge page splitting with exclusive kvm->mmu_lock Yan Zhao
@ 2025-04-24  3:08 ` Yan Zhao
  2025-05-13 22:56   ` Edgecombe, Rick P
  2025-04-24  3:08 ` [RFC PATCH 17/21] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:08 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

Introduce kvm_split_boundary_leafs() to manage the splitting of boundary
leafs within the mirror root.

Before zapping a specific GFN range in the mirror root, split any huge leaf
that intersects with the boundary of the GFN range to ensure that the
subsequent zap operation does not impact any GFN outside the specified
range. This is crucial for the mirror root as the private page table
requires the guest's ACCEPT operation after faulting back a GFN.

This function should be called while kvm->mmu_lock is held for writing. The
kvm->mmu_lock is temporarily released to allocate memory for sp for split.
The only expected error is -ENOMEM.

Opportunistically, WARN in tdp_mmu_zap_leafs() if zapping a huge leaf in
the mirror root affects a GFN outside the specified range.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu/mmu.c     |  21 +++++++
 arch/x86/kvm/mmu/tdp_mmu.c | 116 ++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/mmu/tdp_mmu.h |   1 +
 include/linux/kvm_host.h   |   1 +
 4 files changed, 136 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0e227199d73e..0d49c69b6b55 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1640,6 +1640,27 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
 				 start, end - 1, can_yield, true, flush);
 }
 
+/*
+ * Split large leafs at the boundary of the specified range for the mirror root
+ *
+ * Return value:
+ * 0 : success, no flush is required;
+ * 1 : success, flush is required;
+ * <0: failure.
+ */
+int kvm_split_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	bool ret = 0;
+
+	lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
+			    lockdep_is_held(&kvm->slots_lock));
+
+	if (tdp_mmu_enabled)
+		ret = kvm_tdp_mmu_gfn_range_split_boundary(kvm, range);
+
+	return ret;
+}
+
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool flush = false;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0f683753a7bb..d3fba5d11ea2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -324,6 +324,8 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 				u64 old_spte, u64 new_spte, int level,
 				bool shared);
 
+static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
+				   struct kvm_mmu_page *sp, bool shared);
 static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror);
 static void *get_external_spt(gfn_t gfn, u64 new_spte, int level);
 
@@ -962,6 +964,19 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 	return true;
 }
 
+static inline bool iter_split_required(struct kvm *kvm, struct kvm_mmu_page *root,
+				       struct tdp_iter *iter, gfn_t start, gfn_t end)
+{
+	if (!is_mirror_sp(root) || !is_large_pte(iter->old_spte))
+		return false;
+
+	/* Fully contained, no need to split */
+	if (iter->gfn >= start && iter->gfn + KVM_PAGES_PER_HPAGE(iter->level) <= end)
+		return false;
+
+	return true;
+}
+
 /*
  * If can_yield is true, will release the MMU lock and reschedule if the
  * scheduler needs the CPU or there is contention on the MMU lock. If this
@@ -991,6 +1006,8 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
 
+		WARN_ON_ONCE(iter_split_required(kvm, root, &iter, start, end));
+
 		tdp_mmu_iter_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
 
 		/*
@@ -1246,9 +1263,6 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 	return 0;
 }
 
-static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
-				   struct kvm_mmu_page *sp, bool shared);
-
 /*
  * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
  * page tables and SPTEs to translate the faulting guest physical address.
@@ -1341,6 +1355,102 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return ret;
 }
 
+/*
+ * Split large leafs at the boundary of the specified range for the mirror root
+ */
+static int tdp_mmu_split_boundary_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
+					gfn_t start, gfn_t end, bool can_yield, bool *flush)
+{
+	struct kvm_mmu_page *sp = NULL;
+	struct tdp_iter iter;
+
+	WARN_ON_ONCE(!can_yield);
+
+	if (!is_mirror_sp(root))
+		return 0;
+
+	end = min(end, tdp_mmu_max_gfn_exclusive());
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	rcu_read_lock();
+
+	for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end) {
+retry:
+		if (can_yield &&
+		    tdp_mmu_iter_cond_resched(kvm, &iter, *flush, false)) {
+			*flush = false;
+			continue;
+		}
+
+		if (!is_shadow_present_pte(iter.old_spte) ||
+		    !is_last_spte(iter.old_spte, iter.level) ||
+		    !iter_split_required(kvm, root, &iter, start, end))
+			continue;
+
+		if (!sp) {
+			rcu_read_unlock();
+
+			write_unlock(&kvm->mmu_lock);
+
+			sp = tdp_mmu_alloc_sp_for_split(true);
+
+			write_lock(&kvm->mmu_lock);
+
+			if (!sp) {
+				trace_kvm_mmu_split_huge_page(iter.gfn, iter.old_spte,
+							      iter.level, -ENOMEM);
+				return -ENOMEM;
+			}
+			rcu_read_lock();
+
+			iter.yielded = true;
+			continue;
+		}
+		tdp_mmu_init_child_sp(sp, &iter);
+
+		if (tdp_mmu_split_huge_page(kvm, &iter, sp, false))
+			goto retry;
+
+		sp = NULL;
+		/*
+		 * Set yielded in case after splitting to a lower level,
+		 * the new iter requires furter splitting.
+		 */
+		iter.yielded = true;
+		*flush = true;
+	}
+
+	rcu_read_unlock();
+
+	/* Leave it here though it should be impossible for the mirror root */
+	if (sp)
+		tdp_mmu_free_sp(sp);
+	return 0;
+}
+
+int kvm_tdp_mmu_gfn_range_split_boundary(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	enum kvm_tdp_mmu_root_types types;
+	struct kvm_mmu_page *root;
+	bool flush = false;
+	int ret;
+
+	types = kvm_gfn_range_filter_to_root_types(kvm, range->attr_filter) | KVM_INVALID_ROOTS;
+
+	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, types) {
+		ret = tdp_mmu_split_boundary_leafs(kvm, root, range->start, range->end,
+						   range->may_block, &flush);
+		if (ret < 0) {
+			if (flush)
+				kvm_flush_remote_tlbs(kvm);
+
+			return ret;
+		}
+	}
+	return flush;
+}
+
 /* Used by mmu notifier via kvm_unmap_gfn_range() */
 bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 				 bool flush)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 52acf99d40a0..806a21d4f0e3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -69,6 +69,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
 				  enum kvm_tdp_mmu_root_types root_types);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared);
+int kvm_tdp_mmu_gfn_range_split_boundary(struct kvm *kvm, struct kvm_gfn_range *range);
 
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 655d36e1f4db..19d7a577e7ed 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -272,6 +272,7 @@ struct kvm_gfn_range {
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_split_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range);
 #endif
 
 enum {
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 17/21] KVM: Change the return type of gfn_handler_t() from bool to int
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (15 preceding siblings ...)
  2025-04-24  3:08 ` [RFC PATCH 16/21] KVM: x86/mmu: Introduce kvm_split_boundary_leafs() to split boundary leafs Yan Zhao
@ 2025-04-24  3:08 ` Yan Zhao
  2025-04-24  3:08 ` [RFC PATCH 18/21] KVM: x86: Split huge boundary leafs before private to shared conversion Yan Zhao
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:08 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

Modify the return type of gfn_handler_t() from bool to int. A negative
return value indicates failure, while a return value of 1 signifies success
with a flush required, and 0 denotes success without a flush required.

This adjustment prepares for a future change that will enable
kvm_pre_set_memory_attributes() to fail.

No functional changes expected.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/arm64/kvm/mmu.c             |  4 ++--
 arch/loongarch/kvm/mmu.c         |  4 ++--
 arch/mips/kvm/mmu.c              |  4 ++--
 arch/powerpc/kvm/book3s.c        |  4 ++--
 arch/powerpc/kvm/e500_mmu_host.c |  4 ++--
 arch/riscv/kvm/mmu.c             |  4 ++--
 arch/x86/kvm/mmu/mmu.c           | 12 ++++++------
 include/linux/kvm_host.h         | 12 ++++++------
 virt/kvm/kvm_main.c              | 25 ++++++++++++++++---------
 9 files changed, 40 insertions(+), 33 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 754f2fe0cc67..4bd8f61e9319 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1973,7 +1973,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return false;
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	u64 size = (range->end - range->start) << PAGE_SHIFT;
 
@@ -1989,7 +1989,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	 */
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	u64 size = (range->end - range->start) << PAGE_SHIFT;
 
diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c
index 4d203294767c..5e97fee941b9 100644
--- a/arch/loongarch/kvm/mmu.c
+++ b/arch/loongarch/kvm/mmu.c
@@ -511,7 +511,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 			range->end << PAGE_SHIFT, &ctx);
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	kvm_ptw_ctx ctx;
 
@@ -523,7 +523,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 				range->end << PAGE_SHIFT, &ctx);
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	gpa_t gpa = range->start << PAGE_SHIFT;
 	kvm_pte_t *ptep = kvm_populate_gpa(kvm, NULL, gpa, 0);
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index d2c3b6b41f18..2df3a53e23e9 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -444,12 +444,12 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return true;
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	return kvm_mips_mkold_gpa_pt(kvm, range->start, range->end);
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	gpa_t gpa = range->start << PAGE_SHIFT;
 	pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index d79c5d1098c0..9bf6e1cf64f1 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -886,12 +886,12 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return kvm->arch.kvm_ops->unmap_gfn_range(kvm, range);
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	return kvm->arch.kvm_ops->age_gfn(kvm, range);
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
 }
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 06caf8bbbe2b..debe1ecb4bfd 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -697,13 +697,13 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return kvm_e500_mmu_unmap_gfn(kvm, range);
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* XXX could be more clever ;) */
 	return false;
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* XXX could be more clever ;) */
 	return false;
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 1087ea74567b..581bd1bc6675 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -550,7 +550,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return false;
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	pte_t *ptep;
 	u32 ptep_level = 0;
@@ -568,7 +568,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	return ptep_test_and_clear_young(NULL, 0, ptep);
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	pte_t *ptep;
 	u32 ptep_level = 0;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0d49c69b6b55..ba993445a00e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1778,7 +1778,7 @@ static bool kvm_may_have_shadow_mmu_sptes(struct kvm *kvm)
 	return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
@@ -1791,7 +1791,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	return young;
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
@@ -7691,8 +7691,8 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 }
 
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
-bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
-					struct kvm_gfn_range *range)
+int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+				       struct kvm_gfn_range *range)
 {
 	/*
 	 * Zap SPTEs even if the slot can't be mapped PRIVATE.  KVM x86 only
@@ -7752,8 +7752,8 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
 	return true;
 }
 
-bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
-					 struct kvm_gfn_range *range)
+int kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+					struct kvm_gfn_range *range)
 {
 	unsigned long attrs = range->arg.attributes;
 	struct kvm_memory_slot *slot = range->slot;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 19d7a577e7ed..ec47f2374fdf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -270,8 +270,8 @@ struct kvm_gfn_range {
 	bool lockless;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 int kvm_split_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range);
 #endif
 
@@ -1526,7 +1526,7 @@ void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void kvm_mmu_invalidate_begin(struct kvm *kvm);
 void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
 void kvm_mmu_invalidate_end(struct kvm *kvm);
-bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
@@ -2504,10 +2504,10 @@ static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn
 
 bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 				     unsigned long mask, unsigned long attrs);
-bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+				       struct kvm_gfn_range *range);
+int kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 					struct kvm_gfn_range *range);
-bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
-					 struct kvm_gfn_range *range);
 
 static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5ea1c442e339..72bd98c100cf 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -511,7 +511,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 	return container_of(mn, struct kvm, mmu_notifier);
 }
 
-typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+typedef int (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm);
 
@@ -595,6 +595,7 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
 		kvm_for_each_memslot_in_hva_range(node, slots,
 						  range->start, range->end - 1) {
 			unsigned long hva_start, hva_end;
+			int ret;
 
 			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
 			hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
@@ -635,7 +636,9 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
 						goto mmu_unlock;
 				}
 			}
-			r.ret |= range->handler(kvm, &gfn_range);
+			ret = range->handler(kvm, &gfn_range);
+			WARN_ON_ONCE(ret < 0);
+			r.ret |= ret;
 		}
 	}
 
@@ -721,7 +724,7 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
 	}
 }
 
-bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
 	return kvm_unmap_gfn_range(kvm, range);
@@ -2413,7 +2416,8 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 	struct kvm_memslots *slots;
 	struct kvm_memslot_iter iter;
 	bool found_memslot = false;
-	bool ret = false;
+	bool flush = false;
+	int ret = 0;
 	int i;
 
 	gfn_range.arg = range->arg;
@@ -2446,19 +2450,22 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 					range->on_lock(kvm);
 			}
 
-			ret |= range->handler(kvm, &gfn_range);
+			ret = range->handler(kvm, &gfn_range);
+			if (ret < 0)
+				goto err;
+			flush |= ret;
 		}
 	}
-
-	if (range->flush_on_ret && ret)
+err:
+	if (range->flush_on_ret && flush)
 		kvm_flush_remote_tlbs(kvm);
 
 	if (found_memslot)
 		KVM_MMU_UNLOCK(kvm);
 }
 
-static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
-					  struct kvm_gfn_range *range)
+static int kvm_pre_set_memory_attributes(struct kvm *kvm,
+					 struct kvm_gfn_range *range)
 {
 	/*
 	 * Unconditionally add the range to the invalidation set, regardless of
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 18/21] KVM: x86: Split huge boundary leafs before private to shared conversion
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (16 preceding siblings ...)
  2025-04-24  3:08 ` [RFC PATCH 17/21] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
@ 2025-04-24  3:08 ` Yan Zhao
  2025-05-09 23:34   ` Edgecombe, Rick P
  2025-04-24  3:08 ` [RFC PATCH 19/21] KVM: gmem: Split huge boundary leafs for punch hole of private memory Yan Zhao
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:08 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

Before converting a GFN range from private to shared, it is necessary to
zap the mirror page table. When huge pages are supported and the GFN range
intersects with a huge leaf, split the huge leaf to prevent zapping GFNs
outside the conversion range.

Invoke kvm_split_boundary_leafs() in kvm_arch_pre_set_memory_attributes()
to split the huge boundary leafs before calling kvm_unmap_gfn_range() to
zap the GFN range that will be converted to shared.

Unlike kvm_unmap_gfn_range(), which cannot fail, kvm_split_boundary_leafs()
may fail due to out of memory during splitting.
Update kvm_handle_gfn_range() to propagate the splitting error back to
kvm_vm_set_mem_attributes(), which will subsequently fail the ioctl
KVM_SET_MEMORY_ATTRIBUTES.

The downside of this approach is that although kvm_split_boundary_leafs()
is invoked before kvm_unmap_gfn_range() for each GFN range, the entire
conversion range may consist of several GFN ranges. If an out-of-memory
error occurs during the splitting of a GFN range, some previous GFN ranges
may have been successfully split and zapped, even though their page
attributes remain unchanged due to the splitting failure. This may not be a
big problem as the user can retry the ioctl to split and zap the full
range.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++----
 virt/kvm/kvm_main.c    | 13 +++++++++----
 2 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ba993445a00e..1a34e43bd349 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7694,6 +7694,9 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 				       struct kvm_gfn_range *range)
 {
+	bool flush = false;
+	int ret;
+
 	/*
 	 * Zap SPTEs even if the slot can't be mapped PRIVATE.  KVM x86 only
 	 * supports KVM_MEMORY_ATTRIBUTE_PRIVATE, and so it *seems* like KVM
@@ -7706,7 +7709,7 @@ int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 	 * a hugepage can be used for affected ranges.
 	 */
 	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
-		return false;
+		return 0;
 
 	/* Unmap the old attribute page. */
 	if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)
@@ -7714,7 +7717,13 @@ int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 	else
 		range->attr_filter = KVM_FILTER_PRIVATE;
 
-	return kvm_unmap_gfn_range(kvm, range);
+	ret = kvm_split_boundary_leafs(kvm, range);
+	if (ret < 0)
+		return ret;
+	flush |= ret;
+
+	flush |= kvm_unmap_gfn_range(kvm, range);
+	return flush;
 }
 
 static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
@@ -7769,7 +7778,7 @@ int kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 	 * SHARED may now allow hugepages.
 	 */
 	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
-		return false;
+		return 0;
 
 	/*
 	 * The sequence matters here: upper levels consume the result of lower
@@ -7816,7 +7825,7 @@ int kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 				hugepage_set_mixed(slot, gfn, level);
 		}
 	}
-	return false;
+	return 0;
 }
 
 void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 72bd98c100cf..6d9b82890f15 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2408,8 +2408,8 @@ bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	return true;
 }
 
-static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
-						 struct kvm_mmu_notifier_range *range)
+static __always_inline int kvm_handle_gfn_range(struct kvm *kvm,
+						struct kvm_mmu_notifier_range *range)
 {
 	struct kvm_gfn_range gfn_range;
 	struct kvm_memory_slot *slot;
@@ -2462,6 +2462,8 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 
 	if (found_memslot)
 		KVM_MMU_UNLOCK(kvm);
+
+	return ret < 0 ? ret : 0;
 }
 
 static int kvm_pre_set_memory_attributes(struct kvm *kvm,
@@ -2526,7 +2528,9 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 			goto out_unlock;
 	}
 
-	kvm_handle_gfn_range(kvm, &pre_set_range);
+	r = kvm_handle_gfn_range(kvm, &pre_set_range);
+	if (r)
+		goto out_unlock;
 
 	for (i = start; i < end; i++) {
 		r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
@@ -2534,7 +2538,8 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 		KVM_BUG_ON(r, kvm);
 	}
 
-	kvm_handle_gfn_range(kvm, &post_set_range);
+	r = kvm_handle_gfn_range(kvm, &post_set_range);
+	KVM_BUG_ON(r, kvm);
 
 out_unlock:
 	mutex_unlock(&kvm->slots_lock);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 19/21] KVM: gmem: Split huge boundary leafs for punch hole of private memory
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (17 preceding siblings ...)
  2025-04-24  3:08 ` [RFC PATCH 18/21] KVM: x86: Split huge boundary leafs before private to shared conversion Yan Zhao
@ 2025-04-24  3:08 ` Yan Zhao
  2025-04-24 10:19   ` Francesco Lavra
  2025-05-13 22:59   ` Edgecombe, Rick P
  2025-04-24  3:09 ` [RFC PATCH 20/21] KVM: x86: Force a prefetch fault's max mapping level to 4KB for TDX Yan Zhao
                   ` (2 subsequent siblings)
  21 siblings, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:08 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

Splitting of huge leafs in the mirror page table for kvm_gmem_punch_hole().

Enhance kvm_gmem_invalidate_begin() to invoke kvm_split_boundary_leafs()
for splitting boundary huge leafs before caling kvm_unmap_gfn_range() to do
the real zapping. As kvm_split_boundary_leafs() may fail due to out of
memory, propagate the error to further fail the kvm_gmem_punch_hole().

Splitting huge boudary leafs in the mirror page table is not required for
kvm_gmem_release() as the entire page table is to be zapped; it's also not
required for kvm_gmem_error_folio() as a SPTE must not map more than one
physical folio.

Note: as the kvm_gmem_punch_hole() may request to zap several GFN ranges,
if an out-of-memory error occurs during the splitting of a GFN range, some
previous GFN ranges may have been successfully split and zapped.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 virt/kvm/guest_memfd.c | 30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 4bb140e7f30d..008061734ac5 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -292,13 +292,14 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, int
 	return folio;
 }
 
-static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
-				      pgoff_t end)
+static int kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
+				     pgoff_t end, bool need_split)
 {
 	bool flush = false, found_memslot = false;
 	struct kvm_memory_slot *slot;
 	struct kvm *kvm = gmem->kvm;
 	unsigned long index;
+	int ret = 0;
 
 	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
 		pgoff_t pgoff = slot->gmem.pgoff;
@@ -319,14 +320,23 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
 			kvm_mmu_invalidate_begin(kvm);
 		}
 
+		if (need_split) {
+			ret = kvm_split_boundary_leafs(kvm, &gfn_range);
+			if (ret < 0)
+				goto out;
+
+			flush |= ret;
+		}
 		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
 	}
 
+out:
 	if (flush)
 		kvm_flush_remote_tlbs(kvm);
 
 	if (found_memslot)
 		KVM_MMU_UNLOCK(kvm);
+	return 0;
 }
 
 static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
@@ -347,6 +357,7 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	loff_t size = i_size_read(inode);
 	pgoff_t start, end;
 	struct kvm_gmem *gmem;
+	int ret = 0;
 
 	if (offset > size)
 		return 0;
@@ -361,18 +372,22 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	 */
 	filemap_invalidate_lock(inode->i_mapping);
 
-	list_for_each_entry(gmem, gmem_list, entry)
-		kvm_gmem_invalidate_begin(gmem, start, end);
+	list_for_each_entry(gmem, gmem_list, entry) {
+		ret = kvm_gmem_invalidate_begin(gmem, start, end, true);
+		if (ret < 0)
+			goto out;
+	}
 
 	truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
 	kvm_gmem_mark_range_unprepared(inode, start, end - start);
 
+out:
 	list_for_each_entry(gmem, gmem_list, entry)
 		kvm_gmem_invalidate_end(gmem, start, end);
 
 	filemap_invalidate_unlock(inode->i_mapping);
 
-	return 0;
+	return ret;
 }
 
 static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
@@ -440,7 +455,7 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
 	 * Zap all SPTEs pointed at by this file.  Do not free the backing
 	 * memory, as its lifetime is associated with the inode, not the file.
 	 */
-	kvm_gmem_invalidate_begin(gmem, 0, -1ul);
+	kvm_gmem_invalidate_begin(gmem, 0, -1ul, false);
 	kvm_gmem_invalidate_end(gmem, 0, -1ul);
 
 	list_del(&gmem->entry);
@@ -524,8 +539,9 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
 	start = folio->index;
 	end = start + folio_nr_pages(folio);
 
+	/* The size of the SEPT will not exceed the size of the folio */
 	list_for_each_entry(gmem, gmem_list, entry)
-		kvm_gmem_invalidate_begin(gmem, start, end);
+		kvm_gmem_invalidate_begin(gmem, start, end, false);
 
 	/*
 	 * Do not truncate the range, what action is taken in response to the
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 20/21] KVM: x86: Force a prefetch fault's max mapping level to 4KB for TDX
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (18 preceding siblings ...)
  2025-04-24  3:08 ` [RFC PATCH 19/21] KVM: gmem: Split huge boundary leafs for punch hole of private memory Yan Zhao
@ 2025-04-24  3:09 ` Yan Zhao
  2025-05-13 23:20   ` Edgecombe, Rick P
  2025-05-21  3:30   ` Binbin Wu
  2025-04-24  3:09 ` [RFC PATCH 21/21] KVM: x86: Ignore splitting huge pages in fault path " Yan Zhao
  2025-04-24  7:35 ` [RFC PATCH 00/21] KVM: TDX huge page support for private memory Kirill A. Shutemov
  21 siblings, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:09 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

Introduce a "prefetch" parameter to the private_max_mapping_level hook and
enforce the max mapping level of a prefetch fault for private memory to be
4KB. This is a preparation to enable the ignoring huge page splitting in
the fault path.

If a prefetch fault results in a 2MB huge leaf in the mirror page table,
there may not be a vCPU available to accept the corresponding 2MB huge leaf
in the S-EPT if the TD is not configured to receive #VE for page
acceptance. Consequently, if a vCPU accepts the page at 4KB level, it will
trigger an EPT violation to split the 2MB huge leaf generated by the
prefetch fault.

Since handling the BUSY error from SEAMCALLs for huge page splitting is
more comprehensive in the fault path, which is with kvm->mmu_lock held for
reading, force the max mapping level of a prefetch fault of private memory
to be 4KB to prevent potential splitting.

Since prefetch faults for private memory are uncommon after the TD's build
time, enforcing a 4KB mapping level is unlikely to cause any performance
degradation. The max mapping level is already set to 4KB during the TD's
build phase.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 3 ++-
 arch/x86/kvm/mmu/mmu.c          | 7 ++++---
 arch/x86/kvm/svm/sev.c          | 3 ++-
 arch/x86/kvm/svm/svm.h          | 5 +++--
 arch/x86/kvm/vmx/main.c         | 5 +++--
 arch/x86/kvm/vmx/tdx.c          | 5 +++--
 arch/x86/kvm/vmx/x86_ops.h      | 4 ++--
 7 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6962a8a424ef..5167458742bf 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1900,7 +1900,8 @@ struct kvm_x86_ops {
 	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
 	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
 	void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
-	int (*private_max_mapping_level)(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn);
+	int (*private_max_mapping_level)(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn,
+					 bool prefetch);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1a34e43bd349..94a557e010d3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4488,7 +4488,7 @@ static inline u8 kvm_max_level_for_order(int order)
 }
 
 static u8 kvm_max_private_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn,
-					u8 max_level, int gmem_order)
+					u8 max_level, int gmem_order, bool prefetch)
 {
 	u8 req_max_level;
 
@@ -4499,7 +4499,7 @@ static u8 kvm_max_private_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gf
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
-	req_max_level = kvm_x86_call(private_max_mapping_level)(vcpu, pfn, gfn);
+	req_max_level = kvm_x86_call(private_max_mapping_level)(vcpu, pfn, gfn, prefetch);
 	if (req_max_level)
 		max_level = min(max_level, req_max_level);
 
@@ -4532,7 +4532,8 @@ static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,
 
 	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
 	fault->max_level = kvm_max_private_mapping_level(vcpu, fault->pfn, fault->gfn,
-							 fault->max_level, max_order);
+							 fault->max_level, max_order,
+							 fault->prefetch);
 
 	return RET_PF_CONTINUE;
 }
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index dc6cdf9fa1ba..7a9c44ad5b91 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -4910,7 +4910,8 @@ void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
 	}
 }
 
-int sev_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn)
+int sev_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn,
+				  bool prefetch)
 {
 	int level, rc;
 	bool assigned;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 1a9738b6ae37..272a8404e1c0 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -782,7 +782,7 @@ void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
 void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu);
 int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
 void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
-int sev_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn);
+int sev_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn, bool prefetch);
 #else
 static inline struct page *snp_safe_alloc_page_node(int node, gfp_t gfp)
 {
@@ -809,7 +809,8 @@ static inline int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, in
 	return 0;
 }
 static inline void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end) {}
-static inline int sev_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn)
+static inline int sev_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
+						gfn_t gfn, bool prefetch)
 {
 	return 0;
 }
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 16c0c31dd066..82689ad8bc18 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -881,10 +881,11 @@ static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	return tdx_vcpu_ioctl(vcpu, argp);
 }
 
-static int vt_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn)
+static int vt_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
+					     gfn_t gfn, bool prefetch)
 {
 	if (is_td(vcpu->kvm))
-		return tdx_gmem_private_max_mapping_level(vcpu, pfn, gfn);
+		return tdx_gmem_private_max_mapping_level(vcpu, pfn, gfn, prefetch);
 
 	return 0;
 }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 4386e1a0323e..e24d1cbcc762 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3332,11 +3332,12 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	return ret;
 }
 
-int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn)
+int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
+				       gfn_t gfn, bool prefetch)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
 
-	if (unlikely(to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE))
+	if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE) || prefetch))
 		return PG_LEVEL_4K;
 
 	if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index df7d4cd1436c..0619e9390e5d 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -164,7 +164,7 @@ int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
-int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn);
+int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn, bool prefetch);
 #else
 static inline void tdx_disable_virtualization_cpu(void) {}
 static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
@@ -236,7 +236,7 @@ static inline int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn,
 static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
 static inline void tdx_flush_tlb_all(struct kvm_vcpu *vcpu) {}
 static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
-static inline int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn) { return 0; }
+static inline int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, gfn_t gfn, bool prefetch) { return 0; }
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* [RFC PATCH 21/21] KVM: x86: Ignore splitting huge pages in fault path for TDX
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (19 preceding siblings ...)
  2025-04-24  3:09 ` [RFC PATCH 20/21] KVM: x86: Force a prefetch fault's max mapping level to 4KB for TDX Yan Zhao
@ 2025-04-24  3:09 ` Yan Zhao
  2025-05-13 21:58   ` Edgecombe, Rick P
  2025-04-24  7:35 ` [RFC PATCH 00/21] KVM: TDX huge page support for private memory Kirill A. Shutemov
  21 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  3:09 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng, Yan Zhao

As handling the BUSY error from SEAMCALLs for splitting huge pages is more
comprehensive in the fault path, which is with kvm->mmu_lock held for
reading, simply ignore the splitting huge page request in fault path for
TDX.

Splitting in fault path can now be caused by vCPUs' concurrent ACCEPT
operations at diffent levels, e.g. one vCPU accepts at 2MB level while
another vCPU accepts at 4KB level. As the first vCPU will accepts the whole
2MB range ultimately, ignoring the mapping request (which leads to huge
page splitting) caused by the second 4KB ACCEPT operation is fine.

A rare case that could lead to splitting in the fault path is when a TD is
configured to receive #VE and accesses memory before the ACCEPT operation.
By the time a vCPU accesses a private GFN, due to the lack of any guest
preferred level, KVM could create a mapping at 2MB level. If the TD then
only performs the ACCEPT operation at 4KB level, splitting in the fault
path will be triggered. However, this is not regarded as a typical use
case, as usually TD always accepts pages in the order from 1GB->2MB->4KB.
The worst outcome to ignore the resulting splitting request is an endless
EPT violation. This would not happen for a Linux guest, which does not
expect any #VE.

This ignoring of splitting huge page in fault path is achieved in 3 parts:
1. In KVM TDP MMU,
   allow splitting huge pages in fault path for the mirror page table and
   propagate the splitting request to TDX.
2. Enhance the hook split_external_spt by
   passing in shared/exclusive status of kvm->mmu_lock
3. In TDX's implementation of hook split_external_spt,
   do nothing but to set the max_level of the next fault on the splitting
   GFN range on the vCPU to 2MB and return -EBUSY.

Then after tdx_sept_split_private_spt() returns, TDX's EPT violation
handler may (a) return to guest directly (when signal/interrupt pending) or
(b) retry locally in the TDX's code.
- for (a), the TD can retry the ACCEPT operation, finding the memory is
  accepted at 2MB level by another vCPU.
- for (b), as the violation_request_level is set to 2MB, the next
  kvm_mmu_page_fault() will return RET_PF_SPURIOUS, causing re-entering of
  the TD.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c      | 44 ++++++++++++++++++++-------------
 arch/x86/kvm/vmx/tdx.c          | 25 ++++++++++++++++++-
 arch/x86/kvm/vmx/x86_ops.h      |  5 ++--
 4 files changed, 55 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5167458742bf..faae82eefd99 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1814,7 +1814,7 @@ struct kvm_x86_ops {
 
 	/* Split the external page table into smaller page tables */
 	int (*split_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-				  void *external_spt);
+				  void *external_spt, bool mmu_lock_shared);
 
 	bool (*has_wbinvd_exit)(void);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d3fba5d11ea2..1b2bacde009f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -388,15 +388,15 @@ static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
 }
 
 static int split_external_spt(struct kvm *kvm, gfn_t gfn, u64 old_spte,
-			      u64 new_spte, int level)
+			      u64 new_spte, int level, bool shared)
 {
 	void *external_spt = get_external_spt(gfn, new_spte, level);
 	int ret;
 
 	KVM_BUG_ON(!external_spt, kvm);
 
-	ret = static_call(kvm_x86_split_external_spt)(kvm, gfn, level, external_spt);
-	KVM_BUG_ON(ret, kvm);
+	ret = static_call(kvm_x86_split_external_spt)(kvm, gfn, level, external_spt, shared);
+	KVM_BUG_ON(ret && !shared, kvm);
 
 	return ret;
 }
@@ -536,11 +536,13 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
 {
 	bool was_present = is_shadow_present_pte(old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
+	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
 	int ret = 0;
 
-	KVM_BUG_ON(was_present, kvm);
+	/* leaf to leaf or non-leaf to non-leaf updates are not allowed */
+	KVM_BUG_ON((was_leaf && is_leaf) || (was_present && !was_leaf && !is_leaf), kvm);
 
 	lockdep_assert_held(&kvm->mmu_lock);
 	/*
@@ -551,18 +553,28 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
 	if (!try_cmpxchg64(rcu_dereference(sptep), &old_spte, FROZEN_SPTE))
 		return -EBUSY;
 
-	/*
-	 * Use different call to either set up middle level
-	 * external page table, or leaf.
-	 */
-	if (is_leaf) {
-		ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
-	} else {
-		void *external_spt = get_external_spt(gfn, new_spte, level);
+	if (!was_present) {
+		/*
+		 * Propagate to install a new leaf or non-leaf entry in external
+		 * page table.
+		 */
+		if (is_leaf) {
+			ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
+		} else {
+			void *external_spt = get_external_spt(gfn, new_spte, level);
 
-		KVM_BUG_ON(!external_spt, kvm);
-		ret = static_call(kvm_x86_link_external_spt)(kvm, gfn, level, external_spt);
+			KVM_BUG_ON(!external_spt, kvm);
+			ret = static_call(kvm_x86_link_external_spt)(kvm, gfn, level, external_spt);
+		}
+	} else if (was_leaf && is_present && !is_leaf) {
+		/* demote */
+		ret = split_external_spt(kvm, gfn, old_spte, new_spte, level, true);
+	} else {
+		/* Promotion is not supported by mirror root (TDX)*/
+		KVM_BUG_ON(1, kvm);
+		ret = -EOPNOTSUPP;
 	}
+
 	if (ret)
 		__kvm_tdp_mmu_write_spte(sptep, old_spte);
 	else
@@ -784,7 +796,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 		if (!is_shadow_present_pte(new_spte))
 			remove_external_spte(kvm, gfn, old_spte, level);
 		else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
-			split_external_spt(kvm, gfn, old_spte, new_spte, level);
+			split_external_spt(kvm, gfn, old_spte, new_spte, level, false);
 		else
 			KVM_BUG_ON(1, kvm);
 	}
@@ -1315,8 +1327,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
 
 		if (is_shadow_present_pte(iter.old_spte)) {
-			/* Don't support large page for mirrored roots (TDX) */
-			KVM_BUG_ON(is_mirror_sptep(iter.sptep), vcpu->kvm);
 			r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
 		} else {
 			r = tdp_mmu_link_sp(kvm, &iter, sp, true);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e24d1cbcc762..e994a6c08a75 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1834,7 +1834,7 @@ static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
 }
 
 int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-			       void *private_spt)
+			       void *private_spt, bool mmu_lock_shared)
 {
 	struct page *page = virt_to_page(private_spt);
 	int ret;
@@ -1842,6 +1842,29 @@ int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 	if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
 		return -EINVAL;
 
+	/*
+	 * Split request with mmu_lock held for reading can only occur when one
+	 * vCPU accepts at 2MB level while another vCPU accepts at 4KB level.
+	 * Ignore this 4KB mapping request by setting violation_request_level to
+	 * 2MB and returning -EBUSY for retry. Then the next fault at 2MB level
+	 * would be a spurious fault. The vCPU accepting at 2MB will accept the
+	 * whole 2MB range.
+	 */
+	if (mmu_lock_shared) {
+		struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+		struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+		if (KVM_BUG_ON(!vcpu, kvm))
+			return -EOPNOTSUPP;
+
+		/* Request to map as 2MB leaf for the whole 2MB range */
+		tdx->violation_gfn_start = gfn_round_for_level(gfn, level);
+		tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
+		tdx->violation_request_level = level;
+
+		return -EBUSY;
+	}
+
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
 		return ret;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 0619e9390e5d..fcba76887508 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -159,7 +159,7 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 				 enum pg_level level, kvm_pfn_t pfn);
 int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-			       void *private_spt);
+			       void *private_spt, bool mmu_lock_shared);
 
 void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
@@ -228,7 +228,8 @@ static inline int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 
 static inline int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn,
 					     enum pg_level level,
-					     void *private_spt)
+					     void *private_spt,
+					     bool mmu_lock_shared)
 {
 	return -EOPNOTSUPP;
 }
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 00/21] KVM: TDX huge page support for private memory
  2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
                   ` (20 preceding siblings ...)
  2025-04-24  3:09 ` [RFC PATCH 21/21] KVM: x86: Ignore splitting huge pages in fault path " Yan Zhao
@ 2025-04-24  7:35 ` Kirill A. Shutemov
  2025-04-24  8:33   ` Yan Zhao
  21 siblings, 1 reply; 294+ messages in thread
From: Kirill A. Shutemov @ 2025-04-24  7:35 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, chao.p.peng

On Thu, Apr 24, 2025 at 11:00:32AM +0800, Yan Zhao wrote:
> Basic huge page mapping/unmapping
> ---------------------------------
> - TD build time
>   This series enforces that all private mappings be 4KB during the TD build
>   phase, due to the TDX module's requirement that tdh_mem_page_add(), the
>   SEAMCALL for adding private pages during TD build time, only supports 4KB
>   mappings. Enforcing 4KB mappings also simplifies the implementation of
>   code for TD build time, by eliminating the need to consider merging or
>   splitting in the mirror page table during TD build time.
>   
>   The underlying pages allocated from guest_memfd during TD build time
>   phase can still be large, allowing for potential merging into 2MB
>   mappings once the TD is running.

It can be done before TD is running. The merging is allowed on TD build
stage.

But, yes, for simplicity we can skip it for initial enabling.

> Page splitting (page demotion)
> ------------------------------
> Page splitting occurs in two paths:
> (a) with exclusive kvm->mmu_lock, triggered by zapping operations,
> 
>     For normal VMs, if zapping a narrow region that would need to split a
>     huge page, KVM can simply zap the surrounding GFNs rather than
>     splitting a huge page. The pages can then be faulted back in, where KVM
>     can handle mapping them at a 4KB level.
> 
>     The reason why TDX can't use the normal VM solution is that zapping
>     private memory that is accepted cannot easily be re-faulted, since it
>     can only be re-faulted as unaccepted. So KVM will have to sometimes do
>     the page splitting as part of the zapping operations.
> 
>     These zapping operations can occur for few reasons:
>     1. VM teardown.
>     2. Memslot removal.
>     3. Conversion of private pages to shared.
>     4. Userspace does a hole punch to guest_memfd for some reason.
> 
>     For case 1 and 2, splitting before zapping is unnecessary because
>     either the entire range will be zapped or huge pages do not span
>     memslots.
>     
>     Case 3 or case 4 requires splitting, which is also followed by a
>     backend page splitting in guest_memfd.
> 
> (b) with shared kvm->mmu_lock, triggered by fault.
> 
>     Splitting in this path is not accompanied by a backend page splitting
>     (since backend page splitting necessitates a splitting and zapping
>      operation in the former path).  It is triggered when KVM finds that a
>     non-leaf entry is replacing a huge entry in the fault path, which is
>     usually caused by vCPUs' concurrent ACCEPT operations at different
>     levels.

Hm. This sounds like funky behaviour on the guest side.

You only saw it in a synthetic test, right? No real guest OS should do
this.

It can only be possible if guest is reckless enough to be exposed to
double accept attacks.

We should consider putting a warning if we detect such case on KVM side.

>     This series simply ignores the splitting request in the fault path to
>     avoid unnecessary bounces between levels. The vCPU that performs ACCEPT
>     at a lower level would finally figures out the page has been accepted
>     at a higher level by another vCPU.
> 
>     A rare case that could lead to splitting in the fault path is when a TD
>     is configured to receive #VE and accesses memory before the ACCEPT
>     operation. By the time a vCPU accesses a private GFN, due to the lack
>     of any guest preferred level, KVM could create a mapping at 2MB level.
>     If the TD then only performs the ACCEPT operation at 4KB level,
>     splitting in the fault path will be triggered. However, this is not
>     regarded as a typical use case, as usually TD always accepts pages in
>     the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
>     splitting request is an endless EPT violation. This would not happen
>     for a Linux guest, which does not expect any #VE.

Even if guest accepts memory in response to #VE, it still has to serialize
ACCEPT requests to the same memory block. And track what has been
accepted.

Double accept is a guest bug.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-04-24  3:04 ` [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
@ 2025-04-24  7:48   ` Kirill A. Shutemov
  2025-04-24  8:41     ` Yan Zhao
  2025-04-25  6:51   ` Binbin Wu
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 294+ messages in thread
From: Kirill A. Shutemov @ 2025-04-24  7:48 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Thu, Apr 24, 2025 at 11:04:28AM +0800, Yan Zhao wrote:
> Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.
> 
> Verify the validity of the level and ensure that the mapping range is fully
> contained within the page folio.
> 
> As a conservative solution, perform CLFLUSH on all pages to be mapped into
> the TD before invoking the SEAMCALL TDH_MEM_PAGE_AUG. This ensures that any
> dirty cache lines do not write back later and clobber TD memory.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index f5e2a937c1e7..a66d501b5677 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1595,9 +1595,18 @@ u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u
>  		.rdx = tdx_tdr_pa(td),
>  		.r8 = page_to_phys(page),
>  	};
> +	unsigned long nr_pages = 1 << (level * 9);

PTE_SHIFT.

> +	struct folio *folio = page_folio(page);
> +	unsigned long idx = 0;
>  	u64 ret;
>  
> -	tdx_clflush_page(page);
> +	if (!(level >= TDX_PS_4K && level < TDX_PS_NR) ||

Do we even need this check?

> +	    (folio_page_idx(folio, page) + nr_pages > folio_nr_pages(folio)))
> +		return -EINVAL;
> +
> +	while (nr_pages--)
> +		tdx_clflush_page(nth_page(page, idx++));
> +
>  	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
>  
>  	*ext_err1 = args.rcx;
> -- 
> 2.43.2
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 04/21] KVM: TDX: Enforce 4KB mapping level during TD build Time
  2025-04-24  3:05 ` [RFC PATCH 04/21] KVM: TDX: Enforce 4KB mapping level during TD build Time Yan Zhao
@ 2025-04-24  7:55   ` Kirill A. Shutemov
  2025-04-24  8:49     ` Yan Zhao
  2025-05-13 19:12   ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Kirill A. Shutemov @ 2025-04-24  7:55 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Thu, Apr 24, 2025 at 11:05:00AM +0800, Yan Zhao wrote:
> During the TD build phase (i.e., before the TD becomes RUNNABLE), enforce a
> 4KB mapping level both in the S-EPT managed by the TDX module and the
> mirror page table managed by KVM.
> 
> During this phase, TD's memory is added via tdh_mem_page_add(), which only
> accepts 4KB granularity. Therefore, return PG_LEVEL_4K in TDX's
> .private_max_mapping_level hook to ensure KVM maps at the 4KB level in the
> mirror page table. Meanwhile, iterate over each 4KB page of a large gmem
> backend page in tdx_gmem_post_populate() and invoke tdh_mem_page_add() to
> map at the 4KB level in the S-EPT.
> 
> Still allow huge pages in gmem backend during TD build time. Based on [1],
> which gmem series allows 2MB TPH and non-in-place conversion, pass in

s/TPH/THP/

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 00/21] KVM: TDX huge page support for private memory
  2025-04-24  7:35 ` [RFC PATCH 00/21] KVM: TDX huge page support for private memory Kirill A. Shutemov
@ 2025-04-24  8:33   ` Yan Zhao
  2025-04-24  9:05     ` Kirill A. Shutemov
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  8:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, chao.p.peng

On Thu, Apr 24, 2025 at 10:35:47AM +0300, Kirill A. Shutemov wrote:
> On Thu, Apr 24, 2025 at 11:00:32AM +0800, Yan Zhao wrote:
> > Basic huge page mapping/unmapping
> > ---------------------------------
> > - TD build time
> >   This series enforces that all private mappings be 4KB during the TD build
> >   phase, due to the TDX module's requirement that tdh_mem_page_add(), the
> >   SEAMCALL for adding private pages during TD build time, only supports 4KB
> >   mappings. Enforcing 4KB mappings also simplifies the implementation of
> >   code for TD build time, by eliminating the need to consider merging or
> >   splitting in the mirror page table during TD build time.
> >   
> >   The underlying pages allocated from guest_memfd during TD build time
> >   phase can still be large, allowing for potential merging into 2MB
> >   mappings once the TD is running.
> 
> It can be done before TD is running. The merging is allowed on TD build
> stage.
> 
> But, yes, for simplicity we can skip it for initial enabling.
Yes, to avoid complicating kvm_tdx->nr_premapped calculation.
I also don't see any benefit to allow merging during TD build stage.

> 
> > Page splitting (page demotion)
> > ------------------------------
> > Page splitting occurs in two paths:
> > (a) with exclusive kvm->mmu_lock, triggered by zapping operations,
> > 
> >     For normal VMs, if zapping a narrow region that would need to split a
> >     huge page, KVM can simply zap the surrounding GFNs rather than
> >     splitting a huge page. The pages can then be faulted back in, where KVM
> >     can handle mapping them at a 4KB level.
> > 
> >     The reason why TDX can't use the normal VM solution is that zapping
> >     private memory that is accepted cannot easily be re-faulted, since it
> >     can only be re-faulted as unaccepted. So KVM will have to sometimes do
> >     the page splitting as part of the zapping operations.
> > 
> >     These zapping operations can occur for few reasons:
> >     1. VM teardown.
> >     2. Memslot removal.
> >     3. Conversion of private pages to shared.
> >     4. Userspace does a hole punch to guest_memfd for some reason.
> > 
> >     For case 1 and 2, splitting before zapping is unnecessary because
> >     either the entire range will be zapped or huge pages do not span
> >     memslots.
> >     
> >     Case 3 or case 4 requires splitting, which is also followed by a
> >     backend page splitting in guest_memfd.
> > 
> > (b) with shared kvm->mmu_lock, triggered by fault.
> > 
> >     Splitting in this path is not accompanied by a backend page splitting
> >     (since backend page splitting necessitates a splitting and zapping
> >      operation in the former path).  It is triggered when KVM finds that a
> >     non-leaf entry is replacing a huge entry in the fault path, which is
> >     usually caused by vCPUs' concurrent ACCEPT operations at different
> >     levels.
> 
> Hm. This sounds like funky behaviour on the guest side.
> 
> You only saw it in a synthetic test, right? No real guest OS should do
> this.
Right. In selftest only.
Also in case of any guest bugs.

> It can only be possible if guest is reckless enough to be exposed to
> double accept attacks.
> 
> We should consider putting a warning if we detect such case on KVM side.
Is it acceptable to put warnings in host kernel in case of guest bugs or
attacks?


> >     This series simply ignores the splitting request in the fault path to
> >     avoid unnecessary bounces between levels. The vCPU that performs ACCEPT
> >     at a lower level would finally figures out the page has been accepted
> >     at a higher level by another vCPU.
> > 
> >     A rare case that could lead to splitting in the fault path is when a TD
> >     is configured to receive #VE and accesses memory before the ACCEPT
> >     operation. By the time a vCPU accesses a private GFN, due to the lack
> >     of any guest preferred level, KVM could create a mapping at 2MB level.
> >     If the TD then only performs the ACCEPT operation at 4KB level,
> >     splitting in the fault path will be triggered. However, this is not
> >     regarded as a typical use case, as usually TD always accepts pages in
> >     the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
> >     splitting request is an endless EPT violation. This would not happen
> >     for a Linux guest, which does not expect any #VE.
> 
> Even if guest accepts memory in response to #VE, it still has to serialize
> ACCEPT requests to the same memory block. And track what has been
> accepted.
> 
> Double accept is a guest bug.
In the rare case, there're no double accept.
1. Guest acceses a private GPA
2. KVM creates a 2MB mapping in PENDING state and returns to guest.
3. Guest re-accesses, causing the TDX module to inject a #VE.
4. Guest accepts at 4KB level only.
5. EPT violation to KVM for page splitting.

Here, we expect a normal guest to accept from GB->2MB->4KB in step 4.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-04-24  7:48   ` Kirill A. Shutemov
@ 2025-04-24  8:41     ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  8:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Thu, Apr 24, 2025 at 10:48:53AM +0300, Kirill A. Shutemov wrote:
> On Thu, Apr 24, 2025 at 11:04:28AM +0800, Yan Zhao wrote:
> > Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.
> > 
> > Verify the validity of the level and ensure that the mapping range is fully
> > contained within the page folio.
> > 
> > As a conservative solution, perform CLFLUSH on all pages to be mapped into
> > the TD before invoking the SEAMCALL TDH_MEM_PAGE_AUG. This ensures that any
> > dirty cache lines do not write back later and clobber TD memory.
> > 
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  arch/x86/virt/vmx/tdx/tdx.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index f5e2a937c1e7..a66d501b5677 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1595,9 +1595,18 @@ u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u
> >  		.rdx = tdx_tdr_pa(td),
> >  		.r8 = page_to_phys(page),
> >  	};
> > +	unsigned long nr_pages = 1 << (level * 9);
> 
> PTE_SHIFT.
Yes. Thanks.

> > +	struct folio *folio = page_folio(page);
> > +	unsigned long idx = 0;
> >  	u64 ret;
> >  
> > -	tdx_clflush_page(page);
> > +	if (!(level >= TDX_PS_4K && level < TDX_PS_NR) ||
> 
> Do we even need this check?
Maybe not if tdh_mem_page_aug() trusts KVM :)

The consideration is to avoid nr_pages being too huge to cause too many
tdx_clflush_page()s on any reckless error.
 
> > +	    (folio_page_idx(folio, page) + nr_pages > folio_nr_pages(folio)))
> > +		return -EINVAL;
> > +
> > +	while (nr_pages--)
> > +		tdx_clflush_page(nth_page(page, idx++));
> > +
> >  	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
> >  
> >  	*ext_err1 = args.rcx;
> > -- 
> > 2.43.2
> > 
> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 04/21] KVM: TDX: Enforce 4KB mapping level during TD build Time
  2025-04-24  7:55   ` Kirill A. Shutemov
@ 2025-04-24  8:49     ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  8:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Thu, Apr 24, 2025 at 10:55:53AM +0300, Kirill A. Shutemov wrote:
> On Thu, Apr 24, 2025 at 11:05:00AM +0800, Yan Zhao wrote:
> > During the TD build phase (i.e., before the TD becomes RUNNABLE), enforce a
> > 4KB mapping level both in the S-EPT managed by the TDX module and the
> > mirror page table managed by KVM.
> > 
> > During this phase, TD's memory is added via tdh_mem_page_add(), which only
> > accepts 4KB granularity. Therefore, return PG_LEVEL_4K in TDX's
> > .private_max_mapping_level hook to ensure KVM maps at the 4KB level in the
> > mirror page table. Meanwhile, iterate over each 4KB page of a large gmem
> > backend page in tdx_gmem_post_populate() and invoke tdh_mem_page_add() to
> > map at the 4KB level in the S-EPT.
> > 
> > Still allow huge pages in gmem backend during TD build time. Based on [1],
> > which gmem series allows 2MB TPH and non-in-place conversion, pass in
> 
> s/TPH/THP/
Right. Thanks!

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 00/21] KVM: TDX huge page support for private memory
  2025-04-24  8:33   ` Yan Zhao
@ 2025-04-24  9:05     ` Kirill A. Shutemov
  2025-04-24  9:08       ` Juergen Gross
  2025-04-24  9:49       ` Yan Zhao
  0 siblings, 2 replies; 294+ messages in thread
From: Kirill A. Shutemov @ 2025-04-24  9:05 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, chao.p.peng

On Thu, Apr 24, 2025 at 04:33:13PM +0800, Yan Zhao wrote:
> On Thu, Apr 24, 2025 at 10:35:47AM +0300, Kirill A. Shutemov wrote:
> > On Thu, Apr 24, 2025 at 11:00:32AM +0800, Yan Zhao wrote:
> > > Basic huge page mapping/unmapping
> > > ---------------------------------
> > > - TD build time
> > >   This series enforces that all private mappings be 4KB during the TD build
> > >   phase, due to the TDX module's requirement that tdh_mem_page_add(), the
> > >   SEAMCALL for adding private pages during TD build time, only supports 4KB
> > >   mappings. Enforcing 4KB mappings also simplifies the implementation of
> > >   code for TD build time, by eliminating the need to consider merging or
> > >   splitting in the mirror page table during TD build time.
> > >   
> > >   The underlying pages allocated from guest_memfd during TD build time
> > >   phase can still be large, allowing for potential merging into 2MB
> > >   mappings once the TD is running.
> > 
> > It can be done before TD is running. The merging is allowed on TD build
> > stage.
> > 
> > But, yes, for simplicity we can skip it for initial enabling.
> Yes, to avoid complicating kvm_tdx->nr_premapped calculation.
> I also don't see any benefit to allow merging during TD build stage.
> 
> > 
> > > Page splitting (page demotion)
> > > ------------------------------
> > > Page splitting occurs in two paths:
> > > (a) with exclusive kvm->mmu_lock, triggered by zapping operations,
> > > 
> > >     For normal VMs, if zapping a narrow region that would need to split a
> > >     huge page, KVM can simply zap the surrounding GFNs rather than
> > >     splitting a huge page. The pages can then be faulted back in, where KVM
> > >     can handle mapping them at a 4KB level.
> > > 
> > >     The reason why TDX can't use the normal VM solution is that zapping
> > >     private memory that is accepted cannot easily be re-faulted, since it
> > >     can only be re-faulted as unaccepted. So KVM will have to sometimes do
> > >     the page splitting as part of the zapping operations.
> > > 
> > >     These zapping operations can occur for few reasons:
> > >     1. VM teardown.
> > >     2. Memslot removal.
> > >     3. Conversion of private pages to shared.
> > >     4. Userspace does a hole punch to guest_memfd for some reason.
> > > 
> > >     For case 1 and 2, splitting before zapping is unnecessary because
> > >     either the entire range will be zapped or huge pages do not span
> > >     memslots.
> > >     
> > >     Case 3 or case 4 requires splitting, which is also followed by a
> > >     backend page splitting in guest_memfd.
> > > 
> > > (b) with shared kvm->mmu_lock, triggered by fault.
> > > 
> > >     Splitting in this path is not accompanied by a backend page splitting
> > >     (since backend page splitting necessitates a splitting and zapping
> > >      operation in the former path).  It is triggered when KVM finds that a
> > >     non-leaf entry is replacing a huge entry in the fault path, which is
> > >     usually caused by vCPUs' concurrent ACCEPT operations at different
> > >     levels.
> > 
> > Hm. This sounds like funky behaviour on the guest side.
> > 
> > You only saw it in a synthetic test, right? No real guest OS should do
> > this.
> Right. In selftest only.
> Also in case of any guest bugs.
> 
> > It can only be possible if guest is reckless enough to be exposed to
> > double accept attacks.
> > 
> > We should consider putting a warning if we detect such case on KVM side.
> Is it acceptable to put warnings in host kernel in case of guest bugs or
> attacks?

pr_warn_once() shouldn't be a big deal.

> > >     This series simply ignores the splitting request in the fault path to
> > >     avoid unnecessary bounces between levels. The vCPU that performs ACCEPT
> > >     at a lower level would finally figures out the page has been accepted
> > >     at a higher level by another vCPU.
> > > 
> > >     A rare case that could lead to splitting in the fault path is when a TD
> > >     is configured to receive #VE and accesses memory before the ACCEPT
> > >     operation. By the time a vCPU accesses a private GFN, due to the lack
> > >     of any guest preferred level, KVM could create a mapping at 2MB level.
> > >     If the TD then only performs the ACCEPT operation at 4KB level,
> > >     splitting in the fault path will be triggered. However, this is not
> > >     regarded as a typical use case, as usually TD always accepts pages in
> > >     the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
> > >     splitting request is an endless EPT violation. This would not happen
> > >     for a Linux guest, which does not expect any #VE.
> > 
> > Even if guest accepts memory in response to #VE, it still has to serialize
> > ACCEPT requests to the same memory block. And track what has been
> > accepted.
> > 
> > Double accept is a guest bug.
> In the rare case, there're no double accept.
> 1. Guest acceses a private GPA
> 2. KVM creates a 2MB mapping in PENDING state and returns to guest.
> 3. Guest re-accesses, causing the TDX module to inject a #VE.
> 4. Guest accepts at 4KB level only.
> 5. EPT violation to KVM for page splitting.
> 
> Here, we expect a normal guest to accept from GB->2MB->4KB in step 4.

Okay, I think I misunderstood this case. I thought there is competing 4k
vs 2M ACCEPT requests to the same memory block.

Accepting everything at 4k level is a stupid, but valid behaviour on the
guest behalf. This splitting case has to be supported before the patchset
hits the mainline.

BTW, there's no 1G ACCEPT. I know that guest is written as if it is a
thing, but TDX module only supports 4k and 2M. 1G is only reachable via
promotion.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 00/21] KVM: TDX huge page support for private memory
  2025-04-24  9:05     ` Kirill A. Shutemov
@ 2025-04-24  9:08       ` Juergen Gross
  2025-04-24  9:49       ` Yan Zhao
  1 sibling, 0 replies; 294+ messages in thread
From: Juergen Gross @ 2025-04-24  9:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, chao.p.peng


[-- Attachment #1.1.1: Type: text/plain, Size: 3879 bytes --]

On 24.04.25 11:05, Kirill A. Shutemov wrote:
> On Thu, Apr 24, 2025 at 04:33:13PM +0800, Yan Zhao wrote:
>> On Thu, Apr 24, 2025 at 10:35:47AM +0300, Kirill A. Shutemov wrote:
>>> On Thu, Apr 24, 2025 at 11:00:32AM +0800, Yan Zhao wrote:
>>>> Basic huge page mapping/unmapping
>>>> ---------------------------------
>>>> - TD build time
>>>>    This series enforces that all private mappings be 4KB during the TD build
>>>>    phase, due to the TDX module's requirement that tdh_mem_page_add(), the
>>>>    SEAMCALL for adding private pages during TD build time, only supports 4KB
>>>>    mappings. Enforcing 4KB mappings also simplifies the implementation of
>>>>    code for TD build time, by eliminating the need to consider merging or
>>>>    splitting in the mirror page table during TD build time.
>>>>    
>>>>    The underlying pages allocated from guest_memfd during TD build time
>>>>    phase can still be large, allowing for potential merging into 2MB
>>>>    mappings once the TD is running.
>>>
>>> It can be done before TD is running. The merging is allowed on TD build
>>> stage.
>>>
>>> But, yes, for simplicity we can skip it for initial enabling.
>> Yes, to avoid complicating kvm_tdx->nr_premapped calculation.
>> I also don't see any benefit to allow merging during TD build stage.
>>
>>>
>>>> Page splitting (page demotion)
>>>> ------------------------------
>>>> Page splitting occurs in two paths:
>>>> (a) with exclusive kvm->mmu_lock, triggered by zapping operations,
>>>>
>>>>      For normal VMs, if zapping a narrow region that would need to split a
>>>>      huge page, KVM can simply zap the surrounding GFNs rather than
>>>>      splitting a huge page. The pages can then be faulted back in, where KVM
>>>>      can handle mapping them at a 4KB level.
>>>>
>>>>      The reason why TDX can't use the normal VM solution is that zapping
>>>>      private memory that is accepted cannot easily be re-faulted, since it
>>>>      can only be re-faulted as unaccepted. So KVM will have to sometimes do
>>>>      the page splitting as part of the zapping operations.
>>>>
>>>>      These zapping operations can occur for few reasons:
>>>>      1. VM teardown.
>>>>      2. Memslot removal.
>>>>      3. Conversion of private pages to shared.
>>>>      4. Userspace does a hole punch to guest_memfd for some reason.
>>>>
>>>>      For case 1 and 2, splitting before zapping is unnecessary because
>>>>      either the entire range will be zapped or huge pages do not span
>>>>      memslots.
>>>>      
>>>>      Case 3 or case 4 requires splitting, which is also followed by a
>>>>      backend page splitting in guest_memfd.
>>>>
>>>> (b) with shared kvm->mmu_lock, triggered by fault.
>>>>
>>>>      Splitting in this path is not accompanied by a backend page splitting
>>>>      (since backend page splitting necessitates a splitting and zapping
>>>>       operation in the former path).  It is triggered when KVM finds that a
>>>>      non-leaf entry is replacing a huge entry in the fault path, which is
>>>>      usually caused by vCPUs' concurrent ACCEPT operations at different
>>>>      levels.
>>>
>>> Hm. This sounds like funky behaviour on the guest side.
>>>
>>> You only saw it in a synthetic test, right? No real guest OS should do
>>> this.
>> Right. In selftest only.
>> Also in case of any guest bugs.
>>
>>> It can only be possible if guest is reckless enough to be exposed to
>>> double accept attacks.
>>>
>>> We should consider putting a warning if we detect such case on KVM side.
>> Is it acceptable to put warnings in host kernel in case of guest bugs or
>> attacks?
> 
> pr_warn_once() shouldn't be a big deal.

Shouldn't such a warning be once per guest?

So either we need a per guest flag, or we could use pr_warn_ratelimited().


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 00/21] KVM: TDX huge page support for private memory
  2025-04-24  9:05     ` Kirill A. Shutemov
  2025-04-24  9:08       ` Juergen Gross
@ 2025-04-24  9:49       ` Yan Zhao
  2025-04-24 10:39         ` Kirill A. Shutemov
  1 sibling, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-04-24  9:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, chao.p.peng

On Thu, Apr 24, 2025 at 12:05:15PM +0300, Kirill A. Shutemov wrote:
> On Thu, Apr 24, 2025 at 04:33:13PM +0800, Yan Zhao wrote:
> > On Thu, Apr 24, 2025 at 10:35:47AM +0300, Kirill A. Shutemov wrote:
> > > On Thu, Apr 24, 2025 at 11:00:32AM +0800, Yan Zhao wrote:
> > > > Basic huge page mapping/unmapping
> > > > ---------------------------------
> > > > - TD build time
> > > >   This series enforces that all private mappings be 4KB during the TD build
> > > >   phase, due to the TDX module's requirement that tdh_mem_page_add(), the
> > > >   SEAMCALL for adding private pages during TD build time, only supports 4KB
> > > >   mappings. Enforcing 4KB mappings also simplifies the implementation of
> > > >   code for TD build time, by eliminating the need to consider merging or
> > > >   splitting in the mirror page table during TD build time.
> > > >   
> > > >   The underlying pages allocated from guest_memfd during TD build time
> > > >   phase can still be large, allowing for potential merging into 2MB
> > > >   mappings once the TD is running.
> > > 
> > > It can be done before TD is running. The merging is allowed on TD build
> > > stage.
> > > 
> > > But, yes, for simplicity we can skip it for initial enabling.
> > Yes, to avoid complicating kvm_tdx->nr_premapped calculation.
> > I also don't see any benefit to allow merging during TD build stage.
> > 
> > > 
> > > > Page splitting (page demotion)
> > > > ------------------------------
> > > > Page splitting occurs in two paths:
> > > > (a) with exclusive kvm->mmu_lock, triggered by zapping operations,
> > > > 
> > > >     For normal VMs, if zapping a narrow region that would need to split a
> > > >     huge page, KVM can simply zap the surrounding GFNs rather than
> > > >     splitting a huge page. The pages can then be faulted back in, where KVM
> > > >     can handle mapping them at a 4KB level.
> > > > 
> > > >     The reason why TDX can't use the normal VM solution is that zapping
> > > >     private memory that is accepted cannot easily be re-faulted, since it
> > > >     can only be re-faulted as unaccepted. So KVM will have to sometimes do
> > > >     the page splitting as part of the zapping operations.
> > > > 
> > > >     These zapping operations can occur for few reasons:
> > > >     1. VM teardown.
> > > >     2. Memslot removal.
> > > >     3. Conversion of private pages to shared.
> > > >     4. Userspace does a hole punch to guest_memfd for some reason.
> > > > 
> > > >     For case 1 and 2, splitting before zapping is unnecessary because
> > > >     either the entire range will be zapped or huge pages do not span
> > > >     memslots.
> > > >     
> > > >     Case 3 or case 4 requires splitting, which is also followed by a
> > > >     backend page splitting in guest_memfd.
> > > > 
> > > > (b) with shared kvm->mmu_lock, triggered by fault.
> > > > 
> > > >     Splitting in this path is not accompanied by a backend page splitting
> > > >     (since backend page splitting necessitates a splitting and zapping
> > > >      operation in the former path).  It is triggered when KVM finds that a
> > > >     non-leaf entry is replacing a huge entry in the fault path, which is
> > > >     usually caused by vCPUs' concurrent ACCEPT operations at different
> > > >     levels.
> > > 
> > > Hm. This sounds like funky behaviour on the guest side.
> > > 
> > > You only saw it in a synthetic test, right? No real guest OS should do
> > > this.
> > Right. In selftest only.
> > Also in case of any guest bugs.
> > 
> > > It can only be possible if guest is reckless enough to be exposed to
> > > double accept attacks.
> > > 
> > > We should consider putting a warning if we detect such case on KVM side.
> > Is it acceptable to put warnings in host kernel in case of guest bugs or
> > attacks?
> 
> pr_warn_once() shouldn't be a big deal.
My previous learning is that even a per-VM warning is not desired.
Maybe Rick or anyone else could chime in.
Compared to warning, what about an exit to userspace for further handling?

Another thing is that there may not be an easy way for KVM to differentiate if a
splitting is caused by two competing 4K vs 2M ACCEPT requests or an operation you
deemed valid below. Guests could turn on #VE dynamically.

> > > >     This series simply ignores the splitting request in the fault path to
> > > >     avoid unnecessary bounces between levels. The vCPU that performs ACCEPT
> > > >     at a lower level would finally figures out the page has been accepted
> > > >     at a higher level by another vCPU.
> > > > 
> > > >     A rare case that could lead to splitting in the fault path is when a TD
> > > >     is configured to receive #VE and accesses memory before the ACCEPT
> > > >     operation. By the time a vCPU accesses a private GFN, due to the lack
> > > >     of any guest preferred level, KVM could create a mapping at 2MB level.
> > > >     If the TD then only performs the ACCEPT operation at 4KB level,
> > > >     splitting in the fault path will be triggered. However, this is not
> > > >     regarded as a typical use case, as usually TD always accepts pages in
> > > >     the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
> > > >     splitting request is an endless EPT violation. This would not happen
> > > >     for a Linux guest, which does not expect any #VE.
> > > 
> > > Even if guest accepts memory in response to #VE, it still has to serialize
> > > ACCEPT requests to the same memory block. And track what has been
> > > accepted.
> > > 
> > > Double accept is a guest bug.
> > In the rare case, there're no double accept.
> > 1. Guest acceses a private GPA
> > 2. KVM creates a 2MB mapping in PENDING state and returns to guest.
> > 3. Guest re-accesses, causing the TDX module to inject a #VE.
> > 4. Guest accepts at 4KB level only.
> > 5. EPT violation to KVM for page splitting.
> > 
> > Here, we expect a normal guest to accept from GB->2MB->4KB in step 4.
> 
> Okay, I think I misunderstood this case. I thought there is competing 4k
> vs 2M ACCEPT requests to the same memory block.
> 
> Accepting everything at 4k level is a stupid, but valid behaviour on the
> guest behalf. This splitting case has to be supported before the patchset
> hits the mainline.
Hmm. If you think this is a valid behavior, patches to introduce more locks
in TDX are required :)

Not sure about the value of supporting it though, as it's also purely
hypothetical and couldn't exist in Linux guests.

> BTW, there's no 1G ACCEPT. I know that guest is written as if it is a
> thing, but TDX module only supports 4k and 2M. 1G is only reachable via
> promotion.
Ah, you are right. I re-checked the TDX module code, yes, it returns error on
1G level.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 19/21] KVM: gmem: Split huge boundary leafs for punch hole of private memory
  2025-04-24  3:08 ` [RFC PATCH 19/21] KVM: gmem: Split huge boundary leafs for punch hole of private memory Yan Zhao
@ 2025-04-24 10:19   ` Francesco Lavra
  2025-04-25  1:55     ` Yan Zhao
  2025-05-13 22:59   ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Francesco Lavra @ 2025-04-24 10:19 UTC (permalink / raw)
  To: yan.y.zhao
  Cc: ackerleytng, binbin.wu, chao.p.peng, dave.hansen, david, fan.du,
	ira.weiny, isaku.yamahata, jroedel, jun.miao, kirill.shutemov,
	kvm, linux-kernel, michael.roth, pbonzini, pgonda, quic_eberman,
	rick.p.edgecombe, seanjc, tabba, thomas.lendacky, vannapurve,
	vbabka, x86, xiaoyao.li, zhiquan1.li

On 2025-04-24 at 3:08, Yan Zhao wrote:
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 4bb140e7f30d..008061734ac5 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -292,13 +292,14 @@ static struct folio *kvm_gmem_get_folio(struct
> inode *inode, pgoff_t index, int
>  	return folio;
>  }
>  
> -static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t
> start,
> -				      pgoff_t end)
> +static int kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t
> start,
> +				     pgoff_t end, bool need_split)
>  {
>  	bool flush = false, found_memslot = false;
>  	struct kvm_memory_slot *slot;
>  	struct kvm *kvm = gmem->kvm;
>  	unsigned long index;
> +	int ret = 0;
>  
>  	xa_for_each_range(&gmem->bindings, index, slot, start, end -
> 1) {
>  		pgoff_t pgoff = slot->gmem.pgoff;
> @@ -319,14 +320,23 @@ static void kvm_gmem_invalidate_begin(struct
> kvm_gmem *gmem, pgoff_t start,
>  			kvm_mmu_invalidate_begin(kvm);
>  		}
>  
> +		if (need_split) {
> +			ret = kvm_split_boundary_leafs(kvm,
> &gfn_range);
> +			if (ret < 0)
> +				goto out;
> +
> +			flush |= ret;
> +		}
>  		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
>  	}
>  
> +out:
>  	if (flush)
>  		kvm_flush_remote_tlbs(kvm);
>  
>  	if (found_memslot)
>  		KVM_MMU_UNLOCK(kvm);
> +	return 0;

Should return ret, not 0

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 00/21] KVM: TDX huge page support for private memory
  2025-04-24  9:49       ` Yan Zhao
@ 2025-04-24 10:39         ` Kirill A. Shutemov
  0 siblings, 0 replies; 294+ messages in thread
From: Kirill A. Shutemov @ 2025-04-24 10:39 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, chao.p.peng

On Thu, Apr 24, 2025 at 05:49:17PM +0800, Yan Zhao wrote:
> > Accepting everything at 4k level is a stupid, but valid behaviour on the
> > guest behalf. This splitting case has to be supported before the patchset
> > hits the mainline.
> Hmm. If you think this is a valid behavior, patches to introduce more locks
> in TDX are required :)
> 
> Not sure about the value of supporting it though, as it's also purely
> hypothetical and couldn't exist in Linux guests.

Linux is not the only possible guest.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 19/21] KVM: gmem: Split huge boundary leafs for punch hole of private memory
  2025-04-24 10:19   ` Francesco Lavra
@ 2025-04-25  1:55     ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-25  1:55 UTC (permalink / raw)
  To: Francesco Lavra
  Cc: ackerleytng, binbin.wu, chao.p.peng, dave.hansen, david, fan.du,
	ira.weiny, isaku.yamahata, jroedel, jun.miao, kirill.shutemov,
	kvm, linux-kernel, michael.roth, pbonzini, pgonda, quic_eberman,
	rick.p.edgecombe, seanjc, tabba, thomas.lendacky, vannapurve,
	vbabka, x86, xiaoyao.li, zhiquan1.li

On Thu, Apr 24, 2025 at 12:19:32PM +0200, Francesco Lavra wrote:
> On 2025-04-24 at 3:08, Yan Zhao wrote:
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index 4bb140e7f30d..008061734ac5 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -292,13 +292,14 @@ static struct folio *kvm_gmem_get_folio(struct
> > inode *inode, pgoff_t index, int
> >  	return folio;
> >  }
> >  
> > -static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t
> > start,
> > -				      pgoff_t end)
> > +static int kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t
> > start,
> > +				     pgoff_t end, bool need_split)
> >  {
> >  	bool flush = false, found_memslot = false;
> >  	struct kvm_memory_slot *slot;
> >  	struct kvm *kvm = gmem->kvm;
> >  	unsigned long index;
> > +	int ret = 0;
> >  
> >  	xa_for_each_range(&gmem->bindings, index, slot, start, end -
> > 1) {
> >  		pgoff_t pgoff = slot->gmem.pgoff;
> > @@ -319,14 +320,23 @@ static void kvm_gmem_invalidate_begin(struct
> > kvm_gmem *gmem, pgoff_t start,
> >  			kvm_mmu_invalidate_begin(kvm);
> >  		}
> >  
> > +		if (need_split) {
> > +			ret = kvm_split_boundary_leafs(kvm,
> > &gfn_range);
> > +			if (ret < 0)
> > +				goto out;
> > +
> > +			flush |= ret;
> > +		}
> >  		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
> >  	}
> >  
> > +out:
> >  	if (flush)
> >  		kvm_flush_remote_tlbs(kvm);
> >  
> >  	if (found_memslot)
> >  		KVM_MMU_UNLOCK(kvm);
> > +	return 0;
> 
> Should return ret, not 0
Yes, thank you for the correction!

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-04-24  3:04 ` [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
  2025-04-24  7:48   ` Kirill A. Shutemov
@ 2025-04-25  6:51   ` Binbin Wu
  2025-04-25  7:19     ` Yan Zhao
  2025-05-13 18:52   ` Edgecombe, Rick P
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 294+ messages in thread
From: Binbin Wu @ 2025-04-25  6:51 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, chao.p.peng



On 4/24/2025 11:04 AM, Yan Zhao wrote:
> Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.
>
> Verify the validity of the level and ensure that the mapping range is fully
> contained within the page folio.
>
> As a conservative solution, perform CLFLUSH on all pages to be mapped into
> the TD before invoking the SEAMCALL TDH_MEM_PAGE_AUG. This ensures that any
> dirty cache lines do not write back later and clobber TD memory.
>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>   arch/x86/virt/vmx/tdx/tdx.c | 11 ++++++++++-
>   1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index f5e2a937c1e7..a66d501b5677 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1595,9 +1595,18 @@ u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u
>   		.rdx = tdx_tdr_pa(td),
>   		.r8 = page_to_phys(page),
>   	};
> +	unsigned long nr_pages = 1 << (level * 9);
> +	struct folio *folio = page_folio(page);
> +	unsigned long idx = 0;
>   	u64 ret;
>   
> -	tdx_clflush_page(page);
> +	if (!(level >= TDX_PS_4K && level < TDX_PS_NR) ||
> +	    (folio_page_idx(folio, page) + nr_pages > folio_nr_pages(folio)))
> +		return -EINVAL;
> +
> +	while (nr_pages--)
> +		tdx_clflush_page(nth_page(page, idx++));
Is the following better to save a variable?

while (nr_pages)
     tdx_clflush_page(nth_page(page, --nr_pages));


> +
>   	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
>   
>   	*ext_err1 = args.rcx;


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 03/21] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2025-04-24  3:04 ` [RFC PATCH 03/21] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
@ 2025-04-25  7:12   ` Binbin Wu
  2025-04-25  7:17     ` Yan Zhao
  2025-05-13 18:19   ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Binbin Wu @ 2025-04-25  7:12 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, chao.p.peng



On 4/24/2025 11:04 AM, Yan Zhao wrote:
> From: Xiaoyao Li <xiaoyao.li@intel.com>
>
> Add a wrapper tdh_mem_page_demote() to invoke SEAMCALL TDH_MEM_PAGE_DEMOTE
> to demote a huge leaf entry to a non-leaf entry in S-EPT. Currently, the
> TDX module only supports demotion of a 2M huge leaf entry. After a
> successful demotion, the old 2M huge leaf entry in S-EPT is replaced with a
> non-leaf entry, linking to the newly-added page table page. The newly
> linked page table page then contains 512 leaf entries, pointing to the 2M

2M or 4K?

> guest private pages.
[...]

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 03/21] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2025-04-25  7:12   ` Binbin Wu
@ 2025-04-25  7:17     ` Yan Zhao
  2025-04-25  7:25       ` Binbin Wu
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-04-25  7:17 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, chao.p.peng

On Fri, Apr 25, 2025 at 03:12:32PM +0800, Binbin Wu wrote:
> 
> 
> On 4/24/2025 11:04 AM, Yan Zhao wrote:
> > From: Xiaoyao Li <xiaoyao.li@intel.com>
> > 
> > Add a wrapper tdh_mem_page_demote() to invoke SEAMCALL TDH_MEM_PAGE_DEMOTE
> > to demote a huge leaf entry to a non-leaf entry in S-EPT. Currently, the
> > TDX module only supports demotion of a 2M huge leaf entry. After a
> > successful demotion, the old 2M huge leaf entry in S-EPT is replaced with a
> > non-leaf entry, linking to the newly-added page table page. The newly
> > linked page table page then contains 512 leaf entries, pointing to the 2M
> 
> 2M or 4K?
The 512 leaf entries point to 2M guest private pages together, each pointing to
4K.

> > guest private pages.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-04-25  6:51   ` Binbin Wu
@ 2025-04-25  7:19     ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-25  7:19 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, chao.p.peng

On Fri, Apr 25, 2025 at 02:51:18PM +0800, Binbin Wu wrote:
> 
> 
> On 4/24/2025 11:04 AM, Yan Zhao wrote:
> > Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.
> > 
> > Verify the validity of the level and ensure that the mapping range is fully
> > contained within the page folio.
> > 
> > As a conservative solution, perform CLFLUSH on all pages to be mapped into
> > the TD before invoking the SEAMCALL TDH_MEM_PAGE_AUG. This ensures that any
> > dirty cache lines do not write back later and clobber TD memory.
> > 
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >   arch/x86/virt/vmx/tdx/tdx.c | 11 ++++++++++-
> >   1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index f5e2a937c1e7..a66d501b5677 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1595,9 +1595,18 @@ u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u
> >   		.rdx = tdx_tdr_pa(td),
> >   		.r8 = page_to_phys(page),
> >   	};
> > +	unsigned long nr_pages = 1 << (level * 9);
> > +	struct folio *folio = page_folio(page);
> > +	unsigned long idx = 0;
> >   	u64 ret;
> > -	tdx_clflush_page(page);
> > +	if (!(level >= TDX_PS_4K && level < TDX_PS_NR) ||
> > +	    (folio_page_idx(folio, page) + nr_pages > folio_nr_pages(folio)))
> > +		return -EINVAL;
> > +
> > +	while (nr_pages--)
> > +		tdx_clflush_page(nth_page(page, idx++));
> Is the following better to save a variable?
> 
> while (nr_pages)
>     tdx_clflush_page(nth_page(page, --nr_pages));

Looks better except performing the clflush in reverse order :)

> 
> > +
> >   	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
> >   	*ext_err1 = args.rcx;
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 03/21] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2025-04-25  7:17     ` Yan Zhao
@ 2025-04-25  7:25       ` Binbin Wu
  2025-04-25  9:24         ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Binbin Wu @ 2025-04-25  7:25 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, chao.p.peng



On 4/25/2025 3:17 PM, Yan Zhao wrote:
> On Fri, Apr 25, 2025 at 03:12:32PM +0800, Binbin Wu wrote:
>>
>> On 4/24/2025 11:04 AM, Yan Zhao wrote:
>>> From: Xiaoyao Li <xiaoyao.li@intel.com>
>>>
>>> Add a wrapper tdh_mem_page_demote() to invoke SEAMCALL TDH_MEM_PAGE_DEMOTE
>>> to demote a huge leaf entry to a non-leaf entry in S-EPT. Currently, the
>>> TDX module only supports demotion of a 2M huge leaf entry. After a
>>> successful demotion, the old 2M huge leaf entry in S-EPT is replaced with a
>>> non-leaf entry, linking to the newly-added page table page. The newly
>>> linked page table page then contains 512 leaf entries, pointing to the 2M
>> 2M or 4K?
> The 512 leaf entries point to 2M guest private pages together,
If this, it should be 2M range, since it's not a huge page after demotion.
Also, the plural "pages" is confusing.

>   each pointing to
> 4K.
>
>>> guest private pages.


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 03/21] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2025-04-25  7:25       ` Binbin Wu
@ 2025-04-25  9:24         ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-04-25  9:24 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, chao.p.peng

On Fri, Apr 25, 2025 at 03:25:12PM +0800, Binbin Wu wrote:
> 
> 
> On 4/25/2025 3:17 PM, Yan Zhao wrote:
> > On Fri, Apr 25, 2025 at 03:12:32PM +0800, Binbin Wu wrote:
> > > 
> > > On 4/24/2025 11:04 AM, Yan Zhao wrote:
> > > > From: Xiaoyao Li <xiaoyao.li@intel.com>
> > > > 
> > > > Add a wrapper tdh_mem_page_demote() to invoke SEAMCALL TDH_MEM_PAGE_DEMOTE
> > > > to demote a huge leaf entry to a non-leaf entry in S-EPT. Currently, the
> > > > TDX module only supports demotion of a 2M huge leaf entry. After a
> > > > successful demotion, the old 2M huge leaf entry in S-EPT is replaced with a
> > > > non-leaf entry, linking to the newly-added page table page. The newly
> > > > linked page table page then contains 512 leaf entries, pointing to the 2M
> > > 2M or 4K?
> > The 512 leaf entries point to 2M guest private pages together,
> If this, it should be 2M range, since it's not a huge page after demotion.
> Also, the plural "pages" is confusing.
Ah, indeed, plural "pages" is confiusing :)
Maybe below is better:

The newly linked page table now contains 512 leaf entries, each pointing to a 4K
guest private page within the 2M range.


> >   each pointing to
> > 4K.
> > 
> > > > guest private pages.
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-04-24  3:06 ` [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages Yan Zhao
@ 2025-04-29  0:17   ` Vishal Annapurve
  2025-04-29  0:49     ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-04-29  0:17 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Wed, Apr 23, 2025 at 8:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> Increase folio ref count before mapping a private page, and decrease
> folio ref count after a mapping failure or successfully removing a private
> page.
>
> The folio ref count to inc/dec corresponds to the mapping/unmapping level,
> ensuring the folio ref count remains balanced after entry splitting or
> merging.
>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 19 ++++++++++---------
>  1 file changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 355b21fc169f..e23dce59fc72 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1501,9 +1501,9 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>         td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
>  }
>
> -static void tdx_unpin(struct kvm *kvm, struct page *page)
> +static void tdx_unpin(struct kvm *kvm, struct page *page, int level)
>  {
> -       put_page(page);
> +       folio_put_refs(page_folio(page), KVM_PAGES_PER_HPAGE(level));
>  }
>
>  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> @@ -1517,13 +1517,13 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
>
>         err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
>         if (unlikely(tdx_operand_busy(err))) {
> -               tdx_unpin(kvm, page);
> +               tdx_unpin(kvm, page, level);
>                 return -EBUSY;
>         }
>
>         if (KVM_BUG_ON(err, kvm)) {
>                 pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
> -               tdx_unpin(kvm, page);
> +               tdx_unpin(kvm, page, level);
>                 return -EIO;
>         }
>
> @@ -1570,10 +1570,11 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>          * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
>          * migration.  Until guest_memfd supports page migration, prevent page
>          * migration.
> -        * TODO: Once guest_memfd introduces callback on page migration,
> -        * implement it and remove get_page/put_page().
> +        * TODO: To support in-place-conversion in gmem in futre, remove
> +        * folio_ref_add()/folio_put_refs().

With necessary infrastructure in guest_memfd [1] to prevent page
migration, is it necessary to acquire extra folio refcounts? If not,
why not just cleanup this logic now?

[1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/virt/kvm/guest_memfd.c?h=kvm-coco-queue#n441

> Only increase the folio ref count
> +        * when there're errors during removing private pages.
>          */
> -       get_page(page);
> +       folio_ref_add(page_folio(page), KVM_PAGES_PER_HPAGE(level));
>
>         /*
>          * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> @@ -1647,7 +1648,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>                 return -EIO;
>
>         tdx_clear_page(page, level);
> -       tdx_unpin(kvm, page);
> +       tdx_unpin(kvm, page, level);
>         return 0;
>  }
>
> @@ -1727,7 +1728,7 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>         if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
>             !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
>                 atomic64_dec(&kvm_tdx->nr_premapped);
> -               tdx_unpin(kvm, page);
> +               tdx_unpin(kvm, page, level);
>                 return 0;
>         }
>
> --
> 2.43.2
>

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-04-29  0:17   ` Vishal Annapurve
@ 2025-04-29  0:49     ` Yan Zhao
  2025-04-29 13:46       ` Vishal Annapurve
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-04-29  0:49 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Mon, Apr 28, 2025 at 05:17:16PM -0700, Vishal Annapurve wrote:
> On Wed, Apr 23, 2025 at 8:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > Increase folio ref count before mapping a private page, and decrease
> > folio ref count after a mapping failure or successfully removing a private
> > page.
> >
> > The folio ref count to inc/dec corresponds to the mapping/unmapping level,
> > ensuring the folio ref count remains balanced after entry splitting or
> > merging.
> >
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  arch/x86/kvm/vmx/tdx.c | 19 ++++++++++---------
> >  1 file changed, 10 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 355b21fc169f..e23dce59fc72 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1501,9 +1501,9 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
> >         td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
> >  }
> >
> > -static void tdx_unpin(struct kvm *kvm, struct page *page)
> > +static void tdx_unpin(struct kvm *kvm, struct page *page, int level)
> >  {
> > -       put_page(page);
> > +       folio_put_refs(page_folio(page), KVM_PAGES_PER_HPAGE(level));
> >  }
> >
> >  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> > @@ -1517,13 +1517,13 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> >
> >         err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
> >         if (unlikely(tdx_operand_busy(err))) {
> > -               tdx_unpin(kvm, page);
> > +               tdx_unpin(kvm, page, level);
> >                 return -EBUSY;
> >         }
> >
> >         if (KVM_BUG_ON(err, kvm)) {
> >                 pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
> > -               tdx_unpin(kvm, page);
> > +               tdx_unpin(kvm, page, level);
> >                 return -EIO;
> >         }
> >
> > @@ -1570,10 +1570,11 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> >          * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
> >          * migration.  Until guest_memfd supports page migration, prevent page
> >          * migration.
> > -        * TODO: Once guest_memfd introduces callback on page migration,
> > -        * implement it and remove get_page/put_page().
> > +        * TODO: To support in-place-conversion in gmem in futre, remove
> > +        * folio_ref_add()/folio_put_refs().
> 
> With necessary infrastructure in guest_memfd [1] to prevent page
> migration, is it necessary to acquire extra folio refcounts? If not,
> why not just cleanup this logic now?
Though the old comment says acquiring the lock is for page migration, the other
reason is to prevent the folio from being returned to the OS until it has been
successfully removed from TDX.

If there's an error during the removal or reclaiming of a folio from TDX, such
as a failure in tdh_mem_page_remove()/tdh_phymem_page_wbinvd_hkid() or
tdh_phymem_page_reclaim(), it is important to hold the page refcount within TDX.

So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
folio_ref_add() in the event of a removal failure.

> [1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/virt/kvm/guest_memfd.c?h=kvm-coco-queue#n441
> 
> > Only increase the folio ref count
> > +        * when there're errors during removing private pages.
> >          */
> > -       get_page(page);
> > +       folio_ref_add(page_folio(page), KVM_PAGES_PER_HPAGE(level));
> >
> >         /*
> >          * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> > @@ -1647,7 +1648,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> >                 return -EIO;
> >
> >         tdx_clear_page(page, level);
> > -       tdx_unpin(kvm, page);
> > +       tdx_unpin(kvm, page, level);
> >         return 0;
> >  }
> >
> > @@ -1727,7 +1728,7 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> >         if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
> >             !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
> >                 atomic64_dec(&kvm_tdx->nr_premapped);
> > -               tdx_unpin(kvm, page);
> > +               tdx_unpin(kvm, page, level);
> >                 return 0;
> >         }
> >
> > --
> > 2.43.2
> >

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-04-29  0:49     ` Yan Zhao
@ 2025-04-29 13:46       ` Vishal Annapurve
  2025-05-06  0:53         ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-04-29 13:46 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Mon, Apr 28, 2025 at 05:17:16PM -0700, Vishal Annapurve wrote:
> > On Wed, Apr 23, 2025 at 8:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > Increase folio ref count before mapping a private page, and decrease
> > > folio ref count after a mapping failure or successfully removing a private
> > > page.
> > >
> > > The folio ref count to inc/dec corresponds to the mapping/unmapping level,
> > > ensuring the folio ref count remains balanced after entry splitting or
> > > merging.
> > >
> > > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > ---
> > >  arch/x86/kvm/vmx/tdx.c | 19 ++++++++++---------
> > >  1 file changed, 10 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index 355b21fc169f..e23dce59fc72 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -1501,9 +1501,9 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
> > >         td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
> > >  }
> > >
> > > -static void tdx_unpin(struct kvm *kvm, struct page *page)
> > > +static void tdx_unpin(struct kvm *kvm, struct page *page, int level)
> > >  {
> > > -       put_page(page);
> > > +       folio_put_refs(page_folio(page), KVM_PAGES_PER_HPAGE(level));
> > >  }
> > >
> > >  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> > > @@ -1517,13 +1517,13 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> > >
> > >         err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
> > >         if (unlikely(tdx_operand_busy(err))) {
> > > -               tdx_unpin(kvm, page);
> > > +               tdx_unpin(kvm, page, level);
> > >                 return -EBUSY;
> > >         }
> > >
> > >         if (KVM_BUG_ON(err, kvm)) {
> > >                 pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
> > > -               tdx_unpin(kvm, page);
> > > +               tdx_unpin(kvm, page, level);
> > >                 return -EIO;
> > >         }
> > >
> > > @@ -1570,10 +1570,11 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > >          * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
> > >          * migration.  Until guest_memfd supports page migration, prevent page
> > >          * migration.
> > > -        * TODO: Once guest_memfd introduces callback on page migration,
> > > -        * implement it and remove get_page/put_page().
> > > +        * TODO: To support in-place-conversion in gmem in futre, remove
> > > +        * folio_ref_add()/folio_put_refs().
> >
> > With necessary infrastructure in guest_memfd [1] to prevent page
> > migration, is it necessary to acquire extra folio refcounts? If not,
> > why not just cleanup this logic now?
> Though the old comment says acquiring the lock is for page migration, the other
> reason is to prevent the folio from being returned to the OS until it has been
> successfully removed from TDX.
>
> If there's an error during the removal or reclaiming of a folio from TDX, such
> as a failure in tdh_mem_page_remove()/tdh_phymem_page_wbinvd_hkid() or
> tdh_phymem_page_reclaim(), it is important to hold the page refcount within TDX.
>
> So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> folio_ref_add() in the event of a removal failure.

In my opinion, the above scheme can be deployed with this series
itself. guest_memfd will not take away memory from TDX VMs without an
invalidation. folio_ref_add() will not work for memory not backed by
page structs, but that problem can be solved in future possibly by
notifying guest_memfd of certain ranges being in use even after
invalidation completes.


>
> > [1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/virt/kvm/guest_memfd.c?h=kvm-coco-queue#n441
> >
> > > Only increase the folio ref count
> > > +        * when there're errors during removing private pages.
> > >          */
> > > -       get_page(page);
> > > +       folio_ref_add(page_folio(page), KVM_PAGES_PER_HPAGE(level));
> > >
> > >         /*
> > >          * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> > > @@ -1647,7 +1648,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > >                 return -EIO;
> > >
> > >         tdx_clear_page(page, level);
> > > -       tdx_unpin(kvm, page);
> > > +       tdx_unpin(kvm, page, level);
> > >         return 0;
> > >  }
> > >
> > > @@ -1727,7 +1728,7 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> > >         if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
> > >             !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
> > >                 atomic64_dec(&kvm_tdx->nr_premapped);
> > > -               tdx_unpin(kvm, page);
> > > +               tdx_unpin(kvm, page, level);
> > >                 return 0;
> > >         }
> > >
> > > --
> > > 2.43.2
> > >

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-04-29 13:46       ` Vishal Annapurve
@ 2025-05-06  0:53         ` Yan Zhao
  2025-05-06  5:08           ` Vishal Annapurve
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-06  0:53 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

Sorry for the late reply, I was on leave last week.

On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > folio_ref_add() in the event of a removal failure.
> 
> In my opinion, the above scheme can be deployed with this series
> itself. guest_memfd will not take away memory from TDX VMs without an
I initially intended to add a separate patch at the end of this series to
implement invoking folio_ref_add() only upon a removal failure. However, I
decided against it since it's not a must before guest_memfd supports in-place
conversion.

We can include it in the next version If you think it's better.

> invalidation. folio_ref_add() will not work for memory not backed by
> page structs, but that problem can be solved in future possibly by
With current TDX code, all memory must be backed by a page struct.
Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
than a pfn.

> notifying guest_memfd of certain ranges being in use even after
> invalidation completes.
A curious question:
To support memory not backed by page structs in future, is there any counterpart
to the page struct to hold ref count and map count?


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-06  0:53         ` Yan Zhao
@ 2025-05-06  5:08           ` Vishal Annapurve
  2025-05-06  6:04             ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-05-06  5:08 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> Sorry for the late reply, I was on leave last week.
>
> On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > folio_ref_add() in the event of a removal failure.
> >
> > In my opinion, the above scheme can be deployed with this series
> > itself. guest_memfd will not take away memory from TDX VMs without an
> I initially intended to add a separate patch at the end of this series to
> implement invoking folio_ref_add() only upon a removal failure. However, I
> decided against it since it's not a must before guest_memfd supports in-place
> conversion.
>
> We can include it in the next version If you think it's better.

Ackerley is planning to send out a series for 1G Hugetlb support with
guest memfd soon, hopefully this week. Plus I don't see any reason to
hold extra refcounts in TDX stack so it would be good to clean up this
logic.

>
> > invalidation. folio_ref_add() will not work for memory not backed by
> > page structs, but that problem can be solved in future possibly by
> With current TDX code, all memory must be backed by a page struct.
> Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> than a pfn.
>
> > notifying guest_memfd of certain ranges being in use even after
> > invalidation completes.
> A curious question:
> To support memory not backed by page structs in future, is there any counterpart
> to the page struct to hold ref count and map count?
>

I imagine the needed support will match similar semantics as VM_PFNMAP
[1] memory. No need to maintain refcounts/map counts for such physical
memory ranges as all users will be notified when mappings are
changed/removed.

Any guest_memfd range updates will result in invalidations/updates of
userspace, guest, IOMMU or any other page tables referring to
guest_memfd backed pfns. This story will become clearer once the
support for PFN range allocator for backing guest_memfd starts getting
discussed.

[1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-06  5:08           ` Vishal Annapurve
@ 2025-05-06  6:04             ` Yan Zhao
  2025-05-06 13:18               ` Vishal Annapurve
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-06  6:04 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > Sorry for the late reply, I was on leave last week.
> >
> > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > folio_ref_add() in the event of a removal failure.
> > >
> > > In my opinion, the above scheme can be deployed with this series
> > > itself. guest_memfd will not take away memory from TDX VMs without an
> > I initially intended to add a separate patch at the end of this series to
> > implement invoking folio_ref_add() only upon a removal failure. However, I
> > decided against it since it's not a must before guest_memfd supports in-place
> > conversion.
> >
> > We can include it in the next version If you think it's better.
> 
> Ackerley is planning to send out a series for 1G Hugetlb support with
> guest memfd soon, hopefully this week. Plus I don't see any reason to
> hold extra refcounts in TDX stack so it would be good to clean up this
> logic.
> 
> >
> > > invalidation. folio_ref_add() will not work for memory not backed by
> > > page structs, but that problem can be solved in future possibly by
> > With current TDX code, all memory must be backed by a page struct.
> > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > than a pfn.
> >
> > > notifying guest_memfd of certain ranges being in use even after
> > > invalidation completes.
> > A curious question:
> > To support memory not backed by page structs in future, is there any counterpart
> > to the page struct to hold ref count and map count?
> >
> 
> I imagine the needed support will match similar semantics as VM_PFNMAP
> [1] memory. No need to maintain refcounts/map counts for such physical
> memory ranges as all users will be notified when mappings are
> changed/removed.
So, it's possible to map such memory in both shared and private EPT
simultaneously?


> Any guest_memfd range updates will result in invalidations/updates of
> userspace, guest, IOMMU or any other page tables referring to
> guest_memfd backed pfns. This story will become clearer once the
> support for PFN range allocator for backing guest_memfd starts getting
> discussed.
Ok. It is indeed unclear right now to support such kind of memory.

Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
into private EPT until TDX connect.
And even in that scenario, the memory is only for private MMIO, so the backend
driver is VFIO pci driver rather than guest_memfd.


> [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 07/21] KVM: TDX: Add a helper for WBINVD on huge pages with TD's keyID
  2025-04-24  3:05 ` [RFC PATCH 07/21] KVM: TDX: Add a helper for WBINVD on huge pages with TD's keyID Yan Zhao
@ 2025-05-06  8:37   ` Binbin Wu
  2025-05-16  3:10     ` Yan Zhao
  2025-05-13 19:29   ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Binbin Wu @ 2025-05-06  8:37 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, chao.p.peng



On 4/24/2025 11:05 AM, Yan Zhao wrote:
> From: Xiaoyao Li <xiaoyao.li@intel.com>
>
> After a guest page is removed from the S-EPT, KVM calls
> tdh_phymem_page_wbinvd_hkid() to execute WBINVD on the page using the TD's
> keyID.
>
> Add a helper function that takes level information to perform WBINVD on a
> huge page.
>
> [Yan: split patch, added a helper, rebased to use struct page]
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 24 +++++++++++++++++++-----
>   1 file changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 69f3140928b5..355b21fc169f 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1586,6 +1586,23 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>   	return tdx_mem_page_record_premap_cnt(kvm, level);
>   }
>   
> +static inline u64 tdx_wbinvd_page(struct kvm *kvm, u64 hkid, struct page *page, int level)
> +{
> +	unsigned long nr = KVM_PAGES_PER_HPAGE(level);
> +	unsigned long idx = 0;
> +	u64 err;
> +
> +	while (nr--) {
> +		err = tdh_phymem_page_wbinvd_hkid(hkid, nth_page(page, idx++));
> +
> +		if (KVM_BUG_ON(err, kvm)) {
> +			pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
> +			return err;
> +		}
> +	}
> +	return err;
> +}
> +
>   static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>   				      enum pg_level level, struct page *page)
>   {
> @@ -1625,12 +1642,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>   		return -EIO;
>   	}
>   
> -	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
> -
> -	if (KVM_BUG_ON(err, kvm)) {
> -		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
> +	err = tdx_wbinvd_page(kvm, kvm_tdx->hkid, page, level);
> +	if (err)

It can add unlikely() here.
Also the err is not used after check, maybe it can be combined as:

if (unlikely(tdx_wbinvd_page(kvm, kvm_tdx->hkid, page, level)))
         return -EIO;


>   		return -EIO;
> -	}
>   
>   	tdx_clear_page(page, level);
>   	tdx_unpin(kvm, page);


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-06  6:04             ` Yan Zhao
@ 2025-05-06 13:18               ` Vishal Annapurve
  2025-05-07  7:37                 ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-05-06 13:18 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > Sorry for the late reply, I was on leave last week.
> > >
> > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > folio_ref_add() in the event of a removal failure.
> > > >
> > > > In my opinion, the above scheme can be deployed with this series
> > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > I initially intended to add a separate patch at the end of this series to
> > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > decided against it since it's not a must before guest_memfd supports in-place
> > > conversion.
> > >
> > > We can include it in the next version If you think it's better.
> >
> > Ackerley is planning to send out a series for 1G Hugetlb support with
> > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > hold extra refcounts in TDX stack so it would be good to clean up this
> > logic.
> >
> > >
> > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > page structs, but that problem can be solved in future possibly by
> > > With current TDX code, all memory must be backed by a page struct.
> > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > than a pfn.
> > >
> > > > notifying guest_memfd of certain ranges being in use even after
> > > > invalidation completes.
> > > A curious question:
> > > To support memory not backed by page structs in future, is there any counterpart
> > > to the page struct to hold ref count and map count?
> > >
> >
> > I imagine the needed support will match similar semantics as VM_PFNMAP
> > [1] memory. No need to maintain refcounts/map counts for such physical
> > memory ranges as all users will be notified when mappings are
> > changed/removed.
> So, it's possible to map such memory in both shared and private EPT
> simultaneously?

No, guest_memfd will still ensure that userspace can only fault in
shared memory regions in order to support CoCo VM usecases.

>
>
> > Any guest_memfd range updates will result in invalidations/updates of
> > userspace, guest, IOMMU or any other page tables referring to
> > guest_memfd backed pfns. This story will become clearer once the
> > support for PFN range allocator for backing guest_memfd starts getting
> > discussed.
> Ok. It is indeed unclear right now to support such kind of memory.
>
> Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> into private EPT until TDX connect.

There is a plan to use VM_PFNMAP memory for all of guest_memfd
shared/private ranges orthogonal to TDX connect usecase. With TDX
connect/Sev TIO, major difference would be that guest_memfd private
ranges will be mapped into IOMMU page tables.

Irrespective of whether/when VM_PFNMAP memory support lands, there
have been discussions on not using page structs for private memory
ranges altogether [1] even with hugetlb allocator, which will simplify
seamless merge/split story for private hugepages to support memory
conversion. So I think the general direction we should head towards is
not relying on refcounts for guest_memfd private ranges and/or page
structs altogether.

I think the series [2] to work better with PFNMAP'd physical memory in
KVM is in the very right direction of not assuming page struct backed
memory ranges for guest_memfd as well.

[1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
[2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/

> And even in that scenario, the memory is only for private MMIO, so the backend
> driver is VFIO pci driver rather than guest_memfd.

Not necessary. As I mentioned above guest_memfd ranges will be backed
by VM_PFNMAP memory.

>
>
> > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-06 13:18               ` Vishal Annapurve
@ 2025-05-07  7:37                 ` Yan Zhao
  2025-05-07 14:56                   ` Vishal Annapurve
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-07  7:37 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote:
> On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > Sorry for the late reply, I was on leave last week.
> > > >
> > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > > folio_ref_add() in the event of a removal failure.
> > > > >
> > > > > In my opinion, the above scheme can be deployed with this series
> > > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > > I initially intended to add a separate patch at the end of this series to
> > > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > > decided against it since it's not a must before guest_memfd supports in-place
> > > > conversion.
> > > >
> > > > We can include it in the next version If you think it's better.
> > >
> > > Ackerley is planning to send out a series for 1G Hugetlb support with
> > > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > > hold extra refcounts in TDX stack so it would be good to clean up this
> > > logic.
> > >
> > > >
> > > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > > page structs, but that problem can be solved in future possibly by
> > > > With current TDX code, all memory must be backed by a page struct.
> > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > > than a pfn.
> > > >
> > > > > notifying guest_memfd of certain ranges being in use even after
> > > > > invalidation completes.
> > > > A curious question:
> > > > To support memory not backed by page structs in future, is there any counterpart
> > > > to the page struct to hold ref count and map count?
> > > >
> > >
> > > I imagine the needed support will match similar semantics as VM_PFNMAP
> > > [1] memory. No need to maintain refcounts/map counts for such physical
> > > memory ranges as all users will be notified when mappings are
> > > changed/removed.
> > So, it's possible to map such memory in both shared and private EPT
> > simultaneously?
> 
> No, guest_memfd will still ensure that userspace can only fault in
> shared memory regions in order to support CoCo VM usecases.
Before guest_memfd converts a PFN from shared to private, how does it ensure
there are no shared mappings? e.g., in [1], it uses the folio reference count
to ensure that.

Or do you believe that by eliminating the struct page, there would be no
GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in
response to a guest_memfd invalidation notification?

As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have
no need to register mmu notifier. So why users like VFIO must register
guest_memfd invalidation notification?

Besides, how would guest_memfd handle potential unmap failures? e.g. what
happens to prevent converting a private PFN to shared if there are errors when
TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN.

Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
that fails to be unmapped.


[1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/


> >
> >
> > > Any guest_memfd range updates will result in invalidations/updates of
> > > userspace, guest, IOMMU or any other page tables referring to
> > > guest_memfd backed pfns. This story will become clearer once the
> > > support for PFN range allocator for backing guest_memfd starts getting
> > > discussed.
> > Ok. It is indeed unclear right now to support such kind of memory.
> >
> > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> > into private EPT until TDX connect.
> 
> There is a plan to use VM_PFNMAP memory for all of guest_memfd
> shared/private ranges orthogonal to TDX connect usecase. With TDX
> connect/Sev TIO, major difference would be that guest_memfd private
> ranges will be mapped into IOMMU page tables.
> 
> Irrespective of whether/when VM_PFNMAP memory support lands, there
> have been discussions on not using page structs for private memory
> ranges altogether [1] even with hugetlb allocator, which will simplify
> seamless merge/split story for private hugepages to support memory
> conversion. So I think the general direction we should head towards is
> not relying on refcounts for guest_memfd private ranges and/or page
> structs altogether.
It's fine to use PFN, but I wonder if there're counterparts of struct page to
keep all necessary info.

 
> I think the series [2] to work better with PFNMAP'd physical memory in
> KVM is in the very right direction of not assuming page struct backed
> memory ranges for guest_memfd as well.
Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".


> [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
> [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
> 
> > And even in that scenario, the memory is only for private MMIO, so the backend
> > driver is VFIO pci driver rather than guest_memfd.
> 
> Not necessary. As I mentioned above guest_memfd ranges will be backed
> by VM_PFNMAP memory.
> 
> >
> >
> > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-07  7:37                 ` Yan Zhao
@ 2025-05-07 14:56                   ` Vishal Annapurve
  2025-05-08  1:30                     ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-05-07 14:56 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Wed, May 7, 2025 at 12:39 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote:
> > On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >
> > > > > Sorry for the late reply, I was on leave last week.
> > > > >
> > > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > > > folio_ref_add() in the event of a removal failure.
> > > > > >
> > > > > > In my opinion, the above scheme can be deployed with this series
> > > > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > > > I initially intended to add a separate patch at the end of this series to
> > > > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > > > decided against it since it's not a must before guest_memfd supports in-place
> > > > > conversion.
> > > > >
> > > > > We can include it in the next version If you think it's better.
> > > >
> > > > Ackerley is planning to send out a series for 1G Hugetlb support with
> > > > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > > > hold extra refcounts in TDX stack so it would be good to clean up this
> > > > logic.
> > > >
> > > > >
> > > > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > > > page structs, but that problem can be solved in future possibly by
> > > > > With current TDX code, all memory must be backed by a page struct.
> > > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > > > than a pfn.
> > > > >
> > > > > > notifying guest_memfd of certain ranges being in use even after
> > > > > > invalidation completes.
> > > > > A curious question:
> > > > > To support memory not backed by page structs in future, is there any counterpart
> > > > > to the page struct to hold ref count and map count?
> > > > >
> > > >
> > > > I imagine the needed support will match similar semantics as VM_PFNMAP
> > > > [1] memory. No need to maintain refcounts/map counts for such physical
> > > > memory ranges as all users will be notified when mappings are
> > > > changed/removed.
> > > So, it's possible to map such memory in both shared and private EPT
> > > simultaneously?
> >
> > No, guest_memfd will still ensure that userspace can only fault in
> > shared memory regions in order to support CoCo VM usecases.
> Before guest_memfd converts a PFN from shared to private, how does it ensure
> there are no shared mappings? e.g., in [1], it uses the folio reference count
> to ensure that.
>
> Or do you believe that by eliminating the struct page, there would be no
> GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in
> response to a guest_memfd invalidation notification?

Yes.

>
> As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have
> no need to register mmu notifier. So why users like VFIO must register
> guest_memfd invalidation notification?

VM_PFNMAP'd memory can't be long term pinned, so users of such memory
ranges will have to adopt mechanisms to get notified. I think it would
be easy to pursue new users of guest_memfd to follow this scheme.
Irrespective of whether VM_PFNMAP'd support lands, guest_memfd
hugepage support already needs the stance of: "Guest memfd owns all
long-term refcounts on private memory" as discussed at LPC [1].

[1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf
(slide 12)

>
> Besides, how would guest_memfd handle potential unmap failures? e.g. what
> happens to prevent converting a private PFN to shared if there are errors when
> TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN.

Users will have to signal such failures via the invalidation callback
results or other appropriate mechanisms. guest_memfd can relay the
failures up the call chain to the userspace.

>
> Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
> that fails to be unmapped.
>
>
> [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/
>
>
> > >
> > >
> > > > Any guest_memfd range updates will result in invalidations/updates of
> > > > userspace, guest, IOMMU or any other page tables referring to
> > > > guest_memfd backed pfns. This story will become clearer once the
> > > > support for PFN range allocator for backing guest_memfd starts getting
> > > > discussed.
> > > Ok. It is indeed unclear right now to support such kind of memory.
> > >
> > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> > > into private EPT until TDX connect.
> >
> > There is a plan to use VM_PFNMAP memory for all of guest_memfd
> > shared/private ranges orthogonal to TDX connect usecase. With TDX
> > connect/Sev TIO, major difference would be that guest_memfd private
> > ranges will be mapped into IOMMU page tables.
> >
> > Irrespective of whether/when VM_PFNMAP memory support lands, there
> > have been discussions on not using page structs for private memory
> > ranges altogether [1] even with hugetlb allocator, which will simplify
> > seamless merge/split story for private hugepages to support memory
> > conversion. So I think the general direction we should head towards is
> > not relying on refcounts for guest_memfd private ranges and/or page
> > structs altogether.
> It's fine to use PFN, but I wonder if there're counterparts of struct page to
> keep all necessary info.
>

Story will become clearer once VM_PFNMAP'd memory support starts
getting discussed. In case of guest_memfd, there is flexibility to
store metadata for physical ranges within guest_memfd just like
shareability tracking.

>
> > I think the series [2] to work better with PFNMAP'd physical memory in
> > KVM is in the very right direction of not assuming page struct backed
> > memory ranges for guest_memfd as well.
> Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
> hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".
>
>
> > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
> > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
> >
> > > And even in that scenario, the memory is only for private MMIO, so the backend
> > > driver is VFIO pci driver rather than guest_memfd.
> >
> > Not necessary. As I mentioned above guest_memfd ranges will be backed
> > by VM_PFNMAP memory.
> >
> > >
> > >
> > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-07 14:56                   ` Vishal Annapurve
@ 2025-05-08  1:30                     ` Yan Zhao
  2025-05-08 14:10                       ` Vishal Annapurve
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-08  1:30 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Wed, May 07, 2025 at 07:56:08AM -0700, Vishal Annapurve wrote:
> On Wed, May 7, 2025 at 12:39 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote:
> > > On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > > > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >
> > > > > > Sorry for the late reply, I was on leave last week.
> > > > > >
> > > > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > > > > folio_ref_add() in the event of a removal failure.
> > > > > > >
> > > > > > > In my opinion, the above scheme can be deployed with this series
> > > > > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > > > > I initially intended to add a separate patch at the end of this series to
> > > > > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > > > > decided against it since it's not a must before guest_memfd supports in-place
> > > > > > conversion.
> > > > > >
> > > > > > We can include it in the next version If you think it's better.
> > > > >
> > > > > Ackerley is planning to send out a series for 1G Hugetlb support with
> > > > > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > > > > hold extra refcounts in TDX stack so it would be good to clean up this
> > > > > logic.
> > > > >
> > > > > >
> > > > > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > > > > page structs, but that problem can be solved in future possibly by
> > > > > > With current TDX code, all memory must be backed by a page struct.
> > > > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > > > > than a pfn.
> > > > > >
> > > > > > > notifying guest_memfd of certain ranges being in use even after
> > > > > > > invalidation completes.
> > > > > > A curious question:
> > > > > > To support memory not backed by page structs in future, is there any counterpart
> > > > > > to the page struct to hold ref count and map count?
> > > > > >
> > > > >
> > > > > I imagine the needed support will match similar semantics as VM_PFNMAP
> > > > > [1] memory. No need to maintain refcounts/map counts for such physical
> > > > > memory ranges as all users will be notified when mappings are
> > > > > changed/removed.
> > > > So, it's possible to map such memory in both shared and private EPT
> > > > simultaneously?
> > >
> > > No, guest_memfd will still ensure that userspace can only fault in
> > > shared memory regions in order to support CoCo VM usecases.
> > Before guest_memfd converts a PFN from shared to private, how does it ensure
> > there are no shared mappings? e.g., in [1], it uses the folio reference count
> > to ensure that.
> >
> > Or do you believe that by eliminating the struct page, there would be no
> > GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in
> > response to a guest_memfd invalidation notification?
> 
> Yes.
> 
> >
> > As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have
> > no need to register mmu notifier. So why users like VFIO must register
> > guest_memfd invalidation notification?
> 
> VM_PFNMAP'd memory can't be long term pinned, so users of such memory
> ranges will have to adopt mechanisms to get notified. I think it would
Hmm, in current VFIO, it does not register any notifier for VM_PFNMAP'd memory.

> be easy to pursue new users of guest_memfd to follow this scheme.
> Irrespective of whether VM_PFNMAP'd support lands, guest_memfd
> hugepage support already needs the stance of: "Guest memfd owns all
> long-term refcounts on private memory" as discussed at LPC [1].
> 
> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf
> (slide 12)
> 
> >
> > Besides, how would guest_memfd handle potential unmap failures? e.g. what
> > happens to prevent converting a private PFN to shared if there are errors when
> > TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN.
> 
> Users will have to signal such failures via the invalidation callback
> results or other appropriate mechanisms. guest_memfd can relay the
> failures up the call chain to the userspace.
AFAIK, operations that perform actual unmapping do not allow failure, e.g.
kvm_mmu_unmap_gfn_range(), iopt_area_unfill_domains(),
vfio_iommu_unmap_unpin_all(), vfio_iommu_unmap_unpin_reaccount().

That's why we rely on increasing folio ref count to reflect failure, which are
due to unexpected SEAMCALL errors.

> > Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
> > that fails to be unmapped.
> >
> >
> > [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/
> >
> >
> > > >
> > > >
> > > > > Any guest_memfd range updates will result in invalidations/updates of
> > > > > userspace, guest, IOMMU or any other page tables referring to
> > > > > guest_memfd backed pfns. This story will become clearer once the
> > > > > support for PFN range allocator for backing guest_memfd starts getting
> > > > > discussed.
> > > > Ok. It is indeed unclear right now to support such kind of memory.
> > > >
> > > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> > > > into private EPT until TDX connect.
> > >
> > > There is a plan to use VM_PFNMAP memory for all of guest_memfd
> > > shared/private ranges orthogonal to TDX connect usecase. With TDX
> > > connect/Sev TIO, major difference would be that guest_memfd private
> > > ranges will be mapped into IOMMU page tables.
> > >
> > > Irrespective of whether/when VM_PFNMAP memory support lands, there
> > > have been discussions on not using page structs for private memory
> > > ranges altogether [1] even with hugetlb allocator, which will simplify
> > > seamless merge/split story for private hugepages to support memory
> > > conversion. So I think the general direction we should head towards is
> > > not relying on refcounts for guest_memfd private ranges and/or page
> > > structs altogether.
> > It's fine to use PFN, but I wonder if there're counterparts of struct page to
> > keep all necessary info.
> >
> 
> Story will become clearer once VM_PFNMAP'd memory support starts
> getting discussed. In case of guest_memfd, there is flexibility to
> store metadata for physical ranges within guest_memfd just like
> shareability tracking.
Ok.

> >
> > > I think the series [2] to work better with PFNMAP'd physical memory in
> > > KVM is in the very right direction of not assuming page struct backed
> > > memory ranges for guest_memfd as well.
> > Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
> > hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".
> >
> >
> > > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
> > > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
> > >
> > > > And even in that scenario, the memory is only for private MMIO, so the backend
> > > > driver is VFIO pci driver rather than guest_memfd.
> > >
> > > Not necessary. As I mentioned above guest_memfd ranges will be backed
> > > by VM_PFNMAP memory.
> > >
> > > >
> > > >
> > > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-08  1:30                     ` Yan Zhao
@ 2025-05-08 14:10                       ` Vishal Annapurve
  2025-05-09  3:20                         ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-05-08 14:10 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Wed, May 7, 2025 at 6:32 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Wed, May 07, 2025 at 07:56:08AM -0700, Vishal Annapurve wrote:
> > On Wed, May 7, 2025 at 12:39 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote:
> > > > On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >
> > > > > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > > > > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > >
> > > > > > > Sorry for the late reply, I was on leave last week.
> > > > > > >
> > > > > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > > > > > folio_ref_add() in the event of a removal failure.
> > > > > > > >
> > > > > > > > In my opinion, the above scheme can be deployed with this series
> > > > > > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > > > > > I initially intended to add a separate patch at the end of this series to
> > > > > > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > > > > > decided against it since it's not a must before guest_memfd supports in-place
> > > > > > > conversion.
> > > > > > >
> > > > > > > We can include it in the next version If you think it's better.
> > > > > >
> > > > > > Ackerley is planning to send out a series for 1G Hugetlb support with
> > > > > > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > > > > > hold extra refcounts in TDX stack so it would be good to clean up this
> > > > > > logic.
> > > > > >
> > > > > > >
> > > > > > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > > > > > page structs, but that problem can be solved in future possibly by
> > > > > > > With current TDX code, all memory must be backed by a page struct.
> > > > > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > > > > > than a pfn.
> > > > > > >
> > > > > > > > notifying guest_memfd of certain ranges being in use even after
> > > > > > > > invalidation completes.
> > > > > > > A curious question:
> > > > > > > To support memory not backed by page structs in future, is there any counterpart
> > > > > > > to the page struct to hold ref count and map count?
> > > > > > >
> > > > > >
> > > > > > I imagine the needed support will match similar semantics as VM_PFNMAP
> > > > > > [1] memory. No need to maintain refcounts/map counts for such physical
> > > > > > memory ranges as all users will be notified when mappings are
> > > > > > changed/removed.
> > > > > So, it's possible to map such memory in both shared and private EPT
> > > > > simultaneously?
> > > >
> > > > No, guest_memfd will still ensure that userspace can only fault in
> > > > shared memory regions in order to support CoCo VM usecases.
> > > Before guest_memfd converts a PFN from shared to private, how does it ensure
> > > there are no shared mappings? e.g., in [1], it uses the folio reference count
> > > to ensure that.
> > >
> > > Or do you believe that by eliminating the struct page, there would be no
> > > GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in
> > > response to a guest_memfd invalidation notification?
> >
> > Yes.
> >
> > >
> > > As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have
> > > no need to register mmu notifier. So why users like VFIO must register
> > > guest_memfd invalidation notification?
> >
> > VM_PFNMAP'd memory can't be long term pinned, so users of such memory
> > ranges will have to adopt mechanisms to get notified. I think it would
> Hmm, in current VFIO, it does not register any notifier for VM_PFNMAP'd memory.

I don't completely understand how VM_PFNMAP'd memory is used today for
VFIO. Maybe only MMIO regions are backed by pfnmap today and the story
for normal memory backed by pfnmap is yet to materialize.

>
> > be easy to pursue new users of guest_memfd to follow this scheme.
> > Irrespective of whether VM_PFNMAP'd support lands, guest_memfd
> > hugepage support already needs the stance of: "Guest memfd owns all
> > long-term refcounts on private memory" as discussed at LPC [1].
> >
> > [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf
> > (slide 12)
> >
> > >
> > > Besides, how would guest_memfd handle potential unmap failures? e.g. what
> > > happens to prevent converting a private PFN to shared if there are errors when
> > > TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN.
> >
> > Users will have to signal such failures via the invalidation callback
> > results or other appropriate mechanisms. guest_memfd can relay the
> > failures up the call chain to the userspace.
> AFAIK, operations that perform actual unmapping do not allow failure, e.g.
> kvm_mmu_unmap_gfn_range(), iopt_area_unfill_domains(),
> vfio_iommu_unmap_unpin_all(), vfio_iommu_unmap_unpin_reaccount().

Very likely because these operations simply don't fail.

>
> That's why we rely on increasing folio ref count to reflect failure, which are
> due to unexpected SEAMCALL errors.

TDX stack is adding a scenario where invalidation can fail, a cleaner
solution would be to propagate the result as an invalidation failure.
Another option is to notify guest_memfd out of band to convey the
ranges that failed invalidation.

With in-place conversion supported, even if the refcount is raised for
such pages, they can still get used by the host if the guest_memfd is
unaware that the invalidation failed.

>
> > > Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
> > > that fails to be unmapped.
> > >
> > >
> > > [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/
> > >
> > >
> > > > >
> > > > >
> > > > > > Any guest_memfd range updates will result in invalidations/updates of
> > > > > > userspace, guest, IOMMU or any other page tables referring to
> > > > > > guest_memfd backed pfns. This story will become clearer once the
> > > > > > support for PFN range allocator for backing guest_memfd starts getting
> > > > > > discussed.
> > > > > Ok. It is indeed unclear right now to support such kind of memory.
> > > > >
> > > > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> > > > > into private EPT until TDX connect.
> > > >
> > > > There is a plan to use VM_PFNMAP memory for all of guest_memfd
> > > > shared/private ranges orthogonal to TDX connect usecase. With TDX
> > > > connect/Sev TIO, major difference would be that guest_memfd private
> > > > ranges will be mapped into IOMMU page tables.
> > > >
> > > > Irrespective of whether/when VM_PFNMAP memory support lands, there
> > > > have been discussions on not using page structs for private memory
> > > > ranges altogether [1] even with hugetlb allocator, which will simplify
> > > > seamless merge/split story for private hugepages to support memory
> > > > conversion. So I think the general direction we should head towards is
> > > > not relying on refcounts for guest_memfd private ranges and/or page
> > > > structs altogether.
> > > It's fine to use PFN, but I wonder if there're counterparts of struct page to
> > > keep all necessary info.
> > >
> >
> > Story will become clearer once VM_PFNMAP'd memory support starts
> > getting discussed. In case of guest_memfd, there is flexibility to
> > store metadata for physical ranges within guest_memfd just like
> > shareability tracking.
> Ok.
>
> > >
> > > > I think the series [2] to work better with PFNMAP'd physical memory in
> > > > KVM is in the very right direction of not assuming page struct backed
> > > > memory ranges for guest_memfd as well.
> > > Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
> > > hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".
> > >
> > >
> > > > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
> > > > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
> > > >
> > > > > And even in that scenario, the memory is only for private MMIO, so the backend
> > > > > driver is VFIO pci driver rather than guest_memfd.
> > > >
> > > > Not necessary. As I mentioned above guest_memfd ranges will be backed
> > > > by VM_PFNMAP memory.
> > > >
> > > > >
> > > > >
> > > > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543
> >

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-08 14:10                       ` Vishal Annapurve
@ 2025-05-09  3:20                         ` Yan Zhao
  2025-05-09 14:20                           ` Vishal Annapurve
  2025-05-12 19:00                           ` Ackerley Tng
  0 siblings, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-09  3:20 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Thu, May 08, 2025 at 07:10:19AM -0700, Vishal Annapurve wrote:
> On Wed, May 7, 2025 at 6:32 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Wed, May 07, 2025 at 07:56:08AM -0700, Vishal Annapurve wrote:
> > > On Wed, May 7, 2025 at 12:39 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote:
> > > > > On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >
> > > > > > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > > > > > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > >
> > > > > > > > Sorry for the late reply, I was on leave last week.
> > > > > > > >
> > > > > > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > > > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > > > > > > folio_ref_add() in the event of a removal failure.
> > > > > > > > >
> > > > > > > > > In my opinion, the above scheme can be deployed with this series
> > > > > > > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > > > > > > I initially intended to add a separate patch at the end of this series to
> > > > > > > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > > > > > > decided against it since it's not a must before guest_memfd supports in-place
> > > > > > > > conversion.
> > > > > > > >
> > > > > > > > We can include it in the next version If you think it's better.
> > > > > > >
> > > > > > > Ackerley is planning to send out a series for 1G Hugetlb support with
> > > > > > > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > > > > > > hold extra refcounts in TDX stack so it would be good to clean up this
> > > > > > > logic.
> > > > > > >
> > > > > > > >
> > > > > > > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > > > > > > page structs, but that problem can be solved in future possibly by
> > > > > > > > With current TDX code, all memory must be backed by a page struct.
> > > > > > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > > > > > > than a pfn.
> > > > > > > >
> > > > > > > > > notifying guest_memfd of certain ranges being in use even after
> > > > > > > > > invalidation completes.
> > > > > > > > A curious question:
> > > > > > > > To support memory not backed by page structs in future, is there any counterpart
> > > > > > > > to the page struct to hold ref count and map count?
> > > > > > > >
> > > > > > >
> > > > > > > I imagine the needed support will match similar semantics as VM_PFNMAP
> > > > > > > [1] memory. No need to maintain refcounts/map counts for such physical
> > > > > > > memory ranges as all users will be notified when mappings are
> > > > > > > changed/removed.
> > > > > > So, it's possible to map such memory in both shared and private EPT
> > > > > > simultaneously?
> > > > >
> > > > > No, guest_memfd will still ensure that userspace can only fault in
> > > > > shared memory regions in order to support CoCo VM usecases.
> > > > Before guest_memfd converts a PFN from shared to private, how does it ensure
> > > > there are no shared mappings? e.g., in [1], it uses the folio reference count
> > > > to ensure that.
> > > >
> > > > Or do you believe that by eliminating the struct page, there would be no
> > > > GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in
> > > > response to a guest_memfd invalidation notification?
> > >
> > > Yes.
> > >
> > > >
> > > > As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have
> > > > no need to register mmu notifier. So why users like VFIO must register
> > > > guest_memfd invalidation notification?
> > >
> > > VM_PFNMAP'd memory can't be long term pinned, so users of such memory
> > > ranges will have to adopt mechanisms to get notified. I think it would
> > Hmm, in current VFIO, it does not register any notifier for VM_PFNMAP'd memory.
> 
> I don't completely understand how VM_PFNMAP'd memory is used today for
> VFIO. Maybe only MMIO regions are backed by pfnmap today and the story
> for normal memory backed by pfnmap is yet to materialize.
VFIO can fault in VM_PFNMAP'd memory which is not from MMIO regions. It works
because it knows VM_PFNMAP'd memory are always pinned.

Another example is udmabuf (drivers/dma-buf/udmabuf.c), it mmaps normal folios
with VM_PFNMAP flag without registering mmu notifier because those folios are
pinned.

> >
> > > be easy to pursue new users of guest_memfd to follow this scheme.
> > > Irrespective of whether VM_PFNMAP'd support lands, guest_memfd
> > > hugepage support already needs the stance of: "Guest memfd owns all
> > > long-term refcounts on private memory" as discussed at LPC [1].
> > >
> > > [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf
> > > (slide 12)
> > >
> > > >
> > > > Besides, how would guest_memfd handle potential unmap failures? e.g. what
> > > > happens to prevent converting a private PFN to shared if there are errors when
> > > > TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN.
> > >
> > > Users will have to signal such failures via the invalidation callback
> > > results or other appropriate mechanisms. guest_memfd can relay the
> > > failures up the call chain to the userspace.
> > AFAIK, operations that perform actual unmapping do not allow failure, e.g.
> > kvm_mmu_unmap_gfn_range(), iopt_area_unfill_domains(),
> > vfio_iommu_unmap_unpin_all(), vfio_iommu_unmap_unpin_reaccount().
> 
> Very likely because these operations simply don't fail.

I think they are intentionally designed to be no-fail.

e.g. in __iopt_area_unfill_domain(), no-fail is achieved by using a small backup
buffer allocated on stack in case of kmalloc() failure.


> >
> > That's why we rely on increasing folio ref count to reflect failure, which are
> > due to unexpected SEAMCALL errors.
> 
> TDX stack is adding a scenario where invalidation can fail, a cleaner
> solution would be to propagate the result as an invalidation failure.
Not sure if linux kernel accepts unmap failure.

> Another option is to notify guest_memfd out of band to convey the
> ranges that failed invalidation.
Yes, this might be better. Something similar like holding folio ref count to
let guest_memfd know that a certain PFN cannot be re-assigned.

> With in-place conversion supported, even if the refcount is raised for
> such pages, they can still get used by the host if the guest_memfd is
> unaware that the invalidation failed.
I thought guest_memfd should check if folio ref count is 0 (or a base count)
before conversion, splitting or re-assignment. Otherwise, why do you care if
TDX holds the ref count? :)


> >
> > > > Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
> > > > that fails to be unmapped.
> > > >
> > > >
> > > > [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > > > Any guest_memfd range updates will result in invalidations/updates of
> > > > > > > userspace, guest, IOMMU or any other page tables referring to
> > > > > > > guest_memfd backed pfns. This story will become clearer once the
> > > > > > > support for PFN range allocator for backing guest_memfd starts getting
> > > > > > > discussed.
> > > > > > Ok. It is indeed unclear right now to support such kind of memory.
> > > > > >
> > > > > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> > > > > > into private EPT until TDX connect.
> > > > >
> > > > > There is a plan to use VM_PFNMAP memory for all of guest_memfd
> > > > > shared/private ranges orthogonal to TDX connect usecase. With TDX
> > > > > connect/Sev TIO, major difference would be that guest_memfd private
> > > > > ranges will be mapped into IOMMU page tables.
> > > > >
> > > > > Irrespective of whether/when VM_PFNMAP memory support lands, there
> > > > > have been discussions on not using page structs for private memory
> > > > > ranges altogether [1] even with hugetlb allocator, which will simplify
> > > > > seamless merge/split story for private hugepages to support memory
> > > > > conversion. So I think the general direction we should head towards is
> > > > > not relying on refcounts for guest_memfd private ranges and/or page
> > > > > structs altogether.
> > > > It's fine to use PFN, but I wonder if there're counterparts of struct page to
> > > > keep all necessary info.
> > > >
> > >
> > > Story will become clearer once VM_PFNMAP'd memory support starts
> > > getting discussed. In case of guest_memfd, there is flexibility to
> > > store metadata for physical ranges within guest_memfd just like
> > > shareability tracking.
> > Ok.
> >
> > > >
> > > > > I think the series [2] to work better with PFNMAP'd physical memory in
> > > > > KVM is in the very right direction of not assuming page struct backed
> > > > > memory ranges for guest_memfd as well.
> > > > Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
> > > > hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".
> > > >
> > > >
> > > > > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
> > > > > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
> > > > >
> > > > > > And even in that scenario, the memory is only for private MMIO, so the backend
> > > > > > driver is VFIO pci driver rather than guest_memfd.
> > > > >
> > > > > Not necessary. As I mentioned above guest_memfd ranges will be backed
> > > > > by VM_PFNMAP memory.
> > > > >
> > > > > >
> > > > > >
> > > > > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543
> > >

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-09  3:20                         ` Yan Zhao
@ 2025-05-09 14:20                           ` Vishal Annapurve
  2025-05-09 23:45                             ` Edgecombe, Rick P
  2025-05-12  2:15                             ` Yan Zhao
  2025-05-12 19:00                           ` Ackerley Tng
  1 sibling, 2 replies; 294+ messages in thread
From: Vishal Annapurve @ 2025-05-09 14:20 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Thu, May 8, 2025 at 8:22 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Thu, May 08, 2025 at 07:10:19AM -0700, Vishal Annapurve wrote:
> > On Wed, May 7, 2025 at 6:32 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Wed, May 07, 2025 at 07:56:08AM -0700, Vishal Annapurve wrote:
> > > > On Wed, May 7, 2025 at 12:39 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >
> > > > > On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote:
> > > > > > On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > >
> > > > > > > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > > > > > > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > >
> > > > > > > > > Sorry for the late reply, I was on leave last week.
> > > > > > > > >
> > > > > > > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > > > > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > > > > > > > folio_ref_add() in the event of a removal failure.
> > > > > > > > > >
> > > > > > > > > > In my opinion, the above scheme can be deployed with this series
> > > > > > > > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > > > > > > > I initially intended to add a separate patch at the end of this series to
> > > > > > > > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > > > > > > > decided against it since it's not a must before guest_memfd supports in-place
> > > > > > > > > conversion.
> > > > > > > > >
> > > > > > > > > We can include it in the next version If you think it's better.
> > > > > > > >
> > > > > > > > Ackerley is planning to send out a series for 1G Hugetlb support with
> > > > > > > > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > > > > > > > hold extra refcounts in TDX stack so it would be good to clean up this
> > > > > > > > logic.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > > > > > > > page structs, but that problem can be solved in future possibly by
> > > > > > > > > With current TDX code, all memory must be backed by a page struct.
> > > > > > > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > > > > > > > than a pfn.
> > > > > > > > >
> > > > > > > > > > notifying guest_memfd of certain ranges being in use even after
> > > > > > > > > > invalidation completes.
> > > > > > > > > A curious question:
> > > > > > > > > To support memory not backed by page structs in future, is there any counterpart
> > > > > > > > > to the page struct to hold ref count and map count?
> > > > > > > > >
> > > > > > > >
> > > > > > > > I imagine the needed support will match similar semantics as VM_PFNMAP
> > > > > > > > [1] memory. No need to maintain refcounts/map counts for such physical
> > > > > > > > memory ranges as all users will be notified when mappings are
> > > > > > > > changed/removed.
> > > > > > > So, it's possible to map such memory in both shared and private EPT
> > > > > > > simultaneously?
> > > > > >
> > > > > > No, guest_memfd will still ensure that userspace can only fault in
> > > > > > shared memory regions in order to support CoCo VM usecases.
> > > > > Before guest_memfd converts a PFN from shared to private, how does it ensure
> > > > > there are no shared mappings? e.g., in [1], it uses the folio reference count
> > > > > to ensure that.
> > > > >
> > > > > Or do you believe that by eliminating the struct page, there would be no
> > > > > GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in
> > > > > response to a guest_memfd invalidation notification?
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > > As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have
> > > > > no need to register mmu notifier. So why users like VFIO must register
> > > > > guest_memfd invalidation notification?
> > > >
> > > > VM_PFNMAP'd memory can't be long term pinned, so users of such memory
> > > > ranges will have to adopt mechanisms to get notified. I think it would
> > > Hmm, in current VFIO, it does not register any notifier for VM_PFNMAP'd memory.
> >
> > I don't completely understand how VM_PFNMAP'd memory is used today for
> > VFIO. Maybe only MMIO regions are backed by pfnmap today and the story
> > for normal memory backed by pfnmap is yet to materialize.
> VFIO can fault in VM_PFNMAP'd memory which is not from MMIO regions. It works
> because it knows VM_PFNMAP'd memory are always pinned.
>
> Another example is udmabuf (drivers/dma-buf/udmabuf.c), it mmaps normal folios
> with VM_PFNMAP flag without registering mmu notifier because those folios are
> pinned.
>

I might be wrongly throwing out some terminologies here then.
VM_PFNMAP flag can be set for memory backed by folios/page structs.
udmabuf seems to be working with pinned "folios" in the backend.

The goal is to get to a stage where guest_memfd is backed by pfn
ranges unmanaged by kernel that guest_memfd owns and distributes to
userspace, KVM, IOMMU subject to shareability attributes. if the
shareability changes, the users will get notified and will have to
invalidate their mappings. guest_memfd will allow mmaping such ranges
with VM_PFNMAP flag set by default in the VMAs to indicate the need of
special handling/lack of page structs.

As an intermediate stage, it makes sense to me to just not have
private memory backed by page structs and use a special "filemap" to
map file offsets to these private memory ranges. This step will also
need similar contract with users -
   1) memory is pinned by guest_memfd
   2) users will get invalidation notifiers on shareability changes

I am sure there is a lot of work here and many quirks to be addressed,
let's discuss this more with better context around. A few related RFC
series are planned to be posted in the near future.

> > >
> > > > be easy to pursue new users of guest_memfd to follow this scheme.
> > > > Irrespective of whether VM_PFNMAP'd support lands, guest_memfd
> > > > hugepage support already needs the stance of: "Guest memfd owns all
> > > > long-term refcounts on private memory" as discussed at LPC [1].
> > > >
> > > > [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf
> > > > (slide 12)
> > > >
> > > > >
> > > > > Besides, how would guest_memfd handle potential unmap failures? e.g. what
> > > > > happens to prevent converting a private PFN to shared if there are errors when
> > > > > TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN.
> > > >
> > > > Users will have to signal such failures via the invalidation callback
> > > > results or other appropriate mechanisms. guest_memfd can relay the
> > > > failures up the call chain to the userspace.
> > > AFAIK, operations that perform actual unmapping do not allow failure, e.g.
> > > kvm_mmu_unmap_gfn_range(), iopt_area_unfill_domains(),
> > > vfio_iommu_unmap_unpin_all(), vfio_iommu_unmap_unpin_reaccount().
> >
> > Very likely because these operations simply don't fail.
>
> I think they are intentionally designed to be no-fail.
>
> e.g. in __iopt_area_unfill_domain(), no-fail is achieved by using a small backup
> buffer allocated on stack in case of kmalloc() failure.
>
>
> > >
> > > That's why we rely on increasing folio ref count to reflect failure, which are
> > > due to unexpected SEAMCALL errors.
> >
> > TDX stack is adding a scenario where invalidation can fail, a cleaner
> > solution would be to propagate the result as an invalidation failure.
> Not sure if linux kernel accepts unmap failure.
>
> > Another option is to notify guest_memfd out of band to convey the
> > ranges that failed invalidation.
> Yes, this might be better. Something similar like holding folio ref count to
> let guest_memfd know that a certain PFN cannot be re-assigned.
>
> > With in-place conversion supported, even if the refcount is raised for
> > such pages, they can still get used by the host if the guest_memfd is
> > unaware that the invalidation failed.
> I thought guest_memfd should check if folio ref count is 0 (or a base count)
> before conversion, splitting or re-assignment. Otherwise, why do you care if
> TDX holds the ref count? :)
>

Soon to be posted RFC series by Ackerley currently explicitly checks
for safe private page refcounts when folio splitting is needed and not
for every private to shared conversion. A simple solution would be for
guest_memfd to check safe page refcounts for each private to shared
conversion even if split is not required but will need to be reworked
when either of the stages discussed above land where page structs are
not around.

>
> > >
> > > > > Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
> > > > > that fails to be unmapped.
> > > > >
> > > > >
> > > > > [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/
> > > > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Any guest_memfd range updates will result in invalidations/updates of
> > > > > > > > userspace, guest, IOMMU or any other page tables referring to
> > > > > > > > guest_memfd backed pfns. This story will become clearer once the
> > > > > > > > support for PFN range allocator for backing guest_memfd starts getting
> > > > > > > > discussed.
> > > > > > > Ok. It is indeed unclear right now to support such kind of memory.
> > > > > > >
> > > > > > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> > > > > > > into private EPT until TDX connect.
> > > > > >
> > > > > > There is a plan to use VM_PFNMAP memory for all of guest_memfd
> > > > > > shared/private ranges orthogonal to TDX connect usecase. With TDX
> > > > > > connect/Sev TIO, major difference would be that guest_memfd private
> > > > > > ranges will be mapped into IOMMU page tables.
> > > > > >
> > > > > > Irrespective of whether/when VM_PFNMAP memory support lands, there
> > > > > > have been discussions on not using page structs for private memory
> > > > > > ranges altogether [1] even with hugetlb allocator, which will simplify
> > > > > > seamless merge/split story for private hugepages to support memory
> > > > > > conversion. So I think the general direction we should head towards is
> > > > > > not relying on refcounts for guest_memfd private ranges and/or page
> > > > > > structs altogether.
> > > > > It's fine to use PFN, but I wonder if there're counterparts of struct page to
> > > > > keep all necessary info.
> > > > >
> > > >
> > > > Story will become clearer once VM_PFNMAP'd memory support starts
> > > > getting discussed. In case of guest_memfd, there is flexibility to
> > > > store metadata for physical ranges within guest_memfd just like
> > > > shareability tracking.
> > > Ok.
> > >
> > > > >
> > > > > > I think the series [2] to work better with PFNMAP'd physical memory in
> > > > > > KVM is in the very right direction of not assuming page struct backed
> > > > > > memory ranges for guest_memfd as well.
> > > > > Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
> > > > > hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".
> > > > >
> > > > >
> > > > > > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
> > > > > > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
> > > > > >
> > > > > > > And even in that scenario, the memory is only for private MMIO, so the backend
> > > > > > > driver is VFIO pci driver rather than guest_memfd.
> > > > > >
> > > > > > Not necessary. As I mentioned above guest_memfd ranges will be backed
> > > > > > by VM_PFNMAP memory.
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543
> > > >

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 18/21] KVM: x86: Split huge boundary leafs before private to shared conversion
  2025-04-24  3:08 ` [RFC PATCH 18/21] KVM: x86: Split huge boundary leafs before private to shared conversion Yan Zhao
@ 2025-05-09 23:34   ` Edgecombe, Rick P
  2025-05-12  2:25     ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-09 23:34 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, 2025-04-24 at 11:08 +0800, Yan Zhao wrote:
> Before converting a GFN range from private to shared, it is necessary to
> zap the mirror page table. When huge pages are supported and the GFN range
> intersects with a huge leaf, split the huge leaf to prevent zapping GFNs
> outside the conversion range.

FALLOC_FL_PUNCH_HOLE demotion failure doesn't look like it is addressed in this
series. I noticed that mmu notifier failures are allowed to be handled by
blocking until success is possible, in most cases. KVM just doesn't need to
because it can't fail. We could think about doing retries for
FALLOC_FL_PUNCH_HOLE, while checking for signals. Or adding a ENOMEM error code
to fallocate.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-09 14:20                           ` Vishal Annapurve
@ 2025-05-09 23:45                             ` Edgecombe, Rick P
  2025-05-10  0:41                               ` Vishal Annapurve
  2025-05-12  2:15                             ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-09 23:45 UTC (permalink / raw)
  To: Annapurve, Vishal, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Shutemov, Kirill, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
	pbonzini@redhat.com, binbin.wu@linux.intel.com,
	ackerleytng@google.com, linux-kernel@vger.kernel.org,
	Yamahata, Isaku, Peng, Chao P, Li, Zhiquan1, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Fri, 2025-05-09 at 07:20 -0700, Vishal Annapurve wrote:
> I might be wrongly throwing out some terminologies here then.
> VM_PFNMAP flag can be set for memory backed by folios/page structs.
> udmabuf seems to be working with pinned "folios" in the backend.
> 
> The goal is to get to a stage where guest_memfd is backed by pfn
> ranges unmanaged by kernel that guest_memfd owns and distributes to
> userspace, KVM, IOMMU subject to shareability attributes. if the
> shareability changes, the users will get notified and will have to
> invalidate their mappings. guest_memfd will allow mmaping such ranges
> with VM_PFNMAP flag set by default in the VMAs to indicate the need of
> special handling/lack of page structs.

I see the point about how operating on PFNs can allow smoother transition to a
solution that saves struct page memory, but I wonder about the wisdom of
building this 2MB TDX code against eventual goals.

We were thinking to enable 2MB TDX huge pages on top of:
1. Mmap shared pages
2. In-place conversion
3. 2MB huge page support

Where do you think struct page-less guestmemfd fits in that roadmap?

> 
> As an intermediate stage, it makes sense to me to just not have
> private memory backed by page structs and use a special "filemap" to
> map file offsets to these private memory ranges. This step will also
> need similar contract with users -
>    1) memory is pinned by guest_memfd
>    2) users will get invalidation notifiers on shareability changes
> 
> I am sure there is a lot of work here and many quirks to be addressed,
> let's discuss this more with better context around. A few related RFC
> series are planned to be posted in the near future.

Look forward to collecting more context, and thanks for your patience while we
catch up. But why not an iterative approach? We can't save struct page memory on
guestmemfd huge pages until we have guestmemfd huge pages.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-09 23:45                             ` Edgecombe, Rick P
@ 2025-05-10  0:41                               ` Vishal Annapurve
  2025-05-12 21:59                                 ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-05-10  0:41 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Zhao, Yan Y, quic_eberman@quicinc.com, Shutemov, Kirill,
	Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
	pbonzini@redhat.com, binbin.wu@linux.intel.com,
	ackerleytng@google.com, linux-kernel@vger.kernel.org,
	Yamahata, Isaku, Peng, Chao P, Li, Zhiquan1, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Fri, May 9, 2025 at 4:45 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Fri, 2025-05-09 at 07:20 -0700, Vishal Annapurve wrote:
> > I might be wrongly throwing out some terminologies here then.
> > VM_PFNMAP flag can be set for memory backed by folios/page structs.
> > udmabuf seems to be working with pinned "folios" in the backend.
> >
> > The goal is to get to a stage where guest_memfd is backed by pfn
> > ranges unmanaged by kernel that guest_memfd owns and distributes to
> > userspace, KVM, IOMMU subject to shareability attributes. if the
> > shareability changes, the users will get notified and will have to
> > invalidate their mappings. guest_memfd will allow mmaping such ranges
> > with VM_PFNMAP flag set by default in the VMAs to indicate the need of
> > special handling/lack of page structs.
>
> I see the point about how operating on PFNs can allow smoother transition to a
> solution that saves struct page memory, but I wonder about the wisdom of
> building this 2MB TDX code against eventual goals.

This discussion was more in response to a few questions from Yan [1].

My point of this discussion was to ensure that:
1) There is more awareness about the future roadmap.
2) There is a line of sight towards supporting guest memory (at least
guest private memory) without page structs.

No need to solve these problems right away, but it would be good to
ensure that the design choices are aligned towards the future
direction.

One thing that needs to be resolved right away is - no refcounts on
guest memory from outside guest_memfd [2]. (Discounting the error
situations)

[1] https://lore.kernel.org/lkml/aBldhnTK93+eKcMq@yzhao56-desk.sh.intel.com/
[2] https://lore.kernel.org/lkml/CAGtprH_ggm8N-R9QbV1f8mo8-cQkqyEta3W=h2jry-NRD7_6OA@mail.gmail.com/

>
> We were thinking to enable 2MB TDX huge pages on top of:
> 1. Mmap shared pages
> 2. In-place conversion
> 3. 2MB huge page support
>
> Where do you think struct page-less guestmemfd fits in that roadmap?

Ideally the roadmap should be:
1. mmap support
2. Huge page support in guest memfd with in-place conversion
3. 2MB huge page EPT mappings support
4. private memory without page structs
5. private/shared memory without page structs

There should be newer RFC series landing soon for 1 and 2. In my
opinion, as long as hugepage EPT support is reviewed, tested and is
stable enough, it can land upstream sooner than 2 as well.

>
> >
> > As an intermediate stage, it makes sense to me to just not have
> > private memory backed by page structs and use a special "filemap" to
> > map file offsets to these private memory ranges. This step will also
> > need similar contract with users -
> >    1) memory is pinned by guest_memfd
> >    2) users will get invalidation notifiers on shareability changes
> >
> > I am sure there is a lot of work here and many quirks to be addressed,
> > let's discuss this more with better context around. A few related RFC
> > series are planned to be posted in the near future.
>
> Look forward to collecting more context, and thanks for your patience while we
> catch up. But why not an iterative approach? We can't save struct page memory on
> guestmemfd huge pages until we have guestmemfd huge pages.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-09 14:20                           ` Vishal Annapurve
  2025-05-09 23:45                             ` Edgecombe, Rick P
@ 2025-05-12  2:15                             ` Yan Zhao
  2025-05-12 16:53                               ` Vishal Annapurve
  1 sibling, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-12  2:15 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Fri, May 09, 2025 at 07:20:30AM -0700, Vishal Annapurve wrote:
> On Thu, May 8, 2025 at 8:22 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Thu, May 08, 2025 at 07:10:19AM -0700, Vishal Annapurve wrote:
> > > On Wed, May 7, 2025 at 6:32 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Wed, May 07, 2025 at 07:56:08AM -0700, Vishal Annapurve wrote:
> > > > > On Wed, May 7, 2025 at 12:39 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >
> > > > > > On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote:
> > > > > > > On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > > > > > > > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Sorry for the late reply, I was on leave last week.
> > > > > > > > > >
> > > > > > > > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > > > > > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > > > > > > > > folio_ref_add() in the event of a removal failure.
> > > > > > > > > > >
> > > > > > > > > > > In my opinion, the above scheme can be deployed with this series
> > > > > > > > > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > > > > > > > > I initially intended to add a separate patch at the end of this series to
> > > > > > > > > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > > > > > > > > decided against it since it's not a must before guest_memfd supports in-place
> > > > > > > > > > conversion.
> > > > > > > > > >
> > > > > > > > > > We can include it in the next version If you think it's better.
> > > > > > > > >
> > > > > > > > > Ackerley is planning to send out a series for 1G Hugetlb support with
> > > > > > > > > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > > > > > > > > hold extra refcounts in TDX stack so it would be good to clean up this
> > > > > > > > > logic.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > > > > > > > > page structs, but that problem can be solved in future possibly by
> > > > > > > > > > With current TDX code, all memory must be backed by a page struct.
> > > > > > > > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > > > > > > > > than a pfn.
> > > > > > > > > >
> > > > > > > > > > > notifying guest_memfd of certain ranges being in use even after
> > > > > > > > > > > invalidation completes.
> > > > > > > > > > A curious question:
> > > > > > > > > > To support memory not backed by page structs in future, is there any counterpart
> > > > > > > > > > to the page struct to hold ref count and map count?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I imagine the needed support will match similar semantics as VM_PFNMAP
> > > > > > > > > [1] memory. No need to maintain refcounts/map counts for such physical
> > > > > > > > > memory ranges as all users will be notified when mappings are
> > > > > > > > > changed/removed.
> > > > > > > > So, it's possible to map such memory in both shared and private EPT
> > > > > > > > simultaneously?
> > > > > > >
> > > > > > > No, guest_memfd will still ensure that userspace can only fault in
> > > > > > > shared memory regions in order to support CoCo VM usecases.
> > > > > > Before guest_memfd converts a PFN from shared to private, how does it ensure
> > > > > > there are no shared mappings? e.g., in [1], it uses the folio reference count
> > > > > > to ensure that.
> > > > > >
> > > > > > Or do you believe that by eliminating the struct page, there would be no
> > > > > > GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in
> > > > > > response to a guest_memfd invalidation notification?
> > > > >
> > > > > Yes.
> > > > >
> > > > > >
> > > > > > As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have
> > > > > > no need to register mmu notifier. So why users like VFIO must register
> > > > > > guest_memfd invalidation notification?
> > > > >
> > > > > VM_PFNMAP'd memory can't be long term pinned, so users of such memory
> > > > > ranges will have to adopt mechanisms to get notified. I think it would
> > > > Hmm, in current VFIO, it does not register any notifier for VM_PFNMAP'd memory.
> > >
> > > I don't completely understand how VM_PFNMAP'd memory is used today for
> > > VFIO. Maybe only MMIO regions are backed by pfnmap today and the story
> > > for normal memory backed by pfnmap is yet to materialize.
> > VFIO can fault in VM_PFNMAP'd memory which is not from MMIO regions. It works
> > because it knows VM_PFNMAP'd memory are always pinned.
> >
> > Another example is udmabuf (drivers/dma-buf/udmabuf.c), it mmaps normal folios
> > with VM_PFNMAP flag without registering mmu notifier because those folios are
> > pinned.
> >
> 
> I might be wrongly throwing out some terminologies here then.
> VM_PFNMAP flag can be set for memory backed by folios/page structs.
> udmabuf seems to be working with pinned "folios" in the backend.
> 
> The goal is to get to a stage where guest_memfd is backed by pfn
> ranges unmanaged by kernel that guest_memfd owns and distributes to
> userspace, KVM, IOMMU subject to shareability attributes. if the
OK. So from point of the reset part of kernel, those pfns are not regarded as
memory.

> shareability changes, the users will get notified and will have to
> invalidate their mappings. guest_memfd will allow mmaping such ranges
> with VM_PFNMAP flag set by default in the VMAs to indicate the need of
> special handling/lack of page structs.
My concern is a failable invalidation notifer may not be ideal.
Instead of relying on ref counts (or other mechanisms) to determine whether to
start shareabilitiy changes, with a failable invalidation notifier, some users
may fail the invalidation and the shareability change, even after other users
have successfully unmapped a range.

Auditing whether multiple users of shared memory correctly perform unmapping is
harder than auditing reference counts.

> private memory backed by page structs and use a special "filemap" to
> map file offsets to these private memory ranges. This step will also
> need similar contract with users -
>    1) memory is pinned by guest_memfd
>    2) users will get invalidation notifiers on shareability changes
> 
> I am sure there is a lot of work here and many quirks to be addressed,
> let's discuss this more with better context around. A few related RFC
> series are planned to be posted in the near future.
Ok. Thanks for your time and discussions :)

> > > >
> > > > > be easy to pursue new users of guest_memfd to follow this scheme.
> > > > > Irrespective of whether VM_PFNMAP'd support lands, guest_memfd
> > > > > hugepage support already needs the stance of: "Guest memfd owns all
> > > > > long-term refcounts on private memory" as discussed at LPC [1].
> > > > >
> > > > > [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf
> > > > > (slide 12)
> > > > >
> > > > > >
> > > > > > Besides, how would guest_memfd handle potential unmap failures? e.g. what
> > > > > > happens to prevent converting a private PFN to shared if there are errors when
> > > > > > TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN.
> > > > >
> > > > > Users will have to signal such failures via the invalidation callback
> > > > > results or other appropriate mechanisms. guest_memfd can relay the
> > > > > failures up the call chain to the userspace.
> > > > AFAIK, operations that perform actual unmapping do not allow failure, e.g.
> > > > kvm_mmu_unmap_gfn_range(), iopt_area_unfill_domains(),
> > > > vfio_iommu_unmap_unpin_all(), vfio_iommu_unmap_unpin_reaccount().
> > >
> > > Very likely because these operations simply don't fail.
> >
> > I think they are intentionally designed to be no-fail.
> >
> > e.g. in __iopt_area_unfill_domain(), no-fail is achieved by using a small backup
> > buffer allocated on stack in case of kmalloc() failure.
> >
> >
> > > >
> > > > That's why we rely on increasing folio ref count to reflect failure, which are
> > > > due to unexpected SEAMCALL errors.
> > >
> > > TDX stack is adding a scenario where invalidation can fail, a cleaner
> > > solution would be to propagate the result as an invalidation failure.
> > Not sure if linux kernel accepts unmap failure.
> >
> > > Another option is to notify guest_memfd out of band to convey the
> > > ranges that failed invalidation.
> > Yes, this might be better. Something similar like holding folio ref count to
> > let guest_memfd know that a certain PFN cannot be re-assigned.
> >
> > > With in-place conversion supported, even if the refcount is raised for
> > > such pages, they can still get used by the host if the guest_memfd is
> > > unaware that the invalidation failed.
> > I thought guest_memfd should check if folio ref count is 0 (or a base count)
> > before conversion, splitting or re-assignment. Otherwise, why do you care if
> > TDX holds the ref count? :)
> >
> 
> Soon to be posted RFC series by Ackerley currently explicitly checks
> for safe private page refcounts when folio splitting is needed and not
> for every private to shared conversion. A simple solution would be for
> guest_memfd to check safe page refcounts for each private to shared
> conversion even if split is not required but will need to be reworked
> when either of the stages discussed above land where page structs are
> not around.
> 
> >
> > > >
> > > > > > Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
> > > > > > that fails to be unmapped.
> > > > > >
> > > > > >
> > > > > > [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/
> > > > > >
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Any guest_memfd range updates will result in invalidations/updates of
> > > > > > > > > userspace, guest, IOMMU or any other page tables referring to
> > > > > > > > > guest_memfd backed pfns. This story will become clearer once the
> > > > > > > > > support for PFN range allocator for backing guest_memfd starts getting
> > > > > > > > > discussed.
> > > > > > > > Ok. It is indeed unclear right now to support such kind of memory.
> > > > > > > >
> > > > > > > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> > > > > > > > into private EPT until TDX connect.
> > > > > > >
> > > > > > > There is a plan to use VM_PFNMAP memory for all of guest_memfd
> > > > > > > shared/private ranges orthogonal to TDX connect usecase. With TDX
> > > > > > > connect/Sev TIO, major difference would be that guest_memfd private
> > > > > > > ranges will be mapped into IOMMU page tables.
> > > > > > >
> > > > > > > Irrespective of whether/when VM_PFNMAP memory support lands, there
> > > > > > > have been discussions on not using page structs for private memory
> > > > > > > ranges altogether [1] even with hugetlb allocator, which will simplify
> > > > > > > seamless merge/split story for private hugepages to support memory
> > > > > > > conversion. So I think the general direction we should head towards is
> > > > > > > not relying on refcounts for guest_memfd private ranges and/or page
> > > > > > > structs altogether.
> > > > > > It's fine to use PFN, but I wonder if there're counterparts of struct page to
> > > > > > keep all necessary info.
> > > > > >
> > > > >
> > > > > Story will become clearer once VM_PFNMAP'd memory support starts
> > > > > getting discussed. In case of guest_memfd, there is flexibility to
> > > > > store metadata for physical ranges within guest_memfd just like
> > > > > shareability tracking.
> > > > Ok.
> > > >
> > > > > >
> > > > > > > I think the series [2] to work better with PFNMAP'd physical memory in
> > > > > > > KVM is in the very right direction of not assuming page struct backed
> > > > > > > memory ranges for guest_memfd as well.
> > > > > > Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
> > > > > > hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".
> > > > > >
> > > > > >
> > > > > > > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
> > > > > > > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
> > > > > > >
> > > > > > > > And even in that scenario, the memory is only for private MMIO, so the backend
> > > > > > > > driver is VFIO pci driver rather than guest_memfd.
> > > > > > >
> > > > > > > Not necessary. As I mentioned above guest_memfd ranges will be backed
> > > > > > > by VM_PFNMAP memory.
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543
> > > > >
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 18/21] KVM: x86: Split huge boundary leafs before private to shared conversion
  2025-05-09 23:34   ` Edgecombe, Rick P
@ 2025-05-12  2:25     ` Yan Zhao
  2025-05-12 21:53       ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-12  2:25 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Shutemov, Kirill,
	quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Sat, May 10, 2025 at 07:34:39AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:08 +0800, Yan Zhao wrote:
> > Before converting a GFN range from private to shared, it is necessary to
> > zap the mirror page table. When huge pages are supported and the GFN range
> > intersects with a huge leaf, split the huge leaf to prevent zapping GFNs
> > outside the conversion range.
> 
> FALLOC_FL_PUNCH_HOLE demotion failure doesn't look like it is addressed in this
Hmm, FALLOC_FL_PUNCH_HOLE demotion failure is handled in patch 19.

> series. I noticed that mmu notifier failures are allowed to be handled by
> blocking until success is possible, in most cases. KVM just doesn't need to
> because it can't fail. We could think about doing retries for
> FALLOC_FL_PUNCH_HOLE, while checking for signals. Or adding a ENOMEM error code
> to fallocate.
In patch 19, FALLOC_FL_PUNCH_HOLE could return -ENOMEM.

Returning -ENOMEM may be inevitable as we can't endlessly retry. So for
simplicity, there's no retry in this series.

Besides that, do you think whether we need to conduct splittings before any
unmap is invoked?

As in the patch log:
"
The downside of this approach is that although kvm_split_boundary_leafs()        
is invoked before kvm_unmap_gfn_range() for each GFN range, the entire           
conversion range may consist of several GFN ranges. If an out-of-memory          
error occurs during the splitting of a GFN range, some previous GFN ranges       
may have been successfully split and zapped, even though their page              
attributes remain unchanged due to the splitting failure. This may not be a      
big problem as the user can retry the ioctl to split and zap the full            
range.
"

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-12  2:15                             ` Yan Zhao
@ 2025-05-12 16:53                               ` Vishal Annapurve
  2025-05-15  3:01                                 ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-05-12 16:53 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Sun, May 11, 2025 at 7:18 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> ...
> >
> > I might be wrongly throwing out some terminologies here then.
> > VM_PFNMAP flag can be set for memory backed by folios/page structs.
> > udmabuf seems to be working with pinned "folios" in the backend.
> >
> > The goal is to get to a stage where guest_memfd is backed by pfn
> > ranges unmanaged by kernel that guest_memfd owns and distributes to
> > userspace, KVM, IOMMU subject to shareability attributes. if the
> OK. So from point of the reset part of kernel, those pfns are not regarded as
> memory.
>
> > shareability changes, the users will get notified and will have to
> > invalidate their mappings. guest_memfd will allow mmaping such ranges
> > with VM_PFNMAP flag set by default in the VMAs to indicate the need of
> > special handling/lack of page structs.
> My concern is a failable invalidation notifer may not be ideal.
> Instead of relying on ref counts (or other mechanisms) to determine whether to
> start shareabilitiy changes, with a failable invalidation notifier, some users
> may fail the invalidation and the shareability change, even after other users
> have successfully unmapped a range.

Even if one user fails to invalidate its mappings, I don't see a
reason to go ahead with shareability change. Shareability should not
change unless all existing users let go of their soon-to-be-invalid
view of memory.

>
> Auditing whether multiple users of shared memory correctly perform unmapping is
> harder than auditing reference counts.
>
> > private memory backed by page structs and use a special "filemap" to
> > map file offsets to these private memory ranges. This step will also
> > need similar contract with users -
> >    1) memory is pinned by guest_memfd
> >    2) users will get invalidation notifiers on shareability changes
> >
> > I am sure there is a lot of work here and many quirks to be addressed,
> > let's discuss this more with better context around. A few related RFC
> > series are planned to be posted in the near future.
> Ok. Thanks for your time and discussions :)
> ...

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-09  3:20                         ` Yan Zhao
  2025-05-09 14:20                           ` Vishal Annapurve
@ 2025-05-12 19:00                           ` Ackerley Tng
  2025-05-12 21:44                             ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Ackerley Tng @ 2025-05-12 19:00 UTC (permalink / raw)
  To: Yan Zhao
  Cc: vannapurve, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

Yan Zhao <yan.y.zhao@intel.com> writes:

>> <snip>
>>
>> Very likely because these operations simply don't fail.
>
> I think they are intentionally designed to be no-fail.
>
> e.g. in __iopt_area_unfill_domain(), no-fail is achieved by using a small backup
> buffer allocated on stack in case of kmalloc() failure.
>
>
>> >
>> > That's why we rely on increasing folio ref count to reflect failure, which are
>> > due to unexpected SEAMCALL errors.
>>
>> TDX stack is adding a scenario where invalidation can fail, a cleaner
>> solution would be to propagate the result as an invalidation failure.
> Not sure if linux kernel accepts unmap failure.
>
>> Another option is to notify guest_memfd out of band to convey the
>> ranges that failed invalidation.
> Yes, this might be better. Something similar like holding folio ref count to
> let guest_memfd know that a certain PFN cannot be re-assigned.
>
>> With in-place conversion supported, even if the refcount is raised for
>> such pages, they can still get used by the host if the guest_memfd is
>> unaware that the invalidation failed.
> I thought guest_memfd should check if folio ref count is 0 (or a base count)
> before conversion, splitting or re-assignment. Otherwise, why do you care if
> TDX holds the ref count? :)
>

IIUC the question here is how we should handle failures in unmapping of
private memory, which should be a rare occurrence.

I think there are two options here

1. Fail on unmapping *private* memory

2. Don't fail on unmapping *private* memory, instead tell the owner of
   the memory that this memory is never to be used again.

I think option 1 is better because it is more direct and provides timely
feedback to the caller when the issue happens. There is also room to
provide even more context about the address of the failure here.

It does seem like generally, unmapping memory does not support failing,
but I think that is for shared memory (even in KVM MMU notifiers).
Would it be possible to establish a new contract that for private pages,
unmapping can fail?

The kernel/KVM-internal functions for unmapping GFNs can be modified to
return error when unmapping private memory. Specifically, when
KVM_FILTER_PRIVATE [1] is set, then the unmapping function can return an
error and if not then the caller should not expect failures.

IIUC the only places where private memory is unmapped now is via
guest_memfd's truncate and (future) convert operations, so guest_memfd
can handle those failures or return failure to userspace.

Option 2 is possible too - but seems a little awkward. For conversion
the general steps are to (a) unmap pages from either host, guest or both
page tables (b) change shareability status in guest_memfd. It seems
awkward to first let step (a) pass even though there was an error, and
then proceed to (b) only to check somewhere (via refcount or otherwise)
that there was an issue and the conversion needs to fail.

Currently for private to shared conversions, (will be posting this 1g
page support series (with conversions) soon), I check refcounts == safe
refcount for shared to private conversions before permitting conversions
(error returned to userspace on failure).

For private to shared conversions, there is no check. At conversion
time, when splitting pages, I just spin in the kernel waiting for any
speculative refcounts to drop to go away. The refcount check at
conversion time is currently purely to ensure a safe merge process.

It is possible to check all the refcounts of private pages (split or
huge page) in the requested conversion range to handle unmapping
failures, but that seems expensive to do for every conversion, for
possibly many 4K pages, just to find a rare error case.

Also, if we do this refcount check to find the error, there wouldn't be
any way to tell if it were an error or if it was a speculative refcount,
so guest_memfd would just have to return -EAGAIN for private to shared
conversions. This would make conversions complicated to handle in
userspace, since the userspace VMM doesn't know whether it should retry
(for speculative refcounts) or it should give up because of the
unmapping error. Returning a different error on unmapping failure would
allow userspace to handle the two cases differently.

Regarding Option 2, another way to indicate an error could be to mark
the page as poisoned, but then again that would overlap/shadow true
memory poisoning.

In summary, I think Option 1 is best, which is that we return error
within the kernel, and the caller (for now only guest_memfd unmaps
private memory) should handle the error.

[1] https://github.com/torvalds/linux/blob/627277ba7c2398dc4f95cc9be8222bb2d9477800/include/linux/kvm_host.h#L260

>
>> >
>> > > > Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
>> > > > that fails to be unmapped.
>> > > >
>> > > >
>> > > > [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/
>> > > >
>> > > >
>> > > > > >
>> > > > > >
>> > > > > > > Any guest_memfd range updates will result in invalidations/updates of
>> > > > > > > userspace, guest, IOMMU or any other page tables referring to
>> > > > > > > guest_memfd backed pfns. This story will become clearer once the
>> > > > > > > support for PFN range allocator for backing guest_memfd starts getting
>> > > > > > > discussed.
>> > > > > > Ok. It is indeed unclear right now to support such kind of memory.
>> > > > > >
>> > > > > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
>> > > > > > into private EPT until TDX connect.
>> > > > >
>> > > > > There is a plan to use VM_PFNMAP memory for all of guest_memfd
>> > > > > shared/private ranges orthogonal to TDX connect usecase. With TDX
>> > > > > connect/Sev TIO, major difference would be that guest_memfd private
>> > > > > ranges will be mapped into IOMMU page tables.
>> > > > >
>> > > > > Irrespective of whether/when VM_PFNMAP memory support lands, there
>> > > > > have been discussions on not using page structs for private memory
>> > > > > ranges altogether [1] even with hugetlb allocator, which will simplify
>> > > > > seamless merge/split story for private hugepages to support memory
>> > > > > conversion. So I think the general direction we should head towards is
>> > > > > not relying on refcounts for guest_memfd private ranges and/or page
>> > > > > structs altogether.
>> > > > It's fine to use PFN, but I wonder if there're counterparts of struct page to
>> > > > keep all necessary info.
>> > > >
>> > >
>> > > Story will become clearer once VM_PFNMAP'd memory support starts
>> > > getting discussed. In case of guest_memfd, there is flexibility to
>> > > store metadata for physical ranges within guest_memfd just like
>> > > shareability tracking.
>> > Ok.
>> >
>> > > >
>> > > > > I think the series [2] to work better with PFNMAP'd physical memory in
>> > > > > KVM is in the very right direction of not assuming page struct backed
>> > > > > memory ranges for guest_memfd as well.
>> > > > Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
>> > > > hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".
>> > > >
>> > > >
>> > > > > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
>> > > > > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
>> > > > >
>> > > > > > And even in that scenario, the memory is only for private MMIO, so the backend
>> > > > > > driver is VFIO pci driver rather than guest_memfd.
>> > > > >
>> > > > > Not necessary. As I mentioned above guest_memfd ranges will be backed
>> > > > > by VM_PFNMAP memory.
>> > > > >
>> > > > > >
>> > > > > >
>> > > > > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543
>> > >

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-12 19:00                           ` Ackerley Tng
@ 2025-05-12 21:44                             ` Edgecombe, Rick P
  0 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-12 21:44 UTC (permalink / raw)
  To: ackerleytng@google.com, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Shutemov, Kirill, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Li, Zhiquan1,
	linux-kernel@vger.kernel.org, Yamahata, Isaku, Peng, Chao P,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Mon, 2025-05-12 at 12:00 -0700, Ackerley Tng wrote:
> IIUC the question here is how we should handle failures in unmapping of
> private memory, which should be a rare occurrence.
> 
> I think there are two options here
> 
> 1. Fail on unmapping *private* memory
> 
> 2. Don't fail on unmapping *private* memory, instead tell the owner of
>    the memory that this memory is never to be used again.
> 
> I think option 1 is better because it is more direct and provides timely
> feedback to the caller when the issue happens. There is also room to
> provide even more context about the address of the failure here.
> 
> It does seem like generally, unmapping memory does not support failing,
> but I think that is for shared memory (even in KVM MMU notifiers).
> Would it be possible to establish a new contract that for private pages,
> unmapping can fail?
> 
> The kernel/KVM-internal functions for unmapping GFNs can be modified to
> return error when unmapping private memory. Specifically, when
> KVM_FILTER_PRIVATE [1] is set, then the unmapping function can return an
> error and if not then the caller should not expect failures.
> 
> IIUC the only places where private memory is unmapped now is via
> guest_memfd's truncate and (future) convert operations, so guest_memfd
> can handle those failures or return failure to userspace.

1. Private->shared memory conversion
2. Memslot deletion
3. kvm_gmem_invalidate_begin() callers

> 
> Option 2 is possible too - but seems a little awkward. For conversion
> the general steps are to (a) unmap pages from either host, guest or both
> page tables (b) change shareability status in guest_memfd. It seems
> awkward to first let step (a) pass even though there was an error, and
> then proceed to (b) only to check somewhere (via refcount or otherwise)
> that there was an issue and the conversion needs to fail.
> 
> Currently for private to shared conversions, (will be posting this 1g
> page support series (with conversions) soon), I check refcounts == safe
> refcount for shared to private conversions before permitting conversions
> (error returned to userspace on failure).
> 
> For private to shared conversions, there is no check. At conversion
> time, when splitting pages, I just spin in the kernel waiting for any
> speculative refcounts to drop to go away. The refcount check at
> conversion time is currently purely to ensure a safe merge process.
> 
> It is possible to check all the refcounts of private pages (split or
> huge page) in the requested conversion range to handle unmapping
> failures, but that seems expensive to do for every conversion, for
> possibly many 4K pages, just to find a rare error case.
> 
> Also, if we do this refcount check to find the error, there wouldn't be
> any way to tell if it were an error or if it was a speculative refcount,
> so guest_memfd would just have to return -EAGAIN for private to shared
> conversions. This would make conversions complicated to handle in
> userspace, since the userspace VMM doesn't know whether it should retry
> (for speculative refcounts) or it should give up because of the
> unmapping error. Returning a different error on unmapping failure would
> allow userspace to handle the two cases differently.
> 
> Regarding Option 2, another way to indicate an error could be to mark
> the page as poisoned, but then again that would overlap/shadow true
> memory poisoning.
> 
> In summary, I think Option 1 is best, which is that we return error
> within the kernel, and the caller (for now only guest_memfd unmaps
> private memory) should handle the error.

When we get to huge pages we will have two error conditions on the KVM side:
1. Fail to split
2. A TDX module error

For TDX module errors, today we are essentially talking about bugs. The handling
in the case of TDX module bug should be to bug the VM and to prevent the memory
from being freed to the kernel. In which case the unmapping sort of succeeded,
albeit destructively.

So I'm not sure if userspace needs to know about the TDX module bugs (they are
going to find out anyway on the next KVM ioctl). But if we plumbed the error
code all the way through to guestmemfd, then I guess why not tell them.

On whether we should go to the trouble, could another option be to expose a
guestmemfd function that allows for "poisoning" the memory, and have the TDX bug
paths call it. This way guestmemfd could know specifically about unrecoverable
errors. For splitting failures, we can return an error to guestmemfd without
plumbing the error code all the way through.

Perhaps it might be worth doing a quick POC to see how bad plumbing the error
code all the way looks.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 18/21] KVM: x86: Split huge boundary leafs before private to shared conversion
  2025-05-12  2:25     ` Yan Zhao
@ 2025-05-12 21:53       ` Edgecombe, Rick P
  0 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-12 21:53 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, tabba@google.com,
	vbabka@suse.cz, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Du, Fan,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Shutemov, Kirill,
	pgonda@google.com, x86@kernel.org

On Mon, 2025-05-12 at 10:25 +0800, Yan Zhao wrote:
> > FALLOC_FL_PUNCH_HOLE demotion failure doesn't look like it is addressed in
> > this
> Hmm, FALLOC_FL_PUNCH_HOLE demotion failure is handled in patch 19.

Oh, right you are.

> 
> > series. I noticed that mmu notifier failures are allowed to be handled by
> > blocking until success is possible, in most cases. KVM just doesn't need to
> > because it can't fail. We could think about doing retries for
> > FALLOC_FL_PUNCH_HOLE, while checking for signals. Or adding a ENOMEM error
> > code
> > to fallocate.
> In patch 19, FALLOC_FL_PUNCH_HOLE could return -ENOMEM.

Yes. It is not in the man pages, but looking this morning I see other fallocate
handlers are already returning it.

> 
> Returning -ENOMEM may be inevitable as we can't endlessly retry. So for
> simplicity, there's no retry in this series.

Ok, seems good.

> 
> 
> Besides that, do you think whether we need to conduct splittings before any
> unmap is invoked?
> 
> As in the patch log:
> "
> The downside of this approach is that although
> kvm_split_boundary_leafs()        
> is invoked before kvm_unmap_gfn_range() for each GFN range, the
> entire           
> conversion range may consist of several GFN ranges. If an out-of-
> memory          
> error occurs during the splitting of a GFN range, some previous GFN
> ranges       
> may have been successfully split and zapped, even though their
> page              
> attributes remain unchanged due to the splitting failure. This may not be
> a      
> big problem as the user can retry the ioctl to split and zap the
> full            
> range.
> "

If we ended up plumbing the zapping errors all the way through the MMU, it
probably would be simpler to do it during the unmap. Of course callers would
have to be aware the range may be half unmapped on error. I think the way you
have it is nice for not churning the MMU though.

But for the case of having to retry the split and walking the mirror EPT range
again, it's a rare case and not worth optimizing for. Let's not consider it
much.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-10  0:41                               ` Vishal Annapurve
@ 2025-05-12 21:59                                 ` Edgecombe, Rick P
  0 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-12 21:59 UTC (permalink / raw)
  To: Annapurve, Vishal
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	Zhao, Yan Y, tabba@google.com, Du, Fan, michael.roth@amd.com,
	seanjc@google.com, binbin.wu@linux.intel.com, vbabka@suse.cz,
	pbonzini@redhat.com, Weiny, Ira, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, kvm@vger.kernel.org,
	jroedel@suse.de, linux-kernel@vger.kernel.org, Li, Zhiquan1,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Fri, 2025-05-09 at 17:41 -0700, Vishal Annapurve wrote:
> > > > I see the point about how operating on PFNs can allow smoother
> > > > transition to > > a
> > > > solution that saves struct page memory, but I wonder about the wisdom of
> > > > building this 2MB TDX code against eventual goals.
> > 
> > This discussion was more in response to a few questions from Yan [1].

Right, I follow.

> > 
> > My point of this discussion was to ensure that:
> > 1) There is more awareness about the future roadmap.
> > 2) There is a line of sight towards supporting guest memory (at least
> > guest private memory) without page structs.
> > 
> > No need to solve these problems right away, but it would be good to
> > ensure that the design choices are aligned towards the future
> > direction.

I'm not sure how much we should consider it at this stage. The kernel is not set
in stone, so it's about how much you want to do at once. For us who have been
working on the giant TDX base series, doing things on a more incremental smaller
size sounds nice :). That said, the necessary changes may have other good
reasons, as discussed.

> > 
> > One thing that needs to be resolved right away is - no refcounts on
> > guest memory from outside guest_memfd [2]. (Discounting the error
> > situations)

Sounds fine.

> > 
> > [1] https://lore.kernel.org/lkml/aBldhnTK93+eKcMq@yzhao56-desk.sh.intel.com/
> > [2] >
> > https://lore.kernel.org/lkml/CAGtprH_ggm8N-R9QbV1f8mo8-cQkqyEta3W=h2jry-NRD7_6OA@mail.gmail.com/



^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 03/21] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2025-04-24  3:04 ` [RFC PATCH 03/21] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
  2025-04-25  7:12   ` Binbin Wu
@ 2025-05-13 18:19   ` Edgecombe, Rick P
  2025-05-15  8:26     ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-13 18:19 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, 2025-04-24 at 11:04 +0800, Yan Zhao wrote:
> From: Xiaoyao Li <xiaoyao.li@intel.com>
> 
> Add a wrapper tdh_mem_page_demote() to invoke SEAMCALL TDH_MEM_PAGE_DEMOTE
> to demote a huge leaf entry to a non-leaf entry in S-EPT. Currently, the
> TDX module only supports demotion of a 2M huge leaf entry. After a
> successful demotion, the old 2M huge leaf entry in S-EPT is replaced with a
> non-leaf entry, linking to the newly-added page table page. The newly
> linked page table page then contains 512 leaf entries, pointing to the 2M
> guest private pages.
> 
> The "gpa" and "level" direct the TDX module to search and find the old
> huge leaf entry.
> 
> As the new non-leaf entry points to a page table page, callers need to
> pass in the page table page in parameter "page".
> 
> In case of S-EPT walk failure, the entry, level and state where the error
> was detected are returned in ext_err1 and ext_err2.
> 
> On interrupt pending, SEAMCALL TDH_MEM_PAGE_DEMOTE returns error
> TDX_INTERRUPTED_RESTARTABLE.
> 
> [Yan: Rebased and split patch, wrote changelog]

We should add the level of detail here like we did for the base series ones.

> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/include/asm/tdx.h  |  2 ++
>  arch/x86/virt/vmx/tdx/tdx.c | 20 ++++++++++++++++++++
>  arch/x86/virt/vmx/tdx/tdx.h |  1 +
>  3 files changed, 23 insertions(+)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 26ffc792e673..08eff4b2f5e7 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -177,6 +177,8 @@ u64 tdh_mng_key_config(struct tdx_td *td);
>  u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
>  u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
>  u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
> +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
> +			u64 *ext_err1, u64 *ext_err2);
>  u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
>  u64 tdh_mr_finalize(struct tdx_td *td);
>  u64 tdh_vp_flush(struct tdx_vp *vp);
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index a66d501b5677..5699dfe500d9 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1684,6 +1684,26 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data)
>  }
>  EXPORT_SYMBOL_GPL(tdh_mng_rd);
>  
> +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
> +			u64 *ext_err1, u64 *ext_err2)
> +{
> +	struct tdx_module_args args = {
> +		.rcx = gpa | level,

This will only ever be level 2MB, how about dropping the arg?

> +		.rdx = tdx_tdr_pa(td),
> +		.r8 = page_to_phys(page),
> +	};
> +	u64 ret;
> +
> +	tdx_clflush_page(page);
> +	ret = seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args);
> +
> +	*ext_err1 = args.rcx;
> +	*ext_err2 = args.rdx;

How about we just call these entry and level_state, like the caller.

> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdh_mem_page_demote);

Looking in the docs, TDX module gives some somewhat constrained guidance:
1. TDH.MEM.PAGE.DEMOTE should be invoked in a loop until it terminates
successfully.
2. The host VMM should be designed to avoid cases where interrupt storms prevent
successful completion of TDH.MEM.PAGE.DEMOTE.

The caller looks like:
	do {
		err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
					  &entry, &level_state);
	} while (err == TDX_INTERRUPTED_RESTARTABLE);

	if (unlikely(tdx_operand_busy(err))) {
		tdx_no_vcpus_enter_start(kvm);
		err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
					  &entry, &level_state);
		tdx_no_vcpus_enter_stop(kvm);
	}

The brute force second case could also be subjected to a
TDX_INTERRUPTED_RESTARTABLE and is not handled. As for interrupt storms, I guess
we could disable interrupts while we do the second brute force case. So the
TDX_INTERRUPTED_RESTARTABLE loop could have a max retries, and the brute force
case could also disable interrupts.

Hmm, how to pick the max retries count. It's a tradeoff between interrupt
latency and DOS/code complexity. Do we have any idea how long demote might take?





^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-04-24  3:04 ` [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
  2025-04-24  7:48   ` Kirill A. Shutemov
  2025-04-25  6:51   ` Binbin Wu
@ 2025-05-13 18:52   ` Edgecombe, Rick P
  2025-05-16  9:05     ` Yan Zhao
  2025-05-15  2:16   ` Chao Gao
  2025-07-08  8:48   ` Yan Zhao
  4 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-13 18:52 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, 2025-04-24 at 11:04 +0800, Yan Zhao wrote:
> Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.
> 
> Verify the validity of the level and ensure that the mapping range is fully
> contained within the page folio.
> 
> As a conservative solution, perform CLFLUSH on all pages to be mapped into
> the TD before invoking the SEAMCALL TDH_MEM_PAGE_AUG. This ensures that any
> dirty cache lines do not write back later and clobber TD memory.

This should have a brief background on why it doesn't use the arg - what is
deficient today. Also, an explanation of how it will be used (i.e. what types of
pages will be passed)

> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index f5e2a937c1e7..a66d501b5677 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1595,9 +1595,18 @@ u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u
>  		.rdx = tdx_tdr_pa(td),
>  		.r8 = page_to_phys(page),
>  	};
> +	unsigned long nr_pages = 1 << (level * 9);
> +	struct folio *folio = page_folio(page);
> +	unsigned long idx = 0;
>  	u64 ret;
>  
> -	tdx_clflush_page(page);
> +	if (!(level >= TDX_PS_4K && level < TDX_PS_NR) ||
> +	    (folio_page_idx(folio, page) + nr_pages > folio_nr_pages(folio)))
> +		return -EINVAL;

Shouldn't KVM not try to map a huge page in this situation? Doesn't seem like a
job for the SEAMCALL wrapper.

> +
> +	while (nr_pages--)
> +		tdx_clflush_page(nth_page(page, idx++));

clflush_cache_range() is:
static void tdx_clflush_page(struct page *page)
{
	clflush_cache_range(page_to_virt(page), PAGE_SIZE);
}

So we have loops within loops...  Better to add an arg to tdx_clflush_page() or
add a variant that takes one.

> +
>  	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
>  
>  	*ext_err1 = args.rcx;


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 04/21] KVM: TDX: Enforce 4KB mapping level during TD build Time
  2025-04-24  3:05 ` [RFC PATCH 04/21] KVM: TDX: Enforce 4KB mapping level during TD build Time Yan Zhao
  2025-04-24  7:55   ` Kirill A. Shutemov
@ 2025-05-13 19:12   ` Edgecombe, Rick P
  2025-05-15  9:16     ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-13 19:12 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, 2025-04-24 at 11:05 +0800, Yan Zhao wrote:
> During the TD build phase (i.e., before the TD becomes RUNNABLE), enforce a
> 4KB mapping level both in the S-EPT managed by the TDX module and the
> mirror page table managed by KVM.
> 
> During this phase, TD's memory is added via tdh_mem_page_add(), which only
> accepts 4KB granularity. Therefore, return PG_LEVEL_4K in TDX's
> .private_max_mapping_level hook to ensure KVM maps at the 4KB level in the
> mirror page table. Meanwhile, iterate over each 4KB page of a large gmem
> backend page in tdx_gmem_post_populate() and invoke tdh_mem_page_add() to
> map at the 4KB level in the S-EPT.
> 
> Still allow huge pages in gmem backend during TD build time. Based on [1],
> which gmem series allows 2MB TPH and non-in-place conversion, pass in
> region.nr_pages to kvm_gmem_populate() in tdx_vcpu_init_mem_region().
> 

This commit log will need to be written with upstream in mind when it is out of
RFC.

>  This
> enables kvm_gmem_populate() to allocate huge pages from the gmem backend
> when the remaining nr_pages, GFN alignment, and page private/shared
> attribute permit.  KVM is then able to promote the initial 4K mapping to
> huge after TD is RUNNABLE.
> 
> Disallow any private huge pages during TD build time. Use BUG_ON() in
> tdx_mem_page_record_premap_cnt() and tdx_is_sept_zap_err_due_to_premap() to
> assert the mapping level is 4KB.
> 
> Opportunistically, remove unused parameters in
> tdx_mem_page_record_premap_cnt().
> 
> Link: https://lore.kernel.org/all/20241212063635.712877-1-michael.roth@amd.com [1]
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 45 ++++++++++++++++++++++++++++--------------
>  1 file changed, 30 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 98cde20f14da..03885cb2869b 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1530,14 +1530,16 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
>   * The counter has to be zero on KVM_TDX_FINALIZE_VM, to ensure that there
>   * are no half-initialized shared EPT pages.
>   */
> -static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
> -					  enum pg_level level, kvm_pfn_t pfn)
> +static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, enum pg_level level)
>  {
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>  
>  	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
>  		return -EINVAL;
>  
> +	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> +		return -EINVAL;
> +
>  	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
>  	atomic64_inc(&kvm_tdx->nr_premapped);
>  	return 0;
> @@ -1571,7 +1573,7 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
>  		return tdx_mem_page_aug(kvm, gfn, level, page);
>  
> -	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
> +	return tdx_mem_page_record_premap_cnt(kvm, level);
>  }
>  
>  static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> @@ -1666,7 +1668,7 @@ int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
>  static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
>  					     u64 entry, int level)
>  {
> -	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
> +	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE || level > PG_LEVEL_4K)
>  		return false;

This is catching zapping huge pages before the TD is runnable? Is it necessary
if we are already warning about mapping huge pages before the TD is runnable in
tdx_mem_page_record_premap_cnt()?

>  
>  	if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
> @@ -3052,8 +3054,8 @@ struct tdx_gmem_post_populate_arg {
>  	__u32 flags;
>  };
>  
> -static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> -				  void __user *src, int order, void *_arg)
> +static int tdx_gmem_post_populate_4k(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> +				     void __user *src, void *_arg)
>  {
>  	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> @@ -3120,6 +3122,21 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  	return ret;
>  }
>  
> +static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> +				  void __user *src, int order, void *_arg)
> +{
> +	unsigned long i, npages = 1 << order;
> +	int ret;
> +
> +	for (i = 0; i < npages; i++) {
> +		ret = tdx_gmem_post_populate_4k(kvm, gfn + i, pfn + i,
> +						src + i * PAGE_SIZE, _arg);
> +		if (ret)
> +			return ret;
> +	}
> +	return 0;
> +}
> +
>  static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
>  {
>  	struct vcpu_tdx *tdx = to_tdx(vcpu);
> @@ -3166,20 +3183,15 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
>  		};
>  		gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
>  					     u64_to_user_ptr(region.source_addr),
> -					     1, tdx_gmem_post_populate, &arg);
> +					     region.nr_pages, tdx_gmem_post_populate, &arg);
>  		if (gmem_ret < 0) {
>  			ret = gmem_ret;
>  			break;
>  		}
>  
> -		if (gmem_ret != 1) {
> -			ret = -EIO;
> -			break;
> -		}
> -
> -		region.source_addr += PAGE_SIZE;
> -		region.gpa += PAGE_SIZE;
> -		region.nr_pages--;
> +		region.source_addr += PAGE_SIZE * gmem_ret;

gmem_ret has to be 1, per the above conditional.

> +		region.gpa += PAGE_SIZE * gmem_ret;
> +		region.nr_pages -= gmem_ret;
>  
>  		cond_resched();
>  	}
> @@ -3224,6 +3236,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
>  
>  int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
>  {
> +	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
> +		return PG_LEVEL_4K;
> +
>  	return PG_LEVEL_4K;

^ Change does nothing...

>  }
>  


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 05/21] KVM: TDX: Enhance tdx_clear_page() to support huge pages
  2025-04-24  3:05 ` [RFC PATCH 05/21] KVM: TDX: Enhance tdx_clear_page() to support huge pages Yan Zhao
@ 2025-05-13 19:17   ` Edgecombe, Rick P
  2025-05-16  2:02     ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-13 19:17 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, 2025-04-24 at 11:05 +0800, Yan Zhao wrote:
> From: Xiaoyao Li <xiaoyao.li@intel.com>
> 
> KVM invokes tdx_clear_page() to zero pages using movdir64b().
> Include level information to enable tdx_clear_page() to zero a huge page.
> 
> [Yan: split out, let tdx_clear_page() accept level]
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 03885cb2869b..1186085795ac 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -276,7 +276,7 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
>  	vcpu->cpu = -1;
>  }
>  
> -static void tdx_clear_page(struct page *page)
> +static void __tdx_clear_page(struct page *page)
>  {
>  	const void *zero_page = (const void *) page_to_virt(ZERO_PAGE(0));
>  	void *dest = page_to_virt(page);
> @@ -295,6 +295,15 @@ static void tdx_clear_page(struct page *page)
>  	__mb();
>  }
>  
> +static void tdx_clear_page(struct page *page, int level)
> +{
> +	unsigned long nr = KVM_PAGES_PER_HPAGE(level);
> +	unsigned long idx = 0;
> +
> +	while (nr--)
> +		__tdx_clear_page(nth_page(page, idx++));

You shouldn't need both idx and nr.

> +}

Since tdx_clear_page() has a __mb(), it is probably worth checking that this
generates efficient code, considering the loops within loops pattern.

> +
>  static void tdx_no_vcpus_enter_start(struct kvm *kvm)
>  {
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> @@ -340,11 +349,10 @@ static int tdx_reclaim_page(struct page *page)
>  
>  	r = __tdx_reclaim_page(page);
>  	if (!r)
> -		tdx_clear_page(page);
> +		tdx_clear_page(page, PG_LEVEL_4K);
>  	return r;
>  }
>  
> -
>  /*
>   * Reclaim the TD control page(s) which are crypto-protected by TDX guest's
>   * private KeyID.  Assume the cache associated with the TDX private KeyID has
> @@ -588,7 +596,7 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
>  		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
>  		return;
>  	}
> -	tdx_clear_page(kvm_tdx->td.tdr_page);
> +	tdx_clear_page(kvm_tdx->td.tdr_page, PG_LEVEL_4K);

Why not the __tdx_clear_page() variant? The patch adds it, but doesn't really
use it. Just implement it all in tdx_clear_page() then.

>  
>  	__free_page(kvm_tdx->td.tdr_page);
>  	kvm_tdx->td.tdr_page = NULL;
> @@ -1621,7 +1629,8 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>  		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
>  		return -EIO;
>  	}
> -	tdx_clear_page(page);
> +
> +	tdx_clear_page(page, level);
>  	tdx_unpin(kvm, page);
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 06/21] KVM: TDX: Assert the reclaimed pages were mapped as expected
  2025-04-24  3:05 ` [RFC PATCH 06/21] KVM: TDX: Assert the reclaimed pages were mapped as expected Yan Zhao
@ 2025-05-13 19:25   ` Edgecombe, Rick P
  2025-05-16  2:11     ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-13 19:25 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, 2025-04-24 at 11:05 +0800, Yan Zhao wrote:
>  /* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */
> -static int __tdx_reclaim_page(struct page *page)
> +static int __tdx_reclaim_page(struct page *page, int level)
>  {
>  	u64 err, tdx_pt, tdx_owner, tdx_size;
>  
> @@ -340,16 +340,18 @@ static int __tdx_reclaim_page(struct page *page)
>  		pr_tdx_error_3(TDH_PHYMEM_PAGE_RECLAIM, err, tdx_pt, tdx_owner, tdx_size);
>  		return -EIO;
>  	}
> +
> +	WARN_ON_ONCE(tdx_size != pg_level_to_tdx_sept_level(level));

Why not return an error in this case?

>  	return 0;
>  }
>  

No callers in the series pass anything other than PG_LEVEL_4K, so do we need
this patch?

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 07/21] KVM: TDX: Add a helper for WBINVD on huge pages with TD's keyID
  2025-04-24  3:05 ` [RFC PATCH 07/21] KVM: TDX: Add a helper for WBINVD on huge pages with TD's keyID Yan Zhao
  2025-05-06  8:37   ` Binbin Wu
@ 2025-05-13 19:29   ` Edgecombe, Rick P
  2025-05-16  3:03     ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-13 19:29 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, 2025-04-24 at 11:05 +0800, Yan Zhao wrote:
> From: Xiaoyao Li <xiaoyao.li@intel.com>
> 
> After a guest page is removed from the S-EPT, KVM calls
> tdh_phymem_page_wbinvd_hkid() to execute WBINVD on the page using the TD's
> keyID.
> 
> Add a helper function that takes level information to perform WBINVD on a
> huge page.
> 
> [Yan: split patch, added a helper, rebased to use struct page]
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 24 +++++++++++++++++++-----
>  1 file changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 69f3140928b5..355b21fc169f 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1586,6 +1586,23 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  	return tdx_mem_page_record_premap_cnt(kvm, level);
>  }
>  
> +static inline u64 tdx_wbinvd_page(struct kvm *kvm, u64 hkid, struct page *page, int level)
> +{
> +	unsigned long nr = KVM_PAGES_PER_HPAGE(level);
> +	unsigned long idx = 0;
> +	u64 err;
> +
> +	while (nr--) {
> +		err = tdh_phymem_page_wbinvd_hkid(hkid, nth_page(page, idx++));
> +
> +		if (KVM_BUG_ON(err, kvm)) {
> +			pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
> +			return err;
> +		}
> +	}
> +	return err;
> +}

Hmm, did you consider changing tdh_phymem_page_wbinvd_hkid()? It's the pattern
of KVM wrapping the SEAMCALL helpers to do some more work that needs to be
wrapped.

> +
>  static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>  				      enum pg_level level, struct page *page)
>  {
> @@ -1625,12 +1642,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>  		return -EIO;
>  	}
>  
> -	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
> -
> -	if (KVM_BUG_ON(err, kvm)) {
> -		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
> +	err = tdx_wbinvd_page(kvm, kvm_tdx->hkid, page, level);
> +	if (err)
>  		return -EIO;
> -	}
>  
>  	tdx_clear_page(page, level);
>  	tdx_unpin(kvm, page);


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-04-24  3:06 ` [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE Yan Zhao
@ 2025-05-13 20:10   ` Edgecombe, Rick P
  2025-05-16  1:35     ` Huang, Kai
  2025-05-16  9:28     ` Yan Zhao
  0 siblings, 2 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-13 20:10 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, 2025-04-24 at 11:06 +0800, Yan Zhao wrote:
> Allow TDX's .private_max_mapping_level hook to return 2MB after the TD is
> RUNNABLE, enabling KVM to map TDX private pages at the 2MB level. Remove
> TODOs and adjust KVM_BUG_ON()s accordingly.
> 
> Note: Instead of placing this patch at the tail of the series, it's
> positioned here to show the code changes for basic mapping of private huge
> pages (i.e., transitioning from non-present to present).
> 
> However, since this patch also allows KVM to trigger the merging of small
> entries into a huge leaf entry or the splitting of a huge leaf entry into
> small entries, errors are expected if any of these operations are triggered
> due to the current lack of splitting/merging support.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 16 +++++++---------
>  1 file changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index e23dce59fc72..6b3a8f3e6c9c 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1561,10 +1561,6 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>  	struct page *page = pfn_to_page(pfn);
>  
> -	/* TODO: handle large pages. */
> -	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> -		return -EINVAL;
> -
>  	/*
>  	 * Because guest_memfd doesn't support page migration with
>  	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
> @@ -1612,8 +1608,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>  	gpa_t gpa = gfn_to_gpa(gfn);
>  	u64 err, entry, level_state;
>  
> -	/* TODO: handle large pages. */
> -	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> +	if (KVM_BUG_ON(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K, kvm))

It's not clear why some of these warnings are here and some are in patch 4.

>  		return -EINVAL;
>  
>  	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
> @@ -1714,8 +1709,8 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  	gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
>  	u64 err, entry, level_state;
>  
> -	/* For now large page isn't supported yet. */
> -	WARN_ON_ONCE(level != PG_LEVEL_4K);
> +	/* Before TD runnable, large page is not supported */
> +	WARN_ON_ONCE(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K);
>  
>  	err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
>  
> @@ -1817,6 +1812,9 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>  	struct page *page = pfn_to_page(pfn);
>  	int ret;
>  
> +	WARN_ON_ONCE(folio_page_idx(page_folio(page), page) + KVM_PAGES_PER_HPAGE(level) >
> +		     folio_nr_pages(page_folio(page)));
> +
>  	/*
>  	 * HKID is released after all private pages have been removed, and set
>  	 * before any might be populated. Warn if zapping is attempted when
> @@ -3265,7 +3263,7 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
>  	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
>  		return PG_LEVEL_4K;
>  
> -	return PG_LEVEL_4K;
> +	return PG_LEVEL_2M;

Maybe combine this with patch 4, or split them into sensible categories.

>  }
>  
>  static int tdx_online_cpu(unsigned int cpu)


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 10/21] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root
  2025-04-24  3:06 ` [RFC PATCH 10/21] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
@ 2025-05-13 20:15   ` Edgecombe, Rick P
  2025-05-16  4:01     ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-13 20:15 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, 2025-04-24 at 11:06 +0800, Yan Zhao wrote:
> From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
> 
> Disallow page merging (huge page adjustment) for mirror root by leveraging
> the disallowed_hugepage_adjust().
> 
> [Yan: Passing is_mirror to disallowed_hugepage_adjust()]
> 
> Signed-off-by: Edgecombe, Rick P <rick.p.edgecombe@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 6 +++---
>  arch/x86/kvm/mmu/mmu_internal.h | 2 +-
>  arch/x86/kvm/mmu/paging_tmpl.h  | 2 +-
>  arch/x86/kvm/mmu/tdp_mmu.c      | 7 ++++---
>  4 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a284dce227a0..b923deeeb62e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3326,13 +3326,13 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	fault->pfn &= ~mask;
>  }
>  
> -void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level)
> +void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level, bool is_mirror)
>  {
>  	if (cur_level > PG_LEVEL_4K &&
>  	    cur_level == fault->goal_level &&
>  	    is_shadow_present_pte(spte) &&
>  	    !is_large_pte(spte) &&
> -	    spte_to_child_sp(spte)->nx_huge_page_disallowed) {
> +	    (spte_to_child_sp(spte)->nx_huge_page_disallowed || is_mirror)) {
>  		/*
>  		 * A small SPTE exists for this pfn, but FNAME(fetch),
>  		 * direct_map(), or kvm_tdp_mmu_map() would like to create a
> @@ -3363,7 +3363,7 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  		 * large page, as the leaf could be executable.
>  		 */
>  		if (fault->nx_huge_page_workaround_enabled)
> -			disallowed_hugepage_adjust(fault, *it.sptep, it.level);
> +			disallowed_hugepage_adjust(fault, *it.sptep, it.level, false);
>  
>  		base_gfn = gfn_round_for_level(fault->gfn, it.level);
>  		if (it.level == fault->goal_level)
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index db8f33e4de62..1c1764f46e66 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -411,7 +411,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  int kvm_mmu_max_mapping_level(struct kvm *kvm,
>  			      const struct kvm_memory_slot *slot, gfn_t gfn);
>  void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> -void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
> +void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level, bool is_mirror);
>  
>  void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 68e323568e95..1559182038e3 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -717,7 +717,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  		 * large page, as the leaf could be executable.
>  		 */
>  		if (fault->nx_huge_page_workaround_enabled)
> -			disallowed_hugepage_adjust(fault, *it.sptep, it.level);
> +			disallowed_hugepage_adjust(fault, *it.sptep, it.level, false);
>  
>  		base_gfn = gfn_round_for_level(fault->gfn, it.level);
>  		if (it.level == fault->goal_level)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 405874f4d088..8ee01277cc07 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1244,6 +1244,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	struct tdp_iter iter;
>  	struct kvm_mmu_page *sp;
>  	int ret = RET_PF_RETRY;
> +	bool is_mirror = is_mirror_sp(root);
>  
>  	kvm_mmu_hugepage_adjust(vcpu, fault);
>  
> @@ -1254,8 +1255,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
>  		int r;
>  
> -		if (fault->nx_huge_page_workaround_enabled)
> -			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
> +		if (fault->nx_huge_page_workaround_enabled || is_mirror)

Maybe we should rename nx_huge_page_workaround_enabled to something more generic
and do the is_mirror logic in kvm_mmu_do_page_fault() when setting it. It should
shrink the diff and centralize the logic.

> +			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level, is_mirror);
>  
>  		/*
>  		 * If SPTE has been frozen by another thread, just give up and
> @@ -1278,7 +1279,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  		 */
>  		sp = tdp_mmu_alloc_sp(vcpu);
>  		tdp_mmu_init_child_sp(sp, &iter);
> -		if (is_mirror_sp(sp))
> +		if (is_mirror)
>  			kvm_mmu_alloc_external_spt(vcpu, sp);
>  
>  		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's ACCEPT level
  2025-04-24  3:07 ` [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's ACCEPT level Yan Zhao
@ 2025-05-13 21:20   ` Edgecombe, Rick P
  2025-05-16  6:12     ` Xiaoyao Li
  2025-05-16  6:30     ` Yan Zhao
  0 siblings, 2 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-13 21:20 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, 2025-04-24 at 11:07 +0800, Yan Zhao wrote:
> Determine the max mapping level of a private GFN according to the vCPU's
> ACCEPT level specified in the TDCALL TDG.MEM.PAGE.ACCEPT.
> 
> When an EPT violation occurs due to a vCPU invoking TDG.MEM.PAGE.ACCEPT
> before any actual memory access, the vCPU's ACCEPT level is available in
> the extended exit qualification. Set the vCPU's ACCEPT level as the max
> mapping level for the faulting GFN. This is necessary because if KVM
> specifies a mapping level greater than the vCPU's ACCEPT level, and no
> other vCPUs are accepting at KVM's mapping level, TDG.MEM.PAGE.ACCEPT will
> produce another EPT violation on the vCPU after re-entering the TD, with
> the vCPU's ACCEPT level indicated in the extended exit qualification.

Maybe a little more info would help. It's because the TDX module wants to
"accept" the smaller size in the real S-EPT, but KVM created a huge page. It
can't demote to do this without help from KVM.

> 
> Introduce "violation_gfn_start", "violation_gfn_end", and
> "violation_request_level" in "struct vcpu_tdx" to pass the vCPU's ACCEPT
> level to TDX's private_max_mapping_level hook for determining the max
> mapping level.
> 
> Instead of taking some bits of the error_code passed to
> kvm_mmu_page_fault() and requiring KVM MMU core to check the error_code for
> a fault's max_level, having TDX's private_max_mapping_level hook check for
> request level avoids changes to the KVM MMU core. This approach also
> accommodates future scenarios where the requested mapping level is unknown
> at the start of tdx_handle_ept_violation() (i.e., before invoking
> kvm_mmu_page_fault()).
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/kvm/vmx/tdx.c      | 36 +++++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/vmx/tdx.h      |  4 ++++
>  arch/x86/kvm/vmx/tdx_arch.h |  3 +++
>  3 files changed, 42 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 86775af85cd8..dd63a634e633 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1859,10 +1859,34 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
>  	return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
>  }
>  
> +static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
> +{
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +	int level = -1;
> +
> +	u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> +
> +	u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> +			TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> +
> +	if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
> +		level = (eeq_info & GENMASK(2, 0)) + 1;
> +
> +		tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
> +		tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> +		tdx->violation_request_level = level;
> +	} else {
> +		tdx->violation_gfn_start = -1;
> +		tdx->violation_gfn_end = -1;
> +		tdx->violation_request_level = -1;

We had some internal conversations on how KVM used to stuff a bunch of fault
stuff in the vcpu so it didn't have to pass it around, but now uses the fault
struct for this. The point was (IIRC) to prevent stale data from getting
confused on future faults, and it being hard to track what came from where.

In the TDX case, I think the potential for confusion is still there. The MMU
code could use stale data if an accept EPT violation happens and control returns
to userspace, at which point userspace does a KVM_PRE_FAULT_MEMORY. Then it will
see the stale  tdx->violation_*. Not exactly a common case, but better to not
have loose ends if we can avoid it.

Looking more closely, I don't see why it's too hard to pass in a max_fault_level
into the fault struct. Totally untested rough idea, what do you think?

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index faae82eefd99..3dc476da6391 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -282,7 +282,11 @@ enum x86_intercept_stage;
  * when the guest was accessing private memory.
  */
 #define PFERR_PRIVATE_ACCESS   BIT_ULL(49)
-#define PFERR_SYNTHETIC_MASK   (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS)
+
+#define PFERR_FAULT_LEVEL_MASK (BIT_ULL(50) | BIT_ULL(51) | BIT_ULL(52))
+#define PFERR_FAULT_LEVEL_SHIFT 50
+
+#define PFERR_SYNTHETIC_MASK   (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS |
PFERR_FAULT_LEVEL_MASK)
 
 /* apic attention bits */
 #define KVM_APIC_CHECK_VAPIC   0
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1c1764f46e66..bdb1b0eabd67 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -361,7 +361,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu
*vcpu, gpa_t cr2_or_gpa,
                .nx_huge_page_workaround_enabled =
                        is_nx_huge_page_enabled(vcpu->kvm),
 
-               .max_level = KVM_MAX_HUGEPAGE_LEVEL,
+               .max_level = (err & PFERR_FAULT_LEVEL_MASK) >>
PFERR_FAULT_LEVEL_SHIFT,
                .req_level = PG_LEVEL_4K,
                .goal_level = PG_LEVEL_4K,
                .is_private = err & PFERR_PRIVATE_ACCESS,
diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 8f46a06e2c44..2f22b294ef8b 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -83,7 +83,8 @@ static inline bool vt_is_tdx_private_gpa(struct kvm *kvm,
gpa_t gpa)
 }
 
 static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
-                                            unsigned long exit_qualification)
+                                            unsigned long exit_qualification,
+                                            u8 max_fault_level)
 {
        u64 error_code;
 
@@ -107,6 +108,10 @@ static inline int __vmx_handle_ept_violation(struct
kvm_vcpu *vcpu, gpa_t gpa,
        if (vt_is_tdx_private_gpa(vcpu->kvm, gpa))
                error_code |= PFERR_PRIVATE_ACCESS;
 
+       BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL >= (1 <<
hweight64(PFERR_FAULT_LEVEL_MASK)));
+
+       error_code |= (u64)max_fault_level << PFERR_FAULT_LEVEL_SHIFT;
+
        return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e994a6c08a75..19047de4d98d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2027,7 +2027,7 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
         * handle retries locally in their EPT violation handlers.
         */
        while (1) {
-               ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
+               ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual,
KVM_MAX_HUGEPAGE_LEVEL);
 
                if (ret != RET_PF_RETRY || !local_retry)
                        break;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ef2d7208dd20..b70a2ff35884 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5782,7 +5782,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
        if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu,
gpa)))
                return kvm_emulate_instruction(vcpu, 0);
 
-       return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
+       return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification,
KVM_MAX_HUGEPAGE_LEVEL);
 }
 
 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)



^ permalink raw reply related	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 21/21] KVM: x86: Ignore splitting huge pages in fault path for TDX
  2025-04-24  3:09 ` [RFC PATCH 21/21] KVM: x86: Ignore splitting huge pages in fault path " Yan Zhao
@ 2025-05-13 21:58   ` Edgecombe, Rick P
  2025-05-16  6:40     ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-13 21:58 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, 2025-04-24 at 11:09 +0800, Yan Zhao wrote:

>  int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> -			       void *private_spt)
> +			       void *private_spt, bool mmu_lock_shared)
>  {
>  	struct page *page = virt_to_page(private_spt);
>  	int ret;
> @@ -1842,6 +1842,29 @@ int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
>  	if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
>  		return -EINVAL;
>  
> +	/*
> +	 * Split request with mmu_lock held for reading can only occur when one
> +	 * vCPU accepts at 2MB level while another vCPU accepts at 4KB level.
> +	 * Ignore this 4KB mapping request by setting violation_request_level to
> +	 * 2MB and returning -EBUSY for retry. Then the next fault at 2MB level
> +	 * would be a spurious fault. The vCPU accepting at 2MB will accept the
> +	 * whole 2MB range.
> +	 */
> +	if (mmu_lock_shared) {
> +		struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> +		struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +		if (KVM_BUG_ON(!vcpu, kvm))
> +			return -EOPNOTSUPP;
> +
> +		/* Request to map as 2MB leaf for the whole 2MB range */
> +		tdx->violation_gfn_start = gfn_round_for_level(gfn, level);
> +		tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> +		tdx->violation_request_level = level;
> +
> +		return -EBUSY;

This is too hacky the way it infers so much from mmu_lock_shared. Since guests
shouldn't be doing this, what about just doing kvm_vm_dead(), with a little
pr_warn()? Maybe even just do it in set_external_spte_present() and declare it
the rule for external page tables. It can shrink this patch significantly, for
no expected user impact.

> +	}
> +
>  	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
>  	if (ret <= 0)
>  		return ret;
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 0619e9390e5d..fcba76887508 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -159,7 +159,7 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>  				 enum pg_level level, kvm_pfn_t pfn);
>  int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> -			       void *private_spt);
> +			       void *private_spt, bool mmu_lock_shared);
>  
>  void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
>  void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
> @@ -228,7 +228,8 @@ static inline int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>  
>  static inline int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn,
>  					     enum pg_level level,
> -					     void *private_spt)
> +					     void *private_spt,
> +					     bool mmu_lock_shared)
>  {
>  	return -EOPNOTSUPP;
>  }


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 16/21] KVM: x86/mmu: Introduce kvm_split_boundary_leafs() to split boundary leafs
  2025-04-24  3:08 ` [RFC PATCH 16/21] KVM: x86/mmu: Introduce kvm_split_boundary_leafs() to split boundary leafs Yan Zhao
@ 2025-05-13 22:56   ` Edgecombe, Rick P
  2025-05-16  7:46     ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-13 22:56 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, 2025-04-24 at 11:08 +0800, Yan Zhao wrote:
> Introduce kvm_split_boundary_leafs() to manage the splitting of boundary
> leafs within the mirror root.
> 
> Before zapping a specific GFN range in the mirror root, split any huge leaf
> that intersects with the boundary of the GFN range to ensure that the
> subsequent zap operation does not impact any GFN outside the specified
> range. This is crucial for the mirror root as the private page table
> requires the guest's ACCEPT operation after faulting back a GFN.
> 
> This function should be called while kvm->mmu_lock is held for writing. The
> kvm->mmu_lock is temporarily released to allocate memory for sp for split.
> The only expected error is -ENOMEM.
> 
> Opportunistically, WARN in tdp_mmu_zap_leafs() if zapping a huge leaf in
> the mirror root affects a GFN outside the specified range.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     |  21 +++++++
>  arch/x86/kvm/mmu/tdp_mmu.c | 116 ++++++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/mmu/tdp_mmu.h |   1 +
>  include/linux/kvm_host.h   |   1 +
>  4 files changed, 136 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0e227199d73e..0d49c69b6b55 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1640,6 +1640,27 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
>  				 start, end - 1, can_yield, true, flush);
>  }
>  
> +/*
> + * Split large leafs at the boundary of the specified range for the mirror root
> + *
> + * Return value:
> + * 0 : success, no flush is required;
> + * 1 : success, flush is required;
> + * <0: failure.
> + */
> +int kvm_split_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> +	bool ret = 0;
> +
> +	lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> +			    lockdep_is_held(&kvm->slots_lock));
> +
> +	if (tdp_mmu_enabled)
> +		ret = kvm_tdp_mmu_gfn_range_split_boundary(kvm, range);
> +
> +	return ret;
> +}
> +
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>  	bool flush = false;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 0f683753a7bb..d3fba5d11ea2 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -324,6 +324,8 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  				u64 old_spte, u64 new_spte, int level,
>  				bool shared);
>  
> +static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> +				   struct kvm_mmu_page *sp, bool shared);
>  static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror);
>  static void *get_external_spt(gfn_t gfn, u64 new_spte, int level);
>  
> @@ -962,6 +964,19 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>  	return true;
>  }
>  
> +static inline bool iter_split_required(struct kvm *kvm, struct kvm_mmu_page *root,
> +				       struct tdp_iter *iter, gfn_t start, gfn_t end)
> +{
> +	if (!is_mirror_sp(root) || !is_large_pte(iter->old_spte))
> +		return false;
> +
> +	/* Fully contained, no need to split */
> +	if (iter->gfn >= start && iter->gfn + KVM_PAGES_PER_HPAGE(iter->level) <= end)
> +		return false;
> +
> +	return true;
> +}
> +
>  /*
>   * If can_yield is true, will release the MMU lock and reschedule if the
>   * scheduler needs the CPU or there is contention on the MMU lock. If this
> @@ -991,6 +1006,8 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
>  		    !is_last_spte(iter.old_spte, iter.level))
>  			continue;
>  
> +		WARN_ON_ONCE(iter_split_required(kvm, root, &iter, start, end));
> +

Kind of unrelated change? But good idea. Maybe for another patch.

>  		tdp_mmu_iter_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
>  
>  		/*
> @@ -1246,9 +1263,6 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
>  	return 0;
>  }
>  
> -static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> -				   struct kvm_mmu_page *sp, bool shared);
> -
>  /*
>   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
>   * page tables and SPTEs to translate the faulting guest physical address.
> @@ -1341,6 +1355,102 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	return ret;
>  }
>  
> +/*
> + * Split large leafs at the boundary of the specified range for the mirror root
> + */
> +static int tdp_mmu_split_boundary_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> +					gfn_t start, gfn_t end, bool can_yield, bool *flush)
> +{
> +	struct kvm_mmu_page *sp = NULL;
> +	struct tdp_iter iter;
> +
> +	WARN_ON_ONCE(!can_yield);

Why pass this in then?

> +
> +	if (!is_mirror_sp(root))
> +		return 0;

What is special about mirror roots here?

> +
> +	end = min(end, tdp_mmu_max_gfn_exclusive());
> +
> +	lockdep_assert_held_write(&kvm->mmu_lock);
> +
> +	rcu_read_lock();
> +
> +	for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end) {
> +retry:
> +		if (can_yield &&

Do we need this part of the conditional based on the above?

> +		    tdp_mmu_iter_cond_resched(kvm, &iter, *flush, false)) {
> +			*flush = false;
> +			continue;
> +		}
> +
> +		if (!is_shadow_present_pte(iter.old_spte) ||
> +		    !is_last_spte(iter.old_spte, iter.level) ||
> +		    !iter_split_required(kvm, root, &iter, start, end))
> +			continue;
> +
> +		if (!sp) {
> +			rcu_read_unlock();
> +
> +			write_unlock(&kvm->mmu_lock);
> +
> +			sp = tdp_mmu_alloc_sp_for_split(true);
> +
> +			write_lock(&kvm->mmu_lock);
> +
> +			if (!sp) {
> +				trace_kvm_mmu_split_huge_page(iter.gfn, iter.old_spte,
> +							      iter.level, -ENOMEM);
> +				return -ENOMEM;
> +			}
> +			rcu_read_lock();
> +
> +			iter.yielded = true;
> +			continue;
> +		}
> +		tdp_mmu_init_child_sp(sp, &iter);
> +
> +		if (tdp_mmu_split_huge_page(kvm, &iter, sp, false))

I think it can't fail when you hold mmu write lock.

> +			goto retry;
> +
> +		sp = NULL;
> +		/*
> +		 * Set yielded in case after splitting to a lower level,
> +		 * the new iter requires furter splitting.
> +		 */
> +		iter.yielded = true;
> +		*flush = true;
> +	}
> +
> +	rcu_read_unlock();
> +
> +	/* Leave it here though it should be impossible for the mirror root */
> +	if (sp)
> +		tdp_mmu_free_sp(sp);

What do you think about relying on tdp_mmu_split_huge_pages_root() and moving
this to an optimization patch at the end?

Or what about just two calls to tdp_mmu_split_huge_pages_root() at the
boundaries?

> +	return 0;
> +}
> +
> +int kvm_tdp_mmu_gfn_range_split_boundary(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> +	enum kvm_tdp_mmu_root_types types;
> +	struct kvm_mmu_page *root;
> +	bool flush = false;
> +	int ret;
> +
> +	types = kvm_gfn_range_filter_to_root_types(kvm, range->attr_filter) | KVM_INVALID_ROOTS;

What is the reason for KVM_INVALID_ROOTS in this case?

> +
> +	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, types) {

It would be better to check for mirror roots here, instead of inside
tdp_mmu_split_boundary_leafs().

> +		ret = tdp_mmu_split_boundary_leafs(kvm, root, range->start, range->end,
> +						   range->may_block, &flush);
> +		if (ret < 0) {
> +			if (flush)
> +				kvm_flush_remote_tlbs(kvm);
> +
> +			return ret;
> +		}
> +	}
> +	return flush;
> +}
> +
>  /* Used by mmu notifier via kvm_unmap_gfn_range() */
>  bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
>  				 bool flush)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 52acf99d40a0..806a21d4f0e3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -69,6 +69,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm);
>  void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
>  				  enum kvm_tdp_mmu_root_types root_types);
>  void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared);
> +int kvm_tdp_mmu_gfn_range_split_boundary(struct kvm *kvm, struct kvm_gfn_range *range);
>  
>  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 655d36e1f4db..19d7a577e7ed 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -272,6 +272,7 @@ struct kvm_gfn_range {
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> +int kvm_split_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range);
>  #endif
>  
>  enum {


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 19/21] KVM: gmem: Split huge boundary leafs for punch hole of private memory
  2025-04-24  3:08 ` [RFC PATCH 19/21] KVM: gmem: Split huge boundary leafs for punch hole of private memory Yan Zhao
  2025-04-24 10:19   ` Francesco Lavra
@ 2025-05-13 22:59   ` Edgecombe, Rick P
  2025-05-16  8:19     ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-13 22:59 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, 2025-04-24 at 11:08 +0800, Yan Zhao wrote:
> +static int kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
> +				     pgoff_t end, bool need_split)
>  {
>  	bool flush = false, found_memslot = false;
>  	struct kvm_memory_slot *slot;
>  	struct kvm *kvm = gmem->kvm;
>  	unsigned long index;
> +	int ret = 0;
>  
>  	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
>  		pgoff_t pgoff = slot->gmem.pgoff;
> @@ -319,14 +320,23 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
>  			kvm_mmu_invalidate_begin(kvm);
>  		}
>  
> +		if (need_split) {
> +			ret = kvm_split_boundary_leafs(kvm, &gfn_range);

What is the effect for other guestmemfd users? SEV doesn't need this, right? Oh
I see, down in tdp_mmu_split_boundary_leafs() it bails on non-mirror roots. I
don't like the naming then. It sounds deterministic, but it's really only
necessary splits for certain VM types.

I guess it all depends on how well teaching kvm_mmu_unmap_gfn_range() to fail
goes. But otherwise, we should call it like kvm_prepare_zap_range() or
something. And have it make it clearly do nothing for non-TDX high up where it's
easy to see.

> +			if (ret < 0)
> +				goto out;
> +
> +			flush |= ret;
> +		}
>  		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
>  	}
>  
> +out:
>  	if (flush)
>  		kvm_flush_remote_tlbs(kvm);
>  
>  	if (found_memslot)
>  		KVM_MMU_UNLOCK(kvm);
> +	


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 14/21] KVM: x86/tdp_mmu: Invoke split_external_spt hook with exclusive mmu_lock
  2025-04-24  3:07 ` [RFC PATCH 14/21] KVM: x86/tdp_mmu: Invoke split_external_spt hook with exclusive mmu_lock Yan Zhao
@ 2025-05-13 23:06   ` Edgecombe, Rick P
  2025-05-16  9:17     ` Yan Zhao
  2025-05-20  5:40   ` Binbin Wu
  1 sibling, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-13 23:06 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, 2025-04-24 at 11:07 +0800, Yan Zhao wrote:
> +static int split_external_spt(struct kvm *kvm, gfn_t gfn, u64 old_spte,
> +			      u64 new_spte, int level)
> +{
> +	void *external_spt = get_external_spt(gfn, new_spte, level);
> +	int ret;
> +
> +	KVM_BUG_ON(!external_spt, kvm);
> +
> +	ret = static_call(kvm_x86_split_external_spt)(kvm, gfn, level, external_spt);
> +	KVM_BUG_ON(ret, kvm);

Shouldn't this BUG_ON be handled in the split_external_spt implementation? I
don't think we need another one.

> +
> +	return ret;
> +}
>  /**
>   * handle_removed_pt() - handle a page table removed from the TDP structure
>   *
> @@ -764,13 +778,13 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
>  
>  	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
>  
> -	/*
> -	 * Users that do non-atomic setting of PTEs don't operate on mirror
> -	 * roots, so don't handle it and bug the VM if it's seen.
> -	 */
>  	if (is_mirror_sptep(sptep)) {
> -		KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
> -		remove_external_spte(kvm, gfn, old_spte, level);
> +		if (!is_shadow_present_pte(new_spte))
> +			remove_external_spte(kvm, gfn, old_spte, level);
> +		else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
> +			split_external_spt(kvm, gfn, old_spte, new_spte, level);
> +		else
> +			KVM_BUG_ON(1, kvm);

It might be worth a comment what this is looking for at this point. I think it's
that external EPT only support certain operations, so bug if any unsupported
operations are seen.

>  	}
>  
>  	return old_spte;


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 20/21] KVM: x86: Force a prefetch fault's max mapping level to 4KB for TDX
  2025-04-24  3:09 ` [RFC PATCH 20/21] KVM: x86: Force a prefetch fault's max mapping level to 4KB for TDX Yan Zhao
@ 2025-05-13 23:20   ` Edgecombe, Rick P
  2025-05-16  8:43     ` Yan Zhao
  2025-05-21  3:30   ` Binbin Wu
  1 sibling, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-13 23:20 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, 2025-04-24 at 11:09 +0800, Yan Zhao wrote:
> Introduce a "prefetch" parameter to the private_max_mapping_level hook and
> enforce the max mapping level of a prefetch fault for private memory to be
> 4KB. This is a preparation to enable the ignoring huge page splitting in
> the fault path.
> 
> If a prefetch fault results in a 2MB huge leaf in the mirror page table,
> there may not be a vCPU available to accept the corresponding 2MB huge leaf
> in the S-EPT if the TD is not configured to receive #VE for page
> acceptance. 
> 

Can you elaborate on this case more. A vCPU may not be available? What does that
mean?

> Consequently, if a vCPU accepts the page at 4KB level, it will
> trigger an EPT violation to split the 2MB huge leaf generated by the
> prefetch fault.

The case is KVM_PRE_FAULT_MEMORY faults in 2MB, then guest accepts at 4k (which
it is not supposed to do)?

Then maybe the kvm_vm_dead() case I suggested in the other patch could handle
this case too, and this patch could be dropped?

> 
> Since handling the BUSY error from SEAMCALLs for huge page splitting is
> more comprehensive in the fault path, which is with kvm->mmu_lock held for
> reading, force the max mapping level of a prefetch fault of private memory
> to be 4KB to prevent potential splitting.


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-04-24  3:04 ` [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
                     ` (2 preceding siblings ...)
  2025-05-13 18:52   ` Edgecombe, Rick P
@ 2025-05-15  2:16   ` Chao Gao
  2025-05-16  9:07     ` Yan Zhao
  2025-07-08  8:48   ` Yan Zhao
  4 siblings, 1 reply; 294+ messages in thread
From: Chao Gao @ 2025-05-15  2:16 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Thu, Apr 24, 2025 at 11:04:28AM +0800, Yan Zhao wrote:
>Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.
>
>Verify the validity of the level and ensure that the mapping range is fully
>contained within the page folio.
>
>As a conservative solution, perform CLFLUSH on all pages to be mapped into
>the TD before invoking the SEAMCALL TDH_MEM_PAGE_AUG. This ensures that any
>dirty cache lines do not write back later and clobber TD memory.
>
>Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>---
> arch/x86/virt/vmx/tdx/tdx.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
>diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>index f5e2a937c1e7..a66d501b5677 100644
>--- a/arch/x86/virt/vmx/tdx/tdx.c
>+++ b/arch/x86/virt/vmx/tdx/tdx.c
>@@ -1595,9 +1595,18 @@ u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u
> 		.rdx = tdx_tdr_pa(td),
> 		.r8 = page_to_phys(page),
> 	};
>+	unsigned long nr_pages = 1 << (level * 9);
>+	struct folio *folio = page_folio(page);
>+	unsigned long idx = 0;
> 	u64 ret;
> 
>-	tdx_clflush_page(page);
>+	if (!(level >= TDX_PS_4K && level < TDX_PS_NR) ||
>+	    (folio_page_idx(folio, page) + nr_pages > folio_nr_pages(folio)))
>+		return -EINVAL;

Returning -EINVAL looks incorrect as the return type is u64.

>+
>+	while (nr_pages--)
>+		tdx_clflush_page(nth_page(page, idx++));
>+
> 	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
> 
> 	*ext_err1 = args.rcx;
>-- 
>2.43.2
>

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-12 16:53                               ` Vishal Annapurve
@ 2025-05-15  3:01                                 ` Yan Zhao
  2025-06-04 20:02                                   ` Ackerley Tng
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-15  3:01 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Mon, May 12, 2025 at 09:53:43AM -0700, Vishal Annapurve wrote:
> On Sun, May 11, 2025 at 7:18 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > ...
> > >
> > > I might be wrongly throwing out some terminologies here then.
> > > VM_PFNMAP flag can be set for memory backed by folios/page structs.
> > > udmabuf seems to be working with pinned "folios" in the backend.
> > >
> > > The goal is to get to a stage where guest_memfd is backed by pfn
> > > ranges unmanaged by kernel that guest_memfd owns and distributes to
> > > userspace, KVM, IOMMU subject to shareability attributes. if the
> > OK. So from point of the reset part of kernel, those pfns are not regarded as
> > memory.
> >
> > > shareability changes, the users will get notified and will have to
> > > invalidate their mappings. guest_memfd will allow mmaping such ranges
> > > with VM_PFNMAP flag set by default in the VMAs to indicate the need of
> > > special handling/lack of page structs.
> > My concern is a failable invalidation notifer may not be ideal.
> > Instead of relying on ref counts (or other mechanisms) to determine whether to
> > start shareabilitiy changes, with a failable invalidation notifier, some users
> > may fail the invalidation and the shareability change, even after other users
> > have successfully unmapped a range.
> 
> Even if one user fails to invalidate its mappings, I don't see a
> reason to go ahead with shareability change. Shareability should not
> change unless all existing users let go of their soon-to-be-invalid
> view of memory.
My thinking is that:

1. guest_memfd starts shared-to-private conversion
2. guest_memfd sends invalidation notifications
   2.1 invalidate notification --> A --> Unmap and return success
   2.2 invalidate notification --> B --> Unmap and return success
   2.3 invalidate notification --> C --> return failure
3. guest_memfd finds 2.3 fails, fails shared-to-private conversion and keeps
   shareability as shared

Though the GFN remains shared after 3, it's unmapped in user A and B in 2.1 and
2.2. Even if additional notifications could be sent to A and B to ask for
mapping the GFN back, the map operation might fail. Consequently, A and B might
not be able to restore the mapped status of the GFN. For IOMMU mappings, this
could result in DMAR failure following a failed attempt to do shared-to-private
conversion.

I noticed Ackerley has posted the series. Will check there later.

> >
> > Auditing whether multiple users of shared memory correctly perform unmapping is
> > harder than auditing reference counts.
> >
> > > private memory backed by page structs and use a special "filemap" to
> > > map file offsets to these private memory ranges. This step will also
> > > need similar contract with users -
> > >    1) memory is pinned by guest_memfd
> > >    2) users will get invalidation notifiers on shareability changes
> > >
> > > I am sure there is a lot of work here and many quirks to be addressed,
> > > let's discuss this more with better context around. A few related RFC
> > > series are planned to be posted in the near future.
> > Ok. Thanks for your time and discussions :)
> > ...

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 03/21] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2025-05-13 18:19   ` Edgecombe, Rick P
@ 2025-05-15  8:26     ` Yan Zhao
  2025-05-15 17:28       ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-15  8:26 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Shutemov, Kirill,
	quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, May 14, 2025 at 02:19:56AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:04 +0800, Yan Zhao wrote:
> > From: Xiaoyao Li <xiaoyao.li@intel.com>
> > 
> > Add a wrapper tdh_mem_page_demote() to invoke SEAMCALL TDH_MEM_PAGE_DEMOTE
> > to demote a huge leaf entry to a non-leaf entry in S-EPT. Currently, the
> > TDX module only supports demotion of a 2M huge leaf entry. After a
> > successful demotion, the old 2M huge leaf entry in S-EPT is replaced with a
> > non-leaf entry, linking to the newly-added page table page. The newly
> > linked page table page then contains 512 leaf entries, pointing to the 2M
> > guest private pages.
> > 
> > The "gpa" and "level" direct the TDX module to search and find the old
> > huge leaf entry.
> > 
> > As the new non-leaf entry points to a page table page, callers need to
> > pass in the page table page in parameter "page".
> > 
> > In case of S-EPT walk failure, the entry, level and state where the error
> > was detected are returned in ext_err1 and ext_err2.
> > 
> > On interrupt pending, SEAMCALL TDH_MEM_PAGE_DEMOTE returns error
> > TDX_INTERRUPTED_RESTARTABLE.
> > 
> > [Yan: Rebased and split patch, wrote changelog]
> 
> We should add the level of detail here like we did for the base series ones.
I'll provide changelog details under "---" of each patch in the next version.
 
> > 
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  arch/x86/include/asm/tdx.h  |  2 ++
> >  arch/x86/virt/vmx/tdx/tdx.c | 20 ++++++++++++++++++++
> >  arch/x86/virt/vmx/tdx/tdx.h |  1 +
> >  3 files changed, 23 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index 26ffc792e673..08eff4b2f5e7 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -177,6 +177,8 @@ u64 tdh_mng_key_config(struct tdx_td *td);
> >  u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
> >  u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
> >  u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
> > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
> > +			u64 *ext_err1, u64 *ext_err2);
> >  u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
> >  u64 tdh_mr_finalize(struct tdx_td *td);
> >  u64 tdh_vp_flush(struct tdx_vp *vp);
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index a66d501b5677..5699dfe500d9 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1684,6 +1684,26 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data)
> >  }
> >  EXPORT_SYMBOL_GPL(tdh_mng_rd);
> >  
> > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
> > +			u64 *ext_err1, u64 *ext_err2)
> > +{
> > +	struct tdx_module_args args = {
> > +		.rcx = gpa | level,
> 
> This will only ever be level 2MB, how about dropping the arg?
Do you mean hardcoding level to be 2MB in tdh_mem_page_demote()?

The SEAMCALL TDH_MEM_PAGE_DEMOTE supports level of 1GB in current TDX module.

> > +		.rdx = tdx_tdr_pa(td),
> > +		.r8 = page_to_phys(page),
> > +	};
> > +	u64 ret;
> > +
> > +	tdx_clflush_page(page);
> > +	ret = seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args);
> > +
> > +	*ext_err1 = args.rcx;
> > +	*ext_err2 = args.rdx;
> 
> How about we just call these entry and level_state, like the caller.
Not sure, but I feel that ext_err* might be better, because
- according to the spec,
  a) the args.rcx, args.rdx is unmodified (i.e. still hold the passed-in value)
     in case of error TDX_INTERRUPTED_RESTARTABLE.
  b) args.rcx, args.rdx can only be interpreted as entry and level_state in case
     of EPT walk error.
  c) in other cases, they are 0.
- consistent with tdh_mem_page_aug(), tdh_mem_range_block()...


> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(tdh_mem_page_demote);
> 
> Looking in the docs, TDX module gives some somewhat constrained guidance:
> 1. TDH.MEM.PAGE.DEMOTE should be invoked in a loop until it terminates
> successfully.
> 2. The host VMM should be designed to avoid cases where interrupt storms prevent
> successful completion of TDH.MEM.PAGE.DEMOTE.
> 
> The caller looks like:
> 	do {
> 		err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> 					  &entry, &level_state);
> 	} while (err == TDX_INTERRUPTED_RESTARTABLE);
> 
> 	if (unlikely(tdx_operand_busy(err))) {
> 		tdx_no_vcpus_enter_start(kvm);
> 		err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> 					  &entry, &level_state);
> 		tdx_no_vcpus_enter_stop(kvm);
> 	}
> 
> The brute force second case could also be subjected to a
> TDX_INTERRUPTED_RESTARTABLE and is not handled. As for interrupt storms, I guess
You are right.

> we could disable interrupts while we do the second brute force case. So the
> TDX_INTERRUPTED_RESTARTABLE loop could have a max retries, and the brute force
> case could also disable interrupts.
Good idea.

> Hmm, how to pick the max retries count. It's a tradeoff between interrupt
> latency and DOS/code complexity. Do we have any idea how long demote might take?
I did a brief test on my SPR, where the host was not busy :
tdh_mem_page_demote() was called 142 times, with each invocation taking around
10us.
2 invocations were due to TDX_INTERRUPTED_RESTARTABLE.
For each GFN, at most 1 retry was performed.

I will do more investigations.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 04/21] KVM: TDX: Enforce 4KB mapping level during TD build Time
  2025-05-13 19:12   ` Edgecombe, Rick P
@ 2025-05-15  9:16     ` Yan Zhao
  2025-05-15 17:32       ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-15  9:16 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Shutemov, Kirill,
	quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, May 14, 2025 at 03:12:10AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:05 +0800, Yan Zhao wrote:
> > During the TD build phase (i.e., before the TD becomes RUNNABLE), enforce a
> > 4KB mapping level both in the S-EPT managed by the TDX module and the
> > mirror page table managed by KVM.
> > 
> > During this phase, TD's memory is added via tdh_mem_page_add(), which only
> > accepts 4KB granularity. Therefore, return PG_LEVEL_4K in TDX's
> > .private_max_mapping_level hook to ensure KVM maps at the 4KB level in the
> > mirror page table. Meanwhile, iterate over each 4KB page of a large gmem
> > backend page in tdx_gmem_post_populate() and invoke tdh_mem_page_add() to
> > map at the 4KB level in the S-EPT.
> > 
> > Still allow huge pages in gmem backend during TD build time. Based on [1],
> > which gmem series allows 2MB TPH and non-in-place conversion, pass in
> > region.nr_pages to kvm_gmem_populate() in tdx_vcpu_init_mem_region().
> > 
> 
> This commit log will need to be written with upstream in mind when it is out of
> RFC.
Ok.

 
> >  This
> > enables kvm_gmem_populate() to allocate huge pages from the gmem backend
> > when the remaining nr_pages, GFN alignment, and page private/shared
> > attribute permit.  KVM is then able to promote the initial 4K mapping to
> > huge after TD is RUNNABLE.
> > 
> > Disallow any private huge pages during TD build time. Use BUG_ON() in
> > tdx_mem_page_record_premap_cnt() and tdx_is_sept_zap_err_due_to_premap() to
> > assert the mapping level is 4KB.
> > 
> > Opportunistically, remove unused parameters in
> > tdx_mem_page_record_premap_cnt().
> > 
> > Link: https://lore.kernel.org/all/20241212063635.712877-1-michael.roth@amd.com [1]
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  arch/x86/kvm/vmx/tdx.c | 45 ++++++++++++++++++++++++++++--------------
> >  1 file changed, 30 insertions(+), 15 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 98cde20f14da..03885cb2869b 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1530,14 +1530,16 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> >   * The counter has to be zero on KVM_TDX_FINALIZE_VM, to ensure that there
> >   * are no half-initialized shared EPT pages.
> >   */
> > -static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
> > -					  enum pg_level level, kvm_pfn_t pfn)
> > +static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, enum pg_level level)
> >  {
> >  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> >  
> >  	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> >  		return -EINVAL;
> >  
> > +	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> > +		return -EINVAL;
> > +
> >  	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
> >  	atomic64_inc(&kvm_tdx->nr_premapped);
> >  	return 0;
> > @@ -1571,7 +1573,7 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> >  	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> >  		return tdx_mem_page_aug(kvm, gfn, level, page);
> >  
> > -	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
> > +	return tdx_mem_page_record_premap_cnt(kvm, level);
> >  }
> >  
> >  static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > @@ -1666,7 +1668,7 @@ int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
> >  static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
> >  					     u64 entry, int level)
> >  {
> > -	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
> > +	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE || level > PG_LEVEL_4K)
> >  		return false;
> 
> This is catching zapping huge pages before the TD is runnable? Is it necessary
> if we are already warning about mapping huge pages before the TD is runnable in
> tdx_mem_page_record_premap_cnt()?
Under normal conditions, this check isn't necessary.
I added this check in case bugs in the KVM core MMU where the mirror page table
might be updated to huge without notifying the TDX side.
Am I overthinking?


> >  	if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
> > @@ -3052,8 +3054,8 @@ struct tdx_gmem_post_populate_arg {
> >  	__u32 flags;
> >  };
> >  
> > -static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > -				  void __user *src, int order, void *_arg)
> > +static int tdx_gmem_post_populate_4k(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > +				     void __user *src, void *_arg)
> >  {
> >  	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
> >  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > @@ -3120,6 +3122,21 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> >  	return ret;
> >  }
> >  
> > +static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > +				  void __user *src, int order, void *_arg)
> > +{
> > +	unsigned long i, npages = 1 << order;
> > +	int ret;
> > +
> > +	for (i = 0; i < npages; i++) {
> > +		ret = tdx_gmem_post_populate_4k(kvm, gfn + i, pfn + i,
> > +						src + i * PAGE_SIZE, _arg);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +	return 0;
> > +}
> > +
> >  static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
> >  {
> >  	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > @@ -3166,20 +3183,15 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
> >  		};
> >  		gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
> >  					     u64_to_user_ptr(region.source_addr),
> > -					     1, tdx_gmem_post_populate, &arg);
> > +					     region.nr_pages, tdx_gmem_post_populate, &arg);
> >  		if (gmem_ret < 0) {
> >  			ret = gmem_ret;
> >  			break;
> >  		}
> >  
> > -		if (gmem_ret != 1) {
This line is removed.

> > -			ret = -EIO;
> > -			break;
> > -		}
> > -
> > -		region.source_addr += PAGE_SIZE;
> > -		region.gpa += PAGE_SIZE;
> > -		region.nr_pages--;
> > +		region.source_addr += PAGE_SIZE * gmem_ret;
> 
> gmem_ret has to be 1, per the above conditional.
As region.nr_pages instead of 1 is passed into kvm_gmem_populate(), gmem_ret
can now be greater than 1.

kvm_gmem_populate() can allocate huge backend pages if region.nr_pages, GFN
alignment, and shareability permit.

> > +		region.gpa += PAGE_SIZE * gmem_ret;
> > +		region.nr_pages -= gmem_ret;
> >  
> >  		cond_resched();
> >  	}
> > @@ -3224,6 +3236,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
> >  
> >  int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> >  {
> > +	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
> > +		return PG_LEVEL_4K;
> > +
> >  	return PG_LEVEL_4K;
> 
> ^ Change does nothing...
Right. Patch 9 will update the default level to PG_LEVEL_2M.

The change here is meant to highlight PG_LEVEL_4K is enforced in
tdx_gmem_private_max_mapping_level() when TD is not in TD_STATE_RUNNABLE state.

Will change it in the next version.

> >  }
> >  
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 03/21] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2025-05-15  8:26     ` Yan Zhao
@ 2025-05-15 17:28       ` Edgecombe, Rick P
  2025-05-16  2:23         ` Yan Zhao
  2025-07-01 21:15         ` Edgecombe, Rick P
  0 siblings, 2 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-15 17:28 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, tabba@google.com,
	vbabka@suse.cz, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Du, Fan,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Shutemov, Kirill,
	pgonda@google.com, x86@kernel.org

On Thu, 2025-05-15 at 16:26 +0800, Yan Zhao wrote:
> On Wed, May 14, 2025 at 02:19:56AM +0800, Edgecombe, Rick P wrote:
> > On Thu, 2025-04-24 at 11:04 +0800, Yan Zhao wrote:
> > > From: Xiaoyao Li <xiaoyao.li@intel.com>
> > > 
> > > Add a wrapper tdh_mem_page_demote() to invoke SEAMCALL TDH_MEM_PAGE_DEMOTE
> > > to demote a huge leaf entry to a non-leaf entry in S-EPT. Currently, the
> > > TDX module only supports demotion of a 2M huge leaf entry. After a
> > > successful demotion, the old 2M huge leaf entry in S-EPT is replaced with a
> > > non-leaf entry, linking to the newly-added page table page. The newly
> > > linked page table page then contains 512 leaf entries, pointing to the 2M
> > > guest private pages.
> > > 
> > > The "gpa" and "level" direct the TDX module to search and find the old
> > > huge leaf entry.
> > > 
> > > As the new non-leaf entry points to a page table page, callers need to
> > > pass in the page table page in parameter "page".
> > > 
> > > In case of S-EPT walk failure, the entry, level and state where the error
> > > was detected are returned in ext_err1 and ext_err2.
> > > 
> > > On interrupt pending, SEAMCALL TDH_MEM_PAGE_DEMOTE returns error
> > > TDX_INTERRUPTED_RESTARTABLE.
> > > 
> > > [Yan: Rebased and split patch, wrote changelog]
> > 
> > We should add the level of detail here like we did for the base series ones.
> I'll provide changelog details under "---" of each patch in the next version.

I mean the commit log (above the "---") needs the same tip style treatment as
the other SEAMCALL wrapper patches.

>  
> > > 
> > > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > ---
> > >  arch/x86/include/asm/tdx.h  |  2 ++
> > >  arch/x86/virt/vmx/tdx/tdx.c | 20 ++++++++++++++++++++
> > >  arch/x86/virt/vmx/tdx/tdx.h |  1 +
> > >  3 files changed, 23 insertions(+)
> > > 
> > > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > > index 26ffc792e673..08eff4b2f5e7 100644
> > > --- a/arch/x86/include/asm/tdx.h
> > > +++ b/arch/x86/include/asm/tdx.h
> > > @@ -177,6 +177,8 @@ u64 tdh_mng_key_config(struct tdx_td *td);
> > >  u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
> > >  u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
> > >  u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
> > > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
> > > +			u64 *ext_err1, u64 *ext_err2);
> > >  u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
> > >  u64 tdh_mr_finalize(struct tdx_td *td);
> > >  u64 tdh_vp_flush(struct tdx_vp *vp);
> > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > > index a66d501b5677..5699dfe500d9 100644
> > > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > > @@ -1684,6 +1684,26 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data)
> > >  }
> > >  EXPORT_SYMBOL_GPL(tdh_mng_rd);
> > >  
> > > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
> > > +			u64 *ext_err1, u64 *ext_err2)
> > > +{
> > > +	struct tdx_module_args args = {
> > > +		.rcx = gpa | level,
> > 
> > This will only ever be level 2MB, how about dropping the arg?
> Do you mean hardcoding level to be 2MB in tdh_mem_page_demote()?

Yea, we don't support 1GB, so the level arg on the wrapper is superfluous.

> 
> The SEAMCALL TDH_MEM_PAGE_DEMOTE supports level of 1GB in current TDX module.
> 
> > > +		.rdx = tdx_tdr_pa(td),
> > > +		.r8 = page_to_phys(page),
> > > +	};
> > > +	u64 ret;
> > > +
> > > +	tdx_clflush_page(page);
> > > +	ret = seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args);
> > > +
> > > +	*ext_err1 = args.rcx;
> > > +	*ext_err2 = args.rdx;
> > 
> > How about we just call these entry and level_state, like the caller.
> Not sure, but I feel that ext_err* might be better, because
> - according to the spec,
>   a) the args.rcx, args.rdx is unmodified (i.e. still hold the passed-in value)
>      in case of error TDX_INTERRUPTED_RESTARTABLE.
>   b) args.rcx, args.rdx can only be interpreted as entry and level_state in case
>      of EPT walk error.
>   c) in other cases, they are 0.
> - consistent with tdh_mem_page_aug(), tdh_mem_range_block()...

Yea, it's consistent. I'm ok leaving it as is.

> 
> 
> > > +
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(tdh_mem_page_demote);
> > 
> > Looking in the docs, TDX module gives some somewhat constrained guidance:
> > 1. TDH.MEM.PAGE.DEMOTE should be invoked in a loop until it terminates
> > successfully.
> > 2. The host VMM should be designed to avoid cases where interrupt storms prevent
> > successful completion of TDH.MEM.PAGE.DEMOTE.
> > 
> > The caller looks like:
> > 	do {
> > 		err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> > 					  &entry, &level_state);
> > 	} while (err == TDX_INTERRUPTED_RESTARTABLE);
> > 
> > 	if (unlikely(tdx_operand_busy(err))) {
> > 		tdx_no_vcpus_enter_start(kvm);
> > 		err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> > 					  &entry, &level_state);
> > 		tdx_no_vcpus_enter_stop(kvm);
> > 	}
> > 
> > The brute force second case could also be subjected to a
> > TDX_INTERRUPTED_RESTARTABLE and is not handled. As for interrupt storms, I guess
> You are right.
> 
> > we could disable interrupts while we do the second brute force case. So the
> > TDX_INTERRUPTED_RESTARTABLE loop could have a max retries, and the brute force
> > case could also disable interrupts.
> Good idea.
> 
> > Hmm, how to pick the max retries count. It's a tradeoff between interrupt
> > latency and DOS/code complexity. Do we have any idea how long demote might take?
> I did a brief test on my SPR, where the host was not busy :
> tdh_mem_page_demote() was called 142 times, with each invocation taking around
> 10us.

10us doesn't seem too bad? Makes me think to not loop and instead just do a
single retry with interrupts disabled. We should definitely add the data based
reasoning to the log.

The counter point is that the SEAMCALL must be supporting
TDX_INTERRUPTED_RESTARTABLE for a reason. And the reason probably is that it
sometimes takes longer than someone that was reasonable. Maybe we should ask TDX
module folks if there is any history.

> 2 invocations were due to TDX_INTERRUPTED_RESTARTABLE.
> For each GFN, at most 1 retry was performed.
> 
> I will do more investigations.


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 04/21] KVM: TDX: Enforce 4KB mapping level during TD build Time
  2025-05-15  9:16     ` Yan Zhao
@ 2025-05-15 17:32       ` Edgecombe, Rick P
  2025-05-16 10:05         ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-15 17:32 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, tabba@google.com,
	vbabka@suse.cz, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Du, Fan,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Shutemov, Kirill,
	pgonda@google.com, x86@kernel.org

On Thu, 2025-05-15 at 17:16 +0800, Yan Zhao wrote:
> On Wed, May 14, 2025 at 03:12:10AM +0800, Edgecombe, Rick P wrote:
> > On Thu, 2025-04-24 at 11:05 +0800, Yan Zhao wrote:
> > > During the TD build phase (i.e., before the TD becomes RUNNABLE), enforce a
> > > 4KB mapping level both in the S-EPT managed by the TDX module and the
> > > mirror page table managed by KVM.
> > > 
> > > During this phase, TD's memory is added via tdh_mem_page_add(), which only
> > > accepts 4KB granularity. Therefore, return PG_LEVEL_4K in TDX's
> > > .private_max_mapping_level hook to ensure KVM maps at the 4KB level in the
> > > mirror page table. Meanwhile, iterate over each 4KB page of a large gmem
> > > backend page in tdx_gmem_post_populate() and invoke tdh_mem_page_add() to
> > > map at the 4KB level in the S-EPT.
> > > 
> > > Still allow huge pages in gmem backend during TD build time. Based on [1],
> > > which gmem series allows 2MB TPH and non-in-place conversion, pass in
> > > region.nr_pages to kvm_gmem_populate() in tdx_vcpu_init_mem_region().
> > > 
> > 
> > This commit log will need to be written with upstream in mind when it is out of
> > RFC.
> Ok.
> 
>  
> > >  This
> > > enables kvm_gmem_populate() to allocate huge pages from the gmem backend
> > > when the remaining nr_pages, GFN alignment, and page private/shared
> > > attribute permit.  KVM is then able to promote the initial 4K mapping to
> > > huge after TD is RUNNABLE.
> > > 
> > > Disallow any private huge pages during TD build time. Use BUG_ON() in
> > > tdx_mem_page_record_premap_cnt() and tdx_is_sept_zap_err_due_to_premap() to
> > > assert the mapping level is 4KB.
> > > 
> > > Opportunistically, remove unused parameters in
> > > tdx_mem_page_record_premap_cnt().
> > > 
> > > Link: https://lore.kernel.org/all/20241212063635.712877-1-michael.roth@amd.com [1]
> > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > ---
> > >  arch/x86/kvm/vmx/tdx.c | 45 ++++++++++++++++++++++++++++--------------
> > >  1 file changed, 30 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index 98cde20f14da..03885cb2869b 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -1530,14 +1530,16 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> > >   * The counter has to be zero on KVM_TDX_FINALIZE_VM, to ensure that there
> > >   * are no half-initialized shared EPT pages.
> > >   */
> > > -static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
> > > -					  enum pg_level level, kvm_pfn_t pfn)
> > > +static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, enum pg_level level)
> > >  {
> > >  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > >  
> > >  	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> > >  		return -EINVAL;
> > >  
> > > +	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> > > +		return -EINVAL;
> > > +
> > >  	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
> > >  	atomic64_inc(&kvm_tdx->nr_premapped);
> > >  	return 0;
> > > @@ -1571,7 +1573,7 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > >  	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> > >  		return tdx_mem_page_aug(kvm, gfn, level, page);
> > >  
> > > -	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
> > > +	return tdx_mem_page_record_premap_cnt(kvm, level);
> > >  }
> > >  
> > >  static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > @@ -1666,7 +1668,7 @@ int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
> > >  static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
> > >  					     u64 entry, int level)
> > >  {
> > > -	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
> > > +	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE || level > PG_LEVEL_4K)
> > >  		return false;
> > 
> > This is catching zapping huge pages before the TD is runnable? Is it necessary
> > if we are already warning about mapping huge pages before the TD is runnable in
> > tdx_mem_page_record_premap_cnt()?
> Under normal conditions, this check isn't necessary.
> I added this check in case bugs in the KVM core MMU where the mirror page table
> might be updated to huge without notifying the TDX side.
> Am I overthinking?

If we need so many BUG_ON()s maybe our design is too fragile. I think we could
drop this one.

> 
> 
> > >  	if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
> > > @@ -3052,8 +3054,8 @@ struct tdx_gmem_post_populate_arg {
> > >  	__u32 flags;
> > >  };
> > >  
> > > -static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > > -				  void __user *src, int order, void *_arg)
> > > +static int tdx_gmem_post_populate_4k(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > > +				     void __user *src, void *_arg)
> > >  {
> > >  	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
> > >  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > @@ -3120,6 +3122,21 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > >  	return ret;
> > >  }
> > >  
> > > +static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > > +				  void __user *src, int order, void *_arg)
> > > +{
> > > +	unsigned long i, npages = 1 << order;
> > > +	int ret;
> > > +
> > > +	for (i = 0; i < npages; i++) {
> > > +		ret = tdx_gmem_post_populate_4k(kvm, gfn + i, pfn + i,
> > > +						src + i * PAGE_SIZE, _arg);
> > > +		if (ret)
> > > +			return ret;
> > > +	}
> > > +	return 0;
> > > +}
> > > +
> > >  static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
> > >  {
> > >  	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > > @@ -3166,20 +3183,15 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
> > >  		};
> > >  		gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
> > >  					     u64_to_user_ptr(region.source_addr),
> > > -					     1, tdx_gmem_post_populate, &arg);
> > > +					     region.nr_pages, tdx_gmem_post_populate, &arg);
> > >  		if (gmem_ret < 0) {
> > >  			ret = gmem_ret;
> > >  			break;
> > >  		}
> > >  
> > > -		if (gmem_ret != 1) {
> This line is removed.

Doh! Right.

> 
> > > -			ret = -EIO;
> > > -			break;
> > > -		}
> > > -
> > > -		region.source_addr += PAGE_SIZE;
> > > -		region.gpa += PAGE_SIZE;
> > > -		region.nr_pages--;
> > > +		region.source_addr += PAGE_SIZE * gmem_ret;
> > 
> > gmem_ret has to be 1, per the above conditional.
> As region.nr_pages instead of 1 is passed into kvm_gmem_populate(), gmem_ret
> can now be greater than 1.
> 
> kvm_gmem_populate() can allocate huge backend pages if region.nr_pages, GFN
> alignment, and shareability permit.
> 
> > > +		region.gpa += PAGE_SIZE * gmem_ret;
> > > +		region.nr_pages -= gmem_ret;
> > >  
> > >  		cond_resched();
> > >  	}
> > > @@ -3224,6 +3236,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
> > >  
> > >  int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> > >  {
> > > +	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
> > > +		return PG_LEVEL_4K;
> > > +
> > >  	return PG_LEVEL_4K;
> > 
> > ^ Change does nothing...
> Right. Patch 9 will update the default level to PG_LEVEL_2M.
> 
> The change here is meant to highlight PG_LEVEL_4K is enforced in
> tdx_gmem_private_max_mapping_level() when TD is not in TD_STATE_RUNNABLE state.
> 
> Will change it in the next version.

I can't see the pattern between what goes in this patch vs patch 9. We should
have some reasoning behind it.

> 
> > >  }
> > >  
> > 


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-05-13 20:10   ` Edgecombe, Rick P
@ 2025-05-16  1:35     ` Huang, Kai
  2025-05-16  9:43       ` Yan Zhao
  2025-05-16  9:28     ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Huang, Kai @ 2025-05-16  1:35 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Edgecombe, Rick P,
	Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, Du, Fan,
	linux-kernel@vger.kernel.org, Li, Zhiquan1, Weiny, Ira,
	michael.roth@amd.com, binbin.wu@linux.intel.com,
	ackerleytng@google.com, Yamahata, Isaku, Peng, Chao P,
	kvm@vger.kernel.org, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, 2025-05-13 at 20:10 +0000, Edgecombe, Rick P wrote:
> > @@ -3265,7 +3263,7 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> >   	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
> >   		return PG_LEVEL_4K;
> >   
> > -	return PG_LEVEL_4K;
> > +	return PG_LEVEL_2M;
> 
> Maybe combine this with patch 4, or split them into sensible categories.

How about merge with patch 12

  [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's 
  ACCEPT level

instead?

Per patch 12, the fault due to TDH.MEM.PAGE.ACCPT contains fault level info, so
KVM should just return that.  But seems we are still returning PG_LEVEL_2M if no
such info is provided (IIUC):

int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, 
				       gfn_t gfn)
 {
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
 	if (unlikely(to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE))
 		return PG_LEVEL_4K;
 
+	if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
+		return tdx->violation_request_level;
+
 	return PG_LEVEL_2M;
 }

So why not returning PT_LEVEL_4K at the end?

I am asking because below text mentioned in the coverletter:

    A rare case that could lead to splitting in the fault path is when a TD
    is configured to receive #VE and accesses memory before the ACCEPT
    operation. By the time a vCPU accesses a private GFN, due to the lack
    of any guest preferred level, KVM could create a mapping at 2MB level.
    If the TD then only performs the ACCEPT operation at 4KB level,
    splitting in the fault path will be triggered. However, this is not
    regarded as a typical use case, as usually TD always accepts pages in
    the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
    splitting request is an endless EPT violation. This would not happen
    for a Linux guest, which does not expect any #VE.

Changing to return PT_LEVEL_4K should avoid this problem.  It doesn't hurt
normal cases either, since guest will always do ACCEPT (which contains the
accepting level) before accessing the memory.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 05/21] KVM: TDX: Enhance tdx_clear_page() to support huge pages
  2025-05-13 19:17   ` Edgecombe, Rick P
@ 2025-05-16  2:02     ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-16  2:02 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Shutemov, Kirill,
	quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, May 14, 2025 at 03:17:40AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:05 +0800, Yan Zhao wrote:
> > From: Xiaoyao Li <xiaoyao.li@intel.com>
> > 
> > KVM invokes tdx_clear_page() to zero pages using movdir64b().
> > Include level information to enable tdx_clear_page() to zero a huge page.
> > 
> > [Yan: split out, let tdx_clear_page() accept level]
> > 
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  arch/x86/kvm/vmx/tdx.c | 19 ++++++++++++++-----
> >  1 file changed, 14 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 03885cb2869b..1186085795ac 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -276,7 +276,7 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> >  	vcpu->cpu = -1;
> >  }
> >  
> > -static void tdx_clear_page(struct page *page)
> > +static void __tdx_clear_page(struct page *page)
> >  {
> >  	const void *zero_page = (const void *) page_to_virt(ZERO_PAGE(0));
> >  	void *dest = page_to_virt(page);
> > @@ -295,6 +295,15 @@ static void tdx_clear_page(struct page *page)
> >  	__mb();
> >  }
> >  
> > +static void tdx_clear_page(struct page *page, int level)
> > +{
> > +	unsigned long nr = KVM_PAGES_PER_HPAGE(level);
> > +	unsigned long idx = 0;
> > +
> > +	while (nr--)
> > +		__tdx_clear_page(nth_page(page, idx++));
> 
> You shouldn't need both idx and nr.
> 
> > +}
> 
> Since tdx_clear_page() has a __mb(), it is probably worth checking that this
> generates efficient code, considering the loops within loops pattern.
The concern makes sense!

Will convert level to size and use "for (i = 0; i < size; i += 64)" for
movdir64b().

> > +
> >  static void tdx_no_vcpus_enter_start(struct kvm *kvm)
> >  {
> >  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > @@ -340,11 +349,10 @@ static int tdx_reclaim_page(struct page *page)
> >  
> >  	r = __tdx_reclaim_page(page);
> >  	if (!r)
> > -		tdx_clear_page(page);
> > +		tdx_clear_page(page, PG_LEVEL_4K);
> >  	return r;
> >  }
> >  
> > -
> >  /*
> >   * Reclaim the TD control page(s) which are crypto-protected by TDX guest's
> >   * private KeyID.  Assume the cache associated with the TDX private KeyID has
> > @@ -588,7 +596,7 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
> >  		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
> >  		return;
> >  	}
> > -	tdx_clear_page(kvm_tdx->td.tdr_page);
> > +	tdx_clear_page(kvm_tdx->td.tdr_page, PG_LEVEL_4K);
> 
> Why not the __tdx_clear_page() variant? The patch adds it, but doesn't really
> use it. Just implement it all in tdx_clear_page() then.
Ok.

> >  
> >  	__free_page(kvm_tdx->td.tdr_page);
> >  	kvm_tdx->td.tdr_page = NULL;
> > @@ -1621,7 +1629,8 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> >  		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
> >  		return -EIO;
> >  	}
> > -	tdx_clear_page(page);
> > +
> > +	tdx_clear_page(page, level);
> >  	tdx_unpin(kvm, page);
> >  	return 0;
> >  }
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 06/21] KVM: TDX: Assert the reclaimed pages were mapped as expected
  2025-05-13 19:25   ` Edgecombe, Rick P
@ 2025-05-16  2:11     ` Yan Zhao
  2025-05-16 17:34       ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-16  2:11 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Shutemov, Kirill,
	quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, May 14, 2025 at 03:25:29AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:05 +0800, Yan Zhao wrote:
> >  /* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */
> > -static int __tdx_reclaim_page(struct page *page)
> > +static int __tdx_reclaim_page(struct page *page, int level)
> >  {
> >  	u64 err, tdx_pt, tdx_owner, tdx_size;
> >  
> > @@ -340,16 +340,18 @@ static int __tdx_reclaim_page(struct page *page)
> >  		pr_tdx_error_3(TDH_PHYMEM_PAGE_RECLAIM, err, tdx_pt, tdx_owner, tdx_size);
> >  		return -EIO;
> >  	}
> > +
> > +	WARN_ON_ONCE(tdx_size != pg_level_to_tdx_sept_level(level));
> 
> Why not return an error in this case?
Yes, returing error seems reasonable, which indicate a series bug.

> >  	return 0;
> >  }
> >  
> 
> No callers in the series pass anything other than PG_LEVEL_4K, so do we need
> this patch?
Oh, this patch is only for future VM shutdown optimization where huge guest
pages could be reclaimed.
We can of couse include it in the VM shutdown optimization series if you think
it's better.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 03/21] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2025-05-15 17:28       ` Edgecombe, Rick P
@ 2025-05-16  2:23         ` Yan Zhao
  2025-07-01 21:15         ` Edgecombe, Rick P
  1 sibling, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-16  2:23 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, tabba@google.com,
	vbabka@suse.cz, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Du, Fan,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Shutemov, Kirill,
	pgonda@google.com, x86@kernel.org

On Fri, May 16, 2025 at 01:28:52AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-05-15 at 16:26 +0800, Yan Zhao wrote:
> > On Wed, May 14, 2025 at 02:19:56AM +0800, Edgecombe, Rick P wrote:
> > > On Thu, 2025-04-24 at 11:04 +0800, Yan Zhao wrote:
> > > > From: Xiaoyao Li <xiaoyao.li@intel.com>
> > > > 
> > > > Add a wrapper tdh_mem_page_demote() to invoke SEAMCALL TDH_MEM_PAGE_DEMOTE
> > > > to demote a huge leaf entry to a non-leaf entry in S-EPT. Currently, the
> > > > TDX module only supports demotion of a 2M huge leaf entry. After a
> > > > successful demotion, the old 2M huge leaf entry in S-EPT is replaced with a
> > > > non-leaf entry, linking to the newly-added page table page. The newly
> > > > linked page table page then contains 512 leaf entries, pointing to the 2M
> > > > guest private pages.
> > > > 
> > > > The "gpa" and "level" direct the TDX module to search and find the old
> > > > huge leaf entry.
> > > > 
> > > > As the new non-leaf entry points to a page table page, callers need to
> > > > pass in the page table page in parameter "page".
> > > > 
> > > > In case of S-EPT walk failure, the entry, level and state where the error
> > > > was detected are returned in ext_err1 and ext_err2.
> > > > 
> > > > On interrupt pending, SEAMCALL TDH_MEM_PAGE_DEMOTE returns error
> > > > TDX_INTERRUPTED_RESTARTABLE.
> > > > 
> > > > [Yan: Rebased and split patch, wrote changelog]
> > > 
> > > We should add the level of detail here like we did for the base series ones.
> > I'll provide changelog details under "---" of each patch in the next version.
> 
> I mean the commit log (above the "---") needs the same tip style treatment as
> the other SEAMCALL wrapper patches.
I thought I have followed the style.
Sorry that if you think the commit msg is too simple without showing details
of this SEAMCALL. I can provide a detailed on in the next version if that's the
concern you mentioned above.

> >  
> > > > 
> > > > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > > ---
> > > >  arch/x86/include/asm/tdx.h  |  2 ++
> > > >  arch/x86/virt/vmx/tdx/tdx.c | 20 ++++++++++++++++++++
> > > >  arch/x86/virt/vmx/tdx/tdx.h |  1 +
> > > >  3 files changed, 23 insertions(+)
> > > > 
> > > > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > > > index 26ffc792e673..08eff4b2f5e7 100644
> > > > --- a/arch/x86/include/asm/tdx.h
> > > > +++ b/arch/x86/include/asm/tdx.h
> > > > @@ -177,6 +177,8 @@ u64 tdh_mng_key_config(struct tdx_td *td);
> > > >  u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
> > > >  u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
> > > >  u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
> > > > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
> > > > +			u64 *ext_err1, u64 *ext_err2);
> > > >  u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
> > > >  u64 tdh_mr_finalize(struct tdx_td *td);
> > > >  u64 tdh_vp_flush(struct tdx_vp *vp);
> > > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > > > index a66d501b5677..5699dfe500d9 100644
> > > > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > > > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > > > @@ -1684,6 +1684,26 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(tdh_mng_rd);
> > > >  
> > > > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
> > > > +			u64 *ext_err1, u64 *ext_err2)
> > > > +{
> > > > +	struct tdx_module_args args = {
> > > > +		.rcx = gpa | level,
> > > 
> > > This will only ever be level 2MB, how about dropping the arg?
> > Do you mean hardcoding level to be 2MB in tdh_mem_page_demote()?
> 
> Yea, we don't support 1GB, so the level arg on the wrapper is superfluous.
I'm not sure. It's not like tdh_mem_page_add() where the TDX module just only
supports 4KB.

But your point that permitting 1GB in tdh_mem_page_demote() in x86 code until
after KVM TDX code adds 1GB support also makes sense.

> > The SEAMCALL TDH_MEM_PAGE_DEMOTE supports level of 1GB in current TDX module.
> > 
> > > > +		.rdx = tdx_tdr_pa(td),
> > > > +		.r8 = page_to_phys(page),
> > > > +	};
> > > > +	u64 ret;
> > > > +
> > > > +	tdx_clflush_page(page);
> > > > +	ret = seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args);
> > > > +
> > > > +	*ext_err1 = args.rcx;
> > > > +	*ext_err2 = args.rdx;
 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 07/21] KVM: TDX: Add a helper for WBINVD on huge pages with TD's keyID
  2025-05-13 19:29   ` Edgecombe, Rick P
@ 2025-05-16  3:03     ` Yan Zhao
  2025-05-16 17:35       ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-16  3:03 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Shutemov, Kirill,
	quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, May 14, 2025 at 03:29:00AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:05 +0800, Yan Zhao wrote:
> > From: Xiaoyao Li <xiaoyao.li@intel.com>
> > 
> > After a guest page is removed from the S-EPT, KVM calls
> > tdh_phymem_page_wbinvd_hkid() to execute WBINVD on the page using the TD's
> > keyID.
> > 
> > Add a helper function that takes level information to perform WBINVD on a
> > huge page.
> > 
> > [Yan: split patch, added a helper, rebased to use struct page]
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  arch/x86/kvm/vmx/tdx.c | 24 +++++++++++++++++++-----
> >  1 file changed, 19 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 69f3140928b5..355b21fc169f 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1586,6 +1586,23 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> >  	return tdx_mem_page_record_premap_cnt(kvm, level);
> >  }
> >  
> > +static inline u64 tdx_wbinvd_page(struct kvm *kvm, u64 hkid, struct page *page, int level)
> > +{
> > +	unsigned long nr = KVM_PAGES_PER_HPAGE(level);
> > +	unsigned long idx = 0;
> > +	u64 err;
> > +
> > +	while (nr--) {
> > +		err = tdh_phymem_page_wbinvd_hkid(hkid, nth_page(page, idx++));
> > +
> > +		if (KVM_BUG_ON(err, kvm)) {
> > +			pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
> > +			return err;
> > +		}
> > +	}
> > +	return err;
> > +}
> 
> Hmm, did you consider changing tdh_phymem_page_wbinvd_hkid()? It's the pattern
> of KVM wrapping the SEAMCALL helpers to do some more work that needs to be
> wrapped.
SEAMCALL TDH_PHYMEM_PAGE_WBINVD only accepts a 4KB page.
Will move the loop from KVM to the wrapper in x86 if you think it's better.


> >  static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> >  				      enum pg_level level, struct page *page)
> >  {
> > @@ -1625,12 +1642,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> >  		return -EIO;
> >  	}
> >  
> > -	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
> > -
> > -	if (KVM_BUG_ON(err, kvm)) {
> > -		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
> > +	err = tdx_wbinvd_page(kvm, kvm_tdx->hkid, page, level);
> > +	if (err)
> >  		return -EIO;
> > -	}
> >  
> >  	tdx_clear_page(page, level);
> >  	tdx_unpin(kvm, page);
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 07/21] KVM: TDX: Add a helper for WBINVD on huge pages with TD's keyID
  2025-05-06  8:37   ` Binbin Wu
@ 2025-05-16  3:10     ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-16  3:10 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, chao.p.peng

On Tue, May 06, 2025 at 04:37:22PM +0800, Binbin Wu wrote:
> 
> 
> On 4/24/2025 11:05 AM, Yan Zhao wrote:
> > From: Xiaoyao Li <xiaoyao.li@intel.com>
> > 
> > After a guest page is removed from the S-EPT, KVM calls
> > tdh_phymem_page_wbinvd_hkid() to execute WBINVD on the page using the TD's
> > keyID.
> > 
> > Add a helper function that takes level information to perform WBINVD on a
> > huge page.
> > 
> > [Yan: split patch, added a helper, rebased to use struct page]
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >   arch/x86/kvm/vmx/tdx.c | 24 +++++++++++++++++++-----
> >   1 file changed, 19 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 69f3140928b5..355b21fc169f 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1586,6 +1586,23 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> >   	return tdx_mem_page_record_premap_cnt(kvm, level);
> >   }
> > +static inline u64 tdx_wbinvd_page(struct kvm *kvm, u64 hkid, struct page *page, int level)
> > +{
> > +	unsigned long nr = KVM_PAGES_PER_HPAGE(level);
> > +	unsigned long idx = 0;
> > +	u64 err;
> > +
> > +	while (nr--) {
> > +		err = tdh_phymem_page_wbinvd_hkid(hkid, nth_page(page, idx++));
> > +
> > +		if (KVM_BUG_ON(err, kvm)) {
> > +			pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
> > +			return err;
> > +		}
> > +	}
> > +	return err;
> > +}
> > +
> >   static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> >   				      enum pg_level level, struct page *page)
> >   {
> > @@ -1625,12 +1642,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> >   		return -EIO;
> >   	}
> > -	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
> > -
> > -	if (KVM_BUG_ON(err, kvm)) {
> > -		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
> > +	err = tdx_wbinvd_page(kvm, kvm_tdx->hkid, page, level);
> > +	if (err)
> 
> It can add unlikely() here.
> Also the err is not used after check, maybe it can be combined as:
> 
> if (unlikely(tdx_wbinvd_page(kvm, kvm_tdx->hkid, page, level)))
>         return -EIO;
That's better. Thanks!

> >   		return -EIO;
> > -	}
> >   	tdx_clear_page(page, level);
> >   	tdx_unpin(kvm, page);
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 10/21] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root
  2025-05-13 20:15   ` Edgecombe, Rick P
@ 2025-05-16  4:01     ` Yan Zhao
  2025-05-16 17:50       ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-16  4:01 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Shutemov, Kirill,
	quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, May 14, 2025 at 04:15:14AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:06 +0800, Yan Zhao wrote:
> > From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
> > 
> > Disallow page merging (huge page adjustment) for mirror root by leveraging
> > the disallowed_hugepage_adjust().
> > 
> > [Yan: Passing is_mirror to disallowed_hugepage_adjust()]
> > 
> > Signed-off-by: Edgecombe, Rick P <rick.p.edgecombe@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c          | 6 +++---
> >  arch/x86/kvm/mmu/mmu_internal.h | 2 +-
> >  arch/x86/kvm/mmu/paging_tmpl.h  | 2 +-
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 7 ++++---
> >  4 files changed, 9 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index a284dce227a0..b923deeeb62e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3326,13 +3326,13 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  	fault->pfn &= ~mask;
> >  }
> >  
> > -void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level)
> > +void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level, bool is_mirror)
> >  {
> >  	if (cur_level > PG_LEVEL_4K &&
> >  	    cur_level == fault->goal_level &&
> >  	    is_shadow_present_pte(spte) &&
> >  	    !is_large_pte(spte) &&
> > -	    spte_to_child_sp(spte)->nx_huge_page_disallowed) {
> > +	    (spte_to_child_sp(spte)->nx_huge_page_disallowed || is_mirror)) {
> >  		/*
> >  		 * A small SPTE exists for this pfn, but FNAME(fetch),
> >  		 * direct_map(), or kvm_tdp_mmu_map() would like to create a
> > @@ -3363,7 +3363,7 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  		 * large page, as the leaf could be executable.
> >  		 */
> >  		if (fault->nx_huge_page_workaround_enabled)
> > -			disallowed_hugepage_adjust(fault, *it.sptep, it.level);
> > +			disallowed_hugepage_adjust(fault, *it.sptep, it.level, false);
> >  
> >  		base_gfn = gfn_round_for_level(fault->gfn, it.level);
> >  		if (it.level == fault->goal_level)
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index db8f33e4de62..1c1764f46e66 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -411,7 +411,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >  int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >  			      const struct kvm_memory_slot *slot, gfn_t gfn);
> >  void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > -void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
> > +void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level, bool is_mirror);
> >  
> >  void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >  void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > index 68e323568e95..1559182038e3 100644
> > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > @@ -717,7 +717,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >  		 * large page, as the leaf could be executable.
> >  		 */
> >  		if (fault->nx_huge_page_workaround_enabled)
> > -			disallowed_hugepage_adjust(fault, *it.sptep, it.level);
> > +			disallowed_hugepage_adjust(fault, *it.sptep, it.level, false);
> >  
> >  		base_gfn = gfn_round_for_level(fault->gfn, it.level);
> >  		if (it.level == fault->goal_level)
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 405874f4d088..8ee01277cc07 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1244,6 +1244,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  	struct tdp_iter iter;
> >  	struct kvm_mmu_page *sp;
> >  	int ret = RET_PF_RETRY;
> > +	bool is_mirror = is_mirror_sp(root);
> >  
> >  	kvm_mmu_hugepage_adjust(vcpu, fault);
> >  
> > @@ -1254,8 +1255,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  	for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
> >  		int r;
> >  
> > -		if (fault->nx_huge_page_workaround_enabled)
> > -			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
> > +		if (fault->nx_huge_page_workaround_enabled || is_mirror)
> 
> Maybe we should rename nx_huge_page_workaround_enabled to something more generic
> and do the is_mirror logic in kvm_mmu_do_page_fault() when setting it. It should
> shrink the diff and centralize the logic.
Hmm, I'm reluctant to rename nx_huge_page_workaround_enabled, because

(1) Invoking disallowed_hugepage_adjust() for mirror root is to disable page
    promotion for TDX private memory, so is only applied to TDP MMU.
(2) nx_huge_page_workaround_enabled is used specifically for nx huge pages.
    fault->huge_page_disallowed = fault->exec && fault->nx_huge_page_workaround_enabled;

    if (fault->huge_page_disallowed)
        account_nx_huge_page(vcpu->kvm, sp,
                             fault->req_level >= it.level);
    
    sp->nx_huge_page_disallowed = fault->huge_page_disallowed.

    Affecting fault->huge_page_disallowed would impact
    sp->nx_huge_page_disallowed as well and would disable huge pages entirely.

    So, we still need to keep nx_huge_page_workaround_enabled.

If we introduce a new flag fault->disable_hugepage_adjust, and set it in
kvm_mmu_do_page_fault(), we would also need to invoke
tdp_mmu_get_root_for_fault() there as well.

Checking for mirror root for non-TDX VMs is not necessary, and the invocation of
tdp_mmu_get_root_for_fault() seems redundant with the one in kvm_tdp_mmu_map().


> > +			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level, is_mirror);
> >  
> >  		/*
> >  		 * If SPTE has been frozen by another thread, just give up and
> > @@ -1278,7 +1279,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  		 */
> >  		sp = tdp_mmu_alloc_sp(vcpu);
> >  		tdp_mmu_init_child_sp(sp, &iter);
> > -		if (is_mirror_sp(sp))
> > +		if (is_mirror)
> >  			kvm_mmu_alloc_external_spt(vcpu, sp);
> >  
> >  		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's ACCEPT level
  2025-05-13 21:20   ` Edgecombe, Rick P
@ 2025-05-16  6:12     ` Xiaoyao Li
  2025-05-16  6:30     ` Yan Zhao
  1 sibling, 0 replies; 294+ messages in thread
From: Xiaoyao Li @ 2025-05-16  6:12 UTC (permalink / raw)
  To: Edgecombe, Rick P, pbonzini@redhat.com, seanjc@google.com,
	Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On 5/14/2025 5:20 AM, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:07 +0800, Yan Zhao wrote:
>> Determine the max mapping level of a private GFN according to the vCPU's
>> ACCEPT level specified in the TDCALL TDG.MEM.PAGE.ACCEPT.
>>
>> When an EPT violation occurs due to a vCPU invoking TDG.MEM.PAGE.ACCEPT
>> before any actual memory access, the vCPU's ACCEPT level is available in
>> the extended exit qualification. Set the vCPU's ACCEPT level as the max
>> mapping level for the faulting GFN. This is necessary because if KVM
>> specifies a mapping level greater than the vCPU's ACCEPT level, and no
>> other vCPUs are accepting at KVM's mapping level, TDG.MEM.PAGE.ACCEPT will
>> produce another EPT violation on the vCPU after re-entering the TD, with
>> the vCPU's ACCEPT level indicated in the extended exit qualification.
> 
> Maybe a little more info would help. It's because the TDX module wants to
> "accept" the smaller size in the real S-EPT, but KVM created a huge page. It
> can't demote to do this without help from KVM.
> 
>>
>> Introduce "violation_gfn_start", "violation_gfn_end", and
>> "violation_request_level" in "struct vcpu_tdx" to pass the vCPU's ACCEPT
>> level to TDX's private_max_mapping_level hook for determining the max
>> mapping level.
>>
>> Instead of taking some bits of the error_code passed to
>> kvm_mmu_page_fault() and requiring KVM MMU core to check the error_code for
>> a fault's max_level, having TDX's private_max_mapping_level hook check for
>> request level avoids changes to the KVM MMU core. This approach also
>> accommodates future scenarios where the requested mapping level is unknown
>> at the start of tdx_handle_ept_violation() (i.e., before invoking
>> kvm_mmu_page_fault()).
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>> ---
>>   arch/x86/kvm/vmx/tdx.c      | 36 +++++++++++++++++++++++++++++++++++-
>>   arch/x86/kvm/vmx/tdx.h      |  4 ++++
>>   arch/x86/kvm/vmx/tdx_arch.h |  3 +++
>>   3 files changed, 42 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> index 86775af85cd8..dd63a634e633 100644
>> --- a/arch/x86/kvm/vmx/tdx.c
>> +++ b/arch/x86/kvm/vmx/tdx.c
>> @@ -1859,10 +1859,34 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
>>   	return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
>>   }
>>   
>> +static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
>> +{
>> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
>> +	int level = -1;
>> +
>> +	u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
>> +
>> +	u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
>> +			TDX_EXT_EXIT_QUAL_INFO_SHIFT;
>> +
>> +	if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
>> +		level = (eeq_info & GENMASK(2, 0)) + 1;
>> +
>> +		tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
>> +		tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
>> +		tdx->violation_request_level = level;
>> +	} else {
>> +		tdx->violation_gfn_start = -1;
>> +		tdx->violation_gfn_end = -1;
>> +		tdx->violation_request_level = -1;
> 
> We had some internal conversations on how KVM used to stuff a bunch of fault
> stuff in the vcpu so it didn't have to pass it around, but now uses the fault
> struct for this. The point was (IIRC) to prevent stale data from getting
> confused on future faults, and it being hard to track what came from where.
> 
> In the TDX case, I think the potential for confusion is still there. The MMU
> code could use stale data if an accept EPT violation happens and control returns
> to userspace, at which point userspace does a KVM_PRE_FAULT_MEMORY. Then it will
> see the stale  tdx->violation_*. Not exactly a common case, but better to not
> have loose ends if we can avoid it.
> 
> Looking more closely, I don't see why it's too hard to pass in a max_fault_level
> into the fault struct. Totally untested rough idea, what do you think?

the original huge page support patch did encode the level info in 
error_code. So it has my vote.

https://lore.kernel.org/all/4d61104bff388a081ff8f6ae4ac71e05a13e53c3.1708933624.git.isaku.yamahata@intel.com/

> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index faae82eefd99..3dc476da6391 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -282,7 +282,11 @@ enum x86_intercept_stage;
>    * when the guest was accessing private memory.
>    */
>   #define PFERR_PRIVATE_ACCESS   BIT_ULL(49)
> -#define PFERR_SYNTHETIC_MASK   (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS)
> +
> +#define PFERR_FAULT_LEVEL_MASK (BIT_ULL(50) | BIT_ULL(51) | BIT_ULL(52))
> +#define PFERR_FAULT_LEVEL_SHIFT 50
> +
> +#define PFERR_SYNTHETIC_MASK   (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS |
> PFERR_FAULT_LEVEL_MASK)
>   
>   /* apic attention bits */
>   #define KVM_APIC_CHECK_VAPIC   0
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1c1764f46e66..bdb1b0eabd67 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -361,7 +361,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu
> *vcpu, gpa_t cr2_or_gpa,
>                  .nx_huge_page_workaround_enabled =
>                          is_nx_huge_page_enabled(vcpu->kvm),
>   
> -               .max_level = KVM_MAX_HUGEPAGE_LEVEL,
> +               .max_level = (err & PFERR_FAULT_LEVEL_MASK) >>
> PFERR_FAULT_LEVEL_SHIFT,
>                  .req_level = PG_LEVEL_4K,
>                  .goal_level = PG_LEVEL_4K,
>                  .is_private = err & PFERR_PRIVATE_ACCESS,
> diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
> index 8f46a06e2c44..2f22b294ef8b 100644
> --- a/arch/x86/kvm/vmx/common.h
> +++ b/arch/x86/kvm/vmx/common.h
> @@ -83,7 +83,8 @@ static inline bool vt_is_tdx_private_gpa(struct kvm *kvm,
> gpa_t gpa)
>   }
>   
>   static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> -                                            unsigned long exit_qualification)
> +                                            unsigned long exit_qualification,
> +                                            u8 max_fault_level)
>   {
>          u64 error_code;
>   
> @@ -107,6 +108,10 @@ static inline int __vmx_handle_ept_violation(struct
> kvm_vcpu *vcpu, gpa_t gpa,
>          if (vt_is_tdx_private_gpa(vcpu->kvm, gpa))
>                  error_code |= PFERR_PRIVATE_ACCESS;
>   
> +       BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL >= (1 <<
> hweight64(PFERR_FAULT_LEVEL_MASK)));
> +
> +       error_code |= (u64)max_fault_level << PFERR_FAULT_LEVEL_SHIFT;
> +
>          return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>   }
>   
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index e994a6c08a75..19047de4d98d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2027,7 +2027,7 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>           * handle retries locally in their EPT violation handlers.
>           */
>          while (1) {
> -               ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> +               ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual,
> KVM_MAX_HUGEPAGE_LEVEL);
>   
>                  if (ret != RET_PF_RETRY || !local_retry)
>                          break;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index ef2d7208dd20..b70a2ff35884 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5782,7 +5782,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>          if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu,
> gpa)))
>                  return kvm_emulate_instruction(vcpu, 0);
>   
> -       return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
> +       return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification,
> KVM_MAX_HUGEPAGE_LEVEL);
>   }
>   
>   static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> 
> 


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's ACCEPT level
  2025-05-13 21:20   ` Edgecombe, Rick P
  2025-05-16  6:12     ` Xiaoyao Li
@ 2025-05-16  6:30     ` Yan Zhao
  2025-05-16 22:02       ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-16  6:30 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Shutemov, Kirill,
	quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, May 14, 2025 at 05:20:01AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:07 +0800, Yan Zhao wrote:
> > Determine the max mapping level of a private GFN according to the vCPU's
> > ACCEPT level specified in the TDCALL TDG.MEM.PAGE.ACCEPT.
> > 
> > When an EPT violation occurs due to a vCPU invoking TDG.MEM.PAGE.ACCEPT
> > before any actual memory access, the vCPU's ACCEPT level is available in
> > the extended exit qualification. Set the vCPU's ACCEPT level as the max
> > mapping level for the faulting GFN. This is necessary because if KVM
> > specifies a mapping level greater than the vCPU's ACCEPT level, and no
> > other vCPUs are accepting at KVM's mapping level, TDG.MEM.PAGE.ACCEPT will
> > produce another EPT violation on the vCPU after re-entering the TD, with
> > the vCPU's ACCEPT level indicated in the extended exit qualification.
> 
> Maybe a little more info would help. It's because the TDX module wants to
> "accept" the smaller size in the real S-EPT, but KVM created a huge page. It
> can't demote to do this without help from KVM.
Ok. Right, the TDX module cannot set the entire 2MB mapping to the accepted
state because the guest only specifies 4KB acceptance. The TDX module cannot
perform demotion without a request from KVM. Therefore, the requested level must
be passed to KVM to ensure the mirror page table faults at the expected level.

> > Introduce "violation_gfn_start", "violation_gfn_end", and
> > "violation_request_level" in "struct vcpu_tdx" to pass the vCPU's ACCEPT
> > level to TDX's private_max_mapping_level hook for determining the max
> > mapping level.
> > 
> > Instead of taking some bits of the error_code passed to
> > kvm_mmu_page_fault() and requiring KVM MMU core to check the error_code for
> > a fault's max_level, having TDX's private_max_mapping_level hook check for
> > request level avoids changes to the KVM MMU core. This approach also
> > accommodates future scenarios where the requested mapping level is unknown
> > at the start of tdx_handle_ept_violation() (i.e., before invoking
> > kvm_mmu_page_fault()).
> > 
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  arch/x86/kvm/vmx/tdx.c      | 36 +++++++++++++++++++++++++++++++++++-
> >  arch/x86/kvm/vmx/tdx.h      |  4 ++++
> >  arch/x86/kvm/vmx/tdx_arch.h |  3 +++
> >  3 files changed, 42 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 86775af85cd8..dd63a634e633 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1859,10 +1859,34 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
> >  	return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
> >  }
> >  
> > +static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
> > +{
> > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +	int level = -1;
> > +
> > +	u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> > +
> > +	u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> > +			TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> > +
> > +	if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
> > +		level = (eeq_info & GENMASK(2, 0)) + 1;
> > +
> > +		tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
> > +		tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> > +		tdx->violation_request_level = level;
> > +	} else {
> > +		tdx->violation_gfn_start = -1;
> > +		tdx->violation_gfn_end = -1;
> > +		tdx->violation_request_level = -1;
> 
> We had some internal conversations on how KVM used to stuff a bunch of fault
> stuff in the vcpu so it didn't have to pass it around, but now uses the fault
> struct for this. The point was (IIRC) to prevent stale data from getting
> confused on future faults, and it being hard to track what came from where.
> 
> In the TDX case, I think the potential for confusion is still there. The MMU
> code could use stale data if an accept EPT violation happens and control returns
> to userspace, at which point userspace does a KVM_PRE_FAULT_MEMORY. Then it will
> see the stale  tdx->violation_*. Not exactly a common case, but better to not
> have loose ends if we can avoid it.
> 
> Looking more closely, I don't see why it's too hard to pass in a max_fault_level
> into the fault struct. Totally untested rough idea, what do you think?
Thanks for bringing this up and providing the idea below. In the previous TDX
huge page v8, there's a similar implementation [1] [2].

This series did not adopt that approach because that approach requires
tdx_handle_ept_violation() to pass in max_fault_level, which is not always
available at that stage. e.g.

In patch 19, when vCPU 1 faults on a GFN at 2MB level and then vCPU 2 faults on
the same GFN at 4KB level, TDX wants to ignore the demotion request caused by
vCPU 2's 4KB level fault. So, patch 19 sets tdx->violation_request_level to 2MB
in vCPU 2's split callback and fails the split. vCPU 2's
__vmx_handle_ept_violation() will see RET_PF_RETRY and either do local retry (or
return to the guest).

If it retries locally, tdx_gmem_private_max_mapping_level() will return
tdx->violation_request_level, causing KVM to fault at 2MB level for vCPU 2,
resulting in a spurious fault, eventually returning to the guest.

As tdx->violation_request_level is per-vCPU and it resets in
tdx_get_accept_level() in tdx_handle_ept_violation() (meaning it resets after
each invocation of tdx_handle_ept_violation() and only affects the TDX local
retry loop), it should not hold any stale value.

Alternatively, instead of having tdx_gmem_private_max_mapping_level() to return
tdx->violation_request_level, tdx_handle_ept_violation() could grab
tdx->violation_request_level as the max_fault_level to pass to
__vmx_handle_ept_violation().

This series chose to use tdx_gmem_private_max_mapping_level() to avoid
modification to the KVM MMU core.

[1] https://lore.kernel.org/all/4d61104bff388a081ff8f6ae4ac71e05a13e53c3.1708933624.git.isaku.yamahata@intel.com/
[2 ]https://lore.kernel.org/all/3d2a6bfb033ee1b51f7b875360bd295376c32b54.1708933624.git.isaku.yamahata@intel.com/

> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index faae82eefd99..3dc476da6391 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -282,7 +282,11 @@ enum x86_intercept_stage;
>   * when the guest was accessing private memory.
>   */
>  #define PFERR_PRIVATE_ACCESS   BIT_ULL(49)
> -#define PFERR_SYNTHETIC_MASK   (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS)
> +
> +#define PFERR_FAULT_LEVEL_MASK (BIT_ULL(50) | BIT_ULL(51) | BIT_ULL(52))
> +#define PFERR_FAULT_LEVEL_SHIFT 50
> +
> +#define PFERR_SYNTHETIC_MASK   (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS |
> PFERR_FAULT_LEVEL_MASK)
>  
>  /* apic attention bits */
>  #define KVM_APIC_CHECK_VAPIC   0
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1c1764f46e66..bdb1b0eabd67 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -361,7 +361,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu
> *vcpu, gpa_t cr2_or_gpa,
>                 .nx_huge_page_workaround_enabled =
>                         is_nx_huge_page_enabled(vcpu->kvm),
>  
> -               .max_level = KVM_MAX_HUGEPAGE_LEVEL,
> +               .max_level = (err & PFERR_FAULT_LEVEL_MASK) >>
> PFERR_FAULT_LEVEL_SHIFT,
>                 .req_level = PG_LEVEL_4K,
>                 .goal_level = PG_LEVEL_4K,
>                 .is_private = err & PFERR_PRIVATE_ACCESS,
> diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
> index 8f46a06e2c44..2f22b294ef8b 100644
> --- a/arch/x86/kvm/vmx/common.h
> +++ b/arch/x86/kvm/vmx/common.h
> @@ -83,7 +83,8 @@ static inline bool vt_is_tdx_private_gpa(struct kvm *kvm,
> gpa_t gpa)
>  }
>  
>  static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> -                                            unsigned long exit_qualification)
> +                                            unsigned long exit_qualification,
> +                                            u8 max_fault_level)
>  {
>         u64 error_code;
>  
> @@ -107,6 +108,10 @@ static inline int __vmx_handle_ept_violation(struct
> kvm_vcpu *vcpu, gpa_t gpa,
>         if (vt_is_tdx_private_gpa(vcpu->kvm, gpa))
>                 error_code |= PFERR_PRIVATE_ACCESS;
>  
> +       BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL >= (1 <<
> hweight64(PFERR_FAULT_LEVEL_MASK)));
> +
> +       error_code |= (u64)max_fault_level << PFERR_FAULT_LEVEL_SHIFT;
> +
>         return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>  }
>  
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index e994a6c08a75..19047de4d98d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2027,7 +2027,7 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>          * handle retries locally in their EPT violation handlers.
>          */
>         while (1) {
> -               ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> +               ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual,
> KVM_MAX_HUGEPAGE_LEVEL);
>  
>                 if (ret != RET_PF_RETRY || !local_retry)
>                         break;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index ef2d7208dd20..b70a2ff35884 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5782,7 +5782,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>         if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu,
> gpa)))
>                 return kvm_emulate_instruction(vcpu, 0);
>  
> -       return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
> +       return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification,
> KVM_MAX_HUGEPAGE_LEVEL);
>  }
>  
>  static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> 
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 21/21] KVM: x86: Ignore splitting huge pages in fault path for TDX
  2025-05-13 21:58   ` Edgecombe, Rick P
@ 2025-05-16  6:40     ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-16  6:40 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Shutemov, Kirill,
	quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, May 14, 2025 at 05:58:41AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:09 +0800, Yan Zhao wrote:
> 
> >  int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > -			       void *private_spt)
> > +			       void *private_spt, bool mmu_lock_shared)
> >  {
> >  	struct page *page = virt_to_page(private_spt);
> >  	int ret;
> > @@ -1842,6 +1842,29 @@ int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> >  	if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
> >  		return -EINVAL;
> >  
> > +	/*
> > +	 * Split request with mmu_lock held for reading can only occur when one
> > +	 * vCPU accepts at 2MB level while another vCPU accepts at 4KB level.
> > +	 * Ignore this 4KB mapping request by setting violation_request_level to
> > +	 * 2MB and returning -EBUSY for retry. Then the next fault at 2MB level
> > +	 * would be a spurious fault. The vCPU accepting at 2MB will accept the
> > +	 * whole 2MB range.
> > +	 */
> > +	if (mmu_lock_shared) {
> > +		struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> > +		struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +
> > +		if (KVM_BUG_ON(!vcpu, kvm))
> > +			return -EOPNOTSUPP;
> > +
> > +		/* Request to map as 2MB leaf for the whole 2MB range */
> > +		tdx->violation_gfn_start = gfn_round_for_level(gfn, level);
> > +		tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> > +		tdx->violation_request_level = level;
> > +
> > +		return -EBUSY;
> 
> This is too hacky the way it infers so much from mmu_lock_shared. Since guests
> shouldn't be doing this, what about just doing kvm_vm_dead(), with a little
> pr_warn()? Maybe even just do it in set_external_spte_present() and declare it
There's a valid case [1] besides double accept to trigger demotion in the fault
path. Kirill believed we need to support that case [2].

KVM MMU core can't tell if the demotion is caused by double accept or not.

[1] https://lore.kernel.org/all/aAn3SSocw0XvaRye@yzhao56-desk.sh.intel.com/
[2] https://lore.kernel.org/all/6vdj4mfxlyvypn743klxq5twda66tkugwzljdt275rug2gmwwl@zdziylxpre6y/

> the rule for external page tables. It can shrink this patch significantly, for
> no expected user impact.


> 
> > +	}
> > +
> >  	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
> >  	if (ret <= 0)
> >  		return ret;
> > diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> > index 0619e9390e5d..fcba76887508 100644
> > --- a/arch/x86/kvm/vmx/x86_ops.h
> > +++ b/arch/x86/kvm/vmx/x86_ops.h
> > @@ -159,7 +159,7 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> >  int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> >  				 enum pg_level level, kvm_pfn_t pfn);
> >  int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > -			       void *private_spt);
> > +			       void *private_spt, bool mmu_lock_shared);
> >  
> >  void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
> >  void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
> > @@ -228,7 +228,8 @@ static inline int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> >  
> >  static inline int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn,
> >  					     enum pg_level level,
> > -					     void *private_spt)
> > +					     void *private_spt,
> > +					     bool mmu_lock_shared)
> >  {
> >  	return -EOPNOTSUPP;
> >  }
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 16/21] KVM: x86/mmu: Introduce kvm_split_boundary_leafs() to split boundary leafs
  2025-05-13 22:56   ` Edgecombe, Rick P
@ 2025-05-16  7:46     ` Yan Zhao
  2025-05-16  8:03       ` Yan Zhao
  2025-05-16 11:44       ` Yan Zhao
  0 siblings, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-16  7:46 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Shutemov, Kirill,
	quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, May 14, 2025 at 06:56:26AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:08 +0800, Yan Zhao wrote:
> > Introduce kvm_split_boundary_leafs() to manage the splitting of boundary
> > leafs within the mirror root.
> > 
> > Before zapping a specific GFN range in the mirror root, split any huge leaf
> > that intersects with the boundary of the GFN range to ensure that the
> > subsequent zap operation does not impact any GFN outside the specified
> > range. This is crucial for the mirror root as the private page table
> > requires the guest's ACCEPT operation after faulting back a GFN.
> > 
> > This function should be called while kvm->mmu_lock is held for writing. The
> > kvm->mmu_lock is temporarily released to allocate memory for sp for split.
> > The only expected error is -ENOMEM.
> > 
> > Opportunistically, WARN in tdp_mmu_zap_leafs() if zapping a huge leaf in
> > the mirror root affects a GFN outside the specified range.
> > 
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c     |  21 +++++++
> >  arch/x86/kvm/mmu/tdp_mmu.c | 116 ++++++++++++++++++++++++++++++++++++-
> >  arch/x86/kvm/mmu/tdp_mmu.h |   1 +
> >  include/linux/kvm_host.h   |   1 +
> >  4 files changed, 136 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 0e227199d73e..0d49c69b6b55 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1640,6 +1640,27 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
> >  				 start, end - 1, can_yield, true, flush);
> >  }
> >  
> > +/*
> > + * Split large leafs at the boundary of the specified range for the mirror root
> > + *
> > + * Return value:
> > + * 0 : success, no flush is required;
> > + * 1 : success, flush is required;
> > + * <0: failure.
> > + */
> > +int kvm_split_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range)
> > +{
> > +	bool ret = 0;
> > +
> > +	lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> > +			    lockdep_is_held(&kvm->slots_lock));
> > +
> > +	if (tdp_mmu_enabled)
> > +		ret = kvm_tdp_mmu_gfn_range_split_boundary(kvm, range);
> > +
> > +	return ret;
> > +}
> > +
> >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> >  {
> >  	bool flush = false;
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 0f683753a7bb..d3fba5d11ea2 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -324,6 +324,8 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> >  				u64 old_spte, u64 new_spte, int level,
> >  				bool shared);
> >  
> > +static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> > +				   struct kvm_mmu_page *sp, bool shared);
> >  static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror);
> >  static void *get_external_spt(gfn_t gfn, u64 new_spte, int level);
> >  
> > @@ -962,6 +964,19 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> >  	return true;
> >  }
> >  
> > +static inline bool iter_split_required(struct kvm *kvm, struct kvm_mmu_page *root,
> > +				       struct tdp_iter *iter, gfn_t start, gfn_t end)
> > +{
> > +	if (!is_mirror_sp(root) || !is_large_pte(iter->old_spte))
> > +		return false;
> > +
> > +	/* Fully contained, no need to split */
> > +	if (iter->gfn >= start && iter->gfn + KVM_PAGES_PER_HPAGE(iter->level) <= end)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> >  /*
> >   * If can_yield is true, will release the MMU lock and reschedule if the
> >   * scheduler needs the CPU or there is contention on the MMU lock. If this
> > @@ -991,6 +1006,8 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> >  		    !is_last_spte(iter.old_spte, iter.level))
> >  			continue;
> >  
> > +		WARN_ON_ONCE(iter_split_required(kvm, root, &iter, start, end));
> > +
> 
> Kind of unrelated change? But good idea. Maybe for another patch.
Yes, will move it to a separate patch in a formal version.
As initial RFC, I hoped to show related changes in one patch to allow a whole
picture.


> >  		tdp_mmu_iter_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
> >  
> >  		/*
> > @@ -1246,9 +1263,6 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> >  	return 0;
> >  }
> >  
> > -static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> > -				   struct kvm_mmu_page *sp, bool shared);
> > -
> >  /*
> >   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
> >   * page tables and SPTEs to translate the faulting guest physical address.
> > @@ -1341,6 +1355,102 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  	return ret;
> >  }
> >  
> > +/*
> > + * Split large leafs at the boundary of the specified range for the mirror root
> > + */
> > +static int tdp_mmu_split_boundary_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > +					gfn_t start, gfn_t end, bool can_yield, bool *flush)
> > +{
> > +	struct kvm_mmu_page *sp = NULL;
> > +	struct tdp_iter iter;
> > +
> > +	WARN_ON_ONCE(!can_yield);
> 
> Why pass this in then?
Right, can move the warning up to the caller.
Currently callers of kvm_split_boundary_leafs() are only
kvm_arch_pre_set_memory_attributes() and kvm_gmem_punch_hole(), so can_yield is
always false.

> > +
> > +	if (!is_mirror_sp(root))
> > +		return 0;
> 
> What is special about mirror roots here?
Hmm, I thought only the mirror root needs splitting before zapping of the
S-EPT, which requires guest's acceptance. Other roots could tolerate zapping of
a larger range than required.

Maybe AMD guys can shout out if I'm wrong.

> > +	end = min(end, tdp_mmu_max_gfn_exclusive());
> > +
> > +	lockdep_assert_held_write(&kvm->mmu_lock);
> > +
> > +	rcu_read_lock();
> > +
> > +	for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end) {
> > +retry:
> > +		if (can_yield &&
> 
> Do we need this part of the conditional based on the above?
No need if we don't pass in can_yield.

> > +		    tdp_mmu_iter_cond_resched(kvm, &iter, *flush, false)) {
> > +			*flush = false;
> > +			continue;
> > +		}
> > +
> > +		if (!is_shadow_present_pte(iter.old_spte) ||
> > +		    !is_last_spte(iter.old_spte, iter.level) ||
> > +		    !iter_split_required(kvm, root, &iter, start, end))
> > +			continue;
> > +
> > +		if (!sp) {
> > +			rcu_read_unlock();
> > +
> > +			write_unlock(&kvm->mmu_lock);
> > +
> > +			sp = tdp_mmu_alloc_sp_for_split(true);
> > +
> > +			write_lock(&kvm->mmu_lock);
> > +
> > +			if (!sp) {
> > +				trace_kvm_mmu_split_huge_page(iter.gfn, iter.old_spte,
> > +							      iter.level, -ENOMEM);
> > +				return -ENOMEM;
> > +			}
> > +			rcu_read_lock();
> > +
> > +			iter.yielded = true;
> > +			continue;
> > +		}
> > +		tdp_mmu_init_child_sp(sp, &iter);
> > +
> > +		if (tdp_mmu_split_huge_page(kvm, &iter, sp, false))
> 
> I think it can't fail when you hold mmu write lock.
You are right!
Thanks for catching it.

> > +			goto retry;
> > +
> > +		sp = NULL;
> > +		/*
> > +		 * Set yielded in case after splitting to a lower level,
> > +		 * the new iter requires furter splitting.
> > +		 */
> > +		iter.yielded = true;
> > +		*flush = true;
> > +	}
> > +
> > +	rcu_read_unlock();
> > +
> > +	/* Leave it here though it should be impossible for the mirror root */
> > +	if (sp)
> > +		tdp_mmu_free_sp(sp);
> 
> What do you think about relying on tdp_mmu_split_huge_pages_root() and moving
> this to an optimization patch at the end?
> 
> Or what about just two calls to tdp_mmu_split_huge_pages_root() at the
> boundaries?
Though the two generally look like the same, relying on
tdp_mmu_split_huge_pages_root() will create several minor changes scattering
in tdp_mmu_split_huge_pages_root().

e.g. update flush after tdp_mmu_iter_cond_resched(), check
iter_split_required(), set "iter.yielded = true".

So, it may be hard to review as a initial RFC.

I prefer to do that after Paolo and Sean have taken a look of it :)

> > +	return 0;
> > +}
> > +
> > +int kvm_tdp_mmu_gfn_range_split_boundary(struct kvm *kvm, struct kvm_gfn_range *range)
> > +{
> > +	enum kvm_tdp_mmu_root_types types;
> > +	struct kvm_mmu_page *root;
> > +	bool flush = false;
> > +	int ret;
> > +
> > +	types = kvm_gfn_range_filter_to_root_types(kvm, range->attr_filter) | KVM_INVALID_ROOTS;
> 
> What is the reason for KVM_INVALID_ROOTS in this case?
I wanted to keep consistent with that in kvm_tdp_mmu_unmap_gfn_range().
Yes, we can remove the KVM_INVALID_ROOTS.

> > +
> > +	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, types) {
> 
> It would be better to check for mirror roots here, instead of inside
> tdp_mmu_split_boundary_leafs().
Ok.

> 
> > +		ret = tdp_mmu_split_boundary_leafs(kvm, root, range->start, range->end,
> > +						   range->may_block, &flush);
> > +		if (ret < 0) {
> > +			if (flush)
> > +				kvm_flush_remote_tlbs(kvm);
> > +
> > +			return ret;
> > +		}
> > +	}
> > +	return flush;
> > +}
> > +
> >  /* Used by mmu notifier via kvm_unmap_gfn_range() */
> >  bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
> >  				 bool flush)
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index 52acf99d40a0..806a21d4f0e3 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -69,6 +69,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm);
> >  void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
> >  				  enum kvm_tdp_mmu_root_types root_types);
> >  void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared);
> > +int kvm_tdp_mmu_gfn_range_split_boundary(struct kvm *kvm, struct kvm_gfn_range *range);
> >  
> >  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> >  
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 655d36e1f4db..19d7a577e7ed 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -272,6 +272,7 @@ struct kvm_gfn_range {
> >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > +int kvm_split_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range);
> >  #endif
> >  
> >  enum {
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 16/21] KVM: x86/mmu: Introduce kvm_split_boundary_leafs() to split boundary leafs
  2025-05-16  7:46     ` Yan Zhao
@ 2025-05-16  8:03       ` Yan Zhao
  2025-05-16 22:27         ` Edgecombe, Rick P
  2025-05-16 11:44       ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-16  8:03 UTC (permalink / raw)
  To: Edgecombe, Rick P, pbonzini@redhat.com, seanjc@google.com,
	Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Fri, May 16, 2025 at 03:46:53PM +0800, Yan Zhao wrote:
> On Wed, May 14, 2025 at 06:56:26AM +0800, Edgecombe, Rick P wrote:
> > On Thu, 2025-04-24 at 11:08 +0800, Yan Zhao wrote:

> > >  /*
> > >   * If can_yield is true, will release the MMU lock and reschedule if the
> > >   * scheduler needs the CPU or there is contention on the MMU lock. If this
> > > @@ -991,6 +1006,8 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > >  		    !is_last_spte(iter.old_spte, iter.level))
> > >  			continue;
> > >  
> > > +		WARN_ON_ONCE(iter_split_required(kvm, root, &iter, start, end));
> > > +
> > 
> > Kind of unrelated change? But good idea. Maybe for another patch.
> Yes, will move it to a separate patch in a formal version.
> As initial RFC, I hoped to show related changes in one patch to allow a whole
> picture.
> 
> > > +int kvm_tdp_mmu_gfn_range_split_boundary(struct kvm *kvm, struct kvm_gfn_range *range)
> > > +{
> > > +	enum kvm_tdp_mmu_root_types types;
> > > +	struct kvm_mmu_page *root;
> > > +	bool flush = false;
> > > +	int ret;
> > > +
> > > +	types = kvm_gfn_range_filter_to_root_types(kvm, range->attr_filter) | KVM_INVALID_ROOTS;
> > 
> > What is the reason for KVM_INVALID_ROOTS in this case?
> I wanted to keep consistent with that in kvm_tdp_mmu_unmap_gfn_range().
With this consistency, we can warn in tdp_mmu_zap_leafs() as below though
there should be no invalid mirror root.

WARN_ON_ONCE(iter_split_required(kvm, root, &iter, start, end));
 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 19/21] KVM: gmem: Split huge boundary leafs for punch hole of private memory
  2025-05-13 22:59   ` Edgecombe, Rick P
@ 2025-05-16  8:19     ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-16  8:19 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Shutemov, Kirill,
	quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, May 14, 2025 at 06:59:01AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:08 +0800, Yan Zhao wrote:
> > +static int kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
> > +				     pgoff_t end, bool need_split)
> >  {
> >  	bool flush = false, found_memslot = false;
> >  	struct kvm_memory_slot *slot;
> >  	struct kvm *kvm = gmem->kvm;
> >  	unsigned long index;
> > +	int ret = 0;
> >  
> >  	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
> >  		pgoff_t pgoff = slot->gmem.pgoff;
> > @@ -319,14 +320,23 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
> >  			kvm_mmu_invalidate_begin(kvm);
> >  		}
> >  
> > +		if (need_split) {
> > +			ret = kvm_split_boundary_leafs(kvm, &gfn_range);
> 
> What is the effect for other guestmemfd users? SEV doesn't need this, right? Oh
> I see, down in tdp_mmu_split_boundary_leafs() it bails on non-mirror roots. I
> don't like the naming then. It sounds deterministic, but it's really only
> necessary splits for certain VM types.
Right, kvm_split_boundary_leafs() only takes effect on the mirror root.

> I guess it all depends on how well teaching kvm_mmu_unmap_gfn_range() to fail
> goes. But otherwise, we should call it like kvm_prepare_zap_range() or
Hmm, if we call it kvm_prepare_zap_range(), we have to invoke it for all zaps.
However, except kvm_gmem_punch_hole(), the other two callers
kvm_gmem_error_folio(), kvm_gmem_release() have no need to perfrom splitting
before zapping.
Passing in the invalidation reason to kvm_gmem_invalidate_begin() also makes
things complicated.

> something. And have it make it clearly do nothing for non-TDX high up where it's
> easy to see.
Would a name like kvm_split_boundary_leafs_for_mirror() be too TDX specific?

If we name it kvm_split_boundary_leafs(), SEV can simply remove the bailing out
if they want in future.

> 
> > +			if (ret < 0)
> > +				goto out;
> > +
> > +			flush |= ret;
> > +		}
> >  		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
> >  	}
> >  
> > +out:
> >  	if (flush)
> >  		kvm_flush_remote_tlbs(kvm);
> >  
> >  	if (found_memslot)
> >  		KVM_MMU_UNLOCK(kvm);
> > +	
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 20/21] KVM: x86: Force a prefetch fault's max mapping level to 4KB for TDX
  2025-05-13 23:20   ` Edgecombe, Rick P
@ 2025-05-16  8:43     ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-16  8:43 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Shutemov, Kirill,
	quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, May 14, 2025 at 07:20:18AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:09 +0800, Yan Zhao wrote:
> > Introduce a "prefetch" parameter to the private_max_mapping_level hook and
> > enforce the max mapping level of a prefetch fault for private memory to be
> > 4KB. This is a preparation to enable the ignoring huge page splitting in
> > the fault path.
> > 
> > If a prefetch fault results in a 2MB huge leaf in the mirror page table,
> > there may not be a vCPU available to accept the corresponding 2MB huge leaf
> > in the S-EPT if the TD is not configured to receive #VE for page
> > acceptance. 
> > 
> 
> Can you elaborate on this case more. A vCPU may not be available? What does that
> mean?
Sorry. I didn't express it clearly.

If a prefetch fault results in a 2MB mapping, as the guest is not aware of the
prefetched mapping, it may accept at 4KB later, triggering a demotion.

> > Consequently, if a vCPU accepts the page at 4KB level, it will
> > trigger an EPT violation to split the 2MB huge leaf generated by the
> > prefetch fault.
> 
> The case is KVM_PRE_FAULT_MEMORY faults in 2MB, then guest accepts at 4k (which
> it is not supposed to do)?
Actually, the guest is innocent to accept at 4KB.

> Then maybe the kvm_vm_dead() case I suggested in the other patch could handle
> this case too, and this patch could be dropped?
 
This patch is not required if we decide to support demotion in the fault path.
 
> > Since handling the BUSY error from SEAMCALLs for huge page splitting is
> > more comprehensive in the fault path, which is with kvm->mmu_lock held for
> > reading, force the max mapping level of a prefetch fault of private memory
> > to be 4KB to prevent potential splitting.
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-05-13 18:52   ` Edgecombe, Rick P
@ 2025-05-16  9:05     ` Yan Zhao
  2025-05-16 17:10       ` Edgecombe, Rick P
  2025-06-19  9:26       ` Nikolay Borisov
  0 siblings, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-16  9:05 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Shutemov, Kirill,
	quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, May 14, 2025 at 02:52:49AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:04 +0800, Yan Zhao wrote:
> > Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.
> > 
> > Verify the validity of the level and ensure that the mapping range is fully
> > contained within the page folio.
> > 
> > As a conservative solution, perform CLFLUSH on all pages to be mapped into
> > the TD before invoking the SEAMCALL TDH_MEM_PAGE_AUG. This ensures that any
> > dirty cache lines do not write back later and clobber TD memory.
> 
> This should have a brief background on why it doesn't use the arg - what is
> deficient today. Also, an explanation of how it will be used (i.e. what types of
> pages will be passed)
Will do.

> > 
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  arch/x86/virt/vmx/tdx/tdx.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index f5e2a937c1e7..a66d501b5677 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1595,9 +1595,18 @@ u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u
> >  		.rdx = tdx_tdr_pa(td),
> >  		.r8 = page_to_phys(page),
> >  	};
> > +	unsigned long nr_pages = 1 << (level * 9);
> > +	struct folio *folio = page_folio(page);
> > +	unsigned long idx = 0;
> >  	u64 ret;
> >  
> > -	tdx_clflush_page(page);
> > +	if (!(level >= TDX_PS_4K && level < TDX_PS_NR) ||
> > +	    (folio_page_idx(folio, page) + nr_pages > folio_nr_pages(folio)))
> > +		return -EINVAL;
> 
> Shouldn't KVM not try to map a huge page in this situation? Doesn't seem like a
> job for the SEAMCALL wrapper.
Ok. If the decision is to trust KVM and all potential callers, it's reasonable
to drop those checks.

> > +
> > +	while (nr_pages--)
> > +		tdx_clflush_page(nth_page(page, idx++));
> 
> clflush_cache_range() is:
> static void tdx_clflush_page(struct page *page)
> {
> 	clflush_cache_range(page_to_virt(page), PAGE_SIZE);
> }
> 
> So we have loops within loops...  Better to add an arg to tdx_clflush_page() or
> add a variant that takes one.
Ok.

One thing to note is that even with an extra arg, tdx_clflush_page() has to call
clflush_cache_range() page by page because with
"#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)",
page virtual addresses are not necessarily contiguous.

What about Binbin's proposal [1]? i.e.,

while (nr_pages)
     tdx_clflush_page(nth_page(page, --nr_pages));

[1] https://lore.kernel.org/all/a7d0988d-037c-454f-bc6b-57e71b357488@linux.intel.com/

> > +
> >  	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
> >  
> >  	*ext_err1 = args.rcx;
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-05-15  2:16   ` Chao Gao
@ 2025-05-16  9:07     ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-16  9:07 UTC (permalink / raw)
  To: Chao Gao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Thu, May 15, 2025 at 10:16:58AM +0800, Chao Gao wrote:
> On Thu, Apr 24, 2025 at 11:04:28AM +0800, Yan Zhao wrote:
> >Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.
> >
> >Verify the validity of the level and ensure that the mapping range is fully
> >contained within the page folio.
> >
> >As a conservative solution, perform CLFLUSH on all pages to be mapped into
> >the TD before invoking the SEAMCALL TDH_MEM_PAGE_AUG. This ensures that any
> >dirty cache lines do not write back later and clobber TD memory.
> >
> >Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> >Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> >Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> >---
> > arch/x86/virt/vmx/tdx/tdx.c | 11 ++++++++++-
> > 1 file changed, 10 insertions(+), 1 deletion(-)
> >
> >diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> >index f5e2a937c1e7..a66d501b5677 100644
> >--- a/arch/x86/virt/vmx/tdx/tdx.c
> >+++ b/arch/x86/virt/vmx/tdx/tdx.c
> >@@ -1595,9 +1595,18 @@ u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u
> > 		.rdx = tdx_tdr_pa(td),
> > 		.r8 = page_to_phys(page),
> > 	};
> >+	unsigned long nr_pages = 1 << (level * 9);
> >+	struct folio *folio = page_folio(page);
> >+	unsigned long idx = 0;
> > 	u64 ret;
> > 
> >-	tdx_clflush_page(page);
> >+	if (!(level >= TDX_PS_4K && level < TDX_PS_NR) ||
> >+	    (folio_page_idx(folio, page) + nr_pages > folio_nr_pages(folio)))
> >+		return -EINVAL;
> 
> Returning -EINVAL looks incorrect as the return type is u64.
Good catch. Thanks!
I'll think about how to handle it. Looks it could be dropped if we trust KVM.


> >+	while (nr_pages--)
> >+		tdx_clflush_page(nth_page(page, idx++));
> >+
> > 	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
> > 
> > 	*ext_err1 = args.rcx;
> >-- 
> >2.43.2
> >

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 14/21] KVM: x86/tdp_mmu: Invoke split_external_spt hook with exclusive mmu_lock
  2025-05-13 23:06   ` Edgecombe, Rick P
@ 2025-05-16  9:17     ` Yan Zhao
  2025-05-16 22:11       ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-16  9:17 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Shutemov, Kirill,
	quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, May 14, 2025 at 07:06:48AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:07 +0800, Yan Zhao wrote:
> > +static int split_external_spt(struct kvm *kvm, gfn_t gfn, u64 old_spte,
> > +			      u64 new_spte, int level)
> > +{
> > +	void *external_spt = get_external_spt(gfn, new_spte, level);
> > +	int ret;
> > +
> > +	KVM_BUG_ON(!external_spt, kvm);
> > +
> > +	ret = static_call(kvm_x86_split_external_spt)(kvm, gfn, level, external_spt);
> > +	KVM_BUG_ON(ret, kvm);
> 
> Shouldn't this BUG_ON be handled in the split_external_spt implementation? I
> don't think we need another one.
Ok. But kvm_x86_split_external_spt() is not for TDX only.
Is it good for KVM MMU core to rely on each implementation to trigger BUG_ON?

> > +	return ret;
> > +}
> >  /**
> >   * handle_removed_pt() - handle a page table removed from the TDP structure
> >   *
> > @@ -764,13 +778,13 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> >  
> >  	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> >  
> > -	/*
> > -	 * Users that do non-atomic setting of PTEs don't operate on mirror
> > -	 * roots, so don't handle it and bug the VM if it's seen.
> > -	 */
> >  	if (is_mirror_sptep(sptep)) {
> > -		KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
> > -		remove_external_spte(kvm, gfn, old_spte, level);
> > +		if (!is_shadow_present_pte(new_spte))
> > +			remove_external_spte(kvm, gfn, old_spte, level);
> > +		else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
> > +			split_external_spt(kvm, gfn, old_spte, new_spte, level);
> > +		else
> > +			KVM_BUG_ON(1, kvm);
> 
> It might be worth a comment what this is looking for at this point. I think it's
> that external EPT only support certain operations, so bug if any unsupported
> operations are seen.
Will do.

> >  	}
> >  
> >  	return old_spte;
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-05-13 20:10   ` Edgecombe, Rick P
  2025-05-16  1:35     ` Huang, Kai
@ 2025-05-16  9:28     ` Yan Zhao
  1 sibling, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-16  9:28 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Shutemov, Kirill,
	quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, May 14, 2025 at 04:10:10AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-04-24 at 11:06 +0800, Yan Zhao wrote:
> > Allow TDX's .private_max_mapping_level hook to return 2MB after the TD is
> > RUNNABLE, enabling KVM to map TDX private pages at the 2MB level. Remove
> > TODOs and adjust KVM_BUG_ON()s accordingly.
> > 
> > Note: Instead of placing this patch at the tail of the series, it's
> > positioned here to show the code changes for basic mapping of private huge
> > pages (i.e., transitioning from non-present to present).
> > 
> > However, since this patch also allows KVM to trigger the merging of small
> > entries into a huge leaf entry or the splitting of a huge leaf entry into
> > small entries, errors are expected if any of these operations are triggered
> > due to the current lack of splitting/merging support.
> > 
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  arch/x86/kvm/vmx/tdx.c | 16 +++++++---------
> >  1 file changed, 7 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index e23dce59fc72..6b3a8f3e6c9c 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1561,10 +1561,6 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> >  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> >  	struct page *page = pfn_to_page(pfn);
> >  
> > -	/* TODO: handle large pages. */
> > -	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> > -		return -EINVAL;
> > -
> >  	/*
> >  	 * Because guest_memfd doesn't support page migration with
> >  	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
> > @@ -1612,8 +1608,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> >  	gpa_t gpa = gfn_to_gpa(gfn);
> >  	u64 err, entry, level_state;
> >  
> > -	/* TODO: handle large pages. */
> > -	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> > +	if (KVM_BUG_ON(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K, kvm))
> 
> It's not clear why some of these warnings are here and some are in patch 4.
Patch 4 contains only changes for !TD_STATE_RUNNABLE stage.
This patch is to allow huge page after TD_STATE_RUNNABLE.
So, relaxed the condition to trigger BUG_ON in this patch, i.e.,
before this patch, always bug on level > 4K;
after this patch, only bug on level > 4K before TD is runnable.

> >  		return -EINVAL;
> >  
> >  	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
> > @@ -1714,8 +1709,8 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> >  	gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
> >  	u64 err, entry, level_state;
> >  
> > -	/* For now large page isn't supported yet. */
> > -	WARN_ON_ONCE(level != PG_LEVEL_4K);
> > +	/* Before TD runnable, large page is not supported */
> > +	WARN_ON_ONCE(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K);
> >  
> >  	err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
> >  
> > @@ -1817,6 +1812,9 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> >  	struct page *page = pfn_to_page(pfn);
> >  	int ret;
> >  
> > +	WARN_ON_ONCE(folio_page_idx(page_folio(page), page) + KVM_PAGES_PER_HPAGE(level) >
> > +		     folio_nr_pages(page_folio(page)));
> > +
> >  	/*
> >  	 * HKID is released after all private pages have been removed, and set
> >  	 * before any might be populated. Warn if zapping is attempted when
> > @@ -3265,7 +3263,7 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> >  	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
> >  		return PG_LEVEL_4K;
> >  
> > -	return PG_LEVEL_4K;
> > +	return PG_LEVEL_2M;
> 
> Maybe combine this with patch 4, or split them into sensible categories.
Sorry to bring confusion.

As explained in the patch msg, this patch to return PG_LEVEL_2M actually needs
to be placed at the end of the series, after patches for page splitting/merging.

As inital RFC, it's placed earlier to show changes to enable basic TDX huge page
(without splitting/merging).

> >  }
> >  
> >  static int tdx_online_cpu(unsigned int cpu)
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-05-16  1:35     ` Huang, Kai
@ 2025-05-16  9:43       ` Yan Zhao
  2025-05-16 22:35         ` Huang, Kai
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-16  9:43 UTC (permalink / raw)
  To: Huang, Kai
  Cc: pbonzini@redhat.com, seanjc@google.com, Edgecombe, Rick P,
	quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, Du, Fan,
	linux-kernel@vger.kernel.org, Li, Zhiquan1, Weiny, Ira,
	michael.roth@amd.com, binbin.wu@linux.intel.com,
	ackerleytng@google.com, Yamahata, Isaku, Peng, Chao P,
	kvm@vger.kernel.org, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Fri, May 16, 2025 at 09:35:37AM +0800, Huang, Kai wrote:
> On Tue, 2025-05-13 at 20:10 +0000, Edgecombe, Rick P wrote:
> > > @@ -3265,7 +3263,7 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> > >   	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
> > >   		return PG_LEVEL_4K;
> > >   
> > > -	return PG_LEVEL_4K;
> > > +	return PG_LEVEL_2M;
> > 
> > Maybe combine this with patch 4, or split them into sensible categories.
> 
> How about merge with patch 12
> 
>   [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's 
>   ACCEPT level
> 
> instead?
> 
> Per patch 12, the fault due to TDH.MEM.PAGE.ACCPT contains fault level info, so
> KVM should just return that.  But seems we are still returning PG_LEVEL_2M if no
> such info is provided (IIUC):
Yes, if without such info (tdx->violation_request_level), we always return
PG_LEVEL_2M.


> int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, 
> 				       gfn_t gfn)
>  {
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
>  	if (unlikely(to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE))
>  		return PG_LEVEL_4K;
>  
> +	if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
> +		return tdx->violation_request_level;
> +
>  	return PG_LEVEL_2M;
>  }
> 
> So why not returning PT_LEVEL_4K at the end?
>
> I am asking because below text mentioned in the coverletter:
> 
>     A rare case that could lead to splitting in the fault path is when a TD
>     is configured to receive #VE and accesses memory before the ACCEPT
>     operation. By the time a vCPU accesses a private GFN, due to the lack
>     of any guest preferred level, KVM could create a mapping at 2MB level.
>     If the TD then only performs the ACCEPT operation at 4KB level,
>     splitting in the fault path will be triggered. However, this is not
>     regarded as a typical use case, as usually TD always accepts pages in
>     the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
>     splitting request is an endless EPT violation. This would not happen
>     for a Linux guest, which does not expect any #VE.
> 
> Changing to return PT_LEVEL_4K should avoid this problem.  It doesn't hurt
For TDs expect #VE, guests access private memory before accept it.
In that case, upon KVM receives EPT violation, there's no expected level from
the TDX module. Returning PT_LEVEL_4K at the end basically disables huge pages
for those TDs.

Besides, according to Kirill [1], the order from 1GB->2MB->4KB is only the case
for linux guests.

[1] https://lore.kernel.org/all/6vdj4mfxlyvypn743klxq5twda66tkugwzljdt275rug2gmwwl@zdziylxpre6y/#t

> normal cases either, since guest will always do ACCEPT (which contains the
> accepting level) before accessing the memory.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 04/21] KVM: TDX: Enforce 4KB mapping level during TD build Time
  2025-05-15 17:32       ` Edgecombe, Rick P
@ 2025-05-16 10:05         ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-16 10:05 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, tabba@google.com,
	vbabka@suse.cz, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Du, Fan,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Shutemov, Kirill,
	pgonda@google.com, x86@kernel.org

On Fri, May 16, 2025 at 01:32:44AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-05-15 at 17:16 +0800, Yan Zhao wrote:
> > On Wed, May 14, 2025 at 03:12:10AM +0800, Edgecombe, Rick P wrote:
> > > On Thu, 2025-04-24 at 11:05 +0800, Yan Zhao wrote:
> > > > During the TD build phase (i.e., before the TD becomes RUNNABLE), enforce a
> > > > 4KB mapping level both in the S-EPT managed by the TDX module and the
> > > > mirror page table managed by KVM.
> > > > 
> > > > During this phase, TD's memory is added via tdh_mem_page_add(), which only
> > > > accepts 4KB granularity. Therefore, return PG_LEVEL_4K in TDX's
> > > > .private_max_mapping_level hook to ensure KVM maps at the 4KB level in the
> > > > mirror page table. Meanwhile, iterate over each 4KB page of a large gmem
> > > > backend page in tdx_gmem_post_populate() and invoke tdh_mem_page_add() to
> > > > map at the 4KB level in the S-EPT.
> > > > 
> > > > Still allow huge pages in gmem backend during TD build time. Based on [1],
> > > > which gmem series allows 2MB TPH and non-in-place conversion, pass in
> > > > region.nr_pages to kvm_gmem_populate() in tdx_vcpu_init_mem_region().
> > > > 
> > > 
> > > This commit log will need to be written with upstream in mind when it is out of
> > > RFC.
> > Ok.
> > 
> >  
> > > >  This
> > > > enables kvm_gmem_populate() to allocate huge pages from the gmem backend
> > > > when the remaining nr_pages, GFN alignment, and page private/shared
> > > > attribute permit.  KVM is then able to promote the initial 4K mapping to
> > > > huge after TD is RUNNABLE.
> > > > 
> > > > Disallow any private huge pages during TD build time. Use BUG_ON() in
> > > > tdx_mem_page_record_premap_cnt() and tdx_is_sept_zap_err_due_to_premap() to
> > > > assert the mapping level is 4KB.
> > > > 
> > > > Opportunistically, remove unused parameters in
> > > > tdx_mem_page_record_premap_cnt().
> > > > 
> > > > Link: https://lore.kernel.org/all/20241212063635.712877-1-michael.roth@amd.com [1]
> > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > > ---
> > > >  arch/x86/kvm/vmx/tdx.c | 45 ++++++++++++++++++++++++++++--------------
> > > >  1 file changed, 30 insertions(+), 15 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > > index 98cde20f14da..03885cb2869b 100644
> > > > --- a/arch/x86/kvm/vmx/tdx.c
> > > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > > @@ -1530,14 +1530,16 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> > > >   * The counter has to be zero on KVM_TDX_FINALIZE_VM, to ensure that there
> > > >   * are no half-initialized shared EPT pages.
> > > >   */
> > > > -static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
> > > > -					  enum pg_level level, kvm_pfn_t pfn)
> > > > +static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, enum pg_level level)
> > > >  {
> > > >  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > >  
> > > >  	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> > > >  		return -EINVAL;
> > > >  
> > > > +	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> > > > +		return -EINVAL;
> > > > +
> > > >  	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
> > > >  	atomic64_inc(&kvm_tdx->nr_premapped);
> > > >  	return 0;
> > > > @@ -1571,7 +1573,7 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > > >  	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> > > >  		return tdx_mem_page_aug(kvm, gfn, level, page);
> > > >  
> > > > -	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
> > > > +	return tdx_mem_page_record_premap_cnt(kvm, level);
> > > >  }
> > > >  
> > > >  static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > @@ -1666,7 +1668,7 @@ int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
> > > >  static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
> > > >  					     u64 entry, int level)
> > > >  {
> > > > -	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
> > > > +	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE || level > PG_LEVEL_4K)
> > > >  		return false;
> > > 
> > > This is catching zapping huge pages before the TD is runnable? Is it necessary
> > > if we are already warning about mapping huge pages before the TD is runnable in
> > > tdx_mem_page_record_premap_cnt()?
> > Under normal conditions, this check isn't necessary.
> > I added this check in case bugs in the KVM core MMU where the mirror page table
> > might be updated to huge without notifying the TDX side.
> > Am I overthinking?
> 
> If we need so many BUG_ON()s maybe our design is too fragile. I think we could
> drop this one.
Got it.

> > 
> > 
> > > >  	if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
> > > > @@ -3052,8 +3054,8 @@ struct tdx_gmem_post_populate_arg {
> > > >  	__u32 flags;
> > > >  };
> > > >  
> > > > -static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > > > -				  void __user *src, int order, void *_arg)
> > > > +static int tdx_gmem_post_populate_4k(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > > > +				     void __user *src, void *_arg)
> > > >  {
> > > >  	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
> > > >  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > > @@ -3120,6 +3122,21 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > > >  	return ret;
> > > >  }
> > > >  
> > > > +static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > > > +				  void __user *src, int order, void *_arg)
> > > > +{
> > > > +	unsigned long i, npages = 1 << order;
> > > > +	int ret;
> > > > +
> > > > +	for (i = 0; i < npages; i++) {
> > > > +		ret = tdx_gmem_post_populate_4k(kvm, gfn + i, pfn + i,
> > > > +						src + i * PAGE_SIZE, _arg);
> > > > +		if (ret)
> > > > +			return ret;
> > > > +	}
> > > > +	return 0;
> > > > +}
> > > > +
> > > >  static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
> > > >  {
> > > >  	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > > > @@ -3166,20 +3183,15 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
> > > >  		};
> > > >  		gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
> > > >  					     u64_to_user_ptr(region.source_addr),
> > > > -					     1, tdx_gmem_post_populate, &arg);
> > > > +					     region.nr_pages, tdx_gmem_post_populate, &arg);
> > > >  		if (gmem_ret < 0) {
> > > >  			ret = gmem_ret;
> > > >  			break;
> > > >  		}
> > > >  
> > > > -		if (gmem_ret != 1) {
> > This line is removed.
> 
> Doh! Right.
> 
> > 
> > > > -			ret = -EIO;
> > > > -			break;
> > > > -		}
> > > > -
> > > > -		region.source_addr += PAGE_SIZE;
> > > > -		region.gpa += PAGE_SIZE;
> > > > -		region.nr_pages--;
> > > > +		region.source_addr += PAGE_SIZE * gmem_ret;
> > > 
> > > gmem_ret has to be 1, per the above conditional.
> > As region.nr_pages instead of 1 is passed into kvm_gmem_populate(), gmem_ret
> > can now be greater than 1.
> > 
> > kvm_gmem_populate() can allocate huge backend pages if region.nr_pages, GFN
> > alignment, and shareability permit.
> > 
> > > > +		region.gpa += PAGE_SIZE * gmem_ret;
> > > > +		region.nr_pages -= gmem_ret;
> > > >  
> > > >  		cond_resched();
> > > >  	}
> > > > @@ -3224,6 +3236,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
> > > >  
> > > >  int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> > > >  {
> > > > +	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
> > > > +		return PG_LEVEL_4K;
> > > > +
> > > >  	return PG_LEVEL_4K;
> > > 
> > > ^ Change does nothing...
> > Right. Patch 9 will update the default level to PG_LEVEL_2M.
> > 
> > The change here is meant to highlight PG_LEVEL_4K is enforced in
> > tdx_gmem_private_max_mapping_level() when TD is not in TD_STATE_RUNNABLE state.
> > 
> > Will change it in the next version.
> 
> I can't see the pattern between what goes in this patch vs patch 9. We should
> have some reasoning behind it.
tdx_gmem_private_max_mapping_level() actually has no need to change in this
patch as both cases return PG_LEVEL_4K.

My reasoning behind it was that I hoped to show the whole picture to support
TDX huge page before TD is runnable in this patch as much as possible.

Without this superfluous modification, the info that PG_LEVEL_4K is returned
before TD is runnable would not be noticed in this patch.

It's definitely reasonable for the later versions to drop the hunk of the
change to tdx_gmem_private_max_mapping_level().


> > 
> > > >  }
> > > >  
> > > 
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 16/21] KVM: x86/mmu: Introduce kvm_split_boundary_leafs() to split boundary leafs
  2025-05-16  7:46     ` Yan Zhao
  2025-05-16  8:03       ` Yan Zhao
@ 2025-05-16 11:44       ` Yan Zhao
  2025-05-16 22:16         ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-16 11:44 UTC (permalink / raw)
  To: Edgecombe, Rick P, pbonzini@redhat.com, seanjc@google.com,
	Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Fri, May 16, 2025 at 03:46:53PM +0800, Yan Zhao wrote:
> > > +			goto retry;
> > > +
> > > +		sp = NULL;
> > > +		/*
> > > +		 * Set yielded in case after splitting to a lower level,
> > > +		 * the new iter requires furter splitting.
> > > +		 */
> > > +		iter.yielded = true;
> > > +		*flush = true;
> > > +	}
> > > +
> > > +	rcu_read_unlock();
> > > +
> > > +	/* Leave it here though it should be impossible for the mirror root */
> > > +	if (sp)
> > > +		tdp_mmu_free_sp(sp);
> > 
> > What do you think about relying on tdp_mmu_split_huge_pages_root() and moving
> > this to an optimization patch at the end?
> > 
> > Or what about just two calls to tdp_mmu_split_huge_pages_root() at the
> > boundaries?
> Though the two generally look like the same, relying on
> tdp_mmu_split_huge_pages_root() will create several minor changes scattering
> in tdp_mmu_split_huge_pages_root().
> 
> e.g. update flush after tdp_mmu_iter_cond_resched(), check
> iter_split_required(), set "iter.yielded = true".
> 
> So, it may be hard to review as a initial RFC.
> 
> I prefer to do that after Paolo and Sean have taken a look of it :)

Oh, I might misunderstood your meaning.
Yes, if necessary, we can provide a separate patch at the end to combine code of
tdp_mmu_split_huge_pages_root() and tdp_mmu_split_boundary_leafs().

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-05-16  9:05     ` Yan Zhao
@ 2025-05-16 17:10       ` Edgecombe, Rick P
  2025-06-19  9:26       ` Nikolay Borisov
  1 sibling, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-16 17:10 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, tabba@google.com,
	vbabka@suse.cz, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Du, Fan,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Shutemov, Kirill,
	pgonda@google.com, x86@kernel.org

On Fri, 2025-05-16 at 17:05 +0800, Yan Zhao wrote:
> > So we have loops within loops...  Better to add an arg to tdx_clflush_page()
> > or
> > add a variant that takes one.
> Ok.
> 
> One thing to note is that even with an extra arg, tdx_clflush_page() has to
> call
> clflush_cache_range() page by page because with
> "#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)",
> page virtual addresses are not necessarily contiguous.
> 
> What about Binbin's proposal [1]? i.e.,
> 
> while (nr_pages)
>      tdx_clflush_page(nth_page(page, --nr_pages));
> 
> [1]
> https://lore.kernel.org/all/a7d0988d-037c-454f-bc6b-57e71b357488@linux.intel.com/

These SEAMCALLs are handling physically contiguous pages so I don't think we
need to worry about that. But Binbin's suggestion seems fine too.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 06/21] KVM: TDX: Assert the reclaimed pages were mapped as expected
  2025-05-16  2:11     ` Yan Zhao
@ 2025-05-16 17:34       ` Edgecombe, Rick P
  0 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-16 17:34 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, tabba@google.com,
	vbabka@suse.cz, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Du, Fan,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Shutemov, Kirill,
	pgonda@google.com, x86@kernel.org

On Fri, 2025-05-16 at 10:11 +0800, Yan Zhao wrote:
> > No callers in the series pass anything other than PG_LEVEL_4K, so do we need
> > this patch?
> Oh, this patch is only for future VM shutdown optimization where huge guest
> pages could be reclaimed.
> We can of couse include it in the VM shutdown optimization series if you think
> it's better.

I think it's better.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 07/21] KVM: TDX: Add a helper for WBINVD on huge pages with TD's keyID
  2025-05-16  3:03     ` Yan Zhao
@ 2025-05-16 17:35       ` Edgecombe, Rick P
  0 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-16 17:35 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, tabba@google.com,
	vbabka@suse.cz, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Du, Fan,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Shutemov, Kirill,
	pgonda@google.com, x86@kernel.org

On Fri, 2025-05-16 at 11:03 +0800, Yan Zhao wrote:
> > Hmm, did you consider changing tdh_phymem_page_wbinvd_hkid()? It's the
> > pattern
> > of KVM wrapping the SEAMCALL helpers to do some more work that needs to be
> > wrapped.
> SEAMCALL TDH_PHYMEM_PAGE_WBINVD only accepts a 4KB page.
> Will move the loop from KVM to the wrapper in x86 if you think it's better.

Don't wrap the wrappers was a suggestion from Dave. Let's try it.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 10/21] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root
  2025-05-16  4:01     ` Yan Zhao
@ 2025-05-16 17:50       ` Edgecombe, Rick P
  2025-05-19  3:57         ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-16 17:50 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, tabba@google.com,
	vbabka@suse.cz, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Du, Fan,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Shutemov, Kirill,
	pgonda@google.com, x86@kernel.org

On Fri, 2025-05-16 at 12:01 +0800, Yan Zhao wrote:
> > Maybe we should rename nx_huge_page_workaround_enabled to something more
> > generic
> > and do the is_mirror logic in kvm_mmu_do_page_fault() when setting it. It
> > should
> > shrink the diff and centralize the logic.
> Hmm, I'm reluctant to rename nx_huge_page_workaround_enabled, because
> 
> (1) Invoking disallowed_hugepage_adjust() for mirror root is to disable page
>     promotion for TDX private memory, so is only applied to TDP MMU.
> (2) nx_huge_page_workaround_enabled is used specifically for nx huge pages.
>     fault->huge_page_disallowed = fault->exec && fault-
> >nx_huge_page_workaround_enabled;

Oh, good point.

> 
>     if (fault->huge_page_disallowed)
>         account_nx_huge_page(vcpu->kvm, sp,
>                              fault->req_level >= it.level);
>     
>     sp->nx_huge_page_disallowed = fault->huge_page_disallowed.
> 
>     Affecting fault->huge_page_disallowed would impact
>     sp->nx_huge_page_disallowed as well and would disable huge pages entirely.
> 
>     So, we still need to keep nx_huge_page_workaround_enabled.
> 
> If we introduce a new flag fault->disable_hugepage_adjust, and set it in
> kvm_mmu_do_page_fault(), we would also need to invoke
> tdp_mmu_get_root_for_fault() there as well.
> 
> Checking for mirror root for non-TDX VMs is not necessary, and the invocation
> of
> tdp_mmu_get_root_for_fault() seems redundant with the one in
> kvm_tdp_mmu_map().

Also true. What about a wrapper for MMU code to check instead of fault-
>nx_huge_page_workaround_enabled then?

Also, why not check is_mirror_sp() in disallowed_hugepage_adjust() instead of
passing in an is_mirror arg?

There must be a way to have it fit in better with disallowed_hugepage_adjust()
without adding so much open coded boolean logic.


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's ACCEPT level
  2025-05-16  6:30     ` Yan Zhao
@ 2025-05-16 22:02       ` Edgecombe, Rick P
  2025-05-19  6:39         ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-16 22:02 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, tabba@google.com,
	vbabka@suse.cz, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Du, Fan,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Shutemov, Kirill,
	pgonda@google.com, x86@kernel.org

On Fri, 2025-05-16 at 14:30 +0800, Yan Zhao wrote:
> > Looking more closely, I don't see why it's too hard to pass in a
> > max_fault_level
> > into the fault struct. Totally untested rough idea, what do you think?
> Thanks for bringing this up and providing the idea below. In the previous TDX
> huge page v8, there's a similar implementation [1] [2].
> 
> This series did not adopt that approach because that approach requires
> tdx_handle_ept_violation() to pass in max_fault_level, which is not always
> available at that stage. e.g.
> 
> In patch 19, when vCPU 1 faults on a GFN at 2MB level and then vCPU 2 faults
> on
> the same GFN at 4KB level, TDX wants to ignore the demotion request caused by
> vCPU 2's 4KB level fault. So, patch 19 sets tdx->violation_request_level to
> 2MB
> in vCPU 2's split callback and fails the split. vCPU 2's
> __vmx_handle_ept_violation() will see RET_PF_RETRY and either do local retry
> (or
> return to the guest).

I think you mean patch 20 "KVM: x86: Force a prefetch fault's max mapping level
to 4KB for TDX"?

> 
> If it retries locally, tdx_gmem_private_max_mapping_level() will return
> tdx->violation_request_level, causing KVM to fault at 2MB level for vCPU 2,
> resulting in a spurious fault, eventually returning to the guest.
> 
> As tdx->violation_request_level is per-vCPU and it resets in
> tdx_get_accept_level() in tdx_handle_ept_violation() (meaning it resets after
> each invocation of tdx_handle_ept_violation() and only affects the TDX local
> retry loop), it should not hold any stale value.
> 
> Alternatively, instead of having tdx_gmem_private_max_mapping_level() to
> return
> tdx->violation_request_level, tdx_handle_ept_violation() could grab
> tdx->violation_request_level as the max_fault_level to pass to
> __vmx_handle_ept_violation().
> 
> This series chose to use tdx_gmem_private_max_mapping_level() to avoid
> modification to the KVM MMU core.

It sounds like Kirill is suggesting we do have to have demotion in the fault
path. IIRC it adds a lock, but the cost to skip fault path demotion seems to be
adding up.

> 
> [1]
> https://lore.kernel.org/all/4d61104bff388a081ff8f6ae4ac71e05a13e53c3.1708933624.git.isaku.yamahata@intel.com/
> [2
> ]https://lore.kernel.org/all/3d2a6bfb033ee1b51f7b875360bd295376c32b54.17089336
> 24.git.isaku.yamahata@intel.com/


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 14/21] KVM: x86/tdp_mmu: Invoke split_external_spt hook with exclusive mmu_lock
  2025-05-16  9:17     ` Yan Zhao
@ 2025-05-16 22:11       ` Edgecombe, Rick P
  2025-05-19  4:01         ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-16 22:11 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, tabba@google.com,
	vbabka@suse.cz, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Du, Fan,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Shutemov, Kirill,
	pgonda@google.com, x86@kernel.org

On Fri, 2025-05-16 at 17:17 +0800, Yan Zhao wrote:
> > Shouldn't this BUG_ON be handled in the split_external_spt implementation? I
> > don't think we need another one.
> Ok. But kvm_x86_split_external_spt() is not for TDX only.
> Is it good for KVM MMU core to rely on each implementation to trigger BUG_ON?

It effectively is for TDX only. At least for the foreseeable future. The naming
basically means that people don't have to see "TDX" everywhere when they look in
the MMU code.

> 
> > 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 16/21] KVM: x86/mmu: Introduce kvm_split_boundary_leafs() to split boundary leafs
  2025-05-16 11:44       ` Yan Zhao
@ 2025-05-16 22:16         ` Edgecombe, Rick P
  0 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-16 22:16 UTC (permalink / raw)
  To: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, Zhao, Yan Y,
	tabba@google.com, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	vbabka@suse.cz, Peng, Chao P, Du, Fan, binbin.wu@linux.intel.com,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun,
	kvm@vger.kernel.org, pgonda@google.com, x86@kernel.org

On Fri, 2025-05-16 at 19:44 +0800, Yan Zhao wrote:
> > > What do you think about relying on tdp_mmu_split_huge_pages_root() and
> > > moving
> > > this to an optimization patch at the end?
> > > 
> > > Or what about just two calls to tdp_mmu_split_huge_pages_root() at the
> > > boundaries?
> > Though the two generally look like the same, relying on
> > tdp_mmu_split_huge_pages_root() will create several minor changes scattering
> > in tdp_mmu_split_huge_pages_root().
> > 
> > e.g. update flush after tdp_mmu_iter_cond_resched(), check
> > iter_split_required(), set "iter.yielded = true".
> > 
> > So, it may be hard to review as a initial RFC.
> > 
> > I prefer to do that after Paolo and Sean have taken a look of it :)
> 
> Oh, I might misunderstood your meaning.
> Yes, if necessary, we can provide a separate patch at the end to combine code
> of
> tdp_mmu_split_huge_pages_root() and tdp_mmu_split_boundary_leafs().

Hmm, I'm not sure if they will look at this version or wait until Intel folks
work through it a bit. As for reviewability, the log could simply explain that
tdp_mmu_split_huge_pages_root() is the simple option and an optimization patch
will follow. I think it's helpful to separate optimization from implementation.
It can be confusing what is for which purpose.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 16/21] KVM: x86/mmu: Introduce kvm_split_boundary_leafs() to split boundary leafs
  2025-05-16  8:03       ` Yan Zhao
@ 2025-05-16 22:27         ` Edgecombe, Rick P
  2025-05-19  8:12           ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-16 22:27 UTC (permalink / raw)
  To: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, Zhao, Yan Y,
	tabba@google.com, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	vbabka@suse.cz, Peng, Chao P, Du, Fan, binbin.wu@linux.intel.com,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun,
	kvm@vger.kernel.org, pgonda@google.com, x86@kernel.org

On Fri, 2025-05-16 at 16:03 +0800, Yan Zhao wrote:
> > 
> > > > +int kvm_tdp_mmu_gfn_range_split_boundary(struct kvm *kvm, struct
> > > > kvm_gfn_range *range)
> > > > +{
> > > > +	enum kvm_tdp_mmu_root_types types;
> > > > +	struct kvm_mmu_page *root;
> > > > +	bool flush = false;
> > > > +	int ret;
> > > > +
> > > > +	types = kvm_gfn_range_filter_to_root_types(kvm, range-
> > > > >attr_filter) | KVM_INVALID_ROOTS;
> > > 
> > > What is the reason for KVM_INVALID_ROOTS in this case?
> > I wanted to keep consistent with that in kvm_tdp_mmu_unmap_gfn_range().

Yea, lack of consistency would raise other questions.

> With this consistency, we can warn in tdp_mmu_zap_leafs() as below though
> there should be no invalid mirror root.
> 
> WARN_ON_ONCE(iter_split_required(kvm, root, &iter, start, end));
>  

Hmm, let's be clear about the logic. This is essentially a mirror TDP only
function, and there we don't have the same invalid root scenarios as the more
complicated cases. I'm not exactly sure how we could hit the warning if they
didn't match. I guess a hole punch on the fd while the TD is getting torn down?

Let's comment the reasoning at least.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-05-16  9:43       ` Yan Zhao
@ 2025-05-16 22:35         ` Huang, Kai
  2025-05-16 23:47           ` Edgecombe, Rick P
  2025-05-19  8:32           ` Yan Zhao
  0 siblings, 2 replies; 294+ messages in thread
From: Huang, Kai @ 2025-05-16 22:35 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Li, Zhiquan1, Du, Fan, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Weiny, Ira,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	tabba@google.com, Peng, Chao P, kvm@vger.kernel.org,
	binbin.wu@linux.intel.com, Annapurve, Vishal, Edgecombe, Rick P,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Fri, 2025-05-16 at 17:43 +0800, Zhao, Yan Y wrote:
> On Fri, May 16, 2025 at 09:35:37AM +0800, Huang, Kai wrote:
> > On Tue, 2025-05-13 at 20:10 +0000, Edgecombe, Rick P wrote:
> > > > @@ -3265,7 +3263,7 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> > > >   	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
> > > >   		return PG_LEVEL_4K;
> > > >   
> > > > -	return PG_LEVEL_4K;
> > > > +	return PG_LEVEL_2M;
> > > 
> > > Maybe combine this with patch 4, or split them into sensible categories.
> > 
> > How about merge with patch 12
> > 
> >   [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's 
> >   ACCEPT level
> > 
> > instead?
> > 
> > Per patch 12, the fault due to TDH.MEM.PAGE.ACCPT contains fault level info, so
> > KVM should just return that.  But seems we are still returning PG_LEVEL_2M if no
> > such info is provided (IIUC):
> Yes, if without such info (tdx->violation_request_level), we always return
> PG_LEVEL_2M.
> 
> 
> > int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, 
> > 				       gfn_t gfn)
> >  {
> > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +
> >  	if (unlikely(to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE))
> >  		return PG_LEVEL_4K;
> >  
> > +	if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
> > +		return tdx->violation_request_level;
> > +
> >  	return PG_LEVEL_2M;
> >  }
> > 
> > So why not returning PT_LEVEL_4K at the end?
> > 
> > I am asking because below text mentioned in the coverletter:
> > 
> >     A rare case that could lead to splitting in the fault path is when a TD
> >     is configured to receive #VE and accesses memory before the ACCEPT
> >     operation. By the time a vCPU accesses a private GFN, due to the lack
> >     of any guest preferred level, KVM could create a mapping at 2MB level.
> >     If the TD then only performs the ACCEPT operation at 4KB level,
> >     splitting in the fault path will be triggered. However, this is not
> >     regarded as a typical use case, as usually TD always accepts pages in
> >     the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
> >     splitting request is an endless EPT violation. This would not happen
> >     for a Linux guest, which does not expect any #VE.
> > 
> > Changing to return PT_LEVEL_4K should avoid this problem.  It doesn't hurt
> For TDs expect #VE, guests access private memory before accept it.
> In that case, upon KVM receives EPT violation, there's no expected level from
> the TDX module. Returning PT_LEVEL_4K at the end basically disables huge pages
> for those TDs.

Just want to make sure I understand correctly:

Linux TDs always ACCEPT memory first before touching that memory, therefore KVM
should always be able to get the accept level for Linux TDs.

In other words, returning PG_LEVEL_4K doesn't impact establishing large page
mapping for Linux TDs.

However, other TDs may choose to touch memory first to receive #VE and then
accept that memory.  Returning PG_LEVEL_2M allows those TDs to use large page
mappings in SEPT.  Otherwise, returning PG_LEVEL_4K essentially disables large
page for them (since we don't support PROMOTE for now?).

But in the above text you mentioned that, if doing so, because we choose to
ignore splitting request on read, returning 2M could result in *endless* EPT
violation.

So to me it seems you choose a design that could bring performance gain for
certain non-Linux TDs when they follow a certain behaviour but otherwise could
result in endless EPT violation in KVM.

I am not sure how is this OK?  Or probably I have misunderstanding?

> 
> Besides, according to Kirill [1], the order from 1GB->2MB->4KB is only the case
> for linux guests.
> 
> [1] https://lore.kernel.org/all/6vdj4mfxlyvypn743klxq5twda66tkugwzljdt275rug2gmwwl@zdziylxpre6y/#t

I am not sure how is this related?

On the opposite, if other non-Linux TDs don't follow 1G->2M->4K accept order,
e.g., they always accept 4K, there could be *endless EPT violation* if I
understand your words correctly.

Isn't this yet-another reason we should choose to return PG_LEVEL_4K instead of
2M if no accept level is provided in the fault?

> 
> > normal cases either, since guest will always do ACCEPT (which contains the
> > accepting level) before accessing the memory.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-05-16 22:35         ` Huang, Kai
@ 2025-05-16 23:47           ` Edgecombe, Rick P
  2025-05-19  8:32           ` Yan Zhao
  1 sibling, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-16 23:47 UTC (permalink / raw)
  To: Huang, Kai, Zhao, Yan Y
  Cc: Shutemov, Kirill, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, Li, Zhiquan1, vbabka@suse.cz, tabba@google.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, Weiny, Ira, michael.roth@amd.com,
	pbonzini@redhat.com, Yamahata, Isaku, ackerleytng@google.com,
	binbin.wu@linux.intel.com, Peng, Chao P, quic_eberman@quicinc.com,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun,
	kvm@vger.kernel.org, pgonda@google.com, x86@kernel.org

On Fri, 2025-05-16 at 22:35 +0000, Huang, Kai wrote:
> > For TDs expect #VE, guests access private memory before accept it.
> > In that case, upon KVM receives EPT violation, there's no expected level
> > from
> > the TDX module. Returning PT_LEVEL_4K at the end basically disables huge
> > pages
> > for those TDs.
> 
> Just want to make sure I understand correctly:
> 
> Linux TDs always ACCEPT memory first before touching that memory, therefore
> KVM
> should always be able to get the accept level for Linux TDs.
> 
> In other words, returning PG_LEVEL_4K doesn't impact establishing large page
> mapping for Linux TDs.
> 
> However, other TDs may choose to touch memory first to receive #VE and then
> accept that memory.  Returning PG_LEVEL_2M allows those TDs to use large page
> mappings in SEPT.  Otherwise, returning PG_LEVEL_4K essentially disables large
> page for them (since we don't support PROMOTE for now?).
> 
> But in the above text you mentioned that, if doing so, because we choose to
> ignore splitting request on read, returning 2M could result in *endless* EPT
> violation.
> 
> So to me it seems you choose a design that could bring performance gain for
> certain non-Linux TDs when they follow a certain behaviour but otherwise could
> result in endless EPT violation in KVM.
> 
> I am not sure how is this OK?  Or probably I have misunderstanding?

Good point. And if we just pass 4k level if the EPT violation doesn't have the
accept size, then force prefetch to 4k too, like this does. Then what needs
fault path demotion? Guest double accept bugs?

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 10/21] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root
  2025-05-16 17:50       ` Edgecombe, Rick P
@ 2025-05-19  3:57         ` Yan Zhao
  2025-05-19 17:42           ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-19  3:57 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, tabba@google.com,
	vbabka@suse.cz, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Du, Fan,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Shutemov, Kirill,
	pgonda@google.com, x86@kernel.org

On Sat, May 17, 2025 at 01:50:53AM +0800, Edgecombe, Rick P wrote:
> On Fri, 2025-05-16 at 12:01 +0800, Yan Zhao wrote:
> > > Maybe we should rename nx_huge_page_workaround_enabled to something more
> > > generic
> > > and do the is_mirror logic in kvm_mmu_do_page_fault() when setting it. It
> > > should
> > > shrink the diff and centralize the logic.
> > Hmm, I'm reluctant to rename nx_huge_page_workaround_enabled, because
> > 
> > (1) Invoking disallowed_hugepage_adjust() for mirror root is to disable page
> >     promotion for TDX private memory, so is only applied to TDP MMU.
> > (2) nx_huge_page_workaround_enabled is used specifically for nx huge pages.
> >     fault->huge_page_disallowed = fault->exec && fault-
> > >nx_huge_page_workaround_enabled;
> 
> Oh, good point.
> 
> > 
> >     if (fault->huge_page_disallowed)
> >         account_nx_huge_page(vcpu->kvm, sp,
> >                              fault->req_level >= it.level);
> >     
> >     sp->nx_huge_page_disallowed = fault->huge_page_disallowed.
> > 
> >     Affecting fault->huge_page_disallowed would impact
> >     sp->nx_huge_page_disallowed as well and would disable huge pages entirely.
> > 
> >     So, we still need to keep nx_huge_page_workaround_enabled.
> > 
> > If we introduce a new flag fault->disable_hugepage_adjust, and set it in
> > kvm_mmu_do_page_fault(), we would also need to invoke
> > tdp_mmu_get_root_for_fault() there as well.
> > 
> > Checking for mirror root for non-TDX VMs is not necessary, and the invocation
> > of
> > tdp_mmu_get_root_for_fault() seems redundant with the one in
> > kvm_tdp_mmu_map().
> 
> Also true. What about a wrapper for MMU code to check instead of fault-
> >nx_huge_page_workaround_enabled then?
Like below?

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1b2bacde009f..0e4a03f44036 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1275,6 +1275,11 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
        return 0;
 }

+static inline bool is_fault_disallow_huge_page_adust(struct kvm_page_fault *fault, bool is_mirror)
+{
+       return fault->nx_huge_page_workaround_enabled || is_mirror;
+}
+
 /*
  * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
  * page tables and SPTEs to translate the faulting guest physical address.
@@ -1297,7 +1302,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
        for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
                int r;

-               if (fault->nx_huge_page_workaround_enabled || is_mirror)
+               if (is_fault_disallow_huge_page_adust(fault, is_mirror))
                        disallowed_hugepage_adjust(fault, iter.old_spte, iter.level, is_mirror);

                /*



> Also, why not check is_mirror_sp() in disallowed_hugepage_adjust() instead of
> passing in an is_mirror arg?
It's an optimization.

As is_mirror_sptep(iter->sptep) == is_mirror_sp(root), passing in is_mirror arg
can avoid checking mirror for each sp, which remains unchanged in a root.


> There must be a way to have it fit in better with disallowed_hugepage_adjust()
> without adding so much open coded boolean logic.

^ permalink raw reply related	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 14/21] KVM: x86/tdp_mmu: Invoke split_external_spt hook with exclusive mmu_lock
  2025-05-16 22:11       ` Edgecombe, Rick P
@ 2025-05-19  4:01         ` Yan Zhao
  2025-05-19 20:21           ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-19  4:01 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, tabba@google.com,
	vbabka@suse.cz, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Du, Fan,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Shutemov, Kirill,
	pgonda@google.com, x86@kernel.org

On Sat, May 17, 2025 at 06:11:59AM +0800, Edgecombe, Rick P wrote:
> On Fri, 2025-05-16 at 17:17 +0800, Yan Zhao wrote:
> > > Shouldn't this BUG_ON be handled in the split_external_spt implementation? I
> > > don't think we need another one.
> > Ok. But kvm_x86_split_external_spt() is not for TDX only.
> > Is it good for KVM MMU core to rely on each implementation to trigger BUG_ON?
> 
> It effectively is for TDX only. At least for the foreseeable future. The naming
> basically means that people don't have to see "TDX" everywhere when they look in
> the MMU code.
Hmm, another reason to add the BUG_ON is to align it with remove_external_spte().
There's also a KVM_BUG_ON() following the remove_external_spte hook.

I interpret this as error handling in the KVM MMU core, which returns "void",
so issuing BUG_ON if ret != 0.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's ACCEPT level
  2025-05-16 22:02       ` Edgecombe, Rick P
@ 2025-05-19  6:39         ` Yan Zhao
  2025-05-19 20:17           ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-19  6:39 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, tabba@google.com,
	vbabka@suse.cz, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Du, Fan,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Shutemov, Kirill,
	pgonda@google.com, x86@kernel.org

On Sat, May 17, 2025 at 06:02:14AM +0800, Edgecombe, Rick P wrote:
> On Fri, 2025-05-16 at 14:30 +0800, Yan Zhao wrote:
> > > Looking more closely, I don't see why it's too hard to pass in a
> > > max_fault_level
> > > into the fault struct. Totally untested rough idea, what do you think?
> > Thanks for bringing this up and providing the idea below. In the previous TDX
> > huge page v8, there's a similar implementation [1] [2].
> > 
> > This series did not adopt that approach because that approach requires
> > tdx_handle_ept_violation() to pass in max_fault_level, which is not always
> > available at that stage. e.g.
> > 
> > In patch 19, when vCPU 1 faults on a GFN at 2MB level and then vCPU 2 faults
> > on
> > the same GFN at 4KB level, TDX wants to ignore the demotion request caused by
> > vCPU 2's 4KB level fault. So, patch 19 sets tdx->violation_request_level to
> > 2MB
> > in vCPU 2's split callback and fails the split. vCPU 2's
> > __vmx_handle_ept_violation() will see RET_PF_RETRY and either do local retry
> > (or
> > return to the guest).
> 
> I think you mean patch 20 "KVM: x86: Force a prefetch fault's max mapping level
> to 4KB for TDX"?
Sorry. It's patch 21 "KVM: x86: Ignore splitting huge pages in fault path for
TDX"

> > 
> > If it retries locally, tdx_gmem_private_max_mapping_level() will return
> > tdx->violation_request_level, causing KVM to fault at 2MB level for vCPU 2,
> > resulting in a spurious fault, eventually returning to the guest.
> > 
> > As tdx->violation_request_level is per-vCPU and it resets in
> > tdx_get_accept_level() in tdx_handle_ept_violation() (meaning it resets after
> > each invocation of tdx_handle_ept_violation() and only affects the TDX local
> > retry loop), it should not hold any stale value.
> > 
> > Alternatively, instead of having tdx_gmem_private_max_mapping_level() to
> > return
> > tdx->violation_request_level, tdx_handle_ept_violation() could grab
> > tdx->violation_request_level as the max_fault_level to pass to
> > __vmx_handle_ept_violation().
> > 
> > This series chose to use tdx_gmem_private_max_mapping_level() to avoid
> > modification to the KVM MMU core.
> 
> It sounds like Kirill is suggesting we do have to have demotion in the fault
> path. IIRC it adds a lock, but the cost to skip fault path demotion seems to be
> adding up.
Yes, though Kirill is suggesting to support demotion in the fault path, I still
think that using tdx_gmem_private_max_mapping_level() might be more friendly to
other potential scenarios, such as when the KVM core MMU requests TDX to perform
page promotion, and TDX finds that promotion would consistently fail on a GFN.

Another important reason for not passing a max_fault_level into the fault struct
is that the KVM MMU now has the hook private_max_mapping_level to determine a
private fault's maximum level, which was introduced by commit f32fb32820b1
("KVM: x86: Add hook for determining max NPT mapping level"). We'd better not to
introduce another mechanism if the same job can be accomplished via the
private_max_mapping_level hook.

The code in TDX huge page v8 [1][2] simply inherited the old implementation from
its v1 [3], where the private_max_mapping_level hook had not yet been introduced
for private faults.

[1] https://lore.kernel.org/all/4d61104bff388a081ff8f6ae4ac71e05a13e53c3.1708933624.git.isaku.yamahata@intel.com/
[2] https://lore.kernel.org/all/3d2a6bfb033ee1b51f7b875360bd295376c32b54.1708933624.git.isaku.yamahata@intel.com/
[3] https://lore.kernel.org/all/cover.1659854957.git.isaku.yamahata@intel.com/

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 16/21] KVM: x86/mmu: Introduce kvm_split_boundary_leafs() to split boundary leafs
  2025-05-16 22:27         ` Edgecombe, Rick P
@ 2025-05-19  8:12           ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-19  8:12 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, tabba@google.com,
	thomas.lendacky@amd.com, michael.roth@amd.com, seanjc@google.com,
	Weiny, Ira, linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	ackerleytng@google.com, Yamahata, Isaku, vbabka@suse.cz,
	Peng, Chao P, Du, Fan, binbin.wu@linux.intel.com,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun,
	kvm@vger.kernel.org, pgonda@google.com, x86@kernel.org

On Sat, May 17, 2025 at 06:27:10AM +0800, Edgecombe, Rick P wrote:
> On Fri, 2025-05-16 at 16:03 +0800, Yan Zhao wrote:
> > > 
> > > > > +int kvm_tdp_mmu_gfn_range_split_boundary(struct kvm *kvm, struct
> > > > > kvm_gfn_range *range)
> > > > > +{
> > > > > +	enum kvm_tdp_mmu_root_types types;
> > > > > +	struct kvm_mmu_page *root;
> > > > > +	bool flush = false;
> > > > > +	int ret;
> > > > > +
> > > > > +	types = kvm_gfn_range_filter_to_root_types(kvm, range-
> > > > > >attr_filter) | KVM_INVALID_ROOTS;
> > > > 
> > > > What is the reason for KVM_INVALID_ROOTS in this case?
> > > I wanted to keep consistent with that in kvm_tdp_mmu_unmap_gfn_range().
> 
> Yea, lack of consistency would raise other questions.
> 
> > With this consistency, we can warn in tdp_mmu_zap_leafs() as below though
> > there should be no invalid mirror root.
> > 
> > WARN_ON_ONCE(iter_split_required(kvm, root, &iter, start, end));
> >  
> 
> Hmm, let's be clear about the logic. This is essentially a mirror TDP only
> function, and there we don't have the same invalid root scenarios as the more
> complicated cases. I'm not exactly sure how we could hit the warning if they
> didn't match. I guess a hole punch on the fd while the TD is getting torn down?
In practice, the warning shoudn't be hit because mirror root should only be
invalidated after gmem_fd is destroyed.

> Let's comment the reasoning at least.
Will do.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-05-16 22:35         ` Huang, Kai
  2025-05-16 23:47           ` Edgecombe, Rick P
@ 2025-05-19  8:32           ` Yan Zhao
  2025-05-19 16:53             ` Edgecombe, Rick P
  2025-05-20 23:34             ` Huang, Kai
  1 sibling, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-19  8:32 UTC (permalink / raw)
  To: Huang, Kai
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Li, Zhiquan1, Du, Fan, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Weiny, Ira,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	tabba@google.com, Peng, Chao P, kvm@vger.kernel.org,
	binbin.wu@linux.intel.com, Annapurve, Vishal, Edgecombe, Rick P,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Sat, May 17, 2025 at 06:35:57AM +0800, Huang, Kai wrote:
> On Fri, 2025-05-16 at 17:43 +0800, Zhao, Yan Y wrote:
> > On Fri, May 16, 2025 at 09:35:37AM +0800, Huang, Kai wrote:
> > > On Tue, 2025-05-13 at 20:10 +0000, Edgecombe, Rick P wrote:
> > > > > @@ -3265,7 +3263,7 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> > > > >   	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
> > > > >   		return PG_LEVEL_4K;
> > > > >   
> > > > > -	return PG_LEVEL_4K;
> > > > > +	return PG_LEVEL_2M;
> > > > 
> > > > Maybe combine this with patch 4, or split them into sensible categories.
> > > 
> > > How about merge with patch 12
> > > 
> > >   [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's 
> > >   ACCEPT level
> > > 
> > > instead?
> > > 
> > > Per patch 12, the fault due to TDH.MEM.PAGE.ACCPT contains fault level info, so
> > > KVM should just return that.  But seems we are still returning PG_LEVEL_2M if no
> > > such info is provided (IIUC):
> > Yes, if without such info (tdx->violation_request_level), we always return
> > PG_LEVEL_2M.
> > 
> > 
> > > int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn, 
> > > 				       gfn_t gfn)
> > >  {
> > > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > > +
> > >  	if (unlikely(to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE))
> > >  		return PG_LEVEL_4K;
> > >  
> > > +	if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
> > > +		return tdx->violation_request_level;
> > > +
> > >  	return PG_LEVEL_2M;
> > >  }
> > > 
> > > So why not returning PT_LEVEL_4K at the end?
> > > 
> > > I am asking because below text mentioned in the coverletter:
> > > 
> > >     A rare case that could lead to splitting in the fault path is when a TD
> > >     is configured to receive #VE and accesses memory before the ACCEPT
> > >     operation. By the time a vCPU accesses a private GFN, due to the lack
> > >     of any guest preferred level, KVM could create a mapping at 2MB level.
> > >     If the TD then only performs the ACCEPT operation at 4KB level,
> > >     splitting in the fault path will be triggered. However, this is not
> > >     regarded as a typical use case, as usually TD always accepts pages in
> > >     the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
> > >     splitting request is an endless EPT violation. This would not happen
> > >     for a Linux guest, which does not expect any #VE.
> > > 
> > > Changing to return PT_LEVEL_4K should avoid this problem.  It doesn't hurt
> > For TDs expect #VE, guests access private memory before accept it.
> > In that case, upon KVM receives EPT violation, there's no expected level from
> > the TDX module. Returning PT_LEVEL_4K at the end basically disables huge pages
> > for those TDs.
> 
> Just want to make sure I understand correctly:
> 
> Linux TDs always ACCEPT memory first before touching that memory, therefore KVM
> should always be able to get the accept level for Linux TDs.
> 
> In other words, returning PG_LEVEL_4K doesn't impact establishing large page
> mapping for Linux TDs.
>
> However, other TDs may choose to touch memory first to receive #VE and then
> accept that memory.  Returning PG_LEVEL_2M allows those TDs to use large page
> mappings in SEPT.  Otherwise, returning PG_LEVEL_4K essentially disables large
> page for them (since we don't support PROMOTE for now?).
Not only because we don't support PROMOTE.

After KVM maps at 4KB level, if the guest accepts at 2MB level, it would get
a TDACCEPT_SIZE_MISMATCH error.

The case of "KVM maps at 4KB, guest accepts at 2MB" is different from
"KVM maps at 2MB, guest accepts at 4KB".

For the former, the guest can't trigger endless EPT violation. Just consider
when the guest wants to accept at 2MB while KVM can't meet its request.
If it can trigger endless EPT violation, the linux guest should trigger endless
EPT already with today's basic TDX.

> But in the above text you mentioned that, if doing so, because we choose to
> ignore splitting request on read, returning 2M could result in *endless* EPT
> violation.
I don't get what you mean.
What's the relationship between splitting and "returning 2M could result in
*endless* EPT" ?

> So to me it seems you choose a design that could bring performance gain for
> certain non-Linux TDs when they follow a certain behaviour but otherwise could
> result in endless EPT violation in KVM.
Also don't understand here.
Which design could result in endless EPT violation?

> I am not sure how is this OK?  Or probably I have misunderstanding?

> > 
> > Besides, according to Kirill [1], the order from 1GB->2MB->4KB is only the case
> > for linux guests.
> > 
> > [1] https://lore.kernel.org/all/6vdj4mfxlyvypn743klxq5twda66tkugwzljdt275rug2gmwwl@zdziylxpre6y/#t
> 
> I am not sure how is this related?
> 
> On the opposite, if other non-Linux TDs don't follow 1G->2M->4K accept order,
> e.g., they always accept 4K, there could be *endless EPT violation* if I
> understand your words correctly.
> 
> Isn't this yet-another reason we should choose to return PG_LEVEL_4K instead of
> 2M if no accept level is provided in the fault?
As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH. 

> 
> > 
> > > normal cases either, since guest will always do ACCEPT (which contains the
> > > accepting level) before accessing the memory.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-05-19  8:32           ` Yan Zhao
@ 2025-05-19 16:53             ` Edgecombe, Rick P
  2025-05-20  9:34               ` Yan Zhao
  2025-05-20 23:34             ` Huang, Kai
  1 sibling, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-19 16:53 UTC (permalink / raw)
  To: Zhao, Yan Y, Huang, Kai
  Cc: Shutemov, Kirill, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, Li, Zhiquan1, vbabka@suse.cz, tabba@google.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, Weiny, Ira, michael.roth@amd.com,
	pbonzini@redhat.com, Yamahata, Isaku, ackerleytng@google.com,
	binbin.wu@linux.intel.com, Peng, Chao P, quic_eberman@quicinc.com,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun,
	kvm@vger.kernel.org, pgonda@google.com, x86@kernel.org

On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
> > On the opposite, if other non-Linux TDs don't follow 1G->2M->4K accept
> > order,
> > e.g., they always accept 4K, there could be *endless EPT violation* if I
> > understand your words correctly.
> > 
> > Isn't this yet-another reason we should choose to return PG_LEVEL_4K instead
> > of
> > 2M if no accept level is provided in the fault?
> As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
> TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.

TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
docs say the VMM needs to demote *if* the mapping is large and the accept size
is small. But if we map at 4k size for non-accept EPT violations, we won't hit
this case. I also wonder what is preventing the TDX module from handling a 2MB
accept size at 4k mappings. It could be changed maybe.

But I think Kai's question was: why are we complicating the code for the case of
non-Linux TDs that also use #VE for accept? It's not necessary to be functional,
and there aren't any known TDs like that which are expected to use KVM today.
(err, except the MMU stress test). So in another form the question is: should we
optimize KVM for a case we don't even know if anyone will use? The answer seems
obviously no to me.

I think this connects the question of whether we can pass the necessary info
into fault via synthetic error code. Consider this new design:

 - tdx_gmem_private_max_mapping_level() simply returns 4k for prefetch and pre-
runnable, otherwise returns 2MB
 - if fault has accept info 2MB size, pass 2MB size into fault. Otherwise pass
4k (i.e. VMs that are relying on #VE to do the accept won't get huge pages
*yet*).

What goes wrong? Seems simpler and no more stuffing fault info on the vcpu.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 10/21] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root
  2025-05-19  3:57         ` Yan Zhao
@ 2025-05-19 17:42           ` Edgecombe, Rick P
  2025-05-20 10:11             ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-19 17:42 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1,
	thomas.lendacky@amd.com, tabba@google.com, Du, Fan,
	linux-kernel@vger.kernel.org, seanjc@google.com, Weiny, Ira,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Yamahata, Isaku,
	michael.roth@amd.com, vbabka@suse.cz, ackerleytng@google.com,
	Peng, Chao P, Shutemov, Kirill, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Mon, 2025-05-19 at 11:57 +0800, Yan Zhao wrote:
> Like below?
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 1b2bacde009f..0e4a03f44036 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1275,6 +1275,11 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
>         return 0;
>  }
> 
> +static inline bool is_fault_disallow_huge_page_adust(struct kvm_page_fault *fault, bool is_mirror)
> +{
> +       return fault->nx_huge_page_workaround_enabled || is_mirror;
> +}

Err, no. It doesn't seem worth it.

> +
>  /*
>   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
>   * page tables and SPTEs to translate the faulting guest physical address.
> @@ -1297,7 +1302,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>         for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
>                 int r;
> 
> -               if (fault->nx_huge_page_workaround_enabled || is_mirror)
> +               if (is_fault_disallow_huge_page_adust(fault, is_mirror))
>                         disallowed_hugepage_adjust(fault, iter.old_spte, iter.level, is_mirror);
> 
>                 /*
> 
> 
> 
> > Also, why not check is_mirror_sp() in disallowed_hugepage_adjust() instead of
> > passing in an is_mirror arg?
> It's an optimization.

But disallowed_hugepage_adjust() is already checking the sp.

I think part of the thing that is bugging me is that
nx_huge_page_workaround_enabled is not conceptually about whether the specific
fault/level needs to disallow huge page adjustments, it's whether it needs to
check if it does. Then disallowed_hugepage_adjust() does the actual specific
checking. But for the mirror logic the check is the same for both. It's
asymmetric with NX huge pages, and just sort of jammed in. It would be easier to
follow if the kvm_tdp_mmu_map() conditional checked wither mirror TDP was
"active", rather than the mirror role.

> 
> As is_mirror_sptep(iter->sptep) == is_mirror_sp(root), passing in is_mirror arg
> can avoid checking mirror for each sp, which remains unchanged in a root.

Why not just this. It seems easier to comprehend to me. It does add a little bit
of extra checking in the shared fault for TDX only. I think it's ok and better
not to litter the generic MMU code.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a284dce227a0..37ca77f2ee15 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3328,11 +3328,13 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault
 
 void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int
cur_level)
 {
+       struct kvm_mmu_page * sp = spte_to_child_sp(spte);
+
        if (cur_level > PG_LEVEL_4K &&
            cur_level == fault->goal_level &&
            is_shadow_present_pte(spte) &&
            !is_large_pte(spte) &&
-           spte_to_child_sp(spte)->nx_huge_page_disallowed) {
+           (sp->nx_huge_page_disallowed || sp->role.is_mirror)) {
                /*
                 * A small SPTE exists for this pfn, but FNAME(fetch),
                 * direct_map(), or kvm_tdp_mmu_map() would like to create a
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 405874f4d088..1d22994576b5 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1244,6 +1244,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
kvm_page_fault *fault)
        struct tdp_iter iter;
        struct kvm_mmu_page *sp;
        int ret = RET_PF_RETRY;
+       bool hugepage_adjust_disallowed = fault->nx_huge_page_workaround_enabled
||
+                                         kvm_has_mirrored_tdp(kvm);
 
        kvm_mmu_hugepage_adjust(vcpu, fault);
 
@@ -1254,7 +1256,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
kvm_page_fault *fault)
        for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
                int r;
 
-               if (fault->nx_huge_page_workaround_enabled)
+               if (hugepage_adjust_disallowed)
                        disallowed_hugepage_adjust(fault, iter.old_spte,
iter.level);
 
                /*

 
                /*


^ permalink raw reply related	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's ACCEPT level
  2025-05-19  6:39         ` Yan Zhao
@ 2025-05-19 20:17           ` Edgecombe, Rick P
  0 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-19 20:17 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1,
	thomas.lendacky@amd.com, tabba@google.com, Du, Fan,
	linux-kernel@vger.kernel.org, seanjc@google.com, Weiny, Ira,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Yamahata, Isaku,
	michael.roth@amd.com, vbabka@suse.cz, ackerleytng@google.com,
	Peng, Chao P, Shutemov, Kirill, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Mon, 2025-05-19 at 14:39 +0800, Yan Zhao wrote:
> > It sounds like Kirill is suggesting we do have to have demotion in the fault
> > path. IIRC it adds a lock, but the cost to skip fault path demotion seems to
> > be
> > adding up.
> Yes, though Kirill is suggesting to support demotion in the fault path, I
> still
> think that using tdx_gmem_private_max_mapping_level() might be more friendly
> to
> other potential scenarios, such as when the KVM core MMU requests TDX to
> perform
> page promotion, and TDX finds that promotion would consistently fail on a GFN.
> 
> Another important reason for not passing a max_fault_level into the fault
> struct
> is that the KVM MMU now has the hook private_max_mapping_level to determine a
> private fault's maximum level, which was introduced by commit f32fb32820b1
> ("KVM: x86: Add hook for determining max NPT mapping level"). We'd better not
> to
> introduce another mechanism if the same job can be accomplished via the
> private_max_mapping_level hook.

How about the alternative discussed on the thread with Kai? I don't think Kirill
was suggesting #VE based TDs need huge pages, just that they need to work with
4k accepts. Let's continue the discussion on that thread, because I think they
are all related. Once we conclude there we can iron out any remaining issues on
this specific patch.

> 
> The code in TDX huge page v8 [1][2] simply inherited the old implementation
> from
> its v1 [3], where the private_max_mapping_level hook had not yet been
> introduced
> for private faults.


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 14/21] KVM: x86/tdp_mmu: Invoke split_external_spt hook with exclusive mmu_lock
  2025-05-19  4:01         ` Yan Zhao
@ 2025-05-19 20:21           ` Edgecombe, Rick P
  0 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-19 20:21 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1,
	thomas.lendacky@amd.com, tabba@google.com, Du, Fan,
	linux-kernel@vger.kernel.org, seanjc@google.com, Weiny, Ira,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Yamahata, Isaku,
	michael.roth@amd.com, vbabka@suse.cz, ackerleytng@google.com,
	Peng, Chao P, Shutemov, Kirill, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Mon, 2025-05-19 at 12:01 +0800, Yan Zhao wrote:
> On Sat, May 17, 2025 at 06:11:59AM +0800, Edgecombe, Rick P wrote:
> > On Fri, 2025-05-16 at 17:17 +0800, Yan Zhao wrote:
> > > > Shouldn't this BUG_ON be handled in the split_external_spt implementation? I
> > > > don't think we need another one.
> > > Ok. But kvm_x86_split_external_spt() is not for TDX only.
> > > Is it good for KVM MMU core to rely on each implementation to trigger BUG_ON?
> > 
> > It effectively is for TDX only. At least for the foreseeable future. The naming
> > basically means that people don't have to see "TDX" everywhere when they look in
> > the MMU code.
> Hmm, another reason to add the BUG_ON is to align it with remove_external_spte().
> There's also a KVM_BUG_ON() following the remove_external_spte hook.
> 
> I interpret this as error handling in the KVM MMU core, which returns "void",
> so issuing BUG_ON if ret != 0.

This is related to the other thread about how to handle demote failure. Let's
continue there.

But in general, the amount of KVM_BUG_ON()s we have for mirror EPT is a bit of a
code smell. It's not exclusive to this series. But I'd love if we could keep it
from getting worse.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 14/21] KVM: x86/tdp_mmu: Invoke split_external_spt hook with exclusive mmu_lock
  2025-04-24  3:07 ` [RFC PATCH 14/21] KVM: x86/tdp_mmu: Invoke split_external_spt hook with exclusive mmu_lock Yan Zhao
  2025-05-13 23:06   ` Edgecombe, Rick P
@ 2025-05-20  5:40   ` Binbin Wu
  2025-05-20  9:40     ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Binbin Wu @ 2025-05-20  5:40 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, chao.p.peng



On 4/24/2025 11:07 AM, Yan Zhao wrote:
[...]
>   
> +static int split_external_spt(struct kvm *kvm, gfn_t gfn, u64 old_spte,
> +			      u64 new_spte, int level)
> +{
> +	void *external_spt = get_external_spt(gfn, new_spte, level);
> +	int ret;
> +
> +	KVM_BUG_ON(!external_spt, kvm);
> +
> +	ret = static_call(kvm_x86_split_external_spt)(kvm, gfn, level, external_spt);
Better to use kvm_x86_call() instead of static_call().

> +	KVM_BUG_ON(ret, kvm);
> +
> +	return ret;
> +}
>   /**
>    * handle_removed_pt() - handle a page table removed from the TDP structure
>    *
> @@ -764,13 +778,13 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
>   
>   	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
>   
> -	/*
> -	 * Users that do non-atomic setting of PTEs don't operate on mirror
> -	 * roots, so don't handle it and bug the VM if it's seen.
> -	 */
>   	if (is_mirror_sptep(sptep)) {
> -		KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
> -		remove_external_spte(kvm, gfn, old_spte, level);
> +		if (!is_shadow_present_pte(new_spte))
> +			remove_external_spte(kvm, gfn, old_spte, level);
> +		else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
> +			split_external_spt(kvm, gfn, old_spte, new_spte, level);
> +		else
> +			KVM_BUG_ON(1, kvm);
>   	}
>   
>   	return old_spte;


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 15/21] KVM: TDX: Support huge page splitting with exclusive kvm->mmu_lock
  2025-04-24  3:08 ` [RFC PATCH 15/21] KVM: TDX: Support huge page splitting with exclusive kvm->mmu_lock Yan Zhao
@ 2025-05-20  6:18   ` Binbin Wu
  2025-05-20  9:40     ` Yan Zhao
  2025-07-02 15:47   ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Binbin Wu @ 2025-05-20  6:18 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, chao.p.peng



On 4/24/2025 11:08 AM, Yan Zhao wrote:
[...]
> +
> +int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +			       void *private_spt)
> +{
> +	struct page *page = virt_to_page(private_spt);
> +	int ret;
> +
> +	if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
> +		return -EINVAL;
> +
> +	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
> +	if (ret <= 0)
> +		return ret;
> +
> +	tdx_track(kvm);

It may worth a helper for the zap and track code.
It's the some code as what in tdx_sept_remove_private_spte().
So that they can share the code, including the bug check for HKID and the
comments.


> +
> +	return tdx_spte_demote_private_spte(kvm, gfn, level, page);
> +}
> +
>
[...]

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-05-19 16:53             ` Edgecombe, Rick P
@ 2025-05-20  9:34               ` Yan Zhao
  2025-05-20 23:47                 ` Huang, Kai
  2025-05-21 15:40                 ` Edgecombe, Rick P
  0 siblings, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-20  9:34 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Huang, Kai, Shutemov, Kirill, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, Li, Zhiquan1, vbabka@suse.cz, tabba@google.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, Weiny, Ira, michael.roth@amd.com,
	pbonzini@redhat.com, Yamahata, Isaku, ackerleytng@google.com,
	binbin.wu@linux.intel.com, Peng, Chao P, quic_eberman@quicinc.com,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun,
	kvm@vger.kernel.org, pgonda@google.com, x86@kernel.org

On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
> > > On the opposite, if other non-Linux TDs don't follow 1G->2M->4K accept
> > > order,
> > > e.g., they always accept 4K, there could be *endless EPT violation* if I
> > > understand your words correctly.
> > > 
> > > Isn't this yet-another reason we should choose to return PG_LEVEL_4K instead
> > > of
> > > 2M if no accept level is provided in the fault?
> > As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
> > TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.
> 
> TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
> docs say the VMM needs to demote *if* the mapping is large and the accept size
> is small. But if we map at 4k size for non-accept EPT violations, we won't hit
> this case. I also wonder what is preventing the TDX module from handling a 2MB
> accept size at 4k mappings. It could be changed maybe.
> 
> But I think Kai's question was: why are we complicating the code for the case of
> non-Linux TDs that also use #VE for accept? It's not necessary to be functional,
> and there aren't any known TDs like that which are expected to use KVM today.
> (err, except the MMU stress test). So in another form the question is: should we
> optimize KVM for a case we don't even know if anyone will use? The answer seems
> obviously no to me.
So, you want to disallow huge pages for non-Linux TDs, then we have no need
to support splitting in the fault path, right?

I'm OK if we don't care non-Linux TDs for now.
This can simplify the splitting code and we can add the support when there's a
need.

> I think this connects the question of whether we can pass the necessary info
> into fault via synthetic error code. Consider this new design:
> 
>  - tdx_gmem_private_max_mapping_level() simply returns 4k for prefetch and pre-
> runnable, otherwise returns 2MB
Why prefetch and pre-runnable faults go the first path, while

>  - if fault has accept info 2MB size, pass 2MB size into fault. Otherwise pass
> 4k (i.e. VMs that are relying on #VE to do the accept won't get huge pages
> *yet*).
other faults go the second path?
 
> What goes wrong? Seems simpler and no more stuffing fault info on the vcpu.
I tried to avoid the double paths.
IMHO, it's confusing to specify max_level from two paths.

The fault info in vcpu_tdx isn't a real problem as it's per-vCPU.
An existing example in KVM is vcpu->arch.mmio_gfn.

We don't need something like the vcpu->arch.mmio_gen because
tdx->violation_gfn_* and tdx->violation_request_level are reset in each
tdx_handle_ept_violation().


BTW, dug into some history:

In v18 of TDX basic series,
enforcing 4KB for pre-runnable faults were done by passing PG_LEVEL_4K to
kvm_mmu_map_tdp_page().
https://lore.kernel.org/all/1a64f798b550dad9e096603e8dae3b6e8fb2fbd5.1705965635.git.isaku.yamahata@intel.com/
https://lore.kernel.org/all/97bb1f2996d8a7b828cd9e3309380d1a86ca681b.1705965635.git.isaku.yamahata@intel.com/

For the other faults, it's done by altering max_level in kvm_mmu_do_page_fault(),
and Paolo asked to use the tdx_gmem_private_max_mapping_level() path.
https://lore.kernel.org/all/CABgObfbu1-Ok607uYdo4DzwZf8ZGVQnvHU+y9_M1Zae55K5xwQ@mail.gmail.com/

For the patch "KVM: x86/mmu: Allow per-VM override of the TDP max page level",
it's initially acked by Paolo in v2, and Sean's reply is at
https://lore.kernel.org/all/YO3%2FgvK9A3tgYfT6@google.com .

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 15/21] KVM: TDX: Support huge page splitting with exclusive kvm->mmu_lock
  2025-05-20  6:18   ` Binbin Wu
@ 2025-05-20  9:40     ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-20  9:40 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, chao.p.peng

On Tue, May 20, 2025 at 02:18:12PM +0800, Binbin Wu wrote:
> 
> 
> On 4/24/2025 11:08 AM, Yan Zhao wrote:
> [...]
> > +
> > +int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > +			       void *private_spt)
> > +{
> > +	struct page *page = virt_to_page(private_spt);
> > +	int ret;
> > +
> > +	if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
> > +		return -EINVAL;
> > +
> > +	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
> > +	if (ret <= 0)
> > +		return ret;
> > +
> > +	tdx_track(kvm);
> 
> It may worth a helper for the zap and track code.
> It's the some code as what in tdx_sept_remove_private_spte().
> So that they can share the code, including the bug check for HKID and the
> comments.
Not sure if it's worthwhile.
But I'm open to it if others also agree.

> 
> > +
> > +	return tdx_spte_demote_private_spte(kvm, gfn, level, page);
> > +}
> > +
> > 
> [...]

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 14/21] KVM: x86/tdp_mmu: Invoke split_external_spt hook with exclusive mmu_lock
  2025-05-20  5:40   ` Binbin Wu
@ 2025-05-20  9:40     ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-20  9:40 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, chao.p.peng

On Tue, May 20, 2025 at 01:40:46PM +0800, Binbin Wu wrote:
> 
> 
> On 4/24/2025 11:07 AM, Yan Zhao wrote:
> [...]
> > +static int split_external_spt(struct kvm *kvm, gfn_t gfn, u64 old_spte,
> > +			      u64 new_spte, int level)
> > +{
> > +	void *external_spt = get_external_spt(gfn, new_spte, level);
> > +	int ret;
> > +
> > +	KVM_BUG_ON(!external_spt, kvm);
> > +
> > +	ret = static_call(kvm_x86_split_external_spt)(kvm, gfn, level, external_spt);
> Better to use kvm_x86_call() instead of static_call().
Will do. Thanks!

> > +	KVM_BUG_ON(ret, kvm);
> > +
> > +	return ret;
> > +}
> >   /**
> >    * handle_removed_pt() - handle a page table removed from the TDP structure
> >    *
> > @@ -764,13 +778,13 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> >   	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> > -	/*
> > -	 * Users that do non-atomic setting of PTEs don't operate on mirror
> > -	 * roots, so don't handle it and bug the VM if it's seen.
> > -	 */
> >   	if (is_mirror_sptep(sptep)) {
> > -		KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
> > -		remove_external_spte(kvm, gfn, old_spte, level);
> > +		if (!is_shadow_present_pte(new_spte))
> > +			remove_external_spte(kvm, gfn, old_spte, level);
> > +		else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
> > +			split_external_spt(kvm, gfn, old_spte, new_spte, level);
> > +		else
> > +			KVM_BUG_ON(1, kvm);
> >   	}
> >   	return old_spte;
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 10/21] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root
  2025-05-19 17:42           ` Edgecombe, Rick P
@ 2025-05-20 10:11             ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-20 10:11 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1,
	thomas.lendacky@amd.com, tabba@google.com, Du, Fan,
	linux-kernel@vger.kernel.org, seanjc@google.com, Weiny, Ira,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Yamahata, Isaku,
	michael.roth@amd.com, vbabka@suse.cz, ackerleytng@google.com,
	Peng, Chao P, Shutemov, Kirill, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, May 20, 2025 at 01:42:20AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2025-05-19 at 11:57 +0800, Yan Zhao wrote:
> > Like below?
> > 
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 1b2bacde009f..0e4a03f44036 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1275,6 +1275,11 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> >         return 0;
> >  }
> > 
> > +static inline bool is_fault_disallow_huge_page_adust(struct kvm_page_fault *fault, bool is_mirror)
> > +{
> > +       return fault->nx_huge_page_workaround_enabled || is_mirror;
> > +}
> 
> Err, no. It doesn't seem worth it.
> 
> > +
> >  /*
> >   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
> >   * page tables and SPTEs to translate the faulting guest physical address.
> > @@ -1297,7 +1302,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >         for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
> >                 int r;
> > 
> > -               if (fault->nx_huge_page_workaround_enabled || is_mirror)
> > +               if (is_fault_disallow_huge_page_adust(fault, is_mirror))
> >                         disallowed_hugepage_adjust(fault, iter.old_spte, iter.level, is_mirror);
> > 
> >                 /*
> > 
> > 
> > 
> > > Also, why not check is_mirror_sp() in disallowed_hugepage_adjust() instead of
> > > passing in an is_mirror arg?
> > It's an optimization.
> 
> But disallowed_hugepage_adjust() is already checking the sp.
> 
> I think part of the thing that is bugging me is that
> nx_huge_page_workaround_enabled is not conceptually about whether the specific
> fault/level needs to disallow huge page adjustments, it's whether it needs to
> check if it does. Then disallowed_hugepage_adjust() does the actual specific
> checking. But for the mirror logic the check is the same for both. It's
> asymmetric with NX huge pages, and just sort of jammed in. It would be easier to
> follow if the kvm_tdp_mmu_map() conditional checked wither mirror TDP was
> "active", rather than the mirror role.
You are right. It looks clearer.

> > 
> > As is_mirror_sptep(iter->sptep) == is_mirror_sp(root), passing in is_mirror arg
> > can avoid checking mirror for each sp, which remains unchanged in a root.
> 
> Why not just this. It seems easier to comprehend to me. It does add a little bit
> of extra checking in the shared fault for TDX only. I think it's ok and better
> not to litter the generic MMU code.
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a284dce227a0..37ca77f2ee15 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3328,11 +3328,13 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu,
> struct kvm_page_fault *fault
>  
>  void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int
> cur_level)
>  {
> +       struct kvm_mmu_page * sp = spte_to_child_sp(spte);
if !is_shadow_present_pte(spte) or spte is an leaf entry, it's incorrect to
retrieve child sp. So, maybe

-           spte_to_child_sp(spte)->nx_huge_page_disallowed) {
+           (spte_to_child_sp(spte)->nx_huge_page_disallowed &&
+            is_mirror_sp(spte_to_child_sp(spte))) {

Other changes look good to me.

> +
>         if (cur_level > PG_LEVEL_4K &&
>             cur_level == fault->goal_level &&
>             is_shadow_present_pte(spte) &&
>             !is_large_pte(spte) &&
> -           spte_to_child_sp(spte)->nx_huge_page_disallowed) {
> +           (sp->nx_huge_page_disallowed || sp->role.is_mirror)) {
>                 /*
>                  * A small SPTE exists for this pfn, but FNAME(fetch),
>                  * direct_map(), or kvm_tdp_mmu_map() would like to create a
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 405874f4d088..1d22994576b5 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1244,6 +1244,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
> kvm_page_fault *fault)
>         struct tdp_iter iter;
>         struct kvm_mmu_page *sp;
>         int ret = RET_PF_RETRY;
> +       bool hugepage_adjust_disallowed = fault->nx_huge_page_workaround_enabled
> ||
> +                                         kvm_has_mirrored_tdp(kvm);
>  
>         kvm_mmu_hugepage_adjust(vcpu, fault);
>  
> @@ -1254,7 +1256,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
> kvm_page_fault *fault)
>         for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
>                 int r;
>  
> -               if (fault->nx_huge_page_workaround_enabled)
> +               if (hugepage_adjust_disallowed)
>                         disallowed_hugepage_adjust(fault, iter.old_spte,
> iter.level);
>  
>                 /*
> 
>  
>                 /*
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-05-19  8:32           ` Yan Zhao
  2025-05-19 16:53             ` Edgecombe, Rick P
@ 2025-05-20 23:34             ` Huang, Kai
  2025-05-21  2:35               ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Huang, Kai @ 2025-05-20 23:34 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Shutemov, Kirill, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, Li, Zhiquan1, vbabka@suse.cz, tabba@google.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, Weiny, Ira, michael.roth@amd.com,
	pbonzini@redhat.com, Yamahata, Isaku, ackerleytng@google.com,
	binbin.wu@linux.intel.com, Peng, Chao P, quic_eberman@quicinc.com,
	Annapurve, Vishal, Edgecombe, Rick P, jroedel@suse.de, Miao, Jun,
	kvm@vger.kernel.org, pgonda@google.com, x86@kernel.org

On Mon, 2025-05-19 at 16:32 +0800, Zhao, Yan Y wrote:
> > But in the above text you mentioned that, if doing so, because we choose to
> > ignore splitting request on read, returning 2M could result in *endless* EPT
> > violation.
> I don't get what you mean.
> What's the relationship between splitting and "returning 2M could result in
> *endless* EPT" ?
> 
> > So to me it seems you choose a design that could bring performance gain for
> > certain non-Linux TDs when they follow a certain behaviour but otherwise could
> > result in endless EPT violation in KVM.
> Also don't understand here.
> Which design could result in endless EPT violation?

[Sorry somehow I didn't see your replies yesterday in my mailbox.]

You mentioned below in your coverletter:

    (b) with shared kvm->mmu_lock, triggered by fault.

    ....

    This series simply ignores the splitting request in the fault path to
    avoid unnecessary bounces between levels. The vCPU that performs ACCEPT
    at a lower level would finally figures out the page has been accepted
    at a higher level by another vCPU.

    ... The worst outcome to ignore the resulting
    splitting request is an endless EPT violation. This would not happen
    for a Linux guest, which does not expect any #VE.

So to me, IIUC, this means:

 - this series choose to ignore splitting request when read ..
 - the worse outcome to ignore the resulting splitting request is an endless
   EPT violation..

And this happens exactly in below case:

 1) Guest touches a 4K page
 2) KVM AUGs 2M page
 3) Guest re-accesses that 4K page, and receives #VE
 4) Guest ACCEPTs that 4K page, this triggers EPT violation

IIUC, you choose to ignore splitting large page in step 4) (am I right???). 
Then if guest always ACCEPTs page at 4K level, then KVM will have *endless EPT
violation*.

So, is this the "worst outcome to ignore the resulting splitting request" that
you mentioned in your changelog?

If it is, then why is it OK?

It is OK *ONLY* when "guest always ACCEPTs 4K page" is a buggy behaviour of the
guest itself (which KVM is not responsible for).  I.e., the guest is always
supposed to find the page size that KVM has AUGed upon receiving the #VE (does
the #VE contain such information?) and then do ACCEPT at that page level.

Otherwise, if it's a legal behaviour for the guest to always ACCEPT at 4K level,
then I don't think it's OK to have endless EPT violation in KVM.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-05-20  9:34               ` Yan Zhao
@ 2025-05-20 23:47                 ` Huang, Kai
  2025-06-11 14:42                   ` Sean Christopherson
  2025-05-21 15:40                 ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Huang, Kai @ 2025-05-20 23:47 UTC (permalink / raw)
  To: Zhao, Yan Y, Edgecombe, Rick P
  Cc: Shutemov, Kirill, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, Li, Zhiquan1, thomas.lendacky@amd.com,
	tabba@google.com, quic_eberman@quicinc.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Weiny, Ira,
	vbabka@suse.cz, pbonzini@redhat.com, Yamahata, Isaku,
	michael.roth@amd.com, binbin.wu@linux.intel.com,
	ackerleytng@google.com, Peng, Chao P, kvm@vger.kernel.org,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Tue, 2025-05-20 at 17:34 +0800, Zhao, Yan Y wrote:
> On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote:
> > On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
> > > > On the opposite, if other non-Linux TDs don't follow 1G->2M->4K accept
> > > > order,
> > > > e.g., they always accept 4K, there could be *endless EPT violation* if I
> > > > understand your words correctly.
> > > > 
> > > > Isn't this yet-another reason we should choose to return PG_LEVEL_4K instead
> > > > of
> > > > 2M if no accept level is provided in the fault?
> > > As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
> > > TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.
> > 
> > TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
> > docs say the VMM needs to demote *if* the mapping is large and the accept size
> > is small. But if we map at 4k size for non-accept EPT violations, we won't hit
> > this case. I also wonder what is preventing the TDX module from handling a 2MB
> > accept size at 4k mappings. It could be changed maybe.
> > 
> > But I think Kai's question was: why are we complicating the code for the case of
> > non-Linux TDs that also use #VE for accept? It's not necessary to be functional,
> > and there aren't any known TDs like that which are expected to use KVM today.
> > (err, except the MMU stress test). So in another form the question is: should we
> > optimize KVM for a case we don't even know if anyone will use? The answer seems
> > obviously no to me.
> So, you want to disallow huge pages for non-Linux TDs, then we have no need
> to support splitting in the fault path, right?
> 
> I'm OK if we don't care non-Linux TDs for now.
> This can simplify the splitting code and we can add the support when there's a
> need.

For the record, I am not saying we don't care non-Linux TDs.  I am worrying
about the *endless* EPT violation in your below words:

    ... The worst outcome to ignore the resulting
    splitting request is an endless EPT violation.  This would not happen
    for a Linux guest, which does not expect any #VE.

And the point is, it's not OK if a *legal* guest behaviour can trigger this.

 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-05-20 23:34             ` Huang, Kai
@ 2025-05-21  2:35               ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-21  2:35 UTC (permalink / raw)
  To: Huang, Kai
  Cc: Shutemov, Kirill, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, Li, Zhiquan1, vbabka@suse.cz, tabba@google.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, Weiny, Ira, michael.roth@amd.com,
	pbonzini@redhat.com, Yamahata, Isaku, ackerleytng@google.com,
	binbin.wu@linux.intel.com, Peng, Chao P, quic_eberman@quicinc.com,
	Annapurve, Vishal, Edgecombe, Rick P, jroedel@suse.de, Miao, Jun,
	kvm@vger.kernel.org, pgonda@google.com, x86@kernel.org

On Wed, May 21, 2025 at 07:34:52AM +0800, Huang, Kai wrote:
> On Mon, 2025-05-19 at 16:32 +0800, Zhao, Yan Y wrote:
> > > But in the above text you mentioned that, if doing so, because we choose to
> > > ignore splitting request on read, returning 2M could result in *endless* EPT
> > > violation.
> > I don't get what you mean.
> > What's the relationship between splitting and "returning 2M could result in
> > *endless* EPT" ?
> > 
> > > So to me it seems you choose a design that could bring performance gain for
> > > certain non-Linux TDs when they follow a certain behaviour but otherwise could
> > > result in endless EPT violation in KVM.
> > Also don't understand here.
> > Which design could result in endless EPT violation?
> 
> [Sorry somehow I didn't see your replies yesterday in my mailbox.]
> 
> You mentioned below in your coverletter:
> 
>     (b) with shared kvm->mmu_lock, triggered by fault.
> 
>     ....
> 
>     This series simply ignores the splitting request in the fault path to
>     avoid unnecessary bounces between levels. The vCPU that performs ACCEPT
>     at a lower level would finally figures out the page has been accepted
>     at a higher level by another vCPU.
> 
>     ... The worst outcome to ignore the resulting
>     splitting request is an endless EPT violation. This would not happen
>     for a Linux guest, which does not expect any #VE.
> 
> So to me, IIUC, this means:
> 
>  - this series choose to ignore splitting request when read ..
>  - the worse outcome to ignore the resulting splitting request is an endless
>    EPT violation..
> 
> And this happens exactly in below case:
> 
>  1) Guest touches a 4K page
>  2) KVM AUGs 2M page
>  3) Guest re-accesses that 4K page, and receives #VE
>  4) Guest ACCEPTs that 4K page, this triggers EPT violation
> 
> IIUC, you choose to ignore splitting large page in step 4) (am I right???). 
> Then if guest always ACCEPTs page at 4K level, then KVM will have *endless EPT
> violation*.
> 
> So, is this the "worst outcome to ignore the resulting splitting request" that
> you mentioned in your changelog?
> 
> If it is, then why is it OK?
Initially I assumed the guest should always accept in the sequence of
"1G->2M->4K" as what's linux guest is doing.

If that's true, we can simply ignore the splitting request in the fault (shared)
path because it's the guest that not follow the convention.

However, Kirill and you are right, the guest can accept at 4K.

Given that, the "worst outcome to ignore the resulting splitting request" is not
OK. 

> It is OK *ONLY* when "guest always ACCEPTs 4K page" is a buggy behaviour of the
> guest itself (which KVM is not responsible for).  I.e., the guest is always
> supposed to find the page size that KVM has AUGed upon receiving the #VE (does
> the #VE contain such information?) and then do ACCEPT at that page level.
> 
> Otherwise, if it's a legal behaviour for the guest to always ACCEPT at 4K level,
> then I don't think it's OK to have endless EPT violation in KVM.
We can avoid the endless EPT violation by allowing the splitting in the fault
path, which involves the introduction of several locks in TDX code though. I had
a POC for that one, but we felt that it's better to keep the initial support
simple.

So, if we all agree not to support huge pages for non-Linux TDs as an initial
step, your proposal is a good idea to keep splitting code simple.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 20/21] KVM: x86: Force a prefetch fault's max mapping level to 4KB for TDX
  2025-04-24  3:09 ` [RFC PATCH 20/21] KVM: x86: Force a prefetch fault's max mapping level to 4KB for TDX Yan Zhao
  2025-05-13 23:20   ` Edgecombe, Rick P
@ 2025-05-21  3:30   ` Binbin Wu
  2025-05-21  5:03     ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Binbin Wu @ 2025-05-21  3:30 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, chao.p.peng



On 4/24/2025 11:09 AM, Yan Zhao wrote:
> Introduce a "prefetch" parameter to the private_max_mapping_level hook and
> enforce the max mapping level of a prefetch fault for private memory to be
> 4KB. This is a preparation to enable the ignoring huge page splitting in
> the fault path.
>
> If a prefetch fault results in a 2MB huge leaf in the mirror page table,
> there may not be a vCPU available to accept the corresponding 2MB huge leaf
> in the S-EPT if the TD is not configured to receive #VE for page
> acceptance. Consequently, if a vCPU accepts the page at 4KB level, it will
> trigger an EPT violation to split the 2MB huge leaf generated by the
> prefetch fault.
>
> Since handling the BUSY error from SEAMCALLs for huge page splitting is
> more comprehensive in the fault path, which is with kvm->mmu_lock held for
> reading, force the max mapping level of a prefetch fault of private memory
> to be 4KB to prevent potential splitting.
>
> Since prefetch faults for private memory are uncommon after the TD's build
> time, enforcing a 4KB mapping level is unlikely to cause any performance
> degradation.
I am wondering what are the use cases for KVM_PRE_FAULT_MEMORY.
Is there an API usage guide to limit that userspace shouldn't use it for a large
amount of memory pre-fault? If no, and userspace uses it to pre-fault a lot of
memory, this "unlikely to cause any performance degradation" might be not true.



^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 20/21] KVM: x86: Force a prefetch fault's max mapping level to 4KB for TDX
  2025-05-21  3:30   ` Binbin Wu
@ 2025-05-21  5:03     ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-21  5:03 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kirill.shutemov, tabba, ackerleytng, quic_eberman,
	michael.roth, david, vannapurve, vbabka, jroedel, thomas.lendacky,
	pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, chao.p.peng

On Wed, May 21, 2025 at 11:30:42AM +0800, Binbin Wu wrote:
> 
> 
> On 4/24/2025 11:09 AM, Yan Zhao wrote:
> > Introduce a "prefetch" parameter to the private_max_mapping_level hook and
> > enforce the max mapping level of a prefetch fault for private memory to be
> > 4KB. This is a preparation to enable the ignoring huge page splitting in
> > the fault path.
> > 
> > If a prefetch fault results in a 2MB huge leaf in the mirror page table,
> > there may not be a vCPU available to accept the corresponding 2MB huge leaf
> > in the S-EPT if the TD is not configured to receive #VE for page
> > acceptance. Consequently, if a vCPU accepts the page at 4KB level, it will
> > trigger an EPT violation to split the 2MB huge leaf generated by the
> > prefetch fault.
> > 
> > Since handling the BUSY error from SEAMCALLs for huge page splitting is
> > more comprehensive in the fault path, which is with kvm->mmu_lock held for
> > reading, force the max mapping level of a prefetch fault of private memory
> > to be 4KB to prevent potential splitting.
> > 
> > Since prefetch faults for private memory are uncommon after the TD's build
> > time, enforcing a 4KB mapping level is unlikely to cause any performance
> > degradation.
> I am wondering what are the use cases for KVM_PRE_FAULT_MEMORY.
> Is there an API usage guide to limit that userspace shouldn't use it for a large
> amount of memory pre-fault? If no, and userspace uses it to pre-fault a lot of
> memory, this "unlikely to cause any performance degradation" might be not true.
Currently, there are no known users of KVM_PRE_FAULT_MEMORY.
We can enable huge page support for prefetch faults (along with allowing
splitting in the fault path) in the future if performance considerations arise
for future users.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-05-20  9:34               ` Yan Zhao
  2025-05-20 23:47                 ` Huang, Kai
@ 2025-05-21 15:40                 ` Edgecombe, Rick P
  2025-05-22  3:52                   ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-21 15:40 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Du, Fan, Li, Xiaoyao, Huang, Kai, Shutemov, Kirill, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
	Li, Zhiquan1, quic_eberman@quicinc.com, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Weiny, Ira,
	Yamahata, Isaku, vbabka@suse.cz, ackerleytng@google.com,
	Peng, Chao P, kvm@vger.kernel.org, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, 2025-05-20 at 17:34 +0800, Yan Zhao wrote:
> So, you want to disallow huge pages for non-Linux TDs, then we have no need
> to support splitting in the fault path, right?
> 
> I'm OK if we don't care non-Linux TDs for now.
> This can simplify the splitting code and we can add the support when there's a
> need.

We do need to care about non-Linux TDs functioning, but we don't need to
optimize for them at this point. We need to optimize for things that happen
often. Pending-#VE using TDs are rare, and don't need to have huge pages in
order to work.

Yesterday Kirill and I were chatting offline about the newly defined
TDG.MEM.PAGE.RELEASE. It is kind of like an unaccept, so another possibility is:
1. Guest accepts at 2MB
2. Guest releases at 2MB (no notice to VMM)
3. Guest accepts at 4k, EPT violation with expectation to demote

In that case, KVM won't know to expect it, and that it needs to preemptively map
things at 4k.

For full coverage of the issue, can we discuss a little bit about what demote in
the fault path would look like? The current zapping operation that is involved
depends on mmu write lock. And I remember you had a POC that added essentially a
hidden exclusive lock in TDX code as a substitute. But unlike the other callers,
the fault path demote case could actually handle failure. So if we just returned
busy and didn't try to force the retry, we would just run the risk of
interfering with TDX module sept lock? Is that the only issue with a design that
would allows failure of demote in the fault path?

Let's keep in mind that we could ask for TDX module changes to enable this path.
I think we could probably get away with ignoring TDG.MEM.PAGE.RELEASE if we had
a plan to fix it up with TDX module changes. And if the ultimate root cause of
the complication is avoiding zero-step (sept lock), we should fix that instead
of design around it further.

> 
> > I think this connects the question of whether we can pass the necessary info
> > into fault via synthetic error code. Consider this new design:
> > 
> >   - tdx_gmem_private_max_mapping_level() simply returns 4k for prefetch and pre-
> > runnable, otherwise returns 2MB
> Why prefetch and pre-runnable faults go the first path, while

Because these are either passed into private_max_mapping_level(), or not
associated with the fault (runnable state).

> 
> >   - if fault has accept info 2MB size, pass 2MB size into fault. Otherwise pass
> > 4k (i.e. VMs that are relying on #VE to do the accept won't get huge pages
> > *yet*).
> other faults go the second path?

This info is related to the specific fault.

>  
> > What goes wrong? Seems simpler and no more stuffing fault info on the vcpu.
> I tried to avoid the double paths.
> IMHO, it's confusing to specify max_level from two paths.
> 
> The fault info in vcpu_tdx isn't a real problem as it's per-vCPU.
> An existing example in KVM is vcpu->arch.mmio_gfn.

mmio_gfn isn't info about the fault though, it's info about the gfn being mmio.
So not fault scoped.

> 
> We don't need something like the vcpu->arch.mmio_gen because
> tdx->violation_gfn_* and tdx->violation_request_level are reset in each
> tdx_handle_ept_violation().
> 
> 
> BTW, dug into some history:
> 
> In v18 of TDX basic series,
> enforcing 4KB for pre-runnable faults were done by passing PG_LEVEL_4K to
> kvm_mmu_map_tdp_page().
> https://lore.kernel.org/all/1a64f798b550dad9e096603e8dae3b6e8fb2fbd5.1705965635.git.isaku.yamahata@intel.com/
> https://lore.kernel.org/all/97bb1f2996d8a7b828cd9e3309380d1a86ca681b.1705965635.git.isaku.yamahata@intel.com/
> 
> For the other faults, it's done by altering max_level in kvm_mmu_do_page_fault(),
> and Paolo asked to use the tdx_gmem_private_max_mapping_level() path.
> https://lore.kernel.org/all/CABgObfbu1-Ok607uYdo4DzwZf8ZGVQnvHU+y9_M1Zae55K5xwQ@mail.gmail.com/
> 
> For the patch "KVM: x86/mmu: Allow per-VM override of the TDP max page level",
> it's initially acked by Paolo in v2, and Sean's reply is at
> https://lore.kernel.org/all/YO3%2FgvK9A3tgYfT6@google.com .

The SNP case is not checking fault info, it's closer to the other cases. I don't
see that any of that conversation applies to this case. Can you clarify?

On the subject of the whether to pass accept level into the fault, or stuff it
on the vcpu, I'm still in the camp that it is better to pass it in the error
code. If you disagree, let's see if we can flag down Sean and Paolo to weigh in.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-05-21 15:40                 ` Edgecombe, Rick P
@ 2025-05-22  3:52                   ` Yan Zhao
  2025-05-23 23:40                     ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-05-22  3:52 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Du, Fan, Li, Xiaoyao, Huang, Kai, Shutemov, Kirill, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
	Li, Zhiquan1, quic_eberman@quicinc.com, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Weiny, Ira,
	Yamahata, Isaku, vbabka@suse.cz, ackerleytng@google.com,
	Peng, Chao P, kvm@vger.kernel.org, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, May 21, 2025 at 11:40:15PM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-05-20 at 17:34 +0800, Yan Zhao wrote:
> > So, you want to disallow huge pages for non-Linux TDs, then we have no need
> > to support splitting in the fault path, right?
> > 
> > I'm OK if we don't care non-Linux TDs for now.
> > This can simplify the splitting code and we can add the support when there's a
> > need.
> 
> We do need to care about non-Linux TDs functioning, but we don't need to
> optimize for them at this point. We need to optimize for things that happen
> often. Pending-#VE using TDs are rare, and don't need to have huge pages in
> order to work.
> 
> Yesterday Kirill and I were chatting offline about the newly defined
> TDG.MEM.PAGE.RELEASE. It is kind of like an unaccept, so another possibility is:
> 1. Guest accepts at 2MB
> 2. Guest releases at 2MB (no notice to VMM)
> 3. Guest accepts at 4k, EPT violation with expectation to demote
> 
> In that case, KVM won't know to expect it, and that it needs to preemptively map
> things at 4k.
> 
> For full coverage of the issue, can we discuss a little bit about what demote in
> the fault path would look like?
For demote in the fault path, it will take mmu read lock.

So, the flow in the fault path is
1. zap with mmu read lock.
   ret = tdx_sept_zap_private_spte(kvm, gfn, level, page, true);
   if (ret <= 0)
       return ret;
2. track with mmu read lock
   ret = tdx_track(kvm, true);
   if (ret)
       return ret;
3. demote with mmu read lock
   ret = tdx_spte_demote_private_spte(kvm, gfn, level, page, true);
   if (ret)
       goto err;
4. return success or unzap as error fallback.
   tdx_sept_unzap_private_spte(kvm, gfn, level);

Steps 1-3 will return -EBUSY on busy error (which will not be very often as we
will introduce kvm_tdx->sept_lock. I can post the full lock analysis if
necessary).
Step 4 will be ensured to succeed.

Here's the detailed code for step 1, 3 and 4.

static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
                                     enum pg_level level, struct page *page,
                                     bool mmu_lock_shared)
{
        int tdx_level = pg_level_to_tdx_sept_level(level);
        struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
        gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
        u64 err, entry, level_state;

        /* Before TD runnable, large page is not supported */
        WARN_ON_ONCE(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K);

        if (mmu_lock_shared)
                lockdep_assert_held_read(&kvm->mmu_lock);
        else
                lockdep_assert_held_write(&kvm->mmu_lock);

        write_lock(&kvm_tdx->sept_lock);
        err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
        write_unlock(&kvm_tdx->sept_lock);

        if (unlikely(tdx_operand_busy(err))) {
                if (mmu_lock_shared)
                        return -EBUSY;

                /* After no vCPUs enter, the second retry is expected to succeed */
                write_lock(&kvm_tdx->sept_lock);
                tdx_no_vcpus_enter_start(kvm);
                err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
                tdx_no_vcpus_enter_stop(kvm);
                write_unlock(&kvm_tdx->sept_lock);
        }

        if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
            !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
                atomic64_dec(&kvm_tdx->nr_premapped);
                return 0;
        }

        if (KVM_BUG_ON(err, kvm)) {
                pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
                return -EIO;
        }
        return 1;
}

static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
                                        enum pg_level level, struct page *page,
                                        bool mmu_lock_shared)
{
       int tdx_level = pg_level_to_tdx_sept_level(level);
       struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
       gpa_t gpa = gfn_to_gpa(gfn);
       u64 err, entry, level_state;

       do {
               read_lock(&kvm_tdx->sept_lock);
               err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
                                       &entry, &level_state);
               read_unlock(&kvm_tdx->sept_lock);
       } while (err == TDX_INTERRUPTED_RESTARTABLE);

       if (unlikely(tdx_operand_busy(err)) {
                unsigned long flags;

                if (mmu_lock_shared)
                        return -EBUSY;

                tdx_no_vcpus_enter_start(kvm);
                read_lock(&kvm_tdx->sept_lock);

                local_irq_save(flags);
                err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
                                          &entry, &level_state);
                local_irq_restore(flags);
                read_unlock(&kvm_tdx->sept_lock);
                tdx_no_vcpus_enter_stop(kvm);
        }

        if (KVM_BUG_ON(err, kvm)) {
                pr_tdx_error_2(TDH_MEM_PAGE_DEMOTE, err, entry, level_state);
                return -EIO;
        }
        return 0;
}

static void tdx_sept_unzap_private_spte(struct kvm *kvm, gfn_t gfn,
                                     enum pg_level level)
{
        int tdx_level = pg_level_to_tdx_sept_level(level);
        struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
        gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
        u64 err, entry, level_state;

        write_lock(&kvm_tdx->sept_lock);
        err = tdh_mem_range_unblock(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
        write_unlock(&kvm_tdx->sept_lock);

        if (unlikely(tdx_operand_busy(err))) {
                write_lock(&kvm_tdx->sept_lock);
                tdx_no_vcpus_enter_start(kvm);
                err = tdh_mem_range_unblock(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
                tdx_no_vcpus_enter_stop(kvm);
                write_unlock(&kvm_tdx->sept_lock);
        }

        if (KVM_BUG_ON(err, kvm)) {
                pr_tdx_error_2(TDH_MEM_RANGE_UNBLOCK, err, entry, level_state);
        }
}


> The current zapping operation that is involved
> depends on mmu write lock. And I remember you had a POC that added essentially a
> hidden exclusive lock in TDX code as a substitute. But unlike the other callers,
Right, The kvm_tdx->sept_lock is introduced as a rw lock. The write lock is held
in a very short period, around tdh_mem_sept_remove(), tdh_mem_range_block(),
tdh_mem_range_unblock().

The read/write status of the kvm_tdx->sept_lock corresponds to that in the TDX
module.

  Resources          SHARED  users              EXCLUSIVE users 
-----------------------------------------------------------------------
 secure_ept_lock   tdh_mem_sept_add            tdh_vp_enter
                   tdh_mem_page_aug            tdh_mem_sept_remove
                   tdh_mem_page_remove         tdh_mem_range_block
                   tdh_mem_page_promote        tdh_mem_range_unblock
                   tdh_mem_page_demote

> the fault path demote case could actually handle failure. So if we just returned
> busy and didn't try to force the retry, we would just run the risk of
> interfering with TDX module sept lock? Is that the only issue with a design that
> would allows failure of demote in the fault path?
The concern to support split in the fault path is mainly to avoid unnecesssary
split, e.g., when two vCPUs try to accept at different levels.

Besides that we need to introduce 3 locks inside TDX:
rwlock_t sept_lock, spinlock_t no_vcpu_enter_lock, spinlock_t track_lock.

To ensure the success of unzap (to restore the state), kicking of vCPUs in the
fault path is required, which is not ideal. But with the introduced lock and the
proposed TDX modules's change to tdg_mem_page_accept() (as in the next comment),
the chance to invoke unzap is very low.

> Let's keep in mind that we could ask for TDX module changes to enable this path.
We may need TDX module's change to let tdg_mem_page_accept() not to take lock on
an non-ACCEPTable entry to avoid contention with guest and the potential error
TDX_HOST_PRIORITY_BUSY_TIMEOUT.

> I think we could probably get away with ignoring TDG.MEM.PAGE.RELEASE if we had
> a plan to fix it up with TDX module changes. And if the ultimate root cause of
> the complication is avoiding zero-step (sept lock), we should fix that instead
> of design around it further.
Ok.

> > > I think this connects the question of whether we can pass the necessary info
> > > into fault via synthetic error code. Consider this new design:
> > > 
> > >   - tdx_gmem_private_max_mapping_level() simply returns 4k for prefetch and pre-
> > > runnable, otherwise returns 2MB
> > Why prefetch and pre-runnable faults go the first path, while
> 
> Because these are either passed into private_max_mapping_level(), or not
> associated with the fault (runnable state).
> 
> > 
> > >   - if fault has accept info 2MB size, pass 2MB size into fault. Otherwise pass
> > > 4k (i.e. VMs that are relying on #VE to do the accept won't get huge pages
> > > *yet*).
> > other faults go the second path?
> 
> This info is related to the specific fault.
> 
> >  
> > > What goes wrong? Seems simpler and no more stuffing fault info on the vcpu.
> > I tried to avoid the double paths.
> > IMHO, it's confusing to specify max_level from two paths.
> > 
> > The fault info in vcpu_tdx isn't a real problem as it's per-vCPU.
> > An existing example in KVM is vcpu->arch.mmio_gfn.
> 
> mmio_gfn isn't info about the fault though, it's info about the gfn being mmio.
> So not fault scoped.
> 
> > 
> > We don't need something like the vcpu->arch.mmio_gen because
> > tdx->violation_gfn_* and tdx->violation_request_level are reset in each
> > tdx_handle_ept_violation().
> > 
> > 
> > BTW, dug into some history:
> > 
> > In v18 of TDX basic series,
> > enforcing 4KB for pre-runnable faults were done by passing PG_LEVEL_4K to
> > kvm_mmu_map_tdp_page().
> > https://lore.kernel.org/all/1a64f798b550dad9e096603e8dae3b6e8fb2fbd5.1705965635.git.isaku.yamahata@intel.com/
> > https://lore.kernel.org/all/97bb1f2996d8a7b828cd9e3309380d1a86ca681b.1705965635.git.isaku.yamahata@intel.com/
> > 
> > For the other faults, it's done by altering max_level in kvm_mmu_do_page_fault(),
> > and Paolo asked to use the tdx_gmem_private_max_mapping_level() path.
> > https://lore.kernel.org/all/CABgObfbu1-Ok607uYdo4DzwZf8ZGVQnvHU+y9_M1Zae55K5xwQ@mail.gmail.com/
> > 
> > For the patch "KVM: x86/mmu: Allow per-VM override of the TDP max page level",
> > it's initially acked by Paolo in v2, and Sean's reply is at
> > https://lore.kernel.org/all/YO3%2FgvK9A3tgYfT6@google.com .
> 
> The SNP case is not checking fault info, it's closer to the other cases. I don't
> see that any of that conversation applies to this case. Can you clarify?
My concern of stuffing the error_code to pass in the fault max_level is that
if it's a good path, the TDX basic enabling code should have been implemented in
that way by always passing in 4KB.

Why Sean said
"
Looks like SNP needs a dynamic check, i.e. a kvm_x86_ops hook, to handle an edge
case in the RMP.  That's probably the better route given that this is a short-term
hack (hopefully :-D).
"
instead of suggesting TDX enable the error code path earlier and hardcode the
level to 4KB?

> On the subject of the whether to pass accept level into the fault, or stuff it
> on the vcpu, I'm still in the camp that it is better to pass it in the error
> code. If you disagree, let's see if we can flag down Sean and Paolo to weigh in.
Ok.

To document for further discussions with Sean and Paolo:

- Passing in max_level in tdx_gmem_private_max_mapping_level()
  Cons:
  a) needs to stuff info in the vcpu to get accept level info.

  Pros:
  a) a uniform approach as to SEV.
  b) dynamic. Can get more fault info, e.g. is_prefetch, gfn, pfn.
  c) can get increased/decreased level for a given gfn similarly to get the
     accept level
  d) flexibility for TDX to implement advanced features. e.g.
     1. determine an accept level after certain negotiation with guest
     2. pre-fetch memory


- To pass in max_level in error_code
  Cons:
  a) still need tdx_gmem_private_max_mapping_level() to get dynamic info.
  b) still need info stuffed on the vcpu under certain conditions. e.g.
     when promotion fails with TDX_EPT_INVALID_PROMOTE_CONDITIONS, we can skip
     the local retry by reducing the max_level.
  c) only effective in the EPT violation path.
  Pros:
     currently easy to pass in accept level info.


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-05-22  3:52                   ` Yan Zhao
@ 2025-05-23 23:40                     ` Edgecombe, Rick P
  2025-05-27  1:31                       ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-05-23 23:40 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Du, Fan, Li, Xiaoyao, Huang, Kai, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Shutemov, Kirill,
	michael.roth@amd.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, pbonzini@redhat.com, binbin.wu@linux.intel.com,
	Weiny, Ira, Yamahata, Isaku, vbabka@suse.cz,
	ackerleytng@google.com, Peng, Chao P, kvm@vger.kernel.org,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Thu, 2025-05-22 at 11:52 +0800, Yan Zhao wrote:
> On Wed, May 21, 2025 at 11:40:15PM +0800, Edgecombe, Rick P wrote:
> > On Tue, 2025-05-20 at 17:34 +0800, Yan Zhao wrote:
> > > So, you want to disallow huge pages for non-Linux TDs, then we have no need
> > > to support splitting in the fault path, right?
> > > 
> > > I'm OK if we don't care non-Linux TDs for now.
> > > This can simplify the splitting code and we can add the support when there's a
> > > need.
> > 
> > We do need to care about non-Linux TDs functioning, but we don't need to
> > optimize for them at this point. We need to optimize for things that happen
> > often. Pending-#VE using TDs are rare, and don't need to have huge pages in
> > order to work.
> > 
> > Yesterday Kirill and I were chatting offline about the newly defined
> > TDG.MEM.PAGE.RELEASE. It is kind of like an unaccept, so another possibility is:
> > 1. Guest accepts at 2MB
> > 2. Guest releases at 2MB (no notice to VMM)
> > 3. Guest accepts at 4k, EPT violation with expectation to demote
> > 
> > In that case, KVM won't know to expect it, and that it needs to preemptively map
> > things at 4k.
> > 
> > For full coverage of the issue, can we discuss a little bit about what demote in
> > the fault path would look like?
> For demote in the fault path, it will take mmu read lock.
> 
> So, the flow in the fault path is
> 1. zap with mmu read lock.
>    ret = tdx_sept_zap_private_spte(kvm, gfn, level, page, true);
>    if (ret <= 0)
>        return ret;
> 2. track with mmu read lock
>    ret = tdx_track(kvm, true);
>    if (ret)
>        return ret;
> 3. demote with mmu read lock
>    ret = tdx_spte_demote_private_spte(kvm, gfn, level, page, true);
>    if (ret)
>        goto err;
> 4. return success or unzap as error fallback.
>    tdx_sept_unzap_private_spte(kvm, gfn, level);
> 
> Steps 1-3 will return -EBUSY on busy error (which will not be very often as we
> will introduce kvm_tdx->sept_lock. I can post the full lock analysis if
> necessary).

That is true that it would not be taken very often. It's not a performance
issue, but I think we should not add a lock if we can at all avoid it. It
creates a special case for TDX for the TDP MMU. People would have to then keep
in mind that two mmu read lock threads could still still contend.

[snip]
> 
> 
> > The current zapping operation that is involved
> > depends on mmu write lock. And I remember you had a POC that added essentially a
> > hidden exclusive lock in TDX code as a substitute. But unlike the other callers,
> Right, The kvm_tdx->sept_lock is introduced as a rw lock. The write lock is held
> in a very short period, around tdh_mem_sept_remove(), tdh_mem_range_block(),
> tdh_mem_range_unblock().
> 
> The read/write status of the kvm_tdx->sept_lock corresponds to that in the TDX
> module.
> 
>   Resources          SHARED  users              EXCLUSIVE users 
> -----------------------------------------------------------------------
>  secure_ept_lock   tdh_mem_sept_add            tdh_vp_enter
>                    tdh_mem_page_aug            tdh_mem_sept_remove
>                    tdh_mem_page_remove         tdh_mem_range_block
>                    tdh_mem_page_promote        tdh_mem_range_unblock
>                    tdh_mem_page_demote
> 
> > the fault path demote case could actually handle failure. So if we just returned
> > busy and didn't try to force the retry, we would just run the risk of
> > interfering with TDX module sept lock? Is that the only issue with a design that
> > would allows failure of demote in the fault path?
> The concern to support split in the fault path is mainly to avoid unnecesssary
> split, e.g., when two vCPUs try to accept at different levels.

We are just talking about keeping rare TDs functional here, right? Two cases
are:
 - TDs using PAGE.RELEASE
 - TDs using pending #VEs and accepting memory in strange patterns

Not maintaining huge pages there seems totally acceptable. How I look at this
whole thing is that it just an optimization, not a feature. Every aspect has a
complexity/performance tradeoff that we need to make a sensible decision on.
Maintaining huge page mappings in every possible case is not the goal.

> 
> Besides that we need to introduce 3 locks inside TDX:
> rwlock_t sept_lock, spinlock_t no_vcpu_enter_lock, spinlock_t track_lock.

Huh?

> 
> To ensure the success of unzap (to restore the state), kicking of vCPUs in the
> fault path is required, which is not ideal. But with the introduced lock and the
> proposed TDX modules's change to tdg_mem_page_accept() (as in the next comment),
> the chance to invoke unzap is very low.

Yes, it's probably not safe to expect the exact same demote call chain again.
The fault path could maybe learn to recover from the blocked state?

> 
> > Let's keep in mind that we could ask for TDX module changes to enable this path.
> We may need TDX module's change to let tdg_mem_page_accept() not to take lock on
> an non-ACCEPTable entry to avoid contention with guest and the potential error
> TDX_HOST_PRIORITY_BUSY_TIMEOUT.

Part of that is already in the works (accepting not-present entries). It seems
reasonable. But also, what about looking at having the TDX module do the full
demote operation internally. The track part obviously happens outside of the TDX
module, but maybe the whole thing could be simplified.

> 
> > I think we could probably get away with ignoring TDG.MEM.PAGE.RELEASE if we had
> > a plan to fix it up with TDX module changes. And if the ultimate root cause of
> > the complication is avoiding zero-step (sept lock), we should fix that instead
> > of design around it further.
> Ok.
> 
> > > 

I'll respond to the error code half of this mail separately.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-05-23 23:40                     ` Edgecombe, Rick P
@ 2025-05-27  1:31                       ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-05-27  1:31 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Du, Fan, Li, Xiaoyao, Huang, Kai, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Shutemov, Kirill,
	michael.roth@amd.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, pbonzini@redhat.com, binbin.wu@linux.intel.com,
	Weiny, Ira, Yamahata, Isaku, vbabka@suse.cz,
	ackerleytng@google.com, Peng, Chao P, kvm@vger.kernel.org,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Sat, May 24, 2025 at 07:40:25AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-05-22 at 11:52 +0800, Yan Zhao wrote:
> > On Wed, May 21, 2025 at 11:40:15PM +0800, Edgecombe, Rick P wrote:
> > > On Tue, 2025-05-20 at 17:34 +0800, Yan Zhao wrote:
> > > > So, you want to disallow huge pages for non-Linux TDs, then we have no need
> > > > to support splitting in the fault path, right?
> > > > 
> > > > I'm OK if we don't care non-Linux TDs for now.
> > > > This can simplify the splitting code and we can add the support when there's a
> > > > need.
> > > 
> > > We do need to care about non-Linux TDs functioning, but we don't need to
> > > optimize for them at this point. We need to optimize for things that happen
> > > often. Pending-#VE using TDs are rare, and don't need to have huge pages in
> > > order to work.
> > > 
> > > Yesterday Kirill and I were chatting offline about the newly defined
> > > TDG.MEM.PAGE.RELEASE. It is kind of like an unaccept, so another possibility is:
> > > 1. Guest accepts at 2MB
> > > 2. Guest releases at 2MB (no notice to VMM)
> > > 3. Guest accepts at 4k, EPT violation with expectation to demote
> > > 
> > > In that case, KVM won't know to expect it, and that it needs to preemptively map
> > > things at 4k.
> > > 
> > > For full coverage of the issue, can we discuss a little bit about what demote in
> > > the fault path would look like?
> > For demote in the fault path, it will take mmu read lock.
> > 
> > So, the flow in the fault path is
> > 1. zap with mmu read lock.
> >    ret = tdx_sept_zap_private_spte(kvm, gfn, level, page, true);
> >    if (ret <= 0)
> >        return ret;
> > 2. track with mmu read lock
> >    ret = tdx_track(kvm, true);
> >    if (ret)
> >        return ret;
> > 3. demote with mmu read lock
> >    ret = tdx_spte_demote_private_spte(kvm, gfn, level, page, true);
> >    if (ret)
> >        goto err;
> > 4. return success or unzap as error fallback.
> >    tdx_sept_unzap_private_spte(kvm, gfn, level);
> > 
> > Steps 1-3 will return -EBUSY on busy error (which will not be very often as we
> > will introduce kvm_tdx->sept_lock. I can post the full lock analysis if
> > necessary).
> 
> That is true that it would not be taken very often. It's not a performance
> issue, but I think we should not add a lock if we can at all avoid it. It
> creates a special case for TDX for the TDP MMU. People would have to then keep
> in mind that two mmu read lock threads could still still contend.
Hmm, without the kvm_tdx->sept_lock, we can return retry if busy error is
returned from tdh_mem_range_block(). However, we need to ensure the success of
tdh_mem_range_unblock() before completing the split.

Besides, we need the kvm_tdx->track_lock to serialize tdh_mem_track() and
kicking off vCPUs. In the base series, we use write kvm->mmu_lock to achieve
this purpose.

BTW: Looks Kirill's DPAMT series will introduce a pamt_lock [1]. 
[1] https://lore.kernel.org/all/20250502130828.4071412-6-kirill.shutemov@linux.intel.com/

> > > The current zapping operation that is involved
> > > depends on mmu write lock. And I remember you had a POC that added essentially a
> > > hidden exclusive lock in TDX code as a substitute. But unlike the other callers,
> > Right, The kvm_tdx->sept_lock is introduced as a rw lock. The write lock is held
> > in a very short period, around tdh_mem_sept_remove(), tdh_mem_range_block(),
> > tdh_mem_range_unblock().
> > 
> > The read/write status of the kvm_tdx->sept_lock corresponds to that in the TDX
> > module.
> > 
> >   Resources          SHARED  users              EXCLUSIVE users 
> > -----------------------------------------------------------------------
> >  secure_ept_lock   tdh_mem_sept_add            tdh_vp_enter
> >                    tdh_mem_page_aug            tdh_mem_sept_remove
> >                    tdh_mem_page_remove         tdh_mem_range_block
> >                    tdh_mem_page_promote        tdh_mem_range_unblock
> >                    tdh_mem_page_demote
> > 
> > > the fault path demote case could actually handle failure. So if we just returned
> > > busy and didn't try to force the retry, we would just run the risk of
> > > interfering with TDX module sept lock? Is that the only issue with a design that
> > > would allows failure of demote in the fault path?
> > The concern to support split in the fault path is mainly to avoid unnecesssary
> > split, e.g., when two vCPUs try to accept at different levels.
> 
> We are just talking about keeping rare TDs functional here, right? Two cases
> are:
>  - TDs using PAGE.RELEASE
This is for future linux TDs, right?

>  - TDs using pending #VEs and accepting memory in strange patterns
> 
> Not maintaining huge pages there seems totally acceptable. How I look at this
> whole thing is that it just an optimization, not a feature. Every aspect has a
> complexity/performance tradeoff that we need to make a sensible decision on.
> Maintaining huge page mappings in every possible case is not the goal.
So, can I interpret your preference as follows?
For now,
- Do not support huge pages on non-linux TDs.
- Do not support page splitting in fault path.

> > 
> > Besides that we need to introduce 3 locks inside TDX:
> > rwlock_t sept_lock, spinlock_t no_vcpu_enter_lock, spinlock_t track_lock.
> 
> Huh?
In the base series, no_vcpu_enter_lock and track_lock are saved by holding the
write kvm->mmu_lock.

> 
> > 
> > To ensure the success of unzap (to restore the state), kicking of vCPUs in the
> > fault path is required, which is not ideal. But with the introduced lock and the
> > proposed TDX modules's change to tdg_mem_page_accept() (as in the next comment),
> > the chance to invoke unzap is very low.
> 
> Yes, it's probably not safe to expect the exact same demote call chain again.
> The fault path could maybe learn to recover from the blocked state?
Do you mean you want to introduce a blocked state in the mirror page table?
I don't like it for its complexity.

Do you think we can try to ask for tdh_mem_page_demote() not to use
tdh_mem_range_block() and tdh_mem_range_unblock(). Looks it's anyway required
for TDX connect.

If that's true, the tdh_mem_range_{un}block()/tdh_mem_track() can be avoided in
the fault path.

> > 
> > > Let's keep in mind that we could ask for TDX module changes to enable this path.
> > We may need TDX module's change to let tdg_mem_page_accept() not to take lock on
> > an non-ACCEPTable entry to avoid contention with guest and the potential error
> > TDX_HOST_PRIORITY_BUSY_TIMEOUT.
> 
> Part of that is already in the works (accepting not-present entries). It seems
> reasonable. But also, what about looking at having the TDX module do the full
> demote operation internally. The track part obviously happens outside of the TDX
> module, but maybe the whole thing could be simplified.
> 
> > 
> > > I think we could probably get away with ignoring TDG.MEM.PAGE.RELEASE if we had
> > > a plan to fix it up with TDX module changes. And if the ultimate root cause of
> > > the complication is avoiding zero-step (sept lock), we should fix that instead
> > > of design around it further.
> > Ok.
> > 
> > > > 
> 
> I'll respond to the error code half of this mail separately.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-05-15  3:01                                 ` Yan Zhao
@ 2025-06-04 20:02                                   ` Ackerley Tng
  2025-06-05  2:42                                     ` Yan Zhao
  2025-06-05  2:47                                     ` Yan Zhao
  0 siblings, 2 replies; 294+ messages in thread
From: Ackerley Tng @ 2025-06-04 20:02 UTC (permalink / raw)
  To: Yan Zhao
  Cc: vannapurve, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Mon, May 12, 2025 at 09:53:43AM -0700, Vishal Annapurve wrote:
>> On Sun, May 11, 2025 at 7:18 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>> > ...
>> > >
>> > > I might be wrongly throwing out some terminologies here then.
>> > > VM_PFNMAP flag can be set for memory backed by folios/page structs.
>> > > udmabuf seems to be working with pinned "folios" in the backend.
>> > >
>> > > The goal is to get to a stage where guest_memfd is backed by pfn
>> > > ranges unmanaged by kernel that guest_memfd owns and distributes to
>> > > userspace, KVM, IOMMU subject to shareability attributes. if the
>> > OK. So from point of the reset part of kernel, those pfns are not regarded as
>> > memory.
>> >
>> > > shareability changes, the users will get notified and will have to
>> > > invalidate their mappings. guest_memfd will allow mmaping such ranges
>> > > with VM_PFNMAP flag set by default in the VMAs to indicate the need of
>> > > special handling/lack of page structs.
>> > My concern is a failable invalidation notifer may not be ideal.
>> > Instead of relying on ref counts (or other mechanisms) to determine whether to
>> > start shareabilitiy changes, with a failable invalidation notifier, some users
>> > may fail the invalidation and the shareability change, even after other users
>> > have successfully unmapped a range.
>>
>> Even if one user fails to invalidate its mappings, I don't see a
>> reason to go ahead with shareability change. Shareability should not
>> change unless all existing users let go of their soon-to-be-invalid
>> view of memory.

Hi Yan,

While working on the 1G (aka HugeTLB) page support for guest_memfd
series [1], we took into account conversion failures too. The steps are
in kvm_gmem_convert_range(). (It might be easier to pull the entire
series from GitHub [2] because the steps for conversion changed in two
separate patches.)

We do need to handle errors across ranges to be converted, possibly from
different memslots. The goal is to either have the entire conversion
happen (including page split/merge) or nothing at all when the ioctl
returns.

We try to undo the restructuring (whether split or merge) and undo any
shareability changes on error (barring ENOMEM, in which case we leave a
WARNing).

The part we don't restore is the presence of the pages in the host or
guest page tables. For that, our idea is that if unmapped, the next
access will just map it in, so there's no issue there.

> My thinking is that:
>
> 1. guest_memfd starts shared-to-private conversion
> 2. guest_memfd sends invalidation notifications
>    2.1 invalidate notification --> A --> Unmap and return success
>    2.2 invalidate notification --> B --> Unmap and return success
>    2.3 invalidate notification --> C --> return failure
> 3. guest_memfd finds 2.3 fails, fails shared-to-private conversion and keeps
>    shareability as shared
>
> Though the GFN remains shared after 3, it's unmapped in user A and B in 2.1 and
> 2.2. Even if additional notifications could be sent to A and B to ask for
> mapping the GFN back, the map operation might fail. Consequently, A and B might
> not be able to restore the mapped status of the GFN.

For conversion we don't attempt to restore mappings anywhere (whether in
guest or host page tables). What do you think of not restoring the
mappings?

> For IOMMU mappings, this
> could result in DMAR failure following a failed attempt to do shared-to-private
> conversion.

I believe the current conversion setup guards against this because after
unmapping from the host, we check for any unexpected refcounts.

(This unmapping is not the unmapping we're concerned about, since this is
shared memory, and unmapping doesn't go through TDX.)

Coming back to the refcounts, if the IOMMU had mappings, these refcounts
are "unexpected". The conversion ioctl will return to userspace with an
error.

IO can continue to happen, since the memory is still mapped in the
IOMMU. The memory state is still shared. No issue there.

In RFCv2 [1], we expect userspace to see the error, then try and remove
the memory from the IOMMU, and then try conversion again.

The part in concern here is unmapping failures of private pages, for
private-to-shared conversions, since that part goes through TDX and
might fail.

One other thing about taking refcounts is that in RFCv2,
private-to-shared conversions assume that there are no refcounts on the
private pages at all. (See filemap_remove_folio_for_restructuring() in
[3])

Haven't had a chance to think about all the edge cases, but for now I
think on unmapping failure, in addition to taking a refcount, we should
return an error at least up to guest_memfd, so that guest_memfd could
perhaps keep the refcount on that page, but drop the page from the
filemap. Another option could be to track messed up addresses and always
check that on conversion or something - not sure yet.

Either way, guest_memfd must know. If guest_memfd is not informed, on a
next conversion request, the conversion will just spin in
filemap_remove_folio_for_restructuring().

What do you think of this part about informing guest_memfd of the
failure to unmap?

>
> I noticed Ackerley has posted the series. Will check there later.
>

[1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/
[2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
[3] https://lore.kernel.org/all/7753dc66229663fecea2498cf442a768cb7191ba.1747264138.git.ackerleytng@google.com/

>> >
>> > Auditing whether multiple users of shared memory correctly perform unmapping is
>> > harder than auditing reference counts.
>> >
>> > > private memory backed by page structs and use a special "filemap" to
>> > > map file offsets to these private memory ranges. This step will also
>> > > need similar contract with users -
>> > >    1) memory is pinned by guest_memfd
>> > >    2) users will get invalidation notifiers on shareability changes
>> > >
>> > > I am sure there is a lot of work here and many quirks to be addressed,
>> > > let's discuss this more with better context around. A few related RFC
>> > > series are planned to be posted in the near future.
>> > Ok. Thanks for your time and discussions :)
>> > ...

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-04 20:02                                   ` Ackerley Tng
@ 2025-06-05  2:42                                     ` Yan Zhao
  2025-06-05 21:12                                       ` Ackerley Tng
  2025-06-11 14:30                                       ` Vishal Annapurve
  2025-06-05  2:47                                     ` Yan Zhao
  1 sibling, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-06-05  2:42 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: vannapurve, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Mon, May 12, 2025 at 09:53:43AM -0700, Vishal Annapurve wrote:
> >> On Sun, May 11, 2025 at 7:18 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >> > ...
> >> > >
> >> > > I might be wrongly throwing out some terminologies here then.
> >> > > VM_PFNMAP flag can be set for memory backed by folios/page structs.
> >> > > udmabuf seems to be working with pinned "folios" in the backend.
> >> > >
> >> > > The goal is to get to a stage where guest_memfd is backed by pfn
> >> > > ranges unmanaged by kernel that guest_memfd owns and distributes to
> >> > > userspace, KVM, IOMMU subject to shareability attributes. if the
> >> > OK. So from point of the reset part of kernel, those pfns are not regarded as
> >> > memory.
> >> >
> >> > > shareability changes, the users will get notified and will have to
> >> > > invalidate their mappings. guest_memfd will allow mmaping such ranges
> >> > > with VM_PFNMAP flag set by default in the VMAs to indicate the need of
> >> > > special handling/lack of page structs.
> >> > My concern is a failable invalidation notifer may not be ideal.
> >> > Instead of relying on ref counts (or other mechanisms) to determine whether to
> >> > start shareabilitiy changes, with a failable invalidation notifier, some users
> >> > may fail the invalidation and the shareability change, even after other users
> >> > have successfully unmapped a range.
> >>
> >> Even if one user fails to invalidate its mappings, I don't see a
> >> reason to go ahead with shareability change. Shareability should not
> >> change unless all existing users let go of their soon-to-be-invalid
> >> view of memory.
> 
> Hi Yan,
> 
> While working on the 1G (aka HugeTLB) page support for guest_memfd
> series [1], we took into account conversion failures too. The steps are
> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> series from GitHub [2] because the steps for conversion changed in two
> separate patches.)
> 
> We do need to handle errors across ranges to be converted, possibly from
> different memslots. The goal is to either have the entire conversion
> happen (including page split/merge) or nothing at all when the ioctl
> returns.
> 
> We try to undo the restructuring (whether split or merge) and undo any
> shareability changes on error (barring ENOMEM, in which case we leave a
> WARNing).
As the undo can fail (as the case you leave a WARNing, in patch 38 in [1]), it
can lead to WARNings in kernel with folios not being properly added to the
filemap.

> The part we don't restore is the presence of the pages in the host or
> guest page tables. For that, our idea is that if unmapped, the next
> access will just map it in, so there's no issue there.

I don't think so.

As in patch 38 in [1], on failure, it may fail to
- restore the shareability
- restore the folio's filemap status
- restore the folio's hugetlb stash metadata
- restore the folio's merged/split status

Also, the host page table is not restored.


> > My thinking is that:
> >
> > 1. guest_memfd starts shared-to-private conversion
> > 2. guest_memfd sends invalidation notifications
> >    2.1 invalidate notification --> A --> Unmap and return success
> >    2.2 invalidate notification --> B --> Unmap and return success
> >    2.3 invalidate notification --> C --> return failure
> > 3. guest_memfd finds 2.3 fails, fails shared-to-private conversion and keeps
> >    shareability as shared
> >
> > Though the GFN remains shared after 3, it's unmapped in user A and B in 2.1 and
> > 2.2. Even if additional notifications could be sent to A and B to ask for
> > mapping the GFN back, the map operation might fail. Consequently, A and B might
> > not be able to restore the mapped status of the GFN.
> 
> For conversion we don't attempt to restore mappings anywhere (whether in
> guest or host page tables). What do you think of not restoring the
> mappings?
It could cause problem if the mappings in S-EPT can't be restored.

For TDX private-to-shared conversion, if kvm_gmem_convert_should_proceed() -->
kvm_gmem_unmap_private() --> kvm_mmu_unmap_gfn_range() fails in the end, then
the GFN shareability is restored to private. The next guest access to
the partially unmapped private memory can meet a fatal error: "access before
acceptance".

It could occur in such a scenario:
1. TD issues a TDVMCALL_MAP_GPA to convert a private GFN to shared
2. Conversion fails in KVM.
3. set_memory_decrypted() fails in TD.
4. TD thinks the GFN is still accepted as private and accesses it.


> > For IOMMU mappings, this
> > could result in DMAR failure following a failed attempt to do shared-to-private
> > conversion.
> 
> I believe the current conversion setup guards against this because after
> unmapping from the host, we check for any unexpected refcounts.
Right, it's fine if we check for any unexpected refcounts.


> (This unmapping is not the unmapping we're concerned about, since this is
> shared memory, and unmapping doesn't go through TDX.)
> 
> Coming back to the refcounts, if the IOMMU had mappings, these refcounts
> are "unexpected". The conversion ioctl will return to userspace with an
> error.
> 
> IO can continue to happen, since the memory is still mapped in the
> IOMMU. The memory state is still shared. No issue there.
> 
> In RFCv2 [1], we expect userspace to see the error, then try and remove
> the memory from the IOMMU, and then try conversion again.
I don't think it's right to depend on that userspace could always perform in 
kernel's expected way, i.e. trying conversion until it succeeds.

We need to restore to the previous status (which includes the host page table)
if conversion can't be done.
That said, in my view, a better flow would be:

1. guest_memfd sends a pre-invalidation request to users (users here means the
   consumers in kernel of memory allocated from guest_memfd).

2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
   proceed. For example, in the case of TDX, this might involve memory
   allocation and page splitting.

3. Based on the pre-check results, guest_memfd either aborts the invalidation or
   proceeds by sending the actual invalidation request.

4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
   TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
   In such cases, TDX can callback guest_memfd to inform the poison-status of
   the page or elevate the page reference count.

5. guest_memfd completes the invalidation process. If the memory is marked as
   "poison," guest_memfd can handle it accordingly. If the page has an elevated
   reference count, guest_memfd may not need to take special action, as the
   elevated count prevents the OS from reallocating the page.
   (but from your reply below, seems a callback to guest_memfd is a better
   approach).


> The part in concern here is unmapping failures of private pages, for
> private-to-shared conversions, since that part goes through TDX and
> might fail.
IMO, even for TDX, the real unmap must not fail unless there are bugs in the KVM
or TDX module.
So, for page splitting in S-EPT, I prefer to try splitting in the
pre-invalidation phase before conducting any real unmap.


> One other thing about taking refcounts is that in RFCv2,
> private-to-shared conversions assume that there are no refcounts on the
> private pages at all. (See filemap_remove_folio_for_restructuring() in
> [3])
>
> Haven't had a chance to think about all the edge cases, but for now I
> think on unmapping failure, in addition to taking a refcount, we should
> return an error at least up to guest_memfd, so that guest_memfd could
> perhaps keep the refcount on that page, but drop the page from the
> filemap. Another option could be to track messed up addresses and always
> check that on conversion or something - not sure yet.

It looks good to me. See the bullet 4 in my proposed flow above.

> Either way, guest_memfd must know. If guest_memfd is not informed, on a
> next conversion request, the conversion will just spin in
> filemap_remove_folio_for_restructuring().
It makes sense.


> What do you think of this part about informing guest_memfd of the
> failure to unmap?
So, do you want to add a guest_memfd callback to achieve this purpose?


BTW, here's an analysis of why we can't let kvm_mmu_unmap_gfn_range()
and mmu_notifier_invalidate_range_start() fail, based on the repo
https://github.com/torvalds/linux.git, commit cd2e103d57e5 ("Merge tag
'hardening-v6.16-rc1-fix1-take2' of
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux")

1. Status of mmu notifier
-------------------------------
(1) There're 34 direct callers of mmu_notifier_invalidate_range_start().
    1. clear_refs_write
    2. do_pagemap_scan
    3. uprobe_write_opcode
    4. do_huge_zero_wp_pmd
    5. __split_huge_pmd (N)
    6. __split_huge_pud (N)
    7. move_pages_huge_pmd
    8. copy_hugetlb_page_range
    9. hugetlb_unshare_pmds  (N)
    10. hugetlb_change_protection
    11. hugetlb_wp
    12. unmap_hugepage_range (N)
    13. move_hugetlb_page_tables
    14. collapse_huge_page
    15. retract_page_tables
    16. collapse_pte_mapped_thp
    17. write_protect_page
    18. replace_page
    19. madvise_free_single_vma
    20. wp_clean_pre_vma
    21. wp_page_copy 
    22. zap_page_range_single_batched (N)
    23. unmap_vmas (N)
    24. copy_page_range 
    25. remove_device_exclusive_entry
    26. migrate_vma_collect
    27. __migrate_device_pages
    28. change_pud_range 
    29. move_page_tables
    30. page_vma_mkclean_one
    31. try_to_unmap_one
    32. try_to_migrate_one
    33. make_device_exclusive
    34. move_pages_pte

Of these 34 direct callers, those marked with (N) cannot tolerate
mmu_notifier_invalidate_range_start() failing. I have not yet investigated all
34 direct callers one by one, so the list of (N) is incomplete.

For 5. __split_huge_pmd(), Documentation/mm/transhuge.rst says:
"Note that split_huge_pmd() doesn't have any limitations on refcounting:
pmd can be split at any point and never fails." This is because split_huge_pmd()
serves as a graceful fallback design for code walking pagetables but unaware
about huge pmds.


(2) There's 1 direct caller of mmu_notifier_invalidate_range_start_nonblock(),
__oom_reap_task_mm(), which only expects the error -EAGAIN.

In mn_hlist_invalidate_range_start():
"WARN_ON(mmu_notifier_range_blockable(range) || _ret != -EAGAIN);"


(3) For DMAs, drivers need to invoke pin_user_pages() to pin memory. In that
case, they don't need to register mmu notifier.

Or, device drivers can pin pages via get_user_pages*(), and register for mmu         
notifier callbacks for the memory range. Then, upon receiving a notifier         
"invalidate range" callback , stop the device from using the range, and unpin    
the pages.

See Documentation/core-api/pin_user_pages.rst.


2. Cases that cannot tolerate failure of mmu_notifier_invalidate_range_start()
-------------------------------
(1) Error fallback cases.

    1. split_huge_pmd() as mentioned in Documentation/mm/transhuge.rst.
       split_huge_pmd() is designed as a graceful fallback without failure.

       split_huge_pmd
        |->__split_huge_pmd
           |->mmu_notifier_range_init
           |  mmu_notifier_invalidate_range_start
           |  split_huge_pmd_locked
           |  mmu_notifier_invalidate_range_end


    2. in fs/iomap/buffered-io.c, iomap_write_failed() itself is error handling.
       iomap_write_failed
         |->truncate_pagecache_range
            |->unmap_mapping_range
            |  |->unmap_mapping_pages
            |     |->unmap_mapping_range_tree
            |        |->unmap_mapping_range_vma
            |           |->zap_page_range_single
            |              |->zap_page_range_single_batched
            |                       |->mmu_notifier_range_init
            |                       |  mmu_notifier_invalidate_range_start
            |                       |  unmap_single_vma
            |                       |  mmu_notifier_invalidate_range_end
            |->truncate_inode_pages_range
               |->truncate_cleanup_folio
                  |->if (folio_mapped(folio))
                  |     unmap_mapping_folio(folio);
                         |->unmap_mapping_range_tree
                            |->unmap_mapping_range_vma
                               |->zap_page_range_single
                                  |->zap_page_range_single_batched
                                     |->mmu_notifier_range_init
                                     |  mmu_notifier_invalidate_range_start
                                     |  unmap_single_vma
                                     |  mmu_notifier_invalidate_range_end

   3. in mm/memory.c, zap_page_range_single() is invoked to handle error.
      remap_pfn_range_notrack
        |->int error = remap_pfn_range_internal(vma, addr, pfn, size, prot);
        |  if (!error)
        |      return 0;
	|  zap_page_range_single
           |->zap_page_range_single_batched
              |->mmu_notifier_range_init
              |  mmu_notifier_invalidate_range_start
              |  unmap_single_vma
              |  mmu_notifier_invalidate_range_end

   4. in kernel/events/core.c, zap_page_range_single() is invoked to clear any
      partial mappings on error.

      perf_mmap
        |->ret = map_range(rb, vma);
                 |  err = remap_pfn_range
                 |->if (err) 
                 |     zap_page_range_single
                        |->zap_page_range_single_batched
                           |->mmu_notifier_range_init
                           |  mmu_notifier_invalidate_range_start
                           |  unmap_single_vma
                           |  mmu_notifier_invalidate_range_end


   5. in mm/memory.c, unmap_mapping_folio() is invoked to unmap posion page.

      __do_fault
	|->if (unlikely(PageHWPoison(vmf->page))) { 
	|	vm_fault_t poisonret = VM_FAULT_HWPOISON;
	|	if (ret & VM_FAULT_LOCKED) {
	|		if (page_mapped(vmf->page))
	|			unmap_mapping_folio(folio);
        |                       |->unmap_mapping_range_tree
        |                          |->unmap_mapping_range_vma
        |                             |->zap_page_range_single
        |                                |->zap_page_range_single_batched
        |                                   |->mmu_notifier_range_init
        |                                   |  mmu_notifier_invalidate_range_start
        |                                   |  unmap_single_vma
        |                                   |  mmu_notifier_invalidate_range_end
	|		if (mapping_evict_folio(folio->mapping, folio))
	|			poisonret = VM_FAULT_NOPAGE; 
	|		folio_unlock(folio);
	|	}
	|	folio_put(folio);
	|	vmf->page = NULL;
	|	return poisonret;
	|  }


  6. in mm/vma.c, in __mmap_region(), unmap_region() is invoked to undo any
     partial mapping done by a device driver.

     __mmap_new_vma
       |->__mmap_new_file_vma(map, vma);
          |->error = mmap_file(vma->vm_file, vma);
          |  if (error)
          |     unmap_region
                 |->unmap_vmas
                    |->mmu_notifier_range_init
                    |  mmu_notifier_invalidate_range_start
                    |  unmap_single_vma
                    |  mmu_notifier_invalidate_range_end


(2) No-fail cases
-------------------------------
1. iput() cannot fail. 

iput
 |->iput_final
    |->WRITE_ONCE(inode->i_state, state | I_FREEING);
    |  inode_lru_list_del(inode);
    |  evict(inode);
       |->op->evict_inode(inode);
          |->shmem_evict_inode
             |->shmem_truncate_range
                |->truncate_inode_pages_range
                   |->truncate_cleanup_folio
                      |->if (folio_mapped(folio))
                      |     unmap_mapping_folio(folio);
                            |->unmap_mapping_range_tree
                               |->unmap_mapping_range_vma
                                  |->zap_page_range_single
                                     |->zap_page_range_single_batched
                                        |->mmu_notifier_range_init
                                        |  mmu_notifier_invalidate_range_start
                                        |  unmap_single_vma
                                        |  mmu_notifier_invalidate_range_end


2. exit_mmap() cannot fail

exit_mmap
  |->mmu_notifier_release(mm);
     |->unmap_vmas(&tlb, &vmi.mas, vma, 0, ULONG_MAX, ULONG_MAX, false);
        |->mmu_notifier_range_init
        |  mmu_notifier_invalidate_range_start
        |  unmap_single_vma
        |  mmu_notifier_invalidate_range_end


3. KVM Cases That Cannot Tolerate Unmap Failure
-------------------------------
Allowing unmap operations to fail in the following scenarios would make it very
difficult or even impossible to handle the failure:

(1) __kvm_mmu_get_shadow_page() is designed to reliably obtain a shadow page
without expecting any failure.

mmu_alloc_direct_roots
  |->mmu_alloc_root
     |->kvm_mmu_get_shadow_page
        |->__kvm_mmu_get_shadow_page
           |->kvm_mmu_alloc_shadow_page
              |->account_shadowed
                 |->kvm_mmu_slot_gfn_write_protect
                    |->kvm_tdp_mmu_write_protect_gfn
                       |->write_protect_gfn
                          |->tdp_mmu_iter_set_spte


(2) kvm_vfio_release() and kvm_vfio_file_del() cannot fail

kvm_vfio_release/kvm_vfio_file_del
 |->kvm_vfio_update_coherency
    |->kvm_arch_unregister_noncoherent_dma
       |->kvm_noncoherent_dma_assignment_start_or_stop
          |->kvm_zap_gfn_range
             |->kvm_tdp_mmu_zap_leafs
                |->tdp_mmu_zap_leafs
                   |->tdp_mmu_iter_set_spte


(3) There're lots of callers of __kvm_set_or_clear_apicv_inhibit() currently
never expect failure of unmap.

__kvm_set_or_clear_apicv_inhibit
  |->kvm_zap_gfn_range
     |->kvm_tdp_mmu_zap_leafs
        |->tdp_mmu_zap_leafs
           |->tdp_mmu_iter_set_spte



4. Cases in KVM where it's hard to make tdp_mmu_set_spte() (update SPTE with
write mmu_lock) failable.

(1) kvm_vcpu_flush_tlb_guest()

kvm_vcpu_flush_tlb_guest
  |->kvm_mmu_sync_roots
     |->mmu_sync_children
        |->kvm_vcpu_write_protect_gfn
           |->kvm_mmu_slot_gfn_write_protect
              |->kvm_tdp_mmu_write_protect_gfn
                 |->write_protect_gfn
                    |->tdp_mmu_iter_set_spte
                       |->tdp_mmu_set_spte


(2) handle_removed_pt() and handle_changed_spte().


Thanks
Yan

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-04 20:02                                   ` Ackerley Tng
  2025-06-05  2:42                                     ` Yan Zhao
@ 2025-06-05  2:47                                     ` Yan Zhao
  2025-06-05 22:35                                       ` Ackerley Tng
  1 sibling, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-05  2:47 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: vannapurve, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> Hi Yan,
> 
> While working on the 1G (aka HugeTLB) page support for guest_memfd
> series [1], we took into account conversion failures too. The steps are
> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> series from GitHub [2] because the steps for conversion changed in two
> separate patches.)
...
> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2

Hi Ackerley,
Thanks for providing this branch.

I'm now trying to make TD huge pages working on this branch and would like to
report to you errors I encountered during this process early.

1. symbol arch_get_align_mask() is not available when KVM is compiled as module.
   I currently workaround it as follows:

--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -102,8 +102,13 @@ static unsigned long kvm_gmem_get_align_mask(struct file *file,
        void *priv;

        inode = file_inode(file);
-       if (!kvm_gmem_has_custom_allocator(inode))
-             return arch_get_align_mask(file, flags);
+       if (!kvm_gmem_has_custom_allocator(inode)) {
+               page_size = 1 << PAGE_SHIFT;
+               return PAGE_MASK & (page_size - 1);
+       }


2. Bug of Sleeping function called from invalid context 

[  193.523469] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:325
[  193.539885] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 3332, name: guest_memfd_con
[  193.556235] preempt_count: 1, expected: 0
[  193.564518] RCU nest depth: 0, expected: 0
[  193.572866] 3 locks held by guest_memfd_con/3332:
[  193.581800]  #0: ff16f8ec217e4438 (sb_writers#14){.+.+}-{0:0}, at: __x64_sys_fallocate+0x46/0x80
[  193.598252]  #1: ff16f8fbd85c8310 (mapping.invalidate_lock#4){++++}-{4:4}, at: kvm_gmem_fallocate+0x9e/0x310 [kvm]
[  193.616706]  #2: ff3189b5e4f65018 (&(kvm)->mmu_lock){++++}-{3:3}, at: kvm_gmem_invalidate_begin_and_zap+0x17f/0x260 [kvm]
[  193.635790] Preemption disabled at:
[  193.635793] [<ffffffffc0850c6f>] kvm_gmem_invalidate_begin_and_zap+0x17f/0x260 [kvm]

This is because add_to_invalidated_kvms() invokes kzalloc() inside kvm->mmu_lock
which is a kind of spinlock.

I workarounded it as follows.

 static int kvm_gmem_invalidate_begin_and_zap(struct kvm_gmem *gmem,
                                             pgoff_t start, pgoff_t end,
@@ -1261,13 +1268,13 @@ static int kvm_gmem_invalidate_begin_and_zap(struct kvm_gmem *gmem,
                        KVM_MMU_LOCK(kvm);
                        kvm_mmu_invalidate_begin(kvm);

-                       if (invalidated_kvms) {
-                               ret = add_to_invalidated_kvms(invalidated_kvms, kvm);
-                               if (ret) {
-                                       kvm_mmu_invalidate_end(kvm);
-                                       goto out;
-                               }
-                       }
                }


@@ -1523,12 +1530,14 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
        }

 out:
-       list_for_each_entry_safe(entry, tmp, &invalidated_kvms, list) {
-               kvm_gmem_do_invalidate_end(entry->kvm);
-               list_del(&entry->list);
-               kfree(entry);
-       }
+       list_for_each_entry(gmem, gmem_list, entry)
+               kvm_gmem_do_invalidate_end(gmem->kvm);

        filemap_invalidate_unlock(inode->i_mapping);


Will let you know more findings later.

Thanks
Yan

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-05  2:42                                     ` Yan Zhao
@ 2025-06-05 21:12                                       ` Ackerley Tng
  2025-06-16 10:43                                         ` Yan Zhao
  2025-06-11 14:30                                       ` Vishal Annapurve
  1 sibling, 1 reply; 294+ messages in thread
From: Ackerley Tng @ 2025-06-05 21:12 UTC (permalink / raw)
  To: Yan Zhao
  Cc: vannapurve, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>> 
>> > On Mon, May 12, 2025 at 09:53:43AM -0700, Vishal Annapurve wrote:
>> >> On Sun, May 11, 2025 at 7:18 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>> >> > ...
>> >> > >
>> >> > > I might be wrongly throwing out some terminologies here then.
>> >> > > VM_PFNMAP flag can be set for memory backed by folios/page structs.
>> >> > > udmabuf seems to be working with pinned "folios" in the backend.
>> >> > >
>> >> > > The goal is to get to a stage where guest_memfd is backed by pfn
>> >> > > ranges unmanaged by kernel that guest_memfd owns and distributes to
>> >> > > userspace, KVM, IOMMU subject to shareability attributes. if the
>> >> > OK. So from point of the reset part of kernel, those pfns are not regarded as
>> >> > memory.
>> >> >
>> >> > > shareability changes, the users will get notified and will have to
>> >> > > invalidate their mappings. guest_memfd will allow mmaping such ranges
>> >> > > with VM_PFNMAP flag set by default in the VMAs to indicate the need of
>> >> > > special handling/lack of page structs.
>> >> > My concern is a failable invalidation notifer may not be ideal.
>> >> > Instead of relying on ref counts (or other mechanisms) to determine whether to
>> >> > start shareabilitiy changes, with a failable invalidation notifier, some users
>> >> > may fail the invalidation and the shareability change, even after other users
>> >> > have successfully unmapped a range.
>> >>
>> >> Even if one user fails to invalidate its mappings, I don't see a
>> >> reason to go ahead with shareability change. Shareability should not
>> >> change unless all existing users let go of their soon-to-be-invalid
>> >> view of memory.
>> 
>> Hi Yan,
>> 
>> While working on the 1G (aka HugeTLB) page support for guest_memfd
>> series [1], we took into account conversion failures too. The steps are
>> in kvm_gmem_convert_range(). (It might be easier to pull the entire
>> series from GitHub [2] because the steps for conversion changed in two
>> separate patches.)
>> 
>> We do need to handle errors across ranges to be converted, possibly from
>> different memslots. The goal is to either have the entire conversion
>> happen (including page split/merge) or nothing at all when the ioctl
>> returns.
>> 
>> We try to undo the restructuring (whether split or merge) and undo any
>> shareability changes on error (barring ENOMEM, in which case we leave a
>> WARNing).
> As the undo can fail (as the case you leave a WARNing, in patch 38 in [1]), it
> can lead to WARNings in kernel with folios not being properly added to the
> filemap.
>

I'm not sure how else to handle errors on rollback path. I've hopefully
addressed this on the other thread at [1].

>> The part we don't restore is the presence of the pages in the host or
>> guest page tables. For that, our idea is that if unmapped, the next
>> access will just map it in, so there's no issue there.
>
> I don't think so.
>
> As in patch 38 in [1], on failure, it may fail to
> - restore the shareability
> - restore the folio's filemap status
> - restore the folio's hugetlb stash metadata
> - restore the folio's merged/split status
>

The plan is that we try our best to restore shareability, filemap,
restructuring (aka split/merge, including stash metadata) other than
failures on rollback.

> Also, the host page table is not restored.
>
>

This is by design, the host page tables can be re-populated on the next
fault. I've hopefully addressed this on the other thread at [1].

>> > My thinking is that:
>> >
>> > 1. guest_memfd starts shared-to-private conversion
>> > 2. guest_memfd sends invalidation notifications
>> >    2.1 invalidate notification --> A --> Unmap and return success
>> >    2.2 invalidate notification --> B --> Unmap and return success
>> >    2.3 invalidate notification --> C --> return failure
>> > 3. guest_memfd finds 2.3 fails, fails shared-to-private conversion and keeps
>> >    shareability as shared
>> >
>> > Though the GFN remains shared after 3, it's unmapped in user A and B in 2.1 and
>> > 2.2. Even if additional notifications could be sent to A and B to ask for
>> > mapping the GFN back, the map operation might fail. Consequently, A and B might
>> > not be able to restore the mapped status of the GFN.
>> 
>> For conversion we don't attempt to restore mappings anywhere (whether in
>> guest or host page tables). What do you think of not restoring the
>> mappings?
> It could cause problem if the mappings in S-EPT can't be restored.
>
> For TDX private-to-shared conversion, if kvm_gmem_convert_should_proceed() -->
> kvm_gmem_unmap_private() --> kvm_mmu_unmap_gfn_range() fails in the end, then
> the GFN shareability is restored to private. The next guest access to
> the partially unmapped private memory can meet a fatal error: "access before
> acceptance".
>
> It could occur in such a scenario:
> 1. TD issues a TDVMCALL_MAP_GPA to convert a private GFN to shared
> 2. Conversion fails in KVM.
> 3. set_memory_decrypted() fails in TD.
> 4. TD thinks the GFN is still accepted as private and accesses it.
>
>

This is true, I was thinking that this isn't handled solely in
conversion but by being part of the contract between userspace VMM and
the guest, that guest must handle conversion failures. I've hopefully
addressed this on the other thread at [1].

>> > For IOMMU mappings, this
>> > could result in DMAR failure following a failed attempt to do shared-to-private
>> > conversion.
>> 
>> I believe the current conversion setup guards against this because after
>> unmapping from the host, we check for any unexpected refcounts.
> Right, it's fine if we check for any unexpected refcounts.
>
>
>> (This unmapping is not the unmapping we're concerned about, since this is
>> shared memory, and unmapping doesn't go through TDX.)
>> 
>> Coming back to the refcounts, if the IOMMU had mappings, these refcounts
>> are "unexpected". The conversion ioctl will return to userspace with an
>> error.
>> 
>> IO can continue to happen, since the memory is still mapped in the
>> IOMMU. The memory state is still shared. No issue there.
>> 
>> In RFCv2 [1], we expect userspace to see the error, then try and remove
>> the memory from the IOMMU, and then try conversion again.
> I don't think it's right to depend on that userspace could always perform in 
> kernel's expected way, i.e. trying conversion until it succeeds.
>

Let me think more deeply about this. Please let me know if there's
anything I missed.

It is true that a buggy or malicious userspace VMM can ignore conversion
failures and report success to the guest, but if both the userspace VMM
and guest are malicious, it's quite hard for the kernel to defend
against that.

I think as long as there's no point where the guest can crash the host
in a fixed way, I think it is okay to rely on a userspace VMM and guest
protocol.

IIUC the guest can crash the host (original point of having guest_memfd)
if the guest can convince the host to write to private memory. For that
to happen, the memory must be faulted into the Secure EPTs, and the
shareability state must be ALL for the host to fault it in.

So to have this issue, the conversion failure must be such that the
memory remains faulted into the Secure EPTs while shareability is
shared. Since unmapping from secure EPTs happens pretty early before any
shareability is changed or any rollback (and rollback failures) can
happen, I think we should be quite safe?

If unmapping of private memory fails, this is where I think guest_memfd
should get an error from the unmap and it should not proceed to change
shareability.


> We need to restore to the previous status (which includes the host page table)
> if conversion can't be done.

Most of the previous status (shareability, filemap,
restructuring (aka split/merge, including stash metadata)) are restored
other than during rollback failures.

As for presence in host page tables, is it okay to defer that till the
next fault, and if not okay, why not?

For presence in guest page tables, is it okay to fall back on the
protocol where the guest must handle conversion failures, and if not
okay, why not?

> That said, in my view, a better flow would be:
>
> 1. guest_memfd sends a pre-invalidation request to users (users here means the
>    consumers in kernel of memory allocated from guest_memfd).
>
> 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
>    proceed. For example, in the case of TDX, this might involve memory
>    allocation and page splitting.
>
> 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
>    proceeds by sending the actual invalidation request.
>
> 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
>    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
>    In such cases, TDX can callback guest_memfd to inform the poison-status of
>    the page or elevate the page reference count.
>
> 5. guest_memfd completes the invalidation process. If the memory is marked as
>    "poison," guest_memfd can handle it accordingly. If the page has an elevated
>    reference count, guest_memfd may not need to take special action, as the
>    elevated count prevents the OS from reallocating the page.
>    (but from your reply below, seems a callback to guest_memfd is a better
>    approach).
>
>

Thanks for this, I've tried to combine this into my response at
[1]. I think this works, but it's hard because

a. Pre-checks are hard to check (explained at [1])
b. Even after all the checks, unmapping can still fail, and those still
   have to be handled, and to handle those, we have to buy into the
   userspace VMM/guest protocol, so why not just buy into the protocol
   to start with?

[1] https://lore.kernel.org/all/diqztt4uhunj.fsf@ackerleytng-ctop.c.googlers.com/

>> The part in concern here is unmapping failures of private pages, for
>> private-to-shared conversions, since that part goes through TDX and
>> might fail.
> IMO, even for TDX, the real unmap must not fail unless there are bugs in the KVM
> or TDX module.
> So, for page splitting in S-EPT, I prefer to try splitting in the
> pre-invalidation phase before conducting any real unmap.
>
>

Thanks for your detailed suggestion.

>> One other thing about taking refcounts is that in RFCv2,
>> private-to-shared conversions assume that there are no refcounts on the
>> private pages at all. (See filemap_remove_folio_for_restructuring() in
>> [3])
>>
>> Haven't had a chance to think about all the edge cases, but for now I
>> think on unmapping failure, in addition to taking a refcount, we should
>> return an error at least up to guest_memfd, so that guest_memfd could
>> perhaps keep the refcount on that page, but drop the page from the
>> filemap. Another option could be to track messed up addresses and always
>> check that on conversion or something - not sure yet.
>
> It looks good to me. See the bullet 4 in my proposed flow above.
>

Thanks again for your detailed suggestion.

>> Either way, guest_memfd must know. If guest_memfd is not informed, on a
>> next conversion request, the conversion will just spin in
>> filemap_remove_folio_for_restructuring().
> It makes sense.
>
>
>> What do you think of this part about informing guest_memfd of the
>> failure to unmap?
> So, do you want to add a guest_memfd callback to achieve this purpose?
>

I will need to think the entire thing through, but I meant informing as
in returning an error to guest_memfd so that guest_memfd knows. I think
returning an error should be the first cause of action.

As for whether guest_memfd should know how to handle the error or
whether the userspace VMM should participate in deciding what to do with
the error, I'm not sure. If you have suggestions on this, I hope we can
combine the suggestions about the conversion protocol on the other thread.

Regarding a callback, are you thinking something like not having the
unmap return an error, but instead TDX will call a function like
kvm_gmem_error_at_offset(loff_t offset), and guest_memfd will then
record that somewhere, and then immediately after calling unmap
guest_memfd will check kvm_gmem_was_there_an_error_in_range() and then
determining whether there's an error? Something like that?

I guess it could work but feels a little odd.

>
> BTW, here's an analysis of why we can't let kvm_mmu_unmap_gfn_range()
> and mmu_notifier_invalidate_range_start() fail, based on the repo
> https://github.com/torvalds/linux.git, commit cd2e103d57e5 ("Merge tag
> 'hardening-v6.16-rc1-fix1-take2' of
> git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux")

Thank you, I appreciate the effort you took to enumerate these. The
following suggestions are based on my current understanding. I don't
have time in the near future to do the plumbing to test out the
suggestion, but for now I want to see if this suggestion makes sense,
maybe you can correct any misunderstandings first. 

>
> 1. Status of mmu notifier
> -------------------------------
> (1) There're 34 direct callers of mmu_notifier_invalidate_range_start().
>     1. clear_refs_write
>     2. do_pagemap_scan
>     3. uprobe_write_opcode
>     4. do_huge_zero_wp_pmd
>     5. __split_huge_pmd (N)
>     6. __split_huge_pud (N)
>     7. move_pages_huge_pmd
>     8. copy_hugetlb_page_range
>     9. hugetlb_unshare_pmds  (N)
>     10. hugetlb_change_protection
>     11. hugetlb_wp
>     12. unmap_hugepage_range (N)
>     13. move_hugetlb_page_tables
>     14. collapse_huge_page
>     15. retract_page_tables
>     16. collapse_pte_mapped_thp
>     17. write_protect_page
>     18. replace_page
>     19. madvise_free_single_vma
>     20. wp_clean_pre_vma
>     21. wp_page_copy 
>     22. zap_page_range_single_batched (N)
>     23. unmap_vmas (N)
>     24. copy_page_range 
>     25. remove_device_exclusive_entry
>     26. migrate_vma_collect
>     27. __migrate_device_pages
>     28. change_pud_range 
>     29. move_page_tables
>     30. page_vma_mkclean_one
>     31. try_to_unmap_one
>     32. try_to_migrate_one
>     33. make_device_exclusive
>     34. move_pages_pte
>
> Of these 34 direct callers, those marked with (N) cannot tolerate
> mmu_notifier_invalidate_range_start() failing. I have not yet investigated all
> 34 direct callers one by one, so the list of (N) is incomplete.
>
> For 5. __split_huge_pmd(), Documentation/mm/transhuge.rst says:
> "Note that split_huge_pmd() doesn't have any limitations on refcounting:
> pmd can be split at any point and never fails." This is because split_huge_pmd()
> serves as a graceful fallback design for code walking pagetables but unaware
> about huge pmds.
>
>

Do these callers, especially those with (N), ever try to unmap any TDX
private pages? guest_memfd only gives shared pages to core-mm, so for
shared pages, there will continue to be no chance of errors.

If we change mmu_notifier_invalidate_range_start() to return an int, all
of the callers that never invalidate shared pages can continue to safely
rely on the fact that mmu_notifier_invalidate_range_start() will return
0.

For the callers of mmu_notifier_invalidate_range_start() that may touch
private pages, I believe that's only guest_memfd and KVM. That's where
we want the error, and will handle the error.

Another point here is that I was thinking to put EPT splitting together
with actual unmapping instead of with invalidation because we will
probably invalidate more than we unmap (see explanation at [1] about the
race). Maybe moving EPT splitting to unmap could help?

> (2) There's 1 direct caller of mmu_notifier_invalidate_range_start_nonblock(),
> __oom_reap_task_mm(), which only expects the error -EAGAIN.
>
> In mn_hlist_invalidate_range_start():
> "WARN_ON(mmu_notifier_range_blockable(range) || _ret != -EAGAIN);"
>
>
> (3) For DMAs, drivers need to invoke pin_user_pages() to pin memory. In that
> case, they don't need to register mmu notifier.
>
> Or, device drivers can pin pages via get_user_pages*(), and register for mmu         
> notifier callbacks for the memory range. Then, upon receiving a notifier         
> "invalidate range" callback , stop the device from using the range, and unpin    
> the pages.
>
> See Documentation/core-api/pin_user_pages.rst.
>
>

Do you mean that we should teach device drivers to get callbacks for
private pages? Are you looking ahead to handle TDX IO on private pages?
So far we haven't handled that yet.

> 2. Cases that cannot tolerate failure of mmu_notifier_invalidate_range_start()
> -------------------------------
> (1) Error fallback cases.
>
>     1. split_huge_pmd() as mentioned in Documentation/mm/transhuge.rst.
>        split_huge_pmd() is designed as a graceful fallback without failure.
>
>        split_huge_pmd
>         |->__split_huge_pmd
>            |->mmu_notifier_range_init
>            |  mmu_notifier_invalidate_range_start
>            |  split_huge_pmd_locked
>            |  mmu_notifier_invalidate_range_end
>
>
>     2. in fs/iomap/buffered-io.c, iomap_write_failed() itself is error handling.
>        iomap_write_failed
>          |->truncate_pagecache_range
>             |->unmap_mapping_range
>             |  |->unmap_mapping_pages
>             |     |->unmap_mapping_range_tree
>             |        |->unmap_mapping_range_vma
>             |           |->zap_page_range_single
>             |              |->zap_page_range_single_batched
>             |                       |->mmu_notifier_range_init
>             |                       |  mmu_notifier_invalidate_range_start
>             |                       |  unmap_single_vma
>             |                       |  mmu_notifier_invalidate_range_end
>             |->truncate_inode_pages_range
>                |->truncate_cleanup_folio
>                   |->if (folio_mapped(folio))
>                   |     unmap_mapping_folio(folio);
>                          |->unmap_mapping_range_tree
>                             |->unmap_mapping_range_vma
>                                |->zap_page_range_single
>                                   |->zap_page_range_single_batched
>                                      |->mmu_notifier_range_init
>                                      |  mmu_notifier_invalidate_range_start
>                                      |  unmap_single_vma
>                                      |  mmu_notifier_invalidate_range_end
>
>    3. in mm/memory.c, zap_page_range_single() is invoked to handle error.
>       remap_pfn_range_notrack
>         |->int error = remap_pfn_range_internal(vma, addr, pfn, size, prot);
>         |  if (!error)
>         |      return 0;
> 	|  zap_page_range_single
>            |->zap_page_range_single_batched
>               |->mmu_notifier_range_init
>               |  mmu_notifier_invalidate_range_start
>               |  unmap_single_vma
>               |  mmu_notifier_invalidate_range_end
>
>    4. in kernel/events/core.c, zap_page_range_single() is invoked to clear any
>       partial mappings on error.
>
>       perf_mmap
>         |->ret = map_range(rb, vma);
>                  |  err = remap_pfn_range
>                  |->if (err) 
>                  |     zap_page_range_single
>                         |->zap_page_range_single_batched
>                            |->mmu_notifier_range_init
>                            |  mmu_notifier_invalidate_range_start
>                            |  unmap_single_vma
>                            |  mmu_notifier_invalidate_range_end
>
>
>    5. in mm/memory.c, unmap_mapping_folio() is invoked to unmap posion page.
>
>       __do_fault
> 	|->if (unlikely(PageHWPoison(vmf->page))) { 
> 	|	vm_fault_t poisonret = VM_FAULT_HWPOISON;
> 	|	if (ret & VM_FAULT_LOCKED) {
> 	|		if (page_mapped(vmf->page))
> 	|			unmap_mapping_folio(folio);
>         |                       |->unmap_mapping_range_tree
>         |                          |->unmap_mapping_range_vma
>         |                             |->zap_page_range_single
>         |                                |->zap_page_range_single_batched
>         |                                   |->mmu_notifier_range_init
>         |                                   |  mmu_notifier_invalidate_range_start
>         |                                   |  unmap_single_vma
>         |                                   |  mmu_notifier_invalidate_range_end
> 	|		if (mapping_evict_folio(folio->mapping, folio))
> 	|			poisonret = VM_FAULT_NOPAGE; 
> 	|		folio_unlock(folio);
> 	|	}
> 	|	folio_put(folio);
> 	|	vmf->page = NULL;
> 	|	return poisonret;
> 	|  }
>
>
>   6. in mm/vma.c, in __mmap_region(), unmap_region() is invoked to undo any
>      partial mapping done by a device driver.
>
>      __mmap_new_vma
>        |->__mmap_new_file_vma(map, vma);
>           |->error = mmap_file(vma->vm_file, vma);
>           |  if (error)
>           |     unmap_region
>                  |->unmap_vmas
>                     |->mmu_notifier_range_init
>                     |  mmu_notifier_invalidate_range_start
>                     |  unmap_single_vma
>                     |  mmu_notifier_invalidate_range_end
>
>

These should probably not ever be invalidating or unmapping private pages.

> (2) No-fail cases
> -------------------------------
> 1. iput() cannot fail. 
>
> iput
>  |->iput_final
>     |->WRITE_ONCE(inode->i_state, state | I_FREEING);
>     |  inode_lru_list_del(inode);
>     |  evict(inode);
>        |->op->evict_inode(inode);
>           |->shmem_evict_inode
>              |->shmem_truncate_range
>                 |->truncate_inode_pages_range
>                    |->truncate_cleanup_folio
>                       |->if (folio_mapped(folio))
>                       |     unmap_mapping_folio(folio);
>                             |->unmap_mapping_range_tree
>                                |->unmap_mapping_range_vma
>                                   |->zap_page_range_single
>                                      |->zap_page_range_single_batched
>                                         |->mmu_notifier_range_init
>                                         |  mmu_notifier_invalidate_range_start
>                                         |  unmap_single_vma
>                                         |  mmu_notifier_invalidate_range_end
>
>
> 2. exit_mmap() cannot fail
>
> exit_mmap
>   |->mmu_notifier_release(mm);
>      |->unmap_vmas(&tlb, &vmi.mas, vma, 0, ULONG_MAX, ULONG_MAX, false);
>         |->mmu_notifier_range_init
>         |  mmu_notifier_invalidate_range_start
>         |  unmap_single_vma
>         |  mmu_notifier_invalidate_range_end
>
>

These should probably not ever be invalidating or unmapping private pages.

> 3. KVM Cases That Cannot Tolerate Unmap Failure
> -------------------------------
> Allowing unmap operations to fail in the following scenarios would make it very
> difficult or even impossible to handle the failure:
>
> (1) __kvm_mmu_get_shadow_page() is designed to reliably obtain a shadow page
> without expecting any failure.
>
> mmu_alloc_direct_roots
>   |->mmu_alloc_root
>      |->kvm_mmu_get_shadow_page
>         |->__kvm_mmu_get_shadow_page
>            |->kvm_mmu_alloc_shadow_page
>               |->account_shadowed
>                  |->kvm_mmu_slot_gfn_write_protect
>                     |->kvm_tdp_mmu_write_protect_gfn
>                        |->write_protect_gfn
>                           |->tdp_mmu_iter_set_spte
>
>

I need to learn more about shadow pages but IIUC TDX doesn't use shadow
pages so this path won't interact with unmapping private pages.

> (2) kvm_vfio_release() and kvm_vfio_file_del() cannot fail
>
> kvm_vfio_release/kvm_vfio_file_del
>  |->kvm_vfio_update_coherency
>     |->kvm_arch_unregister_noncoherent_dma
>        |->kvm_noncoherent_dma_assignment_start_or_stop
>           |->kvm_zap_gfn_range
>              |->kvm_tdp_mmu_zap_leafs
>                 |->tdp_mmu_zap_leafs
>                    |->tdp_mmu_iter_set_spte
>
>

I need to learn more about VFIO but for now IIUC IO uses shared pages,
so this path won't interact with unmapping private pages.

> (3) There're lots of callers of __kvm_set_or_clear_apicv_inhibit() currently
> never expect failure of unmap.
>
> __kvm_set_or_clear_apicv_inhibit
>   |->kvm_zap_gfn_range
>      |->kvm_tdp_mmu_zap_leafs
>         |->tdp_mmu_zap_leafs
>            |->tdp_mmu_iter_set_spte
>
>
>

There could be some TDX specific things such that TDX doesn't use this
path.

> 4. Cases in KVM where it's hard to make tdp_mmu_set_spte() (update SPTE with
> write mmu_lock) failable.
>
> (1) kvm_vcpu_flush_tlb_guest()
>
> kvm_vcpu_flush_tlb_guest
>   |->kvm_mmu_sync_roots
>      |->mmu_sync_children
>         |->kvm_vcpu_write_protect_gfn
>            |->kvm_mmu_slot_gfn_write_protect
>               |->kvm_tdp_mmu_write_protect_gfn
>                  |->write_protect_gfn
>                     |->tdp_mmu_iter_set_spte
>                        |->tdp_mmu_set_spte
>
>
> (2) handle_removed_pt() and handle_changed_spte().
>

Thank you so much for looking into these, I'm hoping that the number of
cases where TDX and private pages are unmapped are really limited to a
few paths that we have to rework.

If we agree that the error has to be handled, then regardless of how we
let the caller know that an error happened, all paths touching TDX
private pages have to be reworked.

Between (1) returning an error vs (2) marking error and having the
caller check for errors, then it's probably better to use the standard
approach of returning an error since it is better understood, and
there's no need to have extra data structures?

>
> Thanks
> Yan

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-05  2:47                                     ` Yan Zhao
@ 2025-06-05 22:35                                       ` Ackerley Tng
  2025-06-19  8:11                                         ` Yan Zhao
  2025-07-16  1:23                                         ` Yan Zhao
  0 siblings, 2 replies; 294+ messages in thread
From: Ackerley Tng @ 2025-06-05 22:35 UTC (permalink / raw)
  To: Yan Zhao
  Cc: vannapurve, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
>> Hi Yan,
>> 
>> While working on the 1G (aka HugeTLB) page support for guest_memfd
>> series [1], we took into account conversion failures too. The steps are
>> in kvm_gmem_convert_range(). (It might be easier to pull the entire
>> series from GitHub [2] because the steps for conversion changed in two
>> separate patches.)
> ...
>> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
>
> Hi Ackerley,
> Thanks for providing this branch.

Here's the WIP branch [1], which I initially wasn't intending to make
super public since it's not even RFC standard yet and I didn't want to
add to the many guest_memfd in-flight series, but since you referred to
it, [2] is a v2 of the WIP branch :)

[1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
[2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2

This WIP branch has selftests that test 1G aka HugeTLB page support with
TDX huge page EPT mappings [7]:

1. "KVM: selftests: TDX: Test conversion to private at different
   sizes". This uses the fact that TDX module will return error if the
   page is faulted into the guest at a different level from the accept
   level to check the level that the page was faulted in.
2. "KVM: selftests: Test TDs in private_mem_conversions_test". Updates
   private_mem_conversions_test for use with TDs. This test does
   multi-vCPU conversions and we use this to check for issues to do with
   conversion races.
3. "KVM: selftests: TDX: Test conversions when guest_memfd used for
   private and shared memory". Adds a selftest similar to/on top of
   guest_memfd_conversions_test that does conversions via MapGPA.

Full list of selftests I usually run from tools/testing/selftests/kvm:

+ ./guest_memfd_test
+ ./guest_memfd_conversions_test
+ ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_test
+ ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_conversions_test
+ ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_hugetlb_reporting_test
+ ./x86/private_mem_conversions_test.sh
+ ./set_memory_region_test
+ ./x86/private_mem_kvm_exits_test
+ ./x86/tdx_vm_test
+ ./x86/tdx_upm_test
+ ./x86/tdx_shared_mem_test
+ ./x86/tdx_gmem_private_and_shared_test

As an overview for anyone who might be interested in this WIP branch:

1.  I started with upstream's kvm/next
2.  Applied TDX selftests series [3]
3.  Applied guest_memfd mmap series [4]
4.  Applied conversions (sub)series and HugeTLB (sub)series [5]
5.  Added some fixes for 2 of the earlier series (as labeled in commit
    message)
6.  Updated guest_memfd conversions selftests to work with TDX
7.  Applied 2M EPT series [6] with some hacks
8.  Some patches to make guest_memfd mmap return huge-page-aligned
    userspace address
9.  Selftests for guest_memfd conversion with TDX 2M EPT

[3] https://lore.kernel.org/all/20250414214801.2693294-1-sagis@google.com/
[4] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/
[5] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/
[6] https://lore.kernel.org/all/Z%2FOMB7HNO%2FRQyljz@yzhao56-desk.sh.intel.com/
[7] https://lore.kernel.org/all/20250424030033.32635-1-yan.y.zhao@intel.com/

>
> I'm now trying to make TD huge pages working on this branch and would like to
> report to you errors I encountered during this process early.
>
> 1. symbol arch_get_align_mask() is not available when KVM is compiled as module.
>    I currently workaround it as follows:
>
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -102,8 +102,13 @@ static unsigned long kvm_gmem_get_align_mask(struct file *file,
>         void *priv;
>
>         inode = file_inode(file);
> -       if (!kvm_gmem_has_custom_allocator(inode))
> -             return arch_get_align_mask(file, flags);
> +       if (!kvm_gmem_has_custom_allocator(inode)) {
> +               page_size = 1 << PAGE_SHIFT;
> +               return PAGE_MASK & (page_size - 1);
> +       }
>
>

Thanks, will fix in the next revision.

> 2. Bug of Sleeping function called from invalid context 
>
> [  193.523469] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:325
> [  193.539885] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 3332, name: guest_memfd_con
> [  193.556235] preempt_count: 1, expected: 0
> [  193.564518] RCU nest depth: 0, expected: 0
> [  193.572866] 3 locks held by guest_memfd_con/3332:
> [  193.581800]  #0: ff16f8ec217e4438 (sb_writers#14){.+.+}-{0:0}, at: __x64_sys_fallocate+0x46/0x80
> [  193.598252]  #1: ff16f8fbd85c8310 (mapping.invalidate_lock#4){++++}-{4:4}, at: kvm_gmem_fallocate+0x9e/0x310 [kvm]
> [  193.616706]  #2: ff3189b5e4f65018 (&(kvm)->mmu_lock){++++}-{3:3}, at: kvm_gmem_invalidate_begin_and_zap+0x17f/0x260 [kvm]
> [  193.635790] Preemption disabled at:
> [  193.635793] [<ffffffffc0850c6f>] kvm_gmem_invalidate_begin_and_zap+0x17f/0x260 [kvm]
>
> This is because add_to_invalidated_kvms() invokes kzalloc() inside kvm->mmu_lock
> which is a kind of spinlock.
>
> I workarounded it as follows.
>
>  static int kvm_gmem_invalidate_begin_and_zap(struct kvm_gmem *gmem,
>                                              pgoff_t start, pgoff_t end,
> @@ -1261,13 +1268,13 @@ static int kvm_gmem_invalidate_begin_and_zap(struct kvm_gmem *gmem,
>                         KVM_MMU_LOCK(kvm);
>                         kvm_mmu_invalidate_begin(kvm);
>
> -                       if (invalidated_kvms) {
> -                               ret = add_to_invalidated_kvms(invalidated_kvms, kvm);
> -                               if (ret) {
> -                                       kvm_mmu_invalidate_end(kvm);
> -                                       goto out;
> -                               }
> -                       }
>                 }
>
>
> @@ -1523,12 +1530,14 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>         }
>
>  out:
> -       list_for_each_entry_safe(entry, tmp, &invalidated_kvms, list) {
> -               kvm_gmem_do_invalidate_end(entry->kvm);
> -               list_del(&entry->list);
> -               kfree(entry);
> -       }
> +       list_for_each_entry(gmem, gmem_list, entry)
> +               kvm_gmem_do_invalidate_end(gmem->kvm);
>
>         filemap_invalidate_unlock(inode->i_mapping);
>
>

I fixed this in WIP series v2 by grouping splitting with
unmapping. Please see this commit [8], the commit message includes an
explanation of what's done.

[8] https://github.com/googleprodkernel/linux-cc/commit/fd27635e5209b5e45a628d7fcf42a17a2b3c7e78

> Will let you know more findings later.
>
> Thanks
> Yan

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-05  2:42                                     ` Yan Zhao
  2025-06-05 21:12                                       ` Ackerley Tng
@ 2025-06-11 14:30                                       ` Vishal Annapurve
  2025-06-16  9:59                                         ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-06-11 14:30 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> We need to restore to the previous status (which includes the host page table)
> if conversion can't be done.
> That said, in my view, a better flow would be:
>
> 1. guest_memfd sends a pre-invalidation request to users (users here means the
>    consumers in kernel of memory allocated from guest_memfd).
>
> 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
>    proceed. For example, in the case of TDX, this might involve memory
>    allocation and page splitting.
>
> 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
>    proceeds by sending the actual invalidation request.
>
> 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
>    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
>    In such cases, TDX can callback guest_memfd to inform the poison-status of
>    the page or elevate the page reference count.

Few questions here:
1) It sounds like the failure to remove entries from SEPT could only
be due to bugs in the KVM/TDX module, how reliable would it be to
continue executing TDX VMs on the host once such bugs are hit?
2) Is it reliable to continue executing the host kernel and other
normal VMs once such bugs are hit?
3) Can the memory be reclaimed reliably if the VM is marked as dead
and cleaned up right away?

>
> 5. guest_memfd completes the invalidation process. If the memory is marked as
>    "poison," guest_memfd can handle it accordingly. If the page has an elevated
>    reference count, guest_memfd may not need to take special action, as the
>    elevated count prevents the OS from reallocating the page.
>    (but from your reply below, seems a callback to guest_memfd is a better
>    approach).
>

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-05-20 23:47                 ` Huang, Kai
@ 2025-06-11 14:42                   ` Sean Christopherson
  2025-06-12 23:39                     ` Edgecombe, Rick P
  2025-06-13  2:41                     ` Xiaoyao Li
  0 siblings, 2 replies; 294+ messages in thread
From: Sean Christopherson @ 2025-06-11 14:42 UTC (permalink / raw)
  To: Kai Huang
  Cc: Yan Y Zhao, Rick P Edgecombe, Kirill Shutemov, Xiaoyao Li, Fan Du,
	Dave Hansen, david@redhat.com, Zhiquan Li,
	thomas.lendacky@amd.com, tabba@google.com,
	quic_eberman@quicinc.com, linux-kernel@vger.kernel.org, Ira Weiny,
	vbabka@suse.cz, pbonzini@redhat.com, Isaku Yamahata,
	michael.roth@amd.com, binbin.wu@linux.intel.com,
	ackerleytng@google.com, Chao P Peng, kvm@vger.kernel.org,
	Vishal Annapurve, jroedel@suse.de, Jun Miao, pgonda@google.com,
	x86@kernel.org

On Tue, May 20, 2025, Kai Huang wrote:
> On Tue, 2025-05-20 at 17:34 +0800, Zhao, Yan Y wrote:
> > On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote:
> > > On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
> > > > > On the opposite, if other non-Linux TDs don't follow 1G->2M->4K
> > > > > accept order, e.g., they always accept 4K, there could be *endless
> > > > > EPT violation* if I understand your words correctly.
> > > > > 
> > > > > Isn't this yet-another reason we should choose to return PG_LEVEL_4K
> > > > > instead of 2M if no accept level is provided in the fault?
> > > > As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
> > > > TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.
> > > 
> > > TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
> > > docs say the VMM needs to demote *if* the mapping is large and the accept size
> > > is small.

No thanks, fix the spec and the TDX Module.  Punting an error to the VMM is
inconsistent, convoluted, and inefficient.

Per "Table 8.2: TDG.MEM.PAGE.ACCEPT SEPT Walk Cases":

  S-EPT state         ACCEPT vs. Mapping Size         Behavior
  Leaf SEPT_PRESENT   Smaller                         TDACCEPT_SIZE_MISMATCH
  Leaf !SEPT_PRESENT  Smaller                         EPT Violation <=========================|
  Leaf DONT_CARE      Same                            Success                                 | => THESE TWO SHOULD MATCH!!!
  !Leaf SEPT_FREE     Larger                          EPT Violation, BECAUSE THERE'S NO PAGE  |
  !Leaf SEPT_FREE     Larger                          TDACCEPT_SIZE_MISMATCH <================|

If ACCEPT is "too small", an EPT violation occurs.  But if ACCEPT is "too big",
a TDACCEPT_SIZE_MISMATCH error occurs.  That's asinine.

The only reason that comes to mind for punting the "too small" case to the VMM
is to try and keep the guest alive if the VMM is mapping more memory than has
been enumerated to the guest.  E.g. if the guest suspects the VMM is malicious
or buggy.  IMO, that's a terrible reason to push this much complexity into the
host.  It also risks godawful boot times, e.g. if the guest kernel is buggy and
accepts everything at 4KiB granularity.

The TDX Module should return TDACCEPT_SIZE_MISMATCH and force the guest to take
action, not force the hypervisor to limp along in a degraded state.  If the guest
doesn't want to ACCEPT at a larger granularity, e.g. because it doesn't think the
entire 2MiB/1GiB region is available, then the guest can either log a warning and
"poison" the page(s), or terminate and refuse to boot.

If for some reason the guest _can't_ ACCEPT at larger granularity, i.e. if the
guest _knows_ that 2MiB or 1GiB is available/usable but refuses to ACCEPT at the
appropriate granularity, then IMO that's firmly a guest bug.

If there's a *legitimate* use case where the guest wants to ACCEPT a subset of
memory, then there should be an explicit TDCALL to request that the unwanted
regions of memory be unmapped.  Smushing everything into implicit behavior has
obvioulsy created a giant mess.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-11 14:42                   ` Sean Christopherson
@ 2025-06-12 23:39                     ` Edgecombe, Rick P
  2025-06-13  0:19                       ` Sean Christopherson
  2025-06-13  2:41                     ` Xiaoyao Li
  1 sibling, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-12 23:39 UTC (permalink / raw)
  To: seanjc@google.com, Huang, Kai
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
	Li, Zhiquan1, Shutemov, Kirill, Zhao, Yan Y,
	linux-kernel@vger.kernel.org, Weiny, Ira, michael.roth@amd.com,
	pbonzini@redhat.com, Yamahata, Isaku, ackerleytng@google.com,
	tabba@google.com, Peng, Chao P, kvm@vger.kernel.org,
	binbin.wu@linux.intel.com, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, 2025-06-11 at 07:42 -0700, Sean Christopherson wrote:
> If there's a *legitimate* use case where the guest wants to ACCEPT a subset of
> memory, then there should be an explicit TDCALL to request that the unwanted
> regions of memory be unmapped.  Smushing everything into implicit behavior has
> obvioulsy created a giant mess.

Hi, still digging on if there is any possible use.

I think this may need a guest opt-in, so the guest can say it can handle errors
for both smaller and larger page size matches. So it may not matter if there is
a rare usage or not. If KVM finds the guest opts-in (how to do that TBD), it can
start mapping at the host level. If KVM doesn't see the opt-in, the guest gets
4k pages.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-12 23:39                     ` Edgecombe, Rick P
@ 2025-06-13  0:19                       ` Sean Christopherson
  2025-06-13  0:25                         ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Sean Christopherson @ 2025-06-13  0:19 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Kai Huang, quic_eberman@quicinc.com, Xiaoyao Li, Fan Du,
	Dave Hansen, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Zhiquan1 Li, Kirill Shutemov, Yan Y Zhao,
	linux-kernel@vger.kernel.org, Ira Weiny, michael.roth@amd.com,
	pbonzini@redhat.com, Isaku Yamahata, ackerleytng@google.com,
	tabba@google.com, Chao P Peng, kvm@vger.kernel.org,
	binbin.wu@linux.intel.com, Vishal Annapurve, jroedel@suse.de,
	Jun Miao, pgonda@google.com, x86@kernel.org

On Thu, Jun 12, 2025, Rick P Edgecombe wrote:
> On Wed, 2025-06-11 at 07:42 -0700, Sean Christopherson wrote:
> > If there's a *legitimate* use case where the guest wants to ACCEPT a subset of
> > memory, then there should be an explicit TDCALL to request that the unwanted
> > regions of memory be unmapped.  Smushing everything into implicit behavior has
> > obvioulsy created a giant mess.
> 
> Hi, still digging on if there is any possible use.
> 
> I think this may need a guest opt-in, so the guest can say it can handle errors
> for both smaller and larger page size matches. So it may not matter if there is
> a rare usage or not. If KVM finds the guest opts-in (how to do that TBD), it can
> start mapping at the host level. 

Hmm, clever.  That should work; requiring an updated guest kernel to get optimal
performance doesn't seem too onerous.

> If KVM doesn't see the opt-in, the guest gets 4k pages.

As in, KVM doesn't even try to use hugepage mappings?  If so, this idea probably
gets my vote.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-13  0:19                       ` Sean Christopherson
@ 2025-06-13  0:25                         ` Edgecombe, Rick P
  2025-06-13  0:44                           ` Sean Christopherson
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-13  0:25 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Huang, Kai, Du, Fan,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	Zhao, Yan Y, Li, Zhiquan1, Shutemov, Kirill, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, Weiny, Ira, pbonzini@redhat.com,
	Peng, Chao P, Yamahata, Isaku, ackerleytng@google.com,
	vbabka@suse.cz, kvm@vger.kernel.org, binbin.wu@linux.intel.com,
	tabba@google.com, Annapurve, Vishal, jroedel@suse.de, Miao, Jun,
	pgonda@google.com, x86@kernel.org

On Thu, 2025-06-12 at 17:19 -0700, Sean Christopherson wrote:
> > I think this may need a guest opt-in, so the guest can say it can handle
> > errors
> > for both smaller and larger page size matches. So it may not matter if there
> > is
> > a rare usage or not. If KVM finds the guest opts-in (how to do that TBD), it
> > can
> > start mapping at the host level. 
> 
> Hmm, clever.  That should work; requiring an updated guest kernel to get
> optimal
> performance doesn't seem too onerous.
> 
> > If KVM doesn't see the opt-in, the guest gets 4k pages.
> 
> As in, KVM doesn't even try to use hugepage mappings?  If so, this idea
> probably
> gets my vote.

Maybe an "I can handle it" accept size bit that comes in the exit qualification?

Yan, do you see any problems with that? Like if a guest passed it in some accept
and not others? Thinking about the new "unaccept" SEAMCALL...

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-13  0:25                         ` Edgecombe, Rick P
@ 2025-06-13  0:44                           ` Sean Christopherson
  2025-06-13  0:47                             ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Sean Christopherson @ 2025-06-13  0:44 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: quic_eberman@quicinc.com, Xiaoyao Li, Kai Huang, Fan Du,
	Dave Hansen, david@redhat.com, thomas.lendacky@amd.com,
	Yan Y Zhao, Zhiquan1 Li, Kirill Shutemov, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, Ira Weiny, pbonzini@redhat.com,
	Chao P Peng, Isaku Yamahata, ackerleytng@google.com,
	vbabka@suse.cz, kvm@vger.kernel.org, binbin.wu@linux.intel.com,
	tabba@google.com, Vishal Annapurve, jroedel@suse.de, Jun Miao,
	pgonda@google.com, x86@kernel.org

On Fri, Jun 13, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-06-12 at 17:19 -0700, Sean Christopherson wrote:
> > > I think this may need a guest opt-in, so the guest can say it can handle
> > > errors for both smaller and larger page size matches. So it may not
> > > matter if there is a rare usage or not. If KVM finds the guest opts-in
> > > (how to do that TBD), it can start mapping at the host level. 
> > 
> > Hmm, clever.  That should work; requiring an updated guest kernel to get
> > optimal performance doesn't seem too onerous.
> > 
> > > If KVM doesn't see the opt-in, the guest gets 4k pages.
> > 
> > As in, KVM doesn't even try to use hugepage mappings?  If so, this idea
> > probably gets my vote.
> 
> Maybe an "I can handle it" accept size bit that comes in the exit qualification?

Eww, no.  Having to react on _every_ EPT violation would be annoying, and trying
to debug issues where the guest is mixing options would probably be a nightmare.

I was thinking of something along the lines of an init-time or boot-time opt-in.

> Yan, do you see any problems with that? Like if a guest passed it in some accept
> and not others? Thinking about the new "unaccept" SEAMCALL...

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-13  0:44                           ` Sean Christopherson
@ 2025-06-13  0:47                             ` Edgecombe, Rick P
  2025-06-13  1:32                               ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-13  0:47 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Huang, Kai, Du, Fan,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	Zhao, Yan Y, Li, Zhiquan1, Shutemov, Kirill, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, Weiny, Ira, Peng, Chao P,
	pbonzini@redhat.com, Yamahata, Isaku, ackerleytng@google.com,
	vbabka@suse.cz, kvm@vger.kernel.org, binbin.wu@linux.intel.com,
	tabba@google.com, Annapurve, Vishal, jroedel@suse.de, Miao, Jun,
	pgonda@google.com, x86@kernel.org

On Thu, 2025-06-12 at 17:44 -0700, Sean Christopherson wrote:
> > Maybe an "I can handle it" accept size bit that comes in the exit
> > qualification?
> 
> Eww, no.  Having to react on _every_ EPT violation would be annoying, and
> trying
> to debug issues where the guest is mixing options would probably be a
> nightmare.
> 
> I was thinking of something along the lines of an init-time or boot-time opt-
> in.

Fair.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-13  0:47                             ` Edgecombe, Rick P
@ 2025-06-13  1:32                               ` Yan Zhao
  2025-06-13 21:53                                 ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-13  1:32 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, quic_eberman@quicinc.com, Li, Xiaoyao,
	Huang, Kai, Du, Fan, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, Li, Zhiquan1, Shutemov, Kirill,
	michael.roth@amd.com, linux-kernel@vger.kernel.org, Weiny, Ira,
	Peng, Chao P, pbonzini@redhat.com, Yamahata, Isaku,
	ackerleytng@google.com, vbabka@suse.cz, kvm@vger.kernel.org,
	binbin.wu@linux.intel.com, tabba@google.com, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Fri, Jun 13, 2025 at 08:47:28AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-06-12 at 17:44 -0700, Sean Christopherson wrote:
> > > Maybe an "I can handle it" accept size bit that comes in the exit
> > > qualification?

Dynamic turning on "I can handle it" would still require the supporting of
demotion in the fault path because there may be existing huge pages when
an EPT violation without "I can handle it" comes.

> > Eww, no.  Having to react on _every_ EPT violation would be annoying, and
> > trying
> > to debug issues where the guest is mixing options would probably be a
> > nightmare.
> > 
> > I was thinking of something along the lines of an init-time or boot-time opt-
> > in.
> 
> Fair.

Agreed.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-11 14:42                   ` Sean Christopherson
  2025-06-12 23:39                     ` Edgecombe, Rick P
@ 2025-06-13  2:41                     ` Xiaoyao Li
  2025-06-13  3:29                       ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Xiaoyao Li @ 2025-06-13  2:41 UTC (permalink / raw)
  To: Sean Christopherson, Kai Huang
  Cc: Yan Y Zhao, Rick P Edgecombe, Kirill Shutemov, Fan Du,
	Dave Hansen, david@redhat.com, Zhiquan Li,
	thomas.lendacky@amd.com, tabba@google.com,
	quic_eberman@quicinc.com, linux-kernel@vger.kernel.org, Ira Weiny,
	vbabka@suse.cz, pbonzini@redhat.com, Isaku Yamahata,
	michael.roth@amd.com, binbin.wu@linux.intel.com,
	ackerleytng@google.com, Chao P Peng, kvm@vger.kernel.org,
	Vishal Annapurve, jroedel@suse.de, Jun Miao, pgonda@google.com,
	x86@kernel.org

On 6/11/2025 10:42 PM, Sean Christopherson wrote:
> On Tue, May 20, 2025, Kai Huang wrote:
>> On Tue, 2025-05-20 at 17:34 +0800, Zhao, Yan Y wrote:
>>> On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote:
>>>> On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
>>>>>> On the opposite, if other non-Linux TDs don't follow 1G->2M->4K
>>>>>> accept order, e.g., they always accept 4K, there could be *endless
>>>>>> EPT violation* if I understand your words correctly.
>>>>>>
>>>>>> Isn't this yet-another reason we should choose to return PG_LEVEL_4K
>>>>>> instead of 2M if no accept level is provided in the fault?
>>>>> As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
>>>>> TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.
>>>>
>>>> TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
>>>> docs say the VMM needs to demote *if* the mapping is large and the accept size
>>>> is small.
> 
> No thanks, fix the spec and the TDX Module.  Punting an error to the VMM is
> inconsistent, convoluted, and inefficient.
> 
> Per "Table 8.2: TDG.MEM.PAGE.ACCEPT SEPT Walk Cases":
> 
>    S-EPT state         ACCEPT vs. Mapping Size         Behavior
>    Leaf SEPT_PRESENT   Smaller                         TDACCEPT_SIZE_MISMATCH
>    Leaf !SEPT_PRESENT  Smaller                         EPT Violation <=========================|
>    Leaf DONT_CARE      Same                            Success                                 | => THESE TWO SHOULD MATCH!!!
>    !Leaf SEPT_FREE     Larger                          EPT Violation, BECAUSE THERE'S NO PAGE  |
>    !Leaf SEPT_FREE     Larger                          TDACCEPT_SIZE_MISMATCH <================|
> 
> 
> If ACCEPT is "too small", an EPT violation occurs.  But if ACCEPT is "too big",
> a TDACCEPT_SIZE_MISMATCH error occurs.  That's asinine.
> 
> The only reason that comes to mind for punting the "too small" case to the VMM
> is to try and keep the guest alive if the VMM is mapping more memory than has
> been enumerated to the guest.  E.g. if the guest suspects the VMM is malicious
> or buggy.  IMO, that's a terrible reason to push this much complexity into the
> host.  It also risks godawful boot times, e.g. if the guest kernel is buggy and
> accepts everything at 4KiB granularity.
> 
> The TDX Module should return TDACCEPT_SIZE_MISMATCH and force the guest to take
> action, not force the hypervisor to limp along in a degraded state.  If the guest
> doesn't want to ACCEPT at a larger granularity, e.g. because it doesn't think the
> entire 2MiB/1GiB region is available, then the guest can either log a warning and
> "poison" the page(s), or terminate and refuse to boot.
> 
> If for some reason the guest _can't_ ACCEPT at larger granularity, i.e. if the
> guest _knows_ that 2MiB or 1GiB is available/usable but refuses to ACCEPT at the
> appropriate granularity, then IMO that's firmly a guest bug.

It might just be guest doesn't want to accept a larger level instead of 
can't. Use case see below.

> If there's a *legitimate* use case where the guest wants to ACCEPT a subset of
> memory, then there should be an explicit TDCALL to request that the unwanted
> regions of memory be unmapped.  Smushing everything into implicit behavior has
> obvioulsy created a giant mess.

Isn't the ACCEPT with a specific level explicit? Note that ACCEPT is not 
only for the case that VMM has already mapped page and guest only needs 
to accept it to make it available, it also works for the case that guest 
requests VMM to map the page for a gpa (at specific level) then guest 
accepts it.

Even for the former case, it is understandable for behaving differently 
for the "too small" and "too big" case. If the requested accept level is 
"too small", VMM can handle it by demoting the page to satisfy guest. 
But when the level is "too big", usually the VMM cannot map the page at 
a higher level so that ept violation cannot help. I admit that it leads 
to the requirement that VMM should always try to map the page at the 
highest available level, if the EPT violation is not caused by ACCEPT 
which contains a desired mapping level.

As for the scenario, the one I can think of is, guest is trying to 
convert a 4K sized page between private and shared constantly, for 
testing purpose. Guest knows that if accepting the gpa at higher level, 
it takes more time. And when convert it to shared, it triggers DEMOTE 
and more time. So for better performance, guest just calls ACCEPT with 
4KB page. However, VMM returns PAGE_SIZE_MATCH and enforces guest to 
accept a bigger size. what a stupid VMM.

Anyway, I'm just expressing how I understand the current design and I 
think it's reasonable. And I don't object the idea to return 
ACCEPT_SIZE_MISMATCH for "too small" case, but it's needs to be guest 
opt-in, i.e., let guest itself chooses the behavior.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-13  2:41                     ` Xiaoyao Li
@ 2025-06-13  3:29                       ` Yan Zhao
  2025-06-13  5:35                         ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-13  3:29 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Sean Christopherson, Kai Huang, Rick P Edgecombe, Kirill Shutemov,
	Fan Du, Dave Hansen, david@redhat.com, Zhiquan Li,
	thomas.lendacky@amd.com, tabba@google.com,
	quic_eberman@quicinc.com, linux-kernel@vger.kernel.org, Ira Weiny,
	vbabka@suse.cz, pbonzini@redhat.com, Isaku Yamahata,
	michael.roth@amd.com, binbin.wu@linux.intel.com,
	ackerleytng@google.com, Chao P Peng, kvm@vger.kernel.org,
	Vishal Annapurve, jroedel@suse.de, Jun Miao, pgonda@google.com,
	x86@kernel.org

On Fri, Jun 13, 2025 at 10:41:21AM +0800, Xiaoyao Li wrote:
> On 6/11/2025 10:42 PM, Sean Christopherson wrote:
> > On Tue, May 20, 2025, Kai Huang wrote:
> > > On Tue, 2025-05-20 at 17:34 +0800, Zhao, Yan Y wrote:
> > > > On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote:
> > > > > On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
> > > > > > > On the opposite, if other non-Linux TDs don't follow 1G->2M->4K
> > > > > > > accept order, e.g., they always accept 4K, there could be *endless
> > > > > > > EPT violation* if I understand your words correctly.
> > > > > > > 
> > > > > > > Isn't this yet-another reason we should choose to return PG_LEVEL_4K
> > > > > > > instead of 2M if no accept level is provided in the fault?
> > > > > > As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
> > > > > > TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.
> > > > > 
> > > > > TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
> > > > > docs say the VMM needs to demote *if* the mapping is large and the accept size
> > > > > is small.
> > 
> > No thanks, fix the spec and the TDX Module.  Punting an error to the VMM is
> > inconsistent, convoluted, and inefficient.
> > 
> > Per "Table 8.2: TDG.MEM.PAGE.ACCEPT SEPT Walk Cases":
> > 
> >    S-EPT state         ACCEPT vs. Mapping Size         Behavior
> >    Leaf SEPT_PRESENT   Smaller                         TDACCEPT_SIZE_MISMATCH
> >    Leaf !SEPT_PRESENT  Smaller                         EPT Violation <=========================|
> >    Leaf DONT_CARE      Same                            Success                                 | => THESE TWO SHOULD MATCH!!!
> >    !Leaf SEPT_FREE     Larger                          EPT Violation, BECAUSE THERE'S NO PAGE  |
> >    !Leaf SEPT_FREE     Larger                          TDACCEPT_SIZE_MISMATCH <================|
> > 
> > 
> > If ACCEPT is "too small", an EPT violation occurs.  But if ACCEPT is "too big",
> > a TDACCEPT_SIZE_MISMATCH error occurs.  That's asinine.
> > 
> > The only reason that comes to mind for punting the "too small" case to the VMM
> > is to try and keep the guest alive if the VMM is mapping more memory than has
> > been enumerated to the guest.  E.g. if the guest suspects the VMM is malicious
> > or buggy.  IMO, that's a terrible reason to push this much complexity into the
> > host.  It also risks godawful boot times, e.g. if the guest kernel is buggy and
> > accepts everything at 4KiB granularity.
> > 
> > The TDX Module should return TDACCEPT_SIZE_MISMATCH and force the guest to take
> > action, not force the hypervisor to limp along in a degraded state.  If the guest
> > doesn't want to ACCEPT at a larger granularity, e.g. because it doesn't think the
> > entire 2MiB/1GiB region is available, then the guest can either log a warning and
> > "poison" the page(s), or terminate and refuse to boot.
> > 
> > If for some reason the guest _can't_ ACCEPT at larger granularity, i.e. if the
> > guest _knows_ that 2MiB or 1GiB is available/usable but refuses to ACCEPT at the
> > appropriate granularity, then IMO that's firmly a guest bug.
> 
> It might just be guest doesn't want to accept a larger level instead of
> can't. Use case see below.
> 
> > If there's a *legitimate* use case where the guest wants to ACCEPT a subset of
> > memory, then there should be an explicit TDCALL to request that the unwanted
> > regions of memory be unmapped.  Smushing everything into implicit behavior has
> > obvioulsy created a giant mess.
> 
> Isn't the ACCEPT with a specific level explicit? Note that ACCEPT is not
> only for the case that VMM has already mapped page and guest only needs to
> accept it to make it available, it also works for the case that guest
> requests VMM to map the page for a gpa (at specific level) then guest
> accepts it.
> 
> Even for the former case, it is understandable for behaving differently for
> the "too small" and "too big" case. If the requested accept level is "too
> small", VMM can handle it by demoting the page to satisfy guest. But when
> the level is "too big", usually the VMM cannot map the page at a higher
> level so that ept violation cannot help. I admit that it leads to the
> requirement that VMM should always try to map the page at the highest
> available level, if the EPT violation is not caused by ACCEPT which contains
> a desired mapping level.
> 
> As for the scenario, the one I can think of is, guest is trying to convert a
> 4K sized page between private and shared constantly, for testing purpose.
> Guest knows that if accepting the gpa at higher level, it takes more time.
> And when convert it to shared, it triggers DEMOTE and more time. So for
> better performance, guest just calls ACCEPT with 4KB page. However, VMM
Hmm, ACCEPT at 4KB level at the first time triggers DEMOTE already.
So, I don't see how ACCEPT at 4KB helps performance.

Support VMM has mapped a page at 4MB,

         Scenario 1                           Effort
  (1) Guest ACCEPT at 2MB                   ACCEPT 2MB         
  (2) converts a 4KB page to shared         DEMOTE
  (3) convert it back to private            ACCEPT 4KB


         Scenario 2                           Effort
  (1) Guest ACCEPT at 4MB                   DEMOTE, ACCEPT 4MB         
  (2) converts a 4KB page to shared
  (3) convert it back to private            ACCEPT 4KB


In step (3) of "Scenario 1", VMM will not map the page at 2MB according to the
current implementation because PROMOTION requires uniform ACCEPT status across
all 512 4KB pages to be succeed.

> returns PAGE_SIZE_MATCH and enforces guest to accept a bigger size. what a
> stupid VMM.
I agree with Sean that if guest doesn't want to accept at a bigger size for
certain reasons (e.g. it thinks it's unsafe or consider it as an attack),
invoking an explicit TDVMCALL may be a better approach.

> Anyway, I'm just expressing how I understand the current design and I think
> it's reasonable. And I don't object the idea to return ACCEPT_SIZE_MISMATCH
> for "too small" case, but it's needs to be guest opt-in, i.e., let guest
> itself chooses the behavior.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-13  3:29                       ` Yan Zhao
@ 2025-06-13  5:35                         ` Yan Zhao
  2025-06-13  6:08                           ` Xiaoyao Li
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-13  5:35 UTC (permalink / raw)
  To: Xiaoyao Li, Sean Christopherson, Kai Huang, Rick P Edgecombe,
	Kirill Shutemov, Fan Du, Dave Hansen, david@redhat.com,
	Zhiquan Li, thomas.lendacky@amd.com, tabba@google.com,
	quic_eberman@quicinc.com, linux-kernel@vger.kernel.org, Ira Weiny,
	vbabka@suse.cz, pbonzini@redhat.com, Isaku Yamahata,
	michael.roth@amd.com, binbin.wu@linux.intel.com,
	ackerleytng@google.com, Chao P Peng, kvm@vger.kernel.org,
	Vishal Annapurve, jroedel@suse.de, Jun Miao, pgonda@google.com,
	x86@kernel.org

On Fri, Jun 13, 2025 at 11:29:39AM +0800, Yan Zhao wrote:
> On Fri, Jun 13, 2025 at 10:41:21AM +0800, Xiaoyao Li wrote:
> > On 6/11/2025 10:42 PM, Sean Christopherson wrote:
> > > On Tue, May 20, 2025, Kai Huang wrote:
> > > > On Tue, 2025-05-20 at 17:34 +0800, Zhao, Yan Y wrote:
> > > > > On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote:
> > > > > > On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
> > > > > > > > On the opposite, if other non-Linux TDs don't follow 1G->2M->4K
> > > > > > > > accept order, e.g., they always accept 4K, there could be *endless
> > > > > > > > EPT violation* if I understand your words correctly.
> > > > > > > > 
> > > > > > > > Isn't this yet-another reason we should choose to return PG_LEVEL_4K
> > > > > > > > instead of 2M if no accept level is provided in the fault?
> > > > > > > As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
> > > > > > > TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.
> > > > > > 
> > > > > > TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
> > > > > > docs say the VMM needs to demote *if* the mapping is large and the accept size
> > > > > > is small.
> > > 
> > > No thanks, fix the spec and the TDX Module.  Punting an error to the VMM is
> > > inconsistent, convoluted, and inefficient.
> > > 
> > > Per "Table 8.2: TDG.MEM.PAGE.ACCEPT SEPT Walk Cases":
> > > 
> > >    S-EPT state         ACCEPT vs. Mapping Size         Behavior
> > >    Leaf SEPT_PRESENT   Smaller                         TDACCEPT_SIZE_MISMATCH
> > >    Leaf !SEPT_PRESENT  Smaller                         EPT Violation <=========================|
> > >    Leaf DONT_CARE      Same                            Success                                 | => THESE TWO SHOULD MATCH!!!
> > >    !Leaf SEPT_FREE     Larger                          EPT Violation, BECAUSE THERE'S NO PAGE  |
> > >    !Leaf SEPT_FREE     Larger                          TDACCEPT_SIZE_MISMATCH <================|
> > > 
> > > 
> > > If ACCEPT is "too small", an EPT violation occurs.  But if ACCEPT is "too big",
> > > a TDACCEPT_SIZE_MISMATCH error occurs.  That's asinine.
> > > 
> > > The only reason that comes to mind for punting the "too small" case to the VMM
> > > is to try and keep the guest alive if the VMM is mapping more memory than has
> > > been enumerated to the guest.  E.g. if the guest suspects the VMM is malicious
> > > or buggy.  IMO, that's a terrible reason to push this much complexity into the
> > > host.  It also risks godawful boot times, e.g. if the guest kernel is buggy and
> > > accepts everything at 4KiB granularity.
> > > 
> > > The TDX Module should return TDACCEPT_SIZE_MISMATCH and force the guest to take
> > > action, not force the hypervisor to limp along in a degraded state.  If the guest
> > > doesn't want to ACCEPT at a larger granularity, e.g. because it doesn't think the
> > > entire 2MiB/1GiB region is available, then the guest can either log a warning and
> > > "poison" the page(s), or terminate and refuse to boot.
> > > 
> > > If for some reason the guest _can't_ ACCEPT at larger granularity, i.e. if the
> > > guest _knows_ that 2MiB or 1GiB is available/usable but refuses to ACCEPT at the
> > > appropriate granularity, then IMO that's firmly a guest bug.
> > 
> > It might just be guest doesn't want to accept a larger level instead of
> > can't. Use case see below.
> > 
> > > If there's a *legitimate* use case where the guest wants to ACCEPT a subset of
> > > memory, then there should be an explicit TDCALL to request that the unwanted
> > > regions of memory be unmapped.  Smushing everything into implicit behavior has
> > > obvioulsy created a giant mess.
> > 
> > Isn't the ACCEPT with a specific level explicit? Note that ACCEPT is not
> > only for the case that VMM has already mapped page and guest only needs to
> > accept it to make it available, it also works for the case that guest
> > requests VMM to map the page for a gpa (at specific level) then guest
> > accepts it.
To avoid confusion, here's the full new design:

1.when an EPT violation carries an ACCEPT level info
  (This occurs when TD performs ACCEPT before it accesses memory),
  KVM maps the page at map level <= the specified level.
  Guest's ACCEPT will succeed or return PAGE_SIZE_MATCH if map level < the
  specified level.

2.when an EPT violation does not carry ACCEPT level info
  (This occurs when TD accesses memory before invoking ACCEPT),

  1) if the TD is configured to always accept VMM's map level,
     KVM allows to map at 2MB.
     TD's later 4KB ACCEPT will return PAGE_SIZE_MATCH.
     TD can either retry with 2MB ACCEPT or explictly invoke a TDVMCALL for
     demotion.
  2) if the TD is not configured to always accept VMM's map level,
     KVM always maps at 4KB.
     TD's 2MB ACCEPT will return PAGE_SIZE_MATCH.

Please let me know if anything does not look right.

> > Even for the former case, it is understandable for behaving differently for
> > the "too small" and "too big" case. If the requested accept level is "too
> > small", VMM can handle it by demoting the page to satisfy guest. But when
> > the level is "too big", usually the VMM cannot map the page at a higher
> > level so that ept violation cannot help. I admit that it leads to the
> > requirement that VMM should always try to map the page at the highest
> > available level, if the EPT violation is not caused by ACCEPT which contains
> > a desired mapping level.
> > 
> > As for the scenario, the one I can think of is, guest is trying to convert a
> > 4K sized page between private and shared constantly, for testing purpose.
> > Guest knows that if accepting the gpa at higher level, it takes more time.
> > And when convert it to shared, it triggers DEMOTE and more time. So for
> > better performance, guest just calls ACCEPT with 4KB page. However, VMM
> Hmm, ACCEPT at 4KB level at the first time triggers DEMOTE already.
> So, I don't see how ACCEPT at 4KB helps performance.
Hmm, sent too fast previously. Some correction below:

> Support VMM has mapped a page at 4MB,
Suppose VMM has mapped a page at 2MB when an EPT violation (triggered by TD
memory access instead of by TD ACCEPT) does not carry ACCEPT level info,

> 
>          Scenario 1                           Effort
>   (1) Guest ACCEPT at 2MB                   ACCEPT 2MB         
>   (2) converts a 4KB page to shared         DEMOTE
>   (3) convert it back to private            ACCEPT 4KB
> 
> 
>          Scenario 2                           Effort
>   (1) Guest ACCEPT at 4MB                   DEMOTE, ACCEPT 4MB         
    (1) Guest ACCEPT at 4KB                   DEMOTE, ACCEPT 4KB
>   (2) converts a 4KB page to shared
>   (3) convert it back to private            ACCEPT 4KB
> 
> 
> In step (3) of "Scenario 1", VMM will not map the page at 2MB according to the
> current implementation because PROMOTION requires uniform ACCEPT status across
> all 512 4KB pages to be succeed.
> 
> > returns PAGE_SIZE_MATCH and enforces guest to accept a bigger size. what a
> > stupid VMM.
> I agree with Sean that if guest doesn't want to accept at a bigger size for
> certain reasons (e.g. it thinks it's unsafe or consider it as an attack),
> invoking an explicit TDVMCALL may be a better approach.
> 
> > Anyway, I'm just expressing how I understand the current design and I think
> > it's reasonable. And I don't object the idea to return ACCEPT_SIZE_MISMATCH
> > for "too small" case, but it's needs to be guest opt-in, i.e., let guest
> > itself chooses the behavior.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-13  5:35                         ` Yan Zhao
@ 2025-06-13  6:08                           ` Xiaoyao Li
  0 siblings, 0 replies; 294+ messages in thread
From: Xiaoyao Li @ 2025-06-13  6:08 UTC (permalink / raw)
  To: Yan Zhao, Sean Christopherson, Kai Huang, Rick P Edgecombe,
	Kirill Shutemov, Fan Du, Dave Hansen, david@redhat.com,
	Zhiquan Li, thomas.lendacky@amd.com, tabba@google.com,
	quic_eberman@quicinc.com, linux-kernel@vger.kernel.org, Ira Weiny,
	vbabka@suse.cz, pbonzini@redhat.com, Isaku Yamahata,
	michael.roth@amd.com, binbin.wu@linux.intel.com,
	ackerleytng@google.com, Chao P Peng, kvm@vger.kernel.org,
	Vishal Annapurve, jroedel@suse.de, Jun Miao, pgonda@google.com,
	x86@kernel.org

On 6/13/2025 1:35 PM, Yan Zhao wrote:
> To avoid confusion, here's the full new design:
> 
> 1.when an EPT violation carries an ACCEPT level info
>    (This occurs when TD performs ACCEPT before it accesses memory),
>    KVM maps the page at map level <= the specified level.
>    Guest's ACCEPT will succeed or return PAGE_SIZE_MATCH if map level < the
>    specified level.
> 
> 2.when an EPT violation does not carry ACCEPT level info
>    (This occurs when TD accesses memory before invoking ACCEPT),
> 
>    1) if the TD is configured to always accept VMM's map level,
>       KVM allows to map at 2MB.
>       TD's later 4KB ACCEPT will return PAGE_SIZE_MATCH.
>       TD can either retry with 2MB ACCEPT or explictly invoke a TDVMCALL for
>       demotion.
>    2) if the TD is not configured to always accept VMM's map level,
>       KVM always maps at 4KB.

Is it the decision derived from the discussion of this series to make 
the design simple and avoid the demotion on ACCEPT?

It looks like KVM's own design preference that if the TD doesn't opt-in 
the proposed new feature "always accept VMM's map level', the only way 
it can get the page mapped by EPT as hugepage is always trying to accept 
the page before first access and trying accept starting from biggest 
page size.

I'm OK with it.

>       TD's 2MB ACCEPT will return PAGE_SIZE_MATCH.
> 
> Please let me know if anything does not look right.


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-13  1:32                               ` Yan Zhao
@ 2025-06-13 21:53                                 ` Edgecombe, Rick P
  2025-06-13 22:19                                   ` Sean Christopherson
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-13 21:53 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Du, Fan, Li, Xiaoyao, Huang, Kai, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Li, Zhiquan1, Shutemov, Kirill,
	michael.roth@amd.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, Peng, Chao P, pbonzini@redhat.com, Weiny, Ira,
	Yamahata, Isaku, binbin.wu@linux.intel.com,
	ackerleytng@google.com, kvm@vger.kernel.org, Annapurve, Vishal,
	tabba@google.com, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Fri, 2025-06-13 at 09:32 +0800, Yan Zhao wrote:
> > > Eww, no.  Having to react on _every_ EPT violation would be annoying, and
> > > trying
> > > to debug issues where the guest is mixing options would probably be a
> > > nightmare.
> > > 
> > > I was thinking of something along the lines of an init-time or boot-time
> > > opt-
> > > in.
> > 
> > Fair.
> 
> Agreed.

Arg, I just realized a one-way opt-in will have a theoretical gap. If the guest
kexec's, the new kernel will need to match the opt-in.

A full solution could allow a later opt-out that is handled by the VMM by
shattering all page tables. But it starts to get too complex I think. Especially
since Linux guests already try to accept in the order 1GB->2MB->4k. So in
practice we are already worrying about correctness and not functional issues.
Maybe we just ignore it.

Otherwise, we currently have the following requirements I think:
1. One-way guest opt-in to new TDG.MEM.PAGE.ACCEPT behavior
2. Some notification to KVM that the guest has opted in.
3. After opt-in, TDG.MEM.PAGE.ACCEPT will return TDX_PAGE_SIZE_MISMATCH if
mapping is too small or too big

Thinking about how we would like the notification... Maybe we could have the
actual behavior controlled by the host, and have some GHCI like communication
like a TDVMCALL. The TDVMCALL (or similar) could be handled within KVM.
Basically just call the host side opt-in.

The reason to have it host controllable is that, as above, the new behavior
should be fine for a normal Linux guest. A host user controlled opt-in could be
useful for anyone that wants to run huge pages for old guest kernels. A kvm
module param maybe.

If this sounds good, I'll get the TDX modules input and come back with some
specific spec.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-13 21:53                                 ` Edgecombe, Rick P
@ 2025-06-13 22:19                                   ` Sean Christopherson
  2025-06-13 23:33                                     ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Sean Christopherson @ 2025-06-13 22:19 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Yan Y Zhao, Fan Du, Xiaoyao Li, Kai Huang,
	quic_eberman@quicinc.com, Dave Hansen, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, Zhiquan1 Li,
	Kirill Shutemov, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, Chao P Peng, pbonzini@redhat.com,
	Ira Weiny, Isaku Yamahata, binbin.wu@linux.intel.com,
	ackerleytng@google.com, kvm@vger.kernel.org, Vishal Annapurve,
	tabba@google.com, jroedel@suse.de, Jun Miao, pgonda@google.com,
	x86@kernel.org

On Fri, Jun 13, 2025, Rick P Edgecombe wrote:
> On Fri, 2025-06-13 at 09:32 +0800, Yan Zhao wrote:
> > > > Eww, no.  Having to react on _every_ EPT violation would be annoying,
> > > > and trying to debug issues where the guest is mixing options would
> > > > probably be a nightmare.
> > > > 
> > > > I was thinking of something along the lines of an init-time or
> > > > boot-time opt- in.
> > > 
> > > Fair.
> > 
> > Agreed.
> 
> Arg, I just realized a one-way opt-in will have a theoretical gap. If the guest
> kexec's, the new kernel will need to match the opt-in.

All the more reason to make this a property of the VM that is passed via
"struct td_params".  I.e. put the onus on the owner of the VM to ensure their
kernel(s) have been updated accordingly.

I understand that this could be painful, but honestly _all_ of TDX and SNP is
painful for the guest.  E.g. I don't think it's any worse than the security
issues with TDX (and SNP) guests using kvmclock (which I'd love some reviews on,
btw).

https://lore.kernel.org/all/20250227021855.3257188-35-seanjc@google.com

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-13 22:19                                   ` Sean Christopherson
@ 2025-06-13 23:33                                     ` Edgecombe, Rick P
  2025-06-16  3:14                                       ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-13 23:33 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: Du, Fan, Li, Xiaoyao, Huang, Kai, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	Zhao, Yan Y, Li, Zhiquan1, Shutemov, Kirill, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, Weiny, Ira, pbonzini@redhat.com,
	Peng, Chao P, Yamahata, Isaku, ackerleytng@google.com,
	vbabka@suse.cz, kvm@vger.kernel.org, binbin.wu@linux.intel.com,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, tabba@google.com,
	pgonda@google.com, x86@kernel.org

On Fri, 2025-06-13 at 15:19 -0700, Sean Christopherson wrote:
> > Arg, I just realized a one-way opt-in will have a theoretical gap. If the
> > guest
> > kexec's, the new kernel will need to match the opt-in.
> 
> All the more reason to make this a property of the VM that is passed via
> "struct td_params".  I.e. put the onus on the owner of the VM to ensure their
> kernel(s) have been updated accordingly.

Hmm, it gives me pause. At minimum it should have an enumeration to the guest.

> 
> I understand that this could be painful, but honestly _all_ of TDX and SNP is
> painful for the guest.  E.g. I don't think it's any worse than the security
> issues with TDX (and SNP) guests using kvmclock (which I'd love some reviews
> on,
> btw).
> 
> https://lore.kernel.org/all/20250227021855.3257188-35-seanjc@google.com

Oh, nice. I hadn't seen this. Agree that a comprehensive guest setup is quite
manual. But here we are playing with guest ABI. In practice, yes it's similar to
passing yet another arg to get a good TD.

We can start with a prototype the host side arg and see how it turns out. I
realized we need to verify edk2 as well.

Thanks Sean.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-13 23:33                                     ` Edgecombe, Rick P
@ 2025-06-16  3:14                                       ` Yan Zhao
  2025-06-16 22:49                                         ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-16  3:14 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, Du, Fan, Li, Xiaoyao, Huang, Kai,
	quic_eberman@quicinc.com, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, Li, Zhiquan1, Shutemov, Kirill,
	michael.roth@amd.com, linux-kernel@vger.kernel.org, Weiny, Ira,
	pbonzini@redhat.com, Peng, Chao P, Yamahata, Isaku,
	ackerleytng@google.com, vbabka@suse.cz, kvm@vger.kernel.org,
	binbin.wu@linux.intel.com, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, tabba@google.com, pgonda@google.com, x86@kernel.org

On Sat, Jun 14, 2025 at 07:33:48AM +0800, Edgecombe, Rick P wrote:
> On Fri, 2025-06-13 at 15:19 -0700, Sean Christopherson wrote:
> > > Arg, I just realized a one-way opt-in will have a theoretical gap. If the
> > > guest
> > > kexec's, the new kernel will need to match the opt-in.
> > 
> > All the more reason to make this a property of the VM that is passed via
> > "struct td_params".  I.e. put the onus on the owner of the VM to ensure their
> > kernel(s) have been updated accordingly.
> 
> Hmm, it gives me pause. At minimum it should have an enumeration to the guest.
> 
> > 
> > I understand that this could be painful, but honestly _all_ of TDX and SNP is
> > painful for the guest.  E.g. I don't think it's any worse than the security
> > issues with TDX (and SNP) guests using kvmclock (which I'd love some reviews
> > on,
> > btw).
> > 
> > https://lore.kernel.org/all/20250227021855.3257188-35-seanjc@google.com
> 
> Oh, nice. I hadn't seen this. Agree that a comprehensive guest setup is quite
> manual. But here we are playing with guest ABI. In practice, yes it's similar to
> passing yet another arg to get a good TD.
Could we introduce a TD attr TDX_ATTR_SEPT_EXPLICIT_DEMOTION?

It can be something similar to TDX_ATTR_SEPT_VE_DISABLE except that we don't
provide a dynamical way as the TDCS_CONFIG_FLEXIBLE_PENDING_VE to allow guest to
turn on/off SEPT_VE_DISABLE.
(See the disable_sept_ve() in ./arch/x86/coco/tdx/tdx.c).

So, if userspace configures a TD with TDX_ATTR_SEPT_EXPLICIT_DEMOTION, KVM first
checks if SEPT_EXPLICIT_DEMOTION is supported.
The guest can also check if it would like to support SEPT_EXPLICIT_DEMOTION to
determine to continue or shut down. (If it does not check SEPT_EXPLICIT_DEMOTION,
e.g., if we don't want to update EDK2, the guest must accept memory before
memory accessing).

- if TD is configured with SEPT_EXPLICIT_DEMOTION, KVM allows to map at 2MB when
  there's no level info in an EPT violation. The guest must accept memory before
  accessing memory or if it wants to accept only a partial of host's mapping, it
  needs to explicitly invoke a TDVMCALL to request KVM to perform page demotion.

- if TD is configured without SEPT_EXPLICIT_DEMOTION, KVM always maps at 4KB
  when there's no level info in an EPT violation.

- No matter SEPT_EXPLICIT_DEMOTION is configured or not, if there's a level info
  in an EPT violation, while KVM honors the level info as the max_level info,
  KVM ignores the demotion request in the fault path.

> We can start with a prototype the host side arg and see how it turns out. I
> realized we need to verify edk2 as well.
Current EDK2 should always accept pages before actual memory access.
So, I think it should be fine.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-11 14:30                                       ` Vishal Annapurve
@ 2025-06-16  9:59                                         ` Yan Zhao
  2025-06-17  0:12                                           ` Edgecombe, Rick P
                                                             ` (2 more replies)
  0 siblings, 3 replies; 294+ messages in thread
From: Yan Zhao @ 2025-06-16  9:59 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Ackerley Tng, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

On Wed, Jun 11, 2025 at 07:30:10AM -0700, Vishal Annapurve wrote:
> On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > We need to restore to the previous status (which includes the host page table)
> > if conversion can't be done.
> > That said, in my view, a better flow would be:
> >
> > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> >    consumers in kernel of memory allocated from guest_memfd).
> >
> > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> >    proceed. For example, in the case of TDX, this might involve memory
> >    allocation and page splitting.
> >
> > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> >    proceeds by sending the actual invalidation request.
> >
> > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> >    the page or elevate the page reference count.
> 
> Few questions here:
> 1) It sounds like the failure to remove entries from SEPT could only
> be due to bugs in the KVM/TDX module,
Yes.

> how reliable would it be to
> continue executing TDX VMs on the host once such bugs are hit?
The TDX VMs will be killed. However, the private pages are still mapped in the
SEPT (after the unmapping failure).
The teardown flow for TDX VM is:

do_exit
  |->exit_files
     |->kvm_gmem_release ==> (1) Unmap guest pages 
     |->release kvmfd
        |->kvm_destroy_vm  (2) Reclaiming resources
           |->kvm_arch_pre_destroy_vm  ==> Release hkid
           |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages

Without holding page reference after (1) fails, the guest pages may have been
re-assigned by the host OS while they are still still tracked in the TDX module.


> 2) Is it reliable to continue executing the host kernel and other
> normal VMs once such bugs are hit?
If with TDX holding the page ref count, the impact of unmapping failure of guest
pages is just to leak those pages.

> 3) Can the memory be reclaimed reliably if the VM is marked as dead
> and cleaned up right away?
As in the above flow, TDX needs to hold the page reference on unmapping failure
until after reclaiming is successful. Well, reclaiming itself is possible to
fail either.

So, below is my proposal. Showed in the simple POC code based on
https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2.

Patch 1: TDX increases page ref count on unmap failure.
Patch 2: Bail out private-to-shared conversion if splitting fails.
Patch 3: Make kvm_gmem_zap() return void.

After the change,
- the actual private-to-shared conversion will be not be executed on splitting
  failure (which could be due to out of memory or bugs in the KVM/TDX module) or
  unmapping failure (which is due to bugs in the KVM/TDX module).
- other callers of kvm_gmem_zap(), such as kvm_gmem_release(),
  kvm_gmem_error_folio(), kvm_gmem_punch_hole(), are still allowed to proceed.
  After truncating the pages out of the filemap, the pages could be leaked on
  purpose with the reference hold by TDX.


commit 50432c0bb1e10591714b6b880f43fc30797ca047
Author: Yan Zhao <yan.y.zhao@intel.com>
Date:   Tue Jun 10 00:02:30 2025 -0700

    KVM: TDX: Hold folio ref count on fatal error

    Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c60c1fa7b4ee..93c31eecfc60 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1502,6 +1502,15 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
        td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
 }

+/*
+ * Called when fatal error occurs during removing a TD's page.
+ * Increase the folio ref count in case it's reused by other VMs or host.
+ */
+static void tdx_hold_page_on_error(struct kvm *kvm, struct page *page, int level)
+{
+       folio_ref_add(page_folio(page), KVM_PAGES_PER_HPAGE(level));
+}
+
 static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
                            enum pg_level level, struct page *page)
 {
@@ -1868,12 +1877,14 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
         * before any might be populated. Warn if zapping is attempted when
         * there can't be anything populated in the private EPT.
         */
-       if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-               return -EINVAL;
+       if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm)) {
+               ret = -EINVAL;
+               goto fatal_error;
+       }

        ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
        if (ret <= 0)
-               return ret;
+               goto fatal_error;

        /*
         * TDX requires TLB tracking before dropping private page.  Do
@@ -1881,7 +1892,14 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
         */
        tdx_track(kvm);

-       return tdx_sept_drop_private_spte(kvm, gfn, level, page);
+       ret = tdx_sept_drop_private_spte(kvm, gfn, level, page);
+       if (ret)
+               goto fatal_error;
+       return ret;
+fatal_error:
+       if (ret < 0)
+               tdx_hold_page_on_error(kvm, page, level);
+       return ret;
 }

 void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,


commit 240acb13d4bd724b4c153b73cfba3cd14d3cc296
Author: Yan Zhao <yan.y.zhao@intel.com>
Date:   Tue Jun 10 19:26:56 2025 -0700

    KVM: guest_memfd: Add check kvm_gmem_private_has_safe_refcount()

    Check extra ref count on private pages in case of TDX unmap failure before
    private to shared conversion in the backend.

    In other zap cases, it's ok to do without this ref count check so that
    the error folio will be held by TDX after guest_memfd releases the folio.

    Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index af7943c0a8ba..1e1312bfa157 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -521,6 +521,41 @@ static void kvm_gmem_convert_invalidate_end(struct inode *inode,
                kvm_gmem_invalidate_end(gmem, invalidate_start, invalidate_end);
 }

+static bool kvm_gmem_private_has_safe_refcount(struct inode *inode,
+                                              pgoff_t start, pgoff_t end)
+{
+       pgoff_t index = start;
+       size_t inode_nr_pages;
+       bool ret = true;
+       void *priv;
+
+       /*
+        * Conversion in !kvm_gmem_has_custom_allocator() case does not reach here.
+        */
+       if (!kvm_gmem_has_custom_allocator(inode))
+               return ret;
+
+       priv = kvm_gmem_allocator_private(inode);
+       inode_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
+
+       while (index < end) {
+               struct folio *f;
+               f = filemap_get_folio(inode->i_mapping, index);
+               if (IS_ERR(f)) {
+                       index += inode_nr_pages;
+                       continue;
+               }
+
+               folio_put(f);
+               if (folio_ref_count(f) > folio_nr_pages(f)) {
+                       ret = false;
+                       break;
+               }
+               index += folio_nr_pages(f);
+       }
+       return ret;
+}
+
 static int kvm_gmem_convert_should_proceed(struct inode *inode,
                                           struct conversion_work *work,
                                           bool to_shared, pgoff_t *error_index)
@@ -538,6 +573,10 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
                list_for_each_entry(gmem, gmem_list, entry) {
                        ret = kvm_gmem_zap(gmem, work->start, work_end,
                                           KVM_FILTER_PRIVATE, true);
+                       if (ret)
+                               return ret;
+                       if (!kvm_gmem_private_has_safe_refcount(inode, work->start, work_end))
+                               return -EFAULT;
                }
        } else {
                unmap_mapping_pages(inode->i_mapping, work->start,


commit 26743993663313fa6f8741a43f22ed5ac21399c7
Author: Yan Zhao <yan.y.zhao@intel.com>
Date:   Tue Jun 10 20:01:23 2025 -0700

    KVM: guest_memfd: Move splitting KVM mappings out of kvm_gmem_zap()

    Modify kvm_gmem_zap() to return void and introduce a separate function,
    kvm_gmem_split_private(), to handle the splitting of private EPT.

    With these changes, kvm_gmem_zap() will only be executed after successful
    splitting across the entire conversion/punch range.

    Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 1e1312bfa157..e81efcef0837 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -318,8 +318,7 @@ static bool kvm_gmem_has_safe_refcount(struct address_space *mapping, pgoff_t st
        return refcount_safe;
 }

-static int kvm_gmem_zap(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end,
-                       enum kvm_gfn_range_filter filter, bool do_split)
+static int kvm_gmem_split_private(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end)
 {
        struct kvm_memory_slot *slot;
        struct kvm *kvm = gmem->kvm;
@@ -336,7 +335,7 @@ static int kvm_gmem_zap(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end,
                        .end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
                        .slot = slot,
                        .may_block = true,
-                       .attr_filter = filter,
+                       .attr_filter = KVM_FILTER_PRIVATE,
                };

                if (!locked) {
@@ -344,16 +343,13 @@ static int kvm_gmem_zap(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end,
                        locked = true;
                }

-               if (do_split) {
-                       ret = kvm_split_boundary_leafs(kvm, &gfn_range);
-                       if (ret < 0)
-                               goto out;
+               ret = kvm_split_boundary_leafs(kvm, &gfn_range);
+               if (ret < 0)
+                       goto out;

-                       flush |= ret;
-                       ret = 0;
-               }
+               flush |= ret;
+               ret = 0;

-               flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
        }
 out:
        if (flush)
@@ -365,6 +361,42 @@ static int kvm_gmem_zap(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end,
        return ret;
 }

+static void kvm_gmem_zap(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end,
+                       enum kvm_gfn_range_filter filter)
+{
+       struct kvm_memory_slot *slot;
+       struct kvm *kvm = gmem->kvm;
+       unsigned long index;
+       bool locked = false;
+       bool flush = false;
+
+       xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
+               pgoff_t pgoff = slot->gmem.pgoff;
+               struct kvm_gfn_range gfn_range = {
+                       .start = slot->base_gfn + max(pgoff, start) - pgoff,
+                       .end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
+                       .slot = slot,
+                       .may_block = true,
+                       .attr_filter = filter,
+               };
+
+               if (!locked) {
+                       KVM_MMU_LOCK(kvm);
+                       locked = true;
+               }
+
+               flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
+       }
+
+       if (flush)
+               kvm_flush_remote_tlbs(kvm);
+
+       if (locked)
+               KVM_MMU_UNLOCK(kvm);
+
+       return;
+}
+
 static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
                                      pgoff_t end)
 {
@@ -571,10 +603,10 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,

                gmem_list = &inode->i_mapping->i_private_list;
                list_for_each_entry(gmem, gmem_list, entry) {
-                       ret = kvm_gmem_zap(gmem, work->start, work_end,
-                                          KVM_FILTER_PRIVATE, true);
+                       ret = kvm_gmem_split_private(gmem, work->start, work_end);
                        if (ret)
                                return ret;
+                       kvm_gmem_zap(gmem, work->start, work_end, KVM_FILTER_PRIVATE);
                        if (!kvm_gmem_private_has_safe_refcount(inode, work->start, work_end))
                                return -EFAULT;
                }
@@ -1471,9 +1503,10 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
                 * expensive.
                 */
                filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED;
-               ret = kvm_gmem_zap(gmem, start, end, filter, true);
+               ret = kvm_gmem_split_private(gmem, start, end);
                if (ret)
                        goto out;
+               kvm_gmem_zap(gmem, start, end, filter);
        }

        if (kvm_gmem_has_custom_allocator(inode)) {
@@ -1606,7 +1639,7 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
         * memory, as its lifetime is associated with the inode, not the file.
         */
        kvm_gmem_invalidate_begin(gmem, 0, -1ul);
-       kvm_gmem_zap(gmem, 0, -1ul, KVM_FILTER_PRIVATE | KVM_FILTER_SHARED, false);
+       kvm_gmem_zap(gmem, 0, -1ul, KVM_FILTER_PRIVATE | KVM_FILTER_SHARED);
        kvm_gmem_invalidate_end(gmem, 0, -1ul);

        list_del(&gmem->entry);
@@ -1942,7 +1975,7 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol

                kvm_gmem_invalidate_begin(gmem, start, end);
                filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED;
-               kvm_gmem_zap(gmem, start, end, filter, false);
+               kvm_gmem_zap(gmem, start, end, filter);
        }

        /*


If the above changes are agreeable, we could consider a more ambitious approach:
introducing an interface like:

int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);

This would allow guest_memfd to maintain an internal reference count for each
private GFN. TDX would call guest_memfd_add_page_ref_count() for mapping and
guest_memfd_dec_page_ref_count() after a successful unmapping. Before truncating
a private page from the filemap, guest_memfd could increase the real folio
reference count based on its internal reference count for the private GFN.

^ permalink raw reply related	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-05 21:12                                       ` Ackerley Tng
@ 2025-06-16 10:43                                         ` Yan Zhao
  2025-06-16 23:27                                           ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-16 10:43 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: vannapurve, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

On Thu, Jun 05, 2025 at 02:12:58PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> >> Yan Zhao <yan.y.zhao@intel.com> writes:
> >> 
> >> > On Mon, May 12, 2025 at 09:53:43AM -0700, Vishal Annapurve wrote:
> >> >> On Sun, May 11, 2025 at 7:18 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >> >> > ...
> >> >> > >
> >> >> > > I might be wrongly throwing out some terminologies here then.
> >> >> > > VM_PFNMAP flag can be set for memory backed by folios/page structs.
> >> >> > > udmabuf seems to be working with pinned "folios" in the backend.
> >> >> > >
> >> >> > > The goal is to get to a stage where guest_memfd is backed by pfn
> >> >> > > ranges unmanaged by kernel that guest_memfd owns and distributes to
> >> >> > > userspace, KVM, IOMMU subject to shareability attributes. if the
> >> >> > OK. So from point of the reset part of kernel, those pfns are not regarded as
> >> >> > memory.
> >> >> >
> >> >> > > shareability changes, the users will get notified and will have to
> >> >> > > invalidate their mappings. guest_memfd will allow mmaping such ranges
> >> >> > > with VM_PFNMAP flag set by default in the VMAs to indicate the need of
> >> >> > > special handling/lack of page structs.
> >> >> > My concern is a failable invalidation notifer may not be ideal.
> >> >> > Instead of relying on ref counts (or other mechanisms) to determine whether to
> >> >> > start shareabilitiy changes, with a failable invalidation notifier, some users
> >> >> > may fail the invalidation and the shareability change, even after other users
> >> >> > have successfully unmapped a range.
> >> >>
> >> >> Even if one user fails to invalidate its mappings, I don't see a
> >> >> reason to go ahead with shareability change. Shareability should not
> >> >> change unless all existing users let go of their soon-to-be-invalid
> >> >> view of memory.
> >> 
> >> Hi Yan,
> >> 
> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
> >> series [1], we took into account conversion failures too. The steps are
> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> >> series from GitHub [2] because the steps for conversion changed in two
> >> separate patches.)
> >> 
> >> We do need to handle errors across ranges to be converted, possibly from
> >> different memslots. The goal is to either have the entire conversion
> >> happen (including page split/merge) or nothing at all when the ioctl
> >> returns.
> >> 
> >> We try to undo the restructuring (whether split or merge) and undo any
> >> shareability changes on error (barring ENOMEM, in which case we leave a
> >> WARNing).
> > As the undo can fail (as the case you leave a WARNing, in patch 38 in [1]), it
> > can lead to WARNings in kernel with folios not being properly added to the
> > filemap.
> >
> 
> I'm not sure how else to handle errors on rollback path. I've hopefully
> addressed this on the other thread at [1].
I'll reply [1].

Please also check my reply and proposal at [2].

> >> The part we don't restore is the presence of the pages in the host or
> >> guest page tables. For that, our idea is that if unmapped, the next
> >> access will just map it in, so there's no issue there.
> >
> > I don't think so.
> >
> > As in patch 38 in [1], on failure, it may fail to
> > - restore the shareability
> > - restore the folio's filemap status
> > - restore the folio's hugetlb stash metadata
> > - restore the folio's merged/split status
> >
> 
> The plan is that we try our best to restore shareability, filemap,
> restructuring (aka split/merge, including stash metadata) other than
> failures on rollback.
> 
> > Also, the host page table is not restored.
> >
> >
> 
> This is by design, the host page tables can be re-populated on the next
> fault. I've hopefully addressed this on the other thread at [1].
This is not. Please check my reply to [1].


> >> > My thinking is that:
> >> >
> >> > 1. guest_memfd starts shared-to-private conversion
> >> > 2. guest_memfd sends invalidation notifications
> >> >    2.1 invalidate notification --> A --> Unmap and return success
> >> >    2.2 invalidate notification --> B --> Unmap and return success
> >> >    2.3 invalidate notification --> C --> return failure
> >> > 3. guest_memfd finds 2.3 fails, fails shared-to-private conversion and keeps
> >> >    shareability as shared
> >> >
> >> > Though the GFN remains shared after 3, it's unmapped in user A and B in 2.1 and
> >> > 2.2. Even if additional notifications could be sent to A and B to ask for
> >> > mapping the GFN back, the map operation might fail. Consequently, A and B might
> >> > not be able to restore the mapped status of the GFN.
> >> 
> >> For conversion we don't attempt to restore mappings anywhere (whether in
> >> guest or host page tables). What do you think of not restoring the
> >> mappings?
> > It could cause problem if the mappings in S-EPT can't be restored.
> >
> > For TDX private-to-shared conversion, if kvm_gmem_convert_should_proceed() -->
> > kvm_gmem_unmap_private() --> kvm_mmu_unmap_gfn_range() fails in the end, then
> > the GFN shareability is restored to private. The next guest access to
> > the partially unmapped private memory can meet a fatal error: "access before
> > acceptance".
> >
> > It could occur in such a scenario:
> > 1. TD issues a TDVMCALL_MAP_GPA to convert a private GFN to shared
> > 2. Conversion fails in KVM.
> > 3. set_memory_decrypted() fails in TD.
> > 4. TD thinks the GFN is still accepted as private and accesses it.
> >
> >
> 
> This is true, I was thinking that this isn't handled solely in
> conversion but by being part of the contract between userspace VMM and
> the guest, that guest must handle conversion failures. I've hopefully
> addressed this on the other thread at [1].
> 
> >> > For IOMMU mappings, this
> >> > could result in DMAR failure following a failed attempt to do shared-to-private
> >> > conversion.
> >> 
> >> I believe the current conversion setup guards against this because after
> >> unmapping from the host, we check for any unexpected refcounts.
> > Right, it's fine if we check for any unexpected refcounts.
> >
> >
> >> (This unmapping is not the unmapping we're concerned about, since this is
> >> shared memory, and unmapping doesn't go through TDX.)
> >> 
> >> Coming back to the refcounts, if the IOMMU had mappings, these refcounts
> >> are "unexpected". The conversion ioctl will return to userspace with an
> >> error.
> >> 
> >> IO can continue to happen, since the memory is still mapped in the
> >> IOMMU. The memory state is still shared. No issue there.
> >> 
> >> In RFCv2 [1], we expect userspace to see the error, then try and remove
> >> the memory from the IOMMU, and then try conversion again.
> > I don't think it's right to depend on that userspace could always perform in 
> > kernel's expected way, i.e. trying conversion until it succeeds.
> >
> 
> Let me think more deeply about this. Please let me know if there's
> anything I missed.
> 
> It is true that a buggy or malicious userspace VMM can ignore conversion
> failures and report success to the guest, but if both the userspace VMM
> and guest are malicious, it's quite hard for the kernel to defend
> against that.
Hmm, expecting userspace to try conversion endlessly exceeds what is reasonable
for a cooperative userspace?

> I think as long as there's no point where the guest can crash the host
> in a fixed way, I think it is okay to rely on a userspace VMM and guest
> protocol.
> 
> IIUC the guest can crash the host (original point of having guest_memfd)
> if the guest can convince the host to write to private memory. For that
How to?
Unless the host kernel wants to crash itself, I don't think allowing guest to
crash the host is acceptable.
If you happen to know one, please let us know. We'll fix it.

> to happen, the memory must be faulted into the Secure EPTs, and the
> shareability state must be ALL for the host to fault it in.
> 
> So to have this issue, the conversion failure must be such that the
> memory remains faulted into the Secure EPTs while shareability is
> shared. Since unmapping from secure EPTs happens pretty early before any
> shareability is changed or any rollback (and rollback failures) can
> happen, I think we should be quite safe?
It's not safe if unmapping from the secure EPT fails while the shareability is
changed to shared.


> If unmapping of private memory fails, this is where I think guest_memfd
> should get an error from the unmap and it should not proceed to change
> shareability.
Please check if my proposal at [2] is agreeable.

> 
> > We need to restore to the previous status (which includes the host page table)
> > if conversion can't be done.
> 
> Most of the previous status (shareability, filemap,
> restructuring (aka split/merge, including stash metadata)) are restored
> other than during rollback failures.
However, error during the rollback is unacceptable.


> As for presence in host page tables, is it okay to defer that till the
> next fault, and if not okay, why not?
If the host page tables involve only shared mappings in the primary MMU
and shared EPT, it's ok.


> For presence in guest page tables, is it okay to fall back on the
> protocol where the guest must handle conversion failures, and if not
> okay, why not?
Hmm, whether to roll back the guest page table or not after the conversion
failure is the business of the guest OS.

However, KVM can't rely on that the guest must assume that the page state is
shared even after a private-to-shared conversion failure.


> > That said, in my view, a better flow would be:
> >
> > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> >    consumers in kernel of memory allocated from guest_memfd).
> >
> > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> >    proceed. For example, in the case of TDX, this might involve memory
> >    allocation and page splitting.
> >
> > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> >    proceeds by sending the actual invalidation request.
> >
> > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> >    the page or elevate the page reference count.
> >
> > 5. guest_memfd completes the invalidation process. If the memory is marked as
> >    "poison," guest_memfd can handle it accordingly. If the page has an elevated
> >    reference count, guest_memfd may not need to take special action, as the
> >    elevated count prevents the OS from reallocating the page.
> >    (but from your reply below, seems a callback to guest_memfd is a better
> >    approach).
> >
> >
> 
> Thanks for this, I've tried to combine this into my response at
> [1]. I think this works, but it's hard because
> 
> a. Pre-checks are hard to check (explained at [1])
Please check if the pre-checks in my POC [2] is good.
I tested it for the case of TDX unmapping failure. It does not change the
shareabilitiy if splitting or zapping fails.


> b. Even after all the checks, unmapping can still fail, and those still
>    have to be handled, and to handle those, we have to buy into the
>    userspace VMM/guest protocol, so why not just buy into the protocol
>    to start with?
In my POC [2], the outcome of unmapping failure is to leak the pages.
Please check if it looks good to you.

> [1] https://lore.kernel.org/all/diqztt4uhunj.fsf@ackerleytng-ctop.c.googlers.com/

[2] https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com/

> >> The part in concern here is unmapping failures of private pages, for
> >> private-to-shared conversions, since that part goes through TDX and
> >> might fail.
> > IMO, even for TDX, the real unmap must not fail unless there are bugs in the KVM
> > or TDX module.
> > So, for page splitting in S-EPT, I prefer to try splitting in the
> > pre-invalidation phase before conducting any real unmap.
> >
> >
> 
> Thanks for your detailed suggestion.
> 
> >> One other thing about taking refcounts is that in RFCv2,
> >> private-to-shared conversions assume that there are no refcounts on the
> >> private pages at all. (See filemap_remove_folio_for_restructuring() in
> >> [3])
> >>
> >> Haven't had a chance to think about all the edge cases, but for now I
> >> think on unmapping failure, in addition to taking a refcount, we should
> >> return an error at least up to guest_memfd, so that guest_memfd could
> >> perhaps keep the refcount on that page, but drop the page from the
> >> filemap. Another option could be to track messed up addresses and always
> >> check that on conversion or something - not sure yet.
> >
> > It looks good to me. See the bullet 4 in my proposed flow above.
> >
> 
> Thanks again for your detailed suggestion.
> 
> >> Either way, guest_memfd must know. If guest_memfd is not informed, on a
> >> next conversion request, the conversion will just spin in
> >> filemap_remove_folio_for_restructuring().
> > It makes sense.
> >
> >
> >> What do you think of this part about informing guest_memfd of the
> >> failure to unmap?
> > So, do you want to add a guest_memfd callback to achieve this purpose?
> >
> 
> I will need to think the entire thing through, but I meant informing as
> in returning an error to guest_memfd so that guest_memfd knows. I think
> returning an error should be the first cause of action.
> 
> As for whether guest_memfd should know how to handle the error or
> whether the userspace VMM should participate in deciding what to do with
> the error, I'm not sure. If you have suggestions on this, I hope we can
> combine the suggestions about the conversion protocol on the other thread.
> 
> Regarding a callback, are you thinking something like not having the
> unmap return an error, but instead TDX will call a function like
> kvm_gmem_error_at_offset(loff_t offset), and guest_memfd will then
> record that somewhere, and then immediately after calling unmap
> guest_memfd will check kvm_gmem_was_there_an_error_in_range() and then
> determining whether there's an error? Something like that?
> 
> I guess it could work but feels a little odd.
> 
> >
> > BTW, here's an analysis of why we can't let kvm_mmu_unmap_gfn_range()
> > and mmu_notifier_invalidate_range_start() fail, based on the repo
> > https://github.com/torvalds/linux.git, commit cd2e103d57e5 ("Merge tag
> > 'hardening-v6.16-rc1-fix1-take2' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux")
> 
> Thank you, I appreciate the effort you took to enumerate these. The
> following suggestions are based on my current understanding. I don't
> have time in the near future to do the plumbing to test out the
> suggestion, but for now I want to see if this suggestion makes sense,
> maybe you can correct any misunderstandings first. 

Sorry, I realized that the below enumeration brought confusion.

Listing them was to prove that unmapping failure is not expected by the kernel.

Please kindly let me know if any existing kernel code allows unmap to fail.

> > 1. Status of mmu notifier
> > -------------------------------
> > (1) There're 34 direct callers of mmu_notifier_invalidate_range_start().
> >     1. clear_refs_write
> >     2. do_pagemap_scan
> >     3. uprobe_write_opcode
> >     4. do_huge_zero_wp_pmd
> >     5. __split_huge_pmd (N)
> >     6. __split_huge_pud (N)
> >     7. move_pages_huge_pmd
> >     8. copy_hugetlb_page_range
> >     9. hugetlb_unshare_pmds  (N)
> >     10. hugetlb_change_protection
> >     11. hugetlb_wp
> >     12. unmap_hugepage_range (N)
> >     13. move_hugetlb_page_tables
> >     14. collapse_huge_page
> >     15. retract_page_tables
> >     16. collapse_pte_mapped_thp
> >     17. write_protect_page
> >     18. replace_page
> >     19. madvise_free_single_vma
> >     20. wp_clean_pre_vma
> >     21. wp_page_copy 
> >     22. zap_page_range_single_batched (N)
> >     23. unmap_vmas (N)
> >     24. copy_page_range 
> >     25. remove_device_exclusive_entry
> >     26. migrate_vma_collect
> >     27. __migrate_device_pages
> >     28. change_pud_range 
> >     29. move_page_tables
> >     30. page_vma_mkclean_one
> >     31. try_to_unmap_one
> >     32. try_to_migrate_one
> >     33. make_device_exclusive
> >     34. move_pages_pte
> >
> > Of these 34 direct callers, those marked with (N) cannot tolerate
> > mmu_notifier_invalidate_range_start() failing. I have not yet investigated all
> > 34 direct callers one by one, so the list of (N) is incomplete.
> >
> > For 5. __split_huge_pmd(), Documentation/mm/transhuge.rst says:
> > "Note that split_huge_pmd() doesn't have any limitations on refcounting:
> > pmd can be split at any point and never fails." This is because split_huge_pmd()
> > serves as a graceful fallback design for code walking pagetables but unaware
> > about huge pmds.
> >
> >

> Do these callers, especially those with (N), ever try to unmap any TDX
> private pages? guest_memfd only gives shared pages to core-mm, so for
> shared pages, there will continue to be no chance of errors.
> 
> If we change mmu_notifier_invalidate_range_start() to return an int, all
> of the callers that never invalidate shared pages can continue to safely
> rely on the fact that mmu_notifier_invalidate_range_start() will return
> 0.
mmu_notifier_invalidate_range_start() is just to zap shared pages.

 
> For the callers of mmu_notifier_invalidate_range_start() that may touch
> private pages, I believe that's only guest_memfd and KVM. That's where
> we want the error, and will handle the error.
> 
> Another point here is that I was thinking to put EPT splitting together
> with actual unmapping instead of with invalidation because we will
> probably invalidate more than we unmap (see explanation at [1] about the
> race). Maybe moving EPT splitting to unmap could help?
> 
> > (2) There's 1 direct caller of mmu_notifier_invalidate_range_start_nonblock(),
> > __oom_reap_task_mm(), which only expects the error -EAGAIN.
> >
> > In mn_hlist_invalidate_range_start():
> > "WARN_ON(mmu_notifier_range_blockable(range) || _ret != -EAGAIN);"
> >
> >
> > (3) For DMAs, drivers need to invoke pin_user_pages() to pin memory. In that
> > case, they don't need to register mmu notifier.
> >
> > Or, device drivers can pin pages via get_user_pages*(), and register for mmu         
> > notifier callbacks for the memory range. Then, upon receiving a notifier         
> > "invalidate range" callback , stop the device from using the range, and unpin    
> > the pages.
> >
> > See Documentation/core-api/pin_user_pages.rst.
> >
> >
> 
> Do you mean that we should teach device drivers to get callbacks for
> private pages? Are you looking ahead to handle TDX IO on private pages?
> So far we haven't handled that yet.
I tried to show that device drivers increases page refcount (by pinning) when it
maps a page into IOMMU page table. It does not decrease page refcount (by
unpinning) until after unmapping.

If the page hold by the device driver is allocated from hugetlb, and if the page
has been truncated from the hugetlb, the page is still hold by the device
driver until the page is unmapped in the IOMMU page table.

This is similar to TDX. As long as a page is still mapped in the SEPT or tracked
by the TDX module, it's better to hold a page refcount even after the page is
truncated from the file mapping.


> > 2. Cases that cannot tolerate failure of mmu_notifier_invalidate_range_start()
> > -------------------------------
> > (1) Error fallback cases.
> >
> >     1. split_huge_pmd() as mentioned in Documentation/mm/transhuge.rst.
> >        split_huge_pmd() is designed as a graceful fallback without failure.
> >
> >        split_huge_pmd
> >         |->__split_huge_pmd
> >            |->mmu_notifier_range_init
> >            |  mmu_notifier_invalidate_range_start
> >            |  split_huge_pmd_locked
> >            |  mmu_notifier_invalidate_range_end
> >
> >
> >     2. in fs/iomap/buffered-io.c, iomap_write_failed() itself is error handling.
> >        iomap_write_failed
> >          |->truncate_pagecache_range
> >             |->unmap_mapping_range
> >             |  |->unmap_mapping_pages
> >             |     |->unmap_mapping_range_tree
> >             |        |->unmap_mapping_range_vma
> >             |           |->zap_page_range_single
> >             |              |->zap_page_range_single_batched
> >             |                       |->mmu_notifier_range_init
> >             |                       |  mmu_notifier_invalidate_range_start
> >             |                       |  unmap_single_vma
> >             |                       |  mmu_notifier_invalidate_range_end
> >             |->truncate_inode_pages_range
> >                |->truncate_cleanup_folio
> >                   |->if (folio_mapped(folio))
> >                   |     unmap_mapping_folio(folio);
> >                          |->unmap_mapping_range_tree
> >                             |->unmap_mapping_range_vma
> >                                |->zap_page_range_single
> >                                   |->zap_page_range_single_batched
> >                                      |->mmu_notifier_range_init
> >                                      |  mmu_notifier_invalidate_range_start
> >                                      |  unmap_single_vma
> >                                      |  mmu_notifier_invalidate_range_end
> >
> >    3. in mm/memory.c, zap_page_range_single() is invoked to handle error.
> >       remap_pfn_range_notrack
> >         |->int error = remap_pfn_range_internal(vma, addr, pfn, size, prot);
> >         |  if (!error)
> >         |      return 0;
> > 	|  zap_page_range_single
> >            |->zap_page_range_single_batched
> >               |->mmu_notifier_range_init
> >               |  mmu_notifier_invalidate_range_start
> >               |  unmap_single_vma
> >               |  mmu_notifier_invalidate_range_end
> >
> >    4. in kernel/events/core.c, zap_page_range_single() is invoked to clear any
> >       partial mappings on error.
> >
> >       perf_mmap
> >         |->ret = map_range(rb, vma);
> >                  |  err = remap_pfn_range
> >                  |->if (err) 
> >                  |     zap_page_range_single
> >                         |->zap_page_range_single_batched
> >                            |->mmu_notifier_range_init
> >                            |  mmu_notifier_invalidate_range_start
> >                            |  unmap_single_vma
> >                            |  mmu_notifier_invalidate_range_end
> >
> >
> >    5. in mm/memory.c, unmap_mapping_folio() is invoked to unmap posion page.
> >
> >       __do_fault
> > 	|->if (unlikely(PageHWPoison(vmf->page))) { 
> > 	|	vm_fault_t poisonret = VM_FAULT_HWPOISON;
> > 	|	if (ret & VM_FAULT_LOCKED) {
> > 	|		if (page_mapped(vmf->page))
> > 	|			unmap_mapping_folio(folio);
> >         |                       |->unmap_mapping_range_tree
> >         |                          |->unmap_mapping_range_vma
> >         |                             |->zap_page_range_single
> >         |                                |->zap_page_range_single_batched
> >         |                                   |->mmu_notifier_range_init
> >         |                                   |  mmu_notifier_invalidate_range_start
> >         |                                   |  unmap_single_vma
> >         |                                   |  mmu_notifier_invalidate_range_end
> > 	|		if (mapping_evict_folio(folio->mapping, folio))
> > 	|			poisonret = VM_FAULT_NOPAGE; 
> > 	|		folio_unlock(folio);
> > 	|	}
> > 	|	folio_put(folio);
> > 	|	vmf->page = NULL;
> > 	|	return poisonret;
> > 	|  }
> >
> >
> >   6. in mm/vma.c, in __mmap_region(), unmap_region() is invoked to undo any
> >      partial mapping done by a device driver.
> >
> >      __mmap_new_vma
> >        |->__mmap_new_file_vma(map, vma);
> >           |->error = mmap_file(vma->vm_file, vma);
> >           |  if (error)
> >           |     unmap_region
> >                  |->unmap_vmas
> >                     |->mmu_notifier_range_init
> >                     |  mmu_notifier_invalidate_range_start
> >                     |  unmap_single_vma
> >                     |  mmu_notifier_invalidate_range_end
> >
> >
> 
> These should probably not ever be invalidating or unmapping private pages.
> 
> > (2) No-fail cases
> > -------------------------------
> > 1. iput() cannot fail. 
> >
> > iput
> >  |->iput_final
> >     |->WRITE_ONCE(inode->i_state, state | I_FREEING);
> >     |  inode_lru_list_del(inode);
> >     |  evict(inode);
> >        |->op->evict_inode(inode);
> >           |->shmem_evict_inode
> >              |->shmem_truncate_range
> >                 |->truncate_inode_pages_range
> >                    |->truncate_cleanup_folio
> >                       |->if (folio_mapped(folio))
> >                       |     unmap_mapping_folio(folio);
> >                             |->unmap_mapping_range_tree
> >                                |->unmap_mapping_range_vma
> >                                   |->zap_page_range_single
> >                                      |->zap_page_range_single_batched
> >                                         |->mmu_notifier_range_init
> >                                         |  mmu_notifier_invalidate_range_start
> >                                         |  unmap_single_vma
> >                                         |  mmu_notifier_invalidate_range_end
> >
> >
> > 2. exit_mmap() cannot fail
> >
> > exit_mmap
> >   |->mmu_notifier_release(mm);
> >      |->unmap_vmas(&tlb, &vmi.mas, vma, 0, ULONG_MAX, ULONG_MAX, false);
> >         |->mmu_notifier_range_init
> >         |  mmu_notifier_invalidate_range_start
> >         |  unmap_single_vma
> >         |  mmu_notifier_invalidate_range_end
> >
> >
> 
> These should probably not ever be invalidating or unmapping private pages.
> 
> > 3. KVM Cases That Cannot Tolerate Unmap Failure
> > -------------------------------
> > Allowing unmap operations to fail in the following scenarios would make it very
> > difficult or even impossible to handle the failure:
> >
> > (1) __kvm_mmu_get_shadow_page() is designed to reliably obtain a shadow page
> > without expecting any failure.
> >
> > mmu_alloc_direct_roots
> >   |->mmu_alloc_root
> >      |->kvm_mmu_get_shadow_page
> >         |->__kvm_mmu_get_shadow_page
> >            |->kvm_mmu_alloc_shadow_page
> >               |->account_shadowed
> >                  |->kvm_mmu_slot_gfn_write_protect
> >                     |->kvm_tdp_mmu_write_protect_gfn
> >                        |->write_protect_gfn
> >                           |->tdp_mmu_iter_set_spte
> >
> >
> 
> I need to learn more about shadow pages but IIUC TDX doesn't use shadow
> pages so this path won't interact with unmapping private pages.
> 
> > (2) kvm_vfio_release() and kvm_vfio_file_del() cannot fail
> >
> > kvm_vfio_release/kvm_vfio_file_del
> >  |->kvm_vfio_update_coherency
> >     |->kvm_arch_unregister_noncoherent_dma
> >        |->kvm_noncoherent_dma_assignment_start_or_stop
> >           |->kvm_zap_gfn_range
> >              |->kvm_tdp_mmu_zap_leafs
> >                 |->tdp_mmu_zap_leafs
> >                    |->tdp_mmu_iter_set_spte
> >
> >
> 
> I need to learn more about VFIO but for now IIUC IO uses shared pages,
> so this path won't interact with unmapping private pages.
> 
> > (3) There're lots of callers of __kvm_set_or_clear_apicv_inhibit() currently
> > never expect failure of unmap.
> >
> > __kvm_set_or_clear_apicv_inhibit
> >   |->kvm_zap_gfn_range
> >      |->kvm_tdp_mmu_zap_leafs
> >         |->tdp_mmu_zap_leafs
> >            |->tdp_mmu_iter_set_spte
> >
> >
> >
> 
> There could be some TDX specific things such that TDX doesn't use this
> path.
tdp_mmu_iter_set_spte() is used by KVM generally to update the SPTE when
kvm->mmu_lock is held for write.

TDX uses tdp_mmu_iter_set_spte() to further unmapping the SEPT.

Converting tdp_mmu_iter_set_spte() to return error is a huge work and I don't
think it's right or worthwhile.

> 
> > 4. Cases in KVM where it's hard to make tdp_mmu_set_spte() (update SPTE with
> > write mmu_lock) failable.
> >
> > (1) kvm_vcpu_flush_tlb_guest()
> >
> > kvm_vcpu_flush_tlb_guest
> >   |->kvm_mmu_sync_roots
> >      |->mmu_sync_children
> >         |->kvm_vcpu_write_protect_gfn
> >            |->kvm_mmu_slot_gfn_write_protect
> >               |->kvm_tdp_mmu_write_protect_gfn
> >                  |->write_protect_gfn
> >                     |->tdp_mmu_iter_set_spte
> >                        |->tdp_mmu_set_spte
> >
> >
> > (2) handle_removed_pt() and handle_changed_spte().
> >
> 
> Thank you so much for looking into these, I'm hoping that the number of
> cases where TDX and private pages are unmapped are really limited to a
> few paths that we have to rework.
> 
> If we agree that the error has to be handled, then regardless of how we
> let the caller know that an error happened, all paths touching TDX
> private pages have to be reworked.
> 
> Between (1) returning an error vs (2) marking error and having the
> caller check for errors, then it's probably better to use the standard
> approach of returning an error since it is better understood, and
> there's no need to have extra data structures?
However, I don't think returning error during the unmap path is a standard
approach...


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-16  3:14                                       ` Yan Zhao
@ 2025-06-16 22:49                                         ` Edgecombe, Rick P
  2025-06-17  0:52                                           ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-16 22:49 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Du, Fan, Li, Xiaoyao, Huang, Kai, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Li, Zhiquan1, Shutemov, Kirill,
	michael.roth@amd.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, Peng, Chao P, pbonzini@redhat.com, Weiny, Ira,
	Yamahata, Isaku, binbin.wu@linux.intel.com,
	ackerleytng@google.com, kvm@vger.kernel.org, Annapurve, Vishal,
	tabba@google.com, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Mon, 2025-06-16 at 11:14 +0800, Yan Zhao wrote:
> > Oh, nice. I hadn't seen this. Agree that a comprehensive guest setup is
> > quite
> > manual. But here we are playing with guest ABI. In practice, yes it's
> > similar to
> > passing yet another arg to get a good TD.
> Could we introduce a TD attr TDX_ATTR_SEPT_EXPLICIT_DEMOTION?
> 
> It can be something similar to TDX_ATTR_SEPT_VE_DISABLE except that we don't
> provide a dynamical way as the TDCS_CONFIG_FLEXIBLE_PENDING_VE to allow guest
> to
> turn on/off SEPT_VE_DISABLE.
> (See the disable_sept_ve() in ./arch/x86/coco/tdx/tdx.c).
> 
> So, if userspace configures a TD with TDX_ATTR_SEPT_EXPLICIT_DEMOTION, KVM
> first
> checks if SEPT_EXPLICIT_DEMOTION is supported.
> The guest can also check if it would like to support SEPT_EXPLICIT_DEMOTION to
> determine to continue or shut down. (If it does not check
> SEPT_EXPLICIT_DEMOTION,
> e.g., if we don't want to update EDK2, the guest must accept memory before
> memory accessing).
> 
> - if TD is configured with SEPT_EXPLICIT_DEMOTION, KVM allows to map at 2MB
> when
>   there's no level info in an EPT violation. The guest must accept memory
> before
>   accessing memory or if it wants to accept only a partial of host's mapping,
> it
>   needs to explicitly invoke a TDVMCALL to request KVM to perform page
> demotion.
> 
> - if TD is configured without SEPT_EXPLICIT_DEMOTION, KVM always maps at 4KB
>   when there's no level info in an EPT violation.
> 
> - No matter SEPT_EXPLICIT_DEMOTION is configured or not, if there's a level
> info
>   in an EPT violation, while KVM honors the level info as the max_level info,
>   KVM ignores the demotion request in the fault path.

I think this is what Sean was suggesting. We are going to need a qemu command
line opt-in too.

> 
> > We can start with a prototype the host side arg and see how it turns out. I
> > realized we need to verify edk2 as well.
> Current EDK2 should always accept pages before actual memory access.
> So, I think it should be fine.

It's not just that, it needs to handle the the accept page size being lower than
the mapping size. I went and looked and it is accepting at 4k size in places. It
hopefully is just handling accepting a whole range that is not 2MB aligned. But
I think we need to verify this more.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-16 10:43                                         ` Yan Zhao
@ 2025-06-16 23:27                                           ` Edgecombe, Rick P
  0 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-16 23:27 UTC (permalink / raw)
  To: ackerleytng@google.com, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Shutemov, Kirill, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Li, Zhiquan1,
	linux-kernel@vger.kernel.org, Yamahata, Isaku, Peng, Chao P,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Mon, 2025-06-16 at 18:43 +0800, Yan Zhao wrote:
> > It is true that a buggy or malicious userspace VMM can ignore conversion
> > failures and report success to the guest, but if both the userspace VMM
> > and guest are malicious, it's quite hard for the kernel to defend
> > against that.

For upstream, it's going to be required that userspace can't mess up the host
kernel. Userspace is free to mess up the guest though.





^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-16  9:59                                         ` Yan Zhao
@ 2025-06-17  0:12                                           ` Edgecombe, Rick P
  2025-06-17  1:38                                             ` Yan Zhao
  2025-06-17  0:25                                           ` Edgecombe, Rick P
  2025-06-17  3:51                                           ` Vishal Annapurve
  2 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-17  0:12 UTC (permalink / raw)
  To: Annapurve, Vishal, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, quic_eberman@quicinc.com, Li, Xiaoyao,
	Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Mon, 2025-06-16 at 17:59 +0800, Yan Zhao wrote:
> If the above changes are agreeable, we could consider a more ambitious approach:
> introducing an interface like:
> 
> int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
> int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);

We talked about doing something like having tdx_hold_page_on_error() in
guestmemfd with a proper name. The separation of concerns will be better if we
can just tell guestmemfd, the page has an issue. Then guestmemfd can decide how
to handle it (refcount or whatever).

> 
> This would allow guest_memfd to maintain an internal reference count for each
> private GFN. TDX would call guest_memfd_add_page_ref_count() for mapping and
> guest_memfd_dec_page_ref_count() after a successful unmapping. Before truncating
> a private page from the filemap, guest_memfd could increase the real folio
> reference count based on its internal reference count for the private GFN.

What does this get us exactly? This is the argument to have less error prone
code that can survive forgetting to refcount on error? I don't see that it is an
especially special case.


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-16  9:59                                         ` Yan Zhao
  2025-06-17  0:12                                           ` Edgecombe, Rick P
@ 2025-06-17  0:25                                           ` Edgecombe, Rick P
  2025-06-17  2:00                                             ` Yan Zhao
  2025-06-17  3:51                                           ` Vishal Annapurve
  2 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-17  0:25 UTC (permalink / raw)
  To: Annapurve, Vishal, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, quic_eberman@quicinc.com, Li, Xiaoyao,
	Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Mon, 2025-06-16 at 17:59 +0800, Yan Zhao wrote:
> > Few questions here:
> > 1) It sounds like the failure to remove entries from SEPT could only
> > be due to bugs in the KVM/TDX module,
> Yes.

A TDX module bug could hypothetically cause many types of host instability. We
should consider a little more on the context for the risk before we make TDX a
special case or add much error handling code around it. If we end up with a
bunch of paranoid error handling code around TDX module behavior, that is going
to be a pain to maintain. And error handling code for rare cases will be hard to
remove.

We've had a history of unreliable page removal during the base series
development. When we solved the problem, it was not completely clean (though
more on the guest affecting side). So I think there is reason to be concerned.
But this should work reliably in theory. So I'm not sure we should use the error
case as a hard reason. Instead maybe we should focus on how to make it less
likely to have an error. Unless there is a specific case you are considering,
Yan?

That said, I think the refcounting on error (or rather, notifying guestmemfd on
error do let it handle the error how it wants) is a fine solution. As long as it
doesn't take much code (as is the case for Yan's POC).

> 
> > how reliable would it be to
> > continue executing TDX VMs on the host once such bugs are hit?
> The TDX VMs will be killed. However, the private pages are still mapped in the
> SEPT (after the unmapping failure).
> The teardown flow for TDX VM is:
> 
> do_exit
>   |->exit_files
>      |->kvm_gmem_release ==> (1) Unmap guest pages 
>      |->release kvmfd
>         |->kvm_destroy_vm  (2) Reclaiming resources
>            |->kvm_arch_pre_destroy_vm  ==> Release hkid
>            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
> 
> Without holding page reference after (1) fails, the guest pages may have been
> re-assigned by the host OS while they are still still tracked in the TDX
> module.
> 
> 
> > 2) Is it reliable to continue executing the host kernel and other
> > normal VMs once such bugs are hit?
> If with TDX holding the page ref count, the impact of unmapping failure of
> guest
> pages is just to leak those pages.

If the kernel might be able to continue working, it should try. It should warn
if there is a risk, so people can use panic_on_warn if they want to stop the
kernel.

> 
> > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > and cleaned up right away?
> As in the above flow, TDX needs to hold the page reference on unmapping
> failure
> until after reclaiming is successful. Well, reclaiming itself is possible to
> fail either.

We could ask TDX module folks if there is anything they could guarantee.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-16 22:49                                         ` Edgecombe, Rick P
@ 2025-06-17  0:52                                           ` Yan Zhao
  2025-06-18  0:30                                             ` Yan Zhao
  2025-06-18  1:22                                             ` Edgecombe, Rick P
  0 siblings, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-06-17  0:52 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Du, Fan, Li, Xiaoyao, Huang, Kai, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Li, Zhiquan1, Shutemov, Kirill,
	michael.roth@amd.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, Peng, Chao P, pbonzini@redhat.com, Weiny, Ira,
	Yamahata, Isaku, binbin.wu@linux.intel.com,
	ackerleytng@google.com, kvm@vger.kernel.org, Annapurve, Vishal,
	tabba@google.com, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Tue, Jun 17, 2025 at 06:49:00AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2025-06-16 at 11:14 +0800, Yan Zhao wrote:
> > > Oh, nice. I hadn't seen this. Agree that a comprehensive guest setup is
> > > quite
> > > manual. But here we are playing with guest ABI. In practice, yes it's
> > > similar to
> > > passing yet another arg to get a good TD.
> > Could we introduce a TD attr TDX_ATTR_SEPT_EXPLICIT_DEMOTION?
> > 
> > It can be something similar to TDX_ATTR_SEPT_VE_DISABLE except that we don't
> > provide a dynamical way as the TDCS_CONFIG_FLEXIBLE_PENDING_VE to allow guest
> > to
> > turn on/off SEPT_VE_DISABLE.
> > (See the disable_sept_ve() in ./arch/x86/coco/tdx/tdx.c).
> > 
> > So, if userspace configures a TD with TDX_ATTR_SEPT_EXPLICIT_DEMOTION, KVM
> > first
> > checks if SEPT_EXPLICIT_DEMOTION is supported.
> > The guest can also check if it would like to support SEPT_EXPLICIT_DEMOTION to
> > determine to continue or shut down. (If it does not check
> > SEPT_EXPLICIT_DEMOTION,
> > e.g., if we don't want to update EDK2, the guest must accept memory before
> > memory accessing).
> > 
> > - if TD is configured with SEPT_EXPLICIT_DEMOTION, KVM allows to map at 2MB
> > when
> >   there's no level info in an EPT violation. The guest must accept memory
> > before
> >   accessing memory or if it wants to accept only a partial of host's mapping,
> > it
> >   needs to explicitly invoke a TDVMCALL to request KVM to perform page
> > demotion.
> > 
> > - if TD is configured without SEPT_EXPLICIT_DEMOTION, KVM always maps at 4KB
> >   when there's no level info in an EPT violation.
> > 
> > - No matter SEPT_EXPLICIT_DEMOTION is configured or not, if there's a level
> > info
> >   in an EPT violation, while KVM honors the level info as the max_level info,
> >   KVM ignores the demotion request in the fault path.
> 
> I think this is what Sean was suggesting. We are going to need a qemu command
> line opt-in too.
> 
> > 
> > > We can start with a prototype the host side arg and see how it turns out. I
> > > realized we need to verify edk2 as well.
> > Current EDK2 should always accept pages before actual memory access.
> > So, I think it should be fine.
> 
> It's not just that, it needs to handle the the accept page size being lower than
> the mapping size. I went and looked and it is accepting at 4k size in places. It
As it accepts pages before memory access, the "accept page size being lower than
the the mapping size" can't happen. 

> hopefully is just handling accepting a whole range that is not 2MB aligned. But
> I think we need to verify this more.
Ok.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-17  0:12                                           ` Edgecombe, Rick P
@ 2025-06-17  1:38                                             ` Yan Zhao
  2025-06-17 15:52                                               ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-17  1:38 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Annapurve, Vishal, kvm@vger.kernel.org, quic_eberman@quicinc.com,
	Li, Xiaoyao, Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jun 17, 2025 at 08:12:50AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2025-06-16 at 17:59 +0800, Yan Zhao wrote:
> > If the above changes are agreeable, we could consider a more ambitious approach:
> > introducing an interface like:
> > 
> > int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
> > int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);
> 
> We talked about doing something like having tdx_hold_page_on_error() in
> guestmemfd with a proper name. The separation of concerns will be better if we
> can just tell guestmemfd, the page has an issue. Then guestmemfd can decide how
> to handle it (refcount or whatever).
Instead of using tdx_hold_page_on_error(), the advantage of informing
guest_memfd that TDX is holding a page at 4KB granularity is that, even if there
is a bug in KVM (such as forgetting to notify TDX to remove a mapping in
handle_removed_pt()), guest_memfd would be aware that the page remains mapped in
the TDX module. This allows guest_memfd to determine how to handle the
problematic page (whether through refcount adjustments or other methods) before
truncating it.

> > 
> > This would allow guest_memfd to maintain an internal reference count for each
> > private GFN. TDX would call guest_memfd_add_page_ref_count() for mapping and
> > guest_memfd_dec_page_ref_count() after a successful unmapping. Before truncating
> > a private page from the filemap, guest_memfd could increase the real folio
> > reference count based on its internal reference count for the private GFN.
> 
> What does this get us exactly? This is the argument to have less error prone
> code that can survive forgetting to refcount on error? I don't see that it is an
> especially special case.
Yes, for a less error prone code.

If this approach is considered too complex for an initial implementation, using
tdx_hold_page_on_error() is also a viable option.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-17  0:25                                           ` Edgecombe, Rick P
@ 2025-06-17  2:00                                             ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-06-17  2:00 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Annapurve, Vishal, kvm@vger.kernel.org, quic_eberman@quicinc.com,
	Li, Xiaoyao, Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jun 17, 2025 at 08:25:20AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2025-06-16 at 17:59 +0800, Yan Zhao wrote:
> > > Few questions here:
> > > 1) It sounds like the failure to remove entries from SEPT could only
> > > be due to bugs in the KVM/TDX module,
> > Yes.
> 
> A TDX module bug could hypothetically cause many types of host instability. We
> should consider a little more on the context for the risk before we make TDX a
> special case or add much error handling code around it. If we end up with a
> bunch of paranoid error handling code around TDX module behavior, that is going
> to be a pain to maintain. And error handling code for rare cases will be hard to
> remove.
> 
> We've had a history of unreliable page removal during the base series
> development. When we solved the problem, it was not completely clean (though
> more on the guest affecting side). So I think there is reason to be concerned.
> But this should work reliably in theory. So I'm not sure we should use the error
> case as a hard reason. Instead maybe we should focus on how to make it less
> likely to have an error. Unless there is a specific case you are considering,
> Yan?
Yes, KVM/TDX does its utmost to ensure that page removal cannot fail. However,
if bugs occur, KVM/TDX will trigger a BUG_ON and leak the problematic page.
This is a simple way to constrain the error within affected pages. It also helps
in debugging when unexpected errors arise.

Returning the error code up the stack is not worthwhile and I don't even think
it's feasible.


> That said, I think the refcounting on error (or rather, notifying guestmemfd on
> error do let it handle the error how it wants) is a fine solution. As long as it
> doesn't take much code (as is the case for Yan's POC).
> 
> > 
> > > how reliable would it be to
> > > continue executing TDX VMs on the host once such bugs are hit?
> > The TDX VMs will be killed. However, the private pages are still mapped in the
> > SEPT (after the unmapping failure).
> > The teardown flow for TDX VM is:
> > 
> > do_exit
> >   |->exit_files
> >      |->kvm_gmem_release ==> (1) Unmap guest pages 
> >      |->release kvmfd
> >         |->kvm_destroy_vm  (2) Reclaiming resources
> >            |->kvm_arch_pre_destroy_vm  ==> Release hkid
> >            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
> > 
> > Without holding page reference after (1) fails, the guest pages may have been
> > re-assigned by the host OS while they are still still tracked in the TDX
> > module.
> > 
> > 
> > > 2) Is it reliable to continue executing the host kernel and other
> > > normal VMs once such bugs are hit?
> > If with TDX holding the page ref count, the impact of unmapping failure of
> > guest
> > pages is just to leak those pages.
> 
> If the kernel might be able to continue working, it should try. It should warn
> if there is a risk, so people can use panic_on_warn if they want to stop the
> kernel.
> 
> > 
> > > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > > and cleaned up right away?
> > As in the above flow, TDX needs to hold the page reference on unmapping
> > failure
> > until after reclaiming is successful. Well, reclaiming itself is possible to
> > fail either.
> 
> We could ask TDX module folks if there is anything they could guarantee.
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-16  9:59                                         ` Yan Zhao
  2025-06-17  0:12                                           ` Edgecombe, Rick P
  2025-06-17  0:25                                           ` Edgecombe, Rick P
@ 2025-06-17  3:51                                           ` Vishal Annapurve
  2025-06-17  6:52                                             ` Yan Zhao
  2 siblings, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-06-17  3:51 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

On Mon, Jun 16, 2025 at 3:02 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Wed, Jun 11, 2025 at 07:30:10AM -0700, Vishal Annapurve wrote:
> > On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > We need to restore to the previous status (which includes the host page table)
> > > if conversion can't be done.
> > > That said, in my view, a better flow would be:
> > >
> > > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> > >    consumers in kernel of memory allocated from guest_memfd).
> > >
> > > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> > >    proceed. For example, in the case of TDX, this might involve memory
> > >    allocation and page splitting.
> > >
> > > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> > >    proceeds by sending the actual invalidation request.
> > >
> > > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> > >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> > >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> > >    the page or elevate the page reference count.
> >
> > Few questions here:
> > 1) It sounds like the failure to remove entries from SEPT could only
> > be due to bugs in the KVM/TDX module,
> Yes.
>
> > how reliable would it be to
> > continue executing TDX VMs on the host once such bugs are hit?
> The TDX VMs will be killed. However, the private pages are still mapped in the
> SEPT (after the unmapping failure).
> The teardown flow for TDX VM is:
>
> do_exit
>   |->exit_files
>      |->kvm_gmem_release ==> (1) Unmap guest pages
>      |->release kvmfd
>         |->kvm_destroy_vm  (2) Reclaiming resources
>            |->kvm_arch_pre_destroy_vm  ==> Release hkid
>            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
>
> Without holding page reference after (1) fails, the guest pages may have been
> re-assigned by the host OS while they are still still tracked in the TDX module.

What happens to the pagetable memory holding the SEPT entry? Is that
also supposed to be leaked?

>
>
> > 2) Is it reliable to continue executing the host kernel and other
> > normal VMs once such bugs are hit?
> If with TDX holding the page ref count, the impact of unmapping failure of guest
> pages is just to leak those pages.
>
> > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > and cleaned up right away?
> As in the above flow, TDX needs to hold the page reference on unmapping failure
> until after reclaiming is successful. Well, reclaiming itself is possible to
> fail either.
>
> So, below is my proposal. Showed in the simple POC code based on
> https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2.
>
> Patch 1: TDX increases page ref count on unmap failure.

This will not work as Ackerley pointed out earlier [1], it will be
impossible to differentiate between transient refcounts on private
pages and extra refcounts of private memory due to TDX unmap failure.

[1] https://lore.kernel.org/lkml/diqzfrgfp95d.fsf@ackerleytng-ctop.c.googlers.com/

> Patch 2: Bail out private-to-shared conversion if splitting fails.
> Patch 3: Make kvm_gmem_zap() return void.
>
> ...
>         /*
>
>
> If the above changes are agreeable, we could consider a more ambitious approach:
> introducing an interface like:
>
> int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
> int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);

I don't see any reason to introduce full tracking of gfn mapping
status in SEPTs just to handle very rare scenarios which KVM/TDX are
taking utmost care to avoid.

That being said, I see value in letting guest_memfd know exact ranges
still being under use by the TDX module due to unmapping failures.
guest_memfd can take the right action instead of relying on refcounts.

Does KVM continue unmapping the full range even after TDX SEPT
management fails to unmap a subrange?

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-17  3:51                                           ` Vishal Annapurve
@ 2025-06-17  6:52                                             ` Yan Zhao
  2025-06-17  8:09                                               ` Vishal Annapurve
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-17  6:52 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Ackerley Tng, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

On Mon, Jun 16, 2025 at 08:51:41PM -0700, Vishal Annapurve wrote:
> On Mon, Jun 16, 2025 at 3:02 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Wed, Jun 11, 2025 at 07:30:10AM -0700, Vishal Annapurve wrote:
> > > On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > We need to restore to the previous status (which includes the host page table)
> > > > if conversion can't be done.
> > > > That said, in my view, a better flow would be:
> > > >
> > > > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> > > >    consumers in kernel of memory allocated from guest_memfd).
> > > >
> > > > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> > > >    proceed. For example, in the case of TDX, this might involve memory
> > > >    allocation and page splitting.
> > > >
> > > > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> > > >    proceeds by sending the actual invalidation request.
> > > >
> > > > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> > > >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> > > >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> > > >    the page or elevate the page reference count.
> > >
> > > Few questions here:
> > > 1) It sounds like the failure to remove entries from SEPT could only
> > > be due to bugs in the KVM/TDX module,
> > Yes.
> >
> > > how reliable would it be to
> > > continue executing TDX VMs on the host once such bugs are hit?
> > The TDX VMs will be killed. However, the private pages are still mapped in the
> > SEPT (after the unmapping failure).
> > The teardown flow for TDX VM is:
> >
> > do_exit
> >   |->exit_files
> >      |->kvm_gmem_release ==> (1) Unmap guest pages
> >      |->release kvmfd
> >         |->kvm_destroy_vm  (2) Reclaiming resources
> >            |->kvm_arch_pre_destroy_vm  ==> Release hkid
> >            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
> >
> > Without holding page reference after (1) fails, the guest pages may have been
> > re-assigned by the host OS while they are still still tracked in the TDX module.
> 
> What happens to the pagetable memory holding the SEPT entry? Is that
> also supposed to be leaked?
It depends on if the reclaiming of the page table pages holding the SEPT entry
fails. If it is, it will be also leaked.
But the page to hold TDR is for sure to be leaked as the reclaiming of TDR page
will fail after (1) fails.



> > > 2) Is it reliable to continue executing the host kernel and other
> > > normal VMs once such bugs are hit?
> > If with TDX holding the page ref count, the impact of unmapping failure of guest
> > pages is just to leak those pages.
> >
> > > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > > and cleaned up right away?
> > As in the above flow, TDX needs to hold the page reference on unmapping failure
> > until after reclaiming is successful. Well, reclaiming itself is possible to
> > fail either.
> >
> > So, below is my proposal. Showed in the simple POC code based on
> > https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2.
> >
> > Patch 1: TDX increases page ref count on unmap failure.
> 
> This will not work as Ackerley pointed out earlier [1], it will be
> impossible to differentiate between transient refcounts on private
> pages and extra refcounts of private memory due to TDX unmap failure.
Hmm. why are there transient refcounts on private pages?
And why should we differentiate the two?


> [1] https://lore.kernel.org/lkml/diqzfrgfp95d.fsf@ackerleytng-ctop.c.googlers.com/
> 
> > Patch 2: Bail out private-to-shared conversion if splitting fails.
> > Patch 3: Make kvm_gmem_zap() return void.
> >
> > ...
> >         /*
> >
> >
> > If the above changes are agreeable, we could consider a more ambitious approach:
> > introducing an interface like:
> >
> > int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
> > int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);
> 
> I don't see any reason to introduce full tracking of gfn mapping
> status in SEPTs just to handle very rare scenarios which KVM/TDX are
> taking utmost care to avoid.
> 
> That being said, I see value in letting guest_memfd know exact ranges
> still being under use by the TDX module due to unmapping failures.
> guest_memfd can take the right action instead of relying on refcounts.
> 
> Does KVM continue unmapping the full range even after TDX SEPT
> management fails to unmap a subrange?
Yes, if there's no bug in KVM, it will continue unmapping the full ranges.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-17  6:52                                             ` Yan Zhao
@ 2025-06-17  8:09                                               ` Vishal Annapurve
  2025-06-17  9:57                                                 ` Yan Zhao
  2025-06-18  0:34                                                 ` Edgecombe, Rick P
  0 siblings, 2 replies; 294+ messages in thread
From: Vishal Annapurve @ 2025-06-17  8:09 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

On Mon, Jun 16, 2025 at 11:55 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Mon, Jun 16, 2025 at 08:51:41PM -0700, Vishal Annapurve wrote:
> > On Mon, Jun 16, 2025 at 3:02 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Wed, Jun 11, 2025 at 07:30:10AM -0700, Vishal Annapurve wrote:
> > > > On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >
> > > > > We need to restore to the previous status (which includes the host page table)
> > > > > if conversion can't be done.
> > > > > That said, in my view, a better flow would be:
> > > > >
> > > > > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> > > > >    consumers in kernel of memory allocated from guest_memfd).
> > > > >
> > > > > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> > > > >    proceed. For example, in the case of TDX, this might involve memory
> > > > >    allocation and page splitting.
> > > > >
> > > > > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> > > > >    proceeds by sending the actual invalidation request.
> > > > >
> > > > > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> > > > >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> > > > >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> > > > >    the page or elevate the page reference count.
> > > >
> > > > Few questions here:
> > > > 1) It sounds like the failure to remove entries from SEPT could only
> > > > be due to bugs in the KVM/TDX module,
> > > Yes.
> > >
> > > > how reliable would it be to
> > > > continue executing TDX VMs on the host once such bugs are hit?
> > > The TDX VMs will be killed. However, the private pages are still mapped in the
> > > SEPT (after the unmapping failure).
> > > The teardown flow for TDX VM is:
> > >
> > > do_exit
> > >   |->exit_files
> > >      |->kvm_gmem_release ==> (1) Unmap guest pages
> > >      |->release kvmfd
> > >         |->kvm_destroy_vm  (2) Reclaiming resources
> > >            |->kvm_arch_pre_destroy_vm  ==> Release hkid
> > >            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
> > >
> > > Without holding page reference after (1) fails, the guest pages may have been
> > > re-assigned by the host OS while they are still still tracked in the TDX module.
> >
> > What happens to the pagetable memory holding the SEPT entry? Is that
> > also supposed to be leaked?
> It depends on if the reclaiming of the page table pages holding the SEPT entry
> fails. If it is, it will be also leaked.
> But the page to hold TDR is for sure to be leaked as the reclaiming of TDR page
> will fail after (1) fails.
>

Ok. Few questions that I would like to touch base briefly on:
i) If (1) fails and then VM is marked as bugged, will the TDX module
actually access that page in context of the same VM again?
ii) What all resources should remain unreclaimed if (1) fails?
     * page backing SEPT entry
     * page backing PAMT entry
     * TDMR
    If TDMR is the only one that fails to reclaim, will the TDX module
actually access the physical memory ever after the VM is cleaned up?
Otherwise, should all of these be made unreclaimable?
iii) Will it be safe for the host to use that memory by proper
WBINVD/memory clearing sequence if TDX module/TD is not going to use
that memory?

>
>
> > > > 2) Is it reliable to continue executing the host kernel and other
> > > > normal VMs once such bugs are hit?
> > > If with TDX holding the page ref count, the impact of unmapping failure of guest
> > > pages is just to leak those pages.
> > >
> > > > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > > > and cleaned up right away?
> > > As in the above flow, TDX needs to hold the page reference on unmapping failure
> > > until after reclaiming is successful. Well, reclaiming itself is possible to
> > > fail either.
> > >
> > > So, below is my proposal. Showed in the simple POC code based on
> > > https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2.
> > >
> > > Patch 1: TDX increases page ref count on unmap failure.
> >
> > This will not work as Ackerley pointed out earlier [1], it will be
> > impossible to differentiate between transient refcounts on private
> > pages and extra refcounts of private memory due to TDX unmap failure.
> Hmm. why are there transient refcounts on private pages?
> And why should we differentiate the two?

Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].

Speculative/transient refcounts came up a few times In the context of
guest_memfd discussions, some examples include: pagetable walkers,
page migration, speculative pagecache lookups, GUP-fast etc. David H
can provide more context here as needed.

Effectively some core-mm features that are present today or might land
in the future can cause folio refcounts to be grabbed for short
durations without actual access to underlying physical memory. These
scenarios are unlikely to happen for private memory but can't be
discounted completely.

Another reason to avoid relying on refcounts is to not block usage of
raw physical memory unmanaged by kernel (without page structs) to back
guest private memory as we had discussed previously. This will help
simplify merge/split operations during conversions and help usecases
like guest memory persistence [2] and non-confidential VMs.

[1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
[2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/

>
>
> > [1] https://lore.kernel.org/lkml/diqzfrgfp95d.fsf@ackerleytng-ctop.c.googlers.com/
> >
> > > Patch 2: Bail out private-to-shared conversion if splitting fails.
> > > Patch 3: Make kvm_gmem_zap() return void.
> > >
> > > ...
> > >         /*
> > >
> > >
> > > If the above changes are agreeable, we could consider a more ambitious approach:
> > > introducing an interface like:
> > >
> > > int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
> > > int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);
> >
> > I don't see any reason to introduce full tracking of gfn mapping
> > status in SEPTs just to handle very rare scenarios which KVM/TDX are
> > taking utmost care to avoid.
> >
> > That being said, I see value in letting guest_memfd know exact ranges
> > still being under use by the TDX module due to unmapping failures.
> > guest_memfd can take the right action instead of relying on refcounts.
> >
> > Does KVM continue unmapping the full range even after TDX SEPT
> > management fails to unmap a subrange?
> Yes, if there's no bug in KVM, it will continue unmapping the full ranges.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-17  8:09                                               ` Vishal Annapurve
@ 2025-06-17  9:57                                                 ` Yan Zhao
  2025-06-18  4:25                                                   ` Vishal Annapurve
  2025-06-18  0:34                                                 ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-17  9:57 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Ackerley Tng, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

On Tue, Jun 17, 2025 at 01:09:05AM -0700, Vishal Annapurve wrote:
> On Mon, Jun 16, 2025 at 11:55 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Mon, Jun 16, 2025 at 08:51:41PM -0700, Vishal Annapurve wrote:
> > > On Mon, Jun 16, 2025 at 3:02 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Wed, Jun 11, 2025 at 07:30:10AM -0700, Vishal Annapurve wrote:
> > > > > On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >
> > > > > > We need to restore to the previous status (which includes the host page table)
> > > > > > if conversion can't be done.
> > > > > > That said, in my view, a better flow would be:
> > > > > >
> > > > > > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> > > > > >    consumers in kernel of memory allocated from guest_memfd).
> > > > > >
> > > > > > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> > > > > >    proceed. For example, in the case of TDX, this might involve memory
> > > > > >    allocation and page splitting.
> > > > > >
> > > > > > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> > > > > >    proceeds by sending the actual invalidation request.
> > > > > >
> > > > > > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> > > > > >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> > > > > >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> > > > > >    the page or elevate the page reference count.
> > > > >
> > > > > Few questions here:
> > > > > 1) It sounds like the failure to remove entries from SEPT could only
> > > > > be due to bugs in the KVM/TDX module,
> > > > Yes.
> > > >
> > > > > how reliable would it be to
> > > > > continue executing TDX VMs on the host once such bugs are hit?
> > > > The TDX VMs will be killed. However, the private pages are still mapped in the
> > > > SEPT (after the unmapping failure).
> > > > The teardown flow for TDX VM is:
> > > >
> > > > do_exit
> > > >   |->exit_files
> > > >      |->kvm_gmem_release ==> (1) Unmap guest pages
> > > >      |->release kvmfd
> > > >         |->kvm_destroy_vm  (2) Reclaiming resources
> > > >            |->kvm_arch_pre_destroy_vm  ==> Release hkid
> > > >            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
> > > >
> > > > Without holding page reference after (1) fails, the guest pages may have been
> > > > re-assigned by the host OS while they are still still tracked in the TDX module.
> > >
> > > What happens to the pagetable memory holding the SEPT entry? Is that
> > > also supposed to be leaked?
> > It depends on if the reclaiming of the page table pages holding the SEPT entry
> > fails. If it is, it will be also leaked.
> > But the page to hold TDR is for sure to be leaked as the reclaiming of TDR page
> > will fail after (1) fails.
> >
> 
> Ok. Few questions that I would like to touch base briefly on:
> i) If (1) fails and then VM is marked as bugged, will the TDX module
> actually access that page in context of the same VM again?
In TDX module, the TD is marked as TD_TEARDOWN after step (2) when hkid is
released successfully.
Before that, TD is able to access the pages even if it is marked as buggy by KVM.

After TD is marked as TD_TEARDOWN, since (1) fails, the problematic guest
private pages are still tracked in the PAMT entries.
So, re-assignment the same PFN to other TDs will fail.

> ii) What all resources should remain unreclaimed if (1) fails?
>      * page backing SEPT entry
>      * page backing PAMT entry
>      * TDMR
>     If TDMR is the only one that fails to reclaim, will the TDX module
> actually access the physical memory ever after the VM is cleaned up?
> Otherwise, should all of these be made unreclaimable?
From my understanding, they are
- guest private pages
- TDR page
- PAMT entries for guest private pages and TDR page


> iii) Will it be safe for the host to use that memory by proper
> WBINVD/memory clearing sequence if TDX module/TD is not going to use
> that memory?
I'm not sure. But it should be impossible for host to re-assign the pages to
other TDs as long as PAMT entries are not updated.


> > > > > 2) Is it reliable to continue executing the host kernel and other
> > > > > normal VMs once such bugs are hit?
> > > > If with TDX holding the page ref count, the impact of unmapping failure of guest
> > > > pages is just to leak those pages.
> > > >
> > > > > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > > > > and cleaned up right away?
> > > > As in the above flow, TDX needs to hold the page reference on unmapping failure
> > > > until after reclaiming is successful. Well, reclaiming itself is possible to
> > > > fail either.
> > > >
> > > > So, below is my proposal. Showed in the simple POC code based on
> > > > https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2.
> > > >
> > > > Patch 1: TDX increases page ref count on unmap failure.
> > >
> > > This will not work as Ackerley pointed out earlier [1], it will be
> > > impossible to differentiate between transient refcounts on private
> > > pages and extra refcounts of private memory due to TDX unmap failure.
> > Hmm. why are there transient refcounts on private pages?
> > And why should we differentiate the two?
> 
> Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> 
> Speculative/transient refcounts came up a few times In the context of
> guest_memfd discussions, some examples include: pagetable walkers,
> page migration, speculative pagecache lookups, GUP-fast etc. David H
> can provide more context here as needed.
GUP-fast only walks page tables for shared memory?
Can other walkers get a private folio by walking shared mappings?

On those speculative/transient refcounts came up, can't the 
kvm_gmem_convert_should_proceed() wait in an interruptible way before returning
failure?

The wait will anyway happen after the conversion is started, i.e.,
in filemap_remove_folio_for_restructuring().
       while (!folio_ref_freeze(folio, filemap_refcount)) {
                /*
                 * At this point only filemap refcounts are expected, hence okay
                 * to spin until speculative refcounts go away.
                 */
                WARN_ONCE(1, "Spinning on folio=%p refcount=%d", folio, folio_ref_count(folio));
        }


BTW, I noticed that there's no filemap_invalidate_lock_shared() in
kvm_gmem_fault_shared() in 
https://lore.kernel.org/all/20250611133330.1514028-9-tabba@google.com/.

Do you know why?

> Effectively some core-mm features that are present today or might land
> in the future can cause folio refcounts to be grabbed for short
> durations without actual access to underlying physical memory. These
> scenarios are unlikely to happen for private memory but can't be
> discounted completely.
> 
> Another reason to avoid relying on refcounts is to not block usage of
> raw physical memory unmanaged by kernel (without page structs) to back
> guest private memory as we had discussed previously. This will help
> simplify merge/split operations during conversions and help usecases
> like guest memory persistence [2] and non-confidential VMs.
Ok.
Currently, "letting guest_memfd know exact ranges still being under use by the
TDX module due to unmapping failures" is good enough for TDX, though full
tracking of each GFN is even better.


> [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/
> 
> >
> >
> > > [1] https://lore.kernel.org/lkml/diqzfrgfp95d.fsf@ackerleytng-ctop.c.googlers.com/
> > >
> > > > Patch 2: Bail out private-to-shared conversion if splitting fails.
> > > > Patch 3: Make kvm_gmem_zap() return void.
> > > >
> > > > ...
> > > >         /*
> > > >
> > > >
> > > > If the above changes are agreeable, we could consider a more ambitious approach:
> > > > introducing an interface like:
> > > >
> > > > int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
> > > > int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);
> > >
> > > I don't see any reason to introduce full tracking of gfn mapping
> > > status in SEPTs just to handle very rare scenarios which KVM/TDX are
> > > taking utmost care to avoid.
> > >
> > > That being said, I see value in letting guest_memfd know exact ranges
> > > still being under use by the TDX module due to unmapping failures.
> > > guest_memfd can take the right action instead of relying on refcounts.
> > >
> > > Does KVM continue unmapping the full range even after TDX SEPT
> > > management fails to unmap a subrange?
> > Yes, if there's no bug in KVM, it will continue unmapping the full ranges.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-17  1:38                                             ` Yan Zhao
@ 2025-06-17 15:52                                               ` Edgecombe, Rick P
  2025-06-18  0:19                                                 ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-17 15:52 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, Du, Fan,
	linux-kernel@vger.kernel.org, seanjc@google.com, Weiny, Ira,
	michael.roth@amd.com, pbonzini@redhat.com, ackerleytng@google.com,
	Yamahata, Isaku, binbin.wu@linux.intel.com, Peng, Chao P,
	kvm@vger.kernel.org, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, Li, Zhiquan1, pgonda@google.com, x86@kernel.org

On Tue, 2025-06-17 at 09:38 +0800, Yan Zhao wrote:
> > We talked about doing something like having tdx_hold_page_on_error() in
> > guestmemfd with a proper name. The separation of concerns will be better if
> > we
> > can just tell guestmemfd, the page has an issue. Then guestmemfd can decide
> > how
> > to handle it (refcount or whatever).
> Instead of using tdx_hold_page_on_error(), the advantage of informing
> guest_memfd that TDX is holding a page at 4KB granularity is that, even if
> there
> is a bug in KVM (such as forgetting to notify TDX to remove a mapping in
> handle_removed_pt()), guest_memfd would be aware that the page remains mapped
> in
> the TDX module. This allows guest_memfd to determine how to handle the
> problematic page (whether through refcount adjustments or other methods)
> before
> truncating it.

I don't think a potential bug in KVM is a good enough reason. If we are
concerned can we think about a warning instead?

We had talked enhancing kasan to know when a page is mapped into S-EPT in the
past. So rather than design around potential bugs we could focus on having a
simpler implementation with the infrastructure to catch and fix the bugs.

> 
> > > 
> > > This would allow guest_memfd to maintain an internal reference count for
> > > each
> > > private GFN. TDX would call guest_memfd_add_page_ref_count() for mapping
> > > and
> > > guest_memfd_dec_page_ref_count() after a successful unmapping. Before
> > > truncating
> > > a private page from the filemap, guest_memfd could increase the real folio
> > > reference count based on its internal reference count for the private GFN.
> > 
> > What does this get us exactly? This is the argument to have less error prone
> > code that can survive forgetting to refcount on error? I don't see that it
> > is an
> > especially special case.
> Yes, for a less error prone code.
> 
> If this approach is considered too complex for an initial implementation,
> using
> tdx_hold_page_on_error() is also a viable option.

I'm saying I don't think it's not a good enough reason. Why is it different then
other use-after free bugs? I feel like I'm missing something.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-17 15:52                                               ` Edgecombe, Rick P
@ 2025-06-18  0:19                                                 ` Yan Zhao
  2025-06-18  0:41                                                   ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-18  0:19 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, Du, Fan,
	linux-kernel@vger.kernel.org, seanjc@google.com, Weiny, Ira,
	michael.roth@amd.com, pbonzini@redhat.com, ackerleytng@google.com,
	Yamahata, Isaku, binbin.wu@linux.intel.com, Peng, Chao P,
	kvm@vger.kernel.org, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, Li, Zhiquan1, pgonda@google.com, x86@kernel.org

On Tue, Jun 17, 2025 at 11:52:48PM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-06-17 at 09:38 +0800, Yan Zhao wrote:
> > > We talked about doing something like having tdx_hold_page_on_error() in
> > > guestmemfd with a proper name. The separation of concerns will be better if
> > > we
> > > can just tell guestmemfd, the page has an issue. Then guestmemfd can decide
> > > how
> > > to handle it (refcount or whatever).
> > Instead of using tdx_hold_page_on_error(), the advantage of informing
> > guest_memfd that TDX is holding a page at 4KB granularity is that, even if
> > there
> > is a bug in KVM (such as forgetting to notify TDX to remove a mapping in
> > handle_removed_pt()), guest_memfd would be aware that the page remains mapped
> > in
> > the TDX module. This allows guest_memfd to determine how to handle the
> > problematic page (whether through refcount adjustments or other methods)
> > before
> > truncating it.
> 
> I don't think a potential bug in KVM is a good enough reason. If we are
> concerned can we think about a warning instead?
> 
> We had talked enhancing kasan to know when a page is mapped into S-EPT in the
> past. So rather than design around potential bugs we could focus on having a
> simpler implementation with the infrastructure to catch and fix the bugs.
However, if failing to remove a guest private page would only cause memory leak,
it's fine. 
If TDX does not hold any refcount, guest_memfd has to know that which private
page is still mapped. Otherwise, the page may be re-assigned to other kernel
components while it may still be mapped in the S-EPT.


> > 
> > > > 
> > > > This would allow guest_memfd to maintain an internal reference count for
> > > > each
> > > > private GFN. TDX would call guest_memfd_add_page_ref_count() for mapping
> > > > and
> > > > guest_memfd_dec_page_ref_count() after a successful unmapping. Before
> > > > truncating
> > > > a private page from the filemap, guest_memfd could increase the real folio
> > > > reference count based on its internal reference count for the private GFN.
> > > 
> > > What does this get us exactly? This is the argument to have less error prone
> > > code that can survive forgetting to refcount on error? I don't see that it
> > > is an
> > > especially special case.
> > Yes, for a less error prone code.
> > 
> > If this approach is considered too complex for an initial implementation,
> > using
> > tdx_hold_page_on_error() is also a viable option.
> 
> I'm saying I don't think it's not a good enough reason. Why is it different then
> other use-after free bugs? I feel like I'm missing something.
By tdx_hold_page_on_error(), it could be implememented as on removal failure,
invoke a guest_memfd interface to let guest_memfd know exact ranges still being
under use by the TDX module due to unmapping failures.
Do you think it's ok?

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-17  0:52                                           ` Yan Zhao
@ 2025-06-18  0:30                                             ` Yan Zhao
  2025-06-20 16:31                                               ` Sean Christopherson
  2025-06-18  1:22                                             ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-18  0:30 UTC (permalink / raw)
  To: Edgecombe, Rick P, Du, Fan, Li, Xiaoyao, Huang, Kai,
	quic_eberman@quicinc.com, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, Li, Zhiquan1,
	Shutemov, Kirill, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	pbonzini@redhat.com, Weiny, Ira, Yamahata, Isaku,
	binbin.wu@linux.intel.com, ackerleytng@google.com,
	kvm@vger.kernel.org, Annapurve, Vishal, tabba@google.com,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jun 17, 2025 at 08:52:49AM +0800, Yan Zhao wrote:
> On Tue, Jun 17, 2025 at 06:49:00AM +0800, Edgecombe, Rick P wrote:
> > On Mon, 2025-06-16 at 11:14 +0800, Yan Zhao wrote:
> > > > Oh, nice. I hadn't seen this. Agree that a comprehensive guest setup is
> > > > quite
> > > > manual. But here we are playing with guest ABI. In practice, yes it's
> > > > similar to
> > > > passing yet another arg to get a good TD.
> > > Could we introduce a TD attr TDX_ATTR_SEPT_EXPLICIT_DEMOTION?
> > > 
> > > It can be something similar to TDX_ATTR_SEPT_VE_DISABLE except that we don't
> > > provide a dynamical way as the TDCS_CONFIG_FLEXIBLE_PENDING_VE to allow guest
> > > to
> > > turn on/off SEPT_VE_DISABLE.
> > > (See the disable_sept_ve() in ./arch/x86/coco/tdx/tdx.c).
> > > 
> > > So, if userspace configures a TD with TDX_ATTR_SEPT_EXPLICIT_DEMOTION, KVM
> > > first
> > > checks if SEPT_EXPLICIT_DEMOTION is supported.
> > > The guest can also check if it would like to support SEPT_EXPLICIT_DEMOTION to
> > > determine to continue or shut down. (If it does not check
> > > SEPT_EXPLICIT_DEMOTION,
> > > e.g., if we don't want to update EDK2, the guest must accept memory before
> > > memory accessing).
> > > 
> > > - if TD is configured with SEPT_EXPLICIT_DEMOTION, KVM allows to map at 2MB
> > > when
> > >   there's no level info in an EPT violation. The guest must accept memory
> > > before
> > >   accessing memory or if it wants to accept only a partial of host's mapping,
> > > it
> > >   needs to explicitly invoke a TDVMCALL to request KVM to perform page
> > > demotion.
> > > 
> > > - if TD is configured without SEPT_EXPLICIT_DEMOTION, KVM always maps at 4KB
> > >   when there's no level info in an EPT violation.
> > > 
> > > - No matter SEPT_EXPLICIT_DEMOTION is configured or not, if there's a level
> > > info
> > >   in an EPT violation, while KVM honors the level info as the max_level info,
> > >   KVM ignores the demotion request in the fault path.
Hi Sean,
Could you please confirm if this matches what you think?
i.e.,

  when an EPT violation carries an ACCEPT level info
  KVM maps the page at map level <= the specified level.
  (If KVM finds a shadow-present lead SPTE, it will not try to merge/split it.)
  Guest's ACCEPT will succeed or return PAGE_SIZE_MATCH if map level < the
  specified level.

This can keep linux guests (with SEPT_VE_DISABLE being true) more efficient.
So, for linux guests, if it only wants to accept at 4KB, the flow is
1. guest ACCEPT 4KB
2. KVM maps it at 4KB
3. ACCEPT 4KB returns success

As the ACCEPT comes before KVM actually maps anything, we can avoid the complex
flow:
1. guest ACCEPT 4KB
2. KVM maps it at 2MB
3. ACCEPT 4KB returns PAGE_SIZE_MATCH.
4.(a) guest ACCEPT 2MB or
4.(b) guest triggers TDVMCALL to demote
5. KVM demotes the 2MB mapping
6. guest ACCEPT at 4KB
7. ACCEPT 4KB returns success 

For non-linux guests (with SEPT_VE_DISABLE being false), I totally agree with
your suggestions!

Thanks
Yan

> > I think this is what Sean was suggesting. We are going to need a qemu command
> > line opt-in too.
> > 
> > > 
> > > > We can start with a prototype the host side arg and see how it turns out. I
> > > > realized we need to verify edk2 as well.
> > > Current EDK2 should always accept pages before actual memory access.
> > > So, I think it should be fine.
> > 
> > It's not just that, it needs to handle the the accept page size being lower than
> > the mapping size. I went and looked and it is accepting at 4k size in places. It
> As it accepts pages before memory access, the "accept page size being lower than
> the the mapping size" can't happen. 
> 
> > hopefully is just handling accepting a whole range that is not 2MB aligned. But
> > I think we need to verify this more.
> Ok.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-17  8:09                                               ` Vishal Annapurve
  2025-06-17  9:57                                                 ` Yan Zhao
@ 2025-06-18  0:34                                                 ` Edgecombe, Rick P
  2025-06-18  0:46                                                   ` Yan Zhao
  2025-06-18  4:29                                                   ` Vishal Annapurve
  1 sibling, 2 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-18  0:34 UTC (permalink / raw)
  To: Annapurve, Vishal, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, quic_eberman@quicinc.com, Li, Xiaoyao,
	Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].

I'm confused...

> 
> Speculative/transient refcounts came up a few times In the context of
> guest_memfd discussions, some examples include: pagetable walkers,
> page migration, speculative pagecache lookups, GUP-fast etc. David H
> can provide more context here as needed.
> 
> Effectively some core-mm features that are present today or might land
> in the future can cause folio refcounts to be grabbed for short
> durations without actual access to underlying physical memory. These
> scenarios are unlikely to happen for private memory but can't be
> discounted completely.

This means the refcount could be increased for other reasons, and so guestmemfd
shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
components handling the page elevate the refcount?

> 
> Another reason to avoid relying on refcounts is to not block usage of
> raw physical memory unmanaged by kernel (without page structs) to back
> guest private memory as we had discussed previously. This will help
> simplify merge/split operations during conversions and help usecases
> like guest memory persistence [2] and non-confidential VMs.

If this becomes a thing for private memory (which it isn't yet), then couldn't
we just change things at that point?

Is the only issue with TDX taking refcounts that it won't work with future code
changes?

> 
> [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-18  0:19                                                 ` Yan Zhao
@ 2025-06-18  0:41                                                   ` Edgecombe, Rick P
  2025-06-23  9:27                                                     ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-18  0:41 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, Du, Fan,
	linux-kernel@vger.kernel.org, seanjc@google.com, Weiny, Ira,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Yamahata, Isaku,
	michael.roth@amd.com, ackerleytng@google.com, Peng, Chao P,
	kvm@vger.kernel.org, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, Li, Zhiquan1, pgonda@google.com, x86@kernel.org

On Wed, 2025-06-18 at 08:19 +0800, Yan Zhao wrote:
> > I don't think a potential bug in KVM is a good enough reason. If we are
> > concerned can we think about a warning instead?
> > 
> > We had talked enhancing kasan to know when a page is mapped into S-EPT in
> > the
> > past. So rather than design around potential bugs we could focus on having a
> > simpler implementation with the infrastructure to catch and fix the bugs.
> However, if failing to remove a guest private page would only cause memory
> leak,
> it's fine. 
> If TDX does not hold any refcount, guest_memfd has to know that which private
> page is still mapped. Otherwise, the page may be re-assigned to other kernel
> components while it may still be mapped in the S-EPT.

KASAN detects use-after-free's like that. However, the TDX module code is not
instrumented. It won't check against the KASAN state for it's accesses.

I had a brief chat about this with Dave and Kirill. A couple ideas were
discussed. One was to use page_ext to keep a flag that says the page is in-use
by the TDX module. There was also some discussion of using a normal page flag,
and that the reserved page flag might prevent some of the MM operations that
would be needed on guestmemfd pages. I didn't see the problem when I looked.

For the solution, basically the SEAMCALL wrappers set a flag when they hand a
page to the TDX module, and clear it when they successfully reclaim it via
tdh_mem_page_remove() or tdh_phymem_page_reclaim(). Then if the page makes it
back to the page allocator, a warning is generated.

Also it was mentioned that SGX did have a similar issue to what is being worried
about here:
https://lore.kernel.org/linux-sgx/aCYey1W6i7i3yPLL@gmail.com/T/#m86c8c4cf0e6b9a653bf0709a22bb360034a24d95

> 
> 
> > > 
> > > > > 
> > > > > This would allow guest_memfd to maintain an internal reference count
> > > > > for
> > > > > each
> > > > > private GFN. TDX would call guest_memfd_add_page_ref_count() for
> > > > > mapping
> > > > > and
> > > > > guest_memfd_dec_page_ref_count() after a successful unmapping. Before
> > > > > truncating
> > > > > a private page from the filemap, guest_memfd could increase the real
> > > > > folio
> > > > > reference count based on its internal reference count for the private
> > > > > GFN.
> > > > 
> > > > What does this get us exactly? This is the argument to have less error
> > > > prone
> > > > code that can survive forgetting to refcount on error? I don't see that
> > > > it
> > > > is an
> > > > especially special case.
> > > Yes, for a less error prone code.
> > > 
> > > If this approach is considered too complex for an initial implementation,
> > > using
> > > tdx_hold_page_on_error() is also a viable option.
> > 
> > I'm saying I don't think it's not a good enough reason. Why is it different
> > then
> > other use-after free bugs? I feel like I'm missing something.
> By tdx_hold_page_on_error(), it could be implememented as on removal failure,
> invoke a guest_memfd interface to let guest_memfd know exact ranges still
> being
> under use by the TDX module due to unmapping failures.
> Do you think it's ok?

Either way is ok to me. It seems like we have three ok solutions. But the tone
of the thread is that we are solving some deep problem. Maybe I'm missing
something.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-18  0:34                                                 ` Edgecombe, Rick P
@ 2025-06-18  0:46                                                   ` Yan Zhao
  2025-06-18  4:33                                                     ` Vishal Annapurve
  2025-06-18  4:29                                                   ` Vishal Annapurve
  1 sibling, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-18  0:46 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Annapurve, Vishal, kvm@vger.kernel.org, quic_eberman@quicinc.com,
	Li, Xiaoyao, Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, Jun 18, 2025 at 08:34:24AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> 
> I'm confused...
> 
> > 
> > Speculative/transient refcounts came up a few times In the context of
> > guest_memfd discussions, some examples include: pagetable walkers,
> > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > can provide more context here as needed.
> > 
> > Effectively some core-mm features that are present today or might land
> > in the future can cause folio refcounts to be grabbed for short
> > durations without actual access to underlying physical memory. These
> > scenarios are unlikely to happen for private memory but can't be
> > discounted completely.
> 
> This means the refcount could be increased for other reasons, and so guestmemfd
> shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
> components handling the page elevate the refcount?
Besides that, in [3], when kvm_gmem_convert_should_proceed() determines whether
to convert to private, why is it allowed to just invoke
kvm_gmem_has_safe_refcount() without taking speculative/transient refcounts into
account? Isn't it more easier for shared pages to have speculative/transient
refcounts?

[3] https://lore.kernel.org/lkml/d3832fd95a03aad562705872cbda5b3d248ca321.1747264138.git.ackerleytng@google.com/

> > 
> > Another reason to avoid relying on refcounts is to not block usage of
> > raw physical memory unmanaged by kernel (without page structs) to back
> > guest private memory as we had discussed previously. This will help
> > simplify merge/split operations during conversions and help usecases
> > like guest memory persistence [2] and non-confidential VMs.
> 
> If this becomes a thing for private memory (which it isn't yet), then couldn't
> we just change things at that point?
> 
> Is the only issue with TDX taking refcounts that it won't work with future code
> changes?
> 
> > 
> > [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> > [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-17  0:52                                           ` Yan Zhao
  2025-06-18  0:30                                             ` Yan Zhao
@ 2025-06-18  1:22                                             ` Edgecombe, Rick P
  2025-06-18 11:32                                               ` Shutemov, Kirill
  1 sibling, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-18  1:22 UTC (permalink / raw)
  To: Shutemov, Kirill, Zhao, Yan Y
  Cc: Du, Fan, Li, Xiaoyao, Huang, Kai, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Li, Zhiquan1, kvm@vger.kernel.org,
	michael.roth@amd.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, Peng, Chao P, pbonzini@redhat.com, Weiny, Ira,
	Yamahata, Isaku, binbin.wu@linux.intel.com,
	ackerleytng@google.com, Annapurve, Vishal, tabba@google.com,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, 2025-06-17 at 08:52 +0800, Yan Zhao wrote:
> > hopefully is just handling accepting a whole range that is not 2MB aligned.
> > But
> > I think we need to verify this more.
> Ok.

In Linux guest if a memory region is not 2MB aligned the guest will accept the
ends at 4k size. If a memory region is identical to a memslot range this will be
fine. KVM will map the ends at 4k because it won't let huge pages span a
memslot. But if several memory regions are not 2MB aligned and are covered by
one large memslot, the accept will fail on the 4k ends under this proposal. I
don't know if this is a common configuration, but to cover it in the TDX guest
may not be trivial.

So I think this will only work if guests can reasonably "merge" all of the
adjacent accepts. Or of we declare a bunch of memory/memslot layouts illegal.

Kirill, how difficult would it be for TDX Linux guest to merge all 2MB adjacent
accepts?

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-17  9:57                                                 ` Yan Zhao
@ 2025-06-18  4:25                                                   ` Vishal Annapurve
  0 siblings, 0 replies; 294+ messages in thread
From: Vishal Annapurve @ 2025-06-18  4:25 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

On Tue, Jun 17, 2025 at 3:00 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, Jun 17, 2025 at 01:09:05AM -0700, Vishal Annapurve wrote:
> > On Mon, Jun 16, 2025 at 11:55 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Mon, Jun 16, 2025 at 08:51:41PM -0700, Vishal Annapurve wrote:
> > > > On Mon, Jun 16, 2025 at 3:02 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >
> > > > > On Wed, Jun 11, 2025 at 07:30:10AM -0700, Vishal Annapurve wrote:
> > > > > > On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > >
> > > > > > > We need to restore to the previous status (which includes the host page table)
> > > > > > > if conversion can't be done.
> > > > > > > That said, in my view, a better flow would be:
> > > > > > >
> > > > > > > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> > > > > > >    consumers in kernel of memory allocated from guest_memfd).
> > > > > > >
> > > > > > > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> > > > > > >    proceed. For example, in the case of TDX, this might involve memory
> > > > > > >    allocation and page splitting.
> > > > > > >
> > > > > > > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> > > > > > >    proceeds by sending the actual invalidation request.
> > > > > > >
> > > > > > > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> > > > > > >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> > > > > > >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> > > > > > >    the page or elevate the page reference count.
> > > > > >
> > > > > > Few questions here:
> > > > > > 1) It sounds like the failure to remove entries from SEPT could only
> > > > > > be due to bugs in the KVM/TDX module,
> > > > > Yes.
> > > > >
> > > > > > how reliable would it be to
> > > > > > continue executing TDX VMs on the host once such bugs are hit?
> > > > > The TDX VMs will be killed. However, the private pages are still mapped in the
> > > > > SEPT (after the unmapping failure).
> > > > > The teardown flow for TDX VM is:
> > > > >
> > > > > do_exit
> > > > >   |->exit_files
> > > > >      |->kvm_gmem_release ==> (1) Unmap guest pages
> > > > >      |->release kvmfd
> > > > >         |->kvm_destroy_vm  (2) Reclaiming resources
> > > > >            |->kvm_arch_pre_destroy_vm  ==> Release hkid
> > > > >            |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages
> > > > >
> > > > > Without holding page reference after (1) fails, the guest pages may have been
> > > > > re-assigned by the host OS while they are still still tracked in the TDX module.
> > > >
> > > > What happens to the pagetable memory holding the SEPT entry? Is that
> > > > also supposed to be leaked?
> > > It depends on if the reclaiming of the page table pages holding the SEPT entry
> > > fails. If it is, it will be also leaked.
> > > But the page to hold TDR is for sure to be leaked as the reclaiming of TDR page
> > > will fail after (1) fails.
> > >
> >
> > Ok. Few questions that I would like to touch base briefly on:
> > i) If (1) fails and then VM is marked as bugged, will the TDX module
> > actually access that page in context of the same VM again?
> In TDX module, the TD is marked as TD_TEARDOWN after step (2) when hkid is
> released successfully.
> Before that, TD is able to access the pages even if it is marked as buggy by KVM.
>
> After TD is marked as TD_TEARDOWN, since (1) fails, the problematic guest
> private pages are still tracked in the PAMT entries.
> So, re-assignment the same PFN to other TDs will fail.
>
> > ii) What all resources should remain unreclaimed if (1) fails?
> >      * page backing SEPT entry
> >      * page backing PAMT entry
> >      * TDMR
> >     If TDMR is the only one that fails to reclaim, will the TDX module
> > actually access the physical memory ever after the VM is cleaned up?
> > Otherwise, should all of these be made unreclaimable?
> From my understanding, they are
> - guest private pages
> - TDR page
> - PAMT entries for guest private pages and TDR page
>
>
> > iii) Will it be safe for the host to use that memory by proper
> > WBINVD/memory clearing sequence if TDX module/TD is not going to use
> > that memory?
> I'm not sure. But it should be impossible for host to re-assign the pages to
> other TDs as long as PAMT entries are not updated.
>
>
> > > > > > 2) Is it reliable to continue executing the host kernel and other
> > > > > > normal VMs once such bugs are hit?
> > > > > If with TDX holding the page ref count, the impact of unmapping failure of guest
> > > > > pages is just to leak those pages.
> > > > >
> > > > > > 3) Can the memory be reclaimed reliably if the VM is marked as dead
> > > > > > and cleaned up right away?
> > > > > As in the above flow, TDX needs to hold the page reference on unmapping failure
> > > > > until after reclaiming is successful. Well, reclaiming itself is possible to
> > > > > fail either.
> > > > >
> > > > > So, below is my proposal. Showed in the simple POC code based on
> > > > > https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2.
> > > > >
> > > > > Patch 1: TDX increases page ref count on unmap failure.
> > > >
> > > > This will not work as Ackerley pointed out earlier [1], it will be
> > > > impossible to differentiate between transient refcounts on private
> > > > pages and extra refcounts of private memory due to TDX unmap failure.
> > > Hmm. why are there transient refcounts on private pages?
> > > And why should we differentiate the two?
> >
> > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> >
> > Speculative/transient refcounts came up a few times In the context of
> > guest_memfd discussions, some examples include: pagetable walkers,
> > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > can provide more context here as needed.
> GUP-fast only walks page tables for shared memory?
> Can other walkers get a private folio by walking shared mappings?

No, they can't. There can be walkers that parse direct map entries.

>
> On those speculative/transient refcounts came up, can't the
> kvm_gmem_convert_should_proceed() wait in an interruptible way before returning
> failure?

These refcounts can land any time on any of the ranges, so guest_memfd
implementation to bail out on errors will need to be time bound for
each folio and will need traversal of each folio even before actual
restructuring. That increases the complexity and latency of conversion
operation.

>
> The wait will anyway happen after the conversion is started, i.e.,
> in filemap_remove_folio_for_restructuring().
>        while (!folio_ref_freeze(folio, filemap_refcount)) {
>                 /*
>                  * At this point only filemap refcounts are expected, hence okay
>                  * to spin until speculative refcounts go away.
>                  */
>                 WARN_ONCE(1, "Spinning on folio=%p refcount=%d", folio, folio_ref_count(folio));
>         }
>
>
> BTW, I noticed that there's no filemap_invalidate_lock_shared() in
> kvm_gmem_fault_shared() in
> https://lore.kernel.org/all/20250611133330.1514028-9-tabba@google.com/.
>
> Do you know why?

It will land when guest_memfd in-place conversion support will be posted.

>
> > Effectively some core-mm features that are present today or might land
> > in the future can cause folio refcounts to be grabbed for short
> > durations without actual access to underlying physical memory. These
> > scenarios are unlikely to happen for private memory but can't be
> > discounted completely.
> >
> > Another reason to avoid relying on refcounts is to not block usage of
> > raw physical memory unmanaged by kernel (without page structs) to back
> > guest private memory as we had discussed previously. This will help
> > simplify merge/split operations during conversions and help usecases
> > like guest memory persistence [2] and non-confidential VMs.
> Ok.
> Currently, "letting guest_memfd know exact ranges still being under use by the
> TDX module due to unmapping failures" is good enough for TDX, though full
> tracking of each GFN is even better.
>
>
> > [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> > [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/
> >
> > >
> > >
> > > > [1] https://lore.kernel.org/lkml/diqzfrgfp95d.fsf@ackerleytng-ctop.c.googlers.com/
> > > >
> > > > > Patch 2: Bail out private-to-shared conversion if splitting fails.
> > > > > Patch 3: Make kvm_gmem_zap() return void.
> > > > >
> > > > > ...
> > > > >         /*
> > > > >
> > > > >
> > > > > If the above changes are agreeable, we could consider a more ambitious approach:
> > > > > introducing an interface like:
> > > > >
> > > > > int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
> > > > > int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);
> > > >
> > > > I don't see any reason to introduce full tracking of gfn mapping
> > > > status in SEPTs just to handle very rare scenarios which KVM/TDX are
> > > > taking utmost care to avoid.
> > > >
> > > > That being said, I see value in letting guest_memfd know exact ranges
> > > > still being under use by the TDX module due to unmapping failures.
> > > > guest_memfd can take the right action instead of relying on refcounts.
> > > >
> > > > Does KVM continue unmapping the full range even after TDX SEPT
> > > > management fails to unmap a subrange?
> > > Yes, if there's no bug in KVM, it will continue unmapping the full ranges.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-18  0:34                                                 ` Edgecombe, Rick P
  2025-06-18  0:46                                                   ` Yan Zhao
@ 2025-06-18  4:29                                                   ` Vishal Annapurve
  2025-06-19  0:22                                                     ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-06-18  4:29 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Zhao, Yan Y, kvm@vger.kernel.org, quic_eberman@quicinc.com,
	Li, Xiaoyao, Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jun 17, 2025 at 5:34 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
>
> I'm confused...
>
> >
> > Speculative/transient refcounts came up a few times In the context of
> > guest_memfd discussions, some examples include: pagetable walkers,
> > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > can provide more context here as needed.
> >
> > Effectively some core-mm features that are present today or might land
> > in the future can cause folio refcounts to be grabbed for short
> > durations without actual access to underlying physical memory. These
> > scenarios are unlikely to happen for private memory but can't be
> > discounted completely.
>
> This means the refcount could be increased for other reasons, and so guestmemfd
> shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
> components handling the page elevate the refcount?

It's simpler to handle the transient refcounts as there are following options:
1) Wait for a small amount of time
2) Keep the folio refcounts frozen to zero at all times, which will
effectively eliminate the scenario of transient refcounts.
3) Use raw memory without page structs - unmanaged by kernel.

>
> >
> > Another reason to avoid relying on refcounts is to not block usage of
> > raw physical memory unmanaged by kernel (without page structs) to back
> > guest private memory as we had discussed previously. This will help
> > simplify merge/split operations during conversions and help usecases
> > like guest memory persistence [2] and non-confidential VMs.
>
> If this becomes a thing for private memory (which it isn't yet), then couldn't
> we just change things at that point?

It would be great to avoid having to go through discussion again, if
we have good reasons to handle it now.

>
> Is the only issue with TDX taking refcounts that it won't work with future code
> changes?
>
> >
> > [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> > [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/
>

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-18  0:46                                                   ` Yan Zhao
@ 2025-06-18  4:33                                                     ` Vishal Annapurve
  2025-06-18  6:13                                                       ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-06-18  4:33 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Edgecombe, Rick P, kvm@vger.kernel.org, quic_eberman@quicinc.com,
	Li, Xiaoyao, Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jun 17, 2025 at 5:49 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Wed, Jun 18, 2025 at 08:34:24AM +0800, Edgecombe, Rick P wrote:
> > On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> > > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> >
> > I'm confused...
> >
> > >
> > > Speculative/transient refcounts came up a few times In the context of
> > > guest_memfd discussions, some examples include: pagetable walkers,
> > > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > > can provide more context here as needed.
> > >
> > > Effectively some core-mm features that are present today or might land
> > > in the future can cause folio refcounts to be grabbed for short
> > > durations without actual access to underlying physical memory. These
> > > scenarios are unlikely to happen for private memory but can't be
> > > discounted completely.
> >
> > This means the refcount could be increased for other reasons, and so guestmemfd
> > shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
> > components handling the page elevate the refcount?
> Besides that, in [3], when kvm_gmem_convert_should_proceed() determines whether
> to convert to private, why is it allowed to just invoke
> kvm_gmem_has_safe_refcount() without taking speculative/transient refcounts into
> account? Isn't it more easier for shared pages to have speculative/transient
> refcounts?

These speculative refcounts are taken into account, in case of unsafe
refcounts, conversion operation immediately exits to userspace with
EAGAIN and userspace is supposed to retry conversion.

Yes, it's more easier for shared pages to have speculative/transient refcounts.

>
> [3] https://lore.kernel.org/lkml/d3832fd95a03aad562705872cbda5b3d248ca321.1747264138.git.ackerleytng@google.com/
>
> > >
> > > Another reason to avoid relying on refcounts is to not block usage of
> > > raw physical memory unmanaged by kernel (without page structs) to back
> > > guest private memory as we had discussed previously. This will help
> > > simplify merge/split operations during conversions and help usecases
> > > like guest memory persistence [2] and non-confidential VMs.
> >
> > If this becomes a thing for private memory (which it isn't yet), then couldn't
> > we just change things at that point?
> >
> > Is the only issue with TDX taking refcounts that it won't work with future code
> > changes?
> >
> > >
> > > [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> > > [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/
> >

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-18  4:33                                                     ` Vishal Annapurve
@ 2025-06-18  6:13                                                       ` Yan Zhao
  2025-06-18  6:21                                                         ` Vishal Annapurve
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-18  6:13 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Edgecombe, Rick P, kvm@vger.kernel.org, quic_eberman@quicinc.com,
	Li, Xiaoyao, Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jun 17, 2025 at 09:33:02PM -0700, Vishal Annapurve wrote:
> On Tue, Jun 17, 2025 at 5:49 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Wed, Jun 18, 2025 at 08:34:24AM +0800, Edgecombe, Rick P wrote:
> > > On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> > > > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> > >
> > > I'm confused...
> > >
> > > >
> > > > Speculative/transient refcounts came up a few times In the context of
> > > > guest_memfd discussions, some examples include: pagetable walkers,
> > > > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > > > can provide more context here as needed.
> > > >
> > > > Effectively some core-mm features that are present today or might land
> > > > in the future can cause folio refcounts to be grabbed for short
> > > > durations without actual access to underlying physical memory. These
> > > > scenarios are unlikely to happen for private memory but can't be
> > > > discounted completely.
> > >
> > > This means the refcount could be increased for other reasons, and so guestmemfd
> > > shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
> > > components handling the page elevate the refcount?
> > Besides that, in [3], when kvm_gmem_convert_should_proceed() determines whether
> > to convert to private, why is it allowed to just invoke
> > kvm_gmem_has_safe_refcount() without taking speculative/transient refcounts into
> > account? Isn't it more easier for shared pages to have speculative/transient
> > refcounts?
> 
> These speculative refcounts are taken into account, in case of unsafe
> refcounts, conversion operation immediately exits to userspace with
> EAGAIN and userspace is supposed to retry conversion.
Hmm, so why can't private-to-shared conversion also exit to userspace with
EAGAIN?

In the POC
https://lore.kernel.org/lkml/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com,
kvm_gmem_convert_should_proceed() just returns EFAULT (can be modified to
EAGAIN) to userspace instead.

> 
> Yes, it's more easier for shared pages to have speculative/transient refcounts.
> 
> >
> > [3] https://lore.kernel.org/lkml/d3832fd95a03aad562705872cbda5b3d248ca321.1747264138.git.ackerleytng@google.com/
> >
> > > >
> > > > Another reason to avoid relying on refcounts is to not block usage of
> > > > raw physical memory unmanaged by kernel (without page structs) to back
> > > > guest private memory as we had discussed previously. This will help
> > > > simplify merge/split operations during conversions and help usecases
> > > > like guest memory persistence [2] and non-confidential VMs.
> > >
> > > If this becomes a thing for private memory (which it isn't yet), then couldn't
> > > we just change things at that point?
> > >
> > > Is the only issue with TDX taking refcounts that it won't work with future code
> > > changes?
> > >
> > > >
> > > > [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> > > > [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/
> > >

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-18  6:13                                                       ` Yan Zhao
@ 2025-06-18  6:21                                                         ` Vishal Annapurve
  2025-06-18  6:32                                                           ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-06-18  6:21 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Edgecombe, Rick P, kvm@vger.kernel.org, quic_eberman@quicinc.com,
	Li, Xiaoyao, Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jun 17, 2025 at 11:15 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, Jun 17, 2025 at 09:33:02PM -0700, Vishal Annapurve wrote:
> > On Tue, Jun 17, 2025 at 5:49 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Wed, Jun 18, 2025 at 08:34:24AM +0800, Edgecombe, Rick P wrote:
> > > > On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> > > > > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> > > >
> > > > I'm confused...
> > > >
> > > > >
> > > > > Speculative/transient refcounts came up a few times In the context of
> > > > > guest_memfd discussions, some examples include: pagetable walkers,
> > > > > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > > > > can provide more context here as needed.
> > > > >
> > > > > Effectively some core-mm features that are present today or might land
> > > > > in the future can cause folio refcounts to be grabbed for short
> > > > > durations without actual access to underlying physical memory. These
> > > > > scenarios are unlikely to happen for private memory but can't be
> > > > > discounted completely.
> > > >
> > > > This means the refcount could be increased for other reasons, and so guestmemfd
> > > > shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
> > > > components handling the page elevate the refcount?
> > > Besides that, in [3], when kvm_gmem_convert_should_proceed() determines whether
> > > to convert to private, why is it allowed to just invoke
> > > kvm_gmem_has_safe_refcount() without taking speculative/transient refcounts into
> > > account? Isn't it more easier for shared pages to have speculative/transient
> > > refcounts?
> >
> > These speculative refcounts are taken into account, in case of unsafe
> > refcounts, conversion operation immediately exits to userspace with
> > EAGAIN and userspace is supposed to retry conversion.
> Hmm, so why can't private-to-shared conversion also exit to userspace with
> EAGAIN?

How would userspace/guest_memfd differentiate between
speculative/transient refcounts and extra refcounts due to TDX unmap
failures?

>
> In the POC
> https://lore.kernel.org/lkml/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com,
> kvm_gmem_convert_should_proceed() just returns EFAULT (can be modified to
> EAGAIN) to userspace instead.
>
> >
> > Yes, it's more easier for shared pages to have speculative/transient refcounts.
> >
> > >
> > > [3] https://lore.kernel.org/lkml/d3832fd95a03aad562705872cbda5b3d248ca321.1747264138.git.ackerleytng@google.com/
> > >
> > > > >
> > > > > Another reason to avoid relying on refcounts is to not block usage of
> > > > > raw physical memory unmanaged by kernel (without page structs) to back
> > > > > guest private memory as we had discussed previously. This will help
> > > > > simplify merge/split operations during conversions and help usecases
> > > > > like guest memory persistence [2] and non-confidential VMs.
> > > >
> > > > If this becomes a thing for private memory (which it isn't yet), then couldn't
> > > > we just change things at that point?
> > > >
> > > > Is the only issue with TDX taking refcounts that it won't work with future code
> > > > changes?
> > > >
> > > > >
> > > > > [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> > > > > [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/
> > > >

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-18  6:21                                                         ` Vishal Annapurve
@ 2025-06-18  6:32                                                           ` Yan Zhao
  2025-06-18  6:44                                                             ` Vishal Annapurve
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-18  6:32 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Edgecombe, Rick P, kvm@vger.kernel.org, quic_eberman@quicinc.com,
	Li, Xiaoyao, Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jun 17, 2025 at 11:21:41PM -0700, Vishal Annapurve wrote:
> On Tue, Jun 17, 2025 at 11:15 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Tue, Jun 17, 2025 at 09:33:02PM -0700, Vishal Annapurve wrote:
> > > On Tue, Jun 17, 2025 at 5:49 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Wed, Jun 18, 2025 at 08:34:24AM +0800, Edgecombe, Rick P wrote:
> > > > > On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> > > > > > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> > > > >
> > > > > I'm confused...
> > > > >
> > > > > >
> > > > > > Speculative/transient refcounts came up a few times In the context of
> > > > > > guest_memfd discussions, some examples include: pagetable walkers,
> > > > > > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > > > > > can provide more context here as needed.
> > > > > >
> > > > > > Effectively some core-mm features that are present today or might land
> > > > > > in the future can cause folio refcounts to be grabbed for short
> > > > > > durations without actual access to underlying physical memory. These
> > > > > > scenarios are unlikely to happen for private memory but can't be
> > > > > > discounted completely.
> > > > >
> > > > > This means the refcount could be increased for other reasons, and so guestmemfd
> > > > > shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
> > > > > components handling the page elevate the refcount?
> > > > Besides that, in [3], when kvm_gmem_convert_should_proceed() determines whether
> > > > to convert to private, why is it allowed to just invoke
> > > > kvm_gmem_has_safe_refcount() without taking speculative/transient refcounts into
> > > > account? Isn't it more easier for shared pages to have speculative/transient
> > > > refcounts?
> > >
> > > These speculative refcounts are taken into account, in case of unsafe
> > > refcounts, conversion operation immediately exits to userspace with
> > > EAGAIN and userspace is supposed to retry conversion.
> > Hmm, so why can't private-to-shared conversion also exit to userspace with
> > EAGAIN?
> 
> How would userspace/guest_memfd differentiate between
> speculative/transient refcounts and extra refcounts due to TDX unmap
> failures?
Hmm, it also can't differentiate between speculative/transient refcounts and
extra refcounts on shared folios due to other reasons.

> 
> >
> > In the POC
> > https://lore.kernel.org/lkml/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com,
> > kvm_gmem_convert_should_proceed() just returns EFAULT (can be modified to
> > EAGAIN) to userspace instead.
> >
> > >
> > > Yes, it's more easier for shared pages to have speculative/transient refcounts.
> > >
> > > >
> > > > [3] https://lore.kernel.org/lkml/d3832fd95a03aad562705872cbda5b3d248ca321.1747264138.git.ackerleytng@google.com/
> > > >
> > > > > >
> > > > > > Another reason to avoid relying on refcounts is to not block usage of
> > > > > > raw physical memory unmanaged by kernel (without page structs) to back
> > > > > > guest private memory as we had discussed previously. This will help
> > > > > > simplify merge/split operations during conversions and help usecases
> > > > > > like guest memory persistence [2] and non-confidential VMs.
> > > > >
> > > > > If this becomes a thing for private memory (which it isn't yet), then couldn't
> > > > > we just change things at that point?
> > > > >
> > > > > Is the only issue with TDX taking refcounts that it won't work with future code
> > > > > changes?
> > > > >
> > > > > >
> > > > > > [1] https://lore.kernel.org/lkml/diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com/
> > > > > > [2] https://lore.kernel.org/lkml/20240805093245.889357-1-jgowans@amazon.com/
> > > > >

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-18  6:32                                                           ` Yan Zhao
@ 2025-06-18  6:44                                                             ` Vishal Annapurve
  2025-06-18  6:57                                                               ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-06-18  6:44 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Edgecombe, Rick P, kvm@vger.kernel.org, quic_eberman@quicinc.com,
	Li, Xiaoyao, Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jun 17, 2025 at 11:34 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, Jun 17, 2025 at 11:21:41PM -0700, Vishal Annapurve wrote:
> > On Tue, Jun 17, 2025 at 11:15 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Tue, Jun 17, 2025 at 09:33:02PM -0700, Vishal Annapurve wrote:
> > > > On Tue, Jun 17, 2025 at 5:49 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >
> > > > > On Wed, Jun 18, 2025 at 08:34:24AM +0800, Edgecombe, Rick P wrote:
> > > > > > On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> > > > > > > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> > > > > >
> > > > > > I'm confused...
> > > > > >
> > > > > > >
> > > > > > > Speculative/transient refcounts came up a few times In the context of
> > > > > > > guest_memfd discussions, some examples include: pagetable walkers,
> > > > > > > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > > > > > > can provide more context here as needed.
> > > > > > >
> > > > > > > Effectively some core-mm features that are present today or might land
> > > > > > > in the future can cause folio refcounts to be grabbed for short
> > > > > > > durations without actual access to underlying physical memory. These
> > > > > > > scenarios are unlikely to happen for private memory but can't be
> > > > > > > discounted completely.
> > > > > >
> > > > > > This means the refcount could be increased for other reasons, and so guestmemfd
> > > > > > shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
> > > > > > components handling the page elevate the refcount?
> > > > > Besides that, in [3], when kvm_gmem_convert_should_proceed() determines whether
> > > > > to convert to private, why is it allowed to just invoke
> > > > > kvm_gmem_has_safe_refcount() without taking speculative/transient refcounts into
> > > > > account? Isn't it more easier for shared pages to have speculative/transient
> > > > > refcounts?
> > > >
> > > > These speculative refcounts are taken into account, in case of unsafe
> > > > refcounts, conversion operation immediately exits to userspace with
> > > > EAGAIN and userspace is supposed to retry conversion.
> > > Hmm, so why can't private-to-shared conversion also exit to userspace with
> > > EAGAIN?
> >
> > How would userspace/guest_memfd differentiate between
> > speculative/transient refcounts and extra refcounts due to TDX unmap
> > failures?
> Hmm, it also can't differentiate between speculative/transient refcounts and
> extra refcounts on shared folios due to other reasons.
>

In case of shared memory ranges, userspace is effectively responsible
for extra refcounts and can act towards removing them if not done
already. If "extra" refcounts are taken care of then the only
remaining scenario is speculative/transient refcounts.

But for private memory ranges, userspace is not responsible for any
refcounts landing on them.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-18  6:44                                                             ` Vishal Annapurve
@ 2025-06-18  6:57                                                               ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-06-18  6:57 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Edgecombe, Rick P, kvm@vger.kernel.org, quic_eberman@quicinc.com,
	Li, Xiaoyao, Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jun 17, 2025 at 11:44:34PM -0700, Vishal Annapurve wrote:
> On Tue, Jun 17, 2025 at 11:34 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Tue, Jun 17, 2025 at 11:21:41PM -0700, Vishal Annapurve wrote:
> > > On Tue, Jun 17, 2025 at 11:15 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Tue, Jun 17, 2025 at 09:33:02PM -0700, Vishal Annapurve wrote:
> > > > > On Tue, Jun 17, 2025 at 5:49 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 18, 2025 at 08:34:24AM +0800, Edgecombe, Rick P wrote:
> > > > > > > On Tue, 2025-06-17 at 01:09 -0700, Vishal Annapurve wrote:
> > > > > > > > Sorry I quoted Ackerley's response wrongly. Here is the correct reference [1].
> > > > > > >
> > > > > > > I'm confused...
> > > > > > >
> > > > > > > >
> > > > > > > > Speculative/transient refcounts came up a few times In the context of
> > > > > > > > guest_memfd discussions, some examples include: pagetable walkers,
> > > > > > > > page migration, speculative pagecache lookups, GUP-fast etc. David H
> > > > > > > > can provide more context here as needed.
> > > > > > > >
> > > > > > > > Effectively some core-mm features that are present today or might land
> > > > > > > > in the future can cause folio refcounts to be grabbed for short
> > > > > > > > durations without actual access to underlying physical memory. These
> > > > > > > > scenarios are unlikely to happen for private memory but can't be
> > > > > > > > discounted completely.
> > > > > > >
> > > > > > > This means the refcount could be increased for other reasons, and so guestmemfd
> > > > > > > shouldn't rely on refcounts for it's purposes? So, it is not a problem for other
> > > > > > > components handling the page elevate the refcount?
> > > > > > Besides that, in [3], when kvm_gmem_convert_should_proceed() determines whether
> > > > > > to convert to private, why is it allowed to just invoke
> > > > > > kvm_gmem_has_safe_refcount() without taking speculative/transient refcounts into
> > > > > > account? Isn't it more easier for shared pages to have speculative/transient
> > > > > > refcounts?
> > > > >
> > > > > These speculative refcounts are taken into account, in case of unsafe
> > > > > refcounts, conversion operation immediately exits to userspace with
> > > > > EAGAIN and userspace is supposed to retry conversion.
> > > > Hmm, so why can't private-to-shared conversion also exit to userspace with
> > > > EAGAIN?
> > >
> > > How would userspace/guest_memfd differentiate between
> > > speculative/transient refcounts and extra refcounts due to TDX unmap
> > > failures?
> > Hmm, it also can't differentiate between speculative/transient refcounts and
> > extra refcounts on shared folios due to other reasons.
> >
> 
> In case of shared memory ranges, userspace is effectively responsible
> for extra refcounts and can act towards removing them if not done
> already. If "extra" refcounts are taken care of then the only
> remaining scenario is speculative/transient refcounts.
> 
> But for private memory ranges, userspace is not responsible for any
> refcounts landing on them.
Ok. The similarities between the two are:
- userspace can't help on speculative/transient refcounts.
- userspace can't make conversion success with "extra" refcounts, whether held
  by user or by TDX.

But I think I get your point that EAGAIN is not the right code in case of
"extra" refcounts held by TDX.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-18  1:22                                             ` Edgecombe, Rick P
@ 2025-06-18 11:32                                               ` Shutemov, Kirill
  2025-06-20 16:32                                                 ` Sean Christopherson
  0 siblings, 1 reply; 294+ messages in thread
From: Shutemov, Kirill @ 2025-06-18 11:32 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Zhao, Yan Y, Du, Fan, Li, Xiaoyao, Huang, Kai,
	quic_eberman@quicinc.com, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, Li, Zhiquan1,
	kvm@vger.kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	pbonzini@redhat.com, Weiny, Ira, Yamahata, Isaku,
	binbin.wu@linux.intel.com, ackerleytng@google.com,
	Annapurve, Vishal, tabba@google.com, jroedel@suse.de, Miao, Jun,
	pgonda@google.com, x86@kernel.org

On Wed, Jun 18, 2025 at 04:22:59AM +0300, Edgecombe, Rick P wrote:
> On Tue, 2025-06-17 at 08:52 +0800, Yan Zhao wrote:
> > > hopefully is just handling accepting a whole range that is not 2MB aligned.
> > > But
> > > I think we need to verify this more.
> > Ok.
> 
> In Linux guest if a memory region is not 2MB aligned the guest will accept the
> ends at 4k size. If a memory region is identical to a memslot range this will be
> fine. KVM will map the ends at 4k because it won't let huge pages span a
> memslot. But if several memory regions are not 2MB aligned and are covered by
> one large memslot, the accept will fail on the 4k ends under this proposal. I
> don't know if this is a common configuration, but to cover it in the TDX guest
> may not be trivial.
> 
> So I think this will only work if guests can reasonably "merge" all of the
> adjacent accepts. Or of we declare a bunch of memory/memslot layouts illegal.
> 
> Kirill, how difficult would it be for TDX Linux guest to merge all 2MB adjacent
> accepts?

Hm. What do you mean by merging?

Kernel only accepts <4k during early boot -- in EFI stub. The bitmap we
use to track unaccepted memory tracks the status in 2M granularity and
all later accept requests will be issues on 2M pages with fallback to 4k.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-18  4:29                                                   ` Vishal Annapurve
@ 2025-06-19  0:22                                                     ` Edgecombe, Rick P
  0 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-19  0:22 UTC (permalink / raw)
  To: Annapurve, Vishal
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	Zhao, Yan Y, tabba@google.com, Du, Fan, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, vbabka@suse.cz,
	pbonzini@redhat.com, ackerleytng@google.com,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	Yamahata, Isaku, Peng, Chao P, kvm@vger.kernel.org,
	jroedel@suse.de, Miao, Jun, Li, Zhiquan1, pgonda@google.com,
	x86@kernel.org

On Tue, 2025-06-17 at 21:29 -0700, Vishal Annapurve wrote:
> > This means the refcount could be increased for other reasons, and so
> > guestmemfd
> > shouldn't rely on refcounts for it's purposes? So, it is not a problem for
> > other
> > components handling the page elevate the refcount?
> 
> It's simpler to handle the transient refcounts as there are following options:
> 1) Wait for a small amount of time
> 2) Keep the folio refcounts frozen to zero at all times, which will
> effectively eliminate the scenario of transient refcounts.
> 3) Use raw memory without page structs - unmanaged by kernel.
> 
> > 
> > > 
> > > Another reason to avoid relying on refcounts is to not block usage of
> > > raw physical memory unmanaged by kernel (without page structs) to back
> > > guest private memory as we had discussed previously. This will help
> > > simplify merge/split operations during conversions and help usecases
> > > like guest memory persistence [2] and non-confidential VMs.
> > 
> > If this becomes a thing for private memory (which it isn't yet), then
> > couldn't
> > we just change things at that point?
> 
> It would be great to avoid having to go through discussion again, if
> we have good reasons to handle it now.

I thought we already came to agreement on whether to spend time pre-designing
for future things. This thread has gotten pretty long, can we stick to the
current problems in an effort to close it?


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-05 22:35                                       ` Ackerley Tng
@ 2025-06-19  8:11                                         ` Yan Zhao
  2025-06-20 18:06                                           ` Vishal Annapurve
  2025-07-16  1:23                                         ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-19  8:11 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: vannapurve, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> >> Hi Yan,
> >> 
> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
> >> series [1], we took into account conversion failures too. The steps are
> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> >> series from GitHub [2] because the steps for conversion changed in two
> >> separate patches.)
> > ...
> >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> >
> > Hi Ackerley,
> > Thanks for providing this branch.
> 
> Here's the WIP branch [1], which I initially wasn't intending to make
> super public since it's not even RFC standard yet and I didn't want to
> add to the many guest_memfd in-flight series, but since you referred to
> it, [2] is a v2 of the WIP branch :)
> 
> [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
Thanks. [2] works. TDX huge pages now has successfully been rebased on top of [2].


> This WIP branch has selftests that test 1G aka HugeTLB page support with
> TDX huge page EPT mappings [7]:
> 
> 1. "KVM: selftests: TDX: Test conversion to private at different
>    sizes". This uses the fact that TDX module will return error if the
>    page is faulted into the guest at a different level from the accept
>    level to check the level that the page was faulted in.
> 2. "KVM: selftests: Test TDs in private_mem_conversions_test". Updates
>    private_mem_conversions_test for use with TDs. This test does
>    multi-vCPU conversions and we use this to check for issues to do with
>    conversion races.
> 3. "KVM: selftests: TDX: Test conversions when guest_memfd used for
>    private and shared memory". Adds a selftest similar to/on top of
>    guest_memfd_conversions_test that does conversions via MapGPA.
> 
> Full list of selftests I usually run from tools/testing/selftests/kvm:
> + ./guest_memfd_test
> + ./guest_memfd_conversions_test
> + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_test
> + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_conversions_test
> + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_hugetlb_reporting_test
> + ./x86/private_mem_conversions_test.sh
> + ./set_memory_region_test
> + ./x86/private_mem_kvm_exits_test
> + ./x86/tdx_vm_test
> + ./x86/tdx_upm_test
> + ./x86/tdx_shared_mem_test
> + ./x86/tdx_gmem_private_and_shared_test
> 
> As an overview for anyone who might be interested in this WIP branch:
> 
> 1.  I started with upstream's kvm/next
> 2.  Applied TDX selftests series [3]
> 3.  Applied guest_memfd mmap series [4]
> 4.  Applied conversions (sub)series and HugeTLB (sub)series [5]
> 5.  Added some fixes for 2 of the earlier series (as labeled in commit
>     message)
> 6.  Updated guest_memfd conversions selftests to work with TDX
> 7.  Applied 2M EPT series [6] with some hacks
> 8.  Some patches to make guest_memfd mmap return huge-page-aligned
>     userspace address
> 9.  Selftests for guest_memfd conversion with TDX 2M EPT
> 
> [3] https://lore.kernel.org/all/20250414214801.2693294-1-sagis@google.com/
> [4] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/
> [5] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/
> [6] https://lore.kernel.org/all/Z%2FOMB7HNO%2FRQyljz@yzhao56-desk.sh.intel.com/
> [7] https://lore.kernel.org/all/20250424030033.32635-1-yan.y.zhao@intel.com/
Thanks.
We noticed that it's not easy for TDX initial memory regions to use in-place
conversion version of guest_memfd, because
- tdh_mem_page_add() requires simultaneous access to shared source memory and
  private target memory.
- shared-to-private in-place conversion first unmaps the shared memory and tests
  if any extra folio refcount is held before the conversion is allowed.

Therefore, though tdh_mem_page_add() actually supports in-place add, see [8],
we can't store the initial content in the mmap-ed VA of the in-place conversion
version of guest_memfd.

So, I modified QEMU to workaround this issue by adding an extra anonymous
backend to hold source pages in shared memory, with the target private PFN
allocated from guest_memfd with GUEST_MEMFD_FLAG_SUPPORT_SHARED set.

The goal is to test whether kvm_gmem_populate() works for TDX huge pages.
This testing exposed a bug in kvm_gmem_populate(), which has been fixed in the
following patch.

commit 5f33ed7ca26f00a61c611d2d1fbc001a7ecd8dca
Author: Yan Zhao <yan.y.zhao@intel.com>
Date:   Mon Jun 9 03:01:21 2025 -0700

    Bug fix: Reduce max_order when GFN is not aligned

    Fix the warning hit in kvm_gmem_populate().

    "WARNING: CPU: 7 PID: 4421 at arch/x86/kvm/../../../virt/kvm/guest_memfd.c:
    2496 kvm_gmem_populate+0x4a4/0x5b0"

    The GFN passed to kvm_gmem_populate() may have an offset so it may not be
    aligned to folio order. In this case, reduce the max_order to decrease the
    mapping level.

    Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 4b8047020f17..af7943c0a8ba 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -2493,7 +2493,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
                }

                folio_unlock(folio);
-               WARN_ON(!IS_ALIGNED(gfn, 1 << max_order));
+               while (!IS_ALIGNED(gfn, 1 << max_order))
+                       max_order--;

                npages_to_populate = min(npages - i, 1 << max_order);
                npages_to_populate = private_npages_to_populate(



[8] https://cdrdv2-public.intel.com/839195/intel-tdx-module-1.5-abi-spec-348551002.pdf
"In-Place Add: It is allowed to set the TD page HPA in R8 to the same address as
the source page HPA in R9. In this case the source page is converted to be a TD
private page".

^ permalink raw reply related	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-05-16  9:05     ` Yan Zhao
  2025-05-16 17:10       ` Edgecombe, Rick P
@ 2025-06-19  9:26       ` Nikolay Borisov
  2025-06-23  9:32         ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Nikolay Borisov @ 2025-06-19  9:26 UTC (permalink / raw)
  To: Yan Zhao, Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Shutemov, Kirill,
	quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org



On 5/16/25 12:05, Yan Zhao wrote:
> On Wed, May 14, 2025 at 02:52:49AM +0800, Edgecombe, Rick P wrote:
>> On Thu, 2025-04-24 at 11:04 +0800, Yan Zhao wrote:
>>> Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.
>>>
>>> Verify the validity of the level and ensure that the mapping range is fully
>>> contained within the page folio.
>>>
>>> As a conservative solution, perform CLFLUSH on all pages to be mapped into
>>> the TD before invoking the SEAMCALL TDH_MEM_PAGE_AUG. This ensures that any
>>> dirty cache lines do not write back later and clobber TD memory.
>>
>> This should have a brief background on why it doesn't use the arg - what is
>> deficient today. Also, an explanation of how it will be used (i.e. what types of
>> pages will be passed)
> Will do.
> 
>>>
>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>>> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>>> ---
>>>   arch/x86/virt/vmx/tdx/tdx.c | 11 ++++++++++-
>>>   1 file changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>>> index f5e2a937c1e7..a66d501b5677 100644
>>> --- a/arch/x86/virt/vmx/tdx/tdx.c
>>> +++ b/arch/x86/virt/vmx/tdx/tdx.c
>>> @@ -1595,9 +1595,18 @@ u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u
>>>   		.rdx = tdx_tdr_pa(td),
>>>   		.r8 = page_to_phys(page),
>>>   	};
>>> +	unsigned long nr_pages = 1 << (level * 9);
>>> +	struct folio *folio = page_folio(page);
>>> +	unsigned long idx = 0;
>>>   	u64 ret;
>>>   
>>> -	tdx_clflush_page(page);
>>> +	if (!(level >= TDX_PS_4K && level < TDX_PS_NR) ||
>>> +	    (folio_page_idx(folio, page) + nr_pages > folio_nr_pages(folio)))
>>> +		return -EINVAL;
>>
>> Shouldn't KVM not try to map a huge page in this situation? Doesn't seem like a
>> job for the SEAMCALL wrapper.
> Ok. If the decision is to trust KVM and all potential callers, it's reasonable
> to drop those checks.
> 
>>> +
>>> +	while (nr_pages--)
>>> +		tdx_clflush_page(nth_page(page, idx++));
>>
>> clflush_cache_range() is:
>> static void tdx_clflush_page(struct page *page)
>> {
>> 	clflush_cache_range(page_to_virt(page), PAGE_SIZE);
>> }
>>
>> So we have loops within loops...  Better to add an arg to tdx_clflush_page() or
>> add a variant that takes one.
> Ok.
> 
> One thing to note is that even with an extra arg, tdx_clflush_page() has to call
> clflush_cache_range() page by page because with
> "#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)",
> page virtual addresses are not necessarily contiguous.
> 
> What about Binbin's proposal [1]? i.e.,
> 
> while (nr_pages)
>       tdx_clflush_page(nth_page(page, --nr_pages));

What's the problem with using:

+       for (int i = 0; nr_pages; nr_pages--)
+               tdx_clflush_page(nth_page(page, i++))


The kernel now allows C99-style definition of variables inside a loop + 
it's clear how many times the loop has to be executed.
> 
> [1] https://lore.kernel.org/all/a7d0988d-037c-454f-bc6b-57e71b357488@linux.intel.com/
> 
>>> +
>>>   	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
>>>   
>>>   	*ext_err1 = args.rcx;
>>
> 


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-18  0:30                                             ` Yan Zhao
@ 2025-06-20 16:31                                               ` Sean Christopherson
  2025-06-23 21:44                                                 ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Sean Christopherson @ 2025-06-20 16:31 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Rick P Edgecombe, Fan Du, Xiaoyao Li, Kai Huang,
	quic_eberman@quicinc.com, Dave Hansen, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, Zhiquan1 Li,
	Kirill Shutemov, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, Chao P Peng, pbonzini@redhat.com,
	Ira Weiny, Isaku Yamahata, binbin.wu@linux.intel.com,
	ackerleytng@google.com, kvm@vger.kernel.org, Vishal Annapurve,
	tabba@google.com, jroedel@suse.de, Jun Miao, pgonda@google.com,
	x86@kernel.org

On Wed, Jun 18, 2025, Yan Zhao wrote:
>   when an EPT violation carries an ACCEPT level info
>   KVM maps the page at map level <= the specified level.

No.  I want KVM to map at the maximal level KVM supports, irrespective of what
the guest's ACCEPT level says.  I.e. I want KVM to be able to completely ignore
the ACCEPT level.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-18 11:32                                               ` Shutemov, Kirill
@ 2025-06-20 16:32                                                 ` Sean Christopherson
  2025-06-20 17:44                                                   ` Kirill Shutemov
  0 siblings, 1 reply; 294+ messages in thread
From: Sean Christopherson @ 2025-06-20 16:32 UTC (permalink / raw)
  To: Kirill Shutemov
  Cc: Rick P Edgecombe, Yan Y Zhao, Fan Du, Xiaoyao Li, Kai Huang,
	quic_eberman@quicinc.com, Dave Hansen, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, Zhiquan1 Li,
	kvm@vger.kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, Chao P Peng, pbonzini@redhat.com,
	Ira Weiny, Isaku Yamahata, binbin.wu@linux.intel.com,
	ackerleytng@google.com, Vishal Annapurve, tabba@google.com,
	jroedel@suse.de, Jun Miao, pgonda@google.com, x86@kernel.org

On Wed, Jun 18, 2025, Kirill Shutemov wrote:
> On Wed, Jun 18, 2025 at 04:22:59AM +0300, Edgecombe, Rick P wrote:
> > On Tue, 2025-06-17 at 08:52 +0800, Yan Zhao wrote:
> > > > hopefully is just handling accepting a whole range that is not 2MB aligned.
> > > > But
> > > > I think we need to verify this more.
> > > Ok.
> > 
> > In Linux guest if a memory region is not 2MB aligned the guest will accept the

What is a "memory region" in this context?  An e820 region?  Something else?

> > ends at 4k size. If a memory region is identical to a memslot range this will be
> > fine. KVM will map the ends at 4k because it won't let huge pages span a
> > memslot. But if several memory regions are not 2MB aligned and are covered by
> > one large memslot, the accept will fail on the 4k ends under this proposal. I
> > don't know if this is a common configuration, but to cover it in the TDX guest
> > may not be trivial.
> > 
> > So I think this will only work if guests can reasonably "merge" all of the
> > adjacent accepts. Or of we declare a bunch of memory/memslot layouts illegal.
> > 
> > Kirill, how difficult would it be for TDX Linux guest to merge all 2MB adjacent
> > accepts?
> 
> Hm. What do you mean by merging?
> 
> Kernel only accepts <4k during early boot -- in EFI stub. The bitmap we
> use to track unaccepted memory tracks the status in 2M granularity and
> all later accept requests will be issues on 2M pages with fallback to 4k.
> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-20 16:32                                                 ` Sean Christopherson
@ 2025-06-20 17:44                                                   ` Kirill Shutemov
  2025-06-20 18:40                                                     ` Sean Christopherson
  0 siblings, 1 reply; 294+ messages in thread
From: Kirill Shutemov @ 2025-06-20 17:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, Yan Y Zhao, Fan Du, Xiaoyao Li, Kai Huang,
	quic_eberman@quicinc.com, Dave Hansen, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, Zhiquan1 Li,
	kvm@vger.kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, Chao P Peng, pbonzini@redhat.com,
	Ira Weiny, Isaku Yamahata, binbin.wu@linux.intel.com,
	ackerleytng@google.com, Vishal Annapurve, tabba@google.com,
	jroedel@suse.de, Jun Miao, pgonda@google.com, x86@kernel.org

On Fri, Jun 20, 2025 at 09:32:45AM -0700, Sean Christopherson wrote:
> On Wed, Jun 18, 2025, Kirill Shutemov wrote:
> > On Wed, Jun 18, 2025 at 04:22:59AM +0300, Edgecombe, Rick P wrote:
> > > On Tue, 2025-06-17 at 08:52 +0800, Yan Zhao wrote:
> > > > > hopefully is just handling accepting a whole range that is not 2MB aligned.
> > > > > But
> > > > > I think we need to verify this more.
> > > > Ok.
> > > 
> > > In Linux guest if a memory region is not 2MB aligned the guest will accept the
> 
> What is a "memory region" in this context?  An e820 region?  Something else?

EFI memory map entry.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-19  8:11                                         ` Yan Zhao
@ 2025-06-20 18:06                                           ` Vishal Annapurve
  0 siblings, 0 replies; 294+ messages in thread
From: Vishal Annapurve @ 2025-06-20 18:06 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

On Thu, Jun 19, 2025 at 1:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
> > Yan Zhao <yan.y.zhao@intel.com> writes:
> >
> > > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> > >> Hi Yan,
> > >>
> > >> While working on the 1G (aka HugeTLB) page support for guest_memfd
> > >> series [1], we took into account conversion failures too. The steps are
> > >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> > >> series from GitHub [2] because the steps for conversion changed in two
> > >> separate patches.)
> > > ...
> > >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> > >
> > > Hi Ackerley,
> > > Thanks for providing this branch.
> >
> > Here's the WIP branch [1], which I initially wasn't intending to make
> > super public since it's not even RFC standard yet and I didn't want to
> > add to the many guest_memfd in-flight series, but since you referred to
> > it, [2] is a v2 of the WIP branch :)
> >
> > [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
> > [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
> Thanks. [2] works. TDX huge pages now has successfully been rebased on top of [2].
>
>
> > This WIP branch has selftests that test 1G aka HugeTLB page support with
> > TDX huge page EPT mappings [7]:
> >
> > 1. "KVM: selftests: TDX: Test conversion to private at different
> >    sizes". This uses the fact that TDX module will return error if the
> >    page is faulted into the guest at a different level from the accept
> >    level to check the level that the page was faulted in.
> > 2. "KVM: selftests: Test TDs in private_mem_conversions_test". Updates
> >    private_mem_conversions_test for use with TDs. This test does
> >    multi-vCPU conversions and we use this to check for issues to do with
> >    conversion races.
> > 3. "KVM: selftests: TDX: Test conversions when guest_memfd used for
> >    private and shared memory". Adds a selftest similar to/on top of
> >    guest_memfd_conversions_test that does conversions via MapGPA.
> >
> > Full list of selftests I usually run from tools/testing/selftests/kvm:
> > + ./guest_memfd_test
> > + ./guest_memfd_conversions_test
> > + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_test
> > + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_conversions_test
> > + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_hugetlb_reporting_test
> > + ./x86/private_mem_conversions_test.sh
> > + ./set_memory_region_test
> > + ./x86/private_mem_kvm_exits_test
> > + ./x86/tdx_vm_test
> > + ./x86/tdx_upm_test
> > + ./x86/tdx_shared_mem_test
> > + ./x86/tdx_gmem_private_and_shared_test
> >
> > As an overview for anyone who might be interested in this WIP branch:
> >
> > 1.  I started with upstream's kvm/next
> > 2.  Applied TDX selftests series [3]
> > 3.  Applied guest_memfd mmap series [4]
> > 4.  Applied conversions (sub)series and HugeTLB (sub)series [5]
> > 5.  Added some fixes for 2 of the earlier series (as labeled in commit
> >     message)
> > 6.  Updated guest_memfd conversions selftests to work with TDX
> > 7.  Applied 2M EPT series [6] with some hacks
> > 8.  Some patches to make guest_memfd mmap return huge-page-aligned
> >     userspace address
> > 9.  Selftests for guest_memfd conversion with TDX 2M EPT
> >
> > [3] https://lore.kernel.org/all/20250414214801.2693294-1-sagis@google.com/
> > [4] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/
> > [5] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/
> > [6] https://lore.kernel.org/all/Z%2FOMB7HNO%2FRQyljz@yzhao56-desk.sh.intel.com/
> > [7] https://lore.kernel.org/all/20250424030033.32635-1-yan.y.zhao@intel.com/
> Thanks.
> We noticed that it's not easy for TDX initial memory regions to use in-place
> conversion version of guest_memfd, because
> - tdh_mem_page_add() requires simultaneous access to shared source memory and
>   private target memory.
> - shared-to-private in-place conversion first unmaps the shared memory and tests
>   if any extra folio refcount is held before the conversion is allowed.
>
> Therefore, though tdh_mem_page_add() actually supports in-place add, see [8],
> we can't store the initial content in the mmap-ed VA of the in-place conversion
> version of guest_memfd.
>
> So, I modified QEMU to workaround this issue by adding an extra anonymous
> backend to hold source pages in shared memory, with the target private PFN
> allocated from guest_memfd with GUEST_MEMFD_FLAG_SUPPORT_SHARED set.

Yeah, this scheme of using different memory backing for initial
payload makes sense to me.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-20 17:44                                                   ` Kirill Shutemov
@ 2025-06-20 18:40                                                     ` Sean Christopherson
  2025-06-20 19:26                                                       ` Kirill Shutemov
  0 siblings, 1 reply; 294+ messages in thread
From: Sean Christopherson @ 2025-06-20 18:40 UTC (permalink / raw)
  To: Kirill Shutemov
  Cc: Rick P Edgecombe, Yan Y Zhao, Fan Du, Xiaoyao Li, Kai Huang,
	quic_eberman@quicinc.com, Dave Hansen, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, Zhiquan1 Li,
	kvm@vger.kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, Chao P Peng, pbonzini@redhat.com,
	Ira Weiny, Isaku Yamahata, binbin.wu@linux.intel.com,
	ackerleytng@google.com, Vishal Annapurve, tabba@google.com,
	jroedel@suse.de, Jun Miao, pgonda@google.com, x86@kernel.org

On Fri, Jun 20, 2025, Kirill Shutemov wrote:
> On Fri, Jun 20, 2025 at 09:32:45AM -0700, Sean Christopherson wrote:
> > On Wed, Jun 18, 2025, Kirill Shutemov wrote:
> > > On Wed, Jun 18, 2025 at 04:22:59AM +0300, Edgecombe, Rick P wrote:
> > > > On Tue, 2025-06-17 at 08:52 +0800, Yan Zhao wrote:
> > > > > > hopefully is just handling accepting a whole range that is not 2MB aligned.
> > > > > > But
> > > > > > I think we need to verify this more.
> > > > > Ok.
> > > > 
> > > > In Linux guest if a memory region is not 2MB aligned the guest will accept the
> > 
> > What is a "memory region" in this context?  An e820 region?  Something else?
> 
> EFI memory map entry.

I forget, for TDX, is the EFI map built by guest firmware or by the VMM?

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-20 18:40                                                     ` Sean Christopherson
@ 2025-06-20 19:26                                                       ` Kirill Shutemov
  0 siblings, 0 replies; 294+ messages in thread
From: Kirill Shutemov @ 2025-06-20 19:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, Yan Y Zhao, Fan Du, Xiaoyao Li, Kai Huang,
	quic_eberman@quicinc.com, Dave Hansen, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, Zhiquan1 Li,
	kvm@vger.kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, Chao P Peng, pbonzini@redhat.com,
	Ira Weiny, Isaku Yamahata, binbin.wu@linux.intel.com,
	ackerleytng@google.com, Vishal Annapurve, tabba@google.com,
	jroedel@suse.de, Jun Miao, pgonda@google.com, x86@kernel.org

On Fri, Jun 20, 2025 at 11:40:31AM -0700, Sean Christopherson wrote:
> On Fri, Jun 20, 2025, Kirill Shutemov wrote:
> > On Fri, Jun 20, 2025 at 09:32:45AM -0700, Sean Christopherson wrote:
> > > On Wed, Jun 18, 2025, Kirill Shutemov wrote:
> > > > On Wed, Jun 18, 2025 at 04:22:59AM +0300, Edgecombe, Rick P wrote:
> > > > > On Tue, 2025-06-17 at 08:52 +0800, Yan Zhao wrote:
> > > > > > > hopefully is just handling accepting a whole range that is not 2MB aligned.
> > > > > > > But
> > > > > > > I think we need to verify this more.
> > > > > > Ok.
> > > > > 
> > > > > In Linux guest if a memory region is not 2MB aligned the guest will accept the
> > > 
> > > What is a "memory region" in this context?  An e820 region?  Something else?
> > 
> > EFI memory map entry.
> 
> I forget, for TDX, is the EFI map built by guest firmware or by the VMM?

Guest BIOS.

The BIOS would accept some memory on its own (typically the first 4G) and
leave the rest to be accepted by OS. EFI boot services can also accept
memory on OS request (e.g. on memory allocation), updating the map.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-18  0:41                                                   ` Edgecombe, Rick P
@ 2025-06-23  9:27                                                     ` Yan Zhao
  2025-06-23 18:20                                                       ` Edgecombe, Rick P
       [not found]                                                       ` <draft-diqzh606mcz0.fsf@ackerleytng-ctop.c.googlers.com>
  0 siblings, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-06-23  9:27 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, Du, Fan,
	linux-kernel@vger.kernel.org, seanjc@google.com, Weiny, Ira,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Yamahata, Isaku,
	michael.roth@amd.com, ackerleytng@google.com, Peng, Chao P,
	kvm@vger.kernel.org, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, Li, Zhiquan1, pgonda@google.com, x86@kernel.org

On Wed, Jun 18, 2025 at 08:41:38AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-06-18 at 08:19 +0800, Yan Zhao wrote:
> > > I don't think a potential bug in KVM is a good enough reason. If we are
> > > concerned can we think about a warning instead?
> > > 
> > > We had talked enhancing kasan to know when a page is mapped into S-EPT in
> > > the
> > > past. So rather than design around potential bugs we could focus on having a
> > > simpler implementation with the infrastructure to catch and fix the bugs.
> > However, if failing to remove a guest private page would only cause memory
> > leak,
> > it's fine. 
> > If TDX does not hold any refcount, guest_memfd has to know that which private
> > page is still mapped. Otherwise, the page may be re-assigned to other kernel
> > components while it may still be mapped in the S-EPT.
> 
> KASAN detects use-after-free's like that. However, the TDX module code is not
> instrumented. It won't check against the KASAN state for it's accesses.
> 
> I had a brief chat about this with Dave and Kirill. A couple ideas were
> discussed. One was to use page_ext to keep a flag that says the page is in-use
Thanks!

To use page_ext, should we introduce a new flag PAGE_EXT_FIRMWARE_IN_USE,
similar to PAGE_EXT_YOUNG?

Due to similar issues as those with normal page/folio flags (see the next
comment for details), TDX needs to set PAGE_EXT_FIRMWARE_IN_USE on a
page-by-page basis rather than folio-by-folio.

Additionally, it seems reasonable for guest_memfd not to copy the
PAGE_EXT_FIRMWARE_IN_USE flag when splitting a huge folio?
(in __folio_split() --> split_folio_to_order(), PAGE_EXT_YOUNG and
PAGE_EXT_IDLE are copied to the new folios though).

Furthermore, page_ext uses extra memory. With CONFIG_64BIT, should we instead
introduce a PG_firmware_in_use in page flags, similar to PG_young and PG_idle?

> by the TDX module. There was also some discussion of using a normal page flag,
> and that the reserved page flag might prevent some of the MM operations that
> would be needed on guestmemfd pages. I didn't see the problem when I looked.
> 
> For the solution, basically the SEAMCALL wrappers set a flag when they hand a
> page to the TDX module, and clear it when they successfully reclaim it via
> tdh_mem_page_remove() or tdh_phymem_page_reclaim(). Then if the page makes it
> back to the page allocator, a warning is generated.
After some testing, to use a normal page flag, we may need to set it on a
page-by-page basis rather than folio-by-folio. See "Scheme 1".
And guest_memfd may need to selectively copy page flags when splitting huge
folios. See "Scheme 2".

Scheme 1: Set/unset page flag on folio-by-folio basis, i.e.
        - set folio reserved at tdh_mem_page_aug(), tdh_mem_page_add(),
        - unset folio reserved after a successful tdh_mem_page_remove() or
          tdh_phymem_page_reclaim().

        It has problem in following scenario:
        1. tdh_mem_page_aug() adds a 2MB folio. It marks the folio as reserved
	   via "folio_set_reserved(page_folio(page))"

        2. convert a 4KB page of the 2MB folio to shared.
        2.1 tdh_mem_page_demote() is executed first.
       
        2.2 tdh_mem_page_remove() then removes the 4KB mapping.
            "folio_clear_reserved(page_folio(page))" clears reserved flag for
            the 2MB folio while the rest 511 pages are still mapped in the
            S-EPT.

        2.3. guest_memfd splits the 2MB folio into 512 4KB folios.


Scheme 2: Set/unset page flag on page-by-page basis, i.e.
        - set page flag reserved at tdh_mem_page_aug(), tdh_mem_page_add(),
        - unset page flag reserved after a successful tdh_mem_page_remove() or
          tdh_phymem_page_reclaim().

        It has problem in following scenario:
        1. tdh_mem_page_aug() adds a 2MB folio. It marks pages as reserved by
           invoking "SetPageReserved()" on each page.
           As the folio->flags equals to page[0]->flags, folio->flags is also
	   with reserved set.

        2. convert a 4KB page of the 2MB folio to shared. say, it's page[4].
        2.1 tdh_mem_page_demote() is executed first.
       
        2.2 tdh_mem_page_remove() then removes the 4KB mapping.
            "ClearPageReserved()" clears reserved flag of page[4] of the 2MB
            folio.

        2.3. guest_memfd splits the 2MB folio into 512 4KB folios.
             In guestmem_hugetlb_split_folio(), "p->flags = folio->flags" marks
             page[4]->flags as reserved again as page[0] is still reserved.

            (see the code in https://lore.kernel.org/all/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com/
            for (i = 1; i < orig_nr_pages; ++i) {
                struct page *p = folio_page(folio, i);

                /* Copy flags from the first page to split pages. */
                p->flags = folio->flags;

                p->mapping = NULL;
                clear_compound_head(p);
            }
            )

I'm not sure if "p->flags = folio->flags" can be removed. Currently flag like
PG_unevictable is preserved via this step.

If we selectively copy flags, we may need to implement the following changes to
prevent the page from being available to the page allocator. Otherwise, the
"HugePages_Free" count will not decrease, and the same huge folio will continue
to be recycled (i.e., being allocated and consumed by other VMs).

diff --git a/mm/swap.c b/mm/swap.c
index 2747230ced89..72d8c53e2321 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -98,8 +98,36 @@ static void page_cache_release(struct folio *folio)
                unlock_page_lruvec_irqrestore(lruvec, flags);
 }

+static inline bool folio_is_reserved(struct folio *folio)
+{
+       long nr_pages = folio_nr_pages(folio);
+       long i;
+
+       for (i = 0; i < nr_pages; i++) {
+               if (!PageReserved(folio_page(folio, i)))
+                       continue;
+
+               return true;
+       }
+
+       return false;
+}
+
 static void free_typed_folio(struct folio *folio)
 {
@@ -118,6 +146,13 @@ static void free_typed_folio(struct folio *folio)

 void __folio_put(struct folio *folio)
 {
+       if (folio_is_reserved(folio)) {
+               VM_WARN_ON_FOLIO(folio_is_reserved(folio), folio);
+               return;
+       }
+
        if (unlikely(folio_is_zone_device(folio))) {
                free_zone_device_folio(folio);
                return;
@@ -986,6 +1021,12 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
                if (!folio_ref_sub_and_test(folio, nr_refs))
                        continue;

+               if (folio_is_reserved(folio)) {
+                       VM_WARN_ON_FOLIO(folio_is_reserved(folio), folio);
+                       continue;
+               }
+
                if (unlikely(folio_has_type(folio))) {
                        /* typed folios have their own memcg, if any */
                        if (lruvec) {


Besides, guest_memfd needs to reject converting to shared when a page is still
mapped in S-EPT.

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index d71653e7e51e..6449151a3a69 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -553,6 +553,41 @@ static void kvm_gmem_convert_invalidate_end(struct inode *inode,
                kvm_gmem_invalidate_end(gmem, invalidate_start, invalidate_end);
 }

+static bool kvm_gmem_has_invalid_folio(struct address_space *mapping, pgoff_t start,
+                                       size_t nr_pages)
+{
+       pgoff_t index = start, end = start + nr_pages;
+       bool ret = false;
+
+       while (index < end) {
+               struct folio *f;
+               long i = 0, nr;
+
+               f = filemap_get_folio(mapping, index);
+               if (IS_ERR(f))
+                       continue;
+
+               if (f->index < start)
+                       i = start - f->index;
+
+               nr = folio_nr_pages(f);
+               if (f->index + folio_nr_pages(f) > end)
+                       nr -= f->index + folio_nr_pages(f) - end;
+
+               for (; i < nr; i++) {
+                       if (PageReserved(folio_page(f, i))) {
+                               ret = true;
+                               folio_put(f);
+                               goto out;
+                       }
+               }
+               index += folio_nr_pages(f);
+               folio_put(f);
+       }
+out:
+       return ret;
+}
 static int kvm_gmem_convert_should_proceed(struct inode *inode,
                                           struct conversion_work *work,
                                           bool to_shared, pgoff_t *error_index)
@@ -572,6 +607,12 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
                        if (ret)
                                return ret;
                        kvm_gmem_zap(gmem, work->start, work_end, KVM_FILTER_PRIVATE);
+
+                       if (kvm_gmem_has_invalid_folio(inode->i_mapping, work->start,
+                                                      work->nr_pages)) {
+                               ret = -EFAULT;
+                       }
+
                }
        } else {
                unmap_mapping_pages(inode->i_mapping, work->start,



> Also it was mentioned that SGX did have a similar issue to what is being worried
> about here:
> https://lore.kernel.org/linux-sgx/aCYey1W6i7i3yPLL@gmail.com/T/#m86c8c4cf0e6b9a653bf0709a22bb360034a24d95
> 
> > 
> > 
> > > > 
> > > > > > 
> > > > > > This would allow guest_memfd to maintain an internal reference count
> > > > > > for
> > > > > > each
> > > > > > private GFN. TDX would call guest_memfd_add_page_ref_count() for
> > > > > > mapping
> > > > > > and
> > > > > > guest_memfd_dec_page_ref_count() after a successful unmapping. Before
> > > > > > truncating
> > > > > > a private page from the filemap, guest_memfd could increase the real
> > > > > > folio
> > > > > > reference count based on its internal reference count for the private
> > > > > > GFN.
> > > > > 
> > > > > What does this get us exactly? This is the argument to have less error
> > > > > prone
> > > > > code that can survive forgetting to refcount on error? I don't see that
> > > > > it
> > > > > is an
> > > > > especially special case.
> > > > Yes, for a less error prone code.
> > > > 
> > > > If this approach is considered too complex for an initial implementation,
> > > > using
> > > > tdx_hold_page_on_error() is also a viable option.
> > > 
> > > I'm saying I don't think it's not a good enough reason. Why is it different
> > > then
> > > other use-after free bugs? I feel like I'm missing something.
> > By tdx_hold_page_on_error(), it could be implememented as on removal failure,
> > invoke a guest_memfd interface to let guest_memfd know exact ranges still
> > being
> > under use by the TDX module due to unmapping failures.
> > Do you think it's ok?
> 
> Either way is ok to me. It seems like we have three ok solutions. But the tone
> of the thread is that we are solving some deep problem. Maybe I'm missing
> something.

^ permalink raw reply related	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-06-19  9:26       ` Nikolay Borisov
@ 2025-06-23  9:32         ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-06-23  9:32 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Edgecombe, Rick P, pbonzini@redhat.com, seanjc@google.com,
	Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, Jun 19, 2025 at 12:26:07PM +0300, Nikolay Borisov wrote:
> > What about Binbin's proposal [1]? i.e.,
> > 
> > while (nr_pages)
> >       tdx_clflush_page(nth_page(page, --nr_pages));
> 
> What's the problem with using:
> 
> +       for (int i = 0; nr_pages; nr_pages--)
> +               tdx_clflush_page(nth_page(page, i++))
Thanks! It looks good to me.

> The kernel now allows C99-style definition of variables inside a loop + it's
> clear how many times the loop has to be executed.
 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-23  9:27                                                     ` Yan Zhao
@ 2025-06-23 18:20                                                       ` Edgecombe, Rick P
       [not found]                                                       ` <draft-diqzh606mcz0.fsf@ackerleytng-ctop.c.googlers.com>
  1 sibling, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-23 18:20 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
	vbabka@suse.cz, Shutemov, Kirill, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Yamahata, Isaku,
	ackerleytng@google.com, Weiny, Ira, Peng, Chao P,
	kvm@vger.kernel.org, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, Li, Zhiquan1, pgonda@google.com, x86@kernel.org

On Mon, 2025-06-23 at 17:27 +0800, Yan Zhao wrote:
> To use page_ext, should we introduce a new flag PAGE_EXT_FIRMWARE_IN_USE,
> similar to PAGE_EXT_YOUNG?
> 
> Due to similar issues as those with normal page/folio flags (see the next
> comment for details), TDX needs to set PAGE_EXT_FIRMWARE_IN_USE on a
> page-by-page basis rather than folio-by-folio.
> 
> Additionally, it seems reasonable for guest_memfd not to copy the
> PAGE_EXT_FIRMWARE_IN_USE flag when splitting a huge folio?
> (in __folio_split() --> split_folio_to_order(), PAGE_EXT_YOUNG and
> PAGE_EXT_IDLE are copied to the new folios though).
> 
> Furthermore, page_ext uses extra memory. With CONFIG_64BIT, should we instead
> introduce a PG_firmware_in_use in page flags, similar to PG_young and PG_idle?

Page flags are a scarce resource. If we could have used an existing one, it
would have been nice. But otherwise, I would guess the use case is not strong
enough to justify adding one.

So PAGE_EXT_FIRMWARE_IN_USE is probably a better way to go. Due to the memory
use, it would have to be a debug config like the others. If we have line of
sight to a solution, how do you feel about the following direction to move past
this issue:
1. Go with refcount on error approach for now (i.e. tdx_hold_page_on_error())
2. In a pfn-only future, plan to switch to guestmemfd callback instead of
tdx_hold_page_on_error(). We don't understand the pfn-only feature enough to
properly design for it anyway.
3. Plan for a PAGE_EXT_FIRMWARE_IN_USE as follow on work to huge pages. The
reason why it should not be required before huge pages is because it is not
necessary for correct code, only to catch incorrect code slipping in.

That is based on the assessment that the effort to change the zap path to
communicate failure is too much churn. Do you happen to have a diffstat for a
POC on this BTW?

> 
> > by the TDX module. There was also some discussion of using a normal page flag,
> > and that the reserved page flag might prevent some of the MM operations that
> > would be needed on guestmemfd pages. I didn't see the problem when I looked.
> > 
> > For the solution, basically the SEAMCALL wrappers set a flag when they hand a
> > page to the TDX module, and clear it when they successfully reclaim it via
> > tdh_mem_page_remove() or tdh_phymem_page_reclaim(). Then if the page makes it
> > back to the page allocator, a warning is generated.
> After some testing, to use a normal page flag, we may need to set it on a
> page-by-page basis rather than folio-by-folio. See "Scheme 1".
> And guest_memfd may need to selectively copy page flags when splitting huge
> folios. See "Scheme 2".

With page_ext, it seems we could have it be per page from the beginning?

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-20 16:31                                               ` Sean Christopherson
@ 2025-06-23 21:44                                                 ` Edgecombe, Rick P
  2025-06-24  9:57                                                   ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-23 21:44 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: Du, Fan, Li, Xiaoyao, Huang, Kai, Shutemov, Kirill, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
	Li, Zhiquan1, quic_eberman@quicinc.com, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, Weiny, Ira, Peng, Chao P,
	pbonzini@redhat.com, Yamahata, Isaku, ackerleytng@google.com,
	binbin.wu@linux.intel.com, kvm@vger.kernel.org, Annapurve, Vishal,
	tabba@google.com, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Fri, 2025-06-20 at 09:31 -0700, Sean Christopherson wrote:
> > On Wed, Jun 18, 2025, Yan Zhao wrote:
> > > >    when an EPT violation carries an ACCEPT level info
> > > >    KVM maps the page at map level <= the specified level.
> > 
> > No.  I want KVM to map at the maximal level KVM supports, irrespective of what
> > the guest's ACCEPT level says.  I.e. I want KVM to be able to completely ignore
> > the ACCEPT level.

This is what I was thinking, but I'm starting to think it might not be a good
idea.

The PAGE_SIZE_MISMATCH error code asymmetry is indeed weird. But "accepted" is
in some important ways a type of permission that is controllable by both the
guest and host. To change the ABI and guests such that the permission is still
controlled by the host and guest, but the allowed granularity is only
controllable by the host, feels wrong in a couple ways.

First, it turns host mapping details into guest ABI that could break guests that
rely on it. Second, it bets that there will never be a need for guests to set
the accept state on a specific smaller granularity. Otherwise, this path would 
just be a temporary shortcut and not about components imposing things that are
none of their business.

Instead I think the two impositions that matter here are:
1. TDX requires size to be passed through the generic fault handler somehow.
2. TDX demote is hard to make work under mmu read lock (already working on this
one)

Sean, were the two options for (1) really that bad? Or how do you think about
changing directions in general and we can try to find some other options?

On the subject of alternates to (1). I wonder if the ugly part is that both of
the options sort of break the KVM model where the TDP is not the real backing
state. TDG.MEM.PAGE.ACCEPT is kind of two things, changing the "permission" of
the memory *and* the mapping of it. TDX module asks, map this at this page size
so that I can map it at the right permission. KVM would rather learn that the
permission from the backing GPA info (memslots, etc) and then map it at it's
correct page size. Like what happens with kvm_lpage_info->disallow_lpage.

Maybe we could have EPT violations that contain 4k accept sizes first update the
attribute for the GFN to be accepted or not, like have tdx.c call out to set
kvm_lpage_info->disallow_lpage in the rarer case of 4k accept size? Or something
like that. Maybe set a "accepted" attribute, or something. Not sure if could be
done without the mmu write lock... But it might fit KVM better?

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
       [not found]                                                       ` <draft-diqzh606mcz0.fsf@ackerleytng-ctop.c.googlers.com>
@ 2025-06-23 22:48                                                         ` Ackerley Tng
  2025-06-24 10:18                                                           ` Yan Zhao
                                                                             ` (2 more replies)
  0 siblings, 3 replies; 294+ messages in thread
From: Ackerley Tng @ 2025-06-23 22:48 UTC (permalink / raw)
  To: Yan Zhao, Edgecombe, Rick P
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, Du, Fan,
	linux-kernel@vger.kernel.org, seanjc@google.com, Weiny, Ira,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Yamahata, Isaku,
	michael.roth@amd.com, Peng, Chao P, kvm@vger.kernel.org,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Li, Zhiquan1,
	pgonda@google.com, x86@kernel.org

Ackerley Tng <ackerleytng@google.com> writes:

> Yan Zhao <yan.y.zhao@intel.com> writes:
>
>> On Wed, Jun 18, 2025 at 08:41:38AM +0800, Edgecombe, Rick P wrote:
>>> On Wed, 2025-06-18 at 08:19 +0800, Yan Zhao wrote:
>>> > > I don't think a potential bug in KVM is a good enough reason. If we are
>>> > > concerned can we think about a warning instead?
>>> > > 
>>> > > We had talked enhancing kasan to know when a page is mapped into S-EPT in
>>> > > the
>>> > > past. So rather than design around potential bugs we could focus on having a
>>> > > simpler implementation with the infrastructure to catch and fix the bugs.
>>> > However, if failing to remove a guest private page would only cause memory
>>> > leak,
>>> > it's fine. 
>>> > If TDX does not hold any refcount, guest_memfd has to know that which private
>>> > page is still mapped. Otherwise, the page may be re-assigned to other kernel
>>> > components while it may still be mapped in the S-EPT.
>>> 
>>> KASAN detects use-after-free's like that. However, the TDX module code is not
>>> instrumented. It won't check against the KASAN state for it's accesses.
>>> 
>>> I had a brief chat about this with Dave and Kirill. A couple ideas were
>>> discussed. One was to use page_ext to keep a flag that says the page is in-use
>> Thanks!
>>
>> To use page_ext, should we introduce a new flag PAGE_EXT_FIRMWARE_IN_USE,
>> similar to PAGE_EXT_YOUNG?
>>
>> Due to similar issues as those with normal page/folio flags (see the next
>> comment for details), TDX needs to set PAGE_EXT_FIRMWARE_IN_USE on a
>> page-by-page basis rather than folio-by-folio.
>>
>> Additionally, it seems reasonable for guest_memfd not to copy the
>> PAGE_EXT_FIRMWARE_IN_USE flag when splitting a huge folio?
>> (in __folio_split() --> split_folio_to_order(), PAGE_EXT_YOUNG and
>> PAGE_EXT_IDLE are copied to the new folios though).
>>
>> Furthermore, page_ext uses extra memory. With CONFIG_64BIT, should we instead
>> introduce a PG_firmware_in_use in page flags, similar to PG_young and PG_idle?
>>

I think neither page flags nor page_ext will work for us, but see below.

>>> by the TDX module. There was also some discussion of using a normal page flag,
>>> and that the reserved page flag might prevent some of the MM operations that
>>> would be needed on guestmemfd pages. I didn't see the problem when I looked.
>>> 
>>> For the solution, basically the SEAMCALL wrappers set a flag when they hand a
>>> page to the TDX module, and clear it when they successfully reclaim it via
>>> tdh_mem_page_remove() or tdh_phymem_page_reclaim(). Then if the page makes it
>>> back to the page allocator, a warning is generated.
>> After some testing, to use a normal page flag, we may need to set it on a
>> page-by-page basis rather than folio-by-folio. See "Scheme 1".
>> And guest_memfd may need to selectively copy page flags when splitting huge
>> folios. See "Scheme 2".
>>
>> Scheme 1: Set/unset page flag on folio-by-folio basis, i.e.
>>         - set folio reserved at tdh_mem_page_aug(), tdh_mem_page_add(),
>>         - unset folio reserved after a successful tdh_mem_page_remove() or
>>           tdh_phymem_page_reclaim().
>>
>>         It has problem in following scenario:
>>         1. tdh_mem_page_aug() adds a 2MB folio. It marks the folio as reserved
>> 	   via "folio_set_reserved(page_folio(page))"
>>
>>         2. convert a 4KB page of the 2MB folio to shared.
>>         2.1 tdh_mem_page_demote() is executed first.
>>        
>>         2.2 tdh_mem_page_remove() then removes the 4KB mapping.
>>             "folio_clear_reserved(page_folio(page))" clears reserved flag for
>>             the 2MB folio while the rest 511 pages are still mapped in the
>>             S-EPT.
>>
>>         2.3. guest_memfd splits the 2MB folio into 512 4KB folios.
>>
>>

Folio flags on their own won't work because they're not precise
enough. A folio can be multiple 4K pages, and if a 4K page had failed to
unmap, we want to be able to indicate which 4K page had the failure,
instead of the entire folio. (But see below)

>> Scheme 2: Set/unset page flag on page-by-page basis, i.e.
>>         - set page flag reserved at tdh_mem_page_aug(), tdh_mem_page_add(),
>>         - unset page flag reserved after a successful tdh_mem_page_remove() or
>>           tdh_phymem_page_reclaim().
>>
>>         It has problem in following scenario:
>>         1. tdh_mem_page_aug() adds a 2MB folio. It marks pages as reserved by
>>            invoking "SetPageReserved()" on each page.
>>            As the folio->flags equals to page[0]->flags, folio->flags is also
>> 	   with reserved set.
>>
>>         2. convert a 4KB page of the 2MB folio to shared. say, it's page[4].
>>         2.1 tdh_mem_page_demote() is executed first.
>>        
>>         2.2 tdh_mem_page_remove() then removes the 4KB mapping.
>>             "ClearPageReserved()" clears reserved flag of page[4] of the 2MB
>>             folio.
>>
>>         2.3. guest_memfd splits the 2MB folio into 512 4KB folios.
>>              In guestmem_hugetlb_split_folio(), "p->flags = folio->flags" marks
>>              page[4]->flags as reserved again as page[0] is still reserved.
>>
>>             (see the code in https://lore.kernel.org/all/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com/
>>             for (i = 1; i < orig_nr_pages; ++i) {
>>                 struct page *p = folio_page(folio, i);
>>
>>                 /* Copy flags from the first page to split pages. */
>>                 p->flags = folio->flags;
>>
>>                 p->mapping = NULL;
>>                 clear_compound_head(p);
>>             }
>>             )
>>

Per-page flags won't work because we want to retain HugeTLB Vmemmap
Optimization (HVO), which allows subsequent (identical) struct pages to
alias to each other. If we use a per-page flag, then HVO would break
since struct pages would no longer be identical to each other.

>> [...]

Let me try and summarize the current state of this discussion:

Topic 1: Does TDX need to somehow indicate that it is using a page?

This patch series uses refcounts to indicate that TDX is using a page,
but that complicates private-to-shared conversions.

During a private-to-shared conversion, guest_memfd assumes that
guest_memfd is trusted to manage private memory. TDX and other users
should trust guest_memfd to keep the memory around.

Yan's position is that holding a refcount is in line with how IOMMU
takes a refcount when a page is mapped into the IOMMU [1].

Yan had another suggestion, which is to indicate using a page flag [2].

I think we're in agreement that we don't want to have TDX hold a
refcount while the page is mapped into the Secure EPTs, but taking a
step back, do we really need to indicate (at all) that TDX is using a
page?

In [3] Yan said

> If TDX does not hold any refcount, guest_memfd has to know that which
> private
> page is still mapped. Otherwise, the page may be re-assigned to other
> kernel
> components while it may still be mapped in the S-EPT.

If the private page is mapped for regular VM use as private memory,
guest_memfd is managing that, and the same page will not be re-assigned
to any other kernel component. guest_memfd does hold refcounts in
guest_memfd's filemap.

If the private page is still mapped because there was an unmapping
failure, we can discuss that separately under error handling in Topic 2.

With this, can I confirm that we are in agreement that TDX does not need
to indicate that it is using a page, and can trust guest_memfd to keep
the page around for the VM?

Topic 2: How to handle unmapping/splitting errors arising from TDX?

Previously I was in favor of having unmap() return an error (Rick
suggested doing a POC, and in a more recent email Rick asked for a
diffstat), but Vishal and I talked about this and now I agree having
unmapping return an error is not a good approach for these reasons.

1. Unmapping takes a range, and within the range there could be more
   than one unmapping error. I was previously thinking that unmap()
   could return 0 for success and the failed PFN on error. Returning a
   single PFN on error is okay-ish but if there are more errors it could
   get complicated.

   Another error return option could be to return the folio where the
   unmapping/splitting issue happened, but that would not be
   sufficiently precise, since a folio could be larger than 4K and we
   want to track errors as precisely as we can to reduce memory loss due
   to errors.

2. What I think Yan has been trying to say: unmap() returning an error
   is non-standard in the kernel.

I think (1) is the dealbreaker here and there's no need to do the
plumbing POC and diffstat.

So I think we're all in support of indicating unmapping/splitting issues
without returning anything from unmap(), and the discussed options are

a. Refcounts: won't work - mostly discussed in this (sub-)thread
   [3]. Using refcounts makes it impossible to distinguish between
   transient refcounts and refcounts due to errors.

b. Page flags: won't work with/can't benefit from HVO.

Suggestions still in the running:

c. Folio flags are not precise enough to indicate which page actually
   had an error, but this could be sufficient if we're willing to just
   waste the rest of the huge page on unmapping error.

d. Folio flags with folio splitting on error. This means that on
   unmapping/Secure EPT PTE splitting error, we have to split the
   (larger than 4K) folio to 4K, and then set a flag on the split folio.

   The issue I see with this is that splitting pages with HVO applied
   means doing allocations, and in an error scenario there may not be
   memory left to split the pages.

e. Some other data structure in guest_memfd, say, a linked list, and a
   function like kvm_gmem_add_error_pfn(struct page *page) that would
   look up the guest_memfd inode from the page and add the page's pfn to
   the linked list.

   Everywhere in guest_memfd that does unmapping/splitting would then
   check this linked list to see if the unmapping/splitting
   succeeded.

   Everywhere in guest_memfd that allocates pages will also check this
   linked list to make sure the pages are functional.

   When guest_memfd truncates, if the page being truncated is on the
   list, retain the refcount on the page and leak that page.

f. Combination of c and e, something similar to HugeTLB's
   folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in
   trouble to a linked list on the folio.

g. Like f, but basically treat an unmapping error as hardware poisoning.

I'm kind of inclined towards g, to just treat unmapping errors as
HWPOISON and buying into all the HWPOISON handling requirements. What do
yall think? Can a TDX unmapping error be considered as memory poisoning?

[1] https://lore.kernel.org/all/aE%2F1TgUvr0dcaJUg@yzhao56-desk.sh.intel.com/
[2] https://lore.kernel.org/all/aFkeBtuNBN1RrDAJ@yzhao56-desk.sh.intel.com/
[3] https://lore.kernel.org/all/aFIGFesluhuh2xAS@yzhao56-desk.sh.intel.com/
[3] https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-23 21:44                                                 ` Edgecombe, Rick P
@ 2025-06-24  9:57                                                   ` Yan Zhao
  2025-06-24 18:35                                                     ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-24  9:57 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, Du, Fan, Li, Xiaoyao, Huang, Kai,
	Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, Li, Zhiquan1,
	quic_eberman@quicinc.com, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, Weiny, Ira, Peng, Chao P,
	pbonzini@redhat.com, Yamahata, Isaku, ackerleytng@google.com,
	binbin.wu@linux.intel.com, kvm@vger.kernel.org, Annapurve, Vishal,
	tabba@google.com, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Tue, Jun 24, 2025 at 05:44:17AM +0800, Edgecombe, Rick P wrote:
> On Fri, 2025-06-20 at 09:31 -0700, Sean Christopherson wrote:
> > > On Wed, Jun 18, 2025, Yan Zhao wrote:
> > > > >    when an EPT violation carries an ACCEPT level info
> > > > >    KVM maps the page at map level <= the specified level.
> > > 
> > > No.  I want KVM to map at the maximal level KVM supports, irrespective of what
> > > the guest's ACCEPT level says.  I.e. I want KVM to be able to completely ignore
> > > the ACCEPT level.
> 
> This is what I was thinking, but I'm starting to think it might not be a good
> idea.
> 
> The PAGE_SIZE_MISMATCH error code asymmetry is indeed weird. But "accepted" is
> in some important ways a type of permission that is controllable by both the
> guest and host. To change the ABI and guests such that the permission is still
> controlled by the host and guest, but the allowed granularity is only
> controllable by the host, feels wrong in a couple ways.
> 
> First, it turns host mapping details into guest ABI that could break guests that
> rely on it. Second, it bets that there will never be a need for guests to set
> the accept state on a specific smaller granularity. Otherwise, this path would 
> just be a temporary shortcut and not about components imposing things that are
> none of their business.
> 
> Instead I think the two impositions that matter here are:
> 1. TDX requires size to be passed through the generic fault handler somehow.
> 2. TDX demote is hard to make work under mmu read lock (already working on this
> one)
> 
> Sean, were the two options for (1) really that bad? Or how do you think about
> changing directions in general and we can try to find some other options?
> 
> On the subject of alternates to (1). I wonder if the ugly part is that both of
> the options sort of break the KVM model where the TDP is not the real backing
> state. TDG.MEM.PAGE.ACCEPT is kind of two things, changing the "permission" of
> the memory *and* the mapping of it. TDX module asks, map this at this page size
> so that I can map it at the right permission. KVM would rather learn that the
> permission from the backing GPA info (memslots, etc) and then map it at it's
> correct page size. Like what happens with kvm_lpage_info->disallow_lpage.
Could we provide the info via the private_max_mapping_level hook (i.e. via
tdx_gmem_private_max_mapping_level())?

Or what about introducing a vendor hook in __kvm_mmu_max_mapping_level() for a
private fault?

> Maybe we could have EPT violations that contain 4k accept sizes first update the
> attribute for the GFN to be accepted or not, like have tdx.c call out to set
> kvm_lpage_info->disallow_lpage in the rarer case of 4k accept size? Or something
Something like kvm_lpage_info->disallow_lpage would disallow later page
promotion, though we don't support it right now.

> like that. Maybe set a "accepted" attribute, or something. Not sure if could be
Setting "accepted" attribute in the EPT violation handler?
It's a little odd, as the accept operation is not yet completed.

> done without the mmu write lock... But it might fit KVM better?

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-23 22:48                                                         ` Ackerley Tng
@ 2025-06-24 10:18                                                           ` Yan Zhao
  2025-06-24 21:29                                                             ` Ackerley Tng
  2025-06-24 22:00                                                           ` Edgecombe, Rick P
  2025-06-24 22:03                                                           ` Edgecombe, Rick P
  2 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-24 10:18 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Edgecombe, Rick P, quic_eberman@quicinc.com, Li, Xiaoyao,
	Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
	Du, Fan, linux-kernel@vger.kernel.org, seanjc@google.com,
	Weiny, Ira, pbonzini@redhat.com, binbin.wu@linux.intel.com,
	Yamahata, Isaku, michael.roth@amd.com, Peng, Chao P,
	kvm@vger.kernel.org, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, Li, Zhiquan1, pgonda@google.com, x86@kernel.org

On Mon, Jun 23, 2025 at 03:48:48PM -0700, Ackerley Tng wrote:
> Ackerley Tng <ackerleytng@google.com> writes:
> 
> > Yan Zhao <yan.y.zhao@intel.com> writes:
> >
> >> On Wed, Jun 18, 2025 at 08:41:38AM +0800, Edgecombe, Rick P wrote:
> >>> On Wed, 2025-06-18 at 08:19 +0800, Yan Zhao wrote:
> >>> > > I don't think a potential bug in KVM is a good enough reason. If we are
> >>> > > concerned can we think about a warning instead?
> >>> > > 
> >>> > > We had talked enhancing kasan to know when a page is mapped into S-EPT in
> >>> > > the
> >>> > > past. So rather than design around potential bugs we could focus on having a
> >>> > > simpler implementation with the infrastructure to catch and fix the bugs.
> >>> > However, if failing to remove a guest private page would only cause memory
> >>> > leak,
> >>> > it's fine. 
> >>> > If TDX does not hold any refcount, guest_memfd has to know that which private
> >>> > page is still mapped. Otherwise, the page may be re-assigned to other kernel
> >>> > components while it may still be mapped in the S-EPT.
> >>> 
> >>> KASAN detects use-after-free's like that. However, the TDX module code is not
> >>> instrumented. It won't check against the KASAN state for it's accesses.
> >>> 
> >>> I had a brief chat about this with Dave and Kirill. A couple ideas were
> >>> discussed. One was to use page_ext to keep a flag that says the page is in-use
> >> Thanks!
> >>
> >> To use page_ext, should we introduce a new flag PAGE_EXT_FIRMWARE_IN_USE,
> >> similar to PAGE_EXT_YOUNG?
> >>
> >> Due to similar issues as those with normal page/folio flags (see the next
> >> comment for details), TDX needs to set PAGE_EXT_FIRMWARE_IN_USE on a
> >> page-by-page basis rather than folio-by-folio.
> >>
> >> Additionally, it seems reasonable for guest_memfd not to copy the
> >> PAGE_EXT_FIRMWARE_IN_USE flag when splitting a huge folio?
> >> (in __folio_split() --> split_folio_to_order(), PAGE_EXT_YOUNG and
> >> PAGE_EXT_IDLE are copied to the new folios though).
> >>
> >> Furthermore, page_ext uses extra memory. With CONFIG_64BIT, should we instead
> >> introduce a PG_firmware_in_use in page flags, similar to PG_young and PG_idle?
> >>
> 
> I think neither page flags nor page_ext will work for us, but see below.
> 
> >>> by the TDX module. There was also some discussion of using a normal page flag,
> >>> and that the reserved page flag might prevent some of the MM operations that
> >>> would be needed on guestmemfd pages. I didn't see the problem when I looked.
> >>> 
> >>> For the solution, basically the SEAMCALL wrappers set a flag when they hand a
> >>> page to the TDX module, and clear it when they successfully reclaim it via
> >>> tdh_mem_page_remove() or tdh_phymem_page_reclaim(). Then if the page makes it
> >>> back to the page allocator, a warning is generated.
> >> After some testing, to use a normal page flag, we may need to set it on a
> >> page-by-page basis rather than folio-by-folio. See "Scheme 1".
> >> And guest_memfd may need to selectively copy page flags when splitting huge
> >> folios. See "Scheme 2".
> >>
> >> Scheme 1: Set/unset page flag on folio-by-folio basis, i.e.
> >>         - set folio reserved at tdh_mem_page_aug(), tdh_mem_page_add(),
> >>         - unset folio reserved after a successful tdh_mem_page_remove() or
> >>           tdh_phymem_page_reclaim().
> >>
> >>         It has problem in following scenario:
> >>         1. tdh_mem_page_aug() adds a 2MB folio. It marks the folio as reserved
> >> 	   via "folio_set_reserved(page_folio(page))"
> >>
> >>         2. convert a 4KB page of the 2MB folio to shared.
> >>         2.1 tdh_mem_page_demote() is executed first.
> >>        
> >>         2.2 tdh_mem_page_remove() then removes the 4KB mapping.
> >>             "folio_clear_reserved(page_folio(page))" clears reserved flag for
> >>             the 2MB folio while the rest 511 pages are still mapped in the
> >>             S-EPT.
> >>
> >>         2.3. guest_memfd splits the 2MB folio into 512 4KB folios.
> >>
> >>
> 
> Folio flags on their own won't work because they're not precise
> enough. A folio can be multiple 4K pages, and if a 4K page had failed to
> unmap, we want to be able to indicate which 4K page had the failure,
> instead of the entire folio. (But see below)
> 
> >> Scheme 2: Set/unset page flag on page-by-page basis, i.e.
> >>         - set page flag reserved at tdh_mem_page_aug(), tdh_mem_page_add(),
> >>         - unset page flag reserved after a successful tdh_mem_page_remove() or
> >>           tdh_phymem_page_reclaim().
> >>
> >>         It has problem in following scenario:
> >>         1. tdh_mem_page_aug() adds a 2MB folio. It marks pages as reserved by
> >>            invoking "SetPageReserved()" on each page.
> >>            As the folio->flags equals to page[0]->flags, folio->flags is also
> >> 	   with reserved set.
> >>
> >>         2. convert a 4KB page of the 2MB folio to shared. say, it's page[4].
> >>         2.1 tdh_mem_page_demote() is executed first.
> >>        
> >>         2.2 tdh_mem_page_remove() then removes the 4KB mapping.
> >>             "ClearPageReserved()" clears reserved flag of page[4] of the 2MB
> >>             folio.
> >>
> >>         2.3. guest_memfd splits the 2MB folio into 512 4KB folios.
> >>              In guestmem_hugetlb_split_folio(), "p->flags = folio->flags" marks
> >>              page[4]->flags as reserved again as page[0] is still reserved.
> >>
> >>             (see the code in https://lore.kernel.org/all/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com/
> >>             for (i = 1; i < orig_nr_pages; ++i) {
> >>                 struct page *p = folio_page(folio, i);
> >>
> >>                 /* Copy flags from the first page to split pages. */
> >>                 p->flags = folio->flags;
> >>
> >>                 p->mapping = NULL;
> >>                 clear_compound_head(p);
> >>             }
> >>             )
> >>
> 
> Per-page flags won't work because we want to retain HugeTLB Vmemmap
> Optimization (HVO), which allows subsequent (identical) struct pages to
> alias to each other. If we use a per-page flag, then HVO would break
> since struct pages would no longer be identical to each other.
Ah, I overlooked HVO.
In my testing, neither CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON nor
hugetlb_free_vmemmap was set.

With HVO enabled, setting page flags on a per-page basis indeed does not work.
> 
> >> [...]
> 
> Let me try and summarize the current state of this discussion:
Thanks for this summary.


> Topic 1: Does TDX need to somehow indicate that it is using a page?
> 
> This patch series uses refcounts to indicate that TDX is using a page,
> but that complicates private-to-shared conversions.
> 
> During a private-to-shared conversion, guest_memfd assumes that
> guest_memfd is trusted to manage private memory. TDX and other users
> should trust guest_memfd to keep the memory around.
> 
> Yan's position is that holding a refcount is in line with how IOMMU
> takes a refcount when a page is mapped into the IOMMU [1].
> 
> Yan had another suggestion, which is to indicate using a page flag [2].
> 
> I think we're in agreement that we don't want to have TDX hold a
> refcount while the page is mapped into the Secure EPTs, but taking a
> step back, do we really need to indicate (at all) that TDX is using a
> page?
> 
> In [3] Yan said
> 
> > If TDX does not hold any refcount, guest_memfd has to know that which
> > private
> > page is still mapped. Otherwise, the page may be re-assigned to other
> > kernel
> > components while it may still be mapped in the S-EPT.
> 
> If the private page is mapped for regular VM use as private memory,
> guest_memfd is managing that, and the same page will not be re-assigned
> to any other kernel component. guest_memfd does hold refcounts in
> guest_memfd's filemap.
After kvm_gmem_release(), guest_memfd will return folios to hugetlb, so the same
page could be re-assigned to other kernel components that allocate pages from
hugetlb.

> 
> If the private page is still mapped because there was an unmapping
> failure, we can discuss that separately under error handling in Topic 2.
> 
> With this, can I confirm that we are in agreement that TDX does not need
> to indicate that it is using a page, and can trust guest_memfd to keep
> the page around for the VM?
I thought it's not a must until I came across a comment from Sean:
"Should these bail early if the KVM_BUG_ON() is hit?  Calling into the TDX module
after bugging the VM is a bit odd."
https://lore.kernel.org/kvm/Z4r_XNcxPWpgjZio@google.com/#t.

This comment refers to the following scenario:
when a 2MB non-leaf entry in the mirror root is zapped with shared mmu_lock,
BUG_ON() will be triggered for TDX. But by the time handle_removed_pt() is
reached, the 2MB non-leaf entry would have been successfully removed in the
mirror root.

Bailing out early in remove_external_spte() would prevent the removal of 4KB
private guest pages in the S-EPT later due to lacking of corresponding entry in
the mirror root.

Since KVM MMU does not hold guest page's ref count, failing to notify TDX about
the removal of a guest page could result in a situation where a page still
mapped in the S-EPT is freed and re-allocated by the OS. 

Therefore, indicating that TDX is using a page can be less error-prone, though
it does consume more memory.

> Topic 2: How to handle unmapping/splitting errors arising from TDX?
> 
> Previously I was in favor of having unmap() return an error (Rick
> suggested doing a POC, and in a more recent email Rick asked for a
> diffstat), but Vishal and I talked about this and now I agree having
> unmapping return an error is not a good approach for these reasons.
> 
> 1. Unmapping takes a range, and within the range there could be more
>    than one unmapping error. I was previously thinking that unmap()
>    could return 0 for success and the failed PFN on error. Returning a
>    single PFN on error is okay-ish but if there are more errors it could
>    get complicated.
> 
>    Another error return option could be to return the folio where the
>    unmapping/splitting issue happened, but that would not be
>    sufficiently precise, since a folio could be larger than 4K and we
>    want to track errors as precisely as we can to reduce memory loss due
>    to errors.
> 
> 2. What I think Yan has been trying to say: unmap() returning an error
>    is non-standard in the kernel.
> 
> I think (1) is the dealbreaker here and there's no need to do the
> plumbing POC and diffstat.
> 
> So I think we're all in support of indicating unmapping/splitting issues
> without returning anything from unmap(), and the discussed options are
> 
> a. Refcounts: won't work - mostly discussed in this (sub-)thread
>    [3]. Using refcounts makes it impossible to distinguish between
>    transient refcounts and refcounts due to errors.
> 
> b. Page flags: won't work with/can't benefit from HVO.
> 
> Suggestions still in the running:
> 
> c. Folio flags are not precise enough to indicate which page actually
>    had an error, but this could be sufficient if we're willing to just
>    waste the rest of the huge page on unmapping error.
For 1GB folios, more precise info will be better.


> d. Folio flags with folio splitting on error. This means that on
>    unmapping/Secure EPT PTE splitting error, we have to split the
>    (larger than 4K) folio to 4K, and then set a flag on the split folio.
> 
>    The issue I see with this is that splitting pages with HVO applied
>    means doing allocations, and in an error scenario there may not be
>    memory left to split the pages.
Could we restore the page structures before triggering unmap?

> 
> e. Some other data structure in guest_memfd, say, a linked list, and a
>    function like kvm_gmem_add_error_pfn(struct page *page) that would
>    look up the guest_memfd inode from the page and add the page's pfn to
>    the linked list.
>
>    Everywhere in guest_memfd that does unmapping/splitting would then
>    check this linked list to see if the unmapping/splitting
>    succeeded.
> 
>    Everywhere in guest_memfd that allocates pages will also check this
>    linked list to make sure the pages are functional.
> 
>    When guest_memfd truncates, if the page being truncated is on the
>    list, retain the refcount on the page and leak that page.
>
> f. Combination of c and e, something similar to HugeTLB's
>    folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in
>    trouble to a linked list on the folio.
That seems like a good idea. If memory allocation for the linked list succeeds,
mark the pages within a folio as troublesome; otherwise, mark the entire folio
as troublesome.

But maybe c is good enough for 2MB folios.

> g. Like f, but basically treat an unmapping error as hardware poisoning.
Not sure if hwpoison bit can be used directly.
Further investigation is needed.

> I'm kind of inclined towards g, to just treat unmapping errors as
> HWPOISON and buying into all the HWPOISON handling requirements. What do
> yall think? Can a TDX unmapping error be considered as memory poisoning?
> 
> 
> [1] https://lore.kernel.org/all/aE%2F1TgUvr0dcaJUg@yzhao56-desk.sh.intel.com/
> [2] https://lore.kernel.org/all/aFkeBtuNBN1RrDAJ@yzhao56-desk.sh.intel.com/
> [3] https://lore.kernel.org/all/aFIGFesluhuh2xAS@yzhao56-desk.sh.intel.com/
> [3] https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-24  9:57                                                   ` Yan Zhao
@ 2025-06-24 18:35                                                     ` Edgecombe, Rick P
  2025-06-25  9:28                                                       ` Yan Zhao
  2025-06-25 13:47                                                       ` Vishal Annapurve
  0 siblings, 2 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-24 18:35 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Huang, Kai, Du, Fan,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Li, Zhiquan1, Shutemov, Kirill,
	michael.roth@amd.com, seanjc@google.com, Weiny, Ira, Peng, Chao P,
	pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, kvm@vger.kernel.org, Annapurve, Vishal,
	tabba@google.com, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> Could we provide the info via the private_max_mapping_level hook (i.e. via
> tdx_gmem_private_max_mapping_level())?

This is one of the previous two methods discussed. Can you elaborate on what you
are trying to say?

> 
> Or what about introducing a vendor hook in __kvm_mmu_max_mapping_level() for a
> private fault?
> 
> > Maybe we could have EPT violations that contain 4k accept sizes first update the
> > attribute for the GFN to be accepted or not, like have tdx.c call out to set
> > kvm_lpage_info->disallow_lpage in the rarer case of 4k accept size? Or something
> Something like kvm_lpage_info->disallow_lpage would disallow later page
> promotion, though we don't support it right now.

Well I was originally thinking it would not set kvm_lpage_info->disallow_lpage
directly, but rely on the logic that checks for mixed attributes. But more
below...

> 
> > like that. Maybe set a "accepted" attribute, or something. Not sure if could be
> Setting "accepted" attribute in the EPT violation handler?
> It's a little odd, as the accept operation is not yet completed.

I guess the question in both of these comments is: what is the life cycle. Guest
could call TDG.MEM.PAGE.RELEASE to unaccept it as well. Oh, geez. It looks like
TDG.MEM.PAGE.RELEASE will give the same size hints in the EPT violation. So an
accept attribute is not going work, at least without TDX module changes.

Actually, the problem we have doesn't fit the mixed attributes behavior. If many
vCPU's accept at 2MB region at 4k page size, the entire 2MB range could be non-
mixed and then individual accepts would fail.

So instead there could be a KVM_LPAGE_GUEST_INHIBIT that doesn't get cleared
based on mixed attributes. It would be one way. It would need to get set by
something like kvm_write_track_add_gfn() that lives in tdx.c and is called
before going into the fault handler on 4k accept size. It would have to take mmu
write lock I think, which would kill scalability in the 4k accept case (but not
the normal 2MB one). But as long as mmu_write lock is held, demote will be no
problem, which the operation would also need to do.

I think it actually makes KVM's behavior easier to understand. We don't need to
worry about races between multiple accept sizes and things like that. It also
leaves the core MMU code mostly untouched. Performance/scalability wise it only
punishes the rare case.

For leaving the option open to promote the GFNs in the future, a GHCI interface
or similar could be defined for the guest to say "I don't care about page size
anymore for this gfn". So it won't close it off forever.

> 
> > done without the mmu write lock... But it might fit KVM better?

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-24 10:18                                                           ` Yan Zhao
@ 2025-06-24 21:29                                                             ` Ackerley Tng
  2025-06-24 22:22                                                               ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Ackerley Tng @ 2025-06-24 21:29 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Edgecombe, Rick P, quic_eberman@quicinc.com, Li, Xiaoyao,
	Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
	Du, Fan, linux-kernel@vger.kernel.org, seanjc@google.com,
	Weiny, Ira, pbonzini@redhat.com, binbin.wu@linux.intel.com,
	Yamahata, Isaku, michael.roth@amd.com, Peng, Chao P,
	kvm@vger.kernel.org, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, Li, Zhiquan1, pgonda@google.com, x86@kernel.org

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Mon, Jun 23, 2025 at 03:48:48PM -0700, Ackerley Tng wrote:
>> Ackerley Tng <ackerleytng@google.com> writes:
>> 
>> > Yan Zhao <yan.y.zhao@intel.com> writes:
>> >
>> >> On Wed, Jun 18, 2025 at 08:41:38AM +0800, Edgecombe, Rick P wrote:
>> >>> On Wed, 2025-06-18 at 08:19 +0800, Yan Zhao wrote:
>> >>> > > I don't think a potential bug in KVM is a good enough reason. If we are
>> >>> > > concerned can we think about a warning instead?
>> >>> > > 
>> >>> > > We had talked enhancing kasan to know when a page is mapped into S-EPT in
>> >>> > > the
>> >>> > > past. So rather than design around potential bugs we could focus on having a
>> >>> > > simpler implementation with the infrastructure to catch and fix the bugs.
>> >>> > However, if failing to remove a guest private page would only cause memory
>> >>> > leak,
>> >>> > it's fine. 
>> >>> > If TDX does not hold any refcount, guest_memfd has to know that which private
>> >>> > page is still mapped. Otherwise, the page may be re-assigned to other kernel
>> >>> > components while it may still be mapped in the S-EPT.
>> >>> 
>> >>> KASAN detects use-after-free's like that. However, the TDX module code is not
>> >>> instrumented. It won't check against the KASAN state for it's accesses.
>> >>> 
>> >>> I had a brief chat about this with Dave and Kirill. A couple ideas were
>> >>> discussed. One was to use page_ext to keep a flag that says the page is in-use
>> >> Thanks!
>> >>
>> >> To use page_ext, should we introduce a new flag PAGE_EXT_FIRMWARE_IN_USE,
>> >> similar to PAGE_EXT_YOUNG?
>> >>
>> >> Due to similar issues as those with normal page/folio flags (see the next
>> >> comment for details), TDX needs to set PAGE_EXT_FIRMWARE_IN_USE on a
>> >> page-by-page basis rather than folio-by-folio.
>> >>
>> >> Additionally, it seems reasonable for guest_memfd not to copy the
>> >> PAGE_EXT_FIRMWARE_IN_USE flag when splitting a huge folio?
>> >> (in __folio_split() --> split_folio_to_order(), PAGE_EXT_YOUNG and
>> >> PAGE_EXT_IDLE are copied to the new folios though).
>> >>
>> >> Furthermore, page_ext uses extra memory. With CONFIG_64BIT, should we instead
>> >> introduce a PG_firmware_in_use in page flags, similar to PG_young and PG_idle?
>> >>
>> 
>> I think neither page flags nor page_ext will work for us, but see below.
>> 
>> >>> by the TDX module. There was also some discussion of using a normal page flag,
>> >>> and that the reserved page flag might prevent some of the MM operations that
>> >>> would be needed on guestmemfd pages. I didn't see the problem when I looked.
>> >>> 
>> >>> For the solution, basically the SEAMCALL wrappers set a flag when they hand a
>> >>> page to the TDX module, and clear it when they successfully reclaim it via
>> >>> tdh_mem_page_remove() or tdh_phymem_page_reclaim(). Then if the page makes it
>> >>> back to the page allocator, a warning is generated.
>> >> After some testing, to use a normal page flag, we may need to set it on a
>> >> page-by-page basis rather than folio-by-folio. See "Scheme 1".
>> >> And guest_memfd may need to selectively copy page flags when splitting huge
>> >> folios. See "Scheme 2".
>> >>
>> >> Scheme 1: Set/unset page flag on folio-by-folio basis, i.e.
>> >>         - set folio reserved at tdh_mem_page_aug(), tdh_mem_page_add(),
>> >>         - unset folio reserved after a successful tdh_mem_page_remove() or
>> >>           tdh_phymem_page_reclaim().
>> >>
>> >>         It has problem in following scenario:
>> >>         1. tdh_mem_page_aug() adds a 2MB folio. It marks the folio as reserved
>> >> 	   via "folio_set_reserved(page_folio(page))"
>> >>
>> >>         2. convert a 4KB page of the 2MB folio to shared.
>> >>         2.1 tdh_mem_page_demote() is executed first.
>> >>        
>> >>         2.2 tdh_mem_page_remove() then removes the 4KB mapping.
>> >>             "folio_clear_reserved(page_folio(page))" clears reserved flag for
>> >>             the 2MB folio while the rest 511 pages are still mapped in the
>> >>             S-EPT.
>> >>
>> >>         2.3. guest_memfd splits the 2MB folio into 512 4KB folios.
>> >>
>> >>
>> 
>> Folio flags on their own won't work because they're not precise
>> enough. A folio can be multiple 4K pages, and if a 4K page had failed to
>> unmap, we want to be able to indicate which 4K page had the failure,
>> instead of the entire folio. (But see below)
>> 
>> >> Scheme 2: Set/unset page flag on page-by-page basis, i.e.
>> >>         - set page flag reserved at tdh_mem_page_aug(), tdh_mem_page_add(),
>> >>         - unset page flag reserved after a successful tdh_mem_page_remove() or
>> >>           tdh_phymem_page_reclaim().
>> >>
>> >>         It has problem in following scenario:
>> >>         1. tdh_mem_page_aug() adds a 2MB folio. It marks pages as reserved by
>> >>            invoking "SetPageReserved()" on each page.
>> >>            As the folio->flags equals to page[0]->flags, folio->flags is also
>> >> 	   with reserved set.
>> >>
>> >>         2. convert a 4KB page of the 2MB folio to shared. say, it's page[4].
>> >>         2.1 tdh_mem_page_demote() is executed first.
>> >>        
>> >>         2.2 tdh_mem_page_remove() then removes the 4KB mapping.
>> >>             "ClearPageReserved()" clears reserved flag of page[4] of the 2MB
>> >>             folio.
>> >>
>> >>         2.3. guest_memfd splits the 2MB folio into 512 4KB folios.
>> >>              In guestmem_hugetlb_split_folio(), "p->flags = folio->flags" marks
>> >>              page[4]->flags as reserved again as page[0] is still reserved.
>> >>
>> >>             (see the code in https://lore.kernel.org/all/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com/
>> >>             for (i = 1; i < orig_nr_pages; ++i) {
>> >>                 struct page *p = folio_page(folio, i);
>> >>
>> >>                 /* Copy flags from the first page to split pages. */
>> >>                 p->flags = folio->flags;
>> >>
>> >>                 p->mapping = NULL;
>> >>                 clear_compound_head(p);
>> >>             }
>> >>             )
>> >>
>> 
>> Per-page flags won't work because we want to retain HugeTLB Vmemmap
>> Optimization (HVO), which allows subsequent (identical) struct pages to
>> alias to each other. If we use a per-page flag, then HVO would break
>> since struct pages would no longer be identical to each other.
> Ah, I overlooked HVO.
> In my testing, neither CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON nor
> hugetlb_free_vmemmap was set.
>
> With HVO enabled, setting page flags on a per-page basis indeed does not work.
>> 
>> >> [...]
>> 
>> Let me try and summarize the current state of this discussion:
> Thanks for this summary.
>
>
>> Topic 1: Does TDX need to somehow indicate that it is using a page?
>> 
>> This patch series uses refcounts to indicate that TDX is using a page,
>> but that complicates private-to-shared conversions.
>> 
>> During a private-to-shared conversion, guest_memfd assumes that
>> guest_memfd is trusted to manage private memory. TDX and other users
>> should trust guest_memfd to keep the memory around.
>> 
>> Yan's position is that holding a refcount is in line with how IOMMU
>> takes a refcount when a page is mapped into the IOMMU [1].
>> 
>> Yan had another suggestion, which is to indicate using a page flag [2].
>> 
>> I think we're in agreement that we don't want to have TDX hold a
>> refcount while the page is mapped into the Secure EPTs, but taking a
>> step back, do we really need to indicate (at all) that TDX is using a
>> page?
>> 
>> In [3] Yan said
>> 
>> > If TDX does not hold any refcount, guest_memfd has to know that which
>> > private
>> > page is still mapped. Otherwise, the page may be re-assigned to other
>> > kernel
>> > components while it may still be mapped in the S-EPT.
>> 
>> If the private page is mapped for regular VM use as private memory,
>> guest_memfd is managing that, and the same page will not be re-assigned
>> to any other kernel component. guest_memfd does hold refcounts in
>> guest_memfd's filemap.
> After kvm_gmem_release(), guest_memfd will return folios to hugetlb, so the same
> page could be re-assigned to other kernel components that allocate pages from
> hugetlb.
>

Yes, at kvm_gmem_release(), guest_memfd also unmaps all the private and
shared pages from Secure EPTs, so on successful unmap, all is good, the
page can and should be reused elsewhere.

On unsuccessful unmap, we can then handle that according to the option
we pick from below. e.g. if we pick options f or g, the page should
never be reused elsewhere.

If we pick other options, then we would have to ensure the page doesn't
get used elsewhere anyway, so TDX still won't need to hold a refcount
while using the page?

>> 
>> If the private page is still mapped because there was an unmapping
>> failure, we can discuss that separately under error handling in Topic 2.
>> 
>> With this, can I confirm that we are in agreement that TDX does not need
>> to indicate that it is using a page, and can trust guest_memfd to keep
>> the page around for the VM?
> I thought it's not a must until I came across a comment from Sean:
> "Should these bail early if the KVM_BUG_ON() is hit?  Calling into the TDX module
> after bugging the VM is a bit odd."
> https://lore.kernel.org/kvm/Z4r_XNcxPWpgjZio@google.com/#t.
>
> This comment refers to the following scenario:
> when a 2MB non-leaf entry in the mirror root is zapped with shared mmu_lock,
> BUG_ON() will be triggered for TDX. But by the time handle_removed_pt() is
> reached, the 2MB non-leaf entry would have been successfully removed in the
> mirror root.
>
> Bailing out early in remove_external_spte() would prevent the removal of 4KB
> private guest pages in the S-EPT later due to lacking of corresponding entry in
> the mirror root.
>
> Since KVM MMU does not hold guest page's ref count, failing to notify TDX about
> the removal of a guest page could result in a situation where a page still
> mapped in the S-EPT is freed and re-allocated by the OS. 
>
> Therefore, indicating that TDX is using a page can be less error-prone, though
> it does consume more memory.
>

This is true, but instead of somehow indicating that the page is used by
TDX upfront, how about treating this as an error, so we should not bail
out before first indicating an error on the page, such as using Option f
or g below. Indicating an error on the page will prevent the case where
a page still mapped in the S-EPT is freed and reused by the OS elsewhere.

>> Topic 2: How to handle unmapping/splitting errors arising from TDX?
>> 
>> Previously I was in favor of having unmap() return an error (Rick
>> suggested doing a POC, and in a more recent email Rick asked for a
>> diffstat), but Vishal and I talked about this and now I agree having
>> unmapping return an error is not a good approach for these reasons.
>> 
>> 1. Unmapping takes a range, and within the range there could be more
>>    than one unmapping error. I was previously thinking that unmap()
>>    could return 0 for success and the failed PFN on error. Returning a
>>    single PFN on error is okay-ish but if there are more errors it could
>>    get complicated.
>> 
>>    Another error return option could be to return the folio where the
>>    unmapping/splitting issue happened, but that would not be
>>    sufficiently precise, since a folio could be larger than 4K and we
>>    want to track errors as precisely as we can to reduce memory loss due
>>    to errors.
>> 
>> 2. What I think Yan has been trying to say: unmap() returning an error
>>    is non-standard in the kernel.
>> 
>> I think (1) is the dealbreaker here and there's no need to do the
>> plumbing POC and diffstat.
>> 
>> So I think we're all in support of indicating unmapping/splitting issues
>> without returning anything from unmap(), and the discussed options are
>> 
>> a. Refcounts: won't work - mostly discussed in this (sub-)thread
>>    [3]. Using refcounts makes it impossible to distinguish between
>>    transient refcounts and refcounts due to errors.
>> 
>> b. Page flags: won't work with/can't benefit from HVO.
>> 
>> Suggestions still in the running:
>> 
>> c. Folio flags are not precise enough to indicate which page actually
>>    had an error, but this could be sufficient if we're willing to just
>>    waste the rest of the huge page on unmapping error.
> For 1GB folios, more precise info will be better.
>
>
>> d. Folio flags with folio splitting on error. This means that on
>>    unmapping/Secure EPT PTE splitting error, we have to split the
>>    (larger than 4K) folio to 4K, and then set a flag on the split folio.
>> 
>>    The issue I see with this is that splitting pages with HVO applied
>>    means doing allocations, and in an error scenario there may not be
>>    memory left to split the pages.
> Could we restore the page structures before triggering unmap?
>

Do you mean every time, before unmapping (every conversion, truncation,
etc), first restore the page structs in preparation for unmap failure,
then re-optimize HVO after successful unmap?

Restoring the page structures is quite an expensive operation IIUC,
involving time and allocations, and it kind of defeats the purpose of
HVO if we need to keep extra memory around to keep restoring and undoing
HVO.

Or do you mean only on unmap failure, undo restore page structs to mark
the error?

Because it is expensive, it seems hard for this to work when the unmap
fails (the machine might be in some dire state already).

>> 
>> e. Some other data structure in guest_memfd, say, a linked list, and a
>>    function like kvm_gmem_add_error_pfn(struct page *page) that would
>>    look up the guest_memfd inode from the page and add the page's pfn to
>>    the linked list.
>>
>>    Everywhere in guest_memfd that does unmapping/splitting would then
>>    check this linked list to see if the unmapping/splitting
>>    succeeded.
>> 
>>    Everywhere in guest_memfd that allocates pages will also check this
>>    linked list to make sure the pages are functional.
>> 
>>    When guest_memfd truncates, if the page being truncated is on the
>>    list, retain the refcount on the page and leak that page.
>>
>> f. Combination of c and e, something similar to HugeTLB's
>>    folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in
>>    trouble to a linked list on the folio.
> That seems like a good idea. If memory allocation for the linked list succeeds,
> mark the pages within a folio as troublesome; otherwise, mark the entire folio
> as troublesome.
>

This is already what HugeTLB does for poisoning in
folio_set_hugetlb_hwpoison(), so if an unmapping error can be treated as
hardware poisoning, we could re-use all that infrastructure.

> But maybe c is good enough for 2MB folios.
>

I think it'd be awkward/troublesome to treat different sizes of folios
differently and I would prefer not to do this.

>> g. Like f, but basically treat an unmapping error as hardware poisoning.
> Not sure if hwpoison bit can be used directly.
> Further investigation is needed.
>

Would appreciate your assessment on whether an unmapping error can be
treated as hardware poisoning!

>> I'm kind of inclined towards g, to just treat unmapping errors as
>> HWPOISON and buying into all the HWPOISON handling requirements. What do
>> yall think? Can a TDX unmapping error be considered as memory poisoning?
>> 
>>

I have another option h to add: if there is a unmapping error from TDX,
can it be an indication of compromise, in terms of security? Should TDX
continue to be trusted to run the TD or other TDs securely? If there is
some unmapping error, could correctness in the entire host be in
question?

If either correctness or security is broken, would it be acceptable to
do a full BUG_ON and crash the system, since neither TDX nor regular VMs
on the host should trusted to run correctly after this kind of error?

>> [1] https://lore.kernel.org/all/aE%2F1TgUvr0dcaJUg@yzhao56-desk.sh.intel.com/
>> [2] https://lore.kernel.org/all/aFkeBtuNBN1RrDAJ@yzhao56-desk.sh.intel.com/
>> [3] https://lore.kernel.org/all/aFIGFesluhuh2xAS@yzhao56-desk.sh.intel.com/
>> [3] https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-23 22:48                                                         ` Ackerley Tng
  2025-06-24 10:18                                                           ` Yan Zhao
@ 2025-06-24 22:00                                                           ` Edgecombe, Rick P
  2025-06-24 22:14                                                             ` Edgecombe, Rick P
  2025-06-24 23:30                                                             ` Ackerley Tng
  2025-06-24 22:03                                                           ` Edgecombe, Rick P
  2 siblings, 2 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-24 22:00 UTC (permalink / raw)
  To: ackerleytng@google.com, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
	vbabka@suse.cz, Shutemov, Kirill, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Yamahata, Isaku,
	Peng, Chao P, Weiny, Ira, kvm@vger.kernel.org, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, Li, Zhiquan1, pgonda@google.com,
	x86@kernel.org

On Mon, 2025-06-23 at 15:48 -0700, Ackerley Tng wrote:
> Let me try and summarize the current state of this discussion:
> 
> Topic 1: Does TDX need to somehow indicate that it is using a page?
> 
> This patch series uses refcounts to indicate that TDX is using a page,
> but that complicates private-to-shared conversions.
> 
> During a private-to-shared conversion, guest_memfd assumes that
> guest_memfd is trusted to manage private memory. TDX and other users
> should trust guest_memfd to keep the memory around.
> 
> Yan's position is that holding a refcount is in line with how IOMMU
> takes a refcount when a page is mapped into the IOMMU [1].
> 
> Yan had another suggestion, which is to indicate using a page flag [2].
> 
> I think we're in agreement that we don't want to have TDX hold a
> refcount while the page is mapped into the Secure EPTs, but taking a
> step back, do we really need to indicate (at all) that TDX is using a
> page?
> 
> In [3] Yan said
> 
> > If TDX does not hold any refcount, guest_memfd has to know that which
> > private
> > page is still mapped. Otherwise, the page may be re-assigned to other
> > kernel
> > components while it may still be mapped in the S-EPT.
> 
> If the private page is mapped for regular VM use as private memory,
> guest_memfd is managing that, and the same page will not be re-assigned
> to any other kernel component. guest_memfd does hold refcounts in
> guest_memfd's filemap.
> 
> If the private page is still mapped because there was an unmapping
> failure, we can discuss that separately under error handling in Topic 2.
> 
> With this, can I confirm that we are in agreement that TDX does not need
> to indicate that it is using a page, and can trust guest_memfd to keep
> the page around for the VM?

Minor correction here. Yan was concerned about *bugs* happening when freeing
pages that are accidentally still mapped in the S-EPT. My opinion is that this
is not especially risky to happen here vs other similar places, but it could be
helpful if there was a way to catch such bugs. The page flag, or page_ext
direction came out of a discussion with Dave and Kirill. If it could run all the
time that would be great, but if not a debug config could be sufficient. For
example like CONFIG_PAGE_TABLE_CHECK. It doesn't need to support vmemmap
optimizations because the debug checking doesn't need to run all the time.
Overhead for debug settings is very normal.

> 
> Topic 2: How to handle unmapping/splitting errors arising from TDX?
> 
> Previously I was in favor of having unmap() return an error (Rick
> suggested doing a POC, and in a more recent email Rick asked for a
> diffstat), but Vishal and I talked about this and now I agree having
> unmapping return an error is not a good approach for these reasons.

Ok, let's close this option then.

> 
> 1. Unmapping takes a range, and within the range there could be more
>    than one unmapping error. I was previously thinking that unmap()
>    could return 0 for success and the failed PFN on error. Returning a
>    single PFN on error is okay-ish but if there are more errors it could
>    get complicated.
> 
>    Another error return option could be to return the folio where the
>    unmapping/splitting issue happened, but that would not be
>    sufficiently precise, since a folio could be larger than 4K and we
>    want to track errors as precisely as we can to reduce memory loss due
>    to errors.
> 
> 2. What I think Yan has been trying to say: unmap() returning an error
>    is non-standard in the kernel.
> 
> I think (1) is the dealbreaker here and there's no need to do the
> plumbing POC and diffstat.
> 
> So I think we're all in support of indicating unmapping/splitting issues
> without returning anything from unmap(), and the discussed options are
> 
> a. Refcounts: won't work - mostly discussed in this (sub-)thread
>    [3]. Using refcounts makes it impossible to distinguish between
>    transient refcounts and refcounts due to errors.
> 
> b. Page flags: won't work with/can't benefit from HVO.

As above, this was for the purpose of catching bugs, not for guestmemfd to
logically depend on it.

> 
> Suggestions still in the running:
> 
> c. Folio flags are not precise enough to indicate which page actually
>    had an error, but this could be sufficient if we're willing to just
>    waste the rest of the huge page on unmapping error.

For a scenario of TDX module bug, it seems ok to me.

> 
> d. Folio flags with folio splitting on error. This means that on
>    unmapping/Secure EPT PTE splitting error, we have to split the
>    (larger than 4K) folio to 4K, and then set a flag on the split folio.
> 
>    The issue I see with this is that splitting pages with HVO applied
>    means doing allocations, and in an error scenario there may not be
>    memory left to split the pages.
> 
> e. Some other data structure in guest_memfd, say, a linked list, and a
>    function like kvm_gmem_add_error_pfn(struct page *page) that would
>    look up the guest_memfd inode from the page and add the page's pfn to
>    the linked list.
> 
>    Everywhere in guest_memfd that does unmapping/splitting would then
>    check this linked list to see if the unmapping/splitting
>    succeeded.
> 
>    Everywhere in guest_memfd that allocates pages will also check this
>    linked list to make sure the pages are functional.
> 
>    When guest_memfd truncates, if the page being truncated is on the
>    list, retain the refcount on the page and leak that page.

I think this is a fine option.

> 
> f. Combination of c and e, something similar to HugeTLB's
>    folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in
>    trouble to a linked list on the folio.
> 
> g. Like f, but basically treat an unmapping error as hardware poisoning.
> 
> I'm kind of inclined towards g, to just treat unmapping errors as
> HWPOISON and buying into all the HWPOISON handling requirements. What do
> yall think? Can a TDX unmapping error be considered as memory poisoning?

What does HWPOISON bring over refcounting the page/folio so that it never
returns to the page allocator? We are bugging the TD in these cases. Ohhh... Is
this about the code to allow gmem fds to be handed to new VMs?

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-23 22:48                                                         ` Ackerley Tng
  2025-06-24 10:18                                                           ` Yan Zhao
  2025-06-24 22:00                                                           ` Edgecombe, Rick P
@ 2025-06-24 22:03                                                           ` Edgecombe, Rick P
  2 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-24 22:03 UTC (permalink / raw)
  To: ackerleytng@google.com, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
	vbabka@suse.cz, Shutemov, Kirill, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Yamahata, Isaku,
	Peng, Chao P, Weiny, Ira, kvm@vger.kernel.org, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, Li, Zhiquan1, pgonda@google.com,
	x86@kernel.org

On Mon, 2025-06-23 at 15:48 -0700, Ackerley Tng wrote:
> a. Refcounts: won't work - mostly discussed in this (sub-)thread
>    [3]. Using refcounts makes it impossible to distinguish between
>    transient refcounts and refcounts due to errors.

[3] you pointed to this thread, which is the one we are discussing on. Is that
the link you meant?
https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/



^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-24 22:00                                                           ` Edgecombe, Rick P
@ 2025-06-24 22:14                                                             ` Edgecombe, Rick P
  2025-06-24 23:30                                                             ` Ackerley Tng
  1 sibling, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-24 22:14 UTC (permalink / raw)
  To: ackerleytng@google.com, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
	vbabka@suse.cz, Shutemov, Kirill, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Yamahata, Isaku,
	Peng, Chao P, Weiny, Ira, kvm@vger.kernel.org, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, Li, Zhiquan1, pgonda@google.com,
	x86@kernel.org

On Tue, 2025-06-24 at 15:00 -0700, Rick Edgecombe wrote:
> Minor correction here. Yan was concerned about *bugs* happening when freeing
> pages that are accidentally still mapped in the S-EPT. My opinion is that this
> is not especially risky to happen here vs other similar places, but it could be
> helpful if there was a way to catch such bugs. The page flag, or page_ext
> direction came out of a discussion with Dave and Kirill. If it could run all the
> time that would be great, but if not a debug config could be sufficient. For
> example like CONFIG_PAGE_TABLE_CHECK. It doesn't need to support vmemmap
> optimizations because the debug checking doesn't need to run all the time.
> Overhead for debug settings is very normal.

Note, this is separate from the problem of how to handle or notify TDX unmap
errors. That is still an open regardless. But Yan was concerned if we didn't
take a reference when we first mapped it, that it could be more error prone. So
this flag was an alternative to *holding* a reference during the lifetime of S-
EPT mapping.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-24 21:29                                                             ` Ackerley Tng
@ 2025-06-24 22:22                                                               ` Edgecombe, Rick P
  0 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-24 22:22 UTC (permalink / raw)
  To: ackerleytng@google.com, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
	vbabka@suse.cz, Shutemov, Kirill, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Yamahata, Isaku,
	Peng, Chao P, Weiny, Ira, kvm@vger.kernel.org, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, Li, Zhiquan1, pgonda@google.com,
	x86@kernel.org

On Tue, 2025-06-24 at 14:29 -0700, Ackerley Tng wrote:
> I have another option h to add: if there is a unmapping error from TDX,
> can it be an indication of compromise, in terms of security? Should TDX
> continue to be trusted to run the TD or other TDs securely? If there is
> some unmapping error, could correctness in the entire host be in
> question?

Maybe, but it's the TDX module's job to do something about this. The threat
model of TDX doesn't involve the host VMM ensuring integrity of the TD.

> 
> If either correctness or security is broken, would it be acceptable to
> do a full BUG_ON and crash the system, since neither TDX nor regular VMs
> on the host should trusted to run correctly after this kind of error?

BUG_ON() won't be acceptable. See Linus' opinion on the subject. The standard
practice is to warn and let people run panic_on_warn if they want to be
paranoid. And we already will generate a warning so it's possible to configure
for this behavior today.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-24 22:00                                                           ` Edgecombe, Rick P
  2025-06-24 22:14                                                             ` Edgecombe, Rick P
@ 2025-06-24 23:30                                                             ` Ackerley Tng
  2025-06-25  0:01                                                               ` Edgecombe, Rick P
  2025-06-25  7:08                                                               ` Yan Zhao
  1 sibling, 2 replies; 294+ messages in thread
From: Ackerley Tng @ 2025-06-24 23:30 UTC (permalink / raw)
  To: Edgecombe, Rick P, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
	vbabka@suse.cz, Shutemov, Kirill, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Yamahata, Isaku,
	Peng, Chao P, Weiny, Ira, kvm@vger.kernel.org, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, Li, Zhiquan1, pgonda@google.com,
	x86@kernel.org

"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:

> On Mon, 2025-06-23 at 15:48 -0700, Ackerley Tng wrote:
>> Let me try and summarize the current state of this discussion:
>> 
>> Topic 1: Does TDX need to somehow indicate that it is using a page?
>> 
>> This patch series uses refcounts to indicate that TDX is using a page,
>> but that complicates private-to-shared conversions.
>> 
>> During a private-to-shared conversion, guest_memfd assumes that
>> guest_memfd is trusted to manage private memory. TDX and other users
>> should trust guest_memfd to keep the memory around.
>> 
>> Yan's position is that holding a refcount is in line with how IOMMU
>> takes a refcount when a page is mapped into the IOMMU [1].
>> 
>> Yan had another suggestion, which is to indicate using a page flag [2].
>> 
>> I think we're in agreement that we don't want to have TDX hold a
>> refcount while the page is mapped into the Secure EPTs, but taking a
>> step back, do we really need to indicate (at all) that TDX is using a
>> page?
>> 
>> In [3] Yan said
>> 
>> > If TDX does not hold any refcount, guest_memfd has to know that which
>> > private
>> > page is still mapped. Otherwise, the page may be re-assigned to other
>> > kernel
>> > components while it may still be mapped in the S-EPT.
>> 
>> If the private page is mapped for regular VM use as private memory,
>> guest_memfd is managing that, and the same page will not be re-assigned
>> to any other kernel component. guest_memfd does hold refcounts in
>> guest_memfd's filemap.
>> 
>> If the private page is still mapped because there was an unmapping
>> failure, we can discuss that separately under error handling in Topic 2.
>> 
>> With this, can I confirm that we are in agreement that TDX does not need
>> to indicate that it is using a page, and can trust guest_memfd to keep
>> the page around for the VM?
>
> Minor correction here. Yan was concerned about *bugs* happening when freeing
> pages that are accidentally still mapped in the S-EPT. My opinion is that this
> is not especially risky to happen here vs other similar places, but it could be
> helpful if there was a way to catch such bugs. The page flag, or page_ext
> direction came out of a discussion with Dave and Kirill. If it could run all the
> time that would be great, but if not a debug config could be sufficient. For
> example like CONFIG_PAGE_TABLE_CHECK. It doesn't need to support vmemmap
> optimizations because the debug checking doesn't need to run all the time.
> Overhead for debug settings is very normal.
>

I see, let's call debug checking Topic 3 then, to separate it from Topic
1, which is TDX indicating that it is using a page for production
kernels.

Topic 3: How should TDX indicate use of a page for debugging?

I'm okay if for debugging, TDX uses anything other than refcounts for
checking, because refcounts will interfere with conversions.

Rick's other email is correct. The correct link should be
https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/.

[INTERFERE WITH CONVERSIONS]

To summarize, if TDX uses refcounts to indicate that it is using a page,
or to indicate anything else, then we cannot easily split a page on
private to shared conversions.

Specifically, consider the case where only the x-th subpage of a huge
folio is mapped into Secure-EPTs. When the guest requests to convert
some subpage to shared, the huge folio has to be split for
core-mm. Core-mm, which will use the shared page, must have split folios
to be able to accurately and separately track refcounts for subpages.

During splitting, guest_memfd would see refcount of 512 (for 2M page
being in the filemap) + 1 (if TDX indicates that the x-th subpage is
mapped using a refcount), but would not be able to tell that the 513th
refcount belongs to the x-th subpage. guest_memfd can't split the huge
folio unless it knows how to distribute the 513th refcount.

One might say guest_memfd could clear all the refcounts that TDX is
holding on the huge folio by unmapping the entire huge folio from the
Secure-EPTs, but unmapping the entire huge folio for TDX means zeroing
the contents and requiring guest re-acceptance. Both of these would mess
up guest operation.

Hence, guest_memfd's solution is to require that users of guest_memfd
for private memory trust guest_memfd to maintain the pages around and
not take any refcounts.

So back to Topic 1, for production kernels, is it okay that TDX does not
need to indicate that it is using a page, and can trust guest_memfd to
keep the page around for the VM?

>> 
>> Topic 2: How to handle unmapping/splitting errors arising from TDX?
>> 
>> Previously I was in favor of having unmap() return an error (Rick
>> suggested doing a POC, and in a more recent email Rick asked for a
>> diffstat), but Vishal and I talked about this and now I agree having
>> unmapping return an error is not a good approach for these reasons.
>
> Ok, let's close this option then.
>
>> 
>> 1. Unmapping takes a range, and within the range there could be more
>>    than one unmapping error. I was previously thinking that unmap()
>>    could return 0 for success and the failed PFN on error. Returning a
>>    single PFN on error is okay-ish but if there are more errors it could
>>    get complicated.
>> 
>>    Another error return option could be to return the folio where the
>>    unmapping/splitting issue happened, but that would not be
>>    sufficiently precise, since a folio could be larger than 4K and we
>>    want to track errors as precisely as we can to reduce memory loss due
>>    to errors.
>> 
>> 2. What I think Yan has been trying to say: unmap() returning an error
>>    is non-standard in the kernel.
>> 
>> I think (1) is the dealbreaker here and there's no need to do the
>> plumbing POC and diffstat.
>> 
>> So I think we're all in support of indicating unmapping/splitting issues
>> without returning anything from unmap(), and the discussed options are
>> 
>> a. Refcounts: won't work - mostly discussed in this (sub-)thread
>>    [3]. Using refcounts makes it impossible to distinguish between
>>    transient refcounts and refcounts due to errors.
>> 
>> b. Page flags: won't work with/can't benefit from HVO.
>
> As above, this was for the purpose of catching bugs, not for guestmemfd to
> logically depend on it.
>
>> 
>> Suggestions still in the running:
>> 
>> c. Folio flags are not precise enough to indicate which page actually
>>    had an error, but this could be sufficient if we're willing to just
>>    waste the rest of the huge page on unmapping error.
>
> For a scenario of TDX module bug, it seems ok to me.
>
>> 
>> d. Folio flags with folio splitting on error. This means that on
>>    unmapping/Secure EPT PTE splitting error, we have to split the
>>    (larger than 4K) folio to 4K, and then set a flag on the split folio.
>> 
>>    The issue I see with this is that splitting pages with HVO applied
>>    means doing allocations, and in an error scenario there may not be
>>    memory left to split the pages.
>> 
>> e. Some other data structure in guest_memfd, say, a linked list, and a
>>    function like kvm_gmem_add_error_pfn(struct page *page) that would
>>    look up the guest_memfd inode from the page and add the page's pfn to
>>    the linked list.
>> 
>>    Everywhere in guest_memfd that does unmapping/splitting would then
>>    check this linked list to see if the unmapping/splitting
>>    succeeded.
>> 
>>    Everywhere in guest_memfd that allocates pages will also check this
>>    linked list to make sure the pages are functional.
>> 
>>    When guest_memfd truncates, if the page being truncated is on the
>>    list, retain the refcount on the page and leak that page.
>
> I think this is a fine option.
>
>> 
>> f. Combination of c and e, something similar to HugeTLB's
>>    folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in
>>    trouble to a linked list on the folio.
>> 
>> g. Like f, but basically treat an unmapping error as hardware poisoning.
>> 
>> I'm kind of inclined towards g, to just treat unmapping errors as
>> HWPOISON and buying into all the HWPOISON handling requirements. What do
>> yall think? Can a TDX unmapping error be considered as memory poisoning?
>
> What does HWPOISON bring over refcounting the page/folio so that it never
> returns to the page allocator?

For Topic 2 (handling TDX unmapping errors), HWPOISON is better than
refcounting because refcounting interferes with conversions (see
[INTERFERE WITH CONVERSIONS] above).

> We are bugging the TD in these cases.

Bugging the TD does not help to prevent future conversions from being
interfered with.

1. Conversions involves unmapping, so we could actually be in a
   conversion, the unmapping is performed and fails, and then we try to
   split and enter an infinite loop since private to shared conversions
   assumes guest_memfd holds the only refcounts on guest_memfd memory.

2. The conversion ioctl is a guest_memfd ioctl, not a VM ioctl, and so
   there is no check that the VM is not dead. There shouldn't be any
   check on the VM, because shareability is a property of the memory and
   should be changeable independent of the associated VM.

> Ohhh... Is
> this about the code to allow gmem fds to be handed to new VMs?

Nope, it's not related to linking. The proposed KVM_LINK_GUEST_MEMFD
ioctl [4] also doesn't check if the source VM is dead. There shouldn't
be any check on the source VM, since the memory is from guest_memfd and
should be independently transferable to a new VM.

[4] https://lore.kernel.org/lkml/cover.1747368092.git.afranji@google.com/T/

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-24 23:30                                                             ` Ackerley Tng
@ 2025-06-25  0:01                                                               ` Edgecombe, Rick P
  2025-06-25  7:29                                                                 ` Yan Zhao
  2025-06-25 23:09                                                                 ` Ackerley Tng
  2025-06-25  7:08                                                               ` Yan Zhao
  1 sibling, 2 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-25  0:01 UTC (permalink / raw)
  To: ackerleytng@google.com, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, vbabka@suse.cz, Du, Fan, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	pbonzini@redhat.com, Yamahata, Isaku, binbin.wu@linux.intel.com,
	Weiny, Ira, kvm@vger.kernel.org, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, Li, Zhiquan1, pgonda@google.com,
	x86@kernel.org

On Tue, 2025-06-24 at 16:30 -0700, Ackerley Tng wrote:
> I see, let's call debug checking Topic 3 then, to separate it from Topic
> 1, which is TDX indicating that it is using a page for production
> kernels.
> 
> Topic 3: How should TDX indicate use of a page for debugging?
> 
> I'm okay if for debugging, TDX uses anything other than refcounts for
> checking, because refcounts will interfere with conversions.

Ok. It can be follow on work I think.

> 
> Rick's other email is correct. The correct link should be
> https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/.
> 
> [INTERFERE WITH CONVERSIONS]
> 
> To summarize, if TDX uses refcounts to indicate that it is using a page,
> or to indicate anything else, then we cannot easily split a page on
> private to shared conversions.
> 
> Specifically, consider the case where only the x-th subpage of a huge
> folio is mapped into Secure-EPTs. When the guest requests to convert
> some subpage to shared, the huge folio has to be split for
> core-mm. Core-mm, which will use the shared page, must have split folios
> to be able to accurately and separately track refcounts for subpages.
> 
> During splitting, guest_memfd would see refcount of 512 (for 2M page
> being in the filemap) + 1 (if TDX indicates that the x-th subpage is
> mapped using a refcount), but would not be able to tell that the 513th
> refcount belongs to the x-th subpage. guest_memfd can't split the huge
> folio unless it knows how to distribute the 513th refcount.
> 
> One might say guest_memfd could clear all the refcounts that TDX is
> holding on the huge folio by unmapping the entire huge folio from the
> Secure-EPTs, but unmapping the entire huge folio for TDX means zeroing
> the contents and requiring guest re-acceptance. Both of these would mess
> up guest operation.
> 
> Hence, guest_memfd's solution is to require that users of guest_memfd
> for private memory trust guest_memfd to maintain the pages around and
> not take any refcounts.
> 
> So back to Topic 1, for production kernels, is it okay that TDX does not
> need to indicate that it is using a page, and can trust guest_memfd to
> keep the page around for the VM?

I think Yan's concern is not totally invalid. But I don't see a problem if we
have a line of sight to adding debug checking as follow on work. That is kind of
the path I was trying to nudge.

> 
> > > 
> > > Topic 2: How to handle unmapping/splitting errors arising from TDX?
> > > 
> > > Previously I was in favor of having unmap() return an error (Rick
> > > suggested doing a POC, and in a more recent email Rick asked for a
> > > diffstat), but Vishal and I talked about this and now I agree having
> > > unmapping return an error is not a good approach for these reasons.
> > 
> > Ok, let's close this option then.
> > 
> > > 
> > > 1. Unmapping takes a range, and within the range there could be more
> > >    than one unmapping error. I was previously thinking that unmap()
> > >    could return 0 for success and the failed PFN on error. Returning a
> > >    single PFN on error is okay-ish but if there are more errors it could
> > >    get complicated.
> > > 
> > >    Another error return option could be to return the folio where the
> > >    unmapping/splitting issue happened, but that would not be
> > >    sufficiently precise, since a folio could be larger than 4K and we
> > >    want to track errors as precisely as we can to reduce memory loss due
> > >    to errors.
> > > 
> > > 2. What I think Yan has been trying to say: unmap() returning an error
> > >    is non-standard in the kernel.
> > > 
> > > I think (1) is the dealbreaker here and there's no need to do the
> > > plumbing POC and diffstat.
> > > 
> > > So I think we're all in support of indicating unmapping/splitting issues
> > > without returning anything from unmap(), and the discussed options are
> > > 
> > > a. Refcounts: won't work - mostly discussed in this (sub-)thread
> > >    [3]. Using refcounts makes it impossible to distinguish between
> > >    transient refcounts and refcounts due to errors.
> > > 
> > > b. Page flags: won't work with/can't benefit from HVO.
> > 
> > As above, this was for the purpose of catching bugs, not for guestmemfd to
> > logically depend on it.
> > 
> > > 
> > > Suggestions still in the running:
> > > 
> > > c. Folio flags are not precise enough to indicate which page actually
> > >    had an error, but this could be sufficient if we're willing to just
> > >    waste the rest of the huge page on unmapping error.
> > 
> > For a scenario of TDX module bug, it seems ok to me.
> > 
> > > 
> > > d. Folio flags with folio splitting on error. This means that on
> > >    unmapping/Secure EPT PTE splitting error, we have to split the
> > >    (larger than 4K) folio to 4K, and then set a flag on the split folio.
> > > 
> > >    The issue I see with this is that splitting pages with HVO applied
> > >    means doing allocations, and in an error scenario there may not be
> > >    memory left to split the pages.
> > > 
> > > e. Some other data structure in guest_memfd, say, a linked list, and a
> > >    function like kvm_gmem_add_error_pfn(struct page *page) that would
> > >    look up the guest_memfd inode from the page and add the page's pfn to
> > >    the linked list.
> > > 
> > >    Everywhere in guest_memfd that does unmapping/splitting would then
> > >    check this linked list to see if the unmapping/splitting
> > >    succeeded.
> > > 
> > >    Everywhere in guest_memfd that allocates pages will also check this
> > >    linked list to make sure the pages are functional.
> > > 
> > >    When guest_memfd truncates, if the page being truncated is on the
> > >    list, retain the refcount on the page and leak that page.
> > 
> > I think this is a fine option.
> > 
> > > 
> > > f. Combination of c and e, something similar to HugeTLB's
> > >    folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in
> > >    trouble to a linked list on the folio.
> > > 
> > > g. Like f, but basically treat an unmapping error as hardware poisoning.
> > > 
> > > I'm kind of inclined towards g, to just treat unmapping errors as
> > > HWPOISON and buying into all the HWPOISON handling requirements. What do
> > > yall think? Can a TDX unmapping error be considered as memory poisoning?
> > 
> > What does HWPOISON bring over refcounting the page/folio so that it never
> > returns to the page allocator?
> 
> For Topic 2 (handling TDX unmapping errors), HWPOISON is better than
> refcounting because refcounting interferes with conversions (see
> [INTERFERE WITH CONVERSIONS] above).

I don't know if it quite fits. I think it would be better to not pollute the
concept if possible.

> 
> > We are bugging the TD in these cases.
> 
> Bugging the TD does not help to prevent future conversions from being
> interfered with.
> 
> 1. Conversions involves unmapping, so we could actually be in a
>    conversion, the unmapping is performed and fails, and then we try to
>    split and enter an infinite loop since private to shared conversions
>    assumes guest_memfd holds the only refcounts on guest_memfd memory.
> 
> 2. The conversion ioctl is a guest_memfd ioctl, not a VM ioctl, and so
>    there is no check that the VM is not dead. There shouldn't be any
>    check on the VM, because shareability is a property of the memory and
>    should be changeable independent of the associated VM.

Hmm, they are both about unlinking guestmemfd from a VM lifecycle then. Is that
a better way to put it?

> 
> > Ohhh... Is
> > this about the code to allow gmem fds to be handed to new VMs?
> 
> Nope, it's not related to linking. The proposed KVM_LINK_GUEST_MEMFD
> ioctl [4] also doesn't check if the source VM is dead. There shouldn't
> be any check on the source VM, since the memory is from guest_memfd and
> should be independently transferable to a new VM.

If a page is mapped in the old TD, and a new TD is started, re-mapping the same
page should be prevented somehow, right?

It really does seem like guestmemfd is the right place to keep the the "stuck
page" state. If guestmemfd is not tied to a VM and can be re-used, it should be
the one to decide whether they can be mapped again. Refcounting on error is
about preventing return to the page allocator but that is not the only problem.

I do think that these threads have gone on far too long. It's probably about
time to move forward with something even if it's just to have something to
discuss that doesn't require footnoting so many lore links. So how about we move
forward with option e as a next step. Does that sound good Yan?

Ackerley, thank you very much for pulling together this summary.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-24 23:30                                                             ` Ackerley Tng
  2025-06-25  0:01                                                               ` Edgecombe, Rick P
@ 2025-06-25  7:08                                                               ` Yan Zhao
  2025-06-25 22:54                                                                 ` Ackerley Tng
  1 sibling, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-25  7:08 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Edgecombe, Rick P, quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, vbabka@suse.cz, Shutemov, Kirill,
	michael.roth@amd.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, pbonzini@redhat.com, binbin.wu@linux.intel.com,
	Yamahata, Isaku, Peng, Chao P, Weiny, Ira, kvm@vger.kernel.org,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Li, Zhiquan1,
	pgonda@google.com, x86@kernel.org

On Tue, Jun 24, 2025 at 04:30:57PM -0700, Ackerley Tng wrote:
> "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:
> 
> > On Mon, 2025-06-23 at 15:48 -0700, Ackerley Tng wrote:
> >> Let me try and summarize the current state of this discussion:
> >> 
> >> Topic 1: Does TDX need to somehow indicate that it is using a page?
> >> 
> >> This patch series uses refcounts to indicate that TDX is using a page,
> >> but that complicates private-to-shared conversions.
> >> 
> >> During a private-to-shared conversion, guest_memfd assumes that
> >> guest_memfd is trusted to manage private memory. TDX and other users
> >> should trust guest_memfd to keep the memory around.
> >> 
> >> Yan's position is that holding a refcount is in line with how IOMMU
> >> takes a refcount when a page is mapped into the IOMMU [1].
> >> 
> >> Yan had another suggestion, which is to indicate using a page flag [2].
> >> 
> >> I think we're in agreement that we don't want to have TDX hold a
> >> refcount while the page is mapped into the Secure EPTs, but taking a
> >> step back, do we really need to indicate (at all) that TDX is using a
> >> page?
> >> 
> >> In [3] Yan said
> >> 
> >> > If TDX does not hold any refcount, guest_memfd has to know that which
> >> > private
> >> > page is still mapped. Otherwise, the page may be re-assigned to other
> >> > kernel
> >> > components while it may still be mapped in the S-EPT.
> >> 
> >> If the private page is mapped for regular VM use as private memory,
> >> guest_memfd is managing that, and the same page will not be re-assigned
> >> to any other kernel component. guest_memfd does hold refcounts in
> >> guest_memfd's filemap.
> >> 
> >> If the private page is still mapped because there was an unmapping
> >> failure, we can discuss that separately under error handling in Topic 2.
> >> 
> >> With this, can I confirm that we are in agreement that TDX does not need
> >> to indicate that it is using a page, and can trust guest_memfd to keep
> >> the page around for the VM?
> >
> > Minor correction here. Yan was concerned about *bugs* happening when freeing
> > pages that are accidentally still mapped in the S-EPT. My opinion is that this
> > is not especially risky to happen here vs other similar places, but it could be
> > helpful if there was a way to catch such bugs. The page flag, or page_ext
> > direction came out of a discussion with Dave and Kirill. If it could run all the
> > time that would be great, but if not a debug config could be sufficient. For
> > example like CONFIG_PAGE_TABLE_CHECK. It doesn't need to support vmemmap
> > optimizations because the debug checking doesn't need to run all the time.
> > Overhead for debug settings is very normal.
> >
> 
> I see, let's call debug checking Topic 3 then, to separate it from Topic
> 1, which is TDX indicating that it is using a page for production
> kernels.
> 
> Topic 3: How should TDX indicate use of a page for debugging?
> 
> I'm okay if for debugging, TDX uses anything other than refcounts for
> checking, because refcounts will interfere with conversions.
> 
> Rick's other email is correct. The correct link should be
> https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/.
> 
> [INTERFERE WITH CONVERSIONS]
> 
> To summarize, if TDX uses refcounts to indicate that it is using a page,
> or to indicate anything else, then we cannot easily split a page on
> private to shared conversions.
> 
> Specifically, consider the case where only the x-th subpage of a huge
> folio is mapped into Secure-EPTs. When the guest requests to convert
> some subpage to shared, the huge folio has to be split for
> core-mm. Core-mm, which will use the shared page, must have split folios
> to be able to accurately and separately track refcounts for subpages.
> 
> During splitting, guest_memfd would see refcount of 512 (for 2M page
> being in the filemap) + 1 (if TDX indicates that the x-th subpage is
> mapped using a refcount), but would not be able to tell that the 513th
> refcount belongs to the x-th subpage. guest_memfd can't split the huge
> folio unless it knows how to distribute the 513th refcount.
In my POC, https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com/
kvm_gmem_private_has_safe_refcount() was introduce to check the folio ref count.
It rejects private-to-shared conversion after splitting and unmapping KVM's
secondary page tables if the refcount exceeds a valid threshold.

Though in
https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/, we
agreed that "EAGAIN is not the right code in case of "extra" refcounts held by
TDX", this does not imply that rejecting the conversion itself is incorrect.

This is why we are exploring alternative solutions instead of having TDX hold
the page refcount.

So, either a per-page flag, per-folio flag or solutions e,f,g should be good.

IMO, regardless of the final choice, guest_memfd needs to identify problematic
folios to:
- reject the private-to-shared conversion
- prevent further recycling after kvm_gmem_free_folio()

> One might say guest_memfd could clear all the refcounts that TDX is
> holding on the huge folio by unmapping the entire huge folio from the
> Secure-EPTs, but unmapping the entire huge folio for TDX means zeroing
> the contents and requiring guest re-acceptance. Both of these would mess
> up guest operation.
> 
> Hence, guest_memfd's solution is to require that users of guest_memfd
> for private memory trust guest_memfd to maintain the pages around and
> not take any refcounts.
> 
> So back to Topic 1, for production kernels, is it okay that TDX does not
> need to indicate that it is using a page, and can trust guest_memfd to
> keep the page around for the VM?
If the "TDX does not need to indicate that it is using a page" means "do not
take page refcount", I'm ok.

> >> 
> >> Topic 2: How to handle unmapping/splitting errors arising from TDX?
> >> 
> >> Previously I was in favor of having unmap() return an error (Rick
> >> suggested doing a POC, and in a more recent email Rick asked for a
> >> diffstat), but Vishal and I talked about this and now I agree having
> >> unmapping return an error is not a good approach for these reasons.
> >
> > Ok, let's close this option then.
> >
> >> 
> >> 1. Unmapping takes a range, and within the range there could be more
> >>    than one unmapping error. I was previously thinking that unmap()
> >>    could return 0 for success and the failed PFN on error. Returning a
> >>    single PFN on error is okay-ish but if there are more errors it could
> >>    get complicated.
> >> 
> >>    Another error return option could be to return the folio where the
> >>    unmapping/splitting issue happened, but that would not be
> >>    sufficiently precise, since a folio could be larger than 4K and we
> >>    want to track errors as precisely as we can to reduce memory loss due
> >>    to errors.
> >> 
> >> 2. What I think Yan has been trying to say: unmap() returning an error
> >>    is non-standard in the kernel.
> >> 
> >> I think (1) is the dealbreaker here and there's no need to do the
> >> plumbing POC and diffstat.
> >> 
> >> So I think we're all in support of indicating unmapping/splitting issues
> >> without returning anything from unmap(), and the discussed options are
> >> 
> >> a. Refcounts: won't work - mostly discussed in this (sub-)thread
> >>    [3]. Using refcounts makes it impossible to distinguish between
> >>    transient refcounts and refcounts due to errors.
> >> 
> >> b. Page flags: won't work with/can't benefit from HVO.
> >
> > As above, this was for the purpose of catching bugs, not for guestmemfd to
> > logically depend on it.
> >
> >> 
> >> Suggestions still in the running:
> >> 
> >> c. Folio flags are not precise enough to indicate which page actually
> >>    had an error, but this could be sufficient if we're willing to just
> >>    waste the rest of the huge page on unmapping error.
> >
> > For a scenario of TDX module bug, it seems ok to me.
> >
> >> 
> >> d. Folio flags with folio splitting on error. This means that on
> >>    unmapping/Secure EPT PTE splitting error, we have to split the
> >>    (larger than 4K) folio to 4K, and then set a flag on the split folio.
> >> 
> >>    The issue I see with this is that splitting pages with HVO applied
> >>    means doing allocations, and in an error scenario there may not be
> >>    memory left to split the pages.
> >> 
> >> e. Some other data structure in guest_memfd, say, a linked list, and a
> >>    function like kvm_gmem_add_error_pfn(struct page *page) that would
> >>    look up the guest_memfd inode from the page and add the page's pfn to
> >>    the linked list.
> >> 
> >>    Everywhere in guest_memfd that does unmapping/splitting would then
> >>    check this linked list to see if the unmapping/splitting
> >>    succeeded.
> >> 
> >>    Everywhere in guest_memfd that allocates pages will also check this
> >>    linked list to make sure the pages are functional.
> >> 
> >>    When guest_memfd truncates, if the page being truncated is on the
> >>    list, retain the refcount on the page and leak that page.
> >
> > I think this is a fine option.
> >
> >> 
> >> f. Combination of c and e, something similar to HugeTLB's
> >>    folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in
> >>    trouble to a linked list on the folio.
> >> 
> >> g. Like f, but basically treat an unmapping error as hardware poisoning.
> >> 
> >> I'm kind of inclined towards g, to just treat unmapping errors as
> >> HWPOISON and buying into all the HWPOISON handling requirements. What do
> >> yall think? Can a TDX unmapping error be considered as memory poisoning?
> >
> > What does HWPOISON bring over refcounting the page/folio so that it never
> > returns to the page allocator?
> 
> For Topic 2 (handling TDX unmapping errors), HWPOISON is better than
> refcounting because refcounting interferes with conversions (see
> [INTERFERE WITH CONVERSIONS] above).
> 
> > We are bugging the TD in these cases.
> 
> Bugging the TD does not help to prevent future conversions from being
> interfered with.
> 
> 1. Conversions involves unmapping, so we could actually be in a
>    conversion, the unmapping is performed and fails, and then we try to
>    split and enter an infinite loop since private to shared conversions
>    assumes guest_memfd holds the only refcounts on guest_memfd memory.
We should bail out conversion even with the HWPOISON.
e.g.,
1. user triggers private-to-shared ioctl to convert 4K page A within a 2MB folio
   B to shared.
2. kvm_gmem_convert_should_proceed() executes kvm_gmem_split_private() and
   kvm_gmem_zap().
3. kvm_gmem_convert_should_proceed() checks kvm_gmem_has_invalid_folio()
   (Suppose TDX sets HWPOISON to page A or folio B after kvm_gmem_zap(), then
     kvm_gmem_has_invalid_folio() should return true).
4. Return -EFAULT.

If we allow the actual conversion to proceed after step 3, folio B will be split
into 4KB folios, with page A being converted to a shared 4KB folio, which
becomes accessible by userspace.

This could cause a machine check (#MC) on certain platforms. We should avoid
this scenario when possible.


> 2. The conversion ioctl is a guest_memfd ioctl, not a VM ioctl, and so
>    there is no check that the VM is not dead. There shouldn't be any
>    check on the VM, because shareability is a property of the memory and
>    should be changeable independent of the associated VM.
> 
> > Ohhh... Is
> > this about the code to allow gmem fds to be handed to new VMs?
> 
> Nope, it's not related to linking. The proposed KVM_LINK_GUEST_MEMFD
> ioctl [4] also doesn't check if the source VM is dead. There shouldn't
> be any check on the source VM, since the memory is from guest_memfd and
> should be independently transferable to a new VM.
> 
> [4] https://lore.kernel.org/lkml/cover.1747368092.git.afranji@google.com/T/

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-25  0:01                                                               ` Edgecombe, Rick P
@ 2025-06-25  7:29                                                                 ` Yan Zhao
  2025-06-25 23:09                                                                 ` Ackerley Tng
  1 sibling, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-06-25  7:29 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: ackerleytng@google.com, quic_eberman@quicinc.com, Li, Xiaoyao,
	Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	Du, Fan, michael.roth@amd.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, Peng, Chao P, pbonzini@redhat.com,
	Yamahata, Isaku, binbin.wu@linux.intel.com, Weiny, Ira,
	kvm@vger.kernel.org, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, Li, Zhiquan1, pgonda@google.com, x86@kernel.org

On Wed, Jun 25, 2025 at 08:01:41AM +0800, Edgecombe, Rick P wrote:
> > > > So I think we're all in support of indicating unmapping/splitting issues
> > > > without returning anything from unmap(), and the discussed options are
> > > > 
> > > > a. Refcounts: won't work - mostly discussed in this (sub-)thread
> > > >    [3]. Using refcounts makes it impossible to distinguish between
> > > >    transient refcounts and refcounts due to errors.
> > > > 
> > > > b. Page flags: won't work with/can't benefit from HVO.
> > > 
> > > As above, this was for the purpose of catching bugs, not for guestmemfd to
> > > logically depend on it.
> > > 
> > > > 
> > > > Suggestions still in the running:
> > > > 
> > > > c. Folio flags are not precise enough to indicate which page actually
> > > >    had an error, but this could be sufficient if we're willing to just
> > > >    waste the rest of the huge page on unmapping error.
> > > 
> > > For a scenario of TDX module bug, it seems ok to me.
> > > 
> > > > 
> > > > d. Folio flags with folio splitting on error. This means that on
> > > >    unmapping/Secure EPT PTE splitting error, we have to split the
> > > >    (larger than 4K) folio to 4K, and then set a flag on the split folio.
> > > > 
> > > >    The issue I see with this is that splitting pages with HVO applied
> > > >    means doing allocations, and in an error scenario there may not be
> > > >    memory left to split the pages.
> > > > 
> > > > e. Some other data structure in guest_memfd, say, a linked list, and a
> > > >    function like kvm_gmem_add_error_pfn(struct page *page) that would
> > > >    look up the guest_memfd inode from the page and add the page's pfn to
> > > >    the linked list.
> > > > 
> > > >    Everywhere in guest_memfd that does unmapping/splitting would then
> > > >    check this linked list to see if the unmapping/splitting
> > > >    succeeded.
> > > > 
> > > >    Everywhere in guest_memfd that allocates pages will also check this
> > > >    linked list to make sure the pages are functional.
> > > > 
> > > >    When guest_memfd truncates, if the page being truncated is on the
> > > >    list, retain the refcount on the page and leak that page.
> > > 
> > > I think this is a fine option.
> > > 
> > > > 
> > > > f. Combination of c and e, something similar to HugeTLB's
> > > >    folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in
> > > >    trouble to a linked list on the folio.
> > > > 
> > > > g. Like f, but basically treat an unmapping error as hardware poisoning.
> > > > 
> > > > I'm kind of inclined towards g, to just treat unmapping errors as
> > > > HWPOISON and buying into all the HWPOISON handling requirements. What do
> > > > yall think? Can a TDX unmapping error be considered as memory poisoning?
> > > 
> > > What does HWPOISON bring over refcounting the page/folio so that it never
> > > returns to the page allocator?
... 
> I do think that these threads have gone on far too long. It's probably about
> time to move forward with something even if it's just to have something to
> discuss that doesn't require footnoting so many lore links. So how about we move
> forward with option e as a next step. Does that sound good Yan?
I'm ok with e if allocation of memory for the linked list is not a problem.
Otherwise, I feel that a simpler solution would be to set a folio flag when an
unmapping error occurs.

guest_memfd needs to check this folio flag before the actual conversion and
in kvm_gmem_free_folio().


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-24 18:35                                                     ` Edgecombe, Rick P
@ 2025-06-25  9:28                                                       ` Yan Zhao
  2025-06-25  9:36                                                         ` Yan Zhao
  2025-06-25 14:47                                                         ` Edgecombe, Rick P
  2025-06-25 13:47                                                       ` Vishal Annapurve
  1 sibling, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-06-25  9:28 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Huang, Kai, Du, Fan,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Li, Zhiquan1, Shutemov, Kirill,
	michael.roth@amd.com, seanjc@google.com, Weiny, Ira, Peng, Chao P,
	pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, kvm@vger.kernel.org, Annapurve, Vishal,
	tabba@google.com, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Wed, Jun 25, 2025 at 02:35:59AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> > Could we provide the info via the private_max_mapping_level hook (i.e. via
> > tdx_gmem_private_max_mapping_level())?
> 
> This is one of the previous two methods discussed. Can you elaborate on what you
> are trying to say?
I don't get why we can't use the existing tdx_gmem_private_max_mapping_level()
to convey the max_level info at which a vendor hopes a GFN to be mapped.

Before TDX huge pages, tdx_gmem_private_max_mapping_level() always returns 4KB;
after TDX huge pages, it returns
- 4KB during the TD build stage
- at TD runtime: 4KB or 2MB

Why does KVM need to care how the vendor determines this max_level?
I think a vendor should have its freedom to decide based on software limitation,
guest's wishes, hardware bugs or whatever.

> > Or what about introducing a vendor hook in __kvm_mmu_max_mapping_level() for a
> > private fault?
> > 
> > > Maybe we could have EPT violations that contain 4k accept sizes first update the
> > > attribute for the GFN to be accepted or not, like have tdx.c call out to set
> > > kvm_lpage_info->disallow_lpage in the rarer case of 4k accept size? Or something
> > Something like kvm_lpage_info->disallow_lpage would disallow later page
> > promotion, though we don't support it right now.
> 
> Well I was originally thinking it would not set kvm_lpage_info->disallow_lpage
> directly, but rely on the logic that checks for mixed attributes. But more
> below...
> 
> > 
> > > like that. Maybe set a "accepted" attribute, or something. Not sure if could be
> > Setting "accepted" attribute in the EPT violation handler?
> > It's a little odd, as the accept operation is not yet completed.
> 
> I guess the question in both of these comments is: what is the life cycle. Guest
> could call TDG.MEM.PAGE.RELEASE to unaccept it as well. Oh, geez. It looks like
> TDG.MEM.PAGE.RELEASE will give the same size hints in the EPT violation. So an
> accept attribute is not going work, at least without TDX module changes.
> 
> 
> Actually, the problem we have doesn't fit the mixed attributes behavior. If many
> vCPU's accept at 2MB region at 4k page size, the entire 2MB range could be non-
> mixed and then individual accepts would fail.
> 
> 
> So instead there could be a KVM_LPAGE_GUEST_INHIBIT that doesn't get cleared
Set KVM_LPAGE_GUEST_INHIBIT via a TDVMCALL ?

Or just set the KVM_LPAGE_GUEST_INHIBIT when an EPT violation contains 4KB
level info?

I guess it's the latter one as it can avoid modification to both EDK2 and Linux
guest.  I observed ~2710 instances of "guest accepts at 4KB when KVM can map at
2MB" during the boot-up of a TD with 4GB memory.

But does it mean TDX needs to hold write mmu_lock in the EPT violation handler
and set KVM_LPAGE_GUEST_INHIBIT on finding a violation carries 4KB level info?

> based on mixed attributes. It would be one way. It would need to get set by
> something like kvm_write_track_add_gfn() that lives in tdx.c and is called
> before going into the fault handler on 4k accept size. It would have to take mmu
> write lock I think, which would kill scalability in the 4k accept case (but not
> the normal 2MB one). But as long as mmu_write lock is held, demote will be no
> problem, which the operation would also need to do.
> 
> I think it actually makes KVM's behavior easier to understand. We don't need to
> worry about races between multiple accept sizes and things like that. It also
> leaves the core MMU code mostly untouched. Performance/scalability wise it only
> punishes the rare case.
Write down my understanding to check if it's correct:

- when a TD is NOT configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL, KVM
  always maps at 4KB

- When a TD is configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL,

(a)
1. guest accepts at 4KB
2. TDX sets KVM_LPAGE_GUEST_INHIBIT and try splitting.(with write mmu_lock)
3. KVM maps at 4KB (with read mmu_lock)
4. guest's 4KB accept succeeds.

(b)
1. guest accepts at 2MB.
2. KVM maps at 4KB due to a certain reason.
3. guest's accept 2MB fails with TDACCEPT_SIZE_MISMATCH.
4. guest accepts at 4KB
5. guest's 4KB accept succeeds.

> For leaving the option open to promote the GFNs in the future, a GHCI interface
> or similar could be defined for the guest to say "I don't care about page size
> anymore for this gfn". So it won't close it off forever.
ok.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-25  9:28                                                       ` Yan Zhao
@ 2025-06-25  9:36                                                         ` Yan Zhao
  2025-06-25 14:48                                                           ` Edgecombe, Rick P
  2025-06-25 14:47                                                         ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-25  9:36 UTC (permalink / raw)
  To: Edgecombe, Rick P, quic_eberman@quicinc.com, Li, Xiaoyao,
	Huang, Kai, Du, Fan, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, Li, Zhiquan1,
	Shutemov, Kirill, michael.roth@amd.com, seanjc@google.com,
	Weiny, Ira, Peng, Chao P, pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, kvm@vger.kernel.org, Annapurve, Vishal,
	tabba@google.com, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Wed, Jun 25, 2025 at 05:28:22PM +0800, Yan Zhao wrote:
> Write down my understanding to check if it's correct:
> 
> - when a TD is NOT configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL, KVM
>   always maps at 4KB
> 
> - When a TD is configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL,
Sorry, the two conditions are stale ones. No need any more.
So it's always
 
 (a)
 1. guest accepts at 4KB
 2. TDX sets KVM_LPAGE_GUEST_INHIBIT and try splitting.(with write mmu_lock)
 3. KVM maps at 4KB (with read mmu_lock)
 4. guest's 4KB accept succeeds.
 
 (b)
 1. guest accepts at 2MB.
 2. KVM maps at 4KB due to a certain reason.
 3. guest's accept 2MB fails with TDACCEPT_SIZE_MISMATCH.
 4. guest accepts at 4KB
 5. guest's 4KB accept succeeds.
 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-24 18:35                                                     ` Edgecombe, Rick P
  2025-06-25  9:28                                                       ` Yan Zhao
@ 2025-06-25 13:47                                                       ` Vishal Annapurve
  2025-06-25 15:51                                                         ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-06-25 13:47 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Zhao, Yan Y, quic_eberman@quicinc.com, Li, Xiaoyao, Huang, Kai,
	Du, Fan, Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Li, Zhiquan1, Shutemov, Kirill,
	michael.roth@amd.com, seanjc@google.com, Weiny, Ira, Peng, Chao P,
	pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, kvm@vger.kernel.org, tabba@google.com,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jun 24, 2025 at 11:36 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
> ...
> For leaving the option open to promote the GFNs in the future, a GHCI interface
> or similar could be defined for the guest to say "I don't care about page size
> anymore for this gfn". So it won't close it off forever.
>

I think it's in the host's interest to get the pages mapped at large
page granularity whenever possible. Even if guest doesn't buy-in into
the "future" GHCI interface, there should be some ABI between TDX
module and host VMM to allow promotion probably as soon as all the
ranges within a hugepage get accepted but are still mapped at 4K
granularity.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-25  9:28                                                       ` Yan Zhao
  2025-06-25  9:36                                                         ` Yan Zhao
@ 2025-06-25 14:47                                                         ` Edgecombe, Rick P
  2025-06-26  8:53                                                           ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-25 14:47 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Du, Fan, Li, Xiaoyao, Huang, Kai, Shutemov, Kirill, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
	Li, Zhiquan1, quic_eberman@quicinc.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, pbonzini@redhat.com, Peng, Chao P,
	Yamahata, Isaku, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, ackerleytng@google.com,
	kvm@vger.kernel.org, Annapurve, Vishal, tabba@google.com,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, 2025-06-25 at 17:28 +0800, Yan Zhao wrote:
> On Wed, Jun 25, 2025 at 02:35:59AM +0800, Edgecombe, Rick P wrote:
> > On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> > > Could we provide the info via the private_max_mapping_level hook (i.e. via
> > > tdx_gmem_private_max_mapping_level())?
> > 
> > This is one of the previous two methods discussed. Can you elaborate on what you
> > are trying to say?
> I don't get why we can't use the existing tdx_gmem_private_max_mapping_level()
> to convey the max_level info at which a vendor hopes a GFN to be mapped.
> 
> Before TDX huge pages, tdx_gmem_private_max_mapping_level() always returns 4KB;
> after TDX huge pages, it returns
> - 4KB during the TD build stage
> - at TD runtime: 4KB or 2MB
> 
> Why does KVM need to care how the vendor determines this max_level?
> I think a vendor should have its freedom to decide based on software limitation,
> guest's wishes, hardware bugs or whatever.

I don't see that big of a difference between "KVM" and "vendor". TDX code is KVM
code. Just because it's in tdx.c doesn't mean it's ok for it to be hard to trace
the logic.

I'm not sure what Sean's objection was to that approach, or if he objected to
just the weird SIZE_MISMATCH behavior of TDX module. I think you already know
why I don't prefer it:
 - Requiring demote in the fault handler. This requires an additional write lock
inside the mmu read lock, or TDX module changes. Although now I wonder if the
interrupt error code related problems will get worse with this solution. The
solution is currently not settled.
 - Requiring passing args on the vCPU struct, which as you point out will work
functionally today only because the prefault stuff will avoid seeing it. But
it's fragile
 - The page size behavior is a bit implicit

I'm coming back to this draft after PUCK. Sean shared his thoughts there. I'll
try to summarize. He didn't like how the page size requirements were passed
through the fault handler in a "transient" way. That "transient" property covers
both of the two options for passing the size info through the fault handler that
we were debating. He also didn't like how TDX arch requires KVM to map at a
specific host size around accept. Michael Roth brought up that SNP has the same
requirement, but it can do the zap and refault approach.

We then discussed this lpage_info idea. He was in favor of it, but not, I'd say,
overly enthusiastic. In a "least worst option" kind of way.

I think the biggest downside is the MMU write lock. Our goal for this series is
to help performance, not to get huge page sizes. So if we do this idea, we can't
fully waive our hands that any optimization is pre-mature. It *is* an
optimization. We need to either convince ourselves that the overall benefit is
still there, or have a plan to adopt the guest to avoid 4k accepts. Which we
were previously discussing of requiring anyway.

But I much prefer the deterministic behavior of this approach from a
maintainability perspective.

> 
> > > Or what about introducing a vendor hook in __kvm_mmu_max_mapping_level() for a
> > > private fault?
> > > 
> > > > Maybe we could have EPT violations that contain 4k accept sizes first update the
> > > > attribute for the GFN to be accepted or not, like have tdx.c call out to set
> > > > kvm_lpage_info->disallow_lpage in the rarer case of 4k accept size? Or something
> > > Something like kvm_lpage_info->disallow_lpage would disallow later page
> > > promotion, though we don't support it right now.
> > 
> > Well I was originally thinking it would not set kvm_lpage_info->disallow_lpage
> > directly, but rely on the logic that checks for mixed attributes. But more
> > below...
> > 
> > > 
> > > > like that. Maybe set a "accepted" attribute, or something. Not sure if could be
> > > Setting "accepted" attribute in the EPT violation handler?
> > > It's a little odd, as the accept operation is not yet completed.
> > 
> > I guess the question in both of these comments is: what is the life cycle. Guest
> > could call TDG.MEM.PAGE.RELEASE to unaccept it as well. Oh, geez. It looks like
> > TDG.MEM.PAGE.RELEASE will give the same size hints in the EPT violation. So an
> > accept attribute is not going work, at least without TDX module changes.
> > 
> > 
> > Actually, the problem we have doesn't fit the mixed attributes behavior. If many
> > vCPU's accept at 2MB region at 4k page size, the entire 2MB range could be non-
> > mixed and then individual accepts would fail.
> > 
> > 
> > So instead there could be a KVM_LPAGE_GUEST_INHIBIT that doesn't get cleared
> Set KVM_LPAGE_GUEST_INHIBIT via a TDVMCALL ?
> 
> Or just set the KVM_LPAGE_GUEST_INHIBIT when an EPT violation contains 4KB
> level info?

Yes, that's the idea. 2MB accepts can behave like normal.

> 
> I guess it's the latter one as it can avoid modification to both EDK2 and Linux
> guest.  I observed ~2710 instances of "guest accepts at 4KB when KVM can map at
> 2MB" during the boot-up of a TD with 4GB memory.

Oh, wow that is more than I expected. Did you notice how many vCPUs they were
spread across? What memory size did you use? What was your guest memory
configuration?

> 
> But does it mean TDX needs to hold write mmu_lock in the EPT violation handler
> and set KVM_LPAGE_GUEST_INHIBIT on finding a violation carries 4KB level info?

I think so. I didn't check the reason, but the other similar code took it. Maybe
not? If we don't need to take mmu write lock, then this idea seems like a clear
winner to me.

> 
> > based on mixed attributes. It would be one way. It would need to get set by
> > something like kvm_write_track_add_gfn() that lives in tdx.c and is called
> > before going into the fault handler on 4k accept size. It would have to take mmu
> > write lock I think, which would kill scalability in the 4k accept case (but not
> > the normal 2MB one). But as long as mmu_write lock is held, demote will be no
> > problem, which the operation would also need to do.
> > 
> > I think it actually makes KVM's behavior easier to understand. We don't need to
> > worry about races between multiple accept sizes and things like that. It also
> > leaves the core MMU code mostly untouched. Performance/scalability wise it only
> > punishes the rare case.
> Write down my understanding to check if it's correct:

Will respond to this part on your later mail with corrections.




^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-25  9:36                                                         ` Yan Zhao
@ 2025-06-25 14:48                                                           ` Edgecombe, Rick P
  2025-06-26  0:50                                                             ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-25 14:48 UTC (permalink / raw)
  To: quic_eberman@quicinc.com, Li, Xiaoyao, Huang, Kai, Du, Fan,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	Zhao, Yan Y, Li, Zhiquan1, Shutemov, Kirill, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, pbonzini@redhat.com, Peng, Chao P,
	Yamahata, Isaku, linux-kernel@vger.kernel.org, vbabka@suse.cz,
	ackerleytng@google.com, kvm@vger.kernel.org,
	binbin.wu@linux.intel.com, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, tabba@google.com, pgonda@google.com, x86@kernel.org

On Wed, 2025-06-25 at 17:36 +0800, Yan Zhao wrote:
> On Wed, Jun 25, 2025 at 05:28:22PM +0800, Yan Zhao wrote:
> > Write down my understanding to check if it's correct:
> > 
> > - when a TD is NOT configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL, KVM
> >    always maps at 4KB
> > 
> > - When a TD is configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL,
> Sorry, the two conditions are stale ones. No need any more.
> So it's always
>  
>  (a)
>  1. guest accepts at 4KB
>  2. TDX sets KVM_LPAGE_GUEST_INHIBIT and try splitting.(with write mmu_lock)
>  3. KVM maps at 4KB (with read mmu_lock)
>  4. guest's 4KB accept succeeds.

Yea.

>  
>  (b)
>  1. guest accepts at 2MB.
>  2. KVM maps at 4KB due to a certain reason.

I don't follow this part. You mean because it spans a memslot or other?
Basically that KVM won't guarantee the page size at exactly the accept size? I
think this is ok and good. The ABI can be that KVM will guarantee the S-EPT
mapping size <= the accept size.

>  3. guest's accept 2MB fails with TDACCEPT_SIZE_MISMATCH.
>  4. guest accepts at 4KB
>  5. guest's 4KB accept succeeds.
>  
In this option accept behavior doesn't need to change, but the
TDACCEPT_SIZE_MISMATCH in step 3 still is a little weird. TDX module could
accept at 4k mapping size. But this is an issue for the guest to deal with, not
KVM.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-25 13:47                                                       ` Vishal Annapurve
@ 2025-06-25 15:51                                                         ` Edgecombe, Rick P
  0 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-25 15:51 UTC (permalink / raw)
  To: Annapurve, Vishal
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Huang, Kai, Du, Fan,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	Zhao, Yan Y, Li, Zhiquan1, Shutemov, Kirill, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, pbonzini@redhat.com, Peng, Chao P,
	Yamahata, Isaku, linux-kernel@vger.kernel.org, vbabka@suse.cz,
	ackerleytng@google.com, kvm@vger.kernel.org,
	binbin.wu@linux.intel.com, tabba@google.com, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, 2025-06-25 at 06:47 -0700, Vishal Annapurve wrote:
> On Tue, Jun 24, 2025 at 11:36 AM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> > ...
> > For leaving the option open to promote the GFNs in the future, a GHCI interface
> > or similar could be defined for the guest to say "I don't care about page size
> > anymore for this gfn". So it won't close it off forever.
> > 
> 
> I think it's in the host's interest to get the pages mapped at large
> page granularity whenever possible. Even if guest doesn't buy-in into
> the "future" GHCI interface, there should be some ABI between TDX
> module and host VMM to allow promotion probably as soon as all the
> ranges within a hugepage get accepted but are still mapped at 4K
> granularity.

In the 4k accept size, the guest is kind of requesting a specific host page
size. I agree it's not good to let the guest influence the host's resource
usage. But this already happens with private/shared conversions.

As for future promotion opportunities, I think that part needs a re-think. I
don't think cost/benefit is really there today. If we had a simpler solution (we
discussed some TDX module changes offline), then it changes the calculus. But we
shouldn't focus too much on the ideal TDX implementation. Getting the ideal case
upstream is far, far away. In the meantime we should focus on the simplest
things with the most benefit. In the end I'd expect an iterative, evolving
implementation to be faster to upstream then thinking through how it works with
every idea. The exception is thinking through a sane ABI ahead of time.

I don't think we necessarily need a GHCI interface to expose control of host
page sizes to the guest, but I think it might help with determinism. I meant it
sort of as an escape hatch. Like if we find some nasty races that prevent
optimizations for promotion, we could have an option to have the guest help by
making the ABI around page sizes more formal.

Side topic on page promotion, I'm wondering if the biggest bang-for-the-buck
promotion opportunity will be the memory that gets added via PAGE.ADD at TD
startup time. Which is a narrow specific case that may be easier to attack.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-25  7:08                                                               ` Yan Zhao
@ 2025-06-25 22:54                                                                 ` Ackerley Tng
  0 siblings, 0 replies; 294+ messages in thread
From: Ackerley Tng @ 2025-06-25 22:54 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Edgecombe, Rick P, quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, vbabka@suse.cz, Shutemov, Kirill,
	michael.roth@amd.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, pbonzini@redhat.com, binbin.wu@linux.intel.com,
	Yamahata, Isaku, Peng, Chao P, Weiny, Ira, kvm@vger.kernel.org,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Li, Zhiquan1,
	pgonda@google.com, x86@kernel.org

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Tue, Jun 24, 2025 at 04:30:57PM -0700, Ackerley Tng wrote:
>> "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:
>> 
>> > On Mon, 2025-06-23 at 15:48 -0700, Ackerley Tng wrote:
>> >> Let me try and summarize the current state of this discussion:
>> >> 
>> >> Topic 1: Does TDX need to somehow indicate that it is using a page?
>> >> 
>> >> This patch series uses refcounts to indicate that TDX is using a page,
>> >> but that complicates private-to-shared conversions.
>> >> 
>> >> During a private-to-shared conversion, guest_memfd assumes that
>> >> guest_memfd is trusted to manage private memory. TDX and other users
>> >> should trust guest_memfd to keep the memory around.
>> >> 
>> >> Yan's position is that holding a refcount is in line with how IOMMU
>> >> takes a refcount when a page is mapped into the IOMMU [1].
>> >> 
>> >> Yan had another suggestion, which is to indicate using a page flag [2].
>> >> 
>> >> I think we're in agreement that we don't want to have TDX hold a
>> >> refcount while the page is mapped into the Secure EPTs, but taking a
>> >> step back, do we really need to indicate (at all) that TDX is using a
>> >> page?
>> >> 
>> >> In [3] Yan said
>> >> 
>> >> > If TDX does not hold any refcount, guest_memfd has to know that which
>> >> > private
>> >> > page is still mapped. Otherwise, the page may be re-assigned to other
>> >> > kernel
>> >> > components while it may still be mapped in the S-EPT.
>> >> 
>> >> If the private page is mapped for regular VM use as private memory,
>> >> guest_memfd is managing that, and the same page will not be re-assigned
>> >> to any other kernel component. guest_memfd does hold refcounts in
>> >> guest_memfd's filemap.
>> >> 
>> >> If the private page is still mapped because there was an unmapping
>> >> failure, we can discuss that separately under error handling in Topic 2.
>> >> 
>> >> With this, can I confirm that we are in agreement that TDX does not need
>> >> to indicate that it is using a page, and can trust guest_memfd to keep
>> >> the page around for the VM?
>> >
>> > Minor correction here. Yan was concerned about *bugs* happening when freeing
>> > pages that are accidentally still mapped in the S-EPT. My opinion is that this
>> > is not especially risky to happen here vs other similar places, but it could be
>> > helpful if there was a way to catch such bugs. The page flag, or page_ext
>> > direction came out of a discussion with Dave and Kirill. If it could run all the
>> > time that would be great, but if not a debug config could be sufficient. For
>> > example like CONFIG_PAGE_TABLE_CHECK. It doesn't need to support vmemmap
>> > optimizations because the debug checking doesn't need to run all the time.
>> > Overhead for debug settings is very normal.
>> >
>> 
>> I see, let's call debug checking Topic 3 then, to separate it from Topic
>> 1, which is TDX indicating that it is using a page for production
>> kernels.
>> 
>> Topic 3: How should TDX indicate use of a page for debugging?
>> 
>> I'm okay if for debugging, TDX uses anything other than refcounts for
>> checking, because refcounts will interfere with conversions.
>> 
>> Rick's other email is correct. The correct link should be
>> https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/.
>> 
>> [INTERFERE WITH CONVERSIONS]
>> 
>> To summarize, if TDX uses refcounts to indicate that it is using a page,
>> or to indicate anything else, then we cannot easily split a page on
>> private to shared conversions.
>> 
>> Specifically, consider the case where only the x-th subpage of a huge
>> folio is mapped into Secure-EPTs. When the guest requests to convert
>> some subpage to shared, the huge folio has to be split for
>> core-mm. Core-mm, which will use the shared page, must have split folios
>> to be able to accurately and separately track refcounts for subpages.
>> 
>> During splitting, guest_memfd would see refcount of 512 (for 2M page
>> being in the filemap) + 1 (if TDX indicates that the x-th subpage is
>> mapped using a refcount), but would not be able to tell that the 513th
>> refcount belongs to the x-th subpage. guest_memfd can't split the huge
>> folio unless it knows how to distribute the 513th refcount.
> In my POC, https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com/
> kvm_gmem_private_has_safe_refcount() was introduce to check the folio ref count.
> It rejects private-to-shared conversion after splitting and unmapping KVM's
> secondary page tables if the refcount exceeds a valid threshold.
>
> Though in
> https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/, we
> agreed that "EAGAIN is not the right code in case of "extra" refcounts held by
> TDX", this does not imply that rejecting the conversion itself is incorrect.
>
> This is why we are exploring alternative solutions instead of having TDX hold
> the page refcount.
>
> So, either a per-page flag, per-folio flag or solutions e,f,g should be good.
>
> IMO, regardless of the final choice, guest_memfd needs to identify problematic
> folios to:
> - reject the private-to-shared conversion
> - prevent further recycling after kvm_gmem_free_folio()
>

Agreed!

>> One might say guest_memfd could clear all the refcounts that TDX is
>> holding on the huge folio by unmapping the entire huge folio from the
>> Secure-EPTs, but unmapping the entire huge folio for TDX means zeroing
>> the contents and requiring guest re-acceptance. Both of these would mess
>> up guest operation.
>> 
>> Hence, guest_memfd's solution is to require that users of guest_memfd
>> for private memory trust guest_memfd to maintain the pages around and
>> not take any refcounts.
>> 
>> So back to Topic 1, for production kernels, is it okay that TDX does not
>> need to indicate that it is using a page, and can trust guest_memfd to
>> keep the page around for the VM?
> If the "TDX does not need to indicate that it is using a page" means "do not
> take page refcount", I'm ok.
>

Yes, I was trying to generalize "do not take page refcount" to "TDX does
not need to indicate that it is using a page", but I guess TDX can
indicate that it is using a page for debugging as long as it doesn't
use refcounts or otherwise interfere with conversions. So I believe we
are in agreement on Topic 1 :)

>> >> 
>> >> Topic 2: How to handle unmapping/splitting errors arising from TDX?
>> >> 
>> >> Previously I was in favor of having unmap() return an error (Rick
>> >> suggested doing a POC, and in a more recent email Rick asked for a
>> >> diffstat), but Vishal and I talked about this and now I agree having
>> >> unmapping return an error is not a good approach for these reasons.
>> >
>> > Ok, let's close this option then.
>> >
>> >> 
>> >> 1. Unmapping takes a range, and within the range there could be more
>> >>    than one unmapping error. I was previously thinking that unmap()
>> >>    could return 0 for success and the failed PFN on error. Returning a
>> >>    single PFN on error is okay-ish but if there are more errors it could
>> >>    get complicated.
>> >> 
>> >>    Another error return option could be to return the folio where the
>> >>    unmapping/splitting issue happened, but that would not be
>> >>    sufficiently precise, since a folio could be larger than 4K and we
>> >>    want to track errors as precisely as we can to reduce memory loss due
>> >>    to errors.
>> >> 
>> >> 2. What I think Yan has been trying to say: unmap() returning an error
>> >>    is non-standard in the kernel.
>> >> 
>> >> I think (1) is the dealbreaker here and there's no need to do the
>> >> plumbing POC and diffstat.
>> >> 
>> >> So I think we're all in support of indicating unmapping/splitting issues
>> >> without returning anything from unmap(), and the discussed options are
>> >> 
>> >> a. Refcounts: won't work - mostly discussed in this (sub-)thread
>> >>    [3]. Using refcounts makes it impossible to distinguish between
>> >>    transient refcounts and refcounts due to errors.
>> >> 
>> >> b. Page flags: won't work with/can't benefit from HVO.
>> >
>> > As above, this was for the purpose of catching bugs, not for guestmemfd to
>> > logically depend on it.
>> >
>> >> 
>> >> Suggestions still in the running:
>> >> 
>> >> c. Folio flags are not precise enough to indicate which page actually
>> >>    had an error, but this could be sufficient if we're willing to just
>> >>    waste the rest of the huge page on unmapping error.
>> >
>> > For a scenario of TDX module bug, it seems ok to me.
>> >
>> >> 
>> >> d. Folio flags with folio splitting on error. This means that on
>> >>    unmapping/Secure EPT PTE splitting error, we have to split the
>> >>    (larger than 4K) folio to 4K, and then set a flag on the split folio.
>> >> 
>> >>    The issue I see with this is that splitting pages with HVO applied
>> >>    means doing allocations, and in an error scenario there may not be
>> >>    memory left to split the pages.
>> >> 
>> >> e. Some other data structure in guest_memfd, say, a linked list, and a
>> >>    function like kvm_gmem_add_error_pfn(struct page *page) that would
>> >>    look up the guest_memfd inode from the page and add the page's pfn to
>> >>    the linked list.
>> >> 
>> >>    Everywhere in guest_memfd that does unmapping/splitting would then
>> >>    check this linked list to see if the unmapping/splitting
>> >>    succeeded.
>> >> 
>> >>    Everywhere in guest_memfd that allocates pages will also check this
>> >>    linked list to make sure the pages are functional.
>> >> 
>> >>    When guest_memfd truncates, if the page being truncated is on the
>> >>    list, retain the refcount on the page and leak that page.
>> >
>> > I think this is a fine option.
>> >
>> >> 
>> >> f. Combination of c and e, something similar to HugeTLB's
>> >>    folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in
>> >>    trouble to a linked list on the folio.
>> >> 
>> >> g. Like f, but basically treat an unmapping error as hardware poisoning.
>> >> 
>> >> I'm kind of inclined towards g, to just treat unmapping errors as
>> >> HWPOISON and buying into all the HWPOISON handling requirements. What do
>> >> yall think? Can a TDX unmapping error be considered as memory poisoning?
>> >
>> > What does HWPOISON bring over refcounting the page/folio so that it never
>> > returns to the page allocator?
>> 
>> For Topic 2 (handling TDX unmapping errors), HWPOISON is better than
>> refcounting because refcounting interferes with conversions (see
>> [INTERFERE WITH CONVERSIONS] above).
>> 
>> > We are bugging the TD in these cases.
>> 
>> Bugging the TD does not help to prevent future conversions from being
>> interfered with.
>> 
>> 1. Conversions involves unmapping, so we could actually be in a
>>    conversion, the unmapping is performed and fails, and then we try to
>>    split and enter an infinite loop since private to shared conversions
>>    assumes guest_memfd holds the only refcounts on guest_memfd memory.
> We should bail out conversion even with the HWPOISON.
> e.g.,
> 1. user triggers private-to-shared ioctl to convert 4K page A within a 2MB folio
>    B to shared.
> 2. kvm_gmem_convert_should_proceed() executes kvm_gmem_split_private() and
>    kvm_gmem_zap().
> 3. kvm_gmem_convert_should_proceed() checks kvm_gmem_has_invalid_folio()
>    (Suppose TDX sets HWPOISON to page A or folio B after kvm_gmem_zap(), then
>      kvm_gmem_has_invalid_folio() should return true).
> 4. Return -EFAULT.
>
> If we allow the actual conversion to proceed after step 3, folio B will be split
> into 4KB folios, with page A being converted to a shared 4KB folio, which
> becomes accessible by userspace.
>
> This could cause a machine check (#MC) on certain platforms. We should avoid
> this scenario when possible.
>
>

Thanks for pointing this out. This is a good point and will definitely
have to be handled separately under "what to do when there was some
issue on a page (possibly caused by unmapping)", or as Yan pointed out
above, what to do when "handling problematic folios".

Regarding handling errors, or recording errors, or communicating errors
to guest_memfd, Yan seems in favor of some kind of page flag. I know
Rick is suggesting option e. Can we do something between f and g? I'm
thinking that the easiest page flag to use is the HWpoison flag, because

1. HWpoison checking is already part of guest_memfd, or should
   be.

   guest_memfd already checks HWpoison for kvm_gmem_get_pfn(), and
   __do_fault() checks HWpoison for guest_memfd [5]. As Yan pointed out
   above, it should definitely check and deal with HWpoison on
   conversion. Perhaps on truncation it should look at HWpoison, or very
   likely memory_failure() will need special handling for guest_memfd
   folios. I'll look into this separately as part of HugeTLB support
   patch series.

2. HWpoison support (especially tracking of sub-folio HWpoison) in
   folio_set_hugetlb_hwpoison() can hopefully be reused for guest_memfd.

3. For now, no need to invent a new tracking mechanism or data structure
   to support option e.

4. HWpoison is kind of "part of guest_memfd" if you consider that
   guest_memfd folios to be pretty much always owned by some guest_memfd
   inode, and if the HWpoison flag is checked at the appropriate places.

Could we (at least for the next RFC of this TDX huge page support for
private memory series, or as a first cut), use HWpoison, and then if we
identify that the concept of HWpoison is being overloaded/polluted, then
try another flag/mechanism for tracking?

I plan to work on HWpoison/kvm_gmem_error_folio() handling for HugeTLB
soon and then I can keep the community updated if I find anything new or
incompatible.

>> 2. The conversion ioctl is a guest_memfd ioctl, not a VM ioctl, and so
>>    there is no check that the VM is not dead. There shouldn't be any
>>    check on the VM, because shareability is a property of the memory and
>>    should be changeable independent of the associated VM.
>> 
>> > Ohhh... Is
>> > this about the code to allow gmem fds to be handed to new VMs?
>> 
>> Nope, it's not related to linking. The proposed KVM_LINK_GUEST_MEMFD
>> ioctl [4] also doesn't check if the source VM is dead. There shouldn't
>> be any check on the source VM, since the memory is from guest_memfd and
>> should be independently transferable to a new VM.
>> 
>> [4] https://lore.kernel.org/lkml/cover.1747368092.git.afranji@google.com/T/

[5] https://lore.kernel.org/all/diqzv7ojjxyd.fsf@ackerleytng-ctop.c.googlers.com/

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-25  0:01                                                               ` Edgecombe, Rick P
  2025-06-25  7:29                                                                 ` Yan Zhao
@ 2025-06-25 23:09                                                                 ` Ackerley Tng
  2025-06-25 23:19                                                                   ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Ackerley Tng @ 2025-06-25 23:09 UTC (permalink / raw)
  To: Edgecombe, Rick P, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, vbabka@suse.cz, Du, Fan, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	pbonzini@redhat.com, Yamahata, Isaku, binbin.wu@linux.intel.com,
	Weiny, Ira, kvm@vger.kernel.org, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, Li, Zhiquan1, pgonda@google.com,
	x86@kernel.org

"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:

> On Tue, 2025-06-24 at 16:30 -0700, Ackerley Tng wrote:
>> I see, let's call debug checking Topic 3 then, to separate it from Topic
>> 1, which is TDX indicating that it is using a page for production
>> kernels.
>> 
>> Topic 3: How should TDX indicate use of a page for debugging?
>> 
>> I'm okay if for debugging, TDX uses anything other than refcounts for
>> checking, because refcounts will interfere with conversions.
>
> Ok. It can be follow on work I think.
>

Yup I agree.

>> 
>> Rick's other email is correct. The correct link should be
>> https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@yzhao56-desk.sh.intel.com/.
>> 
>> [INTERFERE WITH CONVERSIONS]
>> 
>> To summarize, if TDX uses refcounts to indicate that it is using a page,
>> or to indicate anything else, then we cannot easily split a page on
>> private to shared conversions.
>> 
>> Specifically, consider the case where only the x-th subpage of a huge
>> folio is mapped into Secure-EPTs. When the guest requests to convert
>> some subpage to shared, the huge folio has to be split for
>> core-mm. Core-mm, which will use the shared page, must have split folios
>> to be able to accurately and separately track refcounts for subpages.
>> 
>> During splitting, guest_memfd would see refcount of 512 (for 2M page
>> being in the filemap) + 1 (if TDX indicates that the x-th subpage is
>> mapped using a refcount), but would not be able to tell that the 513th
>> refcount belongs to the x-th subpage. guest_memfd can't split the huge
>> folio unless it knows how to distribute the 513th refcount.
>> 
>> One might say guest_memfd could clear all the refcounts that TDX is
>> holding on the huge folio by unmapping the entire huge folio from the
>> Secure-EPTs, but unmapping the entire huge folio for TDX means zeroing
>> the contents and requiring guest re-acceptance. Both of these would mess
>> up guest operation.
>> 
>> Hence, guest_memfd's solution is to require that users of guest_memfd
>> for private memory trust guest_memfd to maintain the pages around and
>> not take any refcounts.
>> 
>> So back to Topic 1, for production kernels, is it okay that TDX does not
>> need to indicate that it is using a page, and can trust guest_memfd to
>> keep the page around for the VM?
>
> I think Yan's concern is not totally invalid. But I don't see a problem if we
> have a line of sight to adding debug checking as follow on work. That is kind of
> the path I was trying to nudge.
>
>> 
>> > > 
>> > > Topic 2: How to handle unmapping/splitting errors arising from TDX?
>> > > 
>> > > Previously I was in favor of having unmap() return an error (Rick
>> > > suggested doing a POC, and in a more recent email Rick asked for a
>> > > diffstat), but Vishal and I talked about this and now I agree having
>> > > unmapping return an error is not a good approach for these reasons.
>> > 
>> > Ok, let's close this option then.
>> > 
>> > > 
>> > > 1. Unmapping takes a range, and within the range there could be more
>> > >    than one unmapping error. I was previously thinking that unmap()
>> > >    could return 0 for success and the failed PFN on error. Returning a
>> > >    single PFN on error is okay-ish but if there are more errors it could
>> > >    get complicated.
>> > > 
>> > >    Another error return option could be to return the folio where the
>> > >    unmapping/splitting issue happened, but that would not be
>> > >    sufficiently precise, since a folio could be larger than 4K and we
>> > >    want to track errors as precisely as we can to reduce memory loss due
>> > >    to errors.
>> > > 
>> > > 2. What I think Yan has been trying to say: unmap() returning an error
>> > >    is non-standard in the kernel.
>> > > 
>> > > I think (1) is the dealbreaker here and there's no need to do the
>> > > plumbing POC and diffstat.
>> > > 
>> > > So I think we're all in support of indicating unmapping/splitting issues
>> > > without returning anything from unmap(), and the discussed options are
>> > > 
>> > > a. Refcounts: won't work - mostly discussed in this (sub-)thread
>> > >    [3]. Using refcounts makes it impossible to distinguish between
>> > >    transient refcounts and refcounts due to errors.
>> > > 
>> > > b. Page flags: won't work with/can't benefit from HVO.
>> > 
>> > As above, this was for the purpose of catching bugs, not for guestmemfd to
>> > logically depend on it.
>> > 
>> > > 
>> > > Suggestions still in the running:
>> > > 
>> > > c. Folio flags are not precise enough to indicate which page actually
>> > >    had an error, but this could be sufficient if we're willing to just
>> > >    waste the rest of the huge page on unmapping error.
>> > 
>> > For a scenario of TDX module bug, it seems ok to me.
>> > 
>> > > 
>> > > d. Folio flags with folio splitting on error. This means that on
>> > >    unmapping/Secure EPT PTE splitting error, we have to split the
>> > >    (larger than 4K) folio to 4K, and then set a flag on the split folio.
>> > > 
>> > >    The issue I see with this is that splitting pages with HVO applied
>> > >    means doing allocations, and in an error scenario there may not be
>> > >    memory left to split the pages.
>> > > 
>> > > e. Some other data structure in guest_memfd, say, a linked list, and a
>> > >    function like kvm_gmem_add_error_pfn(struct page *page) that would
>> > >    look up the guest_memfd inode from the page and add the page's pfn to
>> > >    the linked list.
>> > > 
>> > >    Everywhere in guest_memfd that does unmapping/splitting would then
>> > >    check this linked list to see if the unmapping/splitting
>> > >    succeeded.
>> > > 
>> > >    Everywhere in guest_memfd that allocates pages will also check this
>> > >    linked list to make sure the pages are functional.
>> > > 
>> > >    When guest_memfd truncates, if the page being truncated is on the
>> > >    list, retain the refcount on the page and leak that page.
>> > 
>> > I think this is a fine option.
>> > 
>> > > 
>> > > f. Combination of c and e, something similar to HugeTLB's
>> > >    folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in
>> > >    trouble to a linked list on the folio.
>> > > 
>> > > g. Like f, but basically treat an unmapping error as hardware poisoning.
>> > > 
>> > > I'm kind of inclined towards g, to just treat unmapping errors as
>> > > HWPOISON and buying into all the HWPOISON handling requirements. What do
>> > > yall think? Can a TDX unmapping error be considered as memory poisoning?
>> > 
>> > What does HWPOISON bring over refcounting the page/folio so that it never
>> > returns to the page allocator?
>> 
>> For Topic 2 (handling TDX unmapping errors), HWPOISON is better than
>> refcounting because refcounting interferes with conversions (see
>> [INTERFERE WITH CONVERSIONS] above).
>
> I don't know if it quite fits. I think it would be better to not pollute the
> concept if possible.
>

If there's something we know for sure doesn't fit, and that we're
overloading/polluting the concept of HWpoison, then we shouldn't
proceed, but otherwise, is it okay to go with HWpoison as a first cut? I
replied Yan's email with reasons why I think we should give HWpoison a
try, at least for the next RFC.

>> 
>> > We are bugging the TD in these cases.
>> 
>> Bugging the TD does not help to prevent future conversions from being
>> interfered with.
>> 
>> 1. Conversions involves unmapping, so we could actually be in a
>>    conversion, the unmapping is performed and fails, and then we try to
>>    split and enter an infinite loop since private to shared conversions
>>    assumes guest_memfd holds the only refcounts on guest_memfd memory.
>> 
>> 2. The conversion ioctl is a guest_memfd ioctl, not a VM ioctl, and so
>>    there is no check that the VM is not dead. There shouldn't be any
>>    check on the VM, because shareability is a property of the memory and
>>    should be changeable independent of the associated VM.
>
> Hmm, they are both about unlinking guestmemfd from a VM lifecycle then. Is that
> a better way to put it?
>

Unmapping during conversions doesn't take memory away from a VM, it just
forces the memory to be re-faulted as shared, so unlinking memory from a
VM lifecycle isn't quite accurate, if I understand you correctly.

>> 
>> > Ohhh... Is
>> > this about the code to allow gmem fds to be handed to new VMs?
>> 
>> Nope, it's not related to linking. The proposed KVM_LINK_GUEST_MEMFD
>> ioctl [4] also doesn't check if the source VM is dead. There shouldn't
>> be any check on the source VM, since the memory is from guest_memfd and
>> should be independently transferable to a new VM.
>
> If a page is mapped in the old TD, and a new TD is started, re-mapping the same
> page should be prevented somehow, right?
>

Currently I'm thinking that if we go with HWpoison, the new TD will
still get the HWpoison-ed page. The new TD will get the SIGBUS when it
next faults the HWpoison-ed page.

Are you thinking that the HWpoison-ed page should be replaced with a
non-poisoned page for the new TD to run?

Or are you thinking that

* the handing over should be blocked, or
* mapping itself should be blocked, or
* faulting should be blocked?

If handing over should be blocked, could we perhaps scan for HWpoison
when doing the handover and block it there?

I guess I'm trying to do as little as possible during error discovery
(hoping to just mark HWpoison), error handling (just unmap from guest
page tables, like guest_memfd does now), and defer handling to
fault/conversion/perhaps truncation time.

> It really does seem like guestmemfd is the right place to keep the the "stuck
> page" state. If guestmemfd is not tied to a VM and can be re-used, it should be
> the one to decide whether they can be mapped again.

Yup, guest_memfd should get to decide.

> Refcounting on error is
> about preventing return to the page allocator but that is not the only problem.
>

guest_memfd, or perhaps the memory_failure() handler for guest_memfd,
should prevent this return.

> I do think that these threads have gone on far too long. It's probably about
> time to move forward with something even if it's just to have something to
> discuss that doesn't require footnoting so many lore links. So how about we move
> forward with option e as a next step. Does that sound good Yan?
>

Please see my reply to Yan, I'm hoping y'all will agree to something
between option f/g instead.

> Ackerley, thank you very much for pulling together this summary.

Thank you for your reviews and suggestions!

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-25 23:09                                                                 ` Ackerley Tng
@ 2025-06-25 23:19                                                                   ` Edgecombe, Rick P
  2025-06-26 15:16                                                                     ` Shutemov, Kirill
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-25 23:19 UTC (permalink / raw)
  To: ackerleytng@google.com, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
	vbabka@suse.cz, Shutemov, Kirill, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	pbonzini@redhat.com, Yamahata, Isaku, binbin.wu@linux.intel.com,
	Weiny, Ira, kvm@vger.kernel.org, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, Li, Zhiquan1, pgonda@google.com,
	x86@kernel.org

On Wed, 2025-06-25 at 16:09 -0700, Ackerley Tng wrote:
> > I do think that these threads have gone on far too long. It's probably about
> > time to move forward with something even if it's just to have something to
> > discuss that doesn't require footnoting so many lore links. So how about we
> > move
> > forward with option e as a next step. Does that sound good Yan?
> > 
> 
> Please see my reply to Yan, I'm hoping y'all will agree to something
> between option f/g instead.

I'm not sure about the HWPoison approach, but I'm not totally against it. My
bias is that all the MM concepts are tightly interlinked. If may fit perfectly,
but every new use needs to be checked for how fits in with the other MM users of
it. Every time I've decided a page flag was the perfect solution to my problem,
I got informed otherwise. Let me try to flag Kirill to this discussion. He might
have some insights.



^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-25 14:48                                                           ` Edgecombe, Rick P
@ 2025-06-26  0:50                                                             ` Yan Zhao
  0 siblings, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-06-26  0:50 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Huang, Kai, Du, Fan,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	Li, Zhiquan1, Shutemov, Kirill, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, pbonzini@redhat.com, Peng, Chao P,
	Yamahata, Isaku, linux-kernel@vger.kernel.org, vbabka@suse.cz,
	ackerleytng@google.com, kvm@vger.kernel.org,
	binbin.wu@linux.intel.com, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, tabba@google.com, pgonda@google.com, x86@kernel.org

On Wed, Jun 25, 2025 at 10:48:00PM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-06-25 at 17:36 +0800, Yan Zhao wrote:
> > On Wed, Jun 25, 2025 at 05:28:22PM +0800, Yan Zhao wrote:
> > > Write down my understanding to check if it's correct:
> > > 
> > > - when a TD is NOT configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL, KVM
> > >    always maps at 4KB
> > > 
> > > - When a TD is configured to support KVM_LPAGE_GUEST_INHIBIT TDVMCALL,
> > Sorry, the two conditions are stale ones. No need any more.
> > So it's always
> >  
> >  (a)
> >  1. guest accepts at 4KB
> >  2. TDX sets KVM_LPAGE_GUEST_INHIBIT and try splitting.(with write mmu_lock)
> >  3. KVM maps at 4KB (with read mmu_lock)
> >  4. guest's 4KB accept succeeds.
> 
> Yea.
> 
> >  
> >  (b)
> >  1. guest accepts at 2MB.
> >  2. KVM maps at 4KB due to a certain reason.
> 
> I don't follow this part. You mean because it spans a memslot or other?
Sorry for bringing confusion. (b) is the same as the current bahavior.
I listed (b) just to contrast with (a)...

KVM may map at 4KB due to adjacent shared GFNs, spanning a memslot, or because
the TDX code doesn't support huge pages at all...

> Basically that KVM won't guarantee the page size at exactly the accept size? I
> think this is ok and good. The ABI can be that KVM will guarantee the S-EPT
> mapping size <= the accept size.
Right.

> >  3. guest's accept 2MB fails with TDACCEPT_SIZE_MISMATCH.
> >  4. guest accepts at 4KB
> >  5. guest's 4KB accept succeeds.
> >  
> In this option accept behavior doesn't need to change, but the
> TDACCEPT_SIZE_MISMATCH in step 3 still is a little weird. TDX module could
> accept at 4k mapping size. But this is an issue for the guest to deal with, not
> KVM.
With current TDX module, TDACCEPT_SIZE_MISMATCH is returned in step 3.


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-25 14:47                                                         ` Edgecombe, Rick P
@ 2025-06-26  8:53                                                           ` Yan Zhao
  2025-07-01  0:42                                                             ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-26  8:53 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Du, Fan, Li, Xiaoyao, Huang, Kai, Shutemov, Kirill, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
	Li, Zhiquan1, quic_eberman@quicinc.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, pbonzini@redhat.com, Peng, Chao P,
	Yamahata, Isaku, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, ackerleytng@google.com,
	kvm@vger.kernel.org, Annapurve, Vishal, tabba@google.com,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, Jun 25, 2025 at 10:47:47PM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-06-25 at 17:28 +0800, Yan Zhao wrote:
> > On Wed, Jun 25, 2025 at 02:35:59AM +0800, Edgecombe, Rick P wrote:
> > > On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> > > > Could we provide the info via the private_max_mapping_level hook (i.e. via
> > > > tdx_gmem_private_max_mapping_level())?
> > > 
> > > This is one of the previous two methods discussed. Can you elaborate on what you
> > > are trying to say?
> > I don't get why we can't use the existing tdx_gmem_private_max_mapping_level()
> > to convey the max_level info at which a vendor hopes a GFN to be mapped.
> > 
> > Before TDX huge pages, tdx_gmem_private_max_mapping_level() always returns 4KB;
> > after TDX huge pages, it returns
> > - 4KB during the TD build stage
> > - at TD runtime: 4KB or 2MB
> > 
> > Why does KVM need to care how the vendor determines this max_level?
> > I think a vendor should have its freedom to decide based on software limitation,
> > guest's wishes, hardware bugs or whatever.
> 
> I don't see that big of a difference between "KVM" and "vendor". TDX code is KVM
> code. Just because it's in tdx.c doesn't mean it's ok for it to be hard to trace
> the logic.
> 
> I'm not sure what Sean's objection was to that approach, or if he objected to
> just the weird SIZE_MISMATCH behavior of TDX module. I think you already know
> why I don't prefer it:
>  - Requiring demote in the fault handler. This requires an additional write lock
> inside the mmu read lock, or TDX module changes. Although now I wonder if the
> interrupt error code related problems will get worse with this solution. The
> solution is currently not settled.
>  - Requiring passing args on the vCPU struct, which as you point out will work
> functionally today only because the prefault stuff will avoid seeing it. But
> it's fragile
>  - The page size behavior is a bit implicit
Hmm, strictly speaking, all the 3 are not the fault of
tdx_gmem_private_max_mapping_level().

With tdx_gmem_private_max_mapping_level() to pass in the level, we can track
KVM_LPAGE_GUEST_INHIBIT with tdx.c without changing lpage_info.
tdx.c then has the freedom to change KVM_LPAGE_GUEST_INHIBIT to some more
flexible scheme in future while keeping KVM MMU core intact.

But with lpage_info, looks we can save some memory.
The downside is that we may need to update TDX MMU core in case of future
changes.

> I'm coming back to this draft after PUCK. Sean shared his thoughts there. I'll
> try to summarize. He didn't like how the page size requirements were passed
> through the fault handler in a "transient" way. That "transient" property covers
> both of the two options for passing the size info through the fault handler that
> we were debating. He also didn't like how TDX arch requires KVM to map at a
> specific host size around accept. Michael Roth brought up that SNP has the same
> requirement, but it can do the zap and refault approach.
> 
> We then discussed this lpage_info idea. He was in favor of it, but not, I'd say,
> overly enthusiastic. In a "least worst option" kind of way.
> 
> I think the biggest downside is the MMU write lock. Our goal for this series is
> to help performance, not to get huge page sizes. So if we do this idea, we can't
> fully waive our hands that any optimization is pre-mature. It *is* an
> optimization. We need to either convince ourselves that the overall benefit is
> still there, or have a plan to adopt the guest to avoid 4k accepts. Which we
> were previously discussing of requiring anyway.
> 
> But I much prefer the deterministic behavior of this approach from a
> maintainability perspective.
> 
> > 
> > > > Or what about introducing a vendor hook in __kvm_mmu_max_mapping_level() for a
> > > > private fault?
> > > > 
> > > > > Maybe we could have EPT violations that contain 4k accept sizes first update the
> > > > > attribute for the GFN to be accepted or not, like have tdx.c call out to set
> > > > > kvm_lpage_info->disallow_lpage in the rarer case of 4k accept size? Or something
> > > > Something like kvm_lpage_info->disallow_lpage would disallow later page
> > > > promotion, though we don't support it right now.
> > > 
> > > Well I was originally thinking it would not set kvm_lpage_info->disallow_lpage
> > > directly, but rely on the logic that checks for mixed attributes. But more
> > > below...
> > > 
> > > > 
> > > > > like that. Maybe set a "accepted" attribute, or something. Not sure if could be
> > > > Setting "accepted" attribute in the EPT violation handler?
> > > > It's a little odd, as the accept operation is not yet completed.
> > > 
> > > I guess the question in both of these comments is: what is the life cycle. Guest
> > > could call TDG.MEM.PAGE.RELEASE to unaccept it as well. Oh, geez. It looks like
> > > TDG.MEM.PAGE.RELEASE will give the same size hints in the EPT violation. So an
> > > accept attribute is not going work, at least without TDX module changes.
> > > 
> > > 
> > > Actually, the problem we have doesn't fit the mixed attributes behavior. If many
> > > vCPU's accept at 2MB region at 4k page size, the entire 2MB range could be non-
> > > mixed and then individual accepts would fail.
> > > 
> > > 
> > > So instead there could be a KVM_LPAGE_GUEST_INHIBIT that doesn't get cleared
> > Set KVM_LPAGE_GUEST_INHIBIT via a TDVMCALL ?
> > 
> > Or just set the KVM_LPAGE_GUEST_INHIBIT when an EPT violation contains 4KB
> > level info?
> 
> Yes, that's the idea. 2MB accepts can behave like normal.
> 
> > 
> > I guess it's the latter one as it can avoid modification to both EDK2 and Linux
> > guest.  I observed ~2710 instances of "guest accepts at 4KB when KVM can map at
> > 2MB" during the boot-up of a TD with 4GB memory.
> 
> Oh, wow that is more than I expected. Did you notice how many vCPUs they were
> spread across? What memory size did you use? What was your guest memory
> configuration?
The guest memory is 4GB, 8 vCPUs.
The memory slots layout is:
slot 1: base gfn=0, npages=0x80000
slot 2: base gfn=0x100000, npages=0x80000
slot 3: base gfn=0xffc00, npages=0x400

The GFN spread for the ~2710 instances is:
GFNs 0x806-0x9ff (1 time for each of 506 pages)
GFNs 0x7e800-0x7e9ff (1 time for each of 512 pages)
GFN: 0x7d3ff~0x7e7fe (repeated private-to-shared, and shared-to-private are
                      conducted on this range), with the top 3 among them being:
     0x7d9da (476 times)
     0x7d9d9 (156 times)
     0x7d9d7 (974 times)

All those instances are from vCPU 0, when the guest is in EDK2 and during early
kernel boot.

Based on my observation, the count of these instances does not scale with guest
memory. In other words, the count remains roughly the same even when the guest
memory is increased to 8GB.

> > But does it mean TDX needs to hold write mmu_lock in the EPT violation handler
> > and set KVM_LPAGE_GUEST_INHIBIT on finding a violation carries 4KB level info?
> 
> I think so. I didn't check the reason, but the other similar code took it. Maybe
> not? If we don't need to take mmu write lock, then this idea seems like a clear
> winner to me.
Hmm,  setting KVM_LPAGE_GUEST_INHIBIT needs trying splitting to be followed.
So, if we don't want to support splitting under read mmu_lock, we need to take
write mmu_lock.

I drafted a change as below (will refine some parts of it later).
The average count to take write mmu_lock is 11 during VM boot.

There's no signiticant difference in the count of 2M mappings
During guest kerne booting to login, on average: 
before this patch: 1144 2M mappings 
after this patch:  1143 2M mappings.

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index f999c15d8d3e..d4e98728f600 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -322,4 +322,8 @@ static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
 {
        return gfn & kvm_gfn_direct_bits(kvm);
 }
+
+void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
+bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
+
 #endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f0afee2e283a..28c511d8b372 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -721,6 +721,8 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
  */
 #define KVM_LPAGE_MIXED_FLAG   BIT(31)

+#define KVM_LPAGE_GUEST_INHIBIT_FLAG   BIT(30)
+
 static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
                                            gfn_t gfn, int count)
 {
@@ -732,7 +734,8 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,

                old = linfo->disallow_lpage;
                linfo->disallow_lpage += count;
-               WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG);
+               WARN_ON_ONCE((old ^ linfo->disallow_lpage) &
+                             (KVM_LPAGE_MIXED_FLAG | KVM_LPAGE_GUEST_INHIBIT_FLAG));
        }
 }

@@ -1653,13 +1656,15 @@ int kvm_split_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range)
        bool ret = 0;

        lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
-                           lockdep_is_held(&kvm->slots_lock));
+                           lockdep_is_held(&kvm->slots_lock) ||
+                           srcu_read_lock_held(&kvm->srcu));

        if (tdp_mmu_enabled)
                ret = kvm_tdp_mmu_gfn_range_split_boundary(kvm, range);

        return ret;
 }
+EXPORT_SYMBOL_GPL(kvm_split_boundary_leafs);

 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
@@ -7734,6 +7739,18 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
                vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread);
 }

+bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
+{
+       return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_GUEST_INHIBIT_FLAG;
+}
+EXPORT_SYMBOL_GPL(hugepage_test_guest_inhibit);
+
+void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
+{
+       lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_GUEST_INHIBIT_FLAG;
+}
+EXPORT_SYMBOL_GPL(hugepage_set_guest_inhibit);
+
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
                                int level)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 244fd22683db..4028423cf595 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1852,28 +1852,8 @@ int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
        if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
                return -EINVAL;

-       /*
-        * Split request with mmu_lock held for reading can only occur when one
-        * vCPU accepts at 2MB level while another vCPU accepts at 4KB level.
-        * Ignore this 4KB mapping request by setting violation_request_level to
-        * 2MB and returning -EBUSY for retry. Then the next fault at 2MB level
-        * would be a spurious fault. The vCPU accepting at 2MB will accept the
-        * whole 2MB range.
-        */
-       if (mmu_lock_shared) {
-               struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-               struct vcpu_tdx *tdx = to_tdx(vcpu);
-
-               if (KVM_BUG_ON(!vcpu, kvm))
-                       return -EOPNOTSUPP;
-
-               /* Request to map as 2MB leaf for the whole 2MB range */
-               tdx->violation_gfn_start = gfn_round_for_level(gfn, level);
-               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
-               tdx->violation_request_level = level;
-
-               return -EBUSY;
-       }
+       if (mmu_lock_shared)
+               return -EOPNOTSUPP;

        ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
        if (ret <= 0)
@@ -1937,28 +1917,51 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
        return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
 }

-static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
+static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
 {
        struct vcpu_tdx *tdx = to_tdx(vcpu);
+       struct kvm *kvm = vcpu->kvm;
+       gfn_t gfn = gpa_to_gfn(gpa);
+       struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
        int level = -1;
+       u64 eeq_info;

-       u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
+       if (!slot)
+               return 0;

-       u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
-                       TDX_EXT_EXIT_QUAL_INFO_SHIFT;
+       if ((tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK) !=
+           TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
+               return 0;

-       if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
-               level = (eeq_info & GENMASK(2, 0)) + 1;
+       eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
+                   TDX_EXT_EXIT_QUAL_INFO_SHIFT;

-               tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
-               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
-               tdx->violation_request_level = level;
-       } else {
-               tdx->violation_gfn_start = -1;
-               tdx->violation_gfn_end = -1;
-               tdx->violation_request_level = -1;
+       level = (eeq_info & GENMASK(2, 0)) + 1;
+
+       if (level == PG_LEVEL_4K) {
+              if (!hugepage_test_guest_inhibit(slot, gfn, PG_LEVEL_2M)) {
+                       struct kvm_gfn_range gfn_range = {
+                               .start = gfn,
+                               .end = gfn + 1,
+                               .slot = slot,
+                               .may_block = true,
+                               .attr_filter = KVM_FILTER_PRIVATE,
+                       };
+
+                       scoped_guard(write_lock, &kvm->mmu_lock) {
+                               int ret;
+
+                               ret = kvm_split_boundary_leafs(kvm, &gfn_range);
+
+                               if (ret)
+                                       return ret;
+
+                               hugepage_set_guest_inhibit(slot, gfn, PG_LEVEL_2M);
+                       }
+              }
        }
+
+       return 0;
 }

 static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
@@ -1987,7 +1990,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
                 */
                exit_qual = EPT_VIOLATION_ACC_WRITE;

-               tdx_get_accept_level(vcpu, gpa);
+               if (tdx_check_accept_level(vcpu, gpa))
+                       return RET_PF_RETRY;

                /* Only private GPA triggers zero-step mitigation */
                local_retry = true;
@@ -3022,9 +3026,6 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)

        vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;

-       tdx->violation_gfn_start = -1;
-       tdx->violation_gfn_end = -1;
-       tdx->violation_request_level = -1;
        return 0;

 free_tdcx:
@@ -3373,14 +3374,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
                                       gfn_t gfn, bool prefetch)
 {
-       struct vcpu_tdx *tdx = to_tdx(vcpu);
-
-       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE) || prefetch))
+       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE)))
                return PG_LEVEL_4K;

-       if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
-               return tdx->violation_request_level;
-
        return PG_LEVEL_2M;
 }

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index acd18a01f63d..3a3077666ee6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2610,6 +2610,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn

        return NULL;
 }
+EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);

 bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
 {

^ permalink raw reply related	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-25 23:19                                                                   ` Edgecombe, Rick P
@ 2025-06-26 15:16                                                                     ` Shutemov, Kirill
  2025-06-26 22:19                                                                       ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Shutemov, Kirill @ 2025-06-26 15:16 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: ackerleytng@google.com, Zhao, Yan Y, quic_eberman@quicinc.com,
	Li, Xiaoyao, Du, Fan, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	michael.roth@amd.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, Peng, Chao P, pbonzini@redhat.com,
	Yamahata, Isaku, binbin.wu@linux.intel.com, Weiny, Ira,
	kvm@vger.kernel.org, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, Li, Zhiquan1, pgonda@google.com, x86@kernel.org

On Thu, Jun 26, 2025 at 02:19:36AM +0300, Edgecombe, Rick P wrote:
> On Wed, 2025-06-25 at 16:09 -0700, Ackerley Tng wrote:
> > > I do think that these threads have gone on far too long. It's probably about
> > > time to move forward with something even if it's just to have something to
> > > discuss that doesn't require footnoting so many lore links. So how about we
> > > move
> > > forward with option e as a next step. Does that sound good Yan?
> > > 
> > 
> > Please see my reply to Yan, I'm hoping y'all will agree to something
> > between option f/g instead.
> 
> I'm not sure about the HWPoison approach, but I'm not totally against it. My
> bias is that all the MM concepts are tightly interlinked. If may fit perfectly,
> but every new use needs to be checked for how fits in with the other MM users of
> it. Every time I've decided a page flag was the perfect solution to my problem,
> I got informed otherwise. Let me try to flag Kirill to this discussion. He might
> have some insights.

We chatted with Rick about this.

If I understand correctly, we are discussing the situation where the TDX
module failed to return a page to the kernel.

I think it is reasonable to use HWPoison for this case. We cannot
guarantee that we will read back whatever we write to the page. TDX module
has creative ways to corrupt it. 

The memory is no longer functioning as memory. It matches the definition
of HWPoison quite closely.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-26 15:16                                                                     ` Shutemov, Kirill
@ 2025-06-26 22:19                                                                       ` Edgecombe, Rick P
  2025-06-27 17:59                                                                         ` Ackerley Tng
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-26 22:19 UTC (permalink / raw)
  To: Shutemov, Kirill
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, Zhao, Yan Y,
	tabba@google.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	binbin.wu@linux.intel.com, Peng, Chao P, pbonzini@redhat.com,
	ackerleytng@google.com, michael.roth@amd.com, vbabka@suse.cz,
	Yamahata, Isaku, Li, Zhiquan1, Annapurve, Vishal, Weiny, Ira,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, 2025-06-26 at 18:16 +0300, Shutemov, Kirill wrote:
> > > Please see my reply to Yan, I'm hoping y'all will agree to something
> > > between option f/g instead.
> > 
> > I'm not sure about the HWPoison approach, but I'm not totally against it. My
> > bias is that all the MM concepts are tightly interlinked. If may fit
> > perfectly,
> > but every new use needs to be checked for how fits in with the other MM
> > users of
> > it. Every time I've decided a page flag was the perfect solution to my
> > problem,
> > I got informed otherwise. Let me try to flag Kirill to this discussion. He
> > might
> > have some insights.
> 
> We chatted with Rick about this.
> 
> If I understand correctly, we are discussing the situation where the TDX
> module failed to return a page to the kernel.
> 
> I think it is reasonable to use HWPoison for this case. We cannot
> guarantee that we will read back whatever we write to the page. TDX module
> has creative ways to corrupt it. 
> 
> The memory is no longer functioning as memory. It matches the definition
> of HWPoison quite closely.

ok! Lets go f/g. Unless Yan objects.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-26 22:19                                                                       ` Edgecombe, Rick P
@ 2025-06-27 17:59                                                                         ` Ackerley Tng
  2025-06-30 11:13                                                                           ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Ackerley Tng @ 2025-06-27 17:59 UTC (permalink / raw)
  To: Edgecombe, Rick P, Shutemov, Kirill
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, Zhao, Yan Y,
	tabba@google.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	binbin.wu@linux.intel.com, Peng, Chao P, pbonzini@redhat.com,
	michael.roth@amd.com, vbabka@suse.cz, Yamahata, Isaku,
	Li, Zhiquan1, Annapurve, Vishal, Weiny, Ira, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:

> On Thu, 2025-06-26 at 18:16 +0300, Shutemov, Kirill wrote:
>> > > Please see my reply to Yan, I'm hoping y'all will agree to something
>> > > between option f/g instead.
>> > 
>> > I'm not sure about the HWPoison approach, but I'm not totally against it. My
>> > bias is that all the MM concepts are tightly interlinked. If may fit
>> > perfectly,
>> > but every new use needs to be checked for how fits in with the other MM
>> > users of
>> > it. Every time I've decided a page flag was the perfect solution to my
>> > problem,
>> > I got informed otherwise. Let me try to flag Kirill to this discussion. He
>> > might
>> > have some insights.
>> 
>> We chatted with Rick about this.
>> 
>> If I understand correctly, we are discussing the situation where the TDX
>> module failed to return a page to the kernel.
>> 
>> I think it is reasonable to use HWPoison for this case. We cannot
>> guarantee that we will read back whatever we write to the page. TDX module
>> has creative ways to corrupt it. 
>> 
>> The memory is no longer functioning as memory. It matches the definition
>> of HWPoison quite closely.
>
> ok! Lets go f/g. Unless Yan objects.

Follow up as I think about this more: Perhaps we don't need to check for
HWpoison (or TDX unmap errors) on conversion.

On a high level, we don't need to check for HWpoison because conversion
is about changing memory metadata, as in memory privacy status and
struct folio sizes, and not touching memory contents at all. HWpoison
means the memory and its contents shouldn't be used.

Specifically for private-to-shared conversions where the TDX unmap error
can happen, we will

1. HWpoison the page
2. Bug the TD

This falsely successful conversion means the host (guest_memfd) will
think the memory is shared while it may still be mapped in Secure-EPTs.

I think that is okay because the only existing user (TD) stops using
that memory, and no future users can use the memory:

1. The TD will be bugged by then. A non-running TD cannot touch memory
   that had the error on unmapping.

2. The page was not mapped into host page tables (since it was
   private). Even if it were mapped, it will be unmapped from host page
   tables (host page table unmaps don't fail). If the host tries to
   touch the memory, on the next fault, core-mm would notice that the
   page is poisoned and not fault it in.

By the way, when we "bug the TD", can we assume that ALL vCPUs, not just
the one that is did the failed unmap will stop running?

I guess even if the other vCPUs don't stop running, the TDs vCPUs will
access the page as shared thinking the conversion succeeded and keep
hitting #VEs. If the TD accesses the page as private, it's fine since
the page was not unmapped from Secure-EPTs due to the unmap failure and
the host cannot write to it (host will see HWpoison on next fault) and
so there's no host crash and doesn't defeat the purpose of guest_memfd.

If the guest_memfd with a HWpoisoned page is linked to a new, runnable
TD, the new TD would need to fault in the page as private. When it tries
to fault in the page to the new TD, it will hit the HWpoison and
userspace will get to know about the HWpoison.

Yan, Rick, let me know what you think of this!

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-27 17:59                                                                         ` Ackerley Tng
@ 2025-06-30 11:13                                                                           ` Yan Zhao
  2025-06-30 17:55                                                                             ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-06-30 11:13 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Edgecombe, Rick P, Shutemov, Kirill, quic_eberman@quicinc.com,
	Li, Xiaoyao, Du, Fan, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	binbin.wu@linux.intel.com, Peng, Chao P, pbonzini@redhat.com,
	michael.roth@amd.com, vbabka@suse.cz, Yamahata, Isaku,
	Li, Zhiquan1, Annapurve, Vishal, Weiny, Ira, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Fri, Jun 27, 2025 at 10:59:47AM -0700, Ackerley Tng wrote:
> "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:
> 
> > On Thu, 2025-06-26 at 18:16 +0300, Shutemov, Kirill wrote:
> >> > > Please see my reply to Yan, I'm hoping y'all will agree to something
> >> > > between option f/g instead.
> >> > 
> >> > I'm not sure about the HWPoison approach, but I'm not totally against it. My
> >> > bias is that all the MM concepts are tightly interlinked. If may fit
> >> > perfectly,
> >> > but every new use needs to be checked for how fits in with the other MM
> >> > users of
> >> > it. Every time I've decided a page flag was the perfect solution to my
> >> > problem,
> >> > I got informed otherwise. Let me try to flag Kirill to this discussion. He
> >> > might
> >> > have some insights.
> >> 
> >> We chatted with Rick about this.
> >> 
> >> If I understand correctly, we are discussing the situation where the TDX
> >> module failed to return a page to the kernel.
> >> 
> >> I think it is reasonable to use HWPoison for this case. We cannot
> >> guarantee that we will read back whatever we write to the page. TDX module
> >> has creative ways to corrupt it. 
> >> 
> >> The memory is no longer functioning as memory. It matches the definition
> >> of HWPoison quite closely.
> >
> > ok! Lets go f/g. Unless Yan objects.
I'm ok with f/g. But I have two implementation specific questions:

1. How to set the HWPoison bit in TDX?
2. Should we set this bit for non-guest-memfd pages (e.g. for S-EPT pages) ?

TDX can't invoke memory_failure() on error of removing guest private pages or
S-EPT pages, because holding write mmu_lock is regarded as in atomic context.
As there's a mutex in memory_failure(),
"BUG: sleeping function called from invalid context at kernel/locking/mutex.c"
will be printed.

If TDX invokes memory_failure_queue() instead, looks guest_memfd can invoke
memory_failure_queue_kick() to ensure HWPoison bit is set timely.
But which component could invoke memory_failure_queue_kick() for S-EPT pages?
KVM?


> Follow up as I think about this more: Perhaps we don't need to check for
> HWpoison (or TDX unmap errors) on conversion.
Hmm, yes. HWPoision bit is checked in  __kvm_gmem_get_pfn() and __do_fault().
Looks we don't need to check it on conversion for the purpose of disallowing
shared memory access.

My previous mail was based on another bit and I was not aware of the check of
HWPoison in __do_fault().

The conversion will be successful without checking HWPoision during conversion,
with error log "MCE: Killing ... due to hardware memory corruption fault at ..."
though.

> On a high level, we don't need to check for HWpoison because conversion
> is about changing memory metadata, as in memory privacy status and
> struct folio sizes, and not touching memory contents at all. HWpoison
> means the memory and its contents shouldn't be used.
> 
> Specifically for private-to-shared conversions where the TDX unmap error
> can happen, we will
> 
> 1. HWpoison the page
> 2. Bug the TD
> 
> This falsely successful conversion means the host (guest_memfd) will
> think the memory is shared while it may still be mapped in Secure-EPTs.
> 
> I think that is okay because the only existing user (TD) stops using
> that memory, and no future users can use the memory:
> 
> 1. The TD will be bugged by then. A non-running TD cannot touch memory
>    that had the error on unmapping.
> 
> 2. The page was not mapped into host page tables (since it was
>    private). Even if it were mapped, it will be unmapped from host page
>    tables (host page table unmaps don't fail). If the host tries to
>    touch the memory, on the next fault, core-mm would notice that the
>    page is poisoned and not fault it in.
> 
> By the way, when we "bug the TD", can we assume that ALL vCPUs, not just
> the one that is did the failed unmap will stop running?
Right. All the vCPUs will be kicked out of non-root mode after "bug the VM".

> I guess even if the other vCPUs don't stop running, the TDs vCPUs will
> access the page as shared thinking the conversion succeeded and keep
> hitting #VEs. If the TD accesses the page as private, it's fine since
> the page was not unmapped from Secure-EPTs due to the unmap failure and
> the host cannot write to it (host will see HWpoison on next fault) and
> so there's no host crash and doesn't defeat the purpose of guest_memfd.
> 
> If the guest_memfd with a HWpoisoned page is linked to a new, runnable
> TD, the new TD would need to fault in the page as private. When it tries
> to fault in the page to the new TD, it will hit the HWpoison and
> userspace will get to know about the HWpoison.
I'm Ok with just checking HWPosion on the next fault or dequeue of hugetlb.

> Yan, Rick, let me know what you think of this!

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-30 11:13                                                                           ` Yan Zhao
@ 2025-06-30 17:55                                                                             ` Edgecombe, Rick P
  2025-06-30 19:25                                                                               ` Ackerley Tng
                                                                                                 ` (2 more replies)
  0 siblings, 3 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-30 17:55 UTC (permalink / raw)
  To: ackerleytng@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
	vbabka@suse.cz, quic_eberman@quicinc.com, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	Du, Fan, Yamahata, Isaku, pbonzini@redhat.com,
	binbin.wu@linux.intel.com, Weiny, Ira, Li, Zhiquan1,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Mon, 2025-06-30 at 19:13 +0800, Yan Zhao wrote:
> > > ok! Lets go f/g. Unless Yan objects.
> I'm ok with f/g. But I have two implementation specific questions:
> 
> 1. How to set the HWPoison bit in TDX?
> 2. Should we set this bit for non-guest-memfd pages (e.g. for S-EPT pages) ?

Argh, I guess we can keep the existing ref count based approach for the other
types of TDX owned pages?

> 
> TDX can't invoke memory_failure() on error of removing guest private pages or
> S-EPT pages, because holding write mmu_lock is regarded as in atomic context.
> As there's a mutex in memory_failure(),
> "BUG: sleeping function called from invalid context at kernel/locking/mutex.c"
> will be printed.
> 
> If TDX invokes memory_failure_queue() instead, looks guest_memfd can invoke
> memory_failure_queue_kick() to ensure HWPoison bit is set timely.
> But which component could invoke memory_failure_queue_kick() for S-EPT pages?
> KVM?

Hmm, it only has queue of 10 pages per-cpu. If something goes wrong in the TDX
module, I could see exceeding this during a zap operation. At which point, how
much have we really handled it?

But, at the risk of derailing the solution when we are close, some reflection
has made me question whether this is all misprioritized. We are trying to handle
a case where a TDX module bug may return an error when we try to release gmem
pages. For that, this solution is feeling way too complex.

If there is a TDX module bug, a simpler way to handle it would be to fix the
bug. In the meantime the kernel can take simpler, more drastic efforts to
reclaim the memory and ensure system stability.

In the host kexec patches we need to handle a kexec while the TDX module is
running. The solution is to simply wbinvd on each pCPU that might have entered
the TDX module. After that, barring no new SEAMCALLs that could dirty
memory, the pages are free to use by the next kernel. (at least on systems
without the partial write errata)

So for this we can do something similar. Have the arch/x86 side of TDX grow a
new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs after
that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the system
die. Zap/cleanup paths return success in the buggy shutdown case.

Does it fit? Or, can you guys argue that the failures here are actually non-
special cases that are worth more complex recovery? I remember we talked about
IOMMU patterns that are similar, but it seems like the remaining cases under
discussion are about TDX bugs.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-30 17:55                                                                             ` Edgecombe, Rick P
@ 2025-06-30 19:25                                                                               ` Ackerley Tng
  2025-06-30 21:45                                                                                 ` Edgecombe, Rick P
  2025-07-01  5:07                                                                                 ` Yan Zhao
  2025-06-30 21:47                                                                               ` Vishal Annapurve
  2025-07-01  9:35                                                                               ` Yan Zhao
  2 siblings, 2 replies; 294+ messages in thread
From: Ackerley Tng @ 2025-06-30 19:25 UTC (permalink / raw)
  To: Edgecombe, Rick P, Zhao, Yan Y
  Cc: Shutemov, Kirill, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
	vbabka@suse.cz, quic_eberman@quicinc.com, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	Du, Fan, Yamahata, Isaku, pbonzini@redhat.com,
	binbin.wu@linux.intel.com, Weiny, Ira, Li, Zhiquan1,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:

> On Mon, 2025-06-30 at 19:13 +0800, Yan Zhao wrote:
>> > > ok! Lets go f/g. Unless Yan objects.
>> I'm ok with f/g. But I have two implementation specific questions:
>> 
>> 1. How to set the HWPoison bit in TDX?

I was thinking to set the HWpoison flag based on page type. If regular
4K page, set the flag. If THP page (not (yet) supported by guest_memfd),
set the has_hwpoison flag, and if HugeTLB page, call
folio_set_hugetlb_hwpoison().

But if we go with Rick's suggestion below, then we don't have to figure
this out.

>> 2. Should we set this bit for non-guest-memfd pages (e.g. for S-EPT pages) ?
>
> Argh, I guess we can keep the existing ref count based approach for the other
> types of TDX owned pages?
>

Wait TDX can only use guest_memfd pages, right? Even if TDX can use
non-guest_memfd pages, why not also set HWpoison for non-guest_memfd
pages?

Either way I guess if we go with Rick's suggestion below, then we don't
have to figure the above out.

>> 
>> TDX can't invoke memory_failure() on error of removing guest private pages or
>> S-EPT pages, because holding write mmu_lock is regarded as in atomic context.
>> As there's a mutex in memory_failure(),
>> "BUG: sleeping function called from invalid context at kernel/locking/mutex.c"
>> will be printed.
>> 
>> If TDX invokes memory_failure_queue() instead, looks guest_memfd can invoke
>> memory_failure_queue_kick() to ensure HWPoison bit is set timely.
>> But which component could invoke memory_failure_queue_kick() for S-EPT pages?
>> KVM?
>
> Hmm, it only has queue of 10 pages per-cpu. If something goes wrong in the TDX
> module, I could see exceeding this during a zap operation. At which point, how
> much have we really handled it?
>
>
> But, at the risk of derailing the solution when we are close, some reflection
> has made me question whether this is all misprioritized. We are trying to handle
> a case where a TDX module bug may return an error when we try to release gmem
> pages. For that, this solution is feeling way too complex.
>
> If there is a TDX module bug, a simpler way to handle it would be to fix the
> bug. In the meantime the kernel can take simpler, more drastic efforts to
> reclaim the memory and ensure system stability.
>
> In the host kexec patches we need to handle a kexec while the TDX module is
> running. The solution is to simply wbinvd on each pCPU that might have entered
> the TDX module. After that, barring no new SEAMCALLs that could dirty
> memory, the pages are free to use by the next kernel. (at least on systems
> without the partial write errata)
>
> So for this we can do something similar. Have the arch/x86 side of TDX grow a
> new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
> SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs after
> that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the system
> die. Zap/cleanup paths return success in the buggy shutdown case.
>

Do you mean that on unmap/split failure: there is a way to make 100%
sure all memory becomes re-usable by the rest of the host, using
tdx_buggy_shutdown(), wbinvd, etc?

If yes, then I'm onboard with this, and if we are 100% sure all memory
becomes re-usable by the host after all the extensive cleanup, then we
don't need to HWpoison anything.

> Does it fit? Or, can you guys argue that the failures here are actually non-
> special cases that are worth more complex recovery? I remember we talked about
> IOMMU patterns that are similar, but it seems like the remaining cases under
> discussion are about TDX bugs.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-30 19:25                                                                               ` Ackerley Tng
@ 2025-06-30 21:45                                                                                 ` Edgecombe, Rick P
  2025-07-01  5:01                                                                                   ` Yan Zhao
  2025-07-01  5:07                                                                                 ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-06-30 21:45 UTC (permalink / raw)
  To: ackerleytng@google.com, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, Du, Fan, michael.roth@amd.com,
	seanjc@google.com, binbin.wu@linux.intel.com, Peng, Chao P,
	kvm@vger.kernel.org, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, Weiny, Ira, pbonzini@redhat.com,
	Li, Zhiquan1, Annapurve, Vishal, jroedel@suse.de, Miao, Jun,
	pgonda@google.com, x86@kernel.org

On Mon, 2025-06-30 at 12:25 -0700, Ackerley Tng wrote:
> > So for this we can do something similar. Have the arch/x86 side of TDX grow
> > a
> > new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
> > SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs
> > after
> > that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the
> > system
> > die. Zap/cleanup paths return success in the buggy shutdown case.
> > 
> 
> Do you mean that on unmap/split failure:

Maybe Yan can clarify here. I thought the HWpoison scenario was about TDX module
bugs. Not TDX busy errors, demote failures, etc. If there are "normal" failures,
like the ones that can be fixed with retries, then I think HWPoison is not a
good option though.

>  there is a way to make 100%
> sure all memory becomes re-usable by the rest of the host, using
> tdx_buggy_shutdown(), wbinvd, etc?

I think so. If we think the error conditions are rare enough that the cost of
killing all TDs is acceptable, then we should do a proper POC and give it some
scrutiny.

> 
> If yes, then I'm onboard with this, and if we are 100% sure all memory
> becomes re-usable by the host after all the extensive cleanup, then we
> don't need to HWpoison anything.

For eventual upstream acceptance, we need to stop and think every time TDX
requires special handling in generic code. This is why I wanted to clarify if
you guys think the scenario could be in any way considered a generic one.
(IOMMU, etc).

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-30 17:55                                                                             ` Edgecombe, Rick P
  2025-06-30 19:25                                                                               ` Ackerley Tng
@ 2025-06-30 21:47                                                                               ` Vishal Annapurve
  2025-07-01  9:35                                                                               ` Yan Zhao
  2 siblings, 0 replies; 294+ messages in thread
From: Vishal Annapurve @ 2025-06-30 21:47 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: ackerleytng@google.com, Zhao, Yan Y, Shutemov, Kirill,
	Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	quic_eberman@quicinc.com, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	Du, Fan, Yamahata, Isaku, pbonzini@redhat.com,
	binbin.wu@linux.intel.com, Weiny, Ira, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Mon, Jun 30, 2025 at 10:55 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Mon, 2025-06-30 at 19:13 +0800, Yan Zhao wrote:
> > > > ok! Lets go f/g. Unless Yan objects.
> > I'm ok with f/g. But I have two implementation specific questions:
> >
> > 1. How to set the HWPoison bit in TDX?
> > 2. Should we set this bit for non-guest-memfd pages (e.g. for S-EPT pages) ?
>
> Argh, I guess we can keep the existing ref count based approach for the other
> types of TDX owned pages?
>
> >
> > TDX can't invoke memory_failure() on error of removing guest private pages or
> > S-EPT pages, because holding write mmu_lock is regarded as in atomic context.
> > As there's a mutex in memory_failure(),
> > "BUG: sleeping function called from invalid context at kernel/locking/mutex.c"
> > will be printed.
> >
> > If TDX invokes memory_failure_queue() instead, looks guest_memfd can invoke
> > memory_failure_queue_kick() to ensure HWPoison bit is set timely.
> > But which component could invoke memory_failure_queue_kick() for S-EPT pages?
> > KVM?
>
> Hmm, it only has queue of 10 pages per-cpu. If something goes wrong in the TDX
> module, I could see exceeding this during a zap operation. At which point, how
> much have we really handled it?
>
>
> But, at the risk of derailing the solution when we are close, some reflection
> has made me question whether this is all misprioritized. We are trying to handle
> a case where a TDX module bug may return an error when we try to release gmem
> pages. For that, this solution is feeling way too complex.
>
> If there is a TDX module bug, a simpler way to handle it would be to fix the
> bug. In the meantime the kernel can take simpler, more drastic efforts to
> reclaim the memory and ensure system stability.
>
> In the host kexec patches we need to handle a kexec while the TDX module is
> running. The solution is to simply wbinvd on each pCPU that might have entered
> the TDX module. After that, barring no new SEAMCALLs that could dirty
> memory, the pages are free to use by the next kernel. (at least on systems
> without the partial write errata)
>
> So for this we can do something similar. Have the arch/x86 side of TDX grow a
> new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
> SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs after
> that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the system
> die. Zap/cleanup paths return success in the buggy shutdown case.

This approach makes sense to me.

>
> Does it fit? Or, can you guys argue that the failures here are actually non-
> special cases that are worth more complex recovery? I remember we talked about
> IOMMU patterns that are similar, but it seems like the remaining cases under
> discussion are about TDX bugs.
>

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-06-26  8:53                                                           ` Yan Zhao
@ 2025-07-01  0:42                                                             ` Edgecombe, Rick P
  2025-07-01  2:41                                                               ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-01  0:42 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Huang, Kai, Du, Fan,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Li, Zhiquan1, Shutemov, Kirill,
	michael.roth@amd.com, seanjc@google.com, Weiny, Ira, Peng, Chao P,
	pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, kvm@vger.kernel.org, Annapurve, Vishal,
	tabba@google.com, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Thu, 2025-06-26 at 16:53 +0800, Yan Zhao wrote:
> On Wed, Jun 25, 2025 at 10:47:47PM +0800, Edgecombe, Rick P wrote:
> > On Wed, 2025-06-25 at 17:28 +0800, Yan Zhao wrote:
> > > On Wed, Jun 25, 2025 at 02:35:59AM +0800, Edgecombe, Rick P wrote:
> > > > On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> > > > 
> > > I guess it's the latter one as it can avoid modification to both EDK2 and Linux
> > > guest.  I observed ~2710 instances of "guest accepts at 4KB when KVM can map at
> > > 2MB" during the boot-up of a TD with 4GB memory.
> > 
> > Oh, wow that is more than I expected. Did you notice how many vCPUs they were
> > spread across? What memory size did you use? What was your guest memory
> > configuration?
> The guest memory is 4GB, 8 vCPUs.
> The memory slots layout is:
> slot 1: base gfn=0, npages=0x80000
> slot 2: base gfn=0x100000, npages=0x80000
> slot 3: base gfn=0xffc00, npages=0x400
> 
> The GFN spread for the ~2710 instances is:
> GFNs 0x806-0x9ff (1 time for each of 506 pages)
> GFNs 0x7e800-0x7e9ff (1 time for each of 512 pages)
> GFN: 0x7d3ff~0x7e7fe (repeated private-to-shared, and shared-to-private are
>                       conducted on this range), with the top 3 among them being:
>      0x7d9da (476 times)
>      0x7d9d9 (156 times)
>      0x7d9d7 (974 times)
> 
> All those instances are from vCPU 0, when the guest is in EDK2 and during early
> kernel boot.
> 
> Based on my observation, the count of these instances does not scale with guest
> memory. In other words, the count remains roughly the same even when the guest
> memory is increased to 8GB.

So the impact would be negligible. The mmu write lock would not meet much, if
any, contention.

> 
> > > But does it mean TDX needs to hold write mmu_lock in the EPT violation handler
> > > and set KVM_LPAGE_GUEST_INHIBIT on finding a violation carries 4KB level info?
> > 
> > I think so. I didn't check the reason, but the other similar code took it. Maybe
> > not? If we don't need to take mmu write lock, then this idea seems like a clear
> > winner to me.
> Hmm,  setting KVM_LPAGE_GUEST_INHIBIT needs trying splitting to be followed.
> So, if we don't want to support splitting under read mmu_lock, we need to take
> write mmu_lock.
> 
> I drafted a change as below (will refine some parts of it later).
> The average count to take write mmu_lock is 11 during VM boot.
> 
> There's no signiticant difference in the count of 2M mappings
> During guest kerne booting to login, on average: 
> before this patch: 1144 2M mappings 
> after this patch:  1143 2M mappings.

Oh, hmm. Well, it's not strong argument against.

> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index f999c15d8d3e..d4e98728f600 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -322,4 +322,8 @@ static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
>  {
>         return gfn & kvm_gfn_direct_bits(kvm);
>  }
> +
> +void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
> +bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
> +
>  #endif
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f0afee2e283a..28c511d8b372 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -721,6 +721,8 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
>   */
>  #define KVM_LPAGE_MIXED_FLAG   BIT(31)
> 
> +#define KVM_LPAGE_GUEST_INHIBIT_FLAG   BIT(30)
> +
>  static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
>                                             gfn_t gfn, int count)
>  {
> @@ -732,7 +734,8 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> 
>                 old = linfo->disallow_lpage;
>                 linfo->disallow_lpage += count;
> -               WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG);
> +               WARN_ON_ONCE((old ^ linfo->disallow_lpage) &
> +                             (KVM_LPAGE_MIXED_FLAG | KVM_LPAGE_GUEST_INHIBIT_FLAG));
>         }
>  }
> 
> @@ -1653,13 +1656,15 @@ int kvm_split_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range)
>         bool ret = 0;
> 
>         lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> -                           lockdep_is_held(&kvm->slots_lock));
> +                           lockdep_is_held(&kvm->slots_lock) ||
> +                           srcu_read_lock_held(&kvm->srcu));
> 
>         if (tdp_mmu_enabled)
>                 ret = kvm_tdp_mmu_gfn_range_split_boundary(kvm, range);
> 
>         return ret;
>  }
> +EXPORT_SYMBOL_GPL(kvm_split_boundary_leafs);
> 
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
> @@ -7734,6 +7739,18 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
>                 vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread);
>  }
> 
> +bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
> +{
> +       return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_GUEST_INHIBIT_FLAG;
> +}
> +EXPORT_SYMBOL_GPL(hugepage_test_guest_inhibit);
> +
> +void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
> +{
> +       lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_GUEST_INHIBIT_FLAG;
> +}
> +EXPORT_SYMBOL_GPL(hugepage_set_guest_inhibit);
> +
>  #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
>  static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
>                                 int level)
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 244fd22683db..4028423cf595 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1852,28 +1852,8 @@ int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
>         if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
>                 return -EINVAL;
> 
> -       /*
> -        * Split request with mmu_lock held for reading can only occur when one
> -        * vCPU accepts at 2MB level while another vCPU accepts at 4KB level.
> -        * Ignore this 4KB mapping request by setting violation_request_level to
> -        * 2MB and returning -EBUSY for retry. Then the next fault at 2MB level
> -        * would be a spurious fault. The vCPU accepting at 2MB will accept the
> -        * whole 2MB range.
> -        */
> -       if (mmu_lock_shared) {
> -               struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> -               struct vcpu_tdx *tdx = to_tdx(vcpu);
> -
> -               if (KVM_BUG_ON(!vcpu, kvm))
> -                       return -EOPNOTSUPP;
> -
> -               /* Request to map as 2MB leaf for the whole 2MB range */
> -               tdx->violation_gfn_start = gfn_round_for_level(gfn, level);
> -               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> -               tdx->violation_request_level = level;
> -
> -               return -EBUSY;
> -       }
> +       if (mmu_lock_shared)
> +               return -EOPNOTSUPP;
> 
>         ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
>         if (ret <= 0)
> @@ -1937,28 +1917,51 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
>         return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
>  }
> 
> -static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
> +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
>  {
>         struct vcpu_tdx *tdx = to_tdx(vcpu);
> +       struct kvm *kvm = vcpu->kvm;
> +       gfn_t gfn = gpa_to_gfn(gpa);
> +       struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>         int level = -1;
> +       u64 eeq_info;
> 
> -       u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> +       if (!slot)
> +               return 0;
> 
> -       u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> -                       TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> +       if ((tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK) !=
> +           TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
> +               return 0;
> 
> -       if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
> -               level = (eeq_info & GENMASK(2, 0)) + 1;
> +       eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> +                   TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> 
> -               tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
> -               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> -               tdx->violation_request_level = level;
> -       } else {
> -               tdx->violation_gfn_start = -1;
> -               tdx->violation_gfn_end = -1;
> -               tdx->violation_request_level = -1;
> +       level = (eeq_info & GENMASK(2, 0)) + 1;
> +
> +       if (level == PG_LEVEL_4K) {
> +              if (!hugepage_test_guest_inhibit(slot, gfn, PG_LEVEL_2M)) {
> +                       struct kvm_gfn_range gfn_range = {
> +                               .start = gfn,
> +                               .end = gfn + 1,
> +                               .slot = slot,
> +                               .may_block = true,
> +                               .attr_filter = KVM_FILTER_PRIVATE,
> +                       };
> +
> +                       scoped_guard(write_lock, &kvm->mmu_lock) {
> +                               int ret;
> +
> +                               ret = kvm_split_boundary_leafs(kvm, &gfn_range);
> +
> +                               if (ret)
> +                                       return ret;
> +
> +                               hugepage_set_guest_inhibit(slot, gfn, PG_LEVEL_2M);


Can you explain what you found regarding the write lock need? For most accept
cases, we could fault in the PTE's on the read lock. And in the future we could
have a demote that could work under read lock, as we talked. So
kvm_split_boundary_leafs() often or could be unneeded or work under read lock
when needed.

What is the problem in hugepage_set_guest_inhibit() that requires the write
lock?

But in any case, it seems like we have *a* solution here. It doesn't seem like
there are any big downsides. Should we close it?

> +                       }
> +              }
>         }
> +
> +       return 0;
>  }
> 
>  static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> @@ -1987,7 +1990,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>                  */
>                 exit_qual = EPT_VIOLATION_ACC_WRITE;
> 
> -               tdx_get_accept_level(vcpu, gpa);
> +               if (tdx_check_accept_level(vcpu, gpa))
> +                       return RET_PF_RETRY;
> 
>                 /* Only private GPA triggers zero-step mitigation */
>                 local_retry = true;
> @@ -3022,9 +3026,6 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> 
>         vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> 
> -       tdx->violation_gfn_start = -1;
> -       tdx->violation_gfn_end = -1;
> -       tdx->violation_request_level = -1;
>         return 0;
> 
>  free_tdcx:
> @@ -3373,14 +3374,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
>  int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>                                        gfn_t gfn, bool prefetch)
>  {
> -       struct vcpu_tdx *tdx = to_tdx(vcpu);
> -
> -       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE) || prefetch))
> +       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE)))
>                 return PG_LEVEL_4K;
> 
> -       if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
> -               return tdx->violation_request_level;
> -
>         return PG_LEVEL_2M;
>  }
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index acd18a01f63d..3a3077666ee6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2610,6 +2610,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
> 
>         return NULL;
>  }
> +EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
> 
>  bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
>  {


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-07-01  0:42                                                             ` Edgecombe, Rick P
@ 2025-07-01  2:41                                                               ` Yan Zhao
  2025-07-01 15:36                                                                 ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-07-01  2:41 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Huang, Kai, Du, Fan,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Li, Zhiquan1, Shutemov, Kirill,
	michael.roth@amd.com, seanjc@google.com, Weiny, Ira, Peng, Chao P,
	pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, kvm@vger.kernel.org, Annapurve, Vishal,
	tabba@google.com, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Tue, Jul 01, 2025 at 08:42:33AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-06-26 at 16:53 +0800, Yan Zhao wrote:
> > On Wed, Jun 25, 2025 at 10:47:47PM +0800, Edgecombe, Rick P wrote:
> > > On Wed, 2025-06-25 at 17:28 +0800, Yan Zhao wrote:
> > > > On Wed, Jun 25, 2025 at 02:35:59AM +0800, Edgecombe, Rick P wrote:
> > > > > On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> > > > > 
> > > > I guess it's the latter one as it can avoid modification to both EDK2 and Linux
> > > > guest.  I observed ~2710 instances of "guest accepts at 4KB when KVM can map at
> > > > 2MB" during the boot-up of a TD with 4GB memory.
> > > 
> > > Oh, wow that is more than I expected. Did you notice how many vCPUs they were
> > > spread across? What memory size did you use? What was your guest memory
> > > configuration?
> > The guest memory is 4GB, 8 vCPUs.
> > The memory slots layout is:
> > slot 1: base gfn=0, npages=0x80000
> > slot 2: base gfn=0x100000, npages=0x80000
> > slot 3: base gfn=0xffc00, npages=0x400
> > 
> > The GFN spread for the ~2710 instances is:
> > GFNs 0x806-0x9ff (1 time for each of 506 pages)
> > GFNs 0x7e800-0x7e9ff (1 time for each of 512 pages)
> > GFN: 0x7d3ff~0x7e7fe (repeated private-to-shared, and shared-to-private are
> >                       conducted on this range), with the top 3 among them being:
> >      0x7d9da (476 times)
> >      0x7d9d9 (156 times)
> >      0x7d9d7 (974 times)
> > 
> > All those instances are from vCPU 0, when the guest is in EDK2 and during early
> > kernel boot.
> > 
> > Based on my observation, the count of these instances does not scale with guest
> > memory. In other words, the count remains roughly the same even when the guest
> > memory is increased to 8GB.
> 
> So the impact would be negligible. The mmu write lock would not meet much, if
> any, contention.
> 
> > 
> > > > But does it mean TDX needs to hold write mmu_lock in the EPT violation handler
> > > > and set KVM_LPAGE_GUEST_INHIBIT on finding a violation carries 4KB level info?
> > > 
> > > I think so. I didn't check the reason, but the other similar code took it. Maybe
> > > not? If we don't need to take mmu write lock, then this idea seems like a clear
> > > winner to me.
> > Hmm,  setting KVM_LPAGE_GUEST_INHIBIT needs trying splitting to be followed.
> > So, if we don't want to support splitting under read mmu_lock, we need to take
> > write mmu_lock.
> > 
> > I drafted a change as below (will refine some parts of it later).
> > The average count to take write mmu_lock is 11 during VM boot.
> > 
> > There's no signiticant difference in the count of 2M mappings
> > During guest kerne booting to login, on average: 
> > before this patch: 1144 2M mappings 
> > after this patch:  1143 2M mappings.
> 
> Oh, hmm. Well, it's not strong argument against.
> 
> > 
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index f999c15d8d3e..d4e98728f600 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -322,4 +322,8 @@ static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
> >  {
> >         return gfn & kvm_gfn_direct_bits(kvm);
> >  }
> > +
> > +void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
> > +bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
> > +
> >  #endif
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index f0afee2e283a..28c511d8b372 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -721,6 +721,8 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
> >   */
> >  #define KVM_LPAGE_MIXED_FLAG   BIT(31)
> > 
> > +#define KVM_LPAGE_GUEST_INHIBIT_FLAG   BIT(30)
> > +
> >  static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> >                                             gfn_t gfn, int count)
> >  {
> > @@ -732,7 +734,8 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> > 
> >                 old = linfo->disallow_lpage;
> >                 linfo->disallow_lpage += count;
> > -               WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG);
> > +               WARN_ON_ONCE((old ^ linfo->disallow_lpage) &
> > +                             (KVM_LPAGE_MIXED_FLAG | KVM_LPAGE_GUEST_INHIBIT_FLAG));
> >         }
> >  }
> > 
> > @@ -1653,13 +1656,15 @@ int kvm_split_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range)
> >         bool ret = 0;
> > 
> >         lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> > -                           lockdep_is_held(&kvm->slots_lock));
> > +                           lockdep_is_held(&kvm->slots_lock) ||
> > +                           srcu_read_lock_held(&kvm->srcu));
> > 
> >         if (tdp_mmu_enabled)
> >                 ret = kvm_tdp_mmu_gfn_range_split_boundary(kvm, range);
> > 
> >         return ret;
> >  }
> > +EXPORT_SYMBOL_GPL(kvm_split_boundary_leafs);
> > 
> >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> >  {
> > @@ -7734,6 +7739,18 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> >                 vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread);
> >  }
> > 
> > +bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
> > +{
> > +       return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_GUEST_INHIBIT_FLAG;
> > +}
> > +EXPORT_SYMBOL_GPL(hugepage_test_guest_inhibit);
> > +
> > +void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
> > +{
> > +       lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_GUEST_INHIBIT_FLAG;
> > +}
> > +EXPORT_SYMBOL_GPL(hugepage_set_guest_inhibit);
> > +
> >  #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> >  static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
> >                                 int level)
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 244fd22683db..4028423cf595 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1852,28 +1852,8 @@ int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> >         if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
> >                 return -EINVAL;
> > 
> > -       /*
> > -        * Split request with mmu_lock held for reading can only occur when one
> > -        * vCPU accepts at 2MB level while another vCPU accepts at 4KB level.
> > -        * Ignore this 4KB mapping request by setting violation_request_level to
> > -        * 2MB and returning -EBUSY for retry. Then the next fault at 2MB level
> > -        * would be a spurious fault. The vCPU accepting at 2MB will accept the
> > -        * whole 2MB range.
> > -        */
> > -       if (mmu_lock_shared) {
> > -               struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> > -               struct vcpu_tdx *tdx = to_tdx(vcpu);
> > -
> > -               if (KVM_BUG_ON(!vcpu, kvm))
> > -                       return -EOPNOTSUPP;
> > -
> > -               /* Request to map as 2MB leaf for the whole 2MB range */
> > -               tdx->violation_gfn_start = gfn_round_for_level(gfn, level);
> > -               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> > -               tdx->violation_request_level = level;
> > -
> > -               return -EBUSY;
> > -       }
> > +       if (mmu_lock_shared)
> > +               return -EOPNOTSUPP;
> > 
> >         ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
> >         if (ret <= 0)
> > @@ -1937,28 +1917,51 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
> >         return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
> >  }
> > 
> > -static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
> > +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
> >  {
> >         struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +       struct kvm *kvm = vcpu->kvm;
> > +       gfn_t gfn = gpa_to_gfn(gpa);
> > +       struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> >         int level = -1;
> > +       u64 eeq_info;
> > 
> > -       u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> > +       if (!slot)
> > +               return 0;
> > 
> > -       u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> > -                       TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> > +       if ((tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK) !=
> > +           TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
> > +               return 0;
> > 
> > -       if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
> > -               level = (eeq_info & GENMASK(2, 0)) + 1;
> > +       eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> > +                   TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> > 
> > -               tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
> > -               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> > -               tdx->violation_request_level = level;
> > -       } else {
> > -               tdx->violation_gfn_start = -1;
> > -               tdx->violation_gfn_end = -1;
> > -               tdx->violation_request_level = -1;
> > +       level = (eeq_info & GENMASK(2, 0)) + 1;
> > +
> > +       if (level == PG_LEVEL_4K) {
> > +              if (!hugepage_test_guest_inhibit(slot, gfn, PG_LEVEL_2M)) {
> > +                       struct kvm_gfn_range gfn_range = {
> > +                               .start = gfn,
> > +                               .end = gfn + 1,
> > +                               .slot = slot,
> > +                               .may_block = true,
> > +                               .attr_filter = KVM_FILTER_PRIVATE,
> > +                       };
> > +
> > +                       scoped_guard(write_lock, &kvm->mmu_lock) {
> > +                               int ret;
> > +
> > +                               ret = kvm_split_boundary_leafs(kvm, &gfn_range);
> > +
> > +                               if (ret)
> > +                                       return ret;
> > +
> > +                               hugepage_set_guest_inhibit(slot, gfn, PG_LEVEL_2M);
> 
> 
> Can you explain what you found regarding the write lock need?
Here, the write lock protects 2 steps:
(1) update lpage_info.
(2) try splitting if there's any existing 2MB mapping.

The write mmu_lock is needed because lpage_info is read under read mmu_lock in
kvm_tdp_mmu_map().

kvm_tdp_mmu_map
  kvm_mmu_hugepage_adjust
    kvm_lpage_info_max_mapping_level

If we update the lpage_info with read mmu_lock, the other vCPUs may map at a
stale 2MB level even after lpage_info is updated by hugepage_set_guest_inhibit().

Therefore, we must perform splitting under the write mmu_lock to ensure there
are no 2MB mappings after hugepage_set_guest_inhibit().

Otherwise, during later mapping in __vmx_handle_ept_violation(), splitting at
fault path could be triggered as KVM MMU finds the goal level is 4KB while an
existing 2MB mapping is present.


> For most accept
> cases, we could fault in the PTE's on the read lock. And in the future we could

The actual mapping at 4KB level is still with read mmu_lock in
__vmx_handle_ept_violation().

> have a demote that could work under read lock, as we talked. So
> kvm_split_boundary_leafs() often or could be unneeded or work under read lock
> when needed.
Could we leave the "demote under read lock" as a future optimization? 


> What is the problem in hugepage_set_guest_inhibit() that requires the write
> lock?
As above, to avoid the other vCPUs reading stale mapping level and splitting
under read mmu_lock.

As guest_inhibit is set one-way, we could test it using
hugepage_test_guest_inhibit() without holding the lock. The chance to hold write
mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
(in my testing, 11 during VM boot).
 
> But in any case, it seems like we have *a* solution here. It doesn't seem like
> there are any big downsides. Should we close it?
I think it's good, as long as Sean doesn't disagree :)


> > +                       }
> > +              }
> >         }
> > +
> > +       return 0;
> >  }
> > 
> >  static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> > @@ -1987,7 +1990,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> >                  */
> >                 exit_qual = EPT_VIOLATION_ACC_WRITE;
> > 
> > -               tdx_get_accept_level(vcpu, gpa);
> > +               if (tdx_check_accept_level(vcpu, gpa))
> > +                       return RET_PF_RETRY;
> > 
> >                 /* Only private GPA triggers zero-step mitigation */
> >                 local_retry = true;
> > @@ -3022,9 +3026,6 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> > 
> >         vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> > 
> > -       tdx->violation_gfn_start = -1;
> > -       tdx->violation_gfn_end = -1;
> > -       tdx->violation_request_level = -1;
> >         return 0;
> > 
> >  free_tdcx:
> > @@ -3373,14 +3374,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
> >  int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
> >                                        gfn_t gfn, bool prefetch)
> >  {
> > -       struct vcpu_tdx *tdx = to_tdx(vcpu);
> > -
> > -       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE) || prefetch))
> > +       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE)))
> >                 return PG_LEVEL_4K;
> > 
> > -       if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
> > -               return tdx->violation_request_level;
> > -
> >         return PG_LEVEL_2M;
> >  }
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index acd18a01f63d..3a3077666ee6 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2610,6 +2610,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
> > 
> >         return NULL;
> >  }
> > +EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
> > 
> >  bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
> >  {
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-30 21:45                                                                                 ` Edgecombe, Rick P
@ 2025-07-01  5:01                                                                                   ` Yan Zhao
  2025-07-01  5:22                                                                                     ` Vishal Annapurve
  2025-07-01 16:13                                                                                     ` Edgecombe, Rick P
  0 siblings, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-07-01  5:01 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: ackerleytng@google.com, quic_eberman@quicinc.com, Li, Xiaoyao,
	Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
	Du, Fan, michael.roth@amd.com, seanjc@google.com,
	binbin.wu@linux.intel.com, Peng, Chao P, kvm@vger.kernel.org,
	Yamahata, Isaku, linux-kernel@vger.kernel.org, Weiny, Ira,
	pbonzini@redhat.com, Li, Zhiquan1, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jul 01, 2025 at 05:45:54AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2025-06-30 at 12:25 -0700, Ackerley Tng wrote:
> > > So for this we can do something similar. Have the arch/x86 side of TDX grow
> > > a
> > > new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
> > > SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs
> > > after
> > > that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the
> > > system
> > > die. Zap/cleanup paths return success in the buggy shutdown case.
> > > 
> > 
> > Do you mean that on unmap/split failure:
> 
> Maybe Yan can clarify here. I thought the HWpoison scenario was about TDX module
My thinking is to set HWPoison to private pages whenever KVM_BUG_ON() was hit in
TDX. i.e., when the page is still mapped in S-EPT but the TD is bugged on and
about to tear down.

So, it could be due to KVM or TDX module bugs, which retries can't help.

> bugs. Not TDX busy errors, demote failures, etc. If there are "normal" failures,
> like the ones that can be fixed with retries, then I think HWPoison is not a
> good option though.
> 
> >  there is a way to make 100%
> > sure all memory becomes re-usable by the rest of the host, using
> > tdx_buggy_shutdown(), wbinvd, etc?

Not sure about this approach. When TDX module is buggy and the page is still
accessible to guest as private pages, even with no-more SEAMCALLs flag, is it
safe enough for guest_memfd/hugetlb to re-assign the page to allow simultaneous
access in shared memory with potential private access from TD or TDX module?

> I think so. If we think the error conditions are rare enough that the cost of
> killing all TDs is acceptable, then we should do a proper POC and give it some
> scrutiny.
> 
> > 
> > If yes, then I'm onboard with this, and if we are 100% sure all memory
> > becomes re-usable by the host after all the extensive cleanup, then we
> > don't need to HWpoison anything.
> 
> For eventual upstream acceptance, we need to stop and think every time TDX
> requires special handling in generic code. This is why I wanted to clarify if
> you guys think the scenario could be in any way considered a generic one.
> (IOMMU, etc).

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-30 19:25                                                                               ` Ackerley Tng
  2025-06-30 21:45                                                                                 ` Edgecombe, Rick P
@ 2025-07-01  5:07                                                                                 ` Yan Zhao
  2025-07-01 22:01                                                                                   ` Ackerley Tng
  1 sibling, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-07-01  5:07 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Edgecombe, Rick P, Shutemov, Kirill, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	quic_eberman@quicinc.com, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	Du, Fan, Yamahata, Isaku, pbonzini@redhat.com,
	binbin.wu@linux.intel.com, Weiny, Ira, Li, Zhiquan1,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Mon, Jun 30, 2025 at 12:25:49PM -0700, Ackerley Tng wrote:
> "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:
> 
> > On Mon, 2025-06-30 at 19:13 +0800, Yan Zhao wrote:
> >> > > ok! Lets go f/g. Unless Yan objects.
> >> I'm ok with f/g. But I have two implementation specific questions:
> >> 
> >> 1. How to set the HWPoison bit in TDX?
> 
> I was thinking to set the HWpoison flag based on page type. If regular
> 4K page, set the flag. If THP page (not (yet) supported by guest_memfd),
> set the has_hwpoison flag, and if HugeTLB page, call
> folio_set_hugetlb_hwpoison().
Could you elaborate on how to call folio_set_hugetlb_hwpoison()?

> But if we go with Rick's suggestion below, then we don't have to figure
> this out.
> 
> >> 2. Should we set this bit for non-guest-memfd pages (e.g. for S-EPT pages) ?
> >
> > Argh, I guess we can keep the existing ref count based approach for the other
> > types of TDX owned pages?
> >
> 
> Wait TDX can only use guest_memfd pages, right? Even if TDX can use
> non-guest_memfd pages, why not also set HWpoison for non-guest_memfd
> pages?
As in https://lore.kernel.org/all/aGJxU95VvQvQ3bj6@yzhao56-desk.sh.intel.com/,
I don't find a proper interface for TDX to set HWpoison bit on non-guset_memfd
pages.

Neither memory_failure() nor memory_failure_queue() seem fit.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01  5:01                                                                                   ` Yan Zhao
@ 2025-07-01  5:22                                                                                     ` Vishal Annapurve
  2025-07-01  6:03                                                                                       ` Yan Zhao
  2025-07-01 16:13                                                                                     ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-07-01  5:22 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Edgecombe, Rick P, ackerleytng@google.com,
	quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, Du, Fan, michael.roth@amd.com,
	seanjc@google.com, binbin.wu@linux.intel.com, Peng, Chao P,
	kvm@vger.kernel.org, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, Weiny, Ira, pbonzini@redhat.com,
	Li, Zhiquan1, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Mon, Jun 30, 2025 at 10:04 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, Jul 01, 2025 at 05:45:54AM +0800, Edgecombe, Rick P wrote:
> > On Mon, 2025-06-30 at 12:25 -0700, Ackerley Tng wrote:
> > > > So for this we can do something similar. Have the arch/x86 side of TDX grow
> > > > a
> > > > new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
> > > > SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs
> > > > after
> > > > that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the
> > > > system
> > > > die. Zap/cleanup paths return success in the buggy shutdown case.
> > > >
> > >
> > > Do you mean that on unmap/split failure:
> >
> > Maybe Yan can clarify here. I thought the HWpoison scenario was about TDX module
> My thinking is to set HWPoison to private pages whenever KVM_BUG_ON() was hit in
> TDX. i.e., when the page is still mapped in S-EPT but the TD is bugged on and
> about to tear down.
>
> So, it could be due to KVM or TDX module bugs, which retries can't help.
>
> > bugs. Not TDX busy errors, demote failures, etc. If there are "normal" failures,
> > like the ones that can be fixed with retries, then I think HWPoison is not a
> > good option though.
> >
> > >  there is a way to make 100%
> > > sure all memory becomes re-usable by the rest of the host, using
> > > tdx_buggy_shutdown(), wbinvd, etc?
>
> Not sure about this approach. When TDX module is buggy and the page is still
> accessible to guest as private pages, even with no-more SEAMCALLs flag, is it
> safe enough for guest_memfd/hugetlb to re-assign the page to allow simultaneous
> access in shared memory with potential private access from TD or TDX module?

If no more seamcalls are allowed and all cpus are made to exit SEAM
mode then how can there be potential private access from TD or TDX
module?

>
> > I think so. If we think the error conditions are rare enough that the cost of
> > killing all TDs is acceptable, then we should do a proper POC and give it some
> > scrutiny.
> >
> > >
> > > If yes, then I'm onboard with this, and if we are 100% sure all memory
> > > becomes re-usable by the host after all the extensive cleanup, then we
> > > don't need to HWpoison anything.
> >
> > For eventual upstream acceptance, we need to stop and think every time TDX
> > requires special handling in generic code. This is why I wanted to clarify if
> > you guys think the scenario could be in any way considered a generic one.
> > (IOMMU, etc).

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01  5:22                                                                                     ` Vishal Annapurve
@ 2025-07-01  6:03                                                                                       ` Yan Zhao
  2025-07-01  7:13                                                                                         ` Vishal Annapurve
  2025-07-01 22:09                                                                                         ` Ackerley Tng
  0 siblings, 2 replies; 294+ messages in thread
From: Yan Zhao @ 2025-07-01  6:03 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Edgecombe, Rick P, ackerleytng@google.com,
	quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, Du, Fan, michael.roth@amd.com,
	seanjc@google.com, binbin.wu@linux.intel.com, Peng, Chao P,
	kvm@vger.kernel.org, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, Weiny, Ira, pbonzini@redhat.com,
	Li, Zhiquan1, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Mon, Jun 30, 2025 at 10:22:26PM -0700, Vishal Annapurve wrote:
> On Mon, Jun 30, 2025 at 10:04 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Tue, Jul 01, 2025 at 05:45:54AM +0800, Edgecombe, Rick P wrote:
> > > On Mon, 2025-06-30 at 12:25 -0700, Ackerley Tng wrote:
> > > > > So for this we can do something similar. Have the arch/x86 side of TDX grow
> > > > > a
> > > > > new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
> > > > > SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs
> > > > > after
> > > > > that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the
> > > > > system
> > > > > die. Zap/cleanup paths return success in the buggy shutdown case.
> > > > >
> > > >
> > > > Do you mean that on unmap/split failure:
> > >
> > > Maybe Yan can clarify here. I thought the HWpoison scenario was about TDX module
> > My thinking is to set HWPoison to private pages whenever KVM_BUG_ON() was hit in
> > TDX. i.e., when the page is still mapped in S-EPT but the TD is bugged on and
> > about to tear down.
> >
> > So, it could be due to KVM or TDX module bugs, which retries can't help.
> >
> > > bugs. Not TDX busy errors, demote failures, etc. If there are "normal" failures,
> > > like the ones that can be fixed with retries, then I think HWPoison is not a
> > > good option though.
> > >
> > > >  there is a way to make 100%
> > > > sure all memory becomes re-usable by the rest of the host, using
> > > > tdx_buggy_shutdown(), wbinvd, etc?
> >
> > Not sure about this approach. When TDX module is buggy and the page is still
> > accessible to guest as private pages, even with no-more SEAMCALLs flag, is it
> > safe enough for guest_memfd/hugetlb to re-assign the page to allow simultaneous
> > access in shared memory with potential private access from TD or TDX module?
> 
> If no more seamcalls are allowed and all cpus are made to exit SEAM
> mode then how can there be potential private access from TD or TDX
> module?
Not sure. As Kirill said "TDX module has creative ways to corrupt it"
https://lore.kernel.org/all/zlxgzuoqwrbuf54wfqycnuxzxz2yduqtsjinr5uq4ss7iuk2rt@qaaolzwsy6ki/.

Or, could TDX just set a page flag, like what for XEN

        /* XEN */
        /* Pinned in Xen as a read-only pagetable page. */
        PG_pinned = PG_owner_priv_1,

e.g.
	PG_tdx_firmware_access = PG_owner_priv_1,

Then, guest_memfd checks this flag on every zap and replace it with PG_hwpoison
on behalf of TDX?

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01  6:03                                                                                       ` Yan Zhao
@ 2025-07-01  7:13                                                                                         ` Vishal Annapurve
  2025-07-01 14:15                                                                                           ` Edgecombe, Rick P
  2025-07-01 22:09                                                                                         ` Ackerley Tng
  1 sibling, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-07-01  7:13 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Edgecombe, Rick P, ackerleytng@google.com,
	quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, Du, Fan, michael.roth@amd.com,
	seanjc@google.com, binbin.wu@linux.intel.com, Peng, Chao P,
	kvm@vger.kernel.org, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, Weiny, Ira, pbonzini@redhat.com,
	Li, Zhiquan1, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Mon, Jun 30, 2025 at 11:06 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Mon, Jun 30, 2025 at 10:22:26PM -0700, Vishal Annapurve wrote:
> > On Mon, Jun 30, 2025 at 10:04 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Tue, Jul 01, 2025 at 05:45:54AM +0800, Edgecombe, Rick P wrote:
> > > > On Mon, 2025-06-30 at 12:25 -0700, Ackerley Tng wrote:
> > > > > > So for this we can do something similar. Have the arch/x86 side of TDX grow
> > > > > > a
> > > > > > new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
> > > > > > SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs
> > > > > > after
> > > > > > that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the
> > > > > > system
> > > > > > die. Zap/cleanup paths return success in the buggy shutdown case.
> > > > > >
> > > > >
> > > > > Do you mean that on unmap/split failure:
> > > >
> > > > Maybe Yan can clarify here. I thought the HWpoison scenario was about TDX module
> > > My thinking is to set HWPoison to private pages whenever KVM_BUG_ON() was hit in
> > > TDX. i.e., when the page is still mapped in S-EPT but the TD is bugged on and
> > > about to tear down.
> > >
> > > So, it could be due to KVM or TDX module bugs, which retries can't help.
> > >
> > > > bugs. Not TDX busy errors, demote failures, etc. If there are "normal" failures,
> > > > like the ones that can be fixed with retries, then I think HWPoison is not a
> > > > good option though.
> > > >
> > > > >  there is a way to make 100%
> > > > > sure all memory becomes re-usable by the rest of the host, using
> > > > > tdx_buggy_shutdown(), wbinvd, etc?
> > >
> > > Not sure about this approach. When TDX module is buggy and the page is still
> > > accessible to guest as private pages, even with no-more SEAMCALLs flag, is it
> > > safe enough for guest_memfd/hugetlb to re-assign the page to allow simultaneous
> > > access in shared memory with potential private access from TD or TDX module?
> >
> > If no more seamcalls are allowed and all cpus are made to exit SEAM
> > mode then how can there be potential private access from TD or TDX
> > module?
> Not sure. As Kirill said "TDX module has creative ways to corrupt it"
> https://lore.kernel.org/all/zlxgzuoqwrbuf54wfqycnuxzxz2yduqtsjinr5uq4ss7iuk2rt@qaaolzwsy6ki/.

I would assume that would be true only if TDX module logic is allowed
to execute. Otherwise it would be useful to understand these
"creative" ways better.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-30 17:55                                                                             ` Edgecombe, Rick P
  2025-06-30 19:25                                                                               ` Ackerley Tng
  2025-06-30 21:47                                                                               ` Vishal Annapurve
@ 2025-07-01  9:35                                                                               ` Yan Zhao
  2025-07-01 13:32                                                                                 ` Vishal Annapurve
  2 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-07-01  9:35 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: ackerleytng@google.com, Shutemov, Kirill, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	quic_eberman@quicinc.com, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	Du, Fan, Yamahata, Isaku, pbonzini@redhat.com,
	binbin.wu@linux.intel.com, Weiny, Ira, Li, Zhiquan1,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Tue, Jul 01, 2025 at 01:55:43AM +0800, Edgecombe, Rick P wrote:
> So for this we can do something similar. Have the arch/x86 side of TDX grow a
> new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
> SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs after
> that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the system
> die. Zap/cleanup paths return success in the buggy shutdown case.
All TDs in the system die could be too severe for unmap errors due to KVM bugs.

> Does it fit? Or, can you guys argue that the failures here are actually non-
> special cases that are worth more complex recovery? I remember we talked about
> IOMMU patterns that are similar, but it seems like the remaining cases under
> discussion are about TDX bugs.
I didn't mention TDX connect previously to avoid introducing unnecessary
complexity.

For TDX connect, S-EPT is used for private mappings in IOMMU. Unmap could
therefore fail due to pages being pinned for DMA.

So, my thinking was that if that happens, KVM could set a special flag to folios
pinned for private DMA.

Then guest_memfd could check the special flag before allowing private-to-shared
conversion, or punch hole.
guest_memfd could check this special flag and choose to poison or leak the
folio.

Otherwise, if we choose tdx_buggy_shutdown() to "do an all-cpu IPI to kick CPUs
out of SEAMMODE, wbivnd, and set a "no more seamcalls" bool", DMAs may still
have access to the private pages mapped in S-EPT.





^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01  9:35                                                                               ` Yan Zhao
@ 2025-07-01 13:32                                                                                 ` Vishal Annapurve
  2025-07-01 14:02                                                                                   ` Vishal Annapurve
                                                                                                     ` (2 more replies)
  0 siblings, 3 replies; 294+ messages in thread
From: Vishal Annapurve @ 2025-07-01 13:32 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Edgecombe, Rick P, ackerleytng@google.com, Shutemov, Kirill,
	Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	quic_eberman@quicinc.com, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	Du, Fan, Yamahata, Isaku, pbonzini@redhat.com,
	binbin.wu@linux.intel.com, Weiny, Ira, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jul 1, 2025 at 2:38 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, Jul 01, 2025 at 01:55:43AM +0800, Edgecombe, Rick P wrote:
> > So for this we can do something similar. Have the arch/x86 side of TDX grow a
> > new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
> > SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs after
> > that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the system
> > die. Zap/cleanup paths return success in the buggy shutdown case.
> All TDs in the system die could be too severe for unmap errors due to KVM bugs.

At this point, I don't see a way to quantify how bad a KVM bug can get
unless you have explicit ideas about the severity. We should work on
minimizing KVM side bugs too and assuming it would be a rare
occurrence I think it's ok to take this intrusive measure.

>
> > Does it fit? Or, can you guys argue that the failures here are actually non-
> > special cases that are worth more complex recovery? I remember we talked about
> > IOMMU patterns that are similar, but it seems like the remaining cases under
> > discussion are about TDX bugs.
> I didn't mention TDX connect previously to avoid introducing unnecessary
> complexity.
>
> For TDX connect, S-EPT is used for private mappings in IOMMU. Unmap could
> therefore fail due to pages being pinned for DMA.

We are discussing this scenario already[1], where the host will not
pin the pages used by secure DMA for the same reasons why we can't
have KVM pin the guest_memfd pages mapped in SEPT. Is there some other
kind of pinning you are referring to?

If there is an ordering in which pages should be unmapped e.g. first
in secure IOMMU and then KVM SEPT, then we can ensure the right
ordering between invalidation callbacks from guest_memfd.

[1] https://lore.kernel.org/lkml/CAGtprH_qh8sEY3s-JucW3n1Wvoq7jdVZDDokvG5HzPf0HV2=pg@mail.gmail.com/#t

>
> So, my thinking was that if that happens, KVM could set a special flag to folios
> pinned for private DMA.
>
> Then guest_memfd could check the special flag before allowing private-to-shared
> conversion, or punch hole.
> guest_memfd could check this special flag and choose to poison or leak the
> folio.
>
> Otherwise, if we choose tdx_buggy_shutdown() to "do an all-cpu IPI to kick CPUs
> out of SEAMMODE, wbivnd, and set a "no more seamcalls" bool", DMAs may still
> have access to the private pages mapped in S-EPT.

guest_memfd will have to ensure that pages are unmapped from secure
IOMMU pagetables before allowing them to be used by the host.

If secure IOMMU pagetables unmapping fails, I would assume it fails in
the similar category of rare "KVM/TDX module/IOMMUFD" bug and I think
it makes sense to do the same tdx_buggy_shutdown() with such failures
as well.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01 13:32                                                                                 ` Vishal Annapurve
@ 2025-07-01 14:02                                                                                   ` Vishal Annapurve
  2025-07-01 15:42                                                                                     ` Edgecombe, Rick P
  2025-07-01 16:14                                                                                   ` Edgecombe, Rick P
  2025-07-02  8:54                                                                                   ` Yan Zhao
  2 siblings, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-07-01 14:02 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Edgecombe, Rick P, ackerleytng@google.com, Shutemov, Kirill,
	Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	quic_eberman@quicinc.com, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	Du, Fan, Yamahata, Isaku, pbonzini@redhat.com,
	binbin.wu@linux.intel.com, Weiny, Ira, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jul 1, 2025 at 6:32 AM Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Tue, Jul 1, 2025 at 2:38 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Tue, Jul 01, 2025 at 01:55:43AM +0800, Edgecombe, Rick P wrote:
> > > So for this we can do something similar. Have the arch/x86 side of TDX grow a
> > > new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
> > > SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs after
> > > that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the system
> > > die. Zap/cleanup paths return success in the buggy shutdown case.
> > All TDs in the system die could be too severe for unmap errors due to KVM bugs.
>
> At this point, I don't see a way to quantify how bad a KVM bug can get
> unless you have explicit ideas about the severity. We should work on
> minimizing KVM side bugs too and assuming it would be a rare
> occurrence I think it's ok to take this intrusive measure.
>
> >
> > > Does it fit? Or, can you guys argue that the failures here are actually non-
> > > special cases that are worth more complex recovery? I remember we talked about
> > > IOMMU patterns that are similar, but it seems like the remaining cases under
> > > discussion are about TDX bugs.
> > I didn't mention TDX connect previously to avoid introducing unnecessary
> > complexity.
> >
> > For TDX connect, S-EPT is used for private mappings in IOMMU. Unmap could
> > therefore fail due to pages being pinned for DMA.
>
> We are discussing this scenario already[1], where the host will not
> pin the pages used by secure DMA for the same reasons why we can't
> have KVM pin the guest_memfd pages mapped in SEPT. Is there some other
> kind of pinning you are referring to?
>
> If there is an ordering in which pages should be unmapped e.g. first
> in secure IOMMU and then KVM SEPT, then we can ensure the right
> ordering between invalidation callbacks from guest_memfd.
>
> [1] https://lore.kernel.org/lkml/CAGtprH_qh8sEY3s-JucW3n1Wvoq7jdVZDDokvG5HzPf0HV2=pg@mail.gmail.com/#t
>
> >
> > So, my thinking was that if that happens, KVM could set a special flag to folios
> > pinned for private DMA.
> >
> > Then guest_memfd could check the special flag before allowing private-to-shared
> > conversion, or punch hole.
> > guest_memfd could check this special flag and choose to poison or leak the
> > folio.
> >
> > Otherwise, if we choose tdx_buggy_shutdown() to "do an all-cpu IPI to kick CPUs
> > out of SEAMMODE, wbivnd, and set a "no more seamcalls" bool", DMAs may still
> > have access to the private pages mapped in S-EPT.
>
> guest_memfd will have to ensure that pages are unmapped from secure
> IOMMU pagetables before allowing them to be used by the host.
>
> If secure IOMMU pagetables unmapping fails, I would assume it fails in
> the similar category of rare "KVM/TDX module/IOMMUFD" bug and I think
> it makes sense to do the same tdx_buggy_shutdown() with such failures
> as well.

In addition we will need a way to fail all further Secure IOMMU table
walks or some way to stop the active secure DMA by unbinding all the
TDIs. Maybe such scenarios warrant a BUG_ON() if recovery is not
possible as possibly any or all of the KVM/IOMMUFD/TDX module can't be
trusted for reliable functionality anymore.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01  7:13                                                                                         ` Vishal Annapurve
@ 2025-07-01 14:15                                                                                           ` Edgecombe, Rick P
  0 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-01 14:15 UTC (permalink / raw)
  To: Annapurve, Vishal, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, Shutemov, Kirill,
	michael.roth@amd.com, binbin.wu@linux.intel.com,
	seanjc@google.com, Peng, Chao P, Du, Fan, Yamahata, Isaku,
	ackerleytng@google.com, Weiny, Ira, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org, jroedel@suse.de, Miao, Jun,
	Li, Zhiquan1, pgonda@google.com, x86@kernel.org

On Tue, 2025-07-01 at 00:13 -0700, Vishal Annapurve wrote:
> > Not sure. As Kirill said "TDX module has creative ways to corrupt it"
> > https://lore.kernel.org/all/zlxgzuoqwrbuf54wfqycnuxzxz2yduqtsjinr5uq4ss7iuk2rt@qaaolzwsy6ki/
> > .
> 
> I would assume that would be true only if TDX module logic is allowed
> to execute. Otherwise it would be useful to understand these
> "creative" ways better.

The no more seamcalls = no more corruptions is what the host kexec patches are
based on.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-07-01  2:41                                                               ` Yan Zhao
@ 2025-07-01 15:36                                                                 ` Edgecombe, Rick P
  2025-07-02  0:12                                                                   ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-01 15:36 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Du, Fan, Li, Xiaoyao, Huang, Kai, Shutemov, Kirill, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
	Li, Zhiquan1, quic_eberman@quicinc.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, pbonzini@redhat.com, Peng, Chao P,
	Yamahata, Isaku, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, ackerleytng@google.com,
	kvm@vger.kernel.org, Annapurve, Vishal, tabba@google.com,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, 2025-07-01 at 10:41 +0800, Yan Zhao wrote:
> > Can you explain what you found regarding the write lock need?
> Here, the write lock protects 2 steps:
> (1) update lpage_info.
> (2) try splitting if there's any existing 2MB mapping.
> 
> The write mmu_lock is needed because lpage_info is read under read mmu_lock in
> kvm_tdp_mmu_map().
> 
> kvm_tdp_mmu_map
>   kvm_mmu_hugepage_adjust
>     kvm_lpage_info_max_mapping_level
> 
> If we update the lpage_info with read mmu_lock, the other vCPUs may map at a
> stale 2MB level even after lpage_info is updated by
> hugepage_set_guest_inhibit().
> 
> Therefore, we must perform splitting under the write mmu_lock to ensure there
> are no 2MB mappings after hugepage_set_guest_inhibit().
> 
> Otherwise, during later mapping in __vmx_handle_ept_violation(), splitting at
> fault path could be triggered as KVM MMU finds the goal level is 4KB while an
> existing 2MB mapping is present.

It could be?
1. mmu read lock
2. update lpage_info
3. mmu write lock upgrade
4. demote
5. mmu unlock

Then (3) could be skipped in the case of ability to demote under read lock?

I noticed that the other lpage_info updaters took mmu write lock, and I wasn't
sure why. We shouldn't take a lock that we don't actually need just for safety
margin or to copy other code.

> 
> 
> > For most accept
> > cases, we could fault in the PTE's on the read lock. And in the future we
> > could
> 
> The actual mapping at 4KB level is still with read mmu_lock in
> __vmx_handle_ept_violation().
> 
> > have a demote that could work under read lock, as we talked. So
> > kvm_split_boundary_leafs() often or could be unneeded or work under read
> > lock
> > when needed.
> Could we leave the "demote under read lock" as a future optimization? 

We could add it to the list. If we have a TDX module that supports demote with a
single SEAMCALL then we don't have the rollback problem. The optimization could
utilize that. That said, we should focus on the optimizations that make the
biggest difference to real TDs. Your data suggests this might not be the case
today. 

> 
> 
> > What is the problem in hugepage_set_guest_inhibit() that requires the write
> > lock?
> As above, to avoid the other vCPUs reading stale mapping level and splitting
> under read mmu_lock.

We need mmu write lock for demote, but as long as the order is:
1. set lpage_info
2. demote if needed
3. go to fault handler

Then (3) should have what it needs even if another fault races (1).

> 
> As guest_inhibit is set one-way, we could test it using
> hugepage_test_guest_inhibit() without holding the lock. The chance to hold
> write
> mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
> (in my testing, 11 during VM boot).
>  
> > But in any case, it seems like we have *a* solution here. It doesn't seem
> > like
> > there are any big downsides. Should we close it?
> I think it's good, as long as Sean doesn't disagree :)

He seemed onboard. Let's close it. We can even discuss lpage_info update locking
on v2.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01 14:02                                                                                   ` Vishal Annapurve
@ 2025-07-01 15:42                                                                                     ` Edgecombe, Rick P
  0 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-01 15:42 UTC (permalink / raw)
  To: Annapurve, Vishal, Zhao, Yan Y
  Cc: Shutemov, Kirill, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
	vbabka@suse.cz, kvm@vger.kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	quic_eberman@quicinc.com, Yamahata, Isaku, ackerleytng@google.com,
	binbin.wu@linux.intel.com, Weiny, Ira, pbonzini@redhat.com,
	Li, Zhiquan1, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Tue, 2025-07-01 at 07:02 -0700, Vishal Annapurve wrote:
> > guest_memfd will have to ensure that pages are unmapped from secure
> > IOMMU pagetables before allowing them to be used by the host.
> > 
> > If secure IOMMU pagetables unmapping fails, I would assume it fails in
> > the similar category of rare "KVM/TDX module/IOMMUFD" bug and I think
> > it makes sense to do the same tdx_buggy_shutdown() with such failures
> > as well.
> 
> In addition we will need a way to fail all further Secure IOMMU table
> walks or some way to stop the active secure DMA by unbinding all the
> TDIs. Maybe such scenarios warrant a BUG_ON() if recovery is not
> possible as possibly any or all of the KVM/IOMMUFD/TDX module can't be
> trusted for reliable functionality anymore.

I mentioned this on another thread. Normal kernel BUG_ON()'s need extreme
justification. As long as the system might survive, they shouldn't be used.


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01  5:01                                                                                   ` Yan Zhao
  2025-07-01  5:22                                                                                     ` Vishal Annapurve
@ 2025-07-01 16:13                                                                                     ` Edgecombe, Rick P
  2025-07-01 21:48                                                                                       ` Ackerley Tng
  2025-07-02  9:08                                                                                       ` Yan Zhao
  1 sibling, 2 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-01 16:13 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, Shutemov, Kirill,
	michael.roth@amd.com, binbin.wu@linux.intel.com,
	seanjc@google.com, Peng, Chao P, Du, Fan, Yamahata, Isaku,
	ackerleytng@google.com, Weiny, Ira, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, Li, Zhiquan1, pgonda@google.com, x86@kernel.org

On Tue, 2025-07-01 at 13:01 +0800, Yan Zhao wrote:
> > Maybe Yan can clarify here. I thought the HWpoison scenario was about TDX
> > module
> My thinking is to set HWPoison to private pages whenever KVM_BUG_ON() was hit
> in
> TDX. i.e., when the page is still mapped in S-EPT but the TD is bugged on and
> about to tear down.
> 
> So, it could be due to KVM or TDX module bugs, which retries can't help.

We were going to call back into guestmemfd for this, right? Not set it inside
KVM code.

What about a kvm_gmem_buggy_cleanup() instead of the system wide one. KVM calls
it and then proceeds to bug the TD only from the KVM side. It's not as safe for
the system, because who knows what a buggy TDX module could do. But TDX module
could also be buggy without the kernel catching wind of it.

Having a single callback to basically bug the fd would solve the atomic context
issue. Then guestmemfd could dump the entire fd into memory_failure() instead of
returning the pages. And developers could respond by fixing the bug.

IMO maintainability needs to be balanced with efforts to minimize the fallout
from bugs. In the end a system that is too complex is going to have more bugs
anyway.

> 
> > bugs. Not TDX busy errors, demote failures, etc. If there are "normal"
> > failures,
> > like the ones that can be fixed with retries, then I think HWPoison is not a
> > good option though.
> > 
> > >   there is a way to make 100%
> > > sure all memory becomes re-usable by the rest of the host, using
> > > tdx_buggy_shutdown(), wbinvd, etc?
> 
> Not sure about this approach. When TDX module is buggy and the page is still
> accessible to guest as private pages, even with no-more SEAMCALLs flag, is it
> safe enough for guest_memfd/hugetlb to re-assign the page to allow
> simultaneous
> access in shared memory with potential private access from TD or TDX module?

With the no more seamcall's approach it should be safe (for the system). This is
essentially what we are doing for kexec.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01 13:32                                                                                 ` Vishal Annapurve
  2025-07-01 14:02                                                                                   ` Vishal Annapurve
@ 2025-07-01 16:14                                                                                   ` Edgecombe, Rick P
  2025-07-02  8:54                                                                                   ` Yan Zhao
  2 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-01 16:14 UTC (permalink / raw)
  To: Annapurve, Vishal, Zhao, Yan Y
  Cc: Shutemov, Kirill, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
	vbabka@suse.cz, kvm@vger.kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	quic_eberman@quicinc.com, Yamahata, Isaku, ackerleytng@google.com,
	binbin.wu@linux.intel.com, Weiny, Ira, pbonzini@redhat.com,
	Li, Zhiquan1, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Tue, 2025-07-01 at 06:32 -0700, Vishal Annapurve wrote:
> On Tue, Jul 1, 2025 at 2:38 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > 
> > On Tue, Jul 01, 2025 at 01:55:43AM +0800, Edgecombe, Rick P wrote:
> > > So for this we can do something similar. Have the arch/x86 side of TDX grow a
> > > new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
> > > SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs after
> > > that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the system
> > > die. Zap/cleanup paths return success in the buggy shutdown case.
> > All TDs in the system die could be too severe for unmap errors due to KVM bugs.
> 
> At this point, I don't see a way to quantify how bad a KVM bug can get
> unless you have explicit ideas about the severity. We should work on
> minimizing KVM side bugs too and assuming it would be a rare
> occurrence I think it's ok to take this intrusive measure.

Yes, it does seem on the line of "too severe". But keeping a list of pages to
release in a non-atomic context seems to complex for an error case that (still
not 100% clear) is theoretical.

In the argument of it's too severe, it's close to a BUG_ON() for the TDX side of
the kernel. But on the argument of it's not too severe, the system remains
stable.

> 
> > 
> > > Does it fit? Or, can you guys argue that the failures here are actually non-
> > > special cases that are worth more complex recovery? I remember we talked about
> > > IOMMU patterns that are similar, but it seems like the remaining cases under
> > > discussion are about TDX bugs.
> > I didn't mention TDX connect previously to avoid introducing unnecessary
> > complexity.
> > 
> > For TDX connect, S-EPT is used for private mappings in IOMMU. Unmap could
> > therefore fail due to pages being pinned for DMA.
> 
> We are discussing this scenario already[1], where the host will not
> pin the pages used by secure DMA for the same reasons why we can't
> have KVM pin the guest_memfd pages mapped in SEPT. Is there some other
> kind of pinning you are referring to?

I'm wondering about the "something went wrong and we can't invalidate" pattern.
Like the device refuses to cooperate.

> 
> If there is an ordering in which pages should be unmapped e.g. first
> in secure IOMMU and then KVM SEPT, then we can ensure the right
> ordering between invalidation callbacks from guest_memfd.
> 
> [1] https://lore.kernel.org/lkml/CAGtprH_qh8sEY3s-JucW3n1Wvoq7jdVZDDokvG5HzPf0HV2=pg@mail.gmail.com/#t

The general gist seems to be that guestmemfd should be the nerve center of these
decisions, and it should be given enough information to make the decisions to
invalidate only when success is guaranteed. Makes sense.

In this case we can't know the condition ahead of time. It is a TDX-only
problem? If it is, then we need to make TDX behave more like the others. Or have
simple to maintain cop-outs like this.

> 
> > 
> > So, my thinking was that if that happens, KVM could set a special flag to folios
> > pinned for private DMA.
> > 
> > Then guest_memfd could check the special flag before allowing private-to-shared
> > conversion, or punch hole.
> > guest_memfd could check this special flag and choose to poison or leak the
> > folio.
> > 
> > Otherwise, if we choose tdx_buggy_shutdown() to "do an all-cpu IPI to kick CPUs
> > out of SEAMMODE, wbivnd, and set a "no more seamcalls" bool", DMAs may still
> > have access to the private pages mapped in S-EPT.
> 
> guest_memfd will have to ensure that pages are unmapped from secure
> IOMMU pagetables before allowing them to be used by the host.
> 
> If secure IOMMU pagetables unmapping fails, I would assume it fails in
> the similar category of rare "KVM/TDX module/IOMMUFD" bug and I think
> it makes sense to do the same tdx_buggy_shutdown() with such failures
> as well.

It's too hypothetical to reason about. IMO, we need to know about specific
similar patterns to justify a more complex fine grained poisoning approach.


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 03/21] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2025-05-15 17:28       ` Edgecombe, Rick P
  2025-05-16  2:23         ` Yan Zhao
@ 2025-07-01 21:15         ` Edgecombe, Rick P
  1 sibling, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-01 21:15 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, tabba@google.com,
	vbabka@suse.cz, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Du, Fan,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Shutemov, Kirill,
	pgonda@google.com, x86@kernel.org

On Thu, 2025-05-15 at 10:28 -0700, Rick Edgecombe wrote:
> > I did a brief test on my SPR, where the host was not busy :
> > tdh_mem_page_demote() was called 142 times, with each invocation taking
> > around
> > 10us.
> 
> 10us doesn't seem too bad? Makes me think to not loop and instead just do a
> single retry with interrupts disabled. We should definitely add the data based
> reasoning to the log.
> 
> The counter point is that the SEAMCALL must be supporting
> TDX_INTERRUPTED_RESTARTABLE for a reason. And the reason probably is that it
> sometimes takes longer than someone that was reasonable. Maybe we should ask
> TDX
> module folks if there is any history.

Circling back here. After some research/discussion it seems demote should not
take too long such that it should need the option to return
TDX_INTERRUPTED_RESTARTABLE. Even in the dynamic PAMT case. The details of how
to get this changed and documented are still ongoing, but for v2 I say we close
this by expecting it to never return TDX_INTERRUPTED_RESTARTABLE. For now it can
be a VM_BUG_ON() case, with the expectation that TDX module will update to make
the logic valid. Sound good?

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01 16:13                                                                                     ` Edgecombe, Rick P
@ 2025-07-01 21:48                                                                                       ` Ackerley Tng
  2025-07-01 21:57                                                                                         ` Ackerley Tng
  2025-07-01 22:37                                                                                         ` Edgecombe, Rick P
  2025-07-02  9:08                                                                                       ` Yan Zhao
  1 sibling, 2 replies; 294+ messages in thread
From: Ackerley Tng @ 2025-07-01 21:48 UTC (permalink / raw)
  To: Edgecombe, Rick P, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, Shutemov, Kirill,
	michael.roth@amd.com, binbin.wu@linux.intel.com,
	seanjc@google.com, Peng, Chao P, Du, Fan, Yamahata, Isaku,
	Weiny, Ira, pbonzini@redhat.com, linux-kernel@vger.kernel.org,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Li, Zhiquan1,
	pgonda@google.com, x86@kernel.org

"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:

> On Tue, 2025-07-01 at 13:01 +0800, Yan Zhao wrote:
>> > Maybe Yan can clarify here. I thought the HWpoison scenario was about TDX
>> > module
>> My thinking is to set HWPoison to private pages whenever KVM_BUG_ON() was hit
>> in
>> TDX. i.e., when the page is still mapped in S-EPT but the TD is bugged on and
>> about to tear down.
>> 
>> So, it could be due to KVM or TDX module bugs, which retries can't help.
>
> We were going to call back into guestmemfd for this, right? Not set it inside
> KVM code.
>

Perhaps we had different understandings of f/g :P

I meant that TDX module should directly set the HWpoison flag on the
folio (HugeTLB or 4K, guest_memfd or not), not call into guest_memfd.

guest_memfd will then check this flag when necessary, specifically:

* On faults, either into guest or host page tables 
* When freeing the page
    * guest_memfd will not return HugeTLB pages that are poisoned to
      HugeTLB and just leak it
    * 4K pages will be freed normally, because free_pages_prepare() will
      check for HWpoison and skip freeing, from __folio_put() ->
      free_frozen_pages() -> __free_frozen_pages() ->
      free_pages_prepare()
* I believe guest_memfd doesn't need to check HWpoison on conversions [1]

[1] https://lore.kernel.org/all/diqz5xghjca4.fsf@ackerleytng-ctop.c.googlers.com/

> What about a kvm_gmem_buggy_cleanup() instead of the system wide one. KVM calls
> it and then proceeds to bug the TD only from the KVM side. It's not as safe for
> the system, because who knows what a buggy TDX module could do. But TDX module
> could also be buggy without the kernel catching wind of it.
>
> Having a single callback to basically bug the fd would solve the atomic context
> issue. Then guestmemfd could dump the entire fd into memory_failure() instead of
> returning the pages. And developers could respond by fixing the bug.
>

This could work too.

I'm in favor of buying into the HWpoison system though, since we're
quite sure this is fair use of HWpoison.

Are you saying kvm_gmem_buggy_cleanup() will just set the HWpoison flag
on the parts of the folios in trouble?

> IMO maintainability needs to be balanced with efforts to minimize the fallout
> from bugs. In the end a system that is too complex is going to have more bugs
> anyway.
>
>> 
>> > bugs. Not TDX busy errors, demote failures, etc. If there are "normal"
>> > failures,
>> > like the ones that can be fixed with retries, then I think HWPoison is not a
>> > good option though.
>> > 
>> > >   there is a way to make 100%
>> > > sure all memory becomes re-usable by the rest of the host, using
>> > > tdx_buggy_shutdown(), wbinvd, etc?
>> 
>> Not sure about this approach. When TDX module is buggy and the page is still
>> accessible to guest as private pages, even with no-more SEAMCALLs flag, is it
>> safe enough for guest_memfd/hugetlb to re-assign the page to allow
>> simultaneous
>> access in shared memory with potential private access from TD or TDX module?
>
> With the no more seamcall's approach it should be safe (for the system). This is
> essentially what we are doing for kexec.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01 21:48                                                                                       ` Ackerley Tng
@ 2025-07-01 21:57                                                                                         ` Ackerley Tng
  2025-07-01 22:37                                                                                         ` Edgecombe, Rick P
  1 sibling, 0 replies; 294+ messages in thread
From: Ackerley Tng @ 2025-07-01 21:57 UTC (permalink / raw)
  To: Edgecombe, Rick P, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, Shutemov, Kirill,
	michael.roth@amd.com, binbin.wu@linux.intel.com,
	seanjc@google.com, Peng, Chao P, Du, Fan, Yamahata, Isaku,
	Weiny, Ira, pbonzini@redhat.com, linux-kernel@vger.kernel.org,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Li, Zhiquan1,
	pgonda@google.com, x86@kernel.org

Ackerley Tng <ackerleytng@google.com> writes:

> "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:
>
>> On Tue, 2025-07-01 at 13:01 +0800, Yan Zhao wrote:
>>> > Maybe Yan can clarify here. I thought the HWpoison scenario was about TDX
>>> > module
>>> My thinking is to set HWPoison to private pages whenever KVM_BUG_ON() was hit
>>> in
>>> TDX. i.e., when the page is still mapped in S-EPT but the TD is bugged on and
>>> about to tear down.
>>> 
>>> So, it could be due to KVM or TDX module bugs, which retries can't help.
>>
>> We were going to call back into guestmemfd for this, right? Not set it inside
>> KVM code.
>>
>
> Perhaps we had different understandings of f/g :P
>
> I meant that TDX module should directly set the HWpoison flag on the
> folio (HugeTLB or 4K, guest_memfd or not), not call into guest_memfd.
>

Sorry, correction here, not "TDX module" but the TDX part of KVM within
the kernel. Not the TDX module code itself. Sorry for the confusion.

> guest_memfd will then check this flag when necessary, specifically:
>
> * On faults, either into guest or host page tables 
> * When freeing the page
>     * guest_memfd will not return HugeTLB pages that are poisoned to
>       HugeTLB and just leak it
>     * 4K pages will be freed normally, because free_pages_prepare() will
>       check for HWpoison and skip freeing, from __folio_put() ->
>       free_frozen_pages() -> __free_frozen_pages() ->
>       free_pages_prepare()
> * I believe guest_memfd doesn't need to check HWpoison on conversions [1]
>
> [1] https://lore.kernel.org/all/diqz5xghjca4.fsf@ackerleytng-ctop.c.googlers.com/
>
>> What about a kvm_gmem_buggy_cleanup() instead of the system wide one. KVM calls
>> it and then proceeds to bug the TD only from the KVM side. It's not as safe for
>> the system, because who knows what a buggy TDX module could do. But TDX module
>> could also be buggy without the kernel catching wind of it.
>>
>> Having a single callback to basically bug the fd would solve the atomic context
>> issue. Then guestmemfd could dump the entire fd into memory_failure() instead of
>> returning the pages. And developers could respond by fixing the bug.
>>
>
> This could work too.
>
> I'm in favor of buying into the HWpoison system though, since we're
> quite sure this is fair use of HWpoison.
>
> Are you saying kvm_gmem_buggy_cleanup() will just set the HWpoison flag
> on the parts of the folios in trouble?
>
>> IMO maintainability needs to be balanced with efforts to minimize the fallout
>> from bugs. In the end a system that is too complex is going to have more bugs
>> anyway.
>>
>>> 
>>> > bugs. Not TDX busy errors, demote failures, etc. If there are "normal"
>>> > failures,
>>> > like the ones that can be fixed with retries, then I think HWPoison is not a
>>> > good option though.
>>> > 
>>> > >   there is a way to make 100%
>>> > > sure all memory becomes re-usable by the rest of the host, using
>>> > > tdx_buggy_shutdown(), wbinvd, etc?
>>> 
>>> Not sure about this approach. When TDX module is buggy and the page is still
>>> accessible to guest as private pages, even with no-more SEAMCALLs flag, is it
>>> safe enough for guest_memfd/hugetlb to re-assign the page to allow
>>> simultaneous
>>> access in shared memory with potential private access from TD or TDX module?
>>
>> With the no more seamcall's approach it should be safe (for the system). This is
>> essentially what we are doing for kexec.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01  5:07                                                                                 ` Yan Zhao
@ 2025-07-01 22:01                                                                                   ` Ackerley Tng
  2025-07-01 22:26                                                                                     ` Ackerley Tng
  0 siblings, 1 reply; 294+ messages in thread
From: Ackerley Tng @ 2025-07-01 22:01 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Edgecombe, Rick P, Shutemov, Kirill, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	quic_eberman@quicinc.com, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	Du, Fan, Yamahata, Isaku, pbonzini@redhat.com,
	binbin.wu@linux.intel.com, Weiny, Ira, Li, Zhiquan1,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Mon, Jun 30, 2025 at 12:25:49PM -0700, Ackerley Tng wrote:
>> "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:
>> 
>> > On Mon, 2025-06-30 at 19:13 +0800, Yan Zhao wrote:
>> >> > > ok! Lets go f/g. Unless Yan objects.
>> >> I'm ok with f/g. But I have two implementation specific questions:
>> >> 
>> >> 1. How to set the HWPoison bit in TDX?
>> 
>> I was thinking to set the HWpoison flag based on page type. If regular
>> 4K page, set the flag. If THP page (not (yet) supported by guest_memfd),
>> set the has_hwpoison flag, and if HugeTLB page, call
>> folio_set_hugetlb_hwpoison().
> Could you elaborate on how to call folio_set_hugetlb_hwpoison()?
>

Sorry I meant "in TDX" as in the part of the kernel that performs the
unmap. I'm assuming something like

int ret = tdx_do_unmap(page)
if (ret)
	set_hwpoison_based_on_folio_type(page_folio(page))

And set_hwpoison_based_on_folio_type() would have to be written to know
how to set the HWpoison flag based on type of the folio.

I think I might have used the wrong terminology elsewhere. Sorry about
that. I don't mean to call folio_set_hugetlb_hwpoison() from within the
TDX module. I meant to set HWpoison in the kernel, based on return value
to the kernel from the TDX module.

>> But if we go with Rick's suggestion below, then we don't have to figure
>> this out.
>> 
>> >> 2. Should we set this bit for non-guest-memfd pages (e.g. for S-EPT pages) ?
>> >
>> > Argh, I guess we can keep the existing ref count based approach for the other
>> > types of TDX owned pages?
>> >
>> 
>> Wait TDX can only use guest_memfd pages, right? Even if TDX can use
>> non-guest_memfd pages, why not also set HWpoison for non-guest_memfd
>> pages?
> As in https://lore.kernel.org/all/aGJxU95VvQvQ3bj6@yzhao56-desk.sh.intel.com/,
> I don't find a proper interface for TDX to set HWpoison bit on non-guset_memfd
> pages.
>
> Neither memory_failure() nor memory_failure_queue() seem fit.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01  6:03                                                                                       ` Yan Zhao
  2025-07-01  7:13                                                                                         ` Vishal Annapurve
@ 2025-07-01 22:09                                                                                         ` Ackerley Tng
  2025-07-02 11:24                                                                                           ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Ackerley Tng @ 2025-07-01 22:09 UTC (permalink / raw)
  To: Yan Zhao, Vishal Annapurve
  Cc: Edgecombe, Rick P, quic_eberman@quicinc.com, Li, Xiaoyao,
	Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
	Du, Fan, michael.roth@amd.com, seanjc@google.com,
	binbin.wu@linux.intel.com, Peng, Chao P, kvm@vger.kernel.org,
	Yamahata, Isaku, linux-kernel@vger.kernel.org, Weiny, Ira,
	pbonzini@redhat.com, Li, Zhiquan1, jroedel@suse.de, Miao, Jun,
	pgonda@google.com, x86@kernel.org

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Mon, Jun 30, 2025 at 10:22:26PM -0700, Vishal Annapurve wrote:
>> On Mon, Jun 30, 2025 at 10:04 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>> >
>> > On Tue, Jul 01, 2025 at 05:45:54AM +0800, Edgecombe, Rick P wrote:
>> > > On Mon, 2025-06-30 at 12:25 -0700, Ackerley Tng wrote:
>> > > > > So for this we can do something similar. Have the arch/x86 side of TDX grow
>> > > > > a
>> > > > > new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
>> > > > > SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs
>> > > > > after
>> > > > > that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the
>> > > > > system
>> > > > > die. Zap/cleanup paths return success in the buggy shutdown case.
>> > > > >
>> > > >
>> > > > Do you mean that on unmap/split failure:
>> > >
>> > > Maybe Yan can clarify here. I thought the HWpoison scenario was about TDX module
>> > My thinking is to set HWPoison to private pages whenever KVM_BUG_ON() was hit in
>> > TDX. i.e., when the page is still mapped in S-EPT but the TD is bugged on and
>> > about to tear down.
>> >
>> > So, it could be due to KVM or TDX module bugs, which retries can't help.
>> >
>> > > bugs. Not TDX busy errors, demote failures, etc. If there are "normal" failures,
>> > > like the ones that can be fixed with retries, then I think HWPoison is not a
>> > > good option though.
>> > >
>> > > >  there is a way to make 100%
>> > > > sure all memory becomes re-usable by the rest of the host, using
>> > > > tdx_buggy_shutdown(), wbinvd, etc?
>> >
>> > Not sure about this approach. When TDX module is buggy and the page is still
>> > accessible to guest as private pages, even with no-more SEAMCALLs flag, is it
>> > safe enough for guest_memfd/hugetlb to re-assign the page to allow simultaneous
>> > access in shared memory with potential private access from TD or TDX module?
>> 
>> If no more seamcalls are allowed and all cpus are made to exit SEAM
>> mode then how can there be potential private access from TD or TDX
>> module?
> Not sure. As Kirill said "TDX module has creative ways to corrupt it"
> https://lore.kernel.org/all/zlxgzuoqwrbuf54wfqycnuxzxz2yduqtsjinr5uq4ss7iuk2rt@qaaolzwsy6ki/.
>
> Or, could TDX just set a page flag, like what for XEN
>
>         /* XEN */
>         /* Pinned in Xen as a read-only pagetable page. */
>         PG_pinned = PG_owner_priv_1,
>
> e.g.
> 	PG_tdx_firmware_access = PG_owner_priv_1,
>
> Then, guest_memfd checks this flag on every zap and replace it with PG_hwpoison
> on behalf of TDX?

I think this question probably arose because of a misunderstanding I
might have caused. I meant to set the HWpoison flag from the kernel, not
from within the TDX module. Please see [1].

In addition, if the TDX module (now referring specifically to the TDX
module and not the kernel) sets page flags, that won't work with
vmemmap-optimized folios. Setting a page flag on a vmemmap-optimized
folio will be setting the flag on a few pages.

[1] https://lore.kernel.org/all/diqzplej4llh.fsf@ackerleytng-ctop.c.googlers.com/

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01 22:01                                                                                   ` Ackerley Tng
@ 2025-07-01 22:26                                                                                     ` Ackerley Tng
  0 siblings, 0 replies; 294+ messages in thread
From: Ackerley Tng @ 2025-07-01 22:26 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Edgecombe, Rick P, Shutemov, Kirill, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	quic_eberman@quicinc.com, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	Du, Fan, Yamahata, Isaku, pbonzini@redhat.com,
	binbin.wu@linux.intel.com, Weiny, Ira, Li, Zhiquan1,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

Ackerley Tng <ackerleytng@google.com> writes:

> Yan Zhao <yan.y.zhao@intel.com> writes:
>
>> On Mon, Jun 30, 2025 at 12:25:49PM -0700, Ackerley Tng wrote:
>>> "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:
>>> 
>>> > On Mon, 2025-06-30 at 19:13 +0800, Yan Zhao wrote:
>>> >> > > ok! Lets go f/g. Unless Yan objects.
>>> >> I'm ok with f/g. But I have two implementation specific questions:
>>> >> 
>>> >> 1. How to set the HWPoison bit in TDX?
>>> 
>>> I was thinking to set the HWpoison flag based on page type. If regular
>>> 4K page, set the flag. If THP page (not (yet) supported by guest_memfd),
>>> set the has_hwpoison flag, and if HugeTLB page, call
>>> folio_set_hugetlb_hwpoison().
>> Could you elaborate on how to call folio_set_hugetlb_hwpoison()?
>>
>
> Sorry I meant "in TDX" as in the part of the kernel that performs the
> unmap. I'm assuming something like
>
> int ret = tdx_do_unmap(page)
> if (ret)
> 	set_hwpoison_based_on_folio_type(page_folio(page))
>
> And set_hwpoison_based_on_folio_type() would have to be written to know
> how to set the HWpoison flag based on type of the folio.
>
> I think I might have used the wrong terminology elsewhere. Sorry about
> that. I don't mean to call folio_set_hugetlb_hwpoison() from within the
> TDX module. I meant to set HWpoison in the kernel, based on return value
> to the kernel from the TDX module.
>
>>> But if we go with Rick's suggestion below, then we don't have to figure
>>> this out.
>>> 
>>> >> 2. Should we set this bit for non-guest-memfd pages (e.g. for S-EPT pages) ?
>>> >
>>> > Argh, I guess we can keep the existing ref count based approach for the other
>>> > types of TDX owned pages?
>>> >
>>> 
>>> Wait TDX can only use guest_memfd pages, right? Even if TDX can use
>>> non-guest_memfd pages, why not also set HWpoison for non-guest_memfd
>>> pages?
>> As in https://lore.kernel.org/all/aGJxU95VvQvQ3bj6@yzhao56-desk.sh.intel.com/,
>> I don't find a proper interface for TDX to set HWpoison bit on non-guset_memfd
>> pages.
>>
>> Neither memory_failure() nor memory_failure_queue() seem fit.

Missed out a response on this.

Vishal explained to me that non-guest_memfd pages can be used by TDX for
the TDX module itself.

For those, I think it's still okay to set HWpoison, because the kernel
page freeing process will leak HWpoison-ed pages. free_pages_prepare()
will check for HWpoison and skip freeing:

__folio_put() ->
  free_frozen_pages() ->
    __free_frozen_pages() ->
      free_pages_prepare()

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01 21:48                                                                                       ` Ackerley Tng
  2025-07-01 21:57                                                                                         ` Ackerley Tng
@ 2025-07-01 22:37                                                                                         ` Edgecombe, Rick P
  2025-07-02 20:57                                                                                           ` Ackerley Tng
  1 sibling, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-01 22:37 UTC (permalink / raw)
  To: ackerleytng@google.com, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Li, Zhiquan1, Du, Fan,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, vbabka@suse.cz, Shutemov, Kirill,
	michael.roth@amd.com, seanjc@google.com, Weiny, Ira, Peng, Chao P,
	binbin.wu@linux.intel.com, Yamahata, Isaku, pbonzini@redhat.com,
	kvm@vger.kernel.org, Annapurve, Vishal, jroedel@suse.de,
	linux-kernel@vger.kernel.org, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Tue, 2025-07-01 at 14:48 -0700, Ackerley Tng wrote:
> Perhaps we had different understandings of f/g :P

Ah yes, I thought you were saying that guestmemfd would use poison internally
via some gmem_buggy_page() or similar. I guess I thought it is more of
guestmemfd's job. But as Yan pointed out, we need to handle non gmem page errors
too. Currently we leak, but it would be nice to keep the handling symmetrical.
Which would be easier if we did it all in TDX code.

> 
> I meant that TDX module should directly set the HWpoison flag on the
> folio (HugeTLB or 4K, guest_memfd or not), not call into guest_memfd.
> 
> guest_memfd will then check this flag when necessary, specifically:
> 
> * On faults, either into guest or host page tables 
> * When freeing the page
>     * guest_memfd will not return HugeTLB pages that are poisoned to
>       HugeTLB and just leak it
>     * 4K pages will be freed normally, because free_pages_prepare() will
>       check for HWpoison and skip freeing, from __folio_put() ->
>       free_frozen_pages() -> __free_frozen_pages() ->
>       free_pages_prepare()
> * I believe guest_memfd doesn't need to check HWpoison on conversions [1]
> 
> [1] https://lore.kernel.org/all/diqz5xghjca4.fsf@ackerleytng-ctop.c.googlers.com/

If a poisoned page continued to be used, it's a bit weird, no? It could take an
#MC for another reason from userspace and the handling code would see the page
flag is already set. If it doesn't already trip up some MM code somewhere, it
might put undue burden on the memory failure code to have to expect repeated
poisoning of the same memory.

> 
> > What about a kvm_gmem_buggy_cleanup() instead of the system wide one. KVM calls
> > it and then proceeds to bug the TD only from the KVM side. It's not as safe for
> > the system, because who knows what a buggy TDX module could do. But TDX module
> > could also be buggy without the kernel catching wind of it.
> > 
> > Having a single callback to basically bug the fd would solve the atomic context
> > issue. Then guestmemfd could dump the entire fd into memory_failure() instead of
> > returning the pages. And developers could respond by fixing the bug.
> > 
> 
> This could work too.
> 
> I'm in favor of buying into the HWpoison system though, since we're
> quite sure this is fair use of HWpoison.

Do you mean manually setting the poison flag, or calling into memory_failure(),
and friends? If we set them manually, we need to make sure that it does not have
side effects on the machine check handler. It seems risky/messy to me. But
Kirill didn't seem worried.

Maybe we could bring the poison page flag up to DavidH and see if there is any
concern before going down this path too far?

> 
> Are you saying kvm_gmem_buggy_cleanup() will just set the HWpoison flag
> on the parts of the folios in trouble?

I was saying kvm_gmem_buggy_cleanup() can set a bool on the fd, similar to
VM_BUG_ON() setting vm_dead. After an invalidate, if gmem see this, it needs to
assume everything failed, and invalidate everything and poison all guest memory.
The point was to have the simplest possible handling for a rare error. Although
it's only a proposal. The TDX emergency shutdown option may be simpler still.
But killing all TDs is not ideal. So thought we could at least consider other
options.

If we have a solution where TDX needs to do something complicated because
something of its specialness, it may get NAKed. This is my main concern with the
direction of this problem/solution. AFAICT, we are not even sure of a concrete
problem, and it appears to be special to TDX. So the complexity budget should be
small. It's in sharp contrast to the length of the discussion.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-07-01 15:36                                                                 ` Edgecombe, Rick P
@ 2025-07-02  0:12                                                                   ` Yan Zhao
  2025-07-02  0:18                                                                     ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-07-02  0:12 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Du, Fan, Li, Xiaoyao, Huang, Kai, Shutemov, Kirill, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
	Li, Zhiquan1, quic_eberman@quicinc.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, pbonzini@redhat.com, Peng, Chao P,
	Yamahata, Isaku, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, ackerleytng@google.com,
	kvm@vger.kernel.org, Annapurve, Vishal, tabba@google.com,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jul 01, 2025 at 11:36:22PM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-07-01 at 10:41 +0800, Yan Zhao wrote:
> > > Can you explain what you found regarding the write lock need?
> > Here, the write lock protects 2 steps:
> > (1) update lpage_info.
> > (2) try splitting if there's any existing 2MB mapping.
> > 
> > The write mmu_lock is needed because lpage_info is read under read mmu_lock in
> > kvm_tdp_mmu_map().
> > 
> > kvm_tdp_mmu_map
> >   kvm_mmu_hugepage_adjust
> >     kvm_lpage_info_max_mapping_level
> > 
> > If we update the lpage_info with read mmu_lock, the other vCPUs may map at a
> > stale 2MB level even after lpage_info is updated by
> > hugepage_set_guest_inhibit().
> > 
> > Therefore, we must perform splitting under the write mmu_lock to ensure there
> > are no 2MB mappings after hugepage_set_guest_inhibit().
> > 
> > Otherwise, during later mapping in __vmx_handle_ept_violation(), splitting at
> > fault path could be triggered as KVM MMU finds the goal level is 4KB while an
> > existing 2MB mapping is present.
> 
> It could be?
> 1. mmu read lock
> 2. update lpage_info
> 3. mmu write lock upgrade
> 4. demote
> 5. mmu unlock
> 
> Then (3) could be skipped in the case of ability to demote under read lock?
> 
> I noticed that the other lpage_info updaters took mmu write lock, and I wasn't
> sure why. We shouldn't take a lock that we don't actually need just for safety
> margin or to copy other code.
Use write mmu_lock is of reason.

In the 3 steps, 
1. set lpage_info
2. demote if needed
3. go to fault handler

Step 2 requires holding write mmu_lock before invoking kvm_split_boundary_leafs().
The write mmu_lock is also possible to get dropped and re-acquired in
kvm_split_boundary_leafs() for purpose like memory allocation.

If 1 is with read mmu_lock, the other vCPUs is still possible to fault in at 2MB
level after the demote in step 2.
Luckily, current TDX doesn't support promotion now.
But we can avoid wading into this complex situation by holding write mmu_lock
in 1.

> > > For most accept
> > > cases, we could fault in the PTE's on the read lock. And in the future we
> > > could
> > 
> > The actual mapping at 4KB level is still with read mmu_lock in
> > __vmx_handle_ept_violation().
> > 
> > > have a demote that could work under read lock, as we talked. So
> > > kvm_split_boundary_leafs() often or could be unneeded or work under read
> > > lock
> > > when needed.
> > Could we leave the "demote under read lock" as a future optimization? 
> 
> We could add it to the list. If we have a TDX module that supports demote with a
> single SEAMCALL then we don't have the rollback problem. The optimization could
> utilize that. That said, we should focus on the optimizations that make the
> biggest difference to real TDs. Your data suggests this might not be the case
> today. 
Ok. 
 
> > > What is the problem in hugepage_set_guest_inhibit() that requires the write
> > > lock?
> > As above, to avoid the other vCPUs reading stale mapping level and splitting
> > under read mmu_lock.
> 
> We need mmu write lock for demote, but as long as the order is:
> 1. set lpage_info
> 2. demote if needed
> 3. go to fault handler
> 
> Then (3) should have what it needs even if another fault races (1).
See the above comment for why we need to hold write mmu_lock for 1.

Besides, as we need write mmu_lock anyway for 2 (i.e. hold write mmu_lock before
walking the SPTEs to check if there's any existing mapping), I don't see any
performance impact by holding write mmu_lock for 1.


> > As guest_inhibit is set one-way, we could test it using
> > hugepage_test_guest_inhibit() without holding the lock. The chance to hold
> > write
> > mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
> > (in my testing, 11 during VM boot).
> >  
> > > But in any case, it seems like we have *a* solution here. It doesn't seem
> > > like
> > > there are any big downsides. Should we close it?
> > I think it's good, as long as Sean doesn't disagree :)
> 
> He seemed onboard. Let's close it. We can even discuss lpage_info update locking
> on v2.
Ok. I'll use write mmu_lock for updating lpage_info in v2 first.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-07-02  0:12                                                                   ` Yan Zhao
@ 2025-07-02  0:18                                                                     ` Edgecombe, Rick P
  2025-07-02  1:07                                                                       ` Yan Zhao
  2025-07-02  3:31                                                                       ` Yan Zhao
  0 siblings, 2 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-02  0:18 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Huang, Kai, Du, Fan,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Li, Zhiquan1, Shutemov, Kirill,
	michael.roth@amd.com, seanjc@google.com, Weiny, Ira, Peng, Chao P,
	pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, kvm@vger.kernel.org, Annapurve, Vishal,
	tabba@google.com, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Wed, 2025-07-02 at 08:12 +0800, Yan Zhao wrote:
> > Then (3) could be skipped in the case of ability to demote under read lock?
> > 
> > I noticed that the other lpage_info updaters took mmu write lock, and I
> > wasn't
> > sure why. We shouldn't take a lock that we don't actually need just for
> > safety
> > margin or to copy other code.
> Use write mmu_lock is of reason.
> 
> In the 3 steps, 
> 1. set lpage_info
> 2. demote if needed
> 3. go to fault handler
> 
> Step 2 requires holding write mmu_lock before invoking
> kvm_split_boundary_leafs().
> The write mmu_lock is also possible to get dropped and re-acquired in
> kvm_split_boundary_leafs() for purpose like memory allocation.
> 
> If 1 is with read mmu_lock, the other vCPUs is still possible to fault in at
> 2MB
> level after the demote in step 2.
> Luckily, current TDX doesn't support promotion now.
> But we can avoid wading into this complex situation by holding write mmu_lock
> in 1.

I don't think because some code might race in the future is a good reason to
take the write lock.

> 
> > > > For most accept
> > > > cases, we could fault in the PTE's on the read lock. And in the future
> > > > we
> > > > could
> > > 
> > > The actual mapping at 4KB level is still with read mmu_lock in
> > > __vmx_handle_ept_violation().
> > > 
> > > > have a demote that could work under read lock, as we talked. So
> > > > kvm_split_boundary_leafs() often or could be unneeded or work under read
> > > > lock
> > > > when needed.
> > > Could we leave the "demote under read lock" as a future optimization? 
> > 
> > We could add it to the list. If we have a TDX module that supports demote
> > with a
> > single SEAMCALL then we don't have the rollback problem. The optimization
> > could
> > utilize that. That said, we should focus on the optimizations that make the
> > biggest difference to real TDs. Your data suggests this might not be the
> > case
> > today. 
> Ok. 
>  
> > > > What is the problem in hugepage_set_guest_inhibit() that requires the
> > > > write
> > > > lock?
> > > As above, to avoid the other vCPUs reading stale mapping level and
> > > splitting
> > > under read mmu_lock.
> > 
> > We need mmu write lock for demote, but as long as the order is:
> > 1. set lpage_info
> > 2. demote if needed
> > 3. go to fault handler
> > 
> > Then (3) should have what it needs even if another fault races (1).
> See the above comment for why we need to hold write mmu_lock for 1.
> 
> Besides, as we need write mmu_lock anyway for 2 (i.e. hold write mmu_lock
> before
> walking the SPTEs to check if there's any existing mapping), I don't see any
> performance impact by holding write mmu_lock for 1.

It's maintainability problem too. Someday someone may want to remove it and
scratch their head for what race they are missing.

> 
> 
> > > As guest_inhibit is set one-way, we could test it using
> > > hugepage_test_guest_inhibit() without holding the lock. The chance to hold
> > > write
> > > mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
> > > (in my testing, 11 during VM boot).
> > >  
> > > > But in any case, it seems like we have *a* solution here. It doesn't
> > > > seem
> > > > like
> > > > there are any big downsides. Should we close it?
> > > I think it's good, as long as Sean doesn't disagree :)
> > 
> > He seemed onboard. Let's close it. We can even discuss lpage_info update
> > locking
> > on v2.
> Ok. I'll use write mmu_lock for updating lpage_info in v2 first.

Specifically, why do the other lpage_info updating functions take mmu write
lock. Are you sure there is no other reason?


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-07-02  0:18                                                                     ` Edgecombe, Rick P
@ 2025-07-02  1:07                                                                       ` Yan Zhao
  2025-07-02 15:26                                                                         ` Edgecombe, Rick P
  2025-07-02  3:31                                                                       ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-07-02  1:07 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Huang, Kai, Du, Fan,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Li, Zhiquan1, Shutemov, Kirill,
	michael.roth@amd.com, seanjc@google.com, Weiny, Ira, Peng, Chao P,
	pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, kvm@vger.kernel.org, Annapurve, Vishal,
	tabba@google.com, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Wed, Jul 02, 2025 at 08:18:48AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-07-02 at 08:12 +0800, Yan Zhao wrote:
> > > Then (3) could be skipped in the case of ability to demote under read lock?
> > > 
> > > I noticed that the other lpage_info updaters took mmu write lock, and I
> > > wasn't
> > > sure why. We shouldn't take a lock that we don't actually need just for
> > > safety
> > > margin or to copy other code.
> > Use write mmu_lock is of reason.
> > 
> > In the 3 steps, 
> > 1. set lpage_info
> > 2. demote if needed
> > 3. go to fault handler
> > 
> > Step 2 requires holding write mmu_lock before invoking
> > kvm_split_boundary_leafs().
> > The write mmu_lock is also possible to get dropped and re-acquired in
> > kvm_split_boundary_leafs() for purpose like memory allocation.
> > 
> > If 1 is with read mmu_lock, the other vCPUs is still possible to fault in at
> > 2MB
> > level after the demote in step 2.
> > Luckily, current TDX doesn't support promotion now.
> > But we can avoid wading into this complex situation by holding write mmu_lock
> > in 1.
> 
> I don't think because some code might race in the future is a good reason to
> take the write lock.

I still prefer to hold write mmu_lock right now.

Otherwise, we at least need to convert disallow_lpage to atomic variable and
updating it via an atomic way, e.g. cmpxchg. 

struct kvm_lpage_info {
        int disallow_lpage;
};


> > > > > For most accept
> > > > > cases, we could fault in the PTE's on the read lock. And in the future
> > > > > we
> > > > > could
> > > > 
> > > > The actual mapping at 4KB level is still with read mmu_lock in
> > > > __vmx_handle_ept_violation().
> > > > 
> > > > > have a demote that could work under read lock, as we talked. So
> > > > > kvm_split_boundary_leafs() often or could be unneeded or work under read
> > > > > lock
> > > > > when needed.
> > > > Could we leave the "demote under read lock" as a future optimization? 
> > > 
> > > We could add it to the list. If we have a TDX module that supports demote
> > > with a
> > > single SEAMCALL then we don't have the rollback problem. The optimization
> > > could
> > > utilize that. That said, we should focus on the optimizations that make the
> > > biggest difference to real TDs. Your data suggests this might not be the
> > > case
> > > today. 
> > Ok. 
> >  
> > > > > What is the problem in hugepage_set_guest_inhibit() that requires the
> > > > > write
> > > > > lock?
> > > > As above, to avoid the other vCPUs reading stale mapping level and
> > > > splitting
> > > > under read mmu_lock.
> > > 
> > > We need mmu write lock for demote, but as long as the order is:
> > > 1. set lpage_info
> > > 2. demote if needed
> > > 3. go to fault handler
> > > 
> > > Then (3) should have what it needs even if another fault races (1).
> > See the above comment for why we need to hold write mmu_lock for 1.
> > 
> > Besides, as we need write mmu_lock anyway for 2 (i.e. hold write mmu_lock
> > before
> > walking the SPTEs to check if there's any existing mapping), I don't see any
> > performance impact by holding write mmu_lock for 1.
> 
> It's maintainability problem too. Someday someone may want to remove it and
> scratch their head for what race they are missing.
I don't get why holding write mmu_lock will cause maintainability problem.
In contrast, if we want to use read mmu_lock in future, we need to carefully
check if there's any potential risk.

> > > > As guest_inhibit is set one-way, we could test it using
> > > > hugepage_test_guest_inhibit() without holding the lock. The chance to hold
> > > > write
> > > > mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
> > > > (in my testing, 11 during VM boot).
> > > >  
> > > > > But in any case, it seems like we have *a* solution here. It doesn't
> > > > > seem
> > > > > like
> > > > > there are any big downsides. Should we close it?
> > > > I think it's good, as long as Sean doesn't disagree :)
> > > 
> > > He seemed onboard. Let's close it. We can even discuss lpage_info update
> > > locking
> > > on v2.
> > Ok. I'll use write mmu_lock for updating lpage_info in v2 first.
> 
> Specifically, why do the other lpage_info updating functions take mmu write
> lock. Are you sure there is no other reason?
1. The read mmu_lock can't prevent the other vCPUs from reading stale lpage_info.
2. Shadow code in KVM MMU only holds write mmu_lock, so it updates lpage_info
   with write mmu_lock.
3. lpage_info is not updated atomically. If there're two vCPUs updating
   lpage_info concurrently, lpage_info may hold an invalid value.
4. lpage_info is not updated in performance critical paths. No need to hold
   read mmu_lock.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-07-02  0:18                                                                     ` Edgecombe, Rick P
  2025-07-02  1:07                                                                       ` Yan Zhao
@ 2025-07-02  3:31                                                                       ` Yan Zhao
  1 sibling, 0 replies; 294+ messages in thread
From: Yan Zhao @ 2025-07-02  3:31 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Huang, Kai, Du, Fan,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Li, Zhiquan1, Shutemov, Kirill,
	michael.roth@amd.com, seanjc@google.com, Weiny, Ira, Peng, Chao P,
	pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, kvm@vger.kernel.org, Annapurve, Vishal,
	tabba@google.com, jroedel@suse.de, Miao, Jun, pgonda@google.com,
	x86@kernel.org

On Wed, Jul 02, 2025 at 08:18:48AM +0800, Edgecombe, Rick P wrote:
> > > We need mmu write lock for demote, but as long as the order is:
> > > 1. set lpage_info
> > > 2. demote if needed
> > > 3. go to fault handler
> > > 
> > > Then (3) should have what it needs even if another fault races (1).
For now I implemented the sequence as

1. check lpage_info, if 2MB is already disabled for a GFN, goto 3.
2. if 2MB is not disabled,
   2.1 acquire write mmu_lock
   2.2 split the GFN mapping and kvm_flush_remote_tlbs() if split is performed
   2.3 update lpage_info to disable 2MB for the GFN
   2.4 release write mmu_lock
3. fault handler for the GFN

Note: write mmu_lock is held during 2.2 successfully splitting a huge GFN
entry and 2.3. So, it can guarantee that there's no 2MB mapping for the GFN
after 2.3.

Step 1 can help reduce the count of write mmu_lock from 17626 to 11.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01 13:32                                                                                 ` Vishal Annapurve
  2025-07-01 14:02                                                                                   ` Vishal Annapurve
  2025-07-01 16:14                                                                                   ` Edgecombe, Rick P
@ 2025-07-02  8:54                                                                                   ` Yan Zhao
  2025-07-02 13:12                                                                                     ` Vishal Annapurve
  2 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-07-02  8:54 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Edgecombe, Rick P, ackerleytng@google.com, Shutemov, Kirill,
	Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	quic_eberman@quicinc.com, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	Du, Fan, Yamahata, Isaku, pbonzini@redhat.com,
	binbin.wu@linux.intel.com, Weiny, Ira, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jul 01, 2025 at 06:32:38AM -0700, Vishal Annapurve wrote:
> On Tue, Jul 1, 2025 at 2:38 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Tue, Jul 01, 2025 at 01:55:43AM +0800, Edgecombe, Rick P wrote:
> > > So for this we can do something similar. Have the arch/x86 side of TDX grow a
> > > new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
> > > SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs after
> > > that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the system
> > > die. Zap/cleanup paths return success in the buggy shutdown case.
> > All TDs in the system die could be too severe for unmap errors due to KVM bugs.
> 
> At this point, I don't see a way to quantify how bad a KVM bug can get
> unless you have explicit ideas about the severity. We should work on
> minimizing KVM side bugs too and assuming it would be a rare
> occurrence I think it's ok to take this intrusive measure.
> 
> >
> > > Does it fit? Or, can you guys argue that the failures here are actually non-
> > > special cases that are worth more complex recovery? I remember we talked about
> > > IOMMU patterns that are similar, but it seems like the remaining cases under
> > > discussion are about TDX bugs.
> > I didn't mention TDX connect previously to avoid introducing unnecessary
> > complexity.
> >
> > For TDX connect, S-EPT is used for private mappings in IOMMU. Unmap could
> > therefore fail due to pages being pinned for DMA.
> 
> We are discussing this scenario already[1], where the host will not
> pin the pages used by secure DMA for the same reasons why we can't
> have KVM pin the guest_memfd pages mapped in SEPT. Is there some other
> kind of pinning you are referring to?
>
> If there is an ordering in which pages should be unmapped e.g. first
> in secure IOMMU and then KVM SEPT, then we can ensure the right
> ordering between invalidation callbacks from guest_memfd.
It's pinning from a different perspective.
Please check
https://lore.kernel.org/all/aGTvTbPHuXbvj59t@yzhao56-desk.sh.intel.com.

> [1] https://lore.kernel.org/lkml/CAGtprH_qh8sEY3s-JucW3n1Wvoq7jdVZDDokvG5HzPf0HV2=pg@mail.gmail.com/#t
> 
> >
> > So, my thinking was that if that happens, KVM could set a special flag to folios
> > pinned for private DMA.
> >
> > Then guest_memfd could check the special flag before allowing private-to-shared
> > conversion, or punch hole.
> > guest_memfd could check this special flag and choose to poison or leak the
> > folio.
> >
> > Otherwise, if we choose tdx_buggy_shutdown() to "do an all-cpu IPI to kick CPUs
> > out of SEAMMODE, wbivnd, and set a "no more seamcalls" bool", DMAs may still
> > have access to the private pages mapped in S-EPT.
> 
> guest_memfd will have to ensure that pages are unmapped from secure
> IOMMU pagetables before allowing them to be used by the host.
> 
> If secure IOMMU pagetables unmapping fails, I would assume it fails in
> the similar category of rare "KVM/TDX module/IOMMUFD" bug and I think
> it makes sense to do the same tdx_buggy_shutdown() with such failures
> as well.
tdx_buggy_shutdown() should then
do an all-cpu IPI to kick CPU out of SEAMMODE, wbivnd, and set a "no more
seamcalls" bool" and informing IOMMUF/VFIO to stop devices.

BTW, is the "no more seamcall" set by KVM at the per-VM level?
If it's per-VM, other TDs could still entering SEAMMODE. So, potential
corruption is still possible.
Besides, with "no more seamcalls" upon unmapping failure of a GFN, how to
reclaim other pages which might succeed otherwise?

This approach seems very complex.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01 16:13                                                                                     ` Edgecombe, Rick P
  2025-07-01 21:48                                                                                       ` Ackerley Tng
@ 2025-07-02  9:08                                                                                       ` Yan Zhao
  2025-07-02 15:28                                                                                         ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-07-02  9:08 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, Shutemov, Kirill,
	michael.roth@amd.com, binbin.wu@linux.intel.com,
	seanjc@google.com, Peng, Chao P, Du, Fan, Yamahata, Isaku,
	ackerleytng@google.com, Weiny, Ira, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, Li, Zhiquan1, pgonda@google.com, x86@kernel.org

On Wed, Jul 02, 2025 at 12:13:42AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-07-01 at 13:01 +0800, Yan Zhao wrote:
> > > Maybe Yan can clarify here. I thought the HWpoison scenario was about TDX
> > > module
> > My thinking is to set HWPoison to private pages whenever KVM_BUG_ON() was hit
> > in
> > TDX. i.e., when the page is still mapped in S-EPT but the TD is bugged on and
> > about to tear down.
> > 
> > So, it could be due to KVM or TDX module bugs, which retries can't help.
> 
> We were going to call back into guestmemfd for this, right? Not set it inside
> KVM code.
Right. I think KVM calling back into guestmemf (via a special folio flag or API)
is better than KVM setting HWPoison flag or invoking memory_failure() or its
friends.

> What about a kvm_gmem_buggy_cleanup() instead of the system wide one. KVM calls
> it and then proceeds to bug the TD only from the KVM side. It's not as safe for
> the system, because who knows what a buggy TDX module could do. But TDX module
> could also be buggy without the kernel catching wind of it.
> 
> Having a single callback to basically bug the fd would solve the atomic context
> issue. Then guestmemfd could dump the entire fd into memory_failure() instead of
> returning the pages. And developers could respond by fixing the bug.
Do you mean dumping the entire memory inside fd? Or just memory with certain
folio flags in the fd into memory_failure()?

> IMO maintainability needs to be balanced with efforts to minimize the fallout
> from bugs. In the end a system that is too complex is going to have more bugs
> anyway.
Agreed.
To me, having KVM to indicate memory corruption at a folio level (i.e. 2MB or 1GB
granularity) is acceptable.

KVM can set a flag (e.g. the flag proposed in
https://lore.kernel.org/all/aGN6GIFxh57ElHPA@yzhao56-desk.sh.intel.com).

guest_memfd can check this flag after every zap or after seeing
kvm_gmem_buggy_cleanup(). guest_memfd can choose to report memory_failure() or
leak the memory.

But I'm ok if you think dumping and memory_failure() the entire memory inside fd
is simpler.

> > > bugs. Not TDX busy errors, demote failures, etc. If there are "normal"
> > > failures,
> > > like the ones that can be fixed with retries, then I think HWPoison is not a
> > > good option though.
> > > 
> > > >   there is a way to make 100%
> > > > sure all memory becomes re-usable by the rest of the host, using
> > > > tdx_buggy_shutdown(), wbinvd, etc?
> > 
> > Not sure about this approach. When TDX module is buggy and the page is still
> > accessible to guest as private pages, even with no-more SEAMCALLs flag, is it
> > safe enough for guest_memfd/hugetlb to re-assign the page to allow
> > simultaneous
> > access in shared memory with potential private access from TD or TDX module?
> 
> With the no more seamcall's approach it should be safe (for the system). This is
> essentially what we are doing for kexec.
AFAIK, kexec stops devices first by invoking device's shutdown hook.
Similarly, "the no more seamcall's approach" should interact with devices to
avoid DMAs via private keys.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01 22:09                                                                                         ` Ackerley Tng
@ 2025-07-02 11:24                                                                                           ` Yan Zhao
  2025-07-02 18:43                                                                                             ` Ackerley Tng
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-07-02 11:24 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Vishal Annapurve, Edgecombe, Rick P, quic_eberman@quicinc.com,
	Li, Xiaoyao, Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
	Du, Fan, michael.roth@amd.com, seanjc@google.com,
	binbin.wu@linux.intel.com, Peng, Chao P, kvm@vger.kernel.org,
	Yamahata, Isaku, linux-kernel@vger.kernel.org, Weiny, Ira,
	pbonzini@redhat.com, Li, Zhiquan1, jroedel@suse.de, Miao, Jun,
	pgonda@google.com, x86@kernel.org

On Tue, Jul 01, 2025 at 03:09:01PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Mon, Jun 30, 2025 at 10:22:26PM -0700, Vishal Annapurve wrote:
> >> On Mon, Jun 30, 2025 at 10:04 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >> >
> >> > On Tue, Jul 01, 2025 at 05:45:54AM +0800, Edgecombe, Rick P wrote:
> >> > > On Mon, 2025-06-30 at 12:25 -0700, Ackerley Tng wrote:
> >> > > > > So for this we can do something similar. Have the arch/x86 side of TDX grow
> >> > > > > a
> >> > > > > new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
> >> > > > > SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs
> >> > > > > after
> >> > > > > that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the
> >> > > > > system
> >> > > > > die. Zap/cleanup paths return success in the buggy shutdown case.
> >> > > > >
> >> > > >
> >> > > > Do you mean that on unmap/split failure:
> >> > >
> >> > > Maybe Yan can clarify here. I thought the HWpoison scenario was about TDX module
> >> > My thinking is to set HWPoison to private pages whenever KVM_BUG_ON() was hit in
> >> > TDX. i.e., when the page is still mapped in S-EPT but the TD is bugged on and
> >> > about to tear down.
> >> >
> >> > So, it could be due to KVM or TDX module bugs, which retries can't help.
> >> >
> >> > > bugs. Not TDX busy errors, demote failures, etc. If there are "normal" failures,
> >> > > like the ones that can be fixed with retries, then I think HWPoison is not a
> >> > > good option though.
> >> > >
> >> > > >  there is a way to make 100%
> >> > > > sure all memory becomes re-usable by the rest of the host, using
> >> > > > tdx_buggy_shutdown(), wbinvd, etc?
> >> >
> >> > Not sure about this approach. When TDX module is buggy and the page is still
> >> > accessible to guest as private pages, even with no-more SEAMCALLs flag, is it
> >> > safe enough for guest_memfd/hugetlb to re-assign the page to allow simultaneous
> >> > access in shared memory with potential private access from TD or TDX module?
> >> 
> >> If no more seamcalls are allowed and all cpus are made to exit SEAM
> >> mode then how can there be potential private access from TD or TDX
> >> module?
> > Not sure. As Kirill said "TDX module has creative ways to corrupt it"
> > https://lore.kernel.org/all/zlxgzuoqwrbuf54wfqycnuxzxz2yduqtsjinr5uq4ss7iuk2rt@qaaolzwsy6ki/.
> >
> > Or, could TDX just set a page flag, like what for XEN
> >
> >         /* XEN */
> >         /* Pinned in Xen as a read-only pagetable page. */
> >         PG_pinned = PG_owner_priv_1,
> >
> > e.g.
> > 	PG_tdx_firmware_access = PG_owner_priv_1,
> >
> > Then, guest_memfd checks this flag on every zap and replace it with PG_hwpoison
> > on behalf of TDX?
> 
> I think this question probably arose because of a misunderstanding I
> might have caused. I meant to set the HWpoison flag from the kernel, not
> from within the TDX module. Please see [1].
I understood.
But as Rick pointed out
https://lore.kernel.org/all/04d3e455d07042a0ab8e244e6462d9011c914581.camel@intel.com/,
Manually setting the poison flag in KVM's TDX code (in host kernel) seems risky.

> In addition, if the TDX module (now referring specifically to the TDX
> module and not the kernel) sets page flags, that won't work with
Marking at per-folio level seems acceptable to me.

> vmemmap-optimized folios. Setting a page flag on a vmemmap-optimized
> folio will be setting the flag on a few pages.
BTW, I have a concern regarding to the overhead vmemmap-optimization.

In my system,
with hugetlb_free_vmemmap=false, the TD boot time is around 30s;
with hugetlb_free_vmemmap=true, the TD boot time is around 1m20s;


> [1] https://lore.kernel.org/all/diqzplej4llh.fsf@ackerleytng-ctop.c.googlers.com/

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-02  8:54                                                                                   ` Yan Zhao
@ 2025-07-02 13:12                                                                                     ` Vishal Annapurve
  0 siblings, 0 replies; 294+ messages in thread
From: Vishal Annapurve @ 2025-07-02 13:12 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Edgecombe, Rick P, ackerleytng@google.com, Shutemov, Kirill,
	Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	quic_eberman@quicinc.com, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	Du, Fan, Yamahata, Isaku, pbonzini@redhat.com,
	binbin.wu@linux.intel.com, Weiny, Ira, Li, Zhiquan1,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, Jul 2, 2025 at 1:56 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, Jul 01, 2025 at 06:32:38AM -0700, Vishal Annapurve wrote:
> > On Tue, Jul 1, 2025 at 2:38 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Tue, Jul 01, 2025 at 01:55:43AM +0800, Edgecombe, Rick P wrote:
> > > > So for this we can do something similar. Have the arch/x86 side of TDX grow a
> > > > new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
> > > > SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs after
> > > > that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the system
> > > > die. Zap/cleanup paths return success in the buggy shutdown case.
> > > All TDs in the system die could be too severe for unmap errors due to KVM bugs.
> >
> > At this point, I don't see a way to quantify how bad a KVM bug can get
> > unless you have explicit ideas about the severity. We should work on
> > minimizing KVM side bugs too and assuming it would be a rare
> > occurrence I think it's ok to take this intrusive measure.
> >
> > >
> > > > Does it fit? Or, can you guys argue that the failures here are actually non-
> > > > special cases that are worth more complex recovery? I remember we talked about
> > > > IOMMU patterns that are similar, but it seems like the remaining cases under
> > > > discussion are about TDX bugs.
> > > I didn't mention TDX connect previously to avoid introducing unnecessary
> > > complexity.
> > >
> > > For TDX connect, S-EPT is used for private mappings in IOMMU. Unmap could
> > > therefore fail due to pages being pinned for DMA.
> >
> > We are discussing this scenario already[1], where the host will not
> > pin the pages used by secure DMA for the same reasons why we can't
> > have KVM pin the guest_memfd pages mapped in SEPT. Is there some other
> > kind of pinning you are referring to?
> >
> > If there is an ordering in which pages should be unmapped e.g. first
> > in secure IOMMU and then KVM SEPT, then we can ensure the right
> > ordering between invalidation callbacks from guest_memfd.
> It's pinning from a different perspective.
> Please check
> https://lore.kernel.org/all/aGTvTbPHuXbvj59t@yzhao56-desk.sh.intel.com.
>
> > [1] https://lore.kernel.org/lkml/CAGtprH_qh8sEY3s-JucW3n1Wvoq7jdVZDDokvG5HzPf0HV2=pg@mail.gmail.com/#t
> >
> > >
> > > So, my thinking was that if that happens, KVM could set a special flag to folios
> > > pinned for private DMA.
> > >
> > > Then guest_memfd could check the special flag before allowing private-to-shared
> > > conversion, or punch hole.
> > > guest_memfd could check this special flag and choose to poison or leak the
> > > folio.
> > >
> > > Otherwise, if we choose tdx_buggy_shutdown() to "do an all-cpu IPI to kick CPUs
> > > out of SEAMMODE, wbivnd, and set a "no more seamcalls" bool", DMAs may still
> > > have access to the private pages mapped in S-EPT.
> >
> > guest_memfd will have to ensure that pages are unmapped from secure
> > IOMMU pagetables before allowing them to be used by the host.
> >
> > If secure IOMMU pagetables unmapping fails, I would assume it fails in
> > the similar category of rare "KVM/TDX module/IOMMUFD" bug and I think
> > it makes sense to do the same tdx_buggy_shutdown() with such failures
> > as well.
> tdx_buggy_shutdown() should then
> do an all-cpu IPI to kick CPU out of SEAMMODE, wbivnd, and set a "no more
> seamcalls" bool" and informing IOMMUF/VFIO to stop devices.
>
> BTW, is the "no more seamcall" set by KVM at the per-VM level?

No more seamcall here would be the host level.

> If it's per-VM, other TDs could still entering SEAMMODE. So, potential
> corruption is still possible.
> Besides, with "no more seamcalls" upon unmapping failure of a GFN, how to
> reclaim other pages which might succeed otherwise?

I would think that with no more seamcall on the host, KVM TDX logic
could safely reclaim all the pages using the WBINVD method.

>
> This approach seems very complex.

With "informing IOMMUF/VFIO to stop devices" specially only for secure
devices, I agree this could become hard to orchestrate unless Dan
Williams or somebody else has better insights here.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  2025-07-02  1:07                                                                       ` Yan Zhao
@ 2025-07-02 15:26                                                                         ` Edgecombe, Rick P
  0 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-02 15:26 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Du, Fan, Li, Xiaoyao, Huang, Kai, Shutemov, Kirill, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
	Li, Zhiquan1, quic_eberman@quicinc.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, pbonzini@redhat.com, Peng, Chao P,
	Yamahata, Isaku, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, ackerleytng@google.com,
	kvm@vger.kernel.org, Annapurve, Vishal, tabba@google.com,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Wed, 2025-07-02 at 09:07 +0800, Yan Zhao wrote:
> > I don't think because some code might race in the future is a good reason to
> > take the write lock.
> 
> I still prefer to hold write mmu_lock right now.
> 
> Otherwise, we at least need to convert disallow_lpage to atomic variable and
> updating it via an atomic way, e.g. cmpxchg. 
> 
> struct kvm_lpage_info {
>         int disallow_lpage;
> };

This seems like a valid reason. I wanted to make sure there was some reason,
besides it feels safer.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-02  9:08                                                                                       ` Yan Zhao
@ 2025-07-02 15:28                                                                                         ` Edgecombe, Rick P
  0 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-02 15:28 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
	vbabka@suse.cz, kvm@vger.kernel.org, michael.roth@amd.com,
	seanjc@google.com, binbin.wu@linux.intel.com, Peng, Chao P,
	Shutemov, Kirill, ackerleytng@google.com, Yamahata, Isaku,
	Weiny, Ira, pbonzini@redhat.com, linux-kernel@vger.kernel.org,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Li, Zhiquan1,
	pgonda@google.com, x86@kernel.org

On Wed, 2025-07-02 at 17:08 +0800, Yan Zhao wrote:
> > With the no more seamcall's approach it should be safe (for the system).
> > This is
> > essentially what we are doing for kexec.
> AFAIK, kexec stops devices first by invoking device's shutdown hook.
> Similarly, "the no more seamcall's approach" should interact with devices to
> avoid DMAs via private keys

This is can't happen today.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 15/21] KVM: TDX: Support huge page splitting with exclusive kvm->mmu_lock
  2025-04-24  3:08 ` [RFC PATCH 15/21] KVM: TDX: Support huge page splitting with exclusive kvm->mmu_lock Yan Zhao
  2025-05-20  6:18   ` Binbin Wu
@ 2025-07-02 15:47   ` Edgecombe, Rick P
  1 sibling, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-02 15:47 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Thu, 2025-04-24 at 11:08 +0800, Yan Zhao wrote:
> +static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
> +					enum pg_level level, struct page *page)
> +{
> +	int tdx_level = pg_level_to_tdx_sept_level(level);
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	u64 err, entry, level_state;
> +
> +	do {
> +		err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> +					  &entry, &level_state);
> +	} while (err == TDX_INTERRUPTED_RESTARTABLE);
> +
> +	if (unlikely(tdx_operand_busy(err))) {
> +		tdx_no_vcpus_enter_start(kvm);
> +		err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> +					  &entry, &level_state);
> +		tdx_no_vcpus_enter_stop(kvm);
> +	}
> +
> +	if (KVM_BUG_ON(err, kvm)) {
> +		pr_tdx_error_2(TDH_MEM_PAGE_DEMOTE, err, entry, level_state);
> +		return -EIO;
> +	}
> +	return 0;
> +}
> +
> +int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +			       void *private_spt)
> +{
> +	struct page *page = virt_to_page(private_spt);
> +	int ret;
> +
> +	if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
> +		return -EINVAL;
> +
> +	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
> +	if (ret <= 0)
> +		return ret;
> +
> +	tdx_track(kvm);
> +
> +	return tdx_spte_demote_private_spte(kvm, gfn, level, page);
> +}

The latest TDX docs talk about a feature called NON_BLOCKING_RESIZE. It allows
for demote without blocking. If we rely on this feature we could simplify this
code. Not having transitory blocked state would reduce the scenarios that have
to be accounted for. We could also make demote operation accommodate failures
(rollback on SEAMCALL BUSY issue), which means mmu write lock is no longer
needed. It would have helped the fault path demote issue, which we have now
worked around. But still, it seems more flexible as well as simpler.

What about we rely on it this feature for KVM TDX huge mappings?

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-02 11:24                                                                                           ` Yan Zhao
@ 2025-07-02 18:43                                                                                             ` Ackerley Tng
  2025-07-03  4:54                                                                                               ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Ackerley Tng @ 2025-07-02 18:43 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Vishal Annapurve, Edgecombe, Rick P, quic_eberman@quicinc.com,
	Li, Xiaoyao, Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
	Du, Fan, michael.roth@amd.com, seanjc@google.com,
	binbin.wu@linux.intel.com, Peng, Chao P, kvm@vger.kernel.org,
	Yamahata, Isaku, linux-kernel@vger.kernel.org, Weiny, Ira,
	pbonzini@redhat.com, Li, Zhiquan1, jroedel@suse.de, Miao, Jun,
	pgonda@google.com, x86@kernel.org

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Tue, Jul 01, 2025 at 03:09:01PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>> 
>> > On Mon, Jun 30, 2025 at 10:22:26PM -0700, Vishal Annapurve wrote:
>> >> On Mon, Jun 30, 2025 at 10:04 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>> >> >
>> >> > On Tue, Jul 01, 2025 at 05:45:54AM +0800, Edgecombe, Rick P wrote:
>> >> > > On Mon, 2025-06-30 at 12:25 -0700, Ackerley Tng wrote:
>> >> > > > > So for this we can do something similar. Have the arch/x86 side of TDX grow
>> >> > > > > a
>> >> > > > > new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
>> >> > > > > SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs
>> >> > > > > after
>> >> > > > > that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the
>> >> > > > > system
>> >> > > > > die. Zap/cleanup paths return success in the buggy shutdown case.
>> >> > > > >
>> >> > > >
>> >> > > > Do you mean that on unmap/split failure:
>> >> > >
>> >> > > Maybe Yan can clarify here. I thought the HWpoison scenario was about TDX module
>> >> > My thinking is to set HWPoison to private pages whenever KVM_BUG_ON() was hit in
>> >> > TDX. i.e., when the page is still mapped in S-EPT but the TD is bugged on and
>> >> > about to tear down.
>> >> >
>> >> > So, it could be due to KVM or TDX module bugs, which retries can't help.
>> >> >
>> >> > > bugs. Not TDX busy errors, demote failures, etc. If there are "normal" failures,
>> >> > > like the ones that can be fixed with retries, then I think HWPoison is not a
>> >> > > good option though.
>> >> > >
>> >> > > >  there is a way to make 100%
>> >> > > > sure all memory becomes re-usable by the rest of the host, using
>> >> > > > tdx_buggy_shutdown(), wbinvd, etc?
>> >> >
>> >> > Not sure about this approach. When TDX module is buggy and the page is still
>> >> > accessible to guest as private pages, even with no-more SEAMCALLs flag, is it
>> >> > safe enough for guest_memfd/hugetlb to re-assign the page to allow simultaneous
>> >> > access in shared memory with potential private access from TD or TDX module?
>> >> 
>> >> If no more seamcalls are allowed and all cpus are made to exit SEAM
>> >> mode then how can there be potential private access from TD or TDX
>> >> module?
>> > Not sure. As Kirill said "TDX module has creative ways to corrupt it"
>> > https://lore.kernel.org/all/zlxgzuoqwrbuf54wfqycnuxzxz2yduqtsjinr5uq4ss7iuk2rt@qaaolzwsy6ki/.
>> >
>> > Or, could TDX just set a page flag, like what for XEN
>> >
>> >         /* XEN */
>> >         /* Pinned in Xen as a read-only pagetable page. */
>> >         PG_pinned = PG_owner_priv_1,
>> >
>> > e.g.
>> > 	PG_tdx_firmware_access = PG_owner_priv_1,
>> >
>> > Then, guest_memfd checks this flag on every zap and replace it with PG_hwpoison
>> > on behalf of TDX?
>> 
>> I think this question probably arose because of a misunderstanding I
>> might have caused. I meant to set the HWpoison flag from the kernel, not
>> from within the TDX module. Please see [1].
> I understood.
> But as Rick pointed out
> https://lore.kernel.org/all/04d3e455d07042a0ab8e244e6462d9011c914581.camel@intel.com/,
> Manually setting the poison flag in KVM's TDX code (in host kernel) seems risky.
>

Will address this in a reply to Rick's email, there's more context
there that I'd like to clarify.

>> In addition, if the TDX module (now referring specifically to the TDX
>> module and not the kernel) sets page flags, that won't work with
> Marking at per-folio level seems acceptable to me.
>

Will address this in a reply to Rick's email, there's more context there
that I'd like to clarify.

>> vmemmap-optimized folios. Setting a page flag on a vmemmap-optimized
>> folio will be setting the flag on a few pages.
> BTW, I have a concern regarding to the overhead vmemmap-optimization.
>
> In my system,
> with hugetlb_free_vmemmap=false, the TD boot time is around 30s;
> with hugetlb_free_vmemmap=true, the TD boot time is around 1m20s;
>
>

I'm aware of this, was investigating this for something similar
internally. In your system and test, were you working with 1G pages, or
2M pages?

>> [1] https://lore.kernel.org/all/diqzplej4llh.fsf@ackerleytng-ctop.c.googlers.com/

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-01 22:37                                                                                         ` Edgecombe, Rick P
@ 2025-07-02 20:57                                                                                           ` Ackerley Tng
  2025-07-02 23:51                                                                                             ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Ackerley Tng @ 2025-07-02 20:57 UTC (permalink / raw)
  To: Edgecombe, Rick P, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Li, Zhiquan1, Du, Fan,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, vbabka@suse.cz, Shutemov, Kirill,
	michael.roth@amd.com, seanjc@google.com, Weiny, Ira, Peng, Chao P,
	binbin.wu@linux.intel.com, Yamahata, Isaku, pbonzini@redhat.com,
	kvm@vger.kernel.org, Annapurve, Vishal, jroedel@suse.de,
	linux-kernel@vger.kernel.org, Miao, Jun, pgonda@google.com,
	x86@kernel.org

"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:

> On Tue, 2025-07-01 at 14:48 -0700, Ackerley Tng wrote:
>> Perhaps we had different understandings of f/g :P
>
> Ah yes, I thought you were saying that guestmemfd would use poison internally
> via some gmem_buggy_page() or similar. I guess I thought it is more of
> guestmemfd's job. But as Yan pointed out, we need to handle non gmem page errors
> too. Currently we leak, but it would be nice to keep the handling symmetrical.
> Which would be easier if we did it all in TDX code.
>

I meant to set HWpoison externally from guest_memfd because I feel that
it is a separate thing. Unmap failures are similar to discovering a
memory error. If setting HWpoison on memory error is external to
guest_memfd, setting HWpoison on unmap failure should also be
conceptually external to guest_memfd.

After Yan pointed out that non-guest_memfd page errors need to be
handled, it aligns with the idea that setting HWpoison is external to
guest_memfd.

I agree keeping the handling symmetrical would be best, so in both cases
the part of KVM TDX code that sees the unmap failure should directly set
HWpoison and not go through guest_memfd.

>> 
>> I meant that TDX module should directly set the HWpoison flag on the
>> folio (HugeTLB or 4K, guest_memfd or not), not call into guest_memfd.
>> 
>> guest_memfd will then check this flag when necessary, specifically:
>> 
>> * On faults, either into guest or host page tables 
>> * When freeing the page
>>     * guest_memfd will not return HugeTLB pages that are poisoned to
>>       HugeTLB and just leak it
>>     * 4K pages will be freed normally, because free_pages_prepare() will
>>       check for HWpoison and skip freeing, from __folio_put() ->
>>       free_frozen_pages() -> __free_frozen_pages() ->
>>       free_pages_prepare()
>> * I believe guest_memfd doesn't need to check HWpoison on conversions [1]
>> 
>> [1] https://lore.kernel.org/all/diqz5xghjca4.fsf@ackerleytng-ctop.c.googlers.com/
>
> If a poisoned page continued to be used, it's a bit weird, no? 

Do you mean "continued to be used" in the sense that it is present in a
filemap and belongs to a (guest_memfd) inode?

A poisoned page is not faulted in anywhere, and in that sense the page
is not "used". In the case of regular poisoning as in a call to
memory_failure(), the page is unmapped from the page tables. If that
page belongs to guest_memfd, in today's code [2], guest_memfd
intentionally does not truncate it from the filemap. For guest_memfd,
handling the HWpoison at fault time is by design; keeping it present in
the filemap is by design.

In the case of TDX unmap failures leading to HWpoison, the only place it
may remain mapped is in the Secure-EPTs. I use "may" because I'm not
sure about how badly the unmap failed. But either way, the TD gets
bugged, all vCPUs of the TD are stopped, so the HWpoison-ed page is no
longer "used".

[2] https://github.com/torvalds/linux/blob/b4911fb0b060899e4eebca0151eb56deb86921ec/virt/kvm/guest_memfd.c#L334

> It could take an
> #MC for another reason from userspace and the handling code would see the page
> flag is already set. If it doesn't already trip up some MM code somewhere, it
> might put undue burden on the memory failure code to have to expect repeated
> poisoning of the same memory.
>

If it does take another #MC and go to memory_failure(), memory_failure()
already checks for the HWpoison flag being set [3]. This is handled by
killing the process. There is similar handling for a HugeTLB
folio. We're not introducing anything new by using HWpoison; we're
buying into the HWpoison framework, which already handles seeing a
HWpoison when handling a poison.

[3] https://github.com/torvalds/linux/blob/b4911fb0b060899e4eebca0151eb56deb86921ec/mm/memory-failure.c#L2270

>> 
>> > What about a kvm_gmem_buggy_cleanup() instead of the system wide one. KVM calls
>> > it and then proceeds to bug the TD only from the KVM side. It's not as safe for
>> > the system, because who knows what a buggy TDX module could do. But TDX module
>> > could also be buggy without the kernel catching wind of it.
>> > 
>> > Having a single callback to basically bug the fd would solve the atomic context
>> > issue. Then guestmemfd could dump the entire fd into memory_failure() instead of
>> > returning the pages. And developers could respond by fixing the bug.
>> > 
>> 
>> This could work too.
>> 
>> I'm in favor of buying into the HWpoison system though, since we're
>> quite sure this is fair use of HWpoison.
>
> Do you mean manually setting the poison flag, or calling into memory_failure(),
> and friends?

I mean manually setting the poison flag.

* If regular 4K page, set the flag.
* If THP page (not (yet) supported by guest_memfd), set the poison flag
  on the specific subpage causing the error, and in addition set THP'S has_hwpoison
  flag
* If HugeTLB page, call folio_set_hugetlb_hwpoison() on the subpage.

This is already the process in memory_failure() and perhaps some
refactoring could be done.

I think calling memory_failure() would do too much, since in addition to
setting the flag, memory_failure() also sometimes does freeing and may
kill processes, and triggers the users of the page to further handle the
HWpoison.

> If we set them manually, we need to make sure that it does not have
> side effects on the machine check handler. It seems risky/messy to me. But
> Kirill didn't seem worried.
>

I believe the memory_failure() is called from the machine check handler:

DEFINE_IDTENTRY_MCE(exc_machine_check)
  -> exc_machine_check_kernel()
     -> do_machine_check()
        -> kill_me_now() or kill_me_maybe()
           -> memory_failure()

(I might have quoted just one of the paths and I'll have to look into it
more.)

For now, IIUC setting the poison flag is a subset of memory_failure(), which is a
subset of what the machine check handler does.

memory_failure() handles an already poisoned page, so I don't see any
side effects.

I'm happy that Kirill didn't seem worried :) Rick, let me know if you
see any specific risks.

> Maybe we could bring the poison page flag up to DavidH and see if there is any
> concern before going down this path too far?
>

I can do that. David's cc-ed on this email, and I hope to get a chance
to talk about handling HWpoison (generally, not TDX specifically) at the
guest_memfd bi-weekly upstream call on 2025-07-10 so I can bring this up
too.

>> 
>> Are you saying kvm_gmem_buggy_cleanup() will just set the HWpoison flag
>> on the parts of the folios in trouble?
>
> I was saying kvm_gmem_buggy_cleanup() can set a bool on the fd, similar to
> VM_BUG_ON() setting vm_dead.

Setting a bool on the fd is a possible option too. Comparing an
inode-level boolean and HWpoison, I still prefer HWpoison because

1. HWpoison gives us more information about which (sub)folio was
   poisoned. We can think of the bool on the fd as an fd-wide
   poisoning. If we don't know which subpage has an error, we're forced
   to leak the entire fd when the inode is released, which could be a
   huge amount of memory leaked.
2. HWpoison is already checked on faults, so there is no need to add an
   extra check on a bool
3. For HugeTLB, HWpoison will have to be summarized/itemized on merge/split to handle
   regular non-TDX related HWpoisons, so no additional code there.

> After an invalidate, if gmem see this, it needs to
> assume everything failed, and invalidate everything and poison all guest memory.
> The point was to have the simplest possible handling for a rare error.

I agree a bool will probably result in fewer lines of code being changed
and could be a fair first cut, but I feel like we would very quickly
need another patch series to get more granular information and not have
to leak an entire fd worth of memory.

Along these lines, Yan seems to prefer setting HWpoison on the entire
folio without going into the details of the exact subfolios being
poisoned. I think this is a possible in-between solution that doesn't
require leaking the entire fd worth of memory, but it still leaks more
than just where the actual error happened.

I'm willing to go with just setting HWpoison on the entire large folio
as a first cut and leak more memory than necessary (because if we don't
know which subpage it is, we are forced to leak everything to be safe).

However, this patch series needs a large page provider in guest_memfd, and
will only land either after THP or HugeTLB support lands in
guest_memfd.

For now if you're testing on guest_memfd+HugeTLB,
folio_set_hugetlb_hwpoison() already exists, why not use it?

> Although
> it's only a proposal. The TDX emergency shutdown option may be simpler still.
> But killing all TDs is not ideal. So thought we could at least consider other
> options.
>
> If we have a solution where TDX needs to do something complicated because
> something of its specialness, it may get NAKed.

Using HWpoison is generic, since guest_memfd needs to handle HWpoison
for regular memory errors anyway. Even if it is not a final solution, it
should be good enough, if not for this patch series to merge, at least
for the next RFC of this patch series. :)

> This is my main concern with the
> direction of this problem/solution. AFAICT, we are not even sure of a concrete
> problem, and it appears to be special to TDX. So the complexity budget should be
> small. It's in sharp contrast to the length of the discussion.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-02 20:57                                                                                           ` Ackerley Tng
@ 2025-07-02 23:51                                                                                             ` Edgecombe, Rick P
  2025-07-08 21:19                                                                                               ` Ackerley Tng
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-02 23:51 UTC (permalink / raw)
  To: ackerleytng@google.com, Zhao, Yan Y
  Cc: Du, Fan, Li, Xiaoyao, Shutemov, Kirill, Hansen, Dave,
	david@redhat.com, Li, Zhiquan1, vbabka@suse.cz, tabba@google.com,
	thomas.lendacky@amd.com, michael.roth@amd.com, seanjc@google.com,
	Weiny, Ira, Peng, Chao P, binbin.wu@linux.intel.com,
	Yamahata, Isaku, pbonzini@redhat.com, quic_eberman@quicinc.com,
	Annapurve, Vishal, jroedel@suse.de, linux-kernel@vger.kernel.org,
	Miao, Jun, kvm@vger.kernel.org, pgonda@google.com, x86@kernel.org

On Wed, 2025-07-02 at 13:57 -0700, Ackerley Tng wrote:
> > 
> > If a poisoned page continued to be used, it's a bit weird, no? 
> 
> Do you mean "continued to be used" in the sense that it is present in a
> filemap and belongs to a (guest_memfd) inode?

I mean anyway where it might get read or written to again.

> 
> A poisoned page is not faulted in anywhere, and in that sense the page
> is not "used". In the case of regular poisoning as in a call to
> memory_failure(), the page is unmapped from the page tables. If that
> page belongs to guest_memfd, in today's code [2], guest_memfd
> intentionally does not truncate it from the filemap. For guest_memfd,
> handling the HWpoison at fault time is by design; keeping it present in
> the filemap is by design.

I thought I read that you would allow it to be re-used. I see that the code
already checks for poison in the kvm_gmem_get_pfn() path and the mmap() path. So
it will just sit in the fd and not be handed out again. I think it's ok. Well,
as long as conversion to shared doesn't involve zeroing...?

> 
> In the case of TDX unmap failures leading to HWpoison, the only place it
> may remain mapped is in the Secure-EPTs. I use "may" because I'm not
> sure about how badly the unmap failed. But either way, the TD gets
> bugged, all vCPUs of the TD are stopped, so the HWpoison-ed page is no
> longer "used".
> 
> [2]
> https://github.com/torvalds/linux/blob/b4911fb0b060899e4eebca0151eb56deb86921ec/virt/kvm/guest_memfd.c#L334

Yes, I saw that. It looks like special error case treatment for the state we are
setting up.

> 
> > It could take an
> > #MC for another reason from userspace and the handling code would see the
> > page
> > flag is already set. If it doesn't already trip up some MM code somewhere,
> > it
> > might put undue burden on the memory failure code to have to expect repeated
> > poisoning of the same memory.
> > 
> 
> If it does take another #MC and go to memory_failure(), memory_failure()
> already checks for the HWpoison flag being set [3]. This is handled by
> killing the process. There is similar handling for a HugeTLB
> folio. We're not introducing anything new by using HWpoison; we're
> buying into the HWpoison framework, which already handles seeing a
> HWpoison when handling a poison.

Do you see another user that is setting the poison flag manually like proposed?
(i.e. not through memory failure handlers)

> 
> [3]
> https://github.com/torvalds/linux/blob/b4911fb0b060899e4eebca0151eb56deb86921ec/mm/memory-failure.c#L2270
> 
> > > 
> > > > What about a kvm_gmem_buggy_cleanup() instead of the system wide one.
> > > > KVM calls
> > > > it and then proceeds to bug the TD only from the KVM side. It's not as
> > > > safe for
> > > > the system, because who knows what a buggy TDX module could do. But TDX
> > > > module
> > > > could also be buggy without the kernel catching wind of it.
> > > > 
> > > > Having a single callback to basically bug the fd would solve the atomic
> > > > context
> > > > issue. Then guestmemfd could dump the entire fd into memory_failure()
> > > > instead of
> > > > returning the pages. And developers could respond by fixing the bug.
> > > > 
> > > 
> > > This could work too.
> > > 
> > > I'm in favor of buying into the HWpoison system though, since we're
> > > quite sure this is fair use of HWpoison.
> > 
> > Do you mean manually setting the poison flag, or calling into
> > memory_failure(),
> > and friends?
> 
> I mean manually setting the poison flag.
> 
> * If regular 4K page, set the flag.
> * If THP page (not (yet) supported by guest_memfd), set the poison flag
>   on the specific subpage causing the error, and in addition set THP'S
> has_hwpoison
>   flag
> * If HugeTLB page, call folio_set_hugetlb_hwpoison() on the subpage.
> 
> This is already the process in memory_failure() and perhaps some
> refactoring could be done.
> 
> I think calling memory_failure() would do too much, since in addition to
> setting the flag, memory_failure() also sometimes does freeing and may
> kill processes, and triggers the users of the page to further handle the
> HWpoison.

It definitely seem like there is more involved than setting the flag. Which
means for our case we should try to understand what we are skipping and how it
fits with the rest of the kernel. Is any code the checks for poison assuming
that memory_failure() stuff has been done? Stuff like that.

> 
> > If we set them manually, we need to make sure that it does not have
> > side effects on the machine check handler. It seems risky/messy to me. But
> > Kirill didn't seem worried.
> > 
> 
> I believe the memory_failure() is called from the machine check handler:
> 
> DEFINE_IDTENTRY_MCE(exc_machine_check)
>   -> exc_machine_check_kernel()
>      -> do_machine_check()
>         -> kill_me_now() or kill_me_maybe()
>            -> memory_failure()
> 
> (I might have quoted just one of the paths and I'll have to look into it
> more.)

It looked that way to me too. But it works from other contexts. See
MADV_HWPOISON (which is for testing).

> 
> For now, IIUC setting the poison flag is a subset of memory_failure(), which
> is a
> subset of what the machine check handler does.
> 
> memory_failure() handles an already poisoned page, so I don't see any
> side effects.
> 
> I'm happy that Kirill didn't seem worried :) Rick, let me know if you
> see any specific risks.
> 
> > Maybe we could bring the poison page flag up to DavidH and see if there is
> > any
> > concern before going down this path too far?
> > 
> 
> I can do that. David's cc-ed on this email, and I hope to get a chance
> to talk about handling HWpoison (generally, not TDX specifically) at the
> guest_memfd bi-weekly upstream call on 2025-07-10 so I can bring this up
> too.

Ok sounds good. Should we just continue the discussion there? I can try to
attend.

> 
> > > 
> > > Are you saying kvm_gmem_buggy_cleanup() will just set the HWpoison flag
> > > on the parts of the folios in trouble?
> > 
> > I was saying kvm_gmem_buggy_cleanup() can set a bool on the fd, similar to
> > VM_BUG_ON() setting vm_dead.
> 
> Setting a bool on the fd is a possible option too. Comparing an
> inode-level boolean and HWpoison, I still prefer HWpoison because
> 
> 1. HWpoison gives us more information about which (sub)folio was
>    poisoned. We can think of the bool on the fd as an fd-wide
>    poisoning. If we don't know which subpage has an error, we're forced
>    to leak the entire fd when the inode is released, which could be a
>    huge amount of memory leaked.
> 2. HWpoison is already checked on faults, so there is no need to add an
>    extra check on a bool
> 3. For HugeTLB, HWpoison will have to be summarized/itemized on merge/split to
> handle
>    regular non-TDX related HWpoisons, so no additional code there.
> 
> > After an invalidate, if gmem see this, it needs to
> > assume everything failed, and invalidate everything and poison all guest
> > memory.
> > The point was to have the simplest possible handling for a rare error.
> 
> I agree a bool will probably result in fewer lines of code being changed
> and could be a fair first cut, but I feel like we would very quickly
> need another patch series to get more granular information and not have
> to leak an entire fd worth of memory.

We will only leak an entire VMs worth of memory if there is a bug, the form of
which I'm not sure. The kernel doesn't usually have a lot of defensive code to
handle for bugs elsewhere. Unless it's to help debugging. But especially for
other platform software (bios, etc), it should try to stay out of the job of
maintaining code to work around unfixed bugs. And here we are working around
*potential bugs*.

So another *possible* solution is to expect TDX module/KVM to work. Kill the TD,
return success to the invalidation, and hope that it doesn't do anything to
those zombie mappings. It will likely work. Probably much more likely to work
then some other warning cases in the kernel. As far as debugging, if strange
crashes are observed after a bit splat, it can be a good hint.

Unless Yan has some specific case to worry about that she has been holding on to
that makes this error condition a more expected state. That could change things.

> 
> Along these lines, Yan seems to prefer setting HWpoison on the entire
> folio without going into the details of the exact subfolios being
> poisoned. I think this is a possible in-between solution that doesn't
> require leaking the entire fd worth of memory, but it still leaks more
> than just where the actual error happened.
> 
> I'm willing to go with just setting HWpoison on the entire large folio
> as a first cut and leak more memory than necessary (because if we don't
> know which subpage it is, we are forced to leak everything to be safe).

Leaking more memory than necessary in a bug case seems totally ok to me.

> 
> However, this patch series needs a large page provider in guest_memfd, and
> will only land either after THP or HugeTLB support lands in
> guest_memfd.
> 
> For now if you're testing on guest_memfd+HugeTLB,
> folio_set_hugetlb_hwpoison() already exists, why not use it?
> 
> > Although
> > it's only a proposal. The TDX emergency shutdown option may be simpler
> > still.
> > But killing all TDs is not ideal. So thought we could at least consider
> > other
> > options.
> > 
> > If we have a solution where TDX needs to do something complicated because
> > something of its specialness, it may get NAKed.
> 
> Using HWpoison is generic, since guest_memfd needs to handle HWpoison
> for regular memory errors anyway. Even if it is not a final solution, it
> should be good enough, if not for this patch series to merge, at least
> for the next RFC of this patch series. :)

Yes, maybe. If we have a normal, easy, non-imposing solution for handling the
error then I won't object.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-02 18:43                                                                                             ` Ackerley Tng
@ 2025-07-03  4:54                                                                                               ` Yan Zhao
  2025-07-14 19:32                                                                                                 ` Ackerley Tng
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-07-03  4:54 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Vishal Annapurve, Edgecombe, Rick P, quic_eberman@quicinc.com,
	Li, Xiaoyao, Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
	Du, Fan, michael.roth@amd.com, seanjc@google.com,
	binbin.wu@linux.intel.com, Peng, Chao P, kvm@vger.kernel.org,
	Yamahata, Isaku, linux-kernel@vger.kernel.org, Weiny, Ira,
	pbonzini@redhat.com, Li, Zhiquan1, jroedel@suse.de, Miao, Jun,
	pgonda@google.com, x86@kernel.org

On Wed, Jul 02, 2025 at 11:43:23AM -0700, Ackerley Tng wrote:
> >> vmemmap-optimized folios. Setting a page flag on a vmemmap-optimized
> >> folio will be setting the flag on a few pages.
> > BTW, I have a concern regarding to the overhead vmemmap-optimization.
> >
> > In my system,
> > with hugetlb_free_vmemmap=false, the TD boot time is around 30s;
> > with hugetlb_free_vmemmap=true, the TD boot time is around 1m20s;
> 
> I'm aware of this, was investigating this for something similar
> internally. In your system and test, were you working with 1G pages, or
> 2M pages?
2M pages. 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-04-24  3:04 ` [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
                     ` (3 preceding siblings ...)
  2025-05-15  2:16   ` Chao Gao
@ 2025-07-08  8:48   ` Yan Zhao
  2025-07-08 13:55     ` Edgecombe, Rick P
  4 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-07-08  8:48 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kirill.shutemov, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, jroedel, thomas.lendacky, pgonda,
	zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
	xiaoyao.li, binbin.wu, chao.p.peng

On Thu, Apr 24, 2025 at 11:04:28AM +0800, Yan Zhao wrote:
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index f5e2a937c1e7..a66d501b5677 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1595,9 +1595,18 @@ u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u
According to the discussion in DPAMT [*],
"hpa here points to a 2M region that pamt_pages covers. We don't have
struct page that represents it. Passing 4k struct page would be
misleading IMO."

Should we update tdh_mem_page_aug() accordingly to use hpa?
Or use struct folio instead?

[*] https://lore.kernel.org/all/3coaqkcfp7xtpvh2x4kph55qlopupknm7dmzqox6fakzaedhem@a2oysbvbshpm/


>  		.rdx = tdx_tdr_pa(td),
>  		.r8 = page_to_phys(page),
>  	};
> +	unsigned long nr_pages = 1 << (level * 9);
> +	struct folio *folio = page_folio(page);
> +	unsigned long idx = 0;
>  	u64 ret;
>  
> -	tdx_clflush_page(page);
> +	if (!(level >= TDX_PS_4K && level < TDX_PS_NR) ||
> +	    (folio_page_idx(folio, page) + nr_pages > folio_nr_pages(folio)))
> +		return -EINVAL;
> +
> +	while (nr_pages--)
> +		tdx_clflush_page(nth_page(page, idx++));
> +
>  	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
>  
>  	*ext_err1 = args.rcx;
> -- 
> 2.43.2
> 

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-07-08  8:48   ` Yan Zhao
@ 2025-07-08 13:55     ` Edgecombe, Rick P
  2025-07-08 15:29       ` Vishal Annapurve
  2025-07-09  2:23       ` Yan Zhao
  0 siblings, 2 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-08 13:55 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, 2025-07-08 at 16:48 +0800, Yan Zhao wrote:
> On Thu, Apr 24, 2025 at 11:04:28AM +0800, Yan Zhao wrote:
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index f5e2a937c1e7..a66d501b5677 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1595,9 +1595,18 @@ u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u
> According to the discussion in DPAMT [*],
> "hpa here points to a 2M region that pamt_pages covers. We don't have
> struct page that represents it. Passing 4k struct page would be
> misleading IMO."
> 
> Should we update tdh_mem_page_aug() accordingly to use hpa?
> Or use struct folio instead?
> 
> [*] https://lore.kernel.org/all/3coaqkcfp7xtpvh2x4kph55qlopupknm7dmzqox6fakzaedhem@a2oysbvbshpm/

The original seamcall wrapper patches used "u64 hpa", etc everywhere. The
feedback was that it was too error prone to not have types. We looked at using
kvm types (hpa_t, etc), but the type checking was still just surface level [0].

So the goal is to reduce errors and improve code readability. We can consider
breaking symmetry if it is better that way. In this case though, why not use
struct folio?

[0] https://lore.kernel.org/kvm/30d0cef5-82d5-4325-b149-0e99833b8785@intel.com/

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-07-08 13:55     ` Edgecombe, Rick P
@ 2025-07-08 15:29       ` Vishal Annapurve
  2025-07-08 15:32         ` Edgecombe, Rick P
  2025-07-09  2:23       ` Yan Zhao
  1 sibling, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-07-08 15:29 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y,
	Shutemov, Kirill, quic_eberman@quicinc.com, Li, Xiaoyao,
	kvm@vger.kernel.org, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, jroedel@suse.de, Miao, Jun,
	pgonda@google.com, x86@kernel.org

On Tue, Jul 8, 2025 at 6:56 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2025-07-08 at 16:48 +0800, Yan Zhao wrote:
> > On Thu, Apr 24, 2025 at 11:04:28AM +0800, Yan Zhao wrote:
> > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > > index f5e2a937c1e7..a66d501b5677 100644
> > > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > > @@ -1595,9 +1595,18 @@ u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u
> > According to the discussion in DPAMT [*],
> > "hpa here points to a 2M region that pamt_pages covers. We don't have
> > struct page that represents it. Passing 4k struct page would be
> > misleading IMO."
> >
> > Should we update tdh_mem_page_aug() accordingly to use hpa?
> > Or use struct folio instead?
> >
> > [*] https://lore.kernel.org/all/3coaqkcfp7xtpvh2x4kph55qlopupknm7dmzqox6fakzaedhem@a2oysbvbshpm/
>
> The original seamcall wrapper patches used "u64 hpa", etc everywhere. The
> feedback was that it was too error prone to not have types. We looked at using
> kvm types (hpa_t, etc), but the type checking was still just surface level [0].
>
> So the goal is to reduce errors and improve code readability. We can consider
> breaking symmetry if it is better that way. In this case though, why not use
> struct folio?

My vote would be to prefer using "hpa" and not rely on folio/page
structs for guest_memfd allocated memory wherever possible.

>
> [0] https://lore.kernel.org/kvm/30d0cef5-82d5-4325-b149-0e99833b8785@intel.com/

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-07-08 15:29       ` Vishal Annapurve
@ 2025-07-08 15:32         ` Edgecombe, Rick P
  2025-07-08 22:06           ` Vishal Annapurve
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-08 15:32 UTC (permalink / raw)
  To: Annapurve, Vishal
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, Zhao, Yan Y,
	tabba@google.com, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	vbabka@suse.cz, Peng, Chao P, Du, Fan, binbin.wu@linux.intel.com,
	jroedel@suse.de, Miao, Jun, kvm@vger.kernel.org,
	pgonda@google.com, x86@kernel.org

On Tue, 2025-07-08 at 08:29 -0700, Vishal Annapurve wrote:
> > The original seamcall wrapper patches used "u64 hpa", etc everywhere. The
> > feedback was that it was too error prone to not have types. We looked at
> > using
> > kvm types (hpa_t, etc), but the type checking was still just surface level
> > [0].
> > 
> > So the goal is to reduce errors and improve code readability. We can
> > consider
> > breaking symmetry if it is better that way. In this case though, why not use
> > struct folio?
> 
> My vote would be to prefer using "hpa" and not rely on folio/page
> structs for guest_memfd allocated memory wherever possible.

Is this because you want to enable struct page-less gmemfd in the future? Or
other reason?

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-02 23:51                                                                                             ` Edgecombe, Rick P
@ 2025-07-08 21:19                                                                                               ` Ackerley Tng
  2025-07-11  1:46                                                                                                 ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Ackerley Tng @ 2025-07-08 21:19 UTC (permalink / raw)
  To: Edgecombe, Rick P, Zhao, Yan Y
  Cc: Du, Fan, Li, Xiaoyao, Shutemov, Kirill, Hansen, Dave,
	david@redhat.com, Li, Zhiquan1, vbabka@suse.cz, tabba@google.com,
	thomas.lendacky@amd.com, michael.roth@amd.com, seanjc@google.com,
	Weiny, Ira, Peng, Chao P, binbin.wu@linux.intel.com,
	Yamahata, Isaku, pbonzini@redhat.com, quic_eberman@quicinc.com,
	Annapurve, Vishal, jroedel@suse.de, linux-kernel@vger.kernel.org,
	Miao, Jun, kvm@vger.kernel.org, pgonda@google.com, x86@kernel.org

"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:

> On Wed, 2025-07-02 at 13:57 -0700, Ackerley Tng wrote:
>> > 
>> > If a poisoned page continued to be used, it's a bit weird, no? 
>> 
>> Do you mean "continued to be used" in the sense that it is present in a
>> filemap and belongs to a (guest_memfd) inode?
>
> I mean anyway where it might get read or written to again.
>

Today, when handling memory failures, guest_memfd will unmap the page
from the guest and on the next fault, guest_memfd will discover the
HWpoison flag and return -EHWPOISON for KVM to handle.

If we go with my proposal, on TDX unmap failure, KVM kills the TD and
sets -EHWPOISON on the page.

Hence, the TD will not read/write the poisoned page.

TDX unmap failure implies that the page is private, so it will not be
mapped into the host page tables. If it is not in the host page tables,
the next access from the host will cause a page fault, and that's when
the HWpoison will be discoverd.

Hence, the host will also not read/write the poisoned page. Let me know
if you see any way else the poisoned page can be read/written to again.

>> 
>> A poisoned page is not faulted in anywhere, and in that sense the page
>> is not "used". In the case of regular poisoning as in a call to
>> memory_failure(), the page is unmapped from the page tables. If that
>> page belongs to guest_memfd, in today's code [2], guest_memfd
>> intentionally does not truncate it from the filemap. For guest_memfd,
>> handling the HWpoison at fault time is by design; keeping it present in
>> the filemap is by design.
>
> I thought I read that you would allow it to be re-used. I see that the code
> already checks for poison in the kvm_gmem_get_pfn() path and the mmap() path. So
> it will just sit in the fd and not be handed out again. I think it's ok. Well,
> as long as conversion to shared doesn't involve zeroing...?
>

IIUC it is zeroed on unmapping from the guest page tables? Is that done
by the TDX module, or by TDX code in KVM? Either way I think both of
those should be stopped once the unmap failure is discovered, as part of
"killing the TD".

>> 
>> In the case of TDX unmap failures leading to HWpoison, the only place it
>> may remain mapped is in the Secure-EPTs. I use "may" because I'm not
>> sure about how badly the unmap failed. But either way, the TD gets
>> bugged, all vCPUs of the TD are stopped, so the HWpoison-ed page is no
>> longer "used".
>> 
>> [2]
>> https://github.com/torvalds/linux/blob/b4911fb0b060899e4eebca0151eb56deb86921ec/virt/kvm/guest_memfd.c#L334
>
> Yes, I saw that. It looks like special error case treatment for the state we are
> setting up.
>
>> 
>> > It could take an
>> > #MC for another reason from userspace and the handling code would see the
>> > page
>> > flag is already set. If it doesn't already trip up some MM code somewhere,
>> > it
>> > might put undue burden on the memory failure code to have to expect repeated
>> > poisoning of the same memory.
>> > 
>> 
>> If it does take another #MC and go to memory_failure(), memory_failure()
>> already checks for the HWpoison flag being set [3]. This is handled by
>> killing the process. There is similar handling for a HugeTLB
>> folio. We're not introducing anything new by using HWpoison; we're
>> buying into the HWpoison framework, which already handles seeing a
>> HWpoison when handling a poison.
>
> Do you see another user that is setting the poison flag manually like proposed?
> (i.e. not through memory failure handlers)
>

As far as I know, this might be the first case of setting the poison
flag not through memory failure handlers.

>> 
>> [3]
>> https://github.com/torvalds/linux/blob/b4911fb0b060899e4eebca0151eb56deb86921ec/mm/memory-failure.c#L2270
>> 
>> > > 
>> > > > What about a kvm_gmem_buggy_cleanup() instead of the system wide one.
>> > > > KVM calls
>> > > > it and then proceeds to bug the TD only from the KVM side. It's not as
>> > > > safe for
>> > > > the system, because who knows what a buggy TDX module could do. But TDX
>> > > > module
>> > > > could also be buggy without the kernel catching wind of it.
>> > > > 
>> > > > Having a single callback to basically bug the fd would solve the atomic
>> > > > context
>> > > > issue. Then guestmemfd could dump the entire fd into memory_failure()
>> > > > instead of
>> > > > returning the pages. And developers could respond by fixing the bug.
>> > > > 
>> > > 
>> > > This could work too.
>> > > 
>> > > I'm in favor of buying into the HWpoison system though, since we're
>> > > quite sure this is fair use of HWpoison.
>> > 
>> > Do you mean manually setting the poison flag, or calling into
>> > memory_failure(),
>> > and friends?
>> 
>> I mean manually setting the poison flag.
>> 
>> * If regular 4K page, set the flag.
>> * If THP page (not (yet) supported by guest_memfd), set the poison flag
>>   on the specific subpage causing the error, and in addition set THP'S
>> has_hwpoison
>>   flag
>> * If HugeTLB page, call folio_set_hugetlb_hwpoison() on the subpage.
>> 
>> This is already the process in memory_failure() and perhaps some
>> refactoring could be done.
>> 
>> I think calling memory_failure() would do too much, since in addition to
>> setting the flag, memory_failure() also sometimes does freeing and may
>> kill processes, and triggers the users of the page to further handle the
>> HWpoison.
>
> It definitely seem like there is more involved than setting the flag. Which
> means for our case we should try to understand what we are skipping and how it
> fits with the rest of the kernel. Is any code the checks for poison assuming
> that memory_failure() stuff has been done? Stuff like that.
>

Yup! But I do still think setting HWpoison is good enough to pursue at
least for a next RFC patch series, and in the process of testing that
series we could learn more. Do you mean that we shouldn't proceed until
all of this is verified?

>> 
>> > If we set them manually, we need to make sure that it does not have
>> > side effects on the machine check handler. It seems risky/messy to me. But
>> > Kirill didn't seem worried.
>> > 
>> 
>> I believe the memory_failure() is called from the machine check handler:
>> 
>> DEFINE_IDTENTRY_MCE(exc_machine_check)
>>   -> exc_machine_check_kernel()
>>      -> do_machine_check()
>>         -> kill_me_now() or kill_me_maybe()
>>            -> memory_failure()
>> 
>> (I might have quoted just one of the paths and I'll have to look into it
>> more.)
>
> It looked that way to me too. But it works from other contexts. See
> MADV_HWPOISON (which is for testing).
>
>> 
>> For now, IIUC setting the poison flag is a subset of memory_failure(), which
>> is a
>> subset of what the machine check handler does.
>> 
>> memory_failure() handles an already poisoned page, so I don't see any
>> side effects.
>> 
>> I'm happy that Kirill didn't seem worried :) Rick, let me know if you
>> see any specific risks.
>> 
>> > Maybe we could bring the poison page flag up to DavidH and see if there is
>> > any
>> > concern before going down this path too far?
>> > 
>> 
>> I can do that. David's cc-ed on this email, and I hope to get a chance
>> to talk about handling HWpoison (generally, not TDX specifically) at the
>> guest_memfd bi-weekly upstream call on 2025-07-10 so I can bring this up
>> too.
>
> Ok sounds good. Should we just continue the discussion there?

I think we're at a point where further discussion isn't really
useful. Kirill didn't seem worried about using HWpoison, so that's a
good sign. I think we can go ahead to use HWpoison for the next RFC of
this series and we might learn more through the process of testing it.

Do you prefer to just wait till the next guest_memfd call (now
rescheduled to 2025-07-17) before proceeding?

> I can try to
> attend.
>

Sure, thanks! It'll be focused on memory failure handling in general, so
TDX will just be another participant.

>> 
>> > > 
>> > > Are you saying kvm_gmem_buggy_cleanup() will just set the HWpoison flag
>> > > on the parts of the folios in trouble?
>> > 
>> > I was saying kvm_gmem_buggy_cleanup() can set a bool on the fd, similar to
>> > VM_BUG_ON() setting vm_dead.
>> 
>> Setting a bool on the fd is a possible option too. Comparing an
>> inode-level boolean and HWpoison, I still prefer HWpoison because
>> 
>> 1. HWpoison gives us more information about which (sub)folio was
>>    poisoned. We can think of the bool on the fd as an fd-wide
>>    poisoning. If we don't know which subpage has an error, we're forced
>>    to leak the entire fd when the inode is released, which could be a
>>    huge amount of memory leaked.
>> 2. HWpoison is already checked on faults, so there is no need to add an
>>    extra check on a bool
>> 3. For HugeTLB, HWpoison will have to be summarized/itemized on merge/split to
>> handle
>>    regular non-TDX related HWpoisons, so no additional code there.
>> 
>> > After an invalidate, if gmem see this, it needs to
>> > assume everything failed, and invalidate everything and poison all guest
>> > memory.
>> > The point was to have the simplest possible handling for a rare error.
>> 
>> I agree a bool will probably result in fewer lines of code being changed
>> and could be a fair first cut, but I feel like we would very quickly
>> need another patch series to get more granular information and not have
>> to leak an entire fd worth of memory.
>
> We will only leak an entire VMs worth of memory if there is a bug, the form of
> which I'm not sure. The kernel doesn't usually have a lot of defensive code to
> handle for bugs elsewhere. Unless it's to help debugging. But especially for
> other platform software (bios, etc), it should try to stay out of the job of
> maintaining code to work around unfixed bugs. And here we are working around
> *potential bugs*.
>
> So another *possible* solution is to expect TDX module/KVM to work. Kill the TD,
> return success to the invalidation, and hope that it doesn't do anything to
> those zombie mappings. It will likely work. Probably much more likely to work
> then some other warning cases in the kernel. As far as debugging, if strange
> crashes are observed after a bit splat, it can be a good hint.
>
> Unless Yan has some specific case to worry about that she has been holding on to
> that makes this error condition a more expected state. That could change things.
>
>> 
>> Along these lines, Yan seems to prefer setting HWpoison on the entire
>> folio without going into the details of the exact subfolios being
>> poisoned. I think this is a possible in-between solution that doesn't
>> require leaking the entire fd worth of memory, but it still leaks more
>> than just where the actual error happened.
>> 
>> I'm willing to go with just setting HWpoison on the entire large folio
>> as a first cut and leak more memory than necessary (because if we don't
>> know which subpage it is, we are forced to leak everything to be safe).
>
> Leaking more memory than necessary in a bug case seems totally ok to me.
>
>> 
>> However, this patch series needs a large page provider in guest_memfd, and
>> will only land either after THP or HugeTLB support lands in
>> guest_memfd.
>> 
>> For now if you're testing on guest_memfd+HugeTLB,
>> folio_set_hugetlb_hwpoison() already exists, why not use it?
>> 
>> > Although
>> > it's only a proposal. The TDX emergency shutdown option may be simpler
>> > still.
>> > But killing all TDs is not ideal. So thought we could at least consider
>> > other
>> > options.
>> > 
>> > If we have a solution where TDX needs to do something complicated because
>> > something of its specialness, it may get NAKed.
>> 
>> Using HWpoison is generic, since guest_memfd needs to handle HWpoison
>> for regular memory errors anyway. Even if it is not a final solution, it
>> should be good enough, if not for this patch series to merge, at least
>> for the next RFC of this patch series. :)
>
> Yes, maybe. If we have a normal, easy, non-imposing solution for handling the
> error then I won't object.

I believe we have 1 solution now, with 4 options to prevent the memory
from being re-used by the host.

1. Kill the TD *and* one of the following to prevent the memory from
   being re-used by the host:
    a. Kill the host
    b. HWpoison the memory
    c. fd bool, aka inode-wide HWpoison on error
    d. Leak the memory directly (not great, will mess up conversions and
                                 guest_memfd inode release) 

To push along on this topic, is it okay for us to proceed with HWpoison
and find out along the way if it is not easy or imposing?


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-07-08 15:32         ` Edgecombe, Rick P
@ 2025-07-08 22:06           ` Vishal Annapurve
  2025-07-08 23:16             ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Vishal Annapurve @ 2025-07-08 22:06 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, Zhao, Yan Y,
	tabba@google.com, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	vbabka@suse.cz, Peng, Chao P, Du, Fan, binbin.wu@linux.intel.com,
	jroedel@suse.de, Miao, Jun, kvm@vger.kernel.org,
	pgonda@google.com, x86@kernel.org

On Tue, Jul 8, 2025 at 8:32 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2025-07-08 at 08:29 -0700, Vishal Annapurve wrote:
> > > The original seamcall wrapper patches used "u64 hpa", etc everywhere. The
> > > feedback was that it was too error prone to not have types. We looked at
> > > using
> > > kvm types (hpa_t, etc), but the type checking was still just surface level
> > > [0].
> > >
> > > So the goal is to reduce errors and improve code readability. We can
> > > consider
> > > breaking symmetry if it is better that way. In this case though, why not use
> > > struct folio?
> >
> > My vote would be to prefer using "hpa" and not rely on folio/page
> > structs for guest_memfd allocated memory wherever possible.
>
> Is this because you want to enable struct page-less gmemfd in the future?

Yes. That's the only reason.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-07-08 22:06           ` Vishal Annapurve
@ 2025-07-08 23:16             ` Edgecombe, Rick P
  2025-07-08 23:31               ` Vishal Annapurve
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-08 23:16 UTC (permalink / raw)
  To: Annapurve, Vishal
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, Weiny, Ira, pbonzini@redhat.com,
	Zhao, Yan Y, Yamahata, Isaku, ackerleytng@google.com,
	seanjc@google.com, Peng, Chao P, kvm@vger.kernel.org,
	binbin.wu@linux.intel.com, vbabka@suse.cz, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, 2025-07-08 at 15:06 -0700, Vishal Annapurve wrote:
> > > My vote would be to prefer using "hpa" and not rely on folio/page
> > > structs for guest_memfd allocated memory wherever possible.
> > 
> > Is this because you want to enable struct page-less gmemfd in the future?
> 
> Yes. That's the only reason.

I don't think we should change just this field of this seamcall wrapper from the
current pattern for that reason. When this stuff comes along it will be just
about as easy to change it with the rest. Then in the meantime it doesn't look
asymmetric.

In general, I (again) think that we should not focus on accommodating future
stuff unless there is an ABI touch point. This is to ultimately speed enabling
of the entire stack.

It is definitely not to make it harder to implement TDX support for pfn based
gmem in the future. Rather to make it possible. As in, if nothing is upstream
because we are endlessly debating how it all fits together at once, then it
won't be possible to enhance it further.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-07-08 23:16             ` Edgecombe, Rick P
@ 2025-07-08 23:31               ` Vishal Annapurve
  0 siblings, 0 replies; 294+ messages in thread
From: Vishal Annapurve @ 2025-07-08 23:31 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Shutemov, Kirill,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, Weiny, Ira, pbonzini@redhat.com,
	Zhao, Yan Y, Yamahata, Isaku, ackerleytng@google.com,
	seanjc@google.com, Peng, Chao P, kvm@vger.kernel.org,
	binbin.wu@linux.intel.com, vbabka@suse.cz, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jul 8, 2025 at 4:16 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2025-07-08 at 15:06 -0700, Vishal Annapurve wrote:
> > > > My vote would be to prefer using "hpa" and not rely on folio/page
> > > > structs for guest_memfd allocated memory wherever possible.
> > >
> > > Is this because you want to enable struct page-less gmemfd in the future?
> >
> > Yes. That's the only reason.
>
> I don't think we should change just this field of this seamcall wrapper from the
> current pattern for that reason. When this stuff comes along it will be just
> about as easy to change it with the rest. Then in the meantime it doesn't look
> asymmetric.
>
> In general, I (again) think that we should not focus on accommodating future
> stuff unless there is an ABI touch point. This is to ultimately speed enabling
> of the entire stack.
>
> It is definitely not to make it harder to implement TDX support for pfn based
> gmem in the future. Rather to make it possible. As in, if nothing is upstream
> because we are endlessly debating how it all fits together at once, then it
> won't be possible to enhance it further.

I agree and if we can't do without page struct for now that's fine. My
response was just to favor pfn/hpa over "page struct" if possible,
given that we have a choice here. Feel free to ignore if symmetry
seems more important.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-07-08 13:55     ` Edgecombe, Rick P
  2025-07-08 15:29       ` Vishal Annapurve
@ 2025-07-09  2:23       ` Yan Zhao
  2025-07-09 14:08         ` Edgecombe, Rick P
  1 sibling, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-07-09  2:23 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Shutemov, Kirill,
	quic_eberman@quicinc.com, Li, Xiaoyao, kvm@vger.kernel.org,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, Li, Zhiquan1, Du, Fan,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira,
	vbabka@suse.cz, binbin.wu@linux.intel.com, ackerleytng@google.com,
	Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, Jul 08, 2025 at 09:55:39PM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-07-08 at 16:48 +0800, Yan Zhao wrote:
> > On Thu, Apr 24, 2025 at 11:04:28AM +0800, Yan Zhao wrote:
> > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > > index f5e2a937c1e7..a66d501b5677 100644
> > > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > > @@ -1595,9 +1595,18 @@ u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u
> > According to the discussion in DPAMT [*],
> > "hpa here points to a 2M region that pamt_pages covers. We don't have
> > struct page that represents it. Passing 4k struct page would be
> > misleading IMO."
> > 
> > Should we update tdh_mem_page_aug() accordingly to use hpa?
> > Or use struct folio instead?
> > 
> > [*] https://lore.kernel.org/all/3coaqkcfp7xtpvh2x4kph55qlopupknm7dmzqox6fakzaedhem@a2oysbvbshpm/
> 
> The original seamcall wrapper patches used "u64 hpa", etc everywhere. The
> feedback was that it was too error prone to not have types. We looked at using
> kvm types (hpa_t, etc), but the type checking was still just surface level [0].
> 
> So the goal is to reduce errors and improve code readability. We can consider
> breaking symmetry if it is better that way. In this case though, why not use
> struct folio?
I'm Ok with using struct folio.
My previous ask was based on 2 considerations:

1. hpa is simpler and I didn't find Dave's NAK to Kirill's patch (v1 or v2).
2. using struct folio, I need to introduce "start_idx" as well (as below),
   because it's likely that guest_memfd provides a huge folio while KVM wants to
   map it at 4KB.

u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct folio *folio, 
                     unsigned long start_idx, u64 *ext_err1, u64 *ext_err2)      
{                                                                                
        struct page *start = folio_page(folio, start_idx);                       
        unsigned long npages = 1 << (level * PTE_SHIFT);                         
        struct tdx_module_args args = {                                          
                .rcx = gpa | level,                                              
                .rdx = tdx_tdr_pa(td),                                           
                .r8 = page_to_phys(start),                                       
        };                                                                       
        u64 ret;                                                                 
                                                                                 
        if (start_idx + npages > folio_nr_pages(folio))                          
                return TDX_SW_ERROR;                                             
                                                                                 
        for (int i = 0; i < npages; i++)                                         
                tdx_clflush_page(nth_page(start, i));                            
                                                                                 
        ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);                             
                                                                                 
        *ext_err1 = args.rcx;                                                    
        *ext_err2 = args.rdx;                                                    
                                                                                 
        return ret;                                                              
}                                   



> [0] https://lore.kernel.org/kvm/30d0cef5-82d5-4325-b149-0e99833b8785@intel.com/

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-07-09  2:23       ` Yan Zhao
@ 2025-07-09 14:08         ` Edgecombe, Rick P
  0 siblings, 0 replies; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-09 14:08 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
	Hansen, Dave, david@redhat.com, Li, Zhiquan1, tabba@google.com,
	vbabka@suse.cz, thomas.lendacky@amd.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com, Yamahata, Isaku,
	binbin.wu@linux.intel.com, Peng, Chao P, Du, Fan,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun, Shutemov, Kirill,
	pgonda@google.com, x86@kernel.org

On Wed, 2025-07-09 at 10:23 +0800, Yan Zhao wrote:

> 2. using struct folio, I need to introduce "start_idx" as well (as below),
>    because it's likely that guest_memfd provides a huge folio while KVM wants to
>    map it at 4KB.

Seems ok to me.

> 
> u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct folio *folio, 
>                      unsigned long start_idx, u64 *ext_err1, u64 *ext_err2)      
> {                                                                                
>         struct page *start = folio_page(folio, start_idx);                       
>         unsigned long npages = 1 << (level * PTE_SHIFT);                         
>         struct tdx_module_args args = {                                          
>                 .rcx = gpa | level,                                              
>                 .rdx = tdx_tdr_pa(td),                                           
>                 .r8 = page_to_phys(start),                                       
>         };                                                                       
>         u64 ret;                                                                 
>                                                                                  
>         if (start_idx + npages > folio_nr_pages(folio))                          
>                 return TDX_SW_ERROR;                                             
>                                                                                  
>         for (int i = 0; i < npages; i++)                                         
>                 tdx_clflush_page(nth_page(start, i));                            
>                                                                                  
>         ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);                             
>                                                                                  
>         *ext_err1 = args.rcx;                                                    
>         *ext_err2 = args.rdx;                                                    
>                                                                                  
>         return ret;                                                              
> }                      


^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-08 21:19                                                                                               ` Ackerley Tng
@ 2025-07-11  1:46                                                                                                 ` Edgecombe, Rick P
  2025-07-11  5:12                                                                                                   ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-11  1:46 UTC (permalink / raw)
  To: ackerleytng@google.com, Zhao, Yan Y
  Cc: Shutemov, Kirill, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
	Li, Zhiquan1, quic_eberman@quicinc.com, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, Peng, Chao P, pbonzini@redhat.com,
	Yamahata, Isaku, linux-kernel@vger.kernel.org, tabba@google.com,
	kvm@vger.kernel.org, binbin.wu@linux.intel.com, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Tue, 2025-07-08 at 14:19 -0700, Ackerley Tng wrote:
> > Ok sounds good. Should we just continue the discussion there?
> 
> I think we're at a point where further discussion isn't really
> useful. Kirill didn't seem worried about using HWpoison, so that's a
> good sign. I think we can go ahead to use HWpoison for the next RFC of
> this series and we might learn more through the process of testing it.
> 
> Do you prefer to just wait till the next guest_memfd call (now
> rescheduled to 2025-07-17) before proceeding?

Ah, I missed this and joined the call. :)

At this point, I think I'm strongly in favor of not doing anything here.

Yan and I had a discussion on our internal team chat about this. I'll summarize:

Yan confirmed to me again, that there isn't a specific expected failure here. We
are talking about bugs generating the invalidation failure, and leaving the page
mapped. But in the case of a bug in a normal VM, a page can also be left mapped
too.

What is different here, is we have something (a return code) to check that could
catch some of the bugs. But this isn't the only case where a SEACMALL has a spec
defined error that we can't handle in a no-fail code path. In those other cases,
we handle them by making sure the error won't happen and trigger a VM_BUG_ON()
if it does anyway. We can be consistent by just doing the same thing in this
case. Implementing it looks like just removing the refcounting in the current
code.

And this VM_BUG_ON() will lead to a situation almost like unmapping anyway since
the TD can no longer be entered. With future VM shutdown work the pages will not
be zeroed at shutdown usually either. So we should not always expect crashes if
those pages are returned to the page allocator, even if a bug turns up.
Additionally KVM_BUG_ON() will leave a loud warning, allowing us to fix the bug.

But Yan raised a point that might be worth doing something for this case. On the
partial write errata platforms (a TDX specific thing), pages that are reclaimed
need to be zeroed. So to more cleanly handle this subset of catch-able bugs we
are focused on, we could zero the page after the KVM_BUG_ON(). But this still
need to be weighed with how much we want to add code to address potential bugs.

So on the benefit side, it is very low to me. The other side is the cost side,
which I think is maybe actually a stronger case. We can only make TDX a special
case too many times before we will run into upstream problems. Not to lean on
Sean here, but he bangs this drum. If we find that we have case where we have to
add any specialness for TDX (i.e. making it the only thing that sets the poison
bit manually), we should look at changing the TDX arch to address it. I'm not
sure what that looks like, but we haven't really tried too hard in that
direction yet.

So if TDX has a limited number of "gets to be special" cards, I don't think it
is prudent to spend it on something this much of an edge case. So our plan is to
rely on the KVM_BUG_ON() for now. And consider TDX arch changes (currently
unknown), for how to make the situation cleaner somehow.

Yan, is that your recollection? I guess the other points were that although TDX
doesn't need it today, for long term, userspace ABI around invalidations should
support failure. But the actual gmem/kvm interface for this can be figured out
later. And that external EPT specific TDP MMU code could be tweaked to make
things work a little safer around this.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-11  1:46                                                                                                 ` Edgecombe, Rick P
@ 2025-07-11  5:12                                                                                                   ` Yan Zhao
  2025-07-11 16:14                                                                                                     ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-07-11  5:12 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: ackerleytng@google.com, Shutemov, Kirill, Li, Xiaoyao, Du, Fan,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Li, Zhiquan1, quic_eberman@quicinc.com,
	michael.roth@amd.com, seanjc@google.com, Weiny, Ira, Peng, Chao P,
	pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, tabba@google.com,
	kvm@vger.kernel.org, binbin.wu@linux.intel.com, Annapurve, Vishal,
	jroedel@suse.de, Miao, Jun, pgonda@google.com, x86@kernel.org

On Fri, Jul 11, 2025 at 09:46:45AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-07-08 at 14:19 -0700, Ackerley Tng wrote:
> > > Ok sounds good. Should we just continue the discussion there?
> > 
> > I think we're at a point where further discussion isn't really
> > useful. Kirill didn't seem worried about using HWpoison, so that's a
> > good sign. I think we can go ahead to use HWpoison for the next RFC of
> > this series and we might learn more through the process of testing it.
> > 
> > Do you prefer to just wait till the next guest_memfd call (now
> > rescheduled to 2025-07-17) before proceeding?
> 
> Ah, I missed this and joined the call. :)
> 
> At this point, I think I'm strongly in favor of not doing anything here.
> 
> Yan and I had a discussion on our internal team chat about this. I'll summarize:
> 
> Yan confirmed to me again, that there isn't a specific expected failure here. We
> are talking about bugs generating the invalidation failure, and leaving the page
> mapped. But in the case of a bug in a normal VM, a page can also be left mapped
> too.
> 
> What is different here, is we have something (a return code) to check that could
> catch some of the bugs. But this isn't the only case where a SEACMALL has a spec
> defined error that we can't handle in a no-fail code path. In those other cases,
> we handle them by making sure the error won't happen and trigger a VM_BUG_ON()
> if it does anyway. We can be consistent by just doing the same thing in this
> case. Implementing it looks like just removing the refcounting in the current
> code.
> 
> And this VM_BUG_ON() will lead to a situation almost like unmapping anyway since
> the TD can no longer be entered. With future VM shutdown work the pages will not
> be zeroed at shutdown usually either. So we should not always expect crashes if
> those pages are returned to the page allocator, even if a bug turns up.
> Additionally KVM_BUG_ON() will leave a loud warning, allowing us to fix the bug.
> 
> But Yan raised a point that might be worth doing something for this case. On the
> partial write errata platforms (a TDX specific thing), pages that are reclaimed
> need to be zeroed. So to more cleanly handle this subset of catch-able bugs we
> are focused on, we could zero the page after the KVM_BUG_ON(). But this still
> need to be weighed with how much we want to add code to address potential bugs.
> 
> 
> So on the benefit side, it is very low to me. The other side is the cost side,
> which I think is maybe actually a stronger case. We can only make TDX a special
> case too many times before we will run into upstream problems. Not to lean on
> Sean here, but he bangs this drum. If we find that we have case where we have to
> add any specialness for TDX (i.e. making it the only thing that sets the poison
> bit manually), we should look at changing the TDX arch to address it. I'm not
> sure what that looks like, but we haven't really tried too hard in that
> direction yet.
> 
> So if TDX has a limited number of "gets to be special" cards, I don't think it
> is prudent to spend it on something this much of an edge case. So our plan is to
> rely on the KVM_BUG_ON() for now. And consider TDX arch changes (currently
> unknown), for how to make the situation cleaner somehow.
> 
> Yan, is that your recollection? I guess the other points were that although TDX
I'm ok if KVM_BUG_ON() is considered loud enough to warn about the rare
potential corruption, thereby making TDX less special.

> doesn't need it today, for long term, userspace ABI around invalidations should
> support failure. But the actual gmem/kvm interface for this can be figured out
Could we elaborate what're included in userspace ABI around invalidations?

I'm a bit confused as I think the userspace ABI today supports failure already.

Currently, the unmap API between gmem and KVM does not support failure.

In the future, we hope gmem can check if KVM allows a page to be unmapped before
triggering the actual unmap. 

> later. And that external EPT specific TDP MMU code could be tweaked to make
> things work a little safer around this.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-11  5:12                                                                                                   ` Yan Zhao
@ 2025-07-11 16:14                                                                                                     ` Edgecombe, Rick P
  2025-07-14 19:49                                                                                                       ` Ackerley Tng
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-11 16:14 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
	Li, Zhiquan1, Shutemov, Kirill, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, Peng, Chao P, pbonzini@redhat.com,
	Yamahata, Isaku, ackerleytng@google.com, tabba@google.com,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	Annapurve, Vishal, jroedel@suse.de, Miao, Jun,
	kvm@vger.kernel.org, pgonda@google.com, x86@kernel.org

On Fri, 2025-07-11 at 13:12 +0800, Yan Zhao wrote:
> > Yan, is that your recollection? I guess the other points were that although
> > TDX
> I'm ok if KVM_BUG_ON() is considered loud enough to warn about the rare
> potential corruption, thereby making TDX less special.
> 
> > doesn't need it today, for long term, userspace ABI around invalidations
> > should
> > support failure. But the actual gmem/kvm interface for this can be figured
> > out
> Could we elaborate what're included in userspace ABI around invalidations?

Let's see what Ackerley says.

> 
> I'm a bit confused as I think the userspace ABI today supports failure
> already.
> 
> Currently, the unmap API between gmem and KVM does not support failure.

Great. I'm just trying to summarize the internal conversations. I think the
point was for a future looking user ABI, supporting failure is important. But we
don't need the KVM/gmem interface figured out yet.

> 
> In the future, we hope gmem can check if KVM allows a page to be unmapped
> before
> triggering the actual unmap.



^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-03  4:54                                                                                               ` Yan Zhao
@ 2025-07-14 19:32                                                                                                 ` Ackerley Tng
  0 siblings, 0 replies; 294+ messages in thread
From: Ackerley Tng @ 2025-07-14 19:32 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Vishal Annapurve, Edgecombe, Rick P, quic_eberman@quicinc.com,
	Li, Xiaoyao, Shutemov, Kirill, Hansen, Dave, david@redhat.com,
	thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
	Du, Fan, michael.roth@amd.com, seanjc@google.com,
	binbin.wu@linux.intel.com, Peng, Chao P, kvm@vger.kernel.org,
	Yamahata, Isaku, linux-kernel@vger.kernel.org, Weiny, Ira,
	pbonzini@redhat.com, Li, Zhiquan1, jroedel@suse.de, Miao, Jun,
	pgonda@google.com, x86@kernel.org

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, Jul 02, 2025 at 11:43:23AM -0700, Ackerley Tng wrote:
>> >> vmemmap-optimized folios. Setting a page flag on a vmemmap-optimized
>> >> folio will be setting the flag on a few pages.
>> > BTW, I have a concern regarding to the overhead vmemmap-optimization.
>> >
>> > In my system,
>> > with hugetlb_free_vmemmap=false, the TD boot time is around 30s;
>> > with hugetlb_free_vmemmap=true, the TD boot time is around 1m20s;
>> 
>> I'm aware of this, was investigating this for something similar
>> internally. In your system and test, were you working with 1G pages, or
>> 2M pages?
> 2M pages. 

Thanks for letting me know. I'll look more into this. I'm currently also
working on some optimizations, hope this will get addressed.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-11 16:14                                                                                                     ` Edgecombe, Rick P
@ 2025-07-14 19:49                                                                                                       ` Ackerley Tng
  2025-07-15 15:08                                                                                                         ` Edgecombe, Rick P
  0 siblings, 1 reply; 294+ messages in thread
From: Ackerley Tng @ 2025-07-14 19:49 UTC (permalink / raw)
  To: Edgecombe, Rick P, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
	Li, Zhiquan1, Shutemov, Kirill, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, Peng, Chao P, pbonzini@redhat.com,
	Yamahata, Isaku, tabba@google.com, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, kvm@vger.kernel.org, pgonda@google.com, x86@kernel.org

"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:

> On Fri, 2025-07-11 at 13:12 +0800, Yan Zhao wrote:
>> > Yan, is that your recollection? I guess the other points were that although
>> > TDX
>> I'm ok if KVM_BUG_ON() is considered loud enough to warn about the rare
>> potential corruption, thereby making TDX less special.
>> 
>> > doesn't need it today, for long term, userspace ABI around invalidations
>> > should
>> > support failure. But the actual gmem/kvm interface for this can be figured
>> > out
>> Could we elaborate what're included in userspace ABI around invalidations?
>
> Let's see what Ackerley says.
>

There's no specific invalidation command for ioctl but I assume you're
referring to the conversion ioctl?

There is a conversion ioctl planned for guest_memfd and the conversion
ioctl can return an error. The process of conversion involves
invalidating the memory that is to be converted, and for now,
guest_memfd assumes unmapping is successful (like Yan says), but that
can be changed.

>> 
>> I'm a bit confused as I think the userspace ABI today supports failure
>> already.
>> 
>> Currently, the unmap API between gmem and KVM does not support failure.
>
> Great. I'm just trying to summarize the internal conversations. I think the
> point was for a future looking user ABI, supporting failure is important. But we
> don't need the KVM/gmem interface figured out yet.
>

I'm onboard here. So "do nothing" means if there is a TDX unmap failure,

+ KVM_BUG_ON() and hence the TD in question stops running,
    + No more conversions will be possible for this TD since the TD
      stops running.
    + Other TDs can continue running?
+ No refcounts will be taken for the folio/page where the memory failure
  happened.
+ No other indication (including HWpoison) anywhere in folio/page to
  indicate this happened.
+ To round this topic up, do we do anything else as part of "do nothing"
  that I missed? Is there any record in the TDX module (TDX module
  itself, not within the kernel)?

I'll probably be okay with an answer like "won't know what will happen",
but just checking - what might happen if this page that had an unmap
failure gets reused? Suppose the KVM_BUG_ON() is noted but somehow we
couldn't get to the machine in time and the machine continues to serve,
and the memory is used by 

1. Some other non-VM user, something else entirely, say a database?
2. Some new non-TDX VM?
3. Some new TD?


>> 
>> In the future, we hope gmem can check if KVM allows a page to be unmapped
>> before
>> triggering the actual unmap.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-14 19:49                                                                                                       ` Ackerley Tng
@ 2025-07-15 15:08                                                                                                         ` Edgecombe, Rick P
  2025-07-15 22:31                                                                                                           ` Ackerley Tng
  0 siblings, 1 reply; 294+ messages in thread
From: Edgecombe, Rick P @ 2025-07-15 15:08 UTC (permalink / raw)
  To: ackerleytng@google.com, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, kirill.shutemov@intel.com,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Li, Zhiquan1, Du, Fan, tabba@google.com,
	seanjc@google.com, Weiny, Ira, Peng, Chao P, pbonzini@redhat.com,
	Yamahata, Isaku, michael.roth@amd.com, binbin.wu@linux.intel.com,
	linux-kernel@vger.kernel.org, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, kvm@vger.kernel.org, pgonda@google.com, x86@kernel.org

On Mon, 2025-07-14 at 12:49 -0700, Ackerley Tng wrote:
> I'm onboard here. So "do nothing" means if there is a TDX unmap failure,
> 
> + KVM_BUG_ON() and hence the TD in question stops running,
>     + No more conversions will be possible for this TD since the TD
>       stops running.
>     + Other TDs can continue running?
> + No refcounts will be taken for the folio/page where the memory failure
>   happened.
> + No other indication (including HWpoison) anywhere in folio/page to
>   indicate this happened.

Yea.

> + To round this topic up, do we do anything else as part of "do nothing"
>   that I missed? Is there any record in the TDX module (TDX module
>   itself, not within the kernel)?

We should keep this as an option for how to change the TDX module to make this
solution safer. For future arch things, we should maybe pursue something that
works for TDX connect too, which could be more complicated.

> 
> I'll probably be okay with an answer like "won't know what will happen",

I have not exhaustively looked at that there won't be cascading failures. I
think it's reasonable given this is a bug case which we already have a way to
catch with a warning.

> but just checking - what might happen if this page that had an unmap
> failure gets reused? 
> 

The TDX module has this thing called the PAMT which records how each physical
page is in use. If KVM tries to re-add the page, the SEAMCALL will check PAMT,
see it is not in the NDA (Not directly assigned) state, and give an error
(TDX_OPERAND_PAGE_METADATA_INCORRECT). This is part of the security enforcement.

> Suppose the KVM_BUG_ON() is noted but somehow we
> couldn't get to the machine in time and the machine continues to serve,
> and the memory is used by 
> 
> 1. Some other non-VM user, something else entirely, say a database?

We are in a "there is a bug" state at this point, which means stability should
not be expected to be as good. But it should be optimistically ok to re-use the
page as long as the TD is not re-entered, or otherwise actuated via SEAMCALL.

> 2. Some new non-TDX VM?

Same as (1)

> 3. Some new TD?

As above, the TDX module should prevent this.

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-15 15:08                                                                                                         ` Edgecombe, Rick P
@ 2025-07-15 22:31                                                                                                           ` Ackerley Tng
  0 siblings, 0 replies; 294+ messages in thread
From: Ackerley Tng @ 2025-07-15 22:31 UTC (permalink / raw)
  To: Edgecombe, Rick P, Zhao, Yan Y
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, kirill.shutemov@intel.com,
	Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	vbabka@suse.cz, Li, Zhiquan1, Du, Fan, tabba@google.com,
	seanjc@google.com, Weiny, Ira, Peng, Chao P, pbonzini@redhat.com,
	Yamahata, Isaku, michael.roth@amd.com, binbin.wu@linux.intel.com,
	linux-kernel@vger.kernel.org, Annapurve, Vishal, jroedel@suse.de,
	Miao, Jun, kvm@vger.kernel.org, pgonda@google.com, x86@kernel.org

"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:

> On Mon, 2025-07-14 at 12:49 -0700, Ackerley Tng wrote:
>> I'm onboard here. So "do nothing" means if there is a TDX unmap failure,
>> 
>> + KVM_BUG_ON() and hence the TD in question stops running,
>>     + No more conversions will be possible for this TD since the TD
>>       stops running.
>>     + Other TDs can continue running?
>> + No refcounts will be taken for the folio/page where the memory failure
>>   happened.
>> + No other indication (including HWpoison) anywhere in folio/page to
>>   indicate this happened.
>
> Yea.
>
>> + To round this topic up, do we do anything else as part of "do nothing"
>>   that I missed? Is there any record in the TDX module (TDX module
>>   itself, not within the kernel)?
>
> We should keep this as an option for how to change the TDX module to make this
> solution safer. For future arch things, we should maybe pursue something that
> works for TDX connect too, which could be more complicated.
>
>> 
>> I'll probably be okay with an answer like "won't know what will happen",
>
> I have not exhaustively looked at that there won't be cascading failures. I
> think it's reasonable given this is a bug case which we already have a way to
> catch with a warning.
>
>> but just checking - what might happen if this page that had an unmap
>> failure gets reused? 
>> 
>
> The TDX module has this thing called the PAMT which records how each physical
> page is in use. If KVM tries to re-add the page, the SEAMCALL will check PAMT,
> see it is not in the NDA (Not directly assigned) state, and give an error
> (TDX_OPERAND_PAGE_METADATA_INCORRECT). This is part of the security enforcement.
>
>> Suppose the KVM_BUG_ON() is noted but somehow we
>> couldn't get to the machine in time and the machine continues to serve,
>> and the memory is used by 
>> 
>> 1. Some other non-VM user, something else entirely, say a database?
>
> We are in a "there is a bug" state at this point, which means stability should
> not be expected to be as good. But it should be optimistically ok to re-use the
> page as long as the TD is not re-entered, or otherwise actuated via SEAMCALL.
>
>> 2. Some new non-TDX VM?
>
> Same as (1)
>
>> 3. Some new TD?
>
> As above, the TDX module should prevent this.

Thanks for clarifying! SGTM!

Btw, after some more work on handling memory failures for guest_memfd,
it now seems like it's better for guest_memfd to not use the HWpoison
flag internally either.

So it turns out well that for TDX unmap failures we're aligned on not
using HWpoison :)

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-06-05 22:35                                       ` Ackerley Tng
  2025-06-19  8:11                                         ` Yan Zhao
@ 2025-07-16  1:23                                         ` Yan Zhao
  2025-07-16 20:57                                           ` Ackerley Tng
  1 sibling, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-07-16  1:23 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: vannapurve, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> >> Hi Yan,
> >> 
> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
> >> series [1], we took into account conversion failures too. The steps are
> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> >> series from GitHub [2] because the steps for conversion changed in two
> >> separate patches.)
> > ...
> >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> >
> > Hi Ackerley,
> > Thanks for providing this branch.
> 
> Here's the WIP branch [1], which I initially wasn't intending to make
> super public since it's not even RFC standard yet and I didn't want to
> add to the many guest_memfd in-flight series, but since you referred to
> it, [2] is a v2 of the WIP branch :)
> 
> [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
Hi Ackerley,

I'm working on preparing TDX huge page v2 based on [2] from you. The current
decision is that the code base of TDX huge page v2 needs to include DPAMT
and VM shutdown optimization as well.

So, we think kvm-x86/next is a good candidate for us.
(It is in repo https://github.com/kvm-x86/linux.git
 commit 87198fb0208a (tag: kvm-x86-next-2025.07.15, kvm-x86/next) Merge branch 'vmx',
 which already includes code for VM shutdown optimization).
I still need to port DPAMT + gmem 1G + TDX huge page v2 on top it.

Therefore, I'm wondering if the rebase of [2] onto kvm-x86/next can be done
from your side. A straightforward rebase is sufficient, with no need for
any code modification. And it's better to be completed by the end of next
week.

We thought it might be easier for you to do that (but depending on your
bandwidth), allowing me to work on the DPAMT part for TDX huge page v2 in
parallel.

However, if it's difficult for you, please feel free to let us know.

Thanks
Yan

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-16  1:23                                         ` Yan Zhao
@ 2025-07-16 20:57                                           ` Ackerley Tng
  2025-07-18  5:49                                             ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Ackerley Tng @ 2025-07-16 20:57 UTC (permalink / raw)
  To: Yan Zhao
  Cc: vannapurve, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>> 
>> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
>> >> Hi Yan,
>> >> 
>> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
>> >> series [1], we took into account conversion failures too. The steps are
>> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
>> >> series from GitHub [2] because the steps for conversion changed in two
>> >> separate patches.)
>> > ...
>> >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
>> >
>> > Hi Ackerley,
>> > Thanks for providing this branch.
>> 
>> Here's the WIP branch [1], which I initially wasn't intending to make
>> super public since it's not even RFC standard yet and I didn't want to
>> add to the many guest_memfd in-flight series, but since you referred to
>> it, [2] is a v2 of the WIP branch :)
>> 
>> [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
>> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
> Hi Ackerley,
>
> I'm working on preparing TDX huge page v2 based on [2] from you. The current
> decision is that the code base of TDX huge page v2 needs to include DPAMT
> and VM shutdown optimization as well.
>
> So, we think kvm-x86/next is a good candidate for us.
> (It is in repo https://github.com/kvm-x86/linux.git
>  commit 87198fb0208a (tag: kvm-x86-next-2025.07.15, kvm-x86/next) Merge branch 'vmx',
>  which already includes code for VM shutdown optimization).
> I still need to port DPAMT + gmem 1G + TDX huge page v2 on top it.
>
> Therefore, I'm wondering if the rebase of [2] onto kvm-x86/next can be done
> from your side. A straightforward rebase is sufficient, with no need for
> any code modification. And it's better to be completed by the end of next
> week.
>
> We thought it might be easier for you to do that (but depending on your
> bandwidth), allowing me to work on the DPAMT part for TDX huge page v2 in
> parallel.
>

I'm a little tied up with some internal work, is it okay if, for the
next RFC, you base the changes that you need to make for TDX huge page
v2 and DPAMT on the base of [2]?

That will save both of us the rebasing. [2] was also based on (some
other version of) kvm/next.

I think it's okay since the main goal is to show that it works. I'll
let you know when I can get to a guest_memfd_HugeTLB v3 (and all the
other patches that go into [2]).

[2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2

> However, if it's difficult for you, please feel free to let us know.
>
> Thanks
> Yan

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-16 20:57                                           ` Ackerley Tng
@ 2025-07-18  5:49                                             ` Yan Zhao
  2025-07-22  5:33                                               ` Ackerley Tng
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-07-18  5:49 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: vannapurve, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

On Wed, Jul 16, 2025 at 01:57:55PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
> >> Yan Zhao <yan.y.zhao@intel.com> writes:
> >> 
> >> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> >> >> Hi Yan,
> >> >> 
> >> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
> >> >> series [1], we took into account conversion failures too. The steps are
> >> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> >> >> series from GitHub [2] because the steps for conversion changed in two
> >> >> separate patches.)
> >> > ...
> >> >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> >> >
> >> > Hi Ackerley,
> >> > Thanks for providing this branch.
> >> 
> >> Here's the WIP branch [1], which I initially wasn't intending to make
> >> super public since it's not even RFC standard yet and I didn't want to
> >> add to the many guest_memfd in-flight series, but since you referred to
> >> it, [2] is a v2 of the WIP branch :)
> >> 
> >> [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
> >> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
> > Hi Ackerley,
> >
> > I'm working on preparing TDX huge page v2 based on [2] from you. The current
> > decision is that the code base of TDX huge page v2 needs to include DPAMT
> > and VM shutdown optimization as well.
> >
> > So, we think kvm-x86/next is a good candidate for us.
> > (It is in repo https://github.com/kvm-x86/linux.git
> >  commit 87198fb0208a (tag: kvm-x86-next-2025.07.15, kvm-x86/next) Merge branch 'vmx',
> >  which already includes code for VM shutdown optimization).
> > I still need to port DPAMT + gmem 1G + TDX huge page v2 on top it.
> >
> > Therefore, I'm wondering if the rebase of [2] onto kvm-x86/next can be done
> > from your side. A straightforward rebase is sufficient, with no need for
> > any code modification. And it's better to be completed by the end of next
> > week.
> >
> > We thought it might be easier for you to do that (but depending on your
> > bandwidth), allowing me to work on the DPAMT part for TDX huge page v2 in
> > parallel.
> >
> 
> I'm a little tied up with some internal work, is it okay if, for the
No problem.

> next RFC, you base the changes that you need to make for TDX huge page
> v2 and DPAMT on the base of [2]?

> That will save both of us the rebasing. [2] was also based on (some
> other version of) kvm/next.
> 
> I think it's okay since the main goal is to show that it works. I'll
> let you know when I can get to a guest_memfd_HugeTLB v3 (and all the
> other patches that go into [2]).
Hmm, the upstream practice is to post code based on latest version, and
there're lots TDX relates fixes in latest kvm-x86/next.


> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
> 
> > However, if it's difficult for you, please feel free to let us know.
> >
> > Thanks
> > Yan

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-18  5:49                                             ` Yan Zhao
@ 2025-07-22  5:33                                               ` Ackerley Tng
  2025-07-22  6:37                                                 ` Yan Zhao
  0 siblings, 1 reply; 294+ messages in thread
From: Ackerley Tng @ 2025-07-22  5:33 UTC (permalink / raw)
  To: Yan Zhao
  Cc: vannapurve, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, Jul 16, 2025 at 01:57:55PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>> 
>> > On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
>> >> Yan Zhao <yan.y.zhao@intel.com> writes:
>> >> 
>> >> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
>> >> >> Hi Yan,
>> >> >> 
>> >> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
>> >> >> series [1], we took into account conversion failures too. The steps are
>> >> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
>> >> >> series from GitHub [2] because the steps for conversion changed in two
>> >> >> separate patches.)
>> >> > ...
>> >> >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
>> >> >
>> >> > Hi Ackerley,
>> >> > Thanks for providing this branch.
>> >> 
>> >> Here's the WIP branch [1], which I initially wasn't intending to make
>> >> super public since it's not even RFC standard yet and I didn't want to
>> >> add to the many guest_memfd in-flight series, but since you referred to
>> >> it, [2] is a v2 of the WIP branch :)
>> >> 
>> >> [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
>> >> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
>> > Hi Ackerley,
>> >
>> > I'm working on preparing TDX huge page v2 based on [2] from you. The current
>> > decision is that the code base of TDX huge page v2 needs to include DPAMT
>> > and VM shutdown optimization as well.
>> >
>> > So, we think kvm-x86/next is a good candidate for us.
>> > (It is in repo https://github.com/kvm-x86/linux.git
>> >  commit 87198fb0208a (tag: kvm-x86-next-2025.07.15, kvm-x86/next) Merge branch 'vmx',
>> >  which already includes code for VM shutdown optimization).
>> > I still need to port DPAMT + gmem 1G + TDX huge page v2 on top it.
>> >
>> > Therefore, I'm wondering if the rebase of [2] onto kvm-x86/next can be done
>> > from your side. A straightforward rebase is sufficient, with no need for
>> > any code modification. And it's better to be completed by the end of next
>> > week.
>> >
>> > We thought it might be easier for you to do that (but depending on your
>> > bandwidth), allowing me to work on the DPAMT part for TDX huge page v2 in
>> > parallel.
>> >
>> 
>> I'm a little tied up with some internal work, is it okay if, for the
> No problem.
>
>> next RFC, you base the changes that you need to make for TDX huge page
>> v2 and DPAMT on the base of [2]?
>
>> That will save both of us the rebasing. [2] was also based on (some
>> other version of) kvm/next.
>> 
>> I think it's okay since the main goal is to show that it works. I'll
>> let you know when I can get to a guest_memfd_HugeTLB v3 (and all the
>> other patches that go into [2]).
> Hmm, the upstream practice is to post code based on latest version, and
> there're lots TDX relates fixes in latest kvm-x86/next.
>

Yup I understand.

For guest_memfd//HugeTLB I'm still waiting for guest_memfd//mmap
(managed by Fuad) to settle, and there are plenty of comments for the
guest_memfd//conversion component to iron out still, so the full update
to v3 will take longer than I think you want to wait.

I'd say for RFCs it's okay to post patch series based on some snapshot,
since there are so many series in flight?

To unblock you, if posting based on a snapshot is really not okay, here
are some other options I can think of:

a. Use [2] and posting a link to a WIP tree, similar to how [2] was
   done
b. Use some placeholder patches, assuming some interfaces to
   guest_memfd//HugeTLB, like how the first few patches in this series
   assumes some interfaces of guest_memfd with THP support, and post a
   series based on assumed interfaces

Please let me know if one of those options allow you to proceed, thanks!

>> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
>> 
>> > However, if it's difficult for you, please feel free to let us know.
>> >
>> > Thanks
>> > Yan

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-22  5:33                                               ` Ackerley Tng
@ 2025-07-22  6:37                                                 ` Yan Zhao
  2025-07-22 17:55                                                   ` Ackerley Tng
  0 siblings, 1 reply; 294+ messages in thread
From: Yan Zhao @ 2025-07-22  6:37 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: vannapurve, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

On Mon, Jul 21, 2025 at 10:33:14PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Wed, Jul 16, 2025 at 01:57:55PM -0700, Ackerley Tng wrote:
> >> Yan Zhao <yan.y.zhao@intel.com> writes:
> >> 
> >> > On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
> >> >> Yan Zhao <yan.y.zhao@intel.com> writes:
> >> >> 
> >> >> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> >> >> >> Hi Yan,
> >> >> >> 
> >> >> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
> >> >> >> series [1], we took into account conversion failures too. The steps are
> >> >> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> >> >> >> series from GitHub [2] because the steps for conversion changed in two
> >> >> >> separate patches.)
> >> >> > ...
> >> >> >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> >> >> >
> >> >> > Hi Ackerley,
> >> >> > Thanks for providing this branch.
> >> >> 
> >> >> Here's the WIP branch [1], which I initially wasn't intending to make
> >> >> super public since it's not even RFC standard yet and I didn't want to
> >> >> add to the many guest_memfd in-flight series, but since you referred to
> >> >> it, [2] is a v2 of the WIP branch :)
> >> >> 
> >> >> [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
> >> >> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
> >> > Hi Ackerley,
> >> >
> >> > I'm working on preparing TDX huge page v2 based on [2] from you. The current
> >> > decision is that the code base of TDX huge page v2 needs to include DPAMT
> >> > and VM shutdown optimization as well.
> >> >
> >> > So, we think kvm-x86/next is a good candidate for us.
> >> > (It is in repo https://github.com/kvm-x86/linux.git
> >> >  commit 87198fb0208a (tag: kvm-x86-next-2025.07.15, kvm-x86/next) Merge branch 'vmx',
> >> >  which already includes code for VM shutdown optimization).
> >> > I still need to port DPAMT + gmem 1G + TDX huge page v2 on top it.
> >> >
> >> > Therefore, I'm wondering if the rebase of [2] onto kvm-x86/next can be done
> >> > from your side. A straightforward rebase is sufficient, with no need for
> >> > any code modification. And it's better to be completed by the end of next
> >> > week.
> >> >
> >> > We thought it might be easier for you to do that (but depending on your
> >> > bandwidth), allowing me to work on the DPAMT part for TDX huge page v2 in
> >> > parallel.
> >> >
> >> 
> >> I'm a little tied up with some internal work, is it okay if, for the
> > No problem.
> >
> >> next RFC, you base the changes that you need to make for TDX huge page
> >> v2 and DPAMT on the base of [2]?
> >
> >> That will save both of us the rebasing. [2] was also based on (some
> >> other version of) kvm/next.
> >> 
> >> I think it's okay since the main goal is to show that it works. I'll
> >> let you know when I can get to a guest_memfd_HugeTLB v3 (and all the
> >> other patches that go into [2]).
> > Hmm, the upstream practice is to post code based on latest version, and
> > there're lots TDX relates fixes in latest kvm-x86/next.
> >
> 
> Yup I understand.
> 
> For guest_memfd//HugeTLB I'm still waiting for guest_memfd//mmap
> (managed by Fuad) to settle, and there are plenty of comments for the
> guest_memfd//conversion component to iron out still, so the full update
> to v3 will take longer than I think you want to wait.
> 
> I'd say for RFCs it's okay to post patch series based on some snapshot,
> since there are so many series in flight?
> 
> To unblock you, if posting based on a snapshot is really not okay, here
> are some other options I can think of:
> 
> a. Use [2] and posting a link to a WIP tree, similar to how [2] was
>    done
> b. Use some placeholder patches, assuming some interfaces to
>    guest_memfd//HugeTLB, like how the first few patches in this series
>    assumes some interfaces of guest_memfd with THP support, and post a
>    series based on assumed interfaces
> 
> Please let me know if one of those options allow you to proceed, thanks!
Do you see any issues with directly rebasing [2] onto 6.16.0-rc6?

We currently prefer this approach. We have tested [2] for some time, and TDX
huge page series doesn't rely on the implementation details of guest_memfd.

It's ok if you are currently occupied by Google's internal tasks. No worries.

> >> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
> >> 
> >> > However, if it's difficult for you, please feel free to let us know.
> >> >
> >> > Thanks
> >> > Yan

^ permalink raw reply	[flat|nested] 294+ messages in thread

* Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
  2025-07-22  6:37                                                 ` Yan Zhao
@ 2025-07-22 17:55                                                   ` Ackerley Tng
  0 siblings, 0 replies; 294+ messages in thread
From: Ackerley Tng @ 2025-07-22 17:55 UTC (permalink / raw)
  To: Yan Zhao
  Cc: vannapurve, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kirill.shutemov, tabba,
	quic_eberman, michael.roth, david, vbabka, jroedel,
	thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
	isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Mon, Jul 21, 2025 at 10:33:14PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>> 
>> > On Wed, Jul 16, 2025 at 01:57:55PM -0700, Ackerley Tng wrote:
>> >> Yan Zhao <yan.y.zhao@intel.com> writes:
>> >> 
>> >> > On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
>> >> >> Yan Zhao <yan.y.zhao@intel.com> writes:
>> >> >> 
>> >> >> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
>> >> >> >> Hi Yan,
>> >> >> >> 
>> >> >> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
>> >> >> >> series [1], we took into account conversion failures too. The steps are
>> >> >> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
>> >> >> >> series from GitHub [2] because the steps for conversion changed in two
>> >> >> >> separate patches.)
>> >> >> > ...
>> >> >> >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
>> >> >> >
>> >> >> > Hi Ackerley,
>> >> >> > Thanks for providing this branch.
>> >> >> 
>> >> >> Here's the WIP branch [1], which I initially wasn't intending to make
>> >> >> super public since it's not even RFC standard yet and I didn't want to
>> >> >> add to the many guest_memfd in-flight series, but since you referred to
>> >> >> it, [2] is a v2 of the WIP branch :)
>> >> >> 
>> >> >> [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
>> >> >> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
>> >> > Hi Ackerley,
>> >> >
>> >> > I'm working on preparing TDX huge page v2 based on [2] from you. The current
>> >> > decision is that the code base of TDX huge page v2 needs to include DPAMT
>> >> > and VM shutdown optimization as well.
>> >> >
>> >> > So, we think kvm-x86/next is a good candidate for us.
>> >> > (It is in repo https://github.com/kvm-x86/linux.git
>> >> >  commit 87198fb0208a (tag: kvm-x86-next-2025.07.15, kvm-x86/next) Merge branch 'vmx',
>> >> >  which already includes code for VM shutdown optimization).
>> >> > I still need to port DPAMT + gmem 1G + TDX huge page v2 on top it.
>> >> >
>> >> > Therefore, I'm wondering if the rebase of [2] onto kvm-x86/next can be done
>> >> > from your side. A straightforward rebase is sufficient, with no need for
>> >> > any code modification. And it's better to be completed by the end of next
>> >> > week.
>> >> >
>> >> > We thought it might be easier for you to do that (but depending on your
>> >> > bandwidth), allowing me to work on the DPAMT part for TDX huge page v2 in
>> >> > parallel.
>> >> >
>> >> 
>> >> I'm a little tied up with some internal work, is it okay if, for the
>> > No problem.
>> >
>> >> next RFC, you base the changes that you need to make for TDX huge page
>> >> v2 and DPAMT on the base of [2]?
>> >
>> >> That will save both of us the rebasing. [2] was also based on (some
>> >> other version of) kvm/next.
>> >> 
>> >> I think it's okay since the main goal is to show that it works. I'll
>> >> let you know when I can get to a guest_memfd_HugeTLB v3 (and all the
>> >> other patches that go into [2]).
>> > Hmm, the upstream practice is to post code based on latest version, and
>> > there're lots TDX relates fixes in latest kvm-x86/next.
>> >
>> 
>> Yup I understand.
>> 
>> For guest_memfd//HugeTLB I'm still waiting for guest_memfd//mmap
>> (managed by Fuad) to settle, and there are plenty of comments for the
>> guest_memfd//conversion component to iron out still, so the full update
>> to v3 will take longer than I think you want to wait.
>> 
>> I'd say for RFCs it's okay to post patch series based on some snapshot,
>> since there are so many series in flight?
>> 
>> To unblock you, if posting based on a snapshot is really not okay, here
>> are some other options I can think of:
>> 
>> a. Use [2] and posting a link to a WIP tree, similar to how [2] was
>>    done
>> b. Use some placeholder patches, assuming some interfaces to
>>    guest_memfd//HugeTLB, like how the first few patches in this series
>>    assumes some interfaces of guest_memfd with THP support, and post a
>>    series based on assumed interfaces
>> 
>> Please let me know if one of those options allow you to proceed, thanks!
> Do you see any issues with directly rebasing [2] onto 6.16.0-rc6?
>

Nope I think that should be fine. Thanks for checking!

> We currently prefer this approach. We have tested [2] for some time, and TDX
> huge page series doesn't rely on the implementation details of guest_memfd.
>
> It's ok if you are currently occupied by Google's internal tasks. No worries.
>
>> >> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
>> >> 
>> >> > However, if it's difficult for you, please feel free to let us know.
>> >> >
>> >> > Thanks
>> >> > Yan

^ permalink raw reply	[flat|nested] 294+ messages in thread

end of thread, other threads:[~2025-07-22 17:55 UTC | newest]

Thread overview: 294+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-24  3:00 [RFC PATCH 00/21] KVM: TDX huge page support for private memory Yan Zhao
2025-04-24  3:04 ` [RFC PATCH 01/21] KVM: gmem: Allocate 2M huge page from guest_memfd backend Yan Zhao
2025-04-24  3:04 ` [RFC PATCH 02/21] x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
2025-04-24  7:48   ` Kirill A. Shutemov
2025-04-24  8:41     ` Yan Zhao
2025-04-25  6:51   ` Binbin Wu
2025-04-25  7:19     ` Yan Zhao
2025-05-13 18:52   ` Edgecombe, Rick P
2025-05-16  9:05     ` Yan Zhao
2025-05-16 17:10       ` Edgecombe, Rick P
2025-06-19  9:26       ` Nikolay Borisov
2025-06-23  9:32         ` Yan Zhao
2025-05-15  2:16   ` Chao Gao
2025-05-16  9:07     ` Yan Zhao
2025-07-08  8:48   ` Yan Zhao
2025-07-08 13:55     ` Edgecombe, Rick P
2025-07-08 15:29       ` Vishal Annapurve
2025-07-08 15:32         ` Edgecombe, Rick P
2025-07-08 22:06           ` Vishal Annapurve
2025-07-08 23:16             ` Edgecombe, Rick P
2025-07-08 23:31               ` Vishal Annapurve
2025-07-09  2:23       ` Yan Zhao
2025-07-09 14:08         ` Edgecombe, Rick P
2025-04-24  3:04 ` [RFC PATCH 03/21] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
2025-04-25  7:12   ` Binbin Wu
2025-04-25  7:17     ` Yan Zhao
2025-04-25  7:25       ` Binbin Wu
2025-04-25  9:24         ` Yan Zhao
2025-05-13 18:19   ` Edgecombe, Rick P
2025-05-15  8:26     ` Yan Zhao
2025-05-15 17:28       ` Edgecombe, Rick P
2025-05-16  2:23         ` Yan Zhao
2025-07-01 21:15         ` Edgecombe, Rick P
2025-04-24  3:05 ` [RFC PATCH 04/21] KVM: TDX: Enforce 4KB mapping level during TD build Time Yan Zhao
2025-04-24  7:55   ` Kirill A. Shutemov
2025-04-24  8:49     ` Yan Zhao
2025-05-13 19:12   ` Edgecombe, Rick P
2025-05-15  9:16     ` Yan Zhao
2025-05-15 17:32       ` Edgecombe, Rick P
2025-05-16 10:05         ` Yan Zhao
2025-04-24  3:05 ` [RFC PATCH 05/21] KVM: TDX: Enhance tdx_clear_page() to support huge pages Yan Zhao
2025-05-13 19:17   ` Edgecombe, Rick P
2025-05-16  2:02     ` Yan Zhao
2025-04-24  3:05 ` [RFC PATCH 06/21] KVM: TDX: Assert the reclaimed pages were mapped as expected Yan Zhao
2025-05-13 19:25   ` Edgecombe, Rick P
2025-05-16  2:11     ` Yan Zhao
2025-05-16 17:34       ` Edgecombe, Rick P
2025-04-24  3:05 ` [RFC PATCH 07/21] KVM: TDX: Add a helper for WBINVD on huge pages with TD's keyID Yan Zhao
2025-05-06  8:37   ` Binbin Wu
2025-05-16  3:10     ` Yan Zhao
2025-05-13 19:29   ` Edgecombe, Rick P
2025-05-16  3:03     ` Yan Zhao
2025-05-16 17:35       ` Edgecombe, Rick P
2025-04-24  3:06 ` [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages Yan Zhao
2025-04-29  0:17   ` Vishal Annapurve
2025-04-29  0:49     ` Yan Zhao
2025-04-29 13:46       ` Vishal Annapurve
2025-05-06  0:53         ` Yan Zhao
2025-05-06  5:08           ` Vishal Annapurve
2025-05-06  6:04             ` Yan Zhao
2025-05-06 13:18               ` Vishal Annapurve
2025-05-07  7:37                 ` Yan Zhao
2025-05-07 14:56                   ` Vishal Annapurve
2025-05-08  1:30                     ` Yan Zhao
2025-05-08 14:10                       ` Vishal Annapurve
2025-05-09  3:20                         ` Yan Zhao
2025-05-09 14:20                           ` Vishal Annapurve
2025-05-09 23:45                             ` Edgecombe, Rick P
2025-05-10  0:41                               ` Vishal Annapurve
2025-05-12 21:59                                 ` Edgecombe, Rick P
2025-05-12  2:15                             ` Yan Zhao
2025-05-12 16:53                               ` Vishal Annapurve
2025-05-15  3:01                                 ` Yan Zhao
2025-06-04 20:02                                   ` Ackerley Tng
2025-06-05  2:42                                     ` Yan Zhao
2025-06-05 21:12                                       ` Ackerley Tng
2025-06-16 10:43                                         ` Yan Zhao
2025-06-16 23:27                                           ` Edgecombe, Rick P
2025-06-11 14:30                                       ` Vishal Annapurve
2025-06-16  9:59                                         ` Yan Zhao
2025-06-17  0:12                                           ` Edgecombe, Rick P
2025-06-17  1:38                                             ` Yan Zhao
2025-06-17 15:52                                               ` Edgecombe, Rick P
2025-06-18  0:19                                                 ` Yan Zhao
2025-06-18  0:41                                                   ` Edgecombe, Rick P
2025-06-23  9:27                                                     ` Yan Zhao
2025-06-23 18:20                                                       ` Edgecombe, Rick P
     [not found]                                                       ` <draft-diqzh606mcz0.fsf@ackerleytng-ctop.c.googlers.com>
2025-06-23 22:48                                                         ` Ackerley Tng
2025-06-24 10:18                                                           ` Yan Zhao
2025-06-24 21:29                                                             ` Ackerley Tng
2025-06-24 22:22                                                               ` Edgecombe, Rick P
2025-06-24 22:00                                                           ` Edgecombe, Rick P
2025-06-24 22:14                                                             ` Edgecombe, Rick P
2025-06-24 23:30                                                             ` Ackerley Tng
2025-06-25  0:01                                                               ` Edgecombe, Rick P
2025-06-25  7:29                                                                 ` Yan Zhao
2025-06-25 23:09                                                                 ` Ackerley Tng
2025-06-25 23:19                                                                   ` Edgecombe, Rick P
2025-06-26 15:16                                                                     ` Shutemov, Kirill
2025-06-26 22:19                                                                       ` Edgecombe, Rick P
2025-06-27 17:59                                                                         ` Ackerley Tng
2025-06-30 11:13                                                                           ` Yan Zhao
2025-06-30 17:55                                                                             ` Edgecombe, Rick P
2025-06-30 19:25                                                                               ` Ackerley Tng
2025-06-30 21:45                                                                                 ` Edgecombe, Rick P
2025-07-01  5:01                                                                                   ` Yan Zhao
2025-07-01  5:22                                                                                     ` Vishal Annapurve
2025-07-01  6:03                                                                                       ` Yan Zhao
2025-07-01  7:13                                                                                         ` Vishal Annapurve
2025-07-01 14:15                                                                                           ` Edgecombe, Rick P
2025-07-01 22:09                                                                                         ` Ackerley Tng
2025-07-02 11:24                                                                                           ` Yan Zhao
2025-07-02 18:43                                                                                             ` Ackerley Tng
2025-07-03  4:54                                                                                               ` Yan Zhao
2025-07-14 19:32                                                                                                 ` Ackerley Tng
2025-07-01 16:13                                                                                     ` Edgecombe, Rick P
2025-07-01 21:48                                                                                       ` Ackerley Tng
2025-07-01 21:57                                                                                         ` Ackerley Tng
2025-07-01 22:37                                                                                         ` Edgecombe, Rick P
2025-07-02 20:57                                                                                           ` Ackerley Tng
2025-07-02 23:51                                                                                             ` Edgecombe, Rick P
2025-07-08 21:19                                                                                               ` Ackerley Tng
2025-07-11  1:46                                                                                                 ` Edgecombe, Rick P
2025-07-11  5:12                                                                                                   ` Yan Zhao
2025-07-11 16:14                                                                                                     ` Edgecombe, Rick P
2025-07-14 19:49                                                                                                       ` Ackerley Tng
2025-07-15 15:08                                                                                                         ` Edgecombe, Rick P
2025-07-15 22:31                                                                                                           ` Ackerley Tng
2025-07-02  9:08                                                                                       ` Yan Zhao
2025-07-02 15:28                                                                                         ` Edgecombe, Rick P
2025-07-01  5:07                                                                                 ` Yan Zhao
2025-07-01 22:01                                                                                   ` Ackerley Tng
2025-07-01 22:26                                                                                     ` Ackerley Tng
2025-06-30 21:47                                                                               ` Vishal Annapurve
2025-07-01  9:35                                                                               ` Yan Zhao
2025-07-01 13:32                                                                                 ` Vishal Annapurve
2025-07-01 14:02                                                                                   ` Vishal Annapurve
2025-07-01 15:42                                                                                     ` Edgecombe, Rick P
2025-07-01 16:14                                                                                   ` Edgecombe, Rick P
2025-07-02  8:54                                                                                   ` Yan Zhao
2025-07-02 13:12                                                                                     ` Vishal Annapurve
2025-06-25  7:08                                                               ` Yan Zhao
2025-06-25 22:54                                                                 ` Ackerley Tng
2025-06-24 22:03                                                           ` Edgecombe, Rick P
2025-06-17  0:25                                           ` Edgecombe, Rick P
2025-06-17  2:00                                             ` Yan Zhao
2025-06-17  3:51                                           ` Vishal Annapurve
2025-06-17  6:52                                             ` Yan Zhao
2025-06-17  8:09                                               ` Vishal Annapurve
2025-06-17  9:57                                                 ` Yan Zhao
2025-06-18  4:25                                                   ` Vishal Annapurve
2025-06-18  0:34                                                 ` Edgecombe, Rick P
2025-06-18  0:46                                                   ` Yan Zhao
2025-06-18  4:33                                                     ` Vishal Annapurve
2025-06-18  6:13                                                       ` Yan Zhao
2025-06-18  6:21                                                         ` Vishal Annapurve
2025-06-18  6:32                                                           ` Yan Zhao
2025-06-18  6:44                                                             ` Vishal Annapurve
2025-06-18  6:57                                                               ` Yan Zhao
2025-06-18  4:29                                                   ` Vishal Annapurve
2025-06-19  0:22                                                     ` Edgecombe, Rick P
2025-06-05  2:47                                     ` Yan Zhao
2025-06-05 22:35                                       ` Ackerley Tng
2025-06-19  8:11                                         ` Yan Zhao
2025-06-20 18:06                                           ` Vishal Annapurve
2025-07-16  1:23                                         ` Yan Zhao
2025-07-16 20:57                                           ` Ackerley Tng
2025-07-18  5:49                                             ` Yan Zhao
2025-07-22  5:33                                               ` Ackerley Tng
2025-07-22  6:37                                                 ` Yan Zhao
2025-07-22 17:55                                                   ` Ackerley Tng
2025-05-12 19:00                           ` Ackerley Tng
2025-05-12 21:44                             ` Edgecombe, Rick P
2025-04-24  3:06 ` [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE Yan Zhao
2025-05-13 20:10   ` Edgecombe, Rick P
2025-05-16  1:35     ` Huang, Kai
2025-05-16  9:43       ` Yan Zhao
2025-05-16 22:35         ` Huang, Kai
2025-05-16 23:47           ` Edgecombe, Rick P
2025-05-19  8:32           ` Yan Zhao
2025-05-19 16:53             ` Edgecombe, Rick P
2025-05-20  9:34               ` Yan Zhao
2025-05-20 23:47                 ` Huang, Kai
2025-06-11 14:42                   ` Sean Christopherson
2025-06-12 23:39                     ` Edgecombe, Rick P
2025-06-13  0:19                       ` Sean Christopherson
2025-06-13  0:25                         ` Edgecombe, Rick P
2025-06-13  0:44                           ` Sean Christopherson
2025-06-13  0:47                             ` Edgecombe, Rick P
2025-06-13  1:32                               ` Yan Zhao
2025-06-13 21:53                                 ` Edgecombe, Rick P
2025-06-13 22:19                                   ` Sean Christopherson
2025-06-13 23:33                                     ` Edgecombe, Rick P
2025-06-16  3:14                                       ` Yan Zhao
2025-06-16 22:49                                         ` Edgecombe, Rick P
2025-06-17  0:52                                           ` Yan Zhao
2025-06-18  0:30                                             ` Yan Zhao
2025-06-20 16:31                                               ` Sean Christopherson
2025-06-23 21:44                                                 ` Edgecombe, Rick P
2025-06-24  9:57                                                   ` Yan Zhao
2025-06-24 18:35                                                     ` Edgecombe, Rick P
2025-06-25  9:28                                                       ` Yan Zhao
2025-06-25  9:36                                                         ` Yan Zhao
2025-06-25 14:48                                                           ` Edgecombe, Rick P
2025-06-26  0:50                                                             ` Yan Zhao
2025-06-25 14:47                                                         ` Edgecombe, Rick P
2025-06-26  8:53                                                           ` Yan Zhao
2025-07-01  0:42                                                             ` Edgecombe, Rick P
2025-07-01  2:41                                                               ` Yan Zhao
2025-07-01 15:36                                                                 ` Edgecombe, Rick P
2025-07-02  0:12                                                                   ` Yan Zhao
2025-07-02  0:18                                                                     ` Edgecombe, Rick P
2025-07-02  1:07                                                                       ` Yan Zhao
2025-07-02 15:26                                                                         ` Edgecombe, Rick P
2025-07-02  3:31                                                                       ` Yan Zhao
2025-06-25 13:47                                                       ` Vishal Annapurve
2025-06-25 15:51                                                         ` Edgecombe, Rick P
2025-06-18  1:22                                             ` Edgecombe, Rick P
2025-06-18 11:32                                               ` Shutemov, Kirill
2025-06-20 16:32                                                 ` Sean Christopherson
2025-06-20 17:44                                                   ` Kirill Shutemov
2025-06-20 18:40                                                     ` Sean Christopherson
2025-06-20 19:26                                                       ` Kirill Shutemov
2025-06-13  2:41                     ` Xiaoyao Li
2025-06-13  3:29                       ` Yan Zhao
2025-06-13  5:35                         ` Yan Zhao
2025-06-13  6:08                           ` Xiaoyao Li
2025-05-21 15:40                 ` Edgecombe, Rick P
2025-05-22  3:52                   ` Yan Zhao
2025-05-23 23:40                     ` Edgecombe, Rick P
2025-05-27  1:31                       ` Yan Zhao
2025-05-20 23:34             ` Huang, Kai
2025-05-21  2:35               ` Yan Zhao
2025-05-16  9:28     ` Yan Zhao
2025-04-24  3:06 ` [RFC PATCH 10/21] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
2025-05-13 20:15   ` Edgecombe, Rick P
2025-05-16  4:01     ` Yan Zhao
2025-05-16 17:50       ` Edgecombe, Rick P
2025-05-19  3:57         ` Yan Zhao
2025-05-19 17:42           ` Edgecombe, Rick P
2025-05-20 10:11             ` Yan Zhao
2025-04-24  3:06 ` [RFC PATCH 11/21] KVM: x86: Add "vcpu" "gfn" parameters to x86 hook private_max_mapping_level Yan Zhao
2025-04-24  3:07 ` [RFC PATCH 12/21] KVM: TDX: Determine max mapping level according to vCPU's ACCEPT level Yan Zhao
2025-05-13 21:20   ` Edgecombe, Rick P
2025-05-16  6:12     ` Xiaoyao Li
2025-05-16  6:30     ` Yan Zhao
2025-05-16 22:02       ` Edgecombe, Rick P
2025-05-19  6:39         ` Yan Zhao
2025-05-19 20:17           ` Edgecombe, Rick P
2025-04-24  3:07 ` [RFC PATCH 13/21] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
2025-04-24  3:07 ` [RFC PATCH 14/21] KVM: x86/tdp_mmu: Invoke split_external_spt hook with exclusive mmu_lock Yan Zhao
2025-05-13 23:06   ` Edgecombe, Rick P
2025-05-16  9:17     ` Yan Zhao
2025-05-16 22:11       ` Edgecombe, Rick P
2025-05-19  4:01         ` Yan Zhao
2025-05-19 20:21           ` Edgecombe, Rick P
2025-05-20  5:40   ` Binbin Wu
2025-05-20  9:40     ` Yan Zhao
2025-04-24  3:08 ` [RFC PATCH 15/21] KVM: TDX: Support huge page splitting with exclusive kvm->mmu_lock Yan Zhao
2025-05-20  6:18   ` Binbin Wu
2025-05-20  9:40     ` Yan Zhao
2025-07-02 15:47   ` Edgecombe, Rick P
2025-04-24  3:08 ` [RFC PATCH 16/21] KVM: x86/mmu: Introduce kvm_split_boundary_leafs() to split boundary leafs Yan Zhao
2025-05-13 22:56   ` Edgecombe, Rick P
2025-05-16  7:46     ` Yan Zhao
2025-05-16  8:03       ` Yan Zhao
2025-05-16 22:27         ` Edgecombe, Rick P
2025-05-19  8:12           ` Yan Zhao
2025-05-16 11:44       ` Yan Zhao
2025-05-16 22:16         ` Edgecombe, Rick P
2025-04-24  3:08 ` [RFC PATCH 17/21] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
2025-04-24  3:08 ` [RFC PATCH 18/21] KVM: x86: Split huge boundary leafs before private to shared conversion Yan Zhao
2025-05-09 23:34   ` Edgecombe, Rick P
2025-05-12  2:25     ` Yan Zhao
2025-05-12 21:53       ` Edgecombe, Rick P
2025-04-24  3:08 ` [RFC PATCH 19/21] KVM: gmem: Split huge boundary leafs for punch hole of private memory Yan Zhao
2025-04-24 10:19   ` Francesco Lavra
2025-04-25  1:55     ` Yan Zhao
2025-05-13 22:59   ` Edgecombe, Rick P
2025-05-16  8:19     ` Yan Zhao
2025-04-24  3:09 ` [RFC PATCH 20/21] KVM: x86: Force a prefetch fault's max mapping level to 4KB for TDX Yan Zhao
2025-05-13 23:20   ` Edgecombe, Rick P
2025-05-16  8:43     ` Yan Zhao
2025-05-21  3:30   ` Binbin Wu
2025-05-21  5:03     ` Yan Zhao
2025-04-24  3:09 ` [RFC PATCH 21/21] KVM: x86: Ignore splitting huge pages in fault path " Yan Zhao
2025-05-13 21:58   ` Edgecombe, Rick P
2025-05-16  6:40     ` Yan Zhao
2025-04-24  7:35 ` [RFC PATCH 00/21] KVM: TDX huge page support for private memory Kirill A. Shutemov
2025-04-24  8:33   ` Yan Zhao
2025-04-24  9:05     ` Kirill A. Shutemov
2025-04-24  9:08       ` Juergen Gross
2025-04-24  9:49       ` Yan Zhao
2025-04-24 10:39         ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).