public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/24] KVM: TDX huge page support for private memory
@ 2026-01-06 10:16 Yan Zhao
  2026-01-06 10:18 ` [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
                   ` (25 more replies)
  0 siblings, 26 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:16 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

This is v3 of the TDX huge page series. There aren’t any major changes to
the design, just the incorporation of various comments of RFC v2 [0] and
switching to use external cache for splitting to align with changes in
DPAMT v4 [3]. The full stack is available at [4].

Dave/Kirill/Rick/x86 folks, the patches that will eventually need an ack
are patches 1-5. I would appreciate some review on them, with the
understanding that they may need further refinement before they're ready
for ack.
 
Sean, I'm feeling pretty good about the design at this point, however,
there are few remaining design opens (see next section with references to
specific patches). I'm wondering if we can close these as part of the
review of this revision. Thanks a lot!


Highlight
-------------
- Request review of the tip part (patches 1-5)

  Patches 1-5 contain SEAMCALL wrappers updates, which are under tip.
  Besides introducing a SEAMCALL wrapper for demote, the other changes are
  mainly to convert mapping/unmapping related SEAMCALL wrappers from using
  "struct page *page" to "struct folio *folio" and "unsigned long
  start_idx" for huge pages.

- EPT mapping size and folio size

  This series is built upon the rule in KVM that the mapping size in the
  KVM-managed secondary MMU is no larger than the backend folio size.
  
  Therefore, there are sanity checks in the SEAMCALL wrappers in patches
  1-5 to follow this rule, either in tdh_mem_page_aug() for mapping
  (patch 1) or in tdh_phymem_page_wbinvd_hkid() (patch 3),
  tdx_quirk_reset_folio() (patch 4), tdh_phymem_page_reclaim() (patch 5)
  for unmapping.

  However, as Vishal pointed out in [7], the new hugetlb-based guest_memfd
  [1] splits backend folios ahead of notifying KVM for unmapping. So, this
  series also relies on the fixup patch [8] to notify KVM of unmapping
  before splitting the backend folio during the memory conversion ioctl.


- Enable TDX huge pages only on new TDX modules (patch 2, patch 24)

  This v3 detects whether the TDX module supports feature
  TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY and disables TDX huge
  page support for private memory on TDX modules without this feature.

  Additionally, v3 provides a new module parameter "tdx_huge_page" to turn
  off TDX huge pages for private memory by executing
  "modprobe kvm_intel tdx_huge_page=N".

- Dynamic PAMT v4 integration (patches 17-23)

  Currently, DPAMT's involvement with TDX huge pages is limited to page
  splitting. v3 introduces KVM x86 ops for the per-VM external cache for
  splitting (patches 19, 20).

  The per-VM external cache holds pre-allocated pages (patch 19), which are
  dequeued during TDX's splitting operations for installing PAMT pages
  (patches 21-22).

  The general logic of managing the per-VM external cache remains similar
  to the per-VM cache in RFC v2; the difference is that KVM MMU core now
  notifies TDX to implement page pre-allocation/enqueuing, page count
  checking, and page freeing. The external way makes it easier to add a
  lock to protect page enqueuing (in topup_external_per_vm_split_cache()
  op) and page dequeuing (in the split_external_spte() op) for the per-VM
  cache. (See more details in the DPAMT bullet in the "Full Design"
  section).

  Since the basic DPAMT design (without huge pages) is not yet settled,
  feel free to ignore the reviewing request for this part. However, we
  would appreciate some level of review given the differences between the
  per-VM cache and per-vCPU cache.


Changes from RFC v2
-------------------
- Dropped 2 patches:
  "KVM: TDX: Do not hold page refcount on private guest pages"
  (pulled by Sean separately),

  "x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC
   is set"
   (To be consistent with the always-clflush policy.
    The DPAMT-related bug in TDH.PHYMEM.PAGE.WBINVD only existed in the
    old unreleased TDX modules for DPAMT).

- TDX_INTERRUPTED_RESTARTABLE error handling:
  Disable TDX huge pages if the TDX module does not support
  TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY.

- Added a module parameter to turn on/off TDX huge pages.

- Renamed tdx_sept_split_private_spt() to tdx_sept_split_private_spte(),
  passing parameter "old_mirror_spte" instead of "pfn_for_gfn". Updated the
  function implementation to align with Sean's TDX cleanup series.

- Updated API kvm_split_cross_boundary_leafs() to not flush TLBs internally
  or return split/flush status, and added a default implementation for
  non-x86 platforms.

- Renamed tdx_check_accept_level() to tdx_honor_guest_accept_level().
  Returned tdx_honor_guest_accept_level() error to userspace instead of
  retry in KVM.

- Introduced KVM x86 ops for the per-VM external cache for splitting to
  align with DPAMT v4, re-organized the DPAMT related patches, refined
  parameter names for better readability, and added missing KVM_BUG_ON()
  and warnings.

- Changed nth_page() to folio_page() in SEAMCALL wrappers, corrected a
  minor bug in handling reclaimed size mismatch error.

- Removed unnecessary declarations, fixed typos and improved patch logs and
  comments.


Patches Layout
--------------
- Patches 01-05: Update SEAMCALL wrappers/helpers for huge pages.

- Patch 06:      Disallow page merging for TDX.

- Patches 07-09: Enable KVM MMU core to propagate splitting requests to TDX
                 and provide the corresponding implementation in TDX.

- Patches 10-11: Introduce a higher level API
                 kvm_split_cross_boundary_leafs() for splitting
                 cross-boundary mappings.

- Patches 12-13: Honor guest accept level in EPT violation handler.

- Patches 14-16: Split huge pages on page conversion/punch hole.

- Patches 17-23: Dynamic PAMT related changes for TDX huge pages.

- Patch 24:      Turn on 2MB huge pages if tdx_huge_page is true.
                 (Added a module parameter for tdx_huge_page and turn it
                 off if the TDX module does not support
                 TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY).


Full Design (optional reading)
--------------------------------------
1. guest_memfd dependency

   According to the rule in KVM that the mapping size in the KVM-managed
   secondary MMU should not exceed the backend folio size, the TDX huge
   page series relies on guest_memfd's ability to allocate huge folios.

   However, this series does not depend on guest_memfd's implementation
   details, except for patch 16, which invokes API
   kvm_split_cross_boundary_leafs() to request KVM splitting of private
   huge mappings that cross the specified range boundary in guest_memfd
   punch hole and private-to-shared conversion ioctls.

   Therefore, the TDX huge page series can also function with 2MB THP
   guest_memfd by updating patch 16 to match the underlying guest_memfd.

2. Restrictions:
   1) Splitting under read mmu_lock is not supported.

      With the current TDX module (i.e., without NON-BLOCKING-RESIZE
      feature), handling BUSY errors returned from splitting operations
      under read mmu_lock requires invoking tdh_mem_range_unblock(), which
      may also encounter BUSY errors. To avoid complicating the lock
      design, splitting under read mmu_lock for TDX is rejected before the
      TDX module supports the NON-BLOCKING-RESIZE feature.

      However, when TDs perform ACCEPT operations at 4KB level after KVM
      creates huge mappings (e.g., non-Linux TDs' ACCEPT operations that
      occur after memory access, or Linux TDs' ACCEPT operations on a
      pre-faulted range), splitting operations in the fault path are
      required, which usually hold read mmu_lock. 

      Ignoring splitting operations in the fault path would cause repeated
      faults. Forcing KVM's mapping to 4KB before the guest's ACCEPT level
      is available would disable huge pages for non-Linux TDs, since the
      TD's later ACCEPT operations at higher levels would return
      TDX_PAGE_SIZE_MISMATCH error to the guest. (See "ACCEPT level and
      lpage_info" bullet).

      This series hold write mmu_lock for splitting operations in the fault
      path. To maintain performance, the implementation only acquires write
      mmu_lock when necessary, keeping the write mmu_lock acquisition count
      at an acceptable level [6].

   2) No huge page if TDX_INTERRUPTED_RESTARTABLE error is possible

      When SEAMCALL TDH_MEM_PAGE_DEMOTE can return error
      TDX_INTERRUPTED_RESTARTABLE (due to interrupts during SEAMCALL
      execution), the TDX module provides no guaranteed maximum retry count
      to ensure forward progress of page demotion. Interrupt storms could
      then result in DoS if host simply retries endlessly for
      TDX_INTERRUPTED_RESTARTABLE. Disabling interrupts before invoking the
      SEAMCALL in the host cannot prevent TDX_INTERRUPTED_RESTARTABLE
      because NMIs can also trigger the error. Therefore, given that
      SEAMCALL execution time for demotion in basic TDX remains at a
      reasonable level, the tradeoff is to have the TDX module not check
      interrupts during SEAMCALL TDH_MEM_PAGE_DEMOTE, eliminating the
      TDX_INTERRUPTED_RESTARTABLE error.
 
      This series detects whether the TDX module supports feature
      TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY and disables TDX huge
      pages on private memory for TDX modules without this feature, thus
      avoiding TDX_INTERRUPTED_RESTARTABLE error for basic TDX (i.e.,
      without TD partition).

3. Page size is forced to 4KB during TD build time.

   Always return 4KB in the KVM x86 hook gmem_max_mapping_level() to force
   4KB mapping size during TD build time, because:

   - tdh_mem_page_add() only adds private pages at 4KB.
   - The amount of initial private memory pages is limited (typically ~4MB).

4. Page size during TD runtime can be up to 2MB.

   Return up to 2MB in the KVM x86 hook gmem_max_mapping_level():
   Use module parameter "tdx_huge_page" to control whether to return 2MB,
   and turn off tdx_huge_page if the TDX module does not support
   TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY.

   When returning 2MB in KVM x86 hook gmem_max_mapping_level(), KVM may
   still map a page at 4KB due to:
   (1) the backend folio is 4KB,
   (2) disallow_lpage restrictions:
       - mixed private/shared pages in the 2MB range,
       - level misalignment due to slot base_gfn, slot size, and ugfn,
       - GUEST_INHIBIT bit set due to guest ACCEPT operation.
   (3) page merging is disallowed (e.g., when part of a 2MB range has been
       mapped at 4KB level during TD build time).

5. ACCEPT level and lpage_info

   KVM needs to honor TDX guest's ACCEPT level by ensuring a fault's
   mapping level is no higher than the guest's ACCEPT level, which is due
   to the current TDX module implementation:
   - If host mapping level > guest's ACCEPT level, repeated faults carrying
     the guest's ACCEPT level info are generated. No error is returned to
     the guest's ACCEPT operation.
   - If host mapping level < guest's ACCEPT level, the guest's ACCEPT
     operation returns TDX_PAGE_SIZE_MISMATCH error.

   It's efficient to pass the guest's ACCEPT level info to KVM MMU core
   before KVM actually creates the mapping.

   Following the suggestions from Sean/Rick, this series introduces a
   global approach (vs. fault-by-fault per-vCPU) to pass the ACCEPT level
   info: setting the GUEST_INHIBIT bit in a slot's lpage_info(s) above the
   specified guest ACCEPT level. The global approach helps simplify vendor
   code while maintaining a global view across vCPUs. 
   
   Since page merging is currently not supported, and given the limitation
   that page splitting under read mmu_lock is not supported, the
   GUEST_INHIBIT bit is set in a one-way manner under write mmu_lock. Once
   set, it will not be unset, and the GFN will not be allowed to be mapped
   at higher levels, even if the mappings are zapped and re-accepted by the
   guest at higher levels later. This approach simplifies the code and
   prevents potential subtle bugs from different ACCEPT levels specified by
   different vCPUs. Tests showed the one-way manner has minimal impact on
   the huge page map count with a typical Linux TD [5].

   Typical scenarios of honoring 4KB ACCEPT level:
   a. If the guest accesses private memory without first accepting it,
      1) Guest accesses private memory.
      2) KVM maps the private memory at 2MB if no huge page restrictions
         exist.
      3) Guest accepts the private memory at 4KB.
      4) KVM receives an EPT violation VMExit, whose type indicates it's
         caused by the guest's ACCEPT operation at 4KB level.
      5) KVM honors the guest's ACCEPT level by setting the GUEST_INHIBIT
         bit in lpage_info(s) above 4KB level and splitting the created 2MB
         mapping.
      6) Guest accepts the private memory at 4KB level successfully and
         accesses the private memory.

   b. If the guest first accepts private memory before accessing,
      1) Guest accepts private memory at 4KB level.
      2) KVM receives an EPT violation VMExit, whose type indicates it's
         caused by guest's ACCEPT operation at 4KB level.
      3) KVM honors the guest's ACCEPT level by setting the GUEST_INHIBIT
         bit in lpage_info(s) above 4KB level, thus creating the mapping at
         4KB level.
      4) Guest accepts the private memory at 4KB level successfully and
         accesses the private memory.

6. Page splitting (page demotion)

   With the current TDX module, splitting huge mappings in S-EPT requires
   executing tdh_mem_range_block(), tdh_mem_track(), kicking off vCPUs,
   and tdh_mem_page_demote() in sequence. If DPAMT is involved,
   tdh_phymem_pamt_add() is also required.

   Page splitting can occur due to:
   (1) private-to-shared conversion,
   (2) hole punching in guest_memfd,
   (3) guest's ACCEPT operation at lower level than host mapping level.

   All paths trigger splitting by invoking kvm_split_cross_boundary_leafs()
   under write mmu_lock.

   TDX_OPERAND_BUSY is thus handled similarly to removing a private page,
   i.e., by kicking off all vCPUs and retrying, which should succeed on the
   second attempt. Other errors from tdh_mem_page_demote() are not
   expected, triggering TDX_BUG_ON().

7. Page merging (page promotion)

   Promotion is disallowed, because:

   - The current TDX module requires all 4KB leafs to be either all PENDING
     or all ACCEPTED before a successful promotion to 2MB. This requirement
     prevents successful page merging after partially converting a 2MB
     range from private to shared and then back to private, which is the
     primary scenario necessitating page promotion.

   - tdh_mem_page_promote() depends on tdh_mem_range_block() in the current
     TDX module. Consequently, handling BUSY errors is complex, as page
     merging typically occurs in the fault path under shared mmu_lock.

   - Limited amount of initial private memory (typically ~4MB) means the
     need for page merging during TD build time is minimal.

8. Precise zapping of private memory

   Since TDX requires guest's ACCEPT operation on host's mapping of private
   memory, zapping private memory for guest_memfd punch hole and
   private-to-shared conversion must be precise and preceded by splitting
   private memory.

   Patch 16 serves this purpose and is the only patch in the TDX huge page
   series sensitive to guest_memfd's implementation changes.

9. DPAMT
   Currently, DPAMT's involvement with TDX huge page is limited to page
   splitting.

   As shown in the following call stack, DPAMT pages used by splitting are
   pre-allocated and queued in the per-VM external split cache. They are
   dequeued and consumed in tdx_sept_split_private_spte().

   kvm_split_cross_boundary_leafs
     kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs
       tdp_mmu_split_huge_pages_root
 (*)      1) tdp_mmu_alloc_sp_for_split()
  +-----2.1) need_topup_external_split_cache(): check if enough pages in
  |          the external split cache. Go to 3 if pages are enough.
  |  +--2.2) topup_external_split_cache(): preallocate/enqueue pages in
  |  |       the external split cache.
  |  |    3) tdp_mmu_split_huge_page
  |  |         tdp_mmu_link_sp
  |  |           tdp_mmu_iter_set_spte
  |  |(**)         tdp_mmu_set_spte
  |  |               split_external_spte 
  |  |                 kvm_x86_call(split_external_spte)
  |  |                   tdx_sept_split_private_spte
  |  |                   3.1) BLOCK, TRACK
  +--+-------------------3.2) Dequeue PAMT pages from the external split
  |  |                        cache for the new sept page
  |  |                   3.3) PAMT_ADD for the new sept page
  +--+-------------------3.4) Dequeue PAMT pages from the external split
                              cache for the 2MB guest private memory.
                         3.5) DEMOTE.
                         3.6) Update PAMT refcount of the 2MB guest private
                              memory.

   (*) The write mmu_lock is held across the checking of enough pages in
       cache in step 2.1 and the page dequeuing in steps 3.2 and 3.4, so
       it's ensured that dequeuing has enough pages in cache.

  (**) A spinlock prealloc_split_cache_lock is used inside the TDX's cache
       implementation to protect page enqueuing in step 2.2 and page
       dequeuing in steps 3.2 and 3.4.


10. TDX does not hold page ref count

   TDX previously held folio refcount and didn't release the refcount if it
   failed to zap the S-EPT. However, guest_memfd's in-place conversion
   feature requires TDX not to hold folio refcount. Several approaches were
   explored. RFC v2 finally simply avoided holding page refcount in TDX and
   generated KVM_BUG_ON() on S-EPT zapping failure without propagating the
   failure back to guest_memfd or notifying guest_memfd through out-of-band
   methods, considering the complexity and that S-EPT zapping failure is
   currently only possible due to KVM or TDX module bugs.

   This approach was acked by Sean and the patch to drop holding TDX page
   refcount was pulled separately.


Base
----
This is based on the latest WIP hugetlb-based guest_memfd code [1], with
"Rework preparation/population" series v2 [2] and DPAMT v4 [3] rebased on
it. For the full stack see [4].

Four issues are identified in the WIP hugetlb-based guest_memfd [1]:

(1) Compilation error due to missing symbol export of
    hugetlb_restructuring_free_folio().

(2) guest_memfd splits backend folios when the folio is still mapped as
    huge in KVM (which breaks KVM's basic assumption that EPT mapping size
    should not exceed the backend folio size).

(3) guest_memfd is incapable of merging folios to huge for
    shared-to-private conversions.

(4) Unnecessary disabling huge private mappings when HVA is not 2M-aligned,
    given that shared pages can only be mapped at 4KB.

So, this series also depends on the four fixup patches included in [4]:

[FIXUP] KVM: guest_memfd: Allow gmem slot lpage even with non-aligned uaddr
[FIXUP] KVM: guest_memfd: Allow merging folios after to-private conversion
[FIXUP] KVM: guest_memfd: Zap mappings before splitting backend folios
[FIXUP] mm: hugetlb_restructuring: Export hugetlb_restructuring_free_folio()

(lkp sent me some more gmem compilation errors. I ignored them as I didn't
 encounter them with my config and env).


Testing
-------
We currently don't have a QEMU that works with the latest in-place
conversion uABI. We plan to increase testing before asking for merging.
This revision is tested via TDX selftests (enhanced to support the in-place
conversion uABI).

Please check the TDX selftests included in tree [4] (under section "Not for
upstream"[9]) that work with the in-place conversion uABI, specifically the
selftest tdx_vm_huge_page for huge page testing.

These selftests require running under "modprobe kvm vm_memory_attributes=N".

The 2MB mapping count can be checked via "/sys/kernel/debug/kvm/pages_2m".

Note #1: Since TDX does not yet enable in-place copy of init memory
         regions, userspace needs to follow the sequence below for init
	 memory regions to work (also shown in commit "KVM: selftests: TDX:
	 Switch init mem to gmem with MMAP flag" [10] in tree [4]):

         1) Create guest_memfd with MMAP flag for the init memory region.
         2) Set the GFNs for the init memory region to shared and copy
	    initial data to the mmap'ed HVA of the guest_memfd.
         3) Create a separate temporary shared memory backend for the init
	    memory region, and copy initial data from the guest_memfd HVA
	    to the temporary shared memory backend.
         4) Convert the GFNs for the init memory region to private.
         5) Invoke ioctl KVM_TDX_INIT_MEM_REGION, passing the HVA of the
	    temporary shared memory backend as source addr and GPAs of the
	    init memory region.
         6) Free the temporary shared memory backend.

Note #2: This series disables TDX huge pages on TDX modules without feature
         TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY. For testing TDX
         huge pages on those unsupported TDX modules (i.e., before
         TDX_1.5.28.00.972), please cherry-pick the workaround patch
         "x86/virt/tdx: Loop for TDX_INTERRUPTED_RESTARTABLE in
         tdh_mem_page_demote()" [11] contained in [4].

Thanks
Yan

[0] RFC v2: https://lore.kernel.org/all/20250807093950.4395-1-yan.y.zhao@intel.com
[1] hugetlb-based gmem: https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25
[2] gmem-population rework v2: https://lore.kernel.org/all/20251215153411.3613928-1-michael.roth@amd.com
[3] DPAMT v4: https://lore.kernel.org/kvm/20251121005125.417831-1-rick.p.edgecombe@intel.com
[4] kernel full stack: https://github.com/intel-staging/tdx/tree/huge_page_v3
[5] https://lore.kernel.org/all/aF0Kg8FcHVMvsqSo@yzhao56-desk.sh.intel.com
[6] https://lore.kernel.org/all/aGSoDnODoG2%2FpbYn@yzhao56-desk.sh.intel.com
[7] https://lore.kernel.org/all/CAGtprH9vdpAGDNtzje=7faHBQc9qTSF2fUEGcbCkfJehFuP-rw@mail.gmail.com
[8] https://github.com/intel-staging/tdx/commit/a8aedac2df44e29247773db3444bc65f7100daa1
[9] https://github.com/intel-staging/tdx/commit/8747667feb0b37daabcaee7132c398f9e62a6edd
[10] https://github.com/intel-staging/tdx/commit/ab29a85ec2072393ab268e231c97f07833853d0d
[11] https://github.com/intel-staging/tdx/commit/4feb6bf371f3a747b71fc9f4ded25261e66b8895

Edgecombe, Rick P (1):
  KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror
    root

Isaku Yamahata (1):
  KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table
    splitting

Kiryl Shutsemau (4):
  KVM: TDX: Get/Put DPAMT page pair only when mapping size is 4KB
  KVM: TDX: Add/Remove DPAMT pages for the new S-EPT page for splitting
  x86/tdx: Add/Remove DPAMT pages for guest private memory to demote
  x86/tdx: Pass guest memory's PFN info to demote for updating
    pamt_refcount

Xiaoyao Li (1):
  x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()

Yan Zhao (17):
  x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
  x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge
    pages
  x86/tdx: Introduce tdx_quirk_reset_folio() to reset private huge pages
  x86/virt/tdx: Enhance tdh_phymem_page_reclaim() to support huge pages
  KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock
  KVM: TDX: Enable huge page splitting under write mmu_lock
  KVM: x86: Reject splitting huge pages under shared mmu_lock in TDX
  KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  KVM: x86: Introduce hugepage_set_guest_inhibit()
  KVM: TDX: Honor the guest's accept level contained in an EPT violation
  KVM: Change the return type of gfn_handler_t() from bool to int
  KVM: x86: Split cross-boundary mirror leafs for
    KVM_SET_MEMORY_ATTRIBUTES
  KVM: guest_memfd: Split for punch hole and private-to-shared
    conversion
  x86/virt/tdx: Add loud warning when tdx_pamt_put() fails.
  KVM: x86: Introduce per-VM external cache for splitting
  KVM: TDX: Implement per-VM external cache for splitting in TDX
  KVM: TDX: Turn on PG_LEVEL_2M

 arch/arm64/kvm/mmu.c               |   8 +-
 arch/loongarch/kvm/mmu.c           |   8 +-
 arch/mips/kvm/mmu.c                |   6 +-
 arch/powerpc/kvm/book3s.c          |   4 +-
 arch/powerpc/kvm/e500_mmu_host.c   |   8 +-
 arch/riscv/kvm/mmu.c               |  12 +-
 arch/x86/include/asm/kvm-x86-ops.h |   4 +
 arch/x86/include/asm/kvm_host.h    |  22 +++
 arch/x86/include/asm/tdx.h         |  21 +-
 arch/x86/kvm/mmu.h                 |   3 +
 arch/x86/kvm/mmu/mmu.c             |  90 +++++++--
 arch/x86/kvm/mmu/tdp_mmu.c         | 204 +++++++++++++++++---
 arch/x86/kvm/mmu/tdp_mmu.h         |   3 +
 arch/x86/kvm/vmx/tdx.c             | 299 ++++++++++++++++++++++++++---
 arch/x86/kvm/vmx/tdx.h             |   5 +
 arch/x86/kvm/vmx/tdx_arch.h        |   3 +
 arch/x86/virt/vmx/tdx/tdx.c        | 164 +++++++++++++---
 arch/x86/virt/vmx/tdx/tdx.h        |   1 +
 include/linux/kvm_host.h           |  14 +-
 virt/kvm/guest_memfd.c             |  67 +++++++
 virt/kvm/kvm_main.c                |  44 +++--
 21 files changed, 851 insertions(+), 139 deletions(-)

-- 
2.43.2


^ permalink raw reply	[flat|nested] 127+ messages in thread

* [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
@ 2026-01-06 10:18 ` Yan Zhao
  2026-01-06 21:08   ` Dave Hansen
  2026-01-06 10:18 ` [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
                   ` (24 subsequent siblings)
  25 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:18 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.

The SEAMCALL TDH_MEM_PAGE_AUG currently supports adding physical memory to
the S-EPT up to 2MB in size.

While keeping the "level" parameter in the tdh_mem_page_aug() wrapper to
allow callers to specify the physical memory size, introduce the parameters
"folio" and "start_idx" to specify the physical memory starting from the
page at "start_idx" within the "folio". The specified physical memory must
be fully contained within a single folio.

Invoke tdx_clflush_page() for each 4KB segment of the physical memory being
added. tdx_clflush_page() performs CLFLUSH operations conservatively to
prevent dirty cache lines from writing back later and corrupting TD memory.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- nth_page() --> folio_page(). (Kai, Dave)
- Rebased on top of DPAMT v4.

RFC v2:
- Refine patch log. (Rick)
- Removed the level checking. (Kirill, Chao Gao)
- Use "folio", and "start_idx" rather than "page".
- Return TDX_OPERAND_INVALID if the specified physical memory is not
  contained within a single folio.
- Use PTE_SHIFT to replace the 9 in "1 << (level * 9)" (Kirill)
- Use C99-style definition of variables inside a loop. (Nikolay Borisov)

RFC v1:
- Rebased to new tdh_mem_page_aug() with "struct page *" as param.
- Check folio, folio_page_idx.
---
 arch/x86/include/asm/tdx.h  |  3 ++-
 arch/x86/kvm/vmx/tdx.c      |  5 +++--
 arch/x86/virt/vmx/tdx/tdx.c | 13 ++++++++++---
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8c0c548f9735..f92850789193 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -235,7 +235,8 @@ u64 tdh_mng_addcx(struct tdx_td *td, struct page *tdcs_page);
 u64 tdh_mem_page_add(struct tdx_td *td, u64 gpa, struct page *page, struct page *source, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mem_sept_add(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page);
-u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2);
+u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct folio *folio,
+		     unsigned long start_idx, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mem_range_block(struct tdx_td *td, u64 gpa, int level, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mng_key_config(struct tdx_td *td);
 u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 98ff84bc83f2..2f03c51515b9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1679,12 +1679,13 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 	int tdx_level = pg_level_to_tdx_sept_level(level);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	struct page *page = pfn_to_page(pfn);
+	struct folio *folio = page_folio(page);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 entry, level_state;
 	u64 err;
 
-	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
-
+	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, folio,
+			       folio_page_idx(folio, page), &entry, &level_state);
 	if (unlikely(IS_TDX_OPERAND_BUSY(err)))
 		return -EBUSY;
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index b0b33f606c11..41ce18619ffc 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1743,16 +1743,23 @@ u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page)
 }
 EXPORT_SYMBOL_GPL(tdh_vp_addcx);
 
-u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2)
+u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct folio *folio,
+		     unsigned long start_idx, u64 *ext_err1, u64 *ext_err2)
 {
 	struct tdx_module_args args = {
 		.rcx = gpa | level,
 		.rdx = tdx_tdr_pa(td),
-		.r8 = page_to_phys(page),
+		.r8 = page_to_phys(folio_page(folio, start_idx)),
 	};
+	unsigned long npages = 1 << (level * PTE_SHIFT);
 	u64 ret;
 
-	tdx_clflush_page(page);
+	if (start_idx + npages > folio_nr_pages(folio))
+		return TDX_OPERAND_INVALID;
+
+	for (int i = 0; i < npages; i++)
+		tdx_clflush_page(folio_page(folio, start_idx + i));
+
 	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
 
 	*ext_err1 = args.rcx;
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
  2026-01-06 10:18 ` [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
@ 2026-01-06 10:18 ` Yan Zhao
  2026-01-16  1:00   ` Huang, Kai
                     ` (2 more replies)
  2026-01-06 10:19 ` [PATCH v3 03/24] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
                   ` (23 subsequent siblings)
  25 siblings, 3 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:18 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

From: Xiaoyao Li <xiaoyao.li@intel.com>

Introduce SEAMCALL wrapper tdh_mem_page_demote() to invoke
TDH_MEM_PAGE_DEMOTE, which splits a 2MB or a 1GB mapping in S-EPT into
512 4KB or 2MB mappings respectively.

SEAMCALL TDH_MEM_PAGE_DEMOTE walks the S-EPT to locate the huge mapping to
split and add a new S-EPT page to hold the 512 smaller mappings.

Parameters "gpa" and "level" specify the huge mapping to split, and
parameter "new_sept_page" specifies the 4KB page to be added as the S-EPT
page. Invoke tdx_clflush_page() before adding the new S-EPT page
conservatively to prevent dirty cache lines from writing back later and
corrupting TD memory.

tdh_mem_page_demote() may fail, e.g., due to S-EPT walk error. Callers must
check function return value and can retrieve the extended error info from
the output parameters "ext_err1", and "ext_err2".

The TDX module has many internal locks. To avoid staying in SEAM mode for
too long, SEAMCALLs return a BUSY error code to the kernel instead of
spinning on the locks. Depending on the specific SEAMCALL, the caller may
need to handle this error in specific ways (e.g., retry). Therefore, return
the SEAMCALL error code directly to the caller without attempting to handle
it in the core kernel.

Enable tdh_mem_page_demote() only on TDX modules that support feature
TDX_FEATURES0.ENHANCE_DEMOTE_INTERRUPTIBILITY, which does not return error
TDX_INTERRUPTED_RESTARTABLE on basic TDX (i.e., without TD partition) [2].

This is because error TDX_INTERRUPTED_RESTARTABLE is difficult to handle.
The TDX module provides no guaranteed maximum retry count to ensure forward
progress of the demotion. Interrupt storms could then result in a DoS if
host simply retries endlessly for TDX_INTERRUPTED_RESTARTABLE. Disabling
interrupts before invoking the SEAMCALL also doesn't work because NMIs can
also trigger TDX_INTERRUPTED_RESTARTABLE. Therefore, the tradeoff for basic
TDX is to disable the TDX_INTERRUPTED_RESTARTABLE error given the
reasonable execution time for demotion. [1]

Link: https://lore.kernel.org/kvm/99f5585d759328db973403be0713f68e492b492a.camel@intel.com [1]
Link: https://lore.kernel.org/all/fbf04b09f13bc2ce004ac97ee9c1f2c965f44fdf.camel@intel.com [2]
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Use a var name that clearly tell that the page is used as a page table
  page. (Binbin).
- Check if TDX module supports feature ENHANCE_DEMOTE_INTERRUPTIBILITY.
  (Kai).

RFC v2:
- Refine the patch log (Rick).
- Do not handle TDX_INTERRUPTED_RESTARTABLE as the new TDX modules in
  planning do not check interrupts for basic TDX.

RFC v1:
- Rebased and split patch. Updated patch log.
---
 arch/x86/include/asm/tdx.h  |  8 ++++++++
 arch/x86/virt/vmx/tdx/tdx.c | 24 ++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 3 files changed, 33 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index f92850789193..d1891e099d42 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -15,6 +15,7 @@
 /* Bit definitions of TDX_FEATURES0 metadata field */
 #define TDX_FEATURES0_NO_RBP_MOD		BIT_ULL(18)
 #define TDX_FEATURES0_DYNAMIC_PAMT		BIT_ULL(36)
+#define TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY	BIT_ULL(51)
 
 #ifndef __ASSEMBLER__
 
@@ -140,6 +141,11 @@ static inline bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
 	return sysinfo->features.tdx_features0 & TDX_FEATURES0_DYNAMIC_PAMT;
 }
 
+static inline bool tdx_supports_demote_nointerrupt(const struct tdx_sys_info *sysinfo)
+{
+	return sysinfo->features.tdx_features0 & TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY;
+}
+
 void tdx_quirk_reset_page(struct page *page);
 
 int tdx_guest_keyid_alloc(void);
@@ -242,6 +248,8 @@ u64 tdh_mng_key_config(struct tdx_td *td);
 u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
 u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
 u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
+			u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_finalize(struct tdx_td *td);
 u64 tdh_vp_flush(struct tdx_vp *vp);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 41ce18619ffc..c3f4457816c8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1837,6 +1837,30 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data)
 }
 EXPORT_SYMBOL_GPL(tdh_mng_rd);
 
+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
+			u64 *ext_err1, u64 *ext_err2)
+{
+	struct tdx_module_args args = {
+		.rcx = gpa | level,
+		.rdx = tdx_tdr_pa(td),
+		.r8 = page_to_phys(new_sept_page),
+	};
+	u64 ret;
+
+	if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
+		return TDX_SW_ERROR;
+
+	/* Flush the new S-EPT page to be added */
+	tdx_clflush_page(new_sept_page);
+	ret = seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args);
+
+	*ext_err1 = args.rcx;
+	*ext_err2 = args.rdx;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_mem_page_demote);
+
 u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2)
 {
 	struct tdx_module_args args = {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 096c78a1d438..a6c0fa53ece9 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -24,6 +24,7 @@
 #define TDH_MNG_KEY_CONFIG		8
 #define TDH_MNG_CREATE			9
 #define TDH_MNG_RD			11
+#define TDH_MEM_PAGE_DEMOTE		15
 #define TDH_MR_EXTEND			16
 #define TDH_MR_FINALIZE			17
 #define TDH_VP_FLUSH			18
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 03/24] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
  2026-01-06 10:18 ` [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
  2026-01-06 10:18 ` [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
@ 2026-01-06 10:19 ` Yan Zhao
  2026-01-06 10:19 ` [PATCH v3 04/24] x86/tdx: Introduce tdx_quirk_reset_folio() to reset private " Yan Zhao
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:19 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

After removing a TD's private page, the TDX module does not write back and
invalidate cache lines associated with the page and its keyID (i.e., the
TD's guest keyID). The SEAMCALL wrapper tdh_phymem_page_wbinvd_hkid()
enables the caller to provide the TD's guest keyID and physical memory
address to invoke the SEAMCALL TDH_PHYMEM_PAGE_WBINVD to perform cache line
invalidation.

Enhance the SEAMCALL wrapper tdh_phymem_page_wbinvd_hkid() to support cache
line invalidation for huge pages by introducing the parameters "folio",
"start_idx", and "npages". These parameters specify the physical memory
starting from the page at "start_idx" within a "folio" and spanning
"npages" contiguous PFNs. Return TDX_OPERAND_INVALID if the specified
memory is not entirely contained within a single folio.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- nth_page() --> folio_page(). (Kai, Dave)
- Rebased on top of Sean's cleanup series.

RFC v2:
- Enhance tdh_phymem_page_wbinvd_hkid() to invalidate multiple pages
  directly, rather than looping within KVM, following Dave's suggestion:
  "Don't wrap the wrappers." (Rick).

RFC v1:
- Split patch
- Aded a helper tdx_wbinvd_page() in TDX, which accepts param
  "struct page *".
---
 arch/x86/include/asm/tdx.h  |  4 ++--
 arch/x86/kvm/vmx/tdx.c      |  5 ++++-
 arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++++----
 3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d1891e099d42..7f72fd07f4e5 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -264,8 +264,8 @@ u64 tdh_mem_track(struct tdx_td *tdr);
 u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_phymem_cache_wb(bool resume);
 u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td);
-u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
-
+u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
+				unsigned long start_idx, unsigned long npages);
 void tdx_meminfo(struct seq_file *m);
 #else
 static inline void tdx_init(void) { }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 2f03c51515b9..b369f90dbafa 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1857,6 +1857,7 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	struct page *page = pfn_to_page(spte_to_pfn(mirror_spte));
 	int tdx_level = pg_level_to_tdx_sept_level(level);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct folio *folio = page_folio(page);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 err, entry, level_state;
 
@@ -1895,7 +1896,9 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_REMOVE, entry, level_state, kvm))
 		return;
 
-	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
+	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, folio,
+					  folio_page_idx(folio, page),
+					  KVM_PAGES_PER_HPAGE(level));
 	if (TDX_BUG_ON(err, TDH_PHYMEM_PAGE_WBINVD, kvm))
 		return;
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index c3f4457816c8..b57e00c71384 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -2046,13 +2046,24 @@ u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
 }
 EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_tdr);
 
-u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
+u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
+				unsigned long start_idx, unsigned long npages)
 {
-	struct tdx_module_args args = {};
+	u64 err;
 
-	args.rcx = mk_keyed_paddr(hkid, page);
+	if (start_idx + npages > folio_nr_pages(folio))
+		return TDX_OPERAND_INVALID;
 
-	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
+	for (unsigned long i = 0; i < npages; i++) {
+		struct page *p = folio_page(folio, start_idx + i);
+		struct tdx_module_args args = {};
+
+		args.rcx = mk_keyed_paddr(hkid, p);
+		err = seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
+		if (err)
+			break;
+	}
+	return err;
 }
 EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
 
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 04/24] x86/tdx: Introduce tdx_quirk_reset_folio() to reset private huge pages
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (2 preceding siblings ...)
  2026-01-06 10:19 ` [PATCH v3 03/24] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
@ 2026-01-06 10:19 ` Yan Zhao
  2026-01-06 10:20 ` [PATCH v3 05/24] x86/virt/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:19 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

After removing or reclaiming a guest private page or a control page from a
TD, zero the physical page using movdir64b() to enable the kernel to reuse
the pages. This is needed for systems with the X86_BUG_TDX_PW_MCE erratum.

Introduce the function tdx_quirk_reset_folio() to invoke
tdx_quirk_reset_paddr() to convert pages in a huge folio from private back
to normal. The pages start from the page at "start_idx" within a "folio",
spanning "npages" contiguous PFNs.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v3:
- Rebased to Sean's cleanup series.
- tdx_clear_folio() --> tdx_quirk_reset_folio().

RFC v2:
- Add tdx_clear_folio().
- Drop inner loop _tdx_clear_page() and move __mb() outside of the loop.
  (Rick)
- Use C99-style definition of variables inside a for loop.
- Note: [1] also changes tdx_clear_page(). RFC v2 is not based on [1] now.

[1] https://lore.kernel.org/all/20250724130354.79392-2-adrian.hunter@intel.com

RFC v1:
- split out, let tdx_clear_page() accept level.
---
 arch/x86/include/asm/tdx.h  |  2 ++
 arch/x86/kvm/vmx/tdx.c      |  3 ++-
 arch/x86/virt/vmx/tdx/tdx.c | 11 +++++++++++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 7f72fd07f4e5..669dd6d99821 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -147,6 +147,8 @@ static inline bool tdx_supports_demote_nointerrupt(const struct tdx_sys_info *sy
 }
 
 void tdx_quirk_reset_page(struct page *page);
+void tdx_quirk_reset_folio(struct folio *folio, unsigned long start_idx,
+			   unsigned long npages);
 
 int tdx_guest_keyid_alloc(void);
 u32 tdx_get_nr_guest_keyids(void);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b369f90dbafa..5b499593edff 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1902,7 +1902,8 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (TDX_BUG_ON(err, TDH_PHYMEM_PAGE_WBINVD, kvm))
 		return;
 
-	tdx_quirk_reset_page(page);
+	tdx_quirk_reset_folio(folio, folio_page_idx(folio, page),
+			      KVM_PAGES_PER_HPAGE(level));
 	tdx_pamt_put(page);
 }
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index b57e00c71384..20708f56b1a0 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -800,6 +800,17 @@ static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
 	mb();
 }
 
+void tdx_quirk_reset_folio(struct folio *folio, unsigned long start_idx,
+			   unsigned long npages)
+{
+	if (WARN_ON_ONCE(start_idx + npages > folio_nr_pages(folio)))
+		return;
+
+	tdx_quirk_reset_paddr(page_to_phys(folio_page(folio, start_idx)),
+			      npages << PAGE_SHIFT);
+}
+EXPORT_SYMBOL_GPL(tdx_quirk_reset_folio);
+
 void tdx_quirk_reset_page(struct page *page)
 {
 	tdx_quirk_reset_paddr(page_to_phys(page), PAGE_SIZE);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 05/24] x86/virt/tdx: Enhance tdh_phymem_page_reclaim() to support huge pages
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (3 preceding siblings ...)
  2026-01-06 10:19 ` [PATCH v3 04/24] x86/tdx: Introduce tdx_quirk_reset_folio() to reset private " Yan Zhao
@ 2026-01-06 10:20 ` Yan Zhao
  2026-01-06 10:20 ` [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
                   ` (20 subsequent siblings)
  25 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:20 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

Enhance the SEAMCALL wrapper tdh_phymem_page_reclaim() to support huge
pages by introducing new parameters: "folio", "start_idx", and "npages".
These parameters specify the physical memory to be reclaimed, starting from
the page at "start_idx" within a folio and spanning "npages" contiguous
PFNs. The specified memory must be entirely contained within a
single folio. Return TDX_SW_ERROR if the size of the reclaimed memory does
not match the specified size.

On the KVM side, introduce tdx_reclaim_folio() to invoke
tdh_phymem_page_reclaim() for reclaiming huge guest private pages. The
"reset" parameter in tdx_reclaim_folio() specifies whether
tdx_quirk_reset_folio() should be subsequently invoked within
tdx_reclaim_folio().  To facilitate reclaiming of 4KB pages, keep function
tdx_reclaim_page() and make it a helper for reclaiming normal TDX control
pages, and introduce a new helper tdx_reclaim_page_noreset() for reclaiming
the TDR page.

Opportunistically, rename rcx, rdx, r8 to tdx_pt, tdx_owner, tdx_size in
tdx_reclaim_folio() to improve readability.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Rebased to Sean's cleanup series. Dropped invoking tdx_reclaim_folio()
  in tdx_sept_remove_private_spte() due to no reclaiming is required in
  that path.
  However, keep introducing tdx_reclaim_folio() as it will be needed when
  the patches of removing guest private memory after releasing HKID are
  merged.
- tdx_reclaim_page_noclear() --> tdx_reclaim_page_noreset() and invoke
  tdx_quirk_reset_folio() instead in tdx_reclaim_folio() due to rebase.
- Check mismatch between the request size and the reclaimed size, and
  return TDX_SW_ERROR only after a successful TDH_PHYMEM_PAGE_RECLAIM.
  (Binbin)

RFC v2:
- Introduce new params "folio", "start_idx" and "npages" to wrapper
  tdh_phymem_page_reclaim().
- Move the checking of return size from KVM to x86/virt and return error.
- Rename tdx_reclaim_page() to tdx_reclaim_folio().
- Add two helper functions tdx_reclaim_page() tdx_reclaim_page_noclear()
  to faciliate the reclaiming of 4KB pages.

RFC v1:
- Rebased and split patch.
---
 arch/x86/include/asm/tdx.h  |  3 ++-
 arch/x86/kvm/vmx/tdx.c      | 27 +++++++++++++++++----------
 arch/x86/virt/vmx/tdx/tdx.c | 12 ++++++++++--
 3 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 669dd6d99821..abe484045132 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -261,7 +261,8 @@ u64 tdh_mng_init(struct tdx_td *td, u64 td_params, u64 *extended_err);
 u64 tdh_vp_init(struct tdx_vp *vp, u64 initial_rcx, u32 x2apicid);
 u64 tdh_vp_rd(struct tdx_vp *vp, u64 field, u64 *data);
 u64 tdh_vp_wr(struct tdx_vp *vp, u64 field, u64 data, u64 mask);
-u64 tdh_phymem_page_reclaim(struct page *page, u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size);
+u64 tdh_phymem_page_reclaim(struct folio *folio, unsigned long start_idx, unsigned long npages,
+			    u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size);
 u64 tdh_mem_track(struct tdx_td *tdr);
 u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_phymem_cache_wb(bool resume);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 5b499593edff..405afd2a56b7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -318,33 +318,40 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
 })
 
 /* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */
-static int __tdx_reclaim_page(struct page *page)
+static int tdx_reclaim_folio(struct folio *folio, unsigned long start_idx,
+			     unsigned long npages, bool reset)
 {
-	u64 err, rcx, rdx, r8;
+	u64 err, tdx_pt, tdx_owner, tdx_size;
 
-	err = tdh_phymem_page_reclaim(page, &rcx, &rdx, &r8);
+	err = tdh_phymem_page_reclaim(folio, start_idx, npages, &tdx_pt,
+				      &tdx_owner, &tdx_size);
 
 	/*
 	 * No need to check for TDX_OPERAND_BUSY; all TD pages are freed
 	 * before the HKID is released and control pages have also been
 	 * released at this point, so there is no possibility of contention.
 	 */
-	if (TDX_BUG_ON_3(err, TDH_PHYMEM_PAGE_RECLAIM, rcx, rdx, r8, NULL))
+	if (TDX_BUG_ON_3(err, TDH_PHYMEM_PAGE_RECLAIM, tdx_pt, tdx_owner, tdx_size, NULL))
 		return -EIO;
 
+	if (reset)
+		tdx_quirk_reset_folio(folio, start_idx, npages);
 	return 0;
 }
 
 static int tdx_reclaim_page(struct page *page)
 {
-	int r;
+	struct folio *folio = page_folio(page);
 
-	r = __tdx_reclaim_page(page);
-	if (!r)
-		tdx_quirk_reset_page(page);
-	return r;
+	return tdx_reclaim_folio(folio, folio_page_idx(folio, page), 1, true);
 }
 
+static int tdx_reclaim_page_noreset(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	return tdx_reclaim_folio(folio, folio_page_idx(folio, page), 1, false);
+}
 
 /*
  * Reclaim the TD control page(s) which are crypto-protected by TDX guest's
@@ -583,7 +590,7 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
 	if (!kvm_tdx->td.tdr_page)
 		return;
 
-	if (__tdx_reclaim_page(kvm_tdx->td.tdr_page))
+	if (tdx_reclaim_page_noreset(kvm_tdx->td.tdr_page))
 		return;
 
 	/*
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 20708f56b1a0..c12665389b67 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1993,19 +1993,27 @@ EXPORT_SYMBOL_GPL(tdh_vp_init);
  * So despite the names, they must be interpted specially as described by the spec. Return
  * them only for error reporting purposes.
  */
-u64 tdh_phymem_page_reclaim(struct page *page, u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size)
+u64 tdh_phymem_page_reclaim(struct folio *folio, unsigned long start_idx,
+			    unsigned long npages, u64 *tdx_pt, u64 *tdx_owner,
+			    u64 *tdx_size)
 {
 	struct tdx_module_args args = {
-		.rcx = page_to_phys(page),
+		.rcx = page_to_phys(folio_page(folio, start_idx)),
 	};
 	u64 ret;
 
+	if (start_idx + npages > folio_nr_pages(folio))
+		return TDX_OPERAND_INVALID;
+
 	ret = seamcall_ret(TDH_PHYMEM_PAGE_RECLAIM, &args);
 
 	*tdx_pt = args.rcx;
 	*tdx_owner = args.rdx;
 	*tdx_size = args.r8;
 
+	if (!ret && npages != (1 << (*tdx_size) * PTE_SHIFT))
+		return TDX_SW_ERROR;
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(tdh_phymem_page_reclaim);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (4 preceding siblings ...)
  2026-01-06 10:20 ` [PATCH v3 05/24] x86/virt/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
@ 2026-01-06 10:20 ` Yan Zhao
  2026-01-15 22:49   ` Sean Christopherson
  2026-01-06 10:20 ` [PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock Yan Zhao
                   ` (19 subsequent siblings)
  25 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:20 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>

Disallow page merging (huge page adjustment) for the mirror root by
utilizing disallowed_hugepage_adjust().

Make the mirror root check asymmetric with NX huge pages and not to litter
the generic MMU code:

Invoke disallowed_hugepage_adjust() in kvm_tdp_mmu_map() when necessary,
specifically when KVM has mirrored TDP or the NX huge page workaround is
enabled.

Check and reduce the goal_level of a fault internally in
disallowed_hugepage_adjust() when the fault is for a mirror root and
there's a shadow present non-leaf entry at the original goal_level.

Signed-off-by: Edgecombe, Rick P <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Check is_mirror_sp() in disallowed_hugepage_adjust() instead of passing
  in an is_mirror arg. (Rick)
- Check kvm_has_mirrored_tdp() in kvm_tdp_mmu_map() to determine whether
  to invoke disallowed_hugepage_adjust(). (Rick)

RFC v1:
- new patch
---
 arch/x86/kvm/mmu/mmu.c     | 3 ++-
 arch/x86/kvm/mmu/tdp_mmu.c | 4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d2c49d92d25d..b4f2e3ced716 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3418,7 +3418,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
 	    cur_level == fault->goal_level &&
 	    is_shadow_present_pte(spte) &&
 	    !is_large_pte(spte) &&
-	    spte_to_child_sp(spte)->nx_huge_page_disallowed) {
+	    ((spte_to_child_sp(spte)->nx_huge_page_disallowed) ||
+	     is_mirror_sp(spte_to_child_sp(spte)))) {
 		/*
 		 * A small SPTE exists for this pfn, but FNAME(fetch),
 		 * direct_map(), or kvm_tdp_mmu_map() would like to create a
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9c26038f6b77..dfa56554f9e0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1267,6 +1267,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
 	int ret = RET_PF_RETRY;
+	bool hugepage_adjust_disallowed = fault->nx_huge_page_workaround_enabled ||
+					  kvm_has_mirrored_tdp(kvm);
 
 	KVM_MMU_WARN_ON(!root || root->role.invalid);
 
@@ -1279,7 +1281,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
 		int r;
 
-		if (fault->nx_huge_page_workaround_enabled)
+		if (hugepage_adjust_disallowed)
 			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
 
 		/*
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (5 preceding siblings ...)
  2026-01-06 10:20 ` [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
@ 2026-01-06 10:20 ` Yan Zhao
  2026-01-28 22:38   ` Sean Christopherson
  2026-01-06 10:20 ` [PATCH v3 08/24] KVM: TDX: Enable huge page splitting " Yan Zhao
                   ` (18 subsequent siblings)
  25 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:20 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

Introduce kvm_x86_ops.split_external_spte() and wrap it in a helper
function split_external_spte(). Invoke the helper function
split_external_spte() in tdp_mmu_set_spte() to propagate splitting
transitions from the mirror page table to the external page table under
write mmu_lock.

Introduce a new valid transition case for splitting and document all valid
transitions of the mirror page table under write mmu_lock in
tdp_mmu_set_spte().

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Rename split_external_spt() to split_external_spte().

- Pass in param "old_mirror_spte" to hook kvm_x86_ops.set_external_spte().
  This is in aligned with the parameter change in hook
  kvm_x86_ops.set_external_spte() in Sean's cleanup series, and also allows
  future DPAMT patches to acquire guest private PFN from the old mirror
  spte.

- Rename param "external_spt" to "new_external_spt" in hook
  kvm_x86_ops.set_external_spte() to indicate this is a new page table page
  for the external page table.

- Drop declaration of get_external_spt() by moving split_external_spte()
  after get_external_spt() but before set_external_spte_present() and
  tdp_mmu_set_spte(). (Kai)

- split_external_spte --> split_external_spte() (Kai)

RFC v2:
- Removed the KVM_BUG_ON() in split_external_spt(). (Rick)
- Add a comment for the KVM_BUG_ON() in tdp_mmu_set_spte(). (Rick)
- Use kvm_x86_call() instead of static_call(). (Binbin)

RFC v1:
- Split patch.
- Dropped invoking hook zap_private_spte and kvm_flush_remote_tlbs() in KVM
  MMU core.
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  4 ++++
 arch/x86/kvm/mmu/tdp_mmu.c         | 29 +++++++++++++++++++++++++----
 3 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 58c5c9b082ca..84fa8689b45c 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -98,6 +98,7 @@ KVM_X86_OP_OPTIONAL(link_external_spt)
 KVM_X86_OP_OPTIONAL(set_external_spte)
 KVM_X86_OP_OPTIONAL(free_external_spt)
 KVM_X86_OP_OPTIONAL(remove_external_spte)
+KVM_X86_OP_OPTIONAL(split_external_spte)
 KVM_X86_OP_OPTIONAL(alloc_external_fault_cache)
 KVM_X86_OP_OPTIONAL(topup_external_fault_cache)
 KVM_X86_OP_OPTIONAL(free_external_fault_cache)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7818da148a8c..56089d6b9b51 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1848,6 +1848,10 @@ struct kvm_x86_ops {
 	void (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 				     u64 mirror_spte);
 
+	/* Split a huge mapping into smaller mappings in external page table */
+	int (*split_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				   u64 old_mirror_spte, void *new_external_spt);
+
 	/* Allocation a pages from the external page cache. */
 	void *(*alloc_external_fault_cache)(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index dfa56554f9e0..977914b2627f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -508,6 +508,19 @@ static void *get_external_spt(gfn_t gfn, u64 new_spte, int level)
 	return NULL;
 }
 
+static int split_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
+			       u64 new_spte, int level)
+{
+	void *new_external_spt = get_external_spt(gfn, new_spte, level);
+	int ret;
+
+	KVM_BUG_ON(!new_external_spt, kvm);
+
+	ret = kvm_x86_call(split_external_spte)(kvm, gfn, level, old_spte,
+						new_external_spt);
+	return ret;
+}
+
 static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
 						 gfn_t gfn, u64 old_spte,
 						 u64 new_spte, int level)
@@ -758,12 +771,20 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
 
 	/*
-	 * Users that do non-atomic setting of PTEs don't operate on mirror
-	 * roots, so don't handle it and bug the VM if it's seen.
+	 * Propagate changes of SPTE to the external page table under write
+	 * mmu_lock.
+	 * Current valid transitions:
+	 * - present leaf to !present.
+	 * - present non-leaf to !present.
+	 * - present leaf to present non-leaf (splitting)
 	 */
 	if (is_mirror_sptep(sptep)) {
-		KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
-		remove_external_spte(kvm, gfn, old_spte, level);
+		if (!is_shadow_present_pte(new_spte))
+			remove_external_spte(kvm, gfn, old_spte, level);
+		else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
+			split_external_spte(kvm, gfn, old_spte, new_spte, level);
+		else
+			KVM_BUG_ON(1, kvm);
 	}
 
 	return old_spte;
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 08/24] KVM: TDX: Enable huge page splitting under write mmu_lock
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (6 preceding siblings ...)
  2026-01-06 10:20 ` [PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock Yan Zhao
@ 2026-01-06 10:20 ` Yan Zhao
  2026-01-06 10:21 ` [PATCH v3 09/24] KVM: x86: Reject splitting huge pages under shared mmu_lock in TDX Yan Zhao
                   ` (17 subsequent siblings)
  25 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:20 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

Implement kvm_x86_ops.split_external_spte() under TDX to enable huge page
splitting under write mmu_lock.

Invoke tdh_mem_range_block(), tdh_mem_track(), kicking off vCPUs, and
tdh_mem_page_demote() in sequence. All operations are performed under
kvm->mmu_lock held for writing, similar to those in page removal.

Though with kvm->mmu_lock held for writing, tdh_mem_page_demote() may still
contend with tdh_vp_enter() and potentially with the guest's S-EPT entry
operations. Therefore, kick off other vCPUs and prevent tdh_vp_enter()
from being called on them to ensure success on the second attempt. Use
KVM_BUG_ON() for any other unexpected errors.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Rebased on top of Sean's cleanup series.
- Call out UNBLOCK is not required after DEMOTE. (Kai)
- tdx_sept_split_private_spt() --> tdx_sept_split_private_spte().

RFC v2:
- Split out the code to handle the error TDX_INTERRUPTED_RESTARTABLE.
- Rebased to 6.16.0-rc6 (the way of defining TDX hook changes).

RFC v1:
- Split patch for exclusive mmu_lock only,
- Invoke tdx_sept_zap_private_spte() and tdx_track() for splitting.
- Handled busy error of tdh_mem_page_demote() by kicking off vCPUs.
---
 arch/x86/kvm/vmx/tdx.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 405afd2a56b7..b41793402769 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1914,6 +1914,45 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	tdx_pamt_put(page);
 }
 
+/*
+ * Split a 2MB huge mapping.
+ *
+ * Invoke "BLOCK + TRACK + kick off vCPUs (inside tdx_track())" since DEMOTE
+ * now does not support yet the NON-BLOCKING-RESIZE feature. No UNBLOCK is
+ * needed after a successful DEMOTE.
+ *
+ * Under write mmu_lock, kick off all vCPUs (inside tdh_do_no_vcpus()) to ensure
+ * DEMOTE will succeed on the second invocation if the first invocation returns
+ * BUSY.
+ */
+static int tdx_sept_split_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				       u64 old_mirror_spte, void *new_private_spt)
+{
+	struct page *new_sept_page = virt_to_page(new_private_spt);
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	u64 err, entry, level_state;
+
+	if (KVM_BUG_ON(kvm_tdx->state != TD_STATE_RUNNABLE ||
+		       level != PG_LEVEL_2M, kvm))
+		return -EIO;
+
+	err = tdh_do_no_vcpus(tdh_mem_range_block, kvm, &kvm_tdx->td, gpa,
+			      tdx_level, &entry, &level_state);
+	if (TDX_BUG_ON_2(err, TDH_MEM_RANGE_BLOCK, entry, level_state, kvm))
+		return -EIO;
+
+	tdx_track(kvm);
+
+	err = tdh_do_no_vcpus(tdh_mem_page_demote, kvm, &kvm_tdx->td, gpa,
+			      tdx_level, new_sept_page, &entry, &level_state);
+	if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_DEMOTE, entry, level_state, kvm))
+		return -EIO;
+
+	return 0;
+}
+
 void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 			   int trig_mode, int vector)
 {
@@ -3672,6 +3711,7 @@ void __init tdx_hardware_setup(void)
 	vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
 	vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
 	vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
+	vt_x86_ops.split_external_spte = tdx_sept_split_private_spte;
 	vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
 	vt_x86_ops.alloc_external_fault_cache = tdx_alloc_external_fault_cache;
 	vt_x86_ops.topup_external_fault_cache = tdx_topup_external_fault_cache;
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 09/24] KVM: x86: Reject splitting huge pages under shared mmu_lock in TDX
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (7 preceding siblings ...)
  2026-01-06 10:20 ` [PATCH v3 08/24] KVM: TDX: Enable huge page splitting " Yan Zhao
@ 2026-01-06 10:21 ` Yan Zhao
  2026-01-06 10:21 ` [PATCH v3 10/24] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:21 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

Allow propagating SPTE splitting changes from the mirror page table to the
external page table in the fault path under shared mmu_lock, while
rejecting this splitting request in TDX's implementation of
kvm_x86_ops.split_external_spte().

Allow tdp_mmu_split_huge_page() to be invoked for the mirror page table in
the fault path by removing the KVM_BUG_ON() immediately before it.

set_external_spte_present() is invoked in the fault path under shared
mmu_lock to propagate transitions from the mirror page table to the
external page table when the target SPTE is present. Add "splitting" as a
valid transition case in set_external_spte_present() and invoke the helper
split_external_spte() to perform the propagation.

Pass shared mmu_lock information to kvm_x86_ops.split_external_spte() and
reject the splitting request in TDX's implementation of
kvm_x86_ops.split_external_spte() when under shared mmu_lock.

This is because TDX requires different handling for splitting under shared
versus exclusive mmu_lock: under shared mmu_lock, TDX cannot kick off all
vCPUs to avoid BUSY errors from DEMOTE. Since the current TDX module
(i.e., without feature NON-BLOCKING-RESIZE) requires BLOCK/TRACK/kicking
off vCPUs to be invoked before each DEMOTE, if a BUSY error occurs from
DEMOTE, TDX must call UNBLOCK before returning the error to the KVM MMU
core to roll back the old SPTE and retry. However, UNBLOCK itself may also
fail due to contentions.

Rejecting splitting of private huge pages under shared mmu_lock in TDX
rather than using KVM_BUG_ON() in the KVM MMU core allows for splitting
under shared mmu_lock once the TDX module supports the NON-BLOCKING-RESIZE
feature, keeping the KVM MMU core framework stable across TDX module
implementation changes.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Rebased on top of Sean's cleanup series.
- split_external_spte --> kvm_x86_ops.split_external_spte(). (Kai)

RFC v2:
- WARN_ON_ONCE() and return error in tdx_sept_split_private_spt() if it's
  invoked under shared mmu_lock. (rather than increase the next fault's
  max_level in current vCPU via tdx->violation_gfn_start/end and
  tdx->violation_request_level).
- TODO: Perform the real implementation of demote under shared mmu_lock
        when new version of TDX module supporting non-blocking demote is
        available.

RFC v1:
- New patch.
---
 arch/x86/include/asm/kvm_host.h |  3 +-
 arch/x86/kvm/mmu/tdp_mmu.c      | 51 +++++++++++++++++++++------------
 arch/x86/kvm/vmx/tdx.c          |  9 +++++-
 3 files changed, 42 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 56089d6b9b51..315ffb23e9d8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1850,7 +1850,8 @@ struct kvm_x86_ops {
 
 	/* Split a huge mapping into smaller mappings in external page table */
 	int (*split_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-				   u64 old_mirror_spte, void *new_external_spt);
+				   u64 old_mirror_spte, void *new_external_spt,
+				   bool mmu_lock_shared);
 
 	/* Allocation a pages from the external page cache. */
 	void *(*alloc_external_fault_cache)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 977914b2627f..9b45ffb8585f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -509,7 +509,7 @@ static void *get_external_spt(gfn_t gfn, u64 new_spte, int level)
 }
 
 static int split_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
-			       u64 new_spte, int level)
+			       u64 new_spte, int level, bool shared)
 {
 	void *new_external_spt = get_external_spt(gfn, new_spte, level);
 	int ret;
@@ -517,7 +517,7 @@ static int split_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
 	KVM_BUG_ON(!new_external_spt, kvm);
 
 	ret = kvm_x86_call(split_external_spte)(kvm, gfn, level, old_spte,
-						new_external_spt);
+						new_external_spt, shared);
 	return ret;
 }
 
@@ -527,10 +527,20 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
 {
 	bool was_present = is_shadow_present_pte(old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
+	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	int ret = 0;
 
-	KVM_BUG_ON(was_present, kvm);
+	/*
+	 * The caller __tdp_mmu_set_spte_atomic() has ensured new_spte must be
+	 * present.
+	 *
+	 * Current valid transitions:
+	 * - leaf to non-leaf (demote)
+	 * - !present to present leaf
+	 * - !present to present non-leaf
+	 */
+	KVM_BUG_ON(!(!was_present || (was_leaf && !is_leaf)), kvm);
 
 	lockdep_assert_held(&kvm->mmu_lock);
 	/*
@@ -541,18 +551,24 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
 	if (!try_cmpxchg64(rcu_dereference(sptep), &old_spte, FROZEN_SPTE))
 		return -EBUSY;
 
-	/*
-	 * Use different call to either set up middle level
-	 * external page table, or leaf.
-	 */
-	if (is_leaf) {
-		ret = kvm_x86_call(set_external_spte)(kvm, gfn, level, new_spte);
-	} else {
-		void *external_spt = get_external_spt(gfn, new_spte, level);
+	if (!was_present) {
+		/*
+		 * Use different call to either set up middle level external
+		 * page table, or leaf.
+		 */
+		if (is_leaf) {
+			ret = kvm_x86_call(set_external_spte)(kvm, gfn, level, new_spte);
+		} else {
+			void *external_spt = get_external_spt(gfn, new_spte, level);
 
-		KVM_BUG_ON(!external_spt, kvm);
-		ret = kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt);
+			KVM_BUG_ON(!external_spt, kvm);
+			ret = kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt);
+		}
+	} else if (was_leaf && !is_leaf) {
+		/* splitting */
+		ret = split_external_spte(kvm, gfn, old_spte, new_spte, level, true);
 	}
+
 	if (ret)
 		__kvm_tdp_mmu_write_spte(sptep, old_spte);
 	else
@@ -782,7 +798,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 		if (!is_shadow_present_pte(new_spte))
 			remove_external_spte(kvm, gfn, old_spte, level);
 		else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
-			split_external_spte(kvm, gfn, old_spte, new_spte, level);
+			split_external_spte(kvm, gfn, old_spte, new_spte, level, false);
 		else
 			KVM_BUG_ON(1, kvm);
 	}
@@ -1331,13 +1347,10 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
 
-		if (is_shadow_present_pte(iter.old_spte)) {
-			/* Don't support large page for mirrored roots (TDX) */
-			KVM_BUG_ON(is_mirror_sptep(iter.sptep), vcpu->kvm);
+		if (is_shadow_present_pte(iter.old_spte))
 			r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
-		} else {
+		else
 			r = tdp_mmu_link_sp(kvm, &iter, sp, true);
-		}
 
 		/*
 		 * Force the guest to retry if installing an upper level SPTE
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b41793402769..1e29722abb36 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1926,7 +1926,8 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
  * BUSY.
  */
 static int tdx_sept_split_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-				       u64 old_mirror_spte, void *new_private_spt)
+				       u64 old_mirror_spte, void *new_private_spt,
+				       bool mmu_lock_shared)
 {
 	struct page *new_sept_page = virt_to_page(new_private_spt);
 	int tdx_level = pg_level_to_tdx_sept_level(level);
@@ -1938,6 +1939,12 @@ static int tdx_sept_split_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level
 		       level != PG_LEVEL_2M, kvm))
 		return -EIO;
 
+	if (WARN_ON_ONCE(mmu_lock_shared)) {
+		pr_warn_once("Splitting of GFN %llx level %d under shared lock occurs when KVM does not support it yet\n",
+			     gfn, level);
+		return -EOPNOTSUPP;
+	}
+
 	err = tdh_do_no_vcpus(tdh_mem_range_block, kvm, &kvm_tdx->td, gpa,
 			      tdx_level, &entry, &level_state);
 	if (TDX_BUG_ON_2(err, TDH_MEM_RANGE_BLOCK, entry, level_state, kvm))
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 10/24] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (8 preceding siblings ...)
  2026-01-06 10:21 ` [PATCH v3 09/24] KVM: x86: Reject splitting huge pages under shared mmu_lock in TDX Yan Zhao
@ 2026-01-06 10:21 ` Yan Zhao
  2026-01-06 10:21 ` [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
                   ` (15 subsequent siblings)
  25 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:21 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

From: Isaku Yamahata <isaku.yamahata@intel.com>

Enhance tdp_mmu_alloc_sp_for_split() to allocate a page table page for the
external page table for splitting the mirror page table.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Removed unnecessary declaration of tdp_mmu_alloc_sp_for_split(). (Kai)
- Fixed a typo in the patch log. (Kai)

RFC v2:
- NO change.

RFC v1:
- Rebased and simplified the code.
---
 arch/x86/kvm/mmu/tdp_mmu.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9b45ffb8585f..074209d91ec3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1535,7 +1535,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
 	return spte_set;
 }
 
-static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(void)
+static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror)
 {
 	struct kvm_mmu_page *sp;
 
@@ -1549,6 +1549,15 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(void)
 		return NULL;
 	}
 
+	if (mirror) {
+		sp->external_spt = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+		if (!sp->external_spt) {
+			free_page((unsigned long)sp->spt);
+			kmem_cache_free(mmu_page_header_cache, sp);
+			return NULL;
+		}
+	}
+
 	return sp;
 }
 
@@ -1628,7 +1637,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 			else
 				write_unlock(&kvm->mmu_lock);
 
-			sp = tdp_mmu_alloc_sp_for_split();
+			sp = tdp_mmu_alloc_sp_for_split(is_mirror_sp(root));
 
 			if (shared)
 				read_lock(&kvm->mmu_lock);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (9 preceding siblings ...)
  2026-01-06 10:21 ` [PATCH v3 10/24] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
@ 2026-01-06 10:21 ` Yan Zhao
  2026-01-15 12:25   ` Huang, Kai
  2026-01-06 10:21 ` [PATCH v3 12/24] KVM: x86: Introduce hugepage_set_guest_inhibit() Yan Zhao
                   ` (14 subsequent siblings)
  25 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:21 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

Introduce kvm_split_cross_boundary_leafs() to split huge leaf entries that
cross the boundary of a specified range.

Splitting huge leaf entries that cross the boundary is essential before
zapping a specified range in the mirror root. This ensures that the
subsequent zap operation does not affect any GFNs outside the specified
range, which is crucial for the mirror root, as the private page table
requires the guest's ACCEPT operation after faulting back.

While the core of kvm_split_cross_boundary_leafs() leverages the main logic
of tdp_mmu_split_huge_pages_root(), the former only splits huge leaf
entries when their mapping ranges cross the specified range boundary. When
splitting is necessary, kvm->mmu_lock may be temporarily released for
memory allocation, meaning returning -ENOMEM is possible.

Since tdp_mmu_split_huge_pages_root() is originally invoked by dirty page
tracking related functions that flush TLB unconditionally at the end,
tdp_mmu_split_huge_pages_root() doesn't flush TLB before it temporarily
releases mmu_lock.

Do not enhance tdp_mmu_split_huge_pages_root() to return split or flush
status for kvm_split_cross_boundary_leafs(). This is because the status
could be inaccurate when multiple threads are trying to split the same
memory range concurrently, e.g., if kvm_split_cross_boundary_leafs()
returns split/flush as false, it doesn't mean there're no splits in the
specified range, since splits could have occurred in other threads due to
temporary release of mmu_lock.

Therefore, callers of kvm_split_cross_boundary_leafs() need to determine
how/when to flush TLB according to the use cases:

- If the split is triggered in a fault path for TDX, the hardware shouldn't
  have cached the old huge translation. Therefore, no need to flush TLB.

- If the split is triggered by zaps in guest_memfd punch hole or page
  conversion, it can delay the TLB flush until after zaps.

- If the use case relies on pure split status (e.g., splitting for PML),
  flush TLB unconditionally. (Just hypothetical. No such use case currently
  exists for kvm_split_cross_boundary_leafs()).

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- s/only_cross_bounday/only_cross_boundary. (Kai)
- Do not return flush status and have the callers to determine how/when to
  flush TLB.
- Always pass "flush" as false to tdp_mmu_iter_cond_resched(). (Kai)
- Added a default implementation for kvm_split_boundary_leafs() for non-x86
  platforms.
- Removed middle level function tdp_mmu_split_cross_boundary_leafs().
- Use EXPORT_SYMBOL_FOR_KVM_INTERNAL().

RFC v2:
- Rename the API to kvm_split_cross_boundary_leafs().
- Make the API to be usable for direct roots or under shared mmu_lock.
- Leverage the main logic from tdp_mmu_split_huge_pages_root(). (Rick)

RFC v1:
- Split patch.
- introduced API kvm_split_boundary_leafs(), refined the logic and
  simplified the code.
---
 arch/x86/kvm/mmu/mmu.c     | 34 ++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c | 42 ++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/tdp_mmu.h |  3 +++
 include/linux/kvm_host.h   |  2 ++
 virt/kvm/kvm_main.c        |  7 +++++++
 5 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b4f2e3ced716..f40af7ac75b3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1644,6 +1644,40 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
 				 start, end - 1, can_yield, true, flush);
 }
 
+/*
+ * Split large leafs crossing the boundary of the specified range.
+ * Only support TDP MMU. Do nothing if !tdp_mmu_enabled.
+ *
+ * This API does not flush TLB. Callers need to determine how/when to flush TLB
+ * according to their use cases, e.g.,
+ * - No need to flush TLB. e.g., if it's in a fault path or TLB flush has been
+ *   ensured.
+ * - Delay the TLB flush until after zaps if the split is invoked for precise
+ *   zapping.
+ * - Unconditionally flush TLB if a use case relies on pure split status (e.g.,
+ *   splitting for PML).
+ *
+ * Return value: 0 : success;  <0: failure
+ */
+int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
+				   bool shared)
+{
+	bool ret = 0;
+
+	lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
+			    lockdep_is_held(&kvm->slots_lock) ||
+			    srcu_read_lock_held(&kvm->srcu));
+
+	if (!range->may_block)
+		return -EOPNOTSUPP;
+
+	if (tdp_mmu_enabled)
+		ret = kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(kvm, range,
+								       shared);
+	return ret;
+}
+EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_split_cross_boundary_leafs);
+
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool flush = false;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 074209d91ec3..b984027343b7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1600,10 +1600,17 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	return ret;
 }
 
+static bool iter_cross_boundary(struct tdp_iter *iter, gfn_t start, gfn_t end)
+{
+	return !(iter->gfn >= start &&
+		 (iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <= end);
+}
+
 static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 					 struct kvm_mmu_page *root,
 					 gfn_t start, gfn_t end,
-					 int target_level, bool shared)
+					 int target_level, bool shared,
+					 bool only_cross_boundary)
 {
 	struct kvm_mmu_page *sp = NULL;
 	struct tdp_iter iter;
@@ -1615,6 +1622,10 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 	 * level into one lower level. For example, if we encounter a 1GB page
 	 * we split it into 512 2MB pages.
 	 *
+	 * When only_cross_boundary is true, just split huge pages above the
+	 * target level into one lower level if the huge pages cross the start
+	 * or end boundary.
+	 *
 	 * Since the TDP iterator uses a pre-order traversal, we are guaranteed
 	 * to visit an SPTE before ever visiting its children, which means we
 	 * will correctly recursively split huge pages that are more than one
@@ -1629,6 +1640,10 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 		if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
 			continue;
 
+		if (only_cross_boundary &&
+		    !iter_cross_boundary(&iter, start, end))
+			continue;
+
 		if (!sp) {
 			rcu_read_unlock();
 
@@ -1692,12 +1707,35 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
 
 	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
 	for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id) {
-		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level, shared);
+		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level,
+						  shared, false);
+		if (r) {
+			kvm_tdp_mmu_put_root(kvm, root);
+			break;
+		}
+	}
+}
+
+int kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(struct kvm *kvm,
+						     struct kvm_gfn_range *range,
+						     bool shared)
+{
+	enum kvm_tdp_mmu_root_types types;
+	struct kvm_mmu_page *root;
+	int r = 0;
+
+	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+	types = kvm_gfn_range_filter_to_root_types(kvm, range->attr_filter);
+
+	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, types) {
+		r = tdp_mmu_split_huge_pages_root(kvm, root, range->start, range->end,
+						  PG_LEVEL_4K, shared, true);
 		if (r) {
 			kvm_tdp_mmu_put_root(kvm, root);
 			break;
 		}
 	}
+	return r;
 }
 
 static bool tdp_mmu_need_write_protect(struct kvm *kvm, struct kvm_mmu_page *sp)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index bd62977c9199..c20b1416e4b2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -70,6 +70,9 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
 				  enum kvm_tdp_mmu_root_types root_types);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared);
+int kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(struct kvm *kvm,
+						     struct kvm_gfn_range *range,
+						     bool shared);
 
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8144d27e6c12..e563bb22c481 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -275,6 +275,8 @@ struct kvm_gfn_range {
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
+				   bool shared);
 #endif
 
 enum {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1d7ab2324d10..feeef7747099 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -910,6 +910,13 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 	return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
 }
 
+int __weak kvm_split_cross_boundary_leafs(struct kvm *kvm,
+					  struct kvm_gfn_range *range,
+					  bool shared)
+{
+	return 0;
+}
+
 #else  /* !CONFIG_KVM_GENERIC_MMU_NOTIFIER */
 
 static int kvm_init_mmu_notifier(struct kvm *kvm)
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 12/24] KVM: x86: Introduce hugepage_set_guest_inhibit()
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (10 preceding siblings ...)
  2026-01-06 10:21 ` [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
@ 2026-01-06 10:21 ` Yan Zhao
  2026-01-06 10:22 ` [PATCH v3 13/24] KVM: TDX: Honor the guest's accept level contained in an EPT violation Yan Zhao
                   ` (13 subsequent siblings)
  25 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:21 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

TDX requires guests to accept S-EPT mappings created by the host KVM. Due
to the current implementation of the TDX module, if a guest accepts a GFN
at a lower level after KVM maps it at a higher level, the TDX module will
emulate an EPT violation VMExit to KVM instead of returning a size mismatch
error to the guest. If KVM fails to perform page splitting in the VMExit
handler, the guest's accept operation will be triggered again upon
re-entering the guest, causing a repeated EPT violation VMExit.

To facilitate passing the guest's accept level information to the KVM MMU
core and to prevent the repeated mapping of a GFN at different levels due
to different accept levels specified by different vCPUs, introduce the
interface hugepage_set_guest_inhibit(). This interface specifies across
vCPUs that mapping at a certain level is inhibited from the guest.

The KVM_LPAGE_GUEST_INHIBIT_FLAG bit is currently modified in one
direction (set), so no unset interface is provided.

Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com/ [1]
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Use EXPORT_SYMBOL_FOR_KVM_INTERNAL().

RFC v2:
- new in RFC v2
---
 arch/x86/kvm/mmu.h     |  3 +++
 arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 830f46145692..f97bedff5c4c 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -322,4 +322,7 @@ static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
 {
 	return gfn & kvm_gfn_direct_bits(kvm);
 }
+
+void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
+bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
 #endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f40af7ac75b3..029f2f272ffc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -714,12 +714,14 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
 }
 
 /*
- * The most significant bit in disallow_lpage tracks whether or not memory
- * attributes are mixed, i.e. not identical for all gfns at the current level.
+ * The most 2 significant bits in disallow_lpage tracks whether or not memory
+ * attributes are mixed, i.e. not identical for all gfns at the current level,
+ * or whether or not guest inhibits the current level of hugepage at the gfn.
  * The lower order bits are used to refcount other cases where a hugepage is
  * disallowed, e.g. if KVM has shadow a page table at the gfn.
  */
 #define KVM_LPAGE_MIXED_FLAG	BIT(31)
+#define KVM_LPAGE_GUEST_INHIBIT_FLAG   BIT(30)
 
 static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
 					    gfn_t gfn, int count)
@@ -732,7 +734,8 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
 
 		old = linfo->disallow_lpage;
 		linfo->disallow_lpage += count;
-		WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG);
+		WARN_ON_ONCE((old ^ linfo->disallow_lpage) &
+			     (KVM_LPAGE_MIXED_FLAG | KVM_LPAGE_GUEST_INHIBIT_FLAG));
 	}
 }
 
@@ -1644,6 +1647,18 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
 				 start, end - 1, can_yield, true, flush);
 }
 
+bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
+{
+	return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_GUEST_INHIBIT_FLAG;
+}
+EXPORT_SYMBOL_FOR_KVM_INTERNAL(hugepage_test_guest_inhibit);
+
+void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
+{
+	lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_GUEST_INHIBIT_FLAG;
+}
+EXPORT_SYMBOL_FOR_KVM_INTERNAL(hugepage_set_guest_inhibit);
+
 /*
  * Split large leafs crossing the boundary of the specified range.
  * Only support TDP MMU. Do nothing if !tdp_mmu_enabled.
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 13/24] KVM: TDX: Honor the guest's accept level contained in an EPT violation
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (11 preceding siblings ...)
  2026-01-06 10:21 ` [PATCH v3 12/24] KVM: x86: Introduce hugepage_set_guest_inhibit() Yan Zhao
@ 2026-01-06 10:22 ` Yan Zhao
  2026-01-06 10:22 ` [PATCH v3 14/24] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:22 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

TDX requires guests to accept S-EPT mappings created by the host KVM. Due
to the current implementation of the TDX module, if a guest accepts a GFN
at a lower level after KVM maps it at a higher level, the TDX module will
emulate an EPT violation VMExit to KVM instead of returning a size mismatch
error to the guest. If KVM fails to perform page splitting in the EPT
violation handler, the guest's ACCEPT operation will be triggered again
upon re-entering the guest, causing a repeated EPT violation VMExit.

The TDX module thus have the EPT violation VMExit carry the guest's accept
level if it's caused by the guest's ACCEPT operation.

Honor the guest's accept level if an EPT violation VMExit contains guest
accept level:

(1) Set the guest inhibit bit in the lpage info to prevent KVM MMU core
    from mapping at a higher level than the guest's accept level.

(2) Split any existing mapping higher than the guest's accept level.

Use write mmu_lock to protect (1) and (2) for now. When the TDX module with
feature NON-BLOCKING-RESIZE is available, splitting can be performed under
shared mmu_lock as no need to worry about the failure of UNBLOCK after the
failure of DEMOTE. Then both (1) and (2) are possible to be done under
shared mmu_lock.

As an optimization, this patch calls hugepage_test_guest_inhibit() without
holding the mmu_lock to reduce the frequency of acquiring the write
mmu_lock. The write mmu_lock is thus only acquired if the guest inhibit bit
is not already set. This is safe because the guest inhibit bit is set in a
one-way manner while the splitting under the write mmu_lock is performed
before setting the guest inhibit bit.

Note: For EPT violation VMExits without the guest's accept level, they are
not caused by the guest's ACCEPT operation, but are instead caused by the
guest's access of memory before it accepts the memory. Since KVM can't
obtain guest accept level info from such EPT violation VMExits (the ACCEPT
operation hasn't occurred yet), KVM may still map at a higher level than
the later guest's accept level.

So, the typical guest/KVM interaction flow is:
- If guest accesses private memory without first accepting it,
  (like non-Linux guests):
  1. Guest accesses a private memory.
  2. KVM finds it can map the GFN at 2MB. So, AUG at 2MB.
  3. Guest accepts the GFN at 4KB.
  4. KVM receives an EPT violation with eeq_type of ACCEPT + 4KB level.
  5. KVM splits the 2MB mapping.
  6. Guest accepts successfully and accesses the page.

- If guest first accepts private memory before accessing it,
  (like Linux guests):
  1. Guest accepts a private memory at 4KB.
  2. KVM receives an EPT violation with eeq_type of ACCEPT + 4KB level.
  3. KVM AUG at 4KB.
  4. Guest accepts successfully and accesses the page.

Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- tdx_check_accept_level() --> tdx_honor_guest_accept_level(). (Binbin)
- Add patch log and code comment to describe the flows for EPT violations
  w/ and w/o accept level better. (Kai)
- Add a comment to descibe why kvm_flush_remote_tlbs() is not needed after
  kvm_split_cross_boundary_leafs(). (Kai).
- Return ret to userspace on error of tdx_honor_guest_accept_level(). (Kai)

RFC v2
- Change tdx_get_accept_level() to tdx_check_accept_level().
- Invoke kvm_split_cross_boundary_leafs() and hugepage_set_guest_inhibit()
  to change KVM mapping level in a global way according to guest accept
  level. (Rick, Sean).

RFC v1:
- Introduce tdx_get_accept_level() to get guest accept level.
- Use tdx->violation_request_level and tdx->violation_gfn* to pass guest
  accept level to tdx_gmem_private_max_mapping_level() to detemine KVM
  mapping level.
---
 arch/x86/kvm/vmx/tdx.c      | 77 +++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx_arch.h |  3 ++
 2 files changed, 80 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1e29722abb36..712aaa3d45b7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1983,6 +1983,79 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
 	return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
 }
 
+/*
+ * An EPT violation can be either due to the guest's ACCEPT operation or
+ * due to the guest's access of memory before the guest accepts the
+ * memory.
+ *
+ * Type TDX_EXT_EXIT_QUAL_TYPE_ACCEPT in the extended exit qualification
+ * identifies the former case, which must also contain a valid guest
+ * accept level.
+ *
+ * For the former case, honor guest's accept level by setting guest inhibit bit
+ * on levels above the guest accept level and split the existing mapping for the
+ * faulting GFN if it's with a higher level than the guest accept level.
+ *
+ * Do nothing if the EPT violation is due to the latter case. KVM will map the
+ * GFN without considering the guest's accept level (unless the guest inhibit
+ * bit is already set).
+ */
+static inline int tdx_honor_guest_accept_level(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	struct kvm_memory_slot *slot = gfn_to_memslot(vcpu->kvm, gfn);
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	struct kvm *kvm = vcpu->kvm;
+	u64 eeq_type, eeq_info;
+	int level = -1;
+
+	if (!slot)
+		return 0;
+
+	eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
+	if (eeq_type != TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
+		return 0;
+
+	eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
+		   TDX_EXT_EXIT_QUAL_INFO_SHIFT;
+
+	level = (eeq_info & GENMASK(2, 0)) + 1;
+
+	if (level == PG_LEVEL_4K || level == PG_LEVEL_2M) {
+		if (!hugepage_test_guest_inhibit(slot, gfn, level + 1)) {
+			gfn_t base_gfn = gfn_round_for_level(gfn, level);
+			struct kvm_gfn_range gfn_range = {
+				.start = base_gfn,
+				.end = base_gfn + KVM_PAGES_PER_HPAGE(level),
+				.slot = slot,
+				.may_block = true,
+				.attr_filter = KVM_FILTER_PRIVATE,
+			};
+
+			scoped_guard(write_lock, &kvm->mmu_lock) {
+				int ret;
+
+				/*
+				 * No kvm_flush_remote_tlbs() is required after
+				 * the split for S-EPT, because the
+				 * "BLOCK + TRACK + kick off vCPUs" sequence in
+				 * tdx_sept_split_private_spte() has guaranteed
+				 * the TLB flush. The hardware also doesn't
+				 * cache stale huge mappings in the fault path.
+				 */
+				ret = kvm_split_cross_boundary_leafs(kvm, &gfn_range,
+								     false);
+				if (ret)
+					return ret;
+
+				hugepage_set_guest_inhibit(slot, gfn, level + 1);
+				if (level == PG_LEVEL_4K)
+					hugepage_set_guest_inhibit(slot, gfn, level + 2);
+			}
+		}
+	}
+	return 0;
+}
+
 static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qual;
@@ -2008,6 +2081,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 		 */
 		exit_qual = EPT_VIOLATION_ACC_WRITE;
 
+		ret = tdx_honor_guest_accept_level(vcpu, gpa_to_gfn(gpa));
+		if (ret)
+			return ret;
+
 		/* Only private GPA triggers zero-step mitigation */
 		local_retry = true;
 	} else {
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index a30e880849e3..af006a73ee05 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -82,7 +82,10 @@ struct tdx_cpuid_value {
 #define TDX_TD_ATTR_PERFMON		BIT_ULL(63)
 
 #define TDX_EXT_EXIT_QUAL_TYPE_MASK	GENMASK(3, 0)
+#define TDX_EXT_EXIT_QUAL_TYPE_ACCEPT  1
 #define TDX_EXT_EXIT_QUAL_TYPE_PENDING_EPT_VIOLATION  6
+#define TDX_EXT_EXIT_QUAL_INFO_MASK	GENMASK(63, 32)
+#define TDX_EXT_EXIT_QUAL_INFO_SHIFT	32
 /*
  * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
  */
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 14/24] KVM: Change the return type of gfn_handler_t() from bool to int
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (12 preceding siblings ...)
  2026-01-06 10:22 ` [PATCH v3 13/24] KVM: TDX: Honor the guest's accept level contained in an EPT violation Yan Zhao
@ 2026-01-06 10:22 ` Yan Zhao
  2026-01-16  0:21   ` Sean Christopherson
  2026-01-06 10:22 ` [PATCH v3 15/24] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Yan Zhao
                   ` (11 subsequent siblings)
  25 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:22 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

Modify the return type of gfn_handler_t() from bool to int. A negative
return value indicates failure, while a return value of 1 signifies success
with a flush required, and 0 denotes success without a flush required.

This adjustment prepares for a later change that will enable
kvm_pre_set_memory_attributes() to fail.

No functional changes expected.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Rebased.

RFC v2:
- No change

RFC v1:
- New patch.
---
 arch/arm64/kvm/mmu.c             |  8 ++++----
 arch/loongarch/kvm/mmu.c         |  8 ++++----
 arch/mips/kvm/mmu.c              |  6 +++---
 arch/powerpc/kvm/book3s.c        |  4 ++--
 arch/powerpc/kvm/e500_mmu_host.c |  8 ++++----
 arch/riscv/kvm/mmu.c             | 12 ++++++------
 arch/x86/kvm/mmu/mmu.c           | 20 ++++++++++----------
 include/linux/kvm_host.h         | 12 ++++++------
 virt/kvm/kvm_main.c              | 24 ++++++++++++++++--------
 9 files changed, 55 insertions(+), 47 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 5ab0cfa08343..c39d3ef577f8 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2221,12 +2221,12 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return false;
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	u64 size = (range->end - range->start) << PAGE_SHIFT;
 
 	if (!kvm->arch.mmu.pgt)
-		return false;
+		return 0;
 
 	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
 						   range->start << PAGE_SHIFT,
@@ -2237,12 +2237,12 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	 */
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	u64 size = (range->end - range->start) << PAGE_SHIFT;
 
 	if (!kvm->arch.mmu.pgt)
-		return false;
+		return 0;
 
 	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
 						   range->start << PAGE_SHIFT,
diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c
index a7fa458e3360..06fa060878c9 100644
--- a/arch/loongarch/kvm/mmu.c
+++ b/arch/loongarch/kvm/mmu.c
@@ -511,7 +511,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 			range->end << PAGE_SHIFT, &ctx);
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	kvm_ptw_ctx ctx;
 
@@ -523,15 +523,15 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 				range->end << PAGE_SHIFT, &ctx);
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	gpa_t gpa = range->start << PAGE_SHIFT;
 	kvm_pte_t *ptep = kvm_populate_gpa(kvm, NULL, gpa, 0);
 
 	if (ptep && kvm_pte_present(NULL, ptep) && kvm_pte_young(*ptep))
-		return true;
+		return 1;
 
-	return false;
+	return 0;
 }
 
 /*
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index d2c3b6b41f18..c26cc89c8e98 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -444,18 +444,18 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return true;
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	return kvm_mips_mkold_gpa_pt(kvm, range->start, range->end);
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	gpa_t gpa = range->start << PAGE_SHIFT;
 	pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
 
 	if (!gpa_pte)
-		return false;
+		return 0;
 	return pte_young(*gpa_pte);
 }
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index d79c5d1098c0..9bf6e1cf64f1 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -886,12 +886,12 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return kvm->arch.kvm_ops->unmap_gfn_range(kvm, range);
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	return kvm->arch.kvm_ops->age_gfn(kvm, range);
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
 }
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 06caf8bbbe2b..dd5411ee242e 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -697,16 +697,16 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return kvm_e500_mmu_unmap_gfn(kvm, range);
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* XXX could be more clever ;) */
-	return false;
+	return 0;
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* XXX could be more clever ;) */
-	return false;
+	return 0;
 }
 
 /*****************************************/
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 4ab06697bfc0..aa163d2ef7d5 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -259,7 +259,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return false;
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	pte_t *ptep;
 	u32 ptep_level = 0;
@@ -267,7 +267,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	struct kvm_gstage gstage;
 
 	if (!kvm->arch.pgd)
-		return false;
+		return 0;
 
 	WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
 
@@ -277,12 +277,12 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	gstage.pgd = kvm->arch.pgd;
 	if (!kvm_riscv_gstage_get_leaf(&gstage, range->start << PAGE_SHIFT,
 				       &ptep, &ptep_level))
-		return false;
+		return 0;
 
 	return ptep_test_and_clear_young(NULL, 0, ptep);
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	pte_t *ptep;
 	u32 ptep_level = 0;
@@ -290,7 +290,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	struct kvm_gstage gstage;
 
 	if (!kvm->arch.pgd)
-		return false;
+		return 0;
 
 	WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
 
@@ -300,7 +300,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	gstage.pgd = kvm->arch.pgd;
 	if (!kvm_riscv_gstage_get_leaf(&gstage, range->start << PAGE_SHIFT,
 				       &ptep, &ptep_level))
-		return false;
+		return 0;
 
 	return pte_young(ptep_get(ptep));
 }
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 029f2f272ffc..1b180279aacd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1810,7 +1810,7 @@ static bool kvm_may_have_shadow_mmu_sptes(struct kvm *kvm)
 	return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
@@ -1823,7 +1823,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	return young;
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
@@ -7962,8 +7962,8 @@ static void hugepage_set_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
 	lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_MIXED_FLAG;
 }
 
-bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
-					struct kvm_gfn_range *range)
+int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+				       struct kvm_gfn_range *range)
 {
 	struct kvm_memory_slot *slot = range->slot;
 	int level;
@@ -7980,10 +7980,10 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 	 * a hugepage can be used for affected ranges.
 	 */
 	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
-		return false;
+		return 0;
 
 	if (WARN_ON_ONCE(range->end <= range->start))
-		return false;
+		return 0;
 
 	/*
 	 * If the head and tail pages of the range currently allow a hugepage,
@@ -8042,8 +8042,8 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
 	return true;
 }
 
-bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
-					 struct kvm_gfn_range *range)
+int kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+					struct kvm_gfn_range *range)
 {
 	unsigned long attrs = range->arg.attributes;
 	struct kvm_memory_slot *slot = range->slot;
@@ -8059,7 +8059,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 	 * SHARED may now allow hugepages.
 	 */
 	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
-		return false;
+		return 0;
 
 	/*
 	 * The sequence matters here: upper levels consume the result of lower
@@ -8106,7 +8106,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 				hugepage_set_mixed(slot, gfn, level);
 		}
 	}
-	return false;
+	return 0;
 }
 
 void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e563bb22c481..6f3d29db0505 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -273,8 +273,8 @@ struct kvm_gfn_range {
 	bool lockless;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
 				   bool shared);
 #endif
@@ -734,10 +734,10 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
 extern bool vm_memory_attributes;
 bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 				     unsigned long mask, unsigned long attrs);
-bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+				       struct kvm_gfn_range *range);
+int kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 					struct kvm_gfn_range *range);
-bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
-					 struct kvm_gfn_range *range);
 #else
 #define vm_memory_attributes false
 #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
@@ -1568,7 +1568,7 @@ void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void kvm_mmu_invalidate_begin(struct kvm *kvm);
 void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
 void kvm_mmu_invalidate_end(struct kvm *kvm);
-bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index feeef7747099..471f798dba2d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -517,7 +517,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 	return container_of(mn, struct kvm, mmu_notifier);
 }
 
-typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+typedef int (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm);
 
@@ -601,6 +601,7 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
 		kvm_for_each_memslot_in_hva_range(node, slots,
 						  range->start, range->end - 1) {
 			unsigned long hva_start, hva_end;
+			int ret;
 
 			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
 			hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
@@ -641,7 +642,9 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
 						goto mmu_unlock;
 				}
 			}
-			r.ret |= range->handler(kvm, &gfn_range);
+			ret = range->handler(kvm, &gfn_range);
+			WARN_ON_ONCE(ret < 0);
+			r.ret |= ret;
 		}
 	}
 
@@ -727,7 +730,7 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
 	}
 }
 
-bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
 	return kvm_unmap_gfn_range(kvm, range);
@@ -2507,7 +2510,8 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 	struct kvm_memslots *slots;
 	struct kvm_memslot_iter iter;
 	bool found_memslot = false;
-	bool ret = false;
+	bool flush = false;
+	int ret = 0;
 	int i;
 
 	gfn_range.arg = range->arg;
@@ -2540,19 +2544,23 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 					range->on_lock(kvm);
 			}
 
-			ret |= range->handler(kvm, &gfn_range);
+			ret = range->handler(kvm, &gfn_range);
+			if (ret < 0)
+				goto err;
+			flush |= ret;
 		}
 	}
 
-	if (range->flush_on_ret && ret)
+err:
+	if (range->flush_on_ret && flush)
 		kvm_flush_remote_tlbs(kvm);
 
 	if (found_memslot)
 		KVM_MMU_UNLOCK(kvm);
 }
 
-static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
-					  struct kvm_gfn_range *range)
+static int kvm_pre_set_memory_attributes(struct kvm *kvm,
+					 struct kvm_gfn_range *range)
 {
 	/*
 	 * Unconditionally add the range to the invalidation set, regardless of
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 15/24] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (13 preceding siblings ...)
  2026-01-06 10:22 ` [PATCH v3 14/24] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
@ 2026-01-06 10:22 ` Yan Zhao
  2026-01-06 10:22 ` [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:22 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

In TDX, private page tables require precise zapping because faulting back
the zapped mappings necessitates the guest's re-acceptance. Therefore,
before performing a zap for the private-to-shared conversion, rather than
zapping a huge leaf entry that crosses the boundary of the GFN range to be
zapped, split the leaf entry to ensure GFNs outside the conversion range
are not affected.

Invoke kvm_split_cross_boundary_leafs() in
kvm_arch_pre_set_memory_attributes() to split the huge leafs that cross
GFN range boundary before calling kvm_unmap_gfn_range() to zap the GFN
range that will be converted to shared. Only update flush status if zaps
are performed.

Unlike kvm_unmap_gfn_range(), which cannot fail,
kvm_split_cross_boundary_leafs() may fail due to memory allocation for
splitting. Update kvm_handle_gfn_range() to propagate the error back to
kvm_vm_set_mem_attributes(), which can then fail the ioctl
KVM_SET_MEMORY_ATTRIBUTES.

The downside of current implementation is that though
kvm_split_cross_boundary_leafs() is invoked before kvm_unmap_gfn_range()
for each GFN range, the entire conversion range may consist of several GFN
ranges. If an out-of-memory error occurs during the splitting of a GFN
range, some previous GFN ranges may have been successfully split and
zapped, even though their page attributes remain unchanged due to the
splitting failure.

If it's necessary, a patch can be arranged to divide a single invocation of
"kvm_handle_gfn_range(kvm, &pre_set_range)" into two, e.g.,

kvm_handle_gfn_range(kvm, &pre_set_range_prepare_and_split)
kvm_handle_gfn_range(kvm, &pre_set_range_unmap),

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Do not return flush status from kvm_split_cross_boundary_leafs(), so
  TLB is flushed only if zaps are perfomed.

RFC v2:
- update kvm_split_boundary_leafs() to kvm_split_cross_boundary_leafs() and
  invoke it only for priate-to-shared conversion.

RFC v1:
- new patch.
---
 arch/x86/kvm/mmu/mmu.c | 10 ++++++++--
 virt/kvm/kvm_main.c    | 13 +++++++++----
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1b180279aacd..35a6e37bfc68 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -8015,10 +8015,16 @@ int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 	}
 
 	/* Unmap the old attribute page. */
-	if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)
+	if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
 		range->attr_filter = KVM_FILTER_SHARED;
-	else
+	} else {
+		int ret;
+
 		range->attr_filter = KVM_FILTER_PRIVATE;
+		ret = kvm_split_cross_boundary_leafs(kvm, range, false);
+		if (ret)
+			return ret;
+	}
 
 	return kvm_unmap_gfn_range(kvm, range);
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 471f798dba2d..f3b0d7f8dcfd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2502,8 +2502,8 @@ bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	return true;
 }
 
-static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
-						 struct kvm_mmu_notifier_range *range)
+static __always_inline int kvm_handle_gfn_range(struct kvm *kvm,
+						struct kvm_mmu_notifier_range *range)
 {
 	struct kvm_gfn_range gfn_range;
 	struct kvm_memory_slot *slot;
@@ -2557,6 +2557,8 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 
 	if (found_memslot)
 		KVM_MMU_UNLOCK(kvm);
+
+	return ret < 0 ? ret : 0;
 }
 
 static int kvm_pre_set_memory_attributes(struct kvm *kvm,
@@ -2625,7 +2627,9 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 		cond_resched();
 	}
 
-	kvm_handle_gfn_range(kvm, &pre_set_range);
+	r = kvm_handle_gfn_range(kvm, &pre_set_range);
+	if (r)
+		goto out_unlock;
 
 	for (i = start; i < end; i++) {
 		r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
@@ -2634,7 +2638,8 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 		cond_resched();
 	}
 
-	kvm_handle_gfn_range(kvm, &post_set_range);
+	r = kvm_handle_gfn_range(kvm, &post_set_range);
+	KVM_BUG_ON(r, kvm);
 
 out_unlock:
 	mutex_unlock(&kvm->slots_lock);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (14 preceding siblings ...)
  2026-01-06 10:22 ` [PATCH v3 15/24] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Yan Zhao
@ 2026-01-06 10:22 ` Yan Zhao
  2026-01-28 22:39   ` Sean Christopherson
  2026-01-06 10:23 ` [PATCH v3 17/24] KVM: TDX: Get/Put DPAMT page pair only when mapping size is 4KB Yan Zhao
                   ` (9 subsequent siblings)
  25 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:22 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

In TDX, private page tables require precise zapping because faulting back
the zapped mappings necessitates guest re-acceptance. Therefore, before
performing a zap for hole punching and private-to-shared conversions, huge
leaves that cross the boundary of the zapping GFN range in the mirror page
table must be split.

Splitting may fail (usually due to out of memory). If this happens, hole
punching and private-to-shared conversion should bail out early and return
an error to userspace.

Splitting is not necessary for zapping shared mappings or zapping in
kvm_gmem_release()/kvm_gmem_error_folio(). The penalty of zapping more
shared mappings than necessary is minimal. All mappings are zapped in
kvm_gmem_release(). kvm_gmem_error_folio() zaps the entire folio range, and
KVM's basic assumption is that a huge mapping must have a single backend
folio.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Rebased to [2].
- Do not flush TLB for kvm_split_cross_boundary_leafs(), i.e., only flush
  TLB if zaps are performed.

[2] https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25

RFC v2:
- Rebased to [1]. As changes in this patch are gmem specific, they may need
  to be updated if the implementation in [1] changes.
- Update kvm_split_boundary_leafs() to kvm_split_cross_boundary_leafs() and
  invoke it before kvm_gmem_punch_hole() and private-to-shared conversion.

[1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/

RFC v1:
- new patch.
---
 virt/kvm/guest_memfd.c | 67 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 03613b791728..8e7fbed57a20 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -486,6 +486,55 @@ static int merge_truncate_range(struct inode *inode, pgoff_t start,
 	return ret;
 }
 
+static int __kvm_gmem_split_private(struct gmem_file *f, pgoff_t start, pgoff_t end)
+{
+	enum kvm_gfn_range_filter attr_filter = KVM_FILTER_PRIVATE;
+
+	bool locked = false;
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = f->kvm;
+	unsigned long index;
+	int ret = 0;
+
+	xa_for_each_range(&f->bindings, index, slot, start, end - 1) {
+		pgoff_t pgoff = slot->gmem.pgoff;
+		struct kvm_gfn_range gfn_range = {
+			.start = slot->base_gfn + max(pgoff, start) - pgoff,
+			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
+			.slot = slot,
+			.may_block = true,
+			.attr_filter = attr_filter,
+		};
+
+		if (!locked) {
+			KVM_MMU_LOCK(kvm);
+			locked = true;
+		}
+
+		ret = kvm_split_cross_boundary_leafs(kvm, &gfn_range, false);
+		if (ret)
+			break;
+	}
+
+	if (locked)
+		KVM_MMU_UNLOCK(kvm);
+
+	return ret;
+}
+
+static int kvm_gmem_split_private(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+	struct gmem_file *f;
+	int r = 0;
+
+	kvm_gmem_for_each_file(f, inode->i_mapping) {
+		r = __kvm_gmem_split_private(f, start, end);
+		if (r)
+			break;
+	}
+	return r;
+}
+
 static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 {
 	pgoff_t start = offset >> PAGE_SHIFT;
@@ -499,6 +548,13 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	filemap_invalidate_lock(inode->i_mapping);
 
 	kvm_gmem_invalidate_begin(inode, start, end);
+
+	ret = kvm_gmem_split_private(inode, start, end);
+	if (ret) {
+		kvm_gmem_invalidate_end(inode, start, end);
+		filemap_invalidate_unlock(inode->i_mapping);
+		return ret;
+	}
 	kvm_gmem_zap(inode, start, end);
 
 	ret = merge_truncate_range(inode, start, len >> PAGE_SHIFT, true);
@@ -907,6 +963,17 @@ static int kvm_gmem_convert(struct inode *inode, pgoff_t start,
 	invalidate_start = kvm_gmem_compute_invalidate_start(inode, start);
 	invalidate_end = kvm_gmem_compute_invalidate_end(inode, end);
 	kvm_gmem_invalidate_begin(inode, invalidate_start, invalidate_end);
+
+	if (!to_private) {
+		r = kvm_gmem_split_private(inode, start, end);
+		if (r) {
+			*err_index = start;
+			mas_destroy(&mas);
+			kvm_gmem_invalidate_end(inode, invalidate_start, invalidate_end);
+			return r;
+		}
+	}
+
 	kvm_gmem_zap(inode, start, end);
 	kvm_gmem_invalidate_end(inode, invalidate_start, invalidate_end);
 
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 17/24] KVM: TDX: Get/Put DPAMT page pair only when mapping size is 4KB
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (15 preceding siblings ...)
  2026-01-06 10:22 ` [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
@ 2026-01-06 10:23 ` Yan Zhao
  2026-01-06 10:23 ` [PATCH v3 18/24] x86/virt/tdx: Add loud warning when tdx_pamt_put() fails Yan Zhao
                   ` (8 subsequent siblings)
  25 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:23 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Invoke tdx_pamt_{get/put}() to add/remove Dynamic PAMT page pair for guest
private memory only when the S-EPT mapping size is 4KB.

When the mapping size is greater than 4KB, static PAMT pages are used. No
need to install/uninstall extra PAMT pages dynamically.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
[Yan: Move level checking to callers of tdx_pamt_{get/put}()]
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- new patch

  Checking for 4KB level was previously done inside tdx_pamt_{get/put}() in
  DPAMT v2 [1].

  Move the checking to callers of tdx_pamt_{get/put}() in KVM to avoid
  introducing an extra "level" parameter to tdx_pamt_{get/put}(). This is
  also because the callers that could have level > 4KB are limited in KVM,
  i.e., only inside tdx_sept_{set/remove}_private_spte().

[1] https://lore.kernel.org/all/20250609191340.2051741-5-kirill.shutemov@linux.intel.com
---
 arch/x86/kvm/vmx/tdx.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 712aaa3d45b7..c1dc1aaae49d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1722,9 +1722,11 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	WARN_ON_ONCE(!is_shadow_present_pte(mirror_spte) ||
 		     (mirror_spte & VMX_EPT_RWX_MASK) != VMX_EPT_RWX_MASK);
 
-	ret = tdx_pamt_get(page, &tdx->prealloc);
-	if (ret)
-		return ret;
+	if (level == PG_LEVEL_4K) {
+		ret = tdx_pamt_get(page, &tdx->prealloc);
+		if (ret)
+			return ret;
+	}
 
 	/*
 	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
@@ -1743,7 +1745,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	else
 		ret = tdx_mem_page_add(kvm, gfn, level, pfn);
 
-	if (ret)
+	if (ret && level == PG_LEVEL_4K)
 		tdx_pamt_put(page);
 
 	return ret;
@@ -1911,7 +1913,9 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 
 	tdx_quirk_reset_folio(folio, folio_page_idx(folio, page),
 			      KVM_PAGES_PER_HPAGE(level));
-	tdx_pamt_put(page);
+
+	if (level == PG_LEVEL_4K)
+		tdx_pamt_put(page);
 }
 
 /*
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 18/24] x86/virt/tdx: Add loud warning when tdx_pamt_put() fails.
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (16 preceding siblings ...)
  2026-01-06 10:23 ` [PATCH v3 17/24] KVM: TDX: Get/Put DPAMT page pair only when mapping size is 4KB Yan Zhao
@ 2026-01-06 10:23 ` Yan Zhao
  2026-01-06 10:23 ` [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting Yan Zhao
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:23 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

tdx_pamt_put() does not return any error to its caller when SEAMCALL
TDH_PHYMEM_PAMT_REMOVE fails. Though pamt_refcount for the failed 2MB
physical range is increased (so the DPAMT pages stay added), it will cause
that the 2MB physical range can only be mapped at 4KB level, i.e., later
SEAMCALL TDH_MEM_PAGE_AUG on the 2MB range at 2MB level will therefore fail
forever.

Since tdx_pamt_put() only fails when there are bugs in the host kernel or
in the TDX module, simply add a loud warning to aid debugging after such an
error occurs.

Link: https://lore.kernel.org/all/67d55b24ef1a80af615c3672e8436e0ac32e8efa.camel@intel.com
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- new patch
---
 arch/x86/virt/vmx/tdx/tdx.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index c12665389b67..76963c563906 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -2348,8 +2348,7 @@ void tdx_pamt_put(struct page *page)
 			 */
 			atomic_inc(pamt_refcount);
 
-			pr_err("TDH_PHYMEM_PAMT_REMOVE failed: %#llx\n", tdx_status);
-
+			WARN_ONCE(1, "TDH_PHYMEM_PAMT_REMOVE failed: %#llx\n", tdx_status);
 			/*
 			 * Don't free pamt_pa_array as it could hold garbage
 			 * when tdh_phymem_pamt_remove() fails.
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (17 preceding siblings ...)
  2026-01-06 10:23 ` [PATCH v3 18/24] x86/virt/tdx: Add loud warning when tdx_pamt_put() fails Yan Zhao
@ 2026-01-06 10:23 ` Yan Zhao
  2026-01-21  1:54   ` Huang, Kai
  2026-01-06 10:23 ` [PATCH v3 20/24] KVM: TDX: Implement per-VM external cache for splitting in TDX Yan Zhao
                   ` (6 subsequent siblings)
  25 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:23 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

Introduce per-VM external cache for splitting the external page table by
adding KVM x86 ops for cache "topup", "free", "need topup" operations.

Invoke the KVM x86 ops for "topup", "need topup" for the per-VM external
split cache when splitting the mirror root in
tdp_mmu_split_huge_pages_root() where there's no per-vCPU context.

Invoke the KVM x86 op for "free" to destroy the per-VM external split cache
when KVM frees memory caches.

This per-VM external split cache is only used when per-vCPU context is not
available. Use the per-vCPU external fault cache in the fault path
when per-vCPU context is available.

The per-VM external split cache is protected under both kvm->mmu_lock and a
cache lock inside vendor implementations to ensure that there're enough
pages in cache for one split:

- Dequeuing of the per-VM external split cache is in
  kvm_x86_ops.split_external_spte() under mmu_lock.

- Yield the traversal in tdp_mmu_split_huge_pages_root() after topup of
  the per-VM cache, so that need_topup() is checked again after
  re-acquiring the mmu_lock.

- Vendor implementations of the per-VM external split cache provide a
  cache lock to protect the enqueue/dequeue of pages into/from the cache.

Here's the sequence to show how enough pages in cache is guaranteed.

a. with write mmu_lock:

   1. write_lock(&kvm->mmu_lock)
      kvm_x86_ops.need_topup()

   2. write_unlock(&kvm->mmu_lock)
      kvm_x86_ops.topup() --> in vendor:
      {
        allocate pages
        get cache lock
        enqueue pages in cache
        put cache lock
      }

   3. write_lock(&kvm->mmu_lock)
      kvm_x86_ops.need_topup() (goto 2 if topup is necessary)  (*)

      kvm_x86_ops.split_external_spte() --> in vendor:
      {
         get cache lock
         dequeue pages in cache
         put cache lock
      }
      write_unlock(&kvm->mmu_lock)

b. with read mmu_lock,

   1. read_lock(&kvm->mmu_lock)
      kvm_x86_ops.need_topup()

   2. read_unlock(&kvm->mmu_lock)
      kvm_x86_ops.topup() --> in vendor:
      {
        allocate pages
        get cache lock
        enqueue pages in cache
        put cache lock
      }

   3. read_lock(&kvm->mmu_lock)
      kvm_x86_ops.need_topup() (goto 2 if topup is necessary)

      kvm_x86_ops.split_external_spte() --> in vendor:
      {
         get cache lock
         kvm_x86_ops.need_topup() (return retry if topup is necessary) (**)
         dequeue pages in cache
         put cache lock
      }

      read_unlock(&kvm->mmu_lock)

Due to (*) and (**) in step 3, enough pages for split is guaranteed.

Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Introduce x86 ops to manages the cache.
---
 arch/x86/include/asm/kvm-x86-ops.h |  3 ++
 arch/x86/include/asm/kvm_host.h    | 17 +++++++
 arch/x86/kvm/mmu/mmu.c             |  2 +
 arch/x86/kvm/mmu/tdp_mmu.c         | 71 +++++++++++++++++++++++++++++-
 4 files changed, 91 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 84fa8689b45c..307edc51ad8d 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -102,6 +102,9 @@ KVM_X86_OP_OPTIONAL(split_external_spte)
 KVM_X86_OP_OPTIONAL(alloc_external_fault_cache)
 KVM_X86_OP_OPTIONAL(topup_external_fault_cache)
 KVM_X86_OP_OPTIONAL(free_external_fault_cache)
+KVM_X86_OP_OPTIONAL(topup_external_per_vm_split_cache)
+KVM_X86_OP_OPTIONAL(free_external_per_vm_split_cache)
+KVM_X86_OP_OPTIONAL(need_topup_external_per_vm_split_cache)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 315ffb23e9d8..6122801f334b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1862,6 +1862,23 @@ struct kvm_x86_ops {
 	/* Free in external page fault cache. */
 	void (*free_external_fault_cache)(struct kvm_vcpu *vcpu);
 
+	/*
+	 * Top up extra pages needed in the per-VM cache for splitting external
+	 * page table.
+	 */
+	int (*topup_external_per_vm_split_cache)(struct kvm *kvm,
+						 enum pg_level level);
+
+	/* Free the per-VM cache for splitting external page table. */
+	void (*free_external_per_vm_split_cache)(struct kvm *kvm);
+
+	/*
+	 * Check if it's necessary to top up the per-VM cache for splitting
+	 * external page table.
+	 */
+	bool (*need_topup_external_per_vm_split_cache)(struct kvm *kvm,
+						       enum pg_level level);
+
 	bool (*has_wbinvd_exit)(void);
 
 	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 35a6e37bfc68..3d568512201d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6924,6 +6924,8 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
 	kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
 	kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
 	kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
+	if (kvm_has_mirrored_tdp(kvm))
+		kvm_x86_call(free_external_per_vm_split_cache)(kvm);
 }
 
 void kvm_mmu_uninit_vm(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b984027343b7..b45d3da683f2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1606,6 +1606,55 @@ static bool iter_cross_boundary(struct tdp_iter *iter, gfn_t start, gfn_t end)
 		 (iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <= end);
 }
 
+/*
+ * Check the per-VM external split cache under write mmu_lock or read mmu_lock
+ * in tdp_mmu_split_huge_pages_root().
+ *
+ * When need_topup_external_split_cache() returns false, the mmu_lock is held
+ * throughout the exectution from
+ * (a) need_topup_external_split_cache(), and
+ * (b) the cache dequeuing (in tdx_sept_split_private_spte() called by
+ *     tdp_mmu_split_huge_page()).
+ *
+ * - When mmu_lock is held for write, the per-VM external split cache is
+ *   exclusively accessed by a single user. Therefore, the result returned from
+ *   need_topup_external_split_cache() is accurate.
+ *
+ * - When mmu_lock is held for read, the per-VM external split cache can be
+ *   shared among multiple users. Cache dequeuing in
+ *   tdx_sept_split_private_spte() thus needs to check again of the cache page
+ *   count after acquiring its internal split cache lock and return an error for
+ *   retry if the cache page count is not sufficient.
+ */
+static bool need_topup_external_split_cache(struct kvm *kvm, int level)
+{
+	return kvm_x86_call(need_topup_external_per_vm_split_cache)(kvm, level);
+}
+
+static int topup_external_split_cache(struct kvm *kvm, int level, bool shared)
+{
+	int r;
+
+	rcu_read_unlock();
+
+	if (shared)
+		read_unlock(&kvm->mmu_lock);
+	else
+		write_unlock(&kvm->mmu_lock);
+
+	r = kvm_x86_call(topup_external_per_vm_split_cache)(kvm, level);
+
+	if (shared)
+		read_lock(&kvm->mmu_lock);
+	else
+		write_lock(&kvm->mmu_lock);
+
+	if (!r)
+		rcu_read_lock();
+
+	return r;
+}
+
 static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 					 struct kvm_mmu_page *root,
 					 gfn_t start, gfn_t end,
@@ -1614,6 +1663,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 {
 	struct kvm_mmu_page *sp = NULL;
 	struct tdp_iter iter;
+	int r = 0;
 
 	rcu_read_lock();
 
@@ -1672,6 +1722,21 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 			continue;
 		}
 
+		if (is_mirror_sp(root) &&
+		    need_topup_external_split_cache(kvm, iter.level)) {
+			r = topup_external_split_cache(kvm, iter.level, shared);
+
+			if (r) {
+				trace_kvm_mmu_split_huge_page(iter.gfn,
+							      iter.old_spte,
+							      iter.level, r);
+				goto out;
+			}
+
+			iter.yielded = true;
+			continue;
+		}
+
 		tdp_mmu_init_child_sp(sp, &iter);
 
 		if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
@@ -1682,15 +1747,17 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 
 	rcu_read_unlock();
 
+out:
 	/*
 	 * It's possible to exit the loop having never used the last sp if, for
 	 * example, a vCPU doing HugePage NX splitting wins the race and
-	 * installs its own sp in place of the last sp we tried to split.
+	 * installs its own sp in place of the last sp we tried to split or
+	 * topup_external_split_cache() failure.
 	 */
 	if (sp)
 		tdp_mmu_free_sp(sp);
 
-	return 0;
+	return r;
 }
 
 
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 20/24] KVM: TDX: Implement per-VM external cache for splitting in TDX
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (18 preceding siblings ...)
  2026-01-06 10:23 ` [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting Yan Zhao
@ 2026-01-06 10:23 ` Yan Zhao
  2026-01-06 10:23 ` [PATCH v3 21/24] KVM: TDX: Add/Remove DPAMT pages for the new S-EPT page for splitting Yan Zhao
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:23 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

Implement the KVM x86 ops for per-VM external cache for splitting the
external page table in TDX.

Since the per-VM external cache for splitting the external page table is
intended to be invoked outside of vCPU threads, i.e., when the per-vCPU
external_fault_cache is not available, introduce a spinlock
prealloc_split_cache_lock in TDX to protect pages enqueuing/dequeuing
operations for the per-VM external split cache.

Cache topup in tdx_topup_vm_split_cache() manages page enqueuing with the
help of prealloc_split_cache_lock.

Cache dequeuing will be implemented in tdx_sept_split_private_spte() in
later patches, which will also hold prealloc_split_cache_lock.

Checking the need of topup in tdx_need_topup_vm_split_cache() does not hold
prealloc_split_cache_lock internally. When tdx_need_topup_vm_split_cache()
is invoked under write mmu_lock, there's no need for further acquiring
prealloc_split_cache_lock; when tdx_need_topup_vm_split_cache() is invoked
under read mmu_lock, it needs to be checked again after acquiring
prealloc_split_cache_lock for cache dequeuing.

Cache free does not hold prealloc_split_cache_lock because it's intended to
be called when there's no contention.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- new patch corresponds to DPAMT v4.
---
 arch/x86/kvm/vmx/tdx.c | 61 ++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h |  5 ++++
 2 files changed, 66 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c1dc1aaae49d..40cca273d480 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -671,6 +671,9 @@ int tdx_vm_init(struct kvm *kvm)
 
 	kvm_tdx->state = TD_STATE_UNINITIALIZED;
 
+	INIT_LIST_HEAD(&kvm_tdx->prealloc_split_cache.page_list);
+	spin_lock_init(&kvm_tdx->prealloc_split_cache_lock);
+
 	return 0;
 }
 
@@ -1680,6 +1683,61 @@ static void tdx_free_external_fault_cache(struct kvm_vcpu *vcpu)
 		__free_page(page);
 }
 
+/*
+ * Need to prepare at least 2 pairs of PAMT pages (i.e., 4 PAMT pages) for
+ * splitting a S-EPT PG_LEVEL_2M mapping when Dynamic PAMT is enabled:
+ * - 1 pair for the new 4KB S-EPT page for splitting, which may be dequeued in
+ *   tdx_sept_split_private_spte() when there are no installed PAMT pages for
+ *   the 2MB physical range of the S-EPT page.
+ * - 1 pair for demoting guest private memory from 2MB to 4KB, which will be
+ *   dequeued in tdh_mem_page_demote().
+ */
+static int tdx_min_split_cache_sz(struct kvm *kvm, int level)
+{
+	KVM_BUG_ON(level != PG_LEVEL_2M, kvm);
+
+	if (!tdx_supports_dynamic_pamt(tdx_sysinfo))
+		return 0;
+
+	return tdx_dpamt_entry_pages() * 2;
+}
+
+static int tdx_topup_vm_split_cache(struct kvm *kvm, enum pg_level level)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct tdx_prealloc *prealloc = &kvm_tdx->prealloc_split_cache;
+	int cnt = tdx_min_split_cache_sz(kvm, level);
+
+	while (READ_ONCE(prealloc->cnt) < cnt) {
+		struct page *page = alloc_page(GFP_KERNEL);
+
+		if (!page)
+			return -ENOMEM;
+
+		spin_lock(&kvm_tdx->prealloc_split_cache_lock);
+		list_add(&page->lru, &prealloc->page_list);
+		prealloc->cnt++;
+		spin_unlock(&kvm_tdx->prealloc_split_cache_lock);
+	}
+
+	return 0;
+}
+
+static bool tdx_need_topup_vm_split_cache(struct kvm *kvm, enum pg_level level)
+{
+	struct tdx_prealloc *prealloc = &to_kvm_tdx(kvm)->prealloc_split_cache;
+
+	return prealloc->cnt < tdx_min_split_cache_sz(kvm, level);
+}
+
+static void tdx_free_vm_split_cache(struct kvm *kvm)
+{
+	struct page *page;
+
+	while ((page = get_tdx_prealloc_page(&to_kvm_tdx(kvm)->prealloc_split_cache)))
+		__free_page(page);
+}
+
 static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 			    enum pg_level level, kvm_pfn_t pfn)
 {
@@ -3804,4 +3862,7 @@ void __init tdx_hardware_setup(void)
 	vt_x86_ops.alloc_external_fault_cache = tdx_alloc_external_fault_cache;
 	vt_x86_ops.topup_external_fault_cache = tdx_topup_external_fault_cache;
 	vt_x86_ops.free_external_fault_cache = tdx_free_external_fault_cache;
+	vt_x86_ops.topup_external_per_vm_split_cache = tdx_topup_vm_split_cache;
+	vt_x86_ops.need_topup_external_per_vm_split_cache = tdx_need_topup_vm_split_cache;
+	vt_x86_ops.free_external_per_vm_split_cache = tdx_free_vm_split_cache;
 }
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 43dd295b7fd6..034e3ddfb679 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -48,6 +48,11 @@ struct kvm_tdx {
 	 * Set/unset is protected with kvm->mmu_lock.
 	 */
 	bool wait_for_sept_zap;
+
+	/* The per-VM cache for splitting S-EPT */
+	struct tdx_prealloc prealloc_split_cache;
+	/* Protect page enqueuing/dequeuing in prealloc_split_cache */
+	spinlock_t prealloc_split_cache_lock;
 };
 
 /* TDX module vCPU states */
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 21/24] KVM: TDX: Add/Remove DPAMT pages for the new S-EPT page for splitting
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (19 preceding siblings ...)
  2026-01-06 10:23 ` [PATCH v3 20/24] KVM: TDX: Implement per-VM external cache for splitting in TDX Yan Zhao
@ 2026-01-06 10:23 ` Yan Zhao
  2026-01-06 10:24 ` [PATCH v3 22/24] x86/tdx: Add/Remove DPAMT pages for guest private memory to demote Yan Zhao
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:23 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Splitting a huge S-EPT mapping requires the host to provide a new 4KB page,
which will be added as the S-EPT page to hold smaller mappings after the
splitting. Install Dynamic PAMT page pair for the new S-EPT page before
passing the S-EPT page to tdh_mem_page_demote(); Uninstall and free the
Dynamic PAMT page pair when tdh_mem_page_demote() fails.

When Dynamic PAMT is enabled and when there's no installed pair for the 2MB
physical range containing the new S-EPT page, tdx_pamt_get() dequeues a
pair of preallocated pages from the per-VM prealloc_split_cache and
installs them as the Dynamic PAMT page pair. Hold prealloc_split_cache_lock
when dequeuing from the per-VM prealloc_split_cache.

After tdh_mem_page_demote() fails, tdx_pamt_put() uninstalls and frees the
Dynamic PAMT page pair for the new S-EPT page if Dynamic PAMT is enabled,
and the new S-EPT page is the last page in 2MB physical range requiring the
Dynamic PAMT page pair.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Split out as a new patch.
- Add KVM_BUG_ON() after tdx_pamt_get() fails. (Vishal)
---
 arch/x86/kvm/vmx/tdx.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 40cca273d480..ec47bd799274 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1996,6 +1996,7 @@ static int tdx_sept_split_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 err, entry, level_state;
+	int ret;
 
 	if (KVM_BUG_ON(kvm_tdx->state != TD_STATE_RUNNABLE ||
 		       level != PG_LEVEL_2M, kvm))
@@ -2014,10 +2015,18 @@ static int tdx_sept_split_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level
 
 	tdx_track(kvm);
 
+	spin_lock(&kvm_tdx->prealloc_split_cache_lock);
+	ret = tdx_pamt_get(new_sept_page, &kvm_tdx->prealloc_split_cache);
+	spin_unlock(&kvm_tdx->prealloc_split_cache_lock);
+	if (KVM_BUG_ON(ret, kvm))
+		return -EIO;
+
 	err = tdh_do_no_vcpus(tdh_mem_page_demote, kvm, &kvm_tdx->td, gpa,
 			      tdx_level, new_sept_page, &entry, &level_state);
-	if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_DEMOTE, entry, level_state, kvm))
+	if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_DEMOTE, entry, level_state, kvm)) {
+		tdx_pamt_put(new_sept_page);
 		return -EIO;
+	}
 
 	return 0;
 }
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 22/24] x86/tdx: Add/Remove DPAMT pages for guest private memory to demote
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (20 preceding siblings ...)
  2026-01-06 10:23 ` [PATCH v3 21/24] KVM: TDX: Add/Remove DPAMT pages for the new S-EPT page for splitting Yan Zhao
@ 2026-01-06 10:24 ` Yan Zhao
  2026-01-19 10:52   ` Huang, Kai
  2026-01-06 10:24 ` [PATCH v3 23/24] x86/tdx: Pass guest memory's PFN info to demote for updating pamt_refcount Yan Zhao
                   ` (3 subsequent siblings)
  25 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:24 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

When Dynamic PAMT is enabled and splitting a 2MB mapping to 512 4KB
mappings, SEAMCALL TDH.MEM.PAGE.DEMOTE takes the Dynamic PAMT page pair in
registers R12 and R13. The Dynamic PAMT page pair is used to store physical
memory metadata for the 2MB guest private memory after its S-EPT mapping is
split to 4KB successfully.

Pass prealloc_split_cache (the per-VM split cache) to SEAMCALL wrapper
tdh_mem_page_demote() for dequeuing Dynamic PAMT pages from the cache.
Protect the cache dequeuing in KVM with prealloc_split_cache_lock.

Inside wrapper tdh_mem_page_demote(), dequeue the Dynamic PAMT pages into
the guest_memory_pamt_page array and copy the page address to R12 and R13.

Invoke SEAMCALL TDH_MEM_PAGE_DEMOTE using seamcall_saved_ret() to handle
registers above R11.

Free the Dynamic PAMT pages after SEAMCALL TDH_MEM_PAGE_DEMOTE fails since
the guest private memory is still mapped at 2MB level.

Opportunistically, rename dpamt_args_array_ptr() to
dpamt_args_array_ptr_rdx() for tdh_phymem_pamt_{add/remove} and invoke
dpamt_args_array_ptr_r12() in tdh_mem_page_demote() for populating
registers starting from R12.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Split out as a new patch.
- Get pages from preallocate cache corresponding to DPAMT v4.
---
 arch/x86/include/asm/tdx.h  |  1 +
 arch/x86/kvm/vmx/tdx.c      |  5 ++-
 arch/x86/virt/vmx/tdx/tdx.c | 76 ++++++++++++++++++++++++++-----------
 3 files changed, 59 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index abe484045132..5fc7498392fd 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -251,6 +251,7 @@ u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
 u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
 u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
 u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
+			struct tdx_prealloc *prealloc,
 			u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_finalize(struct tdx_td *td);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ec47bd799274..a11ff02a4f30 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2021,8 +2021,11 @@ static int tdx_sept_split_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level
 	if (KVM_BUG_ON(ret, kvm))
 		return -EIO;
 
+	spin_lock(&kvm_tdx->prealloc_split_cache_lock);
 	err = tdh_do_no_vcpus(tdh_mem_page_demote, kvm, &kvm_tdx->td, gpa,
-			      tdx_level, new_sept_page, &entry, &level_state);
+			      tdx_level, new_sept_page,
+			      &kvm_tdx->prealloc_split_cache, &entry, &level_state);
+	spin_unlock(&kvm_tdx->prealloc_split_cache_lock);
 	if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_DEMOTE, entry, level_state, kvm)) {
 		tdx_pamt_put(new_sept_page);
 		return -EIO;
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 76963c563906..9917e4e7705f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1848,25 +1848,69 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data)
 }
 EXPORT_SYMBOL_GPL(tdh_mng_rd);
 
+static int alloc_pamt_array(u64 *pa_array, struct tdx_prealloc *prealloc);
+static void free_pamt_array(u64 *pa_array);
+/*
+ * The TDX spec treats the registers like an array, as they are ordered
+ * in the struct. The array size is limited by the number or registers,
+ * so define the max size it could be for worst case allocations and sanity
+ * checking.
+ */
+#define MAX_TDX_ARG_SIZE(reg) ((sizeof(struct tdx_module_args) - \
+			       offsetof(struct tdx_module_args, reg)) / sizeof(u64))
+#define TDX_ARG_INDEX(reg) (offsetof(struct tdx_module_args, reg) / \
+			    sizeof(u64))
+/*
+ * Treat struct the registers like an array that starts at R12, per
+ * TDX spec. Do some sanitychecks, and return an indexable type.
+ */
+static u64 *dpamt_args_array_ptr_r12(struct tdx_module_array_args *args)
+{
+	WARN_ON_ONCE(tdx_dpamt_entry_pages() > MAX_TDX_ARG_SIZE(r12));
+
+	return &args->args_array[TDX_ARG_INDEX(r12)];
+}
+
 u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
+			struct tdx_prealloc *prealloc,
 			u64 *ext_err1, u64 *ext_err2)
 {
-	struct tdx_module_args args = {
-		.rcx = gpa | level,
-		.rdx = tdx_tdr_pa(td),
-		.r8 = page_to_phys(new_sept_page),
+	bool dpamt = tdx_supports_dynamic_pamt(&tdx_sysinfo) && level == TDX_PS_2M;
+	u64 guest_memory_pamt_page[MAX_TDX_ARG_SIZE(r12)];
+	struct tdx_module_array_args args = {
+		.args.rcx = gpa | level,
+		.args.rdx = tdx_tdr_pa(td),
+		.args.r8 = page_to_phys(new_sept_page),
 	};
 	u64 ret;
 
 	if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
 		return TDX_SW_ERROR;
 
+	if (dpamt) {
+		u64 *args_array = dpamt_args_array_ptr_r12(&args);
+
+		if (alloc_pamt_array(guest_memory_pamt_page, prealloc))
+			return TDX_SW_ERROR;
+
+		/*
+		 * Copy PAMT page PAs of the guest memory into the struct per the
+		 * TDX ABI
+		 */
+		memcpy(args_array, guest_memory_pamt_page,
+		       tdx_dpamt_entry_pages() * sizeof(*args_array));
+	}
+
 	/* Flush the new S-EPT page to be added */
 	tdx_clflush_page(new_sept_page);
-	ret = seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args);
 
-	*ext_err1 = args.rcx;
-	*ext_err2 = args.rdx;
+	ret = seamcall_saved_ret(TDH_MEM_PAGE_DEMOTE, &args.args);
+
+	*ext_err1 = args.args.rcx;
+	*ext_err2 = args.args.rdx;
+
+	if (dpamt && ret)
+		free_pamt_array(guest_memory_pamt_page);
 
 	return ret;
 }
@@ -2104,23 +2148,11 @@ static struct page *alloc_dpamt_page(struct tdx_prealloc *prealloc)
 	return alloc_page(GFP_KERNEL_ACCOUNT);
 }
 
-
-/*
- * The TDX spec treats the registers like an array, as they are ordered
- * in the struct. The array size is limited by the number or registers,
- * so define the max size it could be for worst case allocations and sanity
- * checking.
- */
-#define MAX_TDX_ARG_SIZE(reg) (sizeof(struct tdx_module_args) - \
-			       offsetof(struct tdx_module_args, reg))
-#define TDX_ARG_INDEX(reg) (offsetof(struct tdx_module_args, reg) / \
-			    sizeof(u64))
-
 /*
  * Treat struct the registers like an array that starts at RDX, per
  * TDX spec. Do some sanitychecks, and return an indexable type.
  */
-static u64 *dpamt_args_array_ptr(struct tdx_module_array_args *args)
+static u64 *dpamt_args_array_ptr_rdx(struct tdx_module_array_args *args)
 {
 	WARN_ON_ONCE(tdx_dpamt_entry_pages() > MAX_TDX_ARG_SIZE(rdx));
 
@@ -2188,7 +2220,7 @@ static u64 tdh_phymem_pamt_add(struct page *page, u64 *pamt_pa_array)
 	struct tdx_module_array_args args = {
 		.args.rcx = pamt_2mb_arg(page)
 	};
-	u64 *dpamt_arg_array = dpamt_args_array_ptr(&args);
+	u64 *dpamt_arg_array = dpamt_args_array_ptr_rdx(&args);
 
 	/* Copy PAMT page PA's into the struct per the TDX ABI */
 	memcpy(dpamt_arg_array, pamt_pa_array,
@@ -2216,7 +2248,7 @@ static u64 tdh_phymem_pamt_remove(struct page *page, u64 *pamt_pa_array)
 	struct tdx_module_array_args args = {
 		.args.rcx = pamt_2mb_arg(page),
 	};
-	u64 *args_array = dpamt_args_array_ptr(&args);
+	u64 *args_array = dpamt_args_array_ptr_rdx(&args);
 	u64 ret;
 
 	ret = seamcall_ret(TDH_PHYMEM_PAMT_REMOVE, &args.args);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 23/24] x86/tdx: Pass guest memory's PFN info to demote for updating pamt_refcount
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (21 preceding siblings ...)
  2026-01-06 10:24 ` [PATCH v3 22/24] x86/tdx: Add/Remove DPAMT pages for guest private memory to demote Yan Zhao
@ 2026-01-06 10:24 ` Yan Zhao
  2026-01-06 10:24 ` [PATCH v3 24/24] KVM: TDX: Turn on PG_LEVEL_2M Yan Zhao
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:24 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Pass guest memory's PFN info to tdh_mem_page_demote() by adding parameters
"guest_folio" and "guest_start_idx" to tdh_mem_page_demote().

The guest memory's pfn info is not required by directly SEAMCALL
TDH_MEM_PAGE_DEMOTE. Instead, it's used by host kernel to track the
pamt_refcount for the 2MB range containing the guest private memory.

Ater the S-EPT mapping is successfully split, set the pamt_refcount for the
2MB range containing the guest private memory to 512 after ensuring its
original value is 0. Warn loudly if the setting refcount operation fails,
which indicates kernel bugs.

Check guest memory's base pfn is 2MB aligned and all the guest memory is
contained in a single folio in tdh_mem_page_demote() to guard against any
kernel bugs.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Split out as a new patch.
- Added parameters "guest_folio" and "guest_start_idx" to pass the guest
  memory pfn info.
- Use atomic_cmpxchg_release() to set guest_pamt_refcount.
- No need to add param "pfn_for_gfn" kvm_x86_ops.split_external_spt() as
  the pfn info is already contained in param "old_mirror_spte" in
  kvm_x86_ops.split_external_spte().
---
 arch/x86/include/asm/tdx.h  |  6 +++---
 arch/x86/kvm/vmx/tdx.c      |  9 ++++++---
 arch/x86/virt/vmx/tdx/tdx.c | 30 +++++++++++++++++++++++++-----
 3 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 5fc7498392fd..f536782da157 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -250,9 +250,9 @@ u64 tdh_mng_key_config(struct tdx_td *td);
 u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
 u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
 u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
-u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
-			struct tdx_prealloc *prealloc,
-			u64 *ext_err1, u64 *ext_err2);
+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct folio *guest_folio,
+			unsigned long guest_start_idx, struct page *new_sept_page,
+			struct tdx_prealloc *prealloc, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_finalize(struct tdx_td *td);
 u64 tdh_vp_flush(struct tdx_vp *vp);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index a11ff02a4f30..0054a9de867c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1991,7 +1991,9 @@ static int tdx_sept_split_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level
 				       u64 old_mirror_spte, void *new_private_spt,
 				       bool mmu_lock_shared)
 {
+	struct page *guest_page = pfn_to_page(spte_to_pfn(old_mirror_spte));
 	struct page *new_sept_page = virt_to_page(new_private_spt);
+	struct folio *guest_folio = page_folio(guest_page);
 	int tdx_level = pg_level_to_tdx_sept_level(level);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	gpa_t gpa = gfn_to_gpa(gfn);
@@ -2022,9 +2024,10 @@ static int tdx_sept_split_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level
 		return -EIO;
 
 	spin_lock(&kvm_tdx->prealloc_split_cache_lock);
-	err = tdh_do_no_vcpus(tdh_mem_page_demote, kvm, &kvm_tdx->td, gpa,
-			      tdx_level, new_sept_page,
-			      &kvm_tdx->prealloc_split_cache, &entry, &level_state);
+	err = tdh_do_no_vcpus(tdh_mem_page_demote, kvm, &kvm_tdx->td, gpa, tdx_level,
+			      guest_folio, folio_page_idx(guest_folio, guest_page),
+			      new_sept_page, &kvm_tdx->prealloc_split_cache,
+			      &entry, &level_state);
 	spin_unlock(&kvm_tdx->prealloc_split_cache_lock);
 	if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_DEMOTE, entry, level_state, kvm)) {
 		tdx_pamt_put(new_sept_page);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 9917e4e7705f..d036d9b5c87a 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1871,9 +1871,9 @@ static u64 *dpamt_args_array_ptr_r12(struct tdx_module_array_args *args)
 	return &args->args_array[TDX_ARG_INDEX(r12)];
 }
 
-u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
-			struct tdx_prealloc *prealloc,
-			u64 *ext_err1, u64 *ext_err2)
+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct folio *guest_folio,
+			unsigned long guest_start_idx, struct page *new_sept_page,
+			struct tdx_prealloc *prealloc, u64 *ext_err1, u64 *ext_err2)
 {
 	bool dpamt = tdx_supports_dynamic_pamt(&tdx_sysinfo) && level == TDX_PS_2M;
 	u64 guest_memory_pamt_page[MAX_TDX_ARG_SIZE(r12)];
@@ -1882,6 +1882,8 @@ u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_
 		.args.rdx = tdx_tdr_pa(td),
 		.args.r8 = page_to_phys(new_sept_page),
 	};
+	/* base pfn for guest private memory */
+	unsigned long guest_base_pfn;
 	u64 ret;
 
 	if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
@@ -1889,6 +1891,15 @@ u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_
 
 	if (dpamt) {
 		u64 *args_array = dpamt_args_array_ptr_r12(&args);
+		unsigned long npages = 1 << (level * PTE_SHIFT);
+		struct page *guest_page;
+
+		guest_page = folio_page(guest_folio, guest_start_idx);
+		guest_base_pfn = page_to_pfn(guest_page);
+
+		if (guest_start_idx + npages > folio_nr_pages(guest_folio) ||
+		    !IS_ALIGNED(guest_base_pfn, npages))
+			return TDX_OPERAND_INVALID;
 
 		if (alloc_pamt_array(guest_memory_pamt_page, prealloc))
 			return TDX_SW_ERROR;
@@ -1909,9 +1920,18 @@ u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_
 	*ext_err1 = args.args.rcx;
 	*ext_err2 = args.args.rdx;
 
-	if (dpamt && ret)
-		free_pamt_array(guest_memory_pamt_page);
+	if (dpamt) {
+		if (ret) {
+			free_pamt_array(guest_memory_pamt_page);
+		} else {
+			/* PAMT refcount for guest private memory */
+			atomic_t *pamt_refcount;
 
+			pamt_refcount = tdx_find_pamt_refcount(guest_base_pfn);
+			WARN_ON_ONCE(atomic_cmpxchg_release(pamt_refcount, 0,
+							    PTRS_PER_PMD));
+		}
+	}
 	return ret;
 }
 EXPORT_SYMBOL_GPL(tdh_mem_page_demote);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v3 24/24] KVM: TDX: Turn on PG_LEVEL_2M
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (22 preceding siblings ...)
  2026-01-06 10:24 ` [PATCH v3 23/24] x86/tdx: Pass guest memory's PFN info to demote for updating pamt_refcount Yan Zhao
@ 2026-01-06 10:24 ` Yan Zhao
  2026-01-06 17:47 ` [PATCH v3 00/24] KVM: TDX huge page support for private memory Vishal Annapurve
  2026-01-16  0:28 ` Sean Christopherson
  25 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-06 10:24 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao, yan.y.zhao

Turn on PG_LEVEL_2M in tdx_gmem_private_max_mapping_level() when TDX huge
page is enabled and TD is RUNNABLE.

Introduce a module parameter named "tdx_huge_page" for kvm-intel.ko to
enable/disable TDX huge page. Turn TDX huge page off if the TDX module does
not support TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY.

Force page size to 4KB during TD build time to simplify code design, since
- tdh_mem_page_add() only adds private pages at 4KB.
- The amount of initial memory pages is usually limited (e.g. ~4MB in a
  typical linux TD).

Update the warnings and KVM_BUG_ON() info to match the conditions when 2MB
mappings are permitted.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
v3:
- Introduce the module param enable_tdx_huge_page and disable to toggle TDX
  huge page support.
- Disable TDX huge page if TDX module does not support
  TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY. (Kai).
- Explain why not allow 2M before TD is RUNNABLE in patch log.(Kai)
- Add comment to explain the relationship between returning PG_LEVEL_2M
  and guest accept level. (Kai)
- Dropped some KVM_BUG_ON()s due to rebasing. Updated KVM_BUG_ON()s on
  mapping levels to take into account of enable_tdx_huge_page.

RFC v2:
- Merged RFC v1's patch 4 (forcing PG_LEVEL_4K before TD runnable) with
  patch 9 (allowing PG_LEVEL_2M after TD runnable).
---
 arch/x86/kvm/vmx/tdx.c | 45 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 39 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 0054a9de867c..8149e89b5549 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -54,6 +54,8 @@
 
 bool enable_tdx __ro_after_init;
 module_param_named(tdx, enable_tdx, bool, 0444);
+static bool __read_mostly enable_tdx_huge_page = true;
+module_param_named(tdx_huge_page, enable_tdx_huge_page, bool, 0444);
 
 #define TDX_SHARED_BIT_PWL_5 gpa_to_gfn(BIT_ULL(51))
 #define TDX_SHARED_BIT_PWL_4 gpa_to_gfn(BIT_ULL(47))
@@ -1773,8 +1775,12 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (KVM_BUG_ON(!vcpu, kvm))
 		return -EINVAL;
 
-	/* TODO: handle large pages. */
-	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+	/*
+	 * Large page is not supported before TD runnable or TDX huge page is
+	 * not enabled.
+	 */
+	if (KVM_BUG_ON(((!enable_tdx_huge_page || kvm_tdx->state != TD_STATE_RUNNABLE) &&
+			level != PG_LEVEL_4K), kvm))
 		return -EIO;
 
 	WARN_ON_ONCE(!is_shadow_present_pte(mirror_spte) ||
@@ -1937,9 +1943,12 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	 */
 	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
 		return;
-
-	/* TODO: handle large pages. */
-	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+	/*
+	 * Large page is not supported before TD runnable or TDX huge page is
+	 * not enabled.
+	 */
+	if (KVM_BUG_ON(((!enable_tdx_huge_page || kvm_tdx->state != TD_STATE_RUNNABLE) &&
+			level != PG_LEVEL_4K), kvm))
 		return;
 
 	err = tdh_do_no_vcpus(tdh_mem_range_block, kvm, &kvm_tdx->td, gpa,
@@ -3556,12 +3565,34 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	return ret;
 }
 
+/*
+ * For private pages:
+ *
+ * Force KVM to map at 4KB level when !enable_tdx_huge_page (e.g., due to
+ * incompatible TDX module) or before TD state is RUNNABLE.
+ *
+ * Always allow KVM to map at 2MB level in other cases, though KVM may still map
+ * the page at 4KB (i.e., passing in PG_LEVEL_4K to AUG) due to
+ * (1) the backend folio is 4KB,
+ * (2) disallow_lpage restrictions:
+ *     - mixed private/shared pages in the 2MB range
+ *     - level misalignment due to slot base_gfn, slot size, and ugfn
+ *     - guest_inhibit bit set due to guest's 4KB accept level
+ * (3) page merging is disallowed (e.g., when part of a 2MB range has been
+ *     mapped at 4KB level during TD build time).
+ */
 int tdx_gmem_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn, bool is_private)
 {
 	if (!is_private)
 		return 0;
 
-	return PG_LEVEL_4K;
+	if (!enable_tdx_huge_page)
+		return PG_LEVEL_4K;
+
+	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
+		return PG_LEVEL_4K;
+
+	return PG_LEVEL_2M;
 }
 
 static int tdx_online_cpu(unsigned int cpu)
@@ -3747,6 +3778,8 @@ static int __init __tdx_bringup(void)
 	if (misc_cg_set_capacity(MISC_CG_RES_TDX, tdx_get_nr_guest_keyids()))
 		goto get_sysinfo_err;
 
+	if (enable_tdx_huge_page && !tdx_supports_demote_nointerrupt(tdx_sysinfo))
+		enable_tdx_huge_page = false;
 	/*
 	 * Leave hardware virtualization enabled after TDX is enabled
 	 * successfully.  TDX CPU hotplug depends on this.
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (23 preceding siblings ...)
  2026-01-06 10:24 ` [PATCH v3 24/24] KVM: TDX: Turn on PG_LEVEL_2M Yan Zhao
@ 2026-01-06 17:47 ` Vishal Annapurve
  2026-01-06 21:26   ` Ackerley Tng
  2026-01-16  0:28 ` Sean Christopherson
  25 siblings, 1 reply; 127+ messages in thread
From: Vishal Annapurve @ 2026-01-06 17:47 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, ackerleytng, michael.roth, david, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> - EPT mapping size and folio size
>
>   This series is built upon the rule in KVM that the mapping size in the
>   KVM-managed secondary MMU is no larger than the backend folio size.
>
>   Therefore, there are sanity checks in the SEAMCALL wrappers in patches
>   1-5 to follow this rule, either in tdh_mem_page_aug() for mapping
>   (patch 1) or in tdh_phymem_page_wbinvd_hkid() (patch 3),
>   tdx_quirk_reset_folio() (patch 4), tdh_phymem_page_reclaim() (patch 5)
>   for unmapping.
>
>   However, as Vishal pointed out in [7], the new hugetlb-based guest_memfd
>   [1] splits backend folios ahead of notifying KVM for unmapping. So, this
>   series also relies on the fixup patch [8] to notify KVM of unmapping
>   before splitting the backend folio during the memory conversion ioctl.

I think the major issue here is that if splitting fails there is no
way to undo the unmapping [1]. How should KVM/VMM/guest handle the
case where a guest requested conversion to shared, the conversion
failed and the memory is no longer mapped as private?

[1] https://lore.kernel.org/kvm/aN8P87AXlxlEDdpP@google.com/

> Four issues are identified in the WIP hugetlb-based guest_memfd [1]:
>
> (1) Compilation error due to missing symbol export of
>     hugetlb_restructuring_free_folio().
>
> (2) guest_memfd splits backend folios when the folio is still mapped as
>     huge in KVM (which breaks KVM's basic assumption that EPT mapping size
>     should not exceed the backend folio size).
>
> (3) guest_memfd is incapable of merging folios to huge for
>     shared-to-private conversions.
>
> (4) Unnecessary disabling huge private mappings when HVA is not 2M-aligned,
>     given that shared pages can only be mapped at 4KB.
>
> So, this series also depends on the four fixup patches included in [4]:
>
> [FIXUP] KVM: guest_memfd: Allow gmem slot lpage even with non-aligned uaddr
> [FIXUP] KVM: guest_memfd: Allow merging folios after to-private conversion
> [FIXUP] KVM: guest_memfd: Zap mappings before splitting backend folios
> [FIXUP] mm: hugetlb_restructuring: Export hugetlb_restructuring_free_folio()
>
> (lkp sent me some more gmem compilation errors. I ignored them as I didn't
>  encounter them with my config and env).
>
> ...
>
> [0] RFC v2: https://lore.kernel.org/all/20250807093950.4395-1-yan.y.zhao@intel.com
> [1] hugetlb-based gmem: https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25
> [2] gmem-population rework v2: https://lore.kernel.org/all/20251215153411.3613928-1-michael.roth@amd.com
> [3] DPAMT v4: https://lore.kernel.org/kvm/20251121005125.417831-1-rick.p.edgecombe@intel.com
> [4] kernel full stack: https://github.com/intel-staging/tdx/tree/huge_page_v3
> [5] https://lore.kernel.org/all/aF0Kg8FcHVMvsqSo@yzhao56-desk.sh.intel.com
> [6] https://lore.kernel.org/all/aGSoDnODoG2%2FpbYn@yzhao56-desk.sh.intel.com
> [7] https://lore.kernel.org/all/CAGtprH9vdpAGDNtzje=7faHBQc9qTSF2fUEGcbCkfJehFuP-rw@mail.gmail.com
> [8] https://github.com/intel-staging/tdx/commit/a8aedac2df44e29247773db3444bc65f7100daa1
> [9] https://github.com/intel-staging/tdx/commit/8747667feb0b37daabcaee7132c398f9e62a6edd
> [10] https://github.com/intel-staging/tdx/commit/ab29a85ec2072393ab268e231c97f07833853d0d
> [11] https://github.com/intel-staging/tdx/commit/4feb6bf371f3a747b71fc9f4ded25261e66b8895
>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2026-01-06 10:18 ` [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
@ 2026-01-06 21:08   ` Dave Hansen
  2026-01-07  9:12     ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Dave Hansen @ 2026-01-06 21:08 UTC (permalink / raw)
  To: Yan Zhao, pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, kas, tabba, ackerleytng,
	michael.roth, david, vannapurve, sagis, vbabka, thomas.lendacky,
	nik.borisov, pgonda, fan.du, jun.miao, francescolavra.fl, jgross,
	ira.weiny, isaku.yamahata, xiaoyao.li, kai.huang, binbin.wu,
	chao.p.peng, chao.gao

On 1/6/26 02:18, Yan Zhao wrote:
> Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.
> 
> The SEAMCALL TDH_MEM_PAGE_AUG currently supports adding physical memory to
> the S-EPT up to 2MB in size.
> 
> While keeping the "level" parameter in the tdh_mem_page_aug() wrapper to
> allow callers to specify the physical memory size, introduce the parameters
> "folio" and "start_idx" to specify the physical memory starting from the
> page at "start_idx" within the "folio". The specified physical memory must
> be fully contained within a single folio.
> 
> Invoke tdx_clflush_page() for each 4KB segment of the physical memory being
> added. tdx_clflush_page() performs CLFLUSH operations conservatively to
> prevent dirty cache lines from writing back later and corrupting TD memory.

This changelog is heavy on the "what" and weak on the "why". It's not
telling me what I need to know.

...
> +	struct folio *folio = page_folio(page);
>  	gpa_t gpa = gfn_to_gpa(gfn);
>  	u64 entry, level_state;
>  	u64 err;
>  
> -	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
> -
> +	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, folio,
> +			       folio_page_idx(folio, page), &entry, &level_state);
>  	if (unlikely(IS_TDX_OPERAND_BUSY(err)))
>  		return -EBUSY;

For example, 'folio' is able to be trivially derived from page. Yet,
this removes the 'page' argument and replaces it with 'folio' _and_
another value which can be derived from 'page'.

This looks superficially like an illogical change. *Why* was this done?

> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index b0b33f606c11..41ce18619ffc 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1743,16 +1743,23 @@ u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page)
>  }
>  EXPORT_SYMBOL_GPL(tdh_vp_addcx);
>  
> -u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2)
> +u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct folio *folio,
> +		     unsigned long start_idx, u64 *ext_err1, u64 *ext_err2)
>  {
>  	struct tdx_module_args args = {
>  		.rcx = gpa | level,
>  		.rdx = tdx_tdr_pa(td),
> -		.r8 = page_to_phys(page),
> +		.r8 = page_to_phys(folio_page(folio, start_idx)),
>  	};
> +	unsigned long npages = 1 << (level * PTE_SHIFT);
>  	u64 ret;

This 'npages' calculation is not obviously correct. It's not clear what
"level" is or what values it should have.

This is precisely the kind of place to deploy a helper that explains
what is going on.

> -	tdx_clflush_page(page);
> +	if (start_idx + npages > folio_nr_pages(folio))
> +		return TDX_OPERAND_INVALID;

Why is this necessary? Would it be a bug if this happens?

> +	for (int i = 0; i < npages; i++)
> +		tdx_clflush_page(folio_page(folio, start_idx + i));

All of the page<->folio conversions are kinda hurting my brain. I think
we need to decide what the canonical type for these things is in TDX, do
the conversion once, and stick with it.


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-06 17:47 ` [PATCH v3 00/24] KVM: TDX huge page support for private memory Vishal Annapurve
@ 2026-01-06 21:26   ` Ackerley Tng
  2026-01-06 21:38     ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Ackerley Tng @ 2026-01-06 21:26 UTC (permalink / raw)
  To: Vishal Annapurve, Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, michael.roth, david, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

Vishal Annapurve <vannapurve@google.com> writes:

> On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>>
>> - EPT mapping size and folio size
>>
>>   This series is built upon the rule in KVM that the mapping size in the
>>   KVM-managed secondary MMU is no larger than the backend folio size.
>>

I'm not familiar with this rule and would like to find out more. Why is
this rule imposed? Is this rule there just because traditionally folio
sizes also define the limit of contiguity, and so the mapping size must
not be greater than folio size in case the block of memory represented
by the folio is not contiguous?

In guest_memfd's case, even if the folio is split (just for refcount
tracking purposese on private to shared conversion), the memory is still
contiguous up to the original folio's size. Will the contiguity address
the concerns?

Specifically for TDX, does the folio size actually matter relative to
the mapping, or is it more about contiguity than the folio size?

>>   Therefore, there are sanity checks in the SEAMCALL wrappers in patches
>>   1-5 to follow this rule, either in tdh_mem_page_aug() for mapping
>>   (patch 1) or in tdh_phymem_page_wbinvd_hkid() (patch 3),
>>   tdx_quirk_reset_folio() (patch 4), tdh_phymem_page_reclaim() (patch 5)
>>   for unmapping.
>>
>>   However, as Vishal pointed out in [7], the new hugetlb-based guest_memfd
>>   [1] splits backend folios ahead of notifying KVM for unmapping. So, this
>>   series also relies on the fixup patch [8] to notify KVM of unmapping
>>   before splitting the backend folio during the memory conversion ioctl.
>
> I think the major issue here is that if splitting fails there is no
> way to undo the unmapping [1]. How should KVM/VMM/guest handle the
> case where a guest requested conversion to shared, the conversion
> failed and the memory is no longer mapped as private?
>
> [1] https://lore.kernel.org/kvm/aN8P87AXlxlEDdpP@google.com/
>

Unmapping was supposed to be the point of no return in the conversion
process. (This might have changed since we last discussed this. The link
[1] from Vishal is where it was discussed.)

The new/current plan is that in the conversion process we'll do anything
that might fail first, and then commit the conversion, beginning with
zapping, and so zapping is the point of no return.

(I think you also suggested this before, but back then I couldn't see a
way to separate out the steps cleanly)

Here's the conversion steps in what we're trying now (leaving out the
TDX EPT splitting first)

1. Allocate enough memory for updating attributes maple tree
2a. Only for shared->private conversions: unmap from host page table,
check for safe refcounts
2b. Only for private->shared conversions: split folios (note: split
only, no merges) split can fail since HVO needs to be undone, and that
requires allocations.
3. Invalidate begin
4. Zap from stage 2 page tables: this is the point of no return, before
this, we must be sure nothing after will fail.
5. Update attributes maple tree using allocated memory from step 1.
6. Invalidate end
7. Only for shared->private conversions: merge folios, making sure that
merging does not fail (should not, since there are no allocations, only
folio aka metadata updates)

Updating the maple tree before calling the folio merge function allows
the merge function to look up the *updated* maple tree.

I'm thinking to insert the call to EPT splitting after invalidate begin
(3) since EPT splitting is not undoable. However, that will be after
folio splitting, hence my earlier question on whether it's a hard rule
based on folio size, or based on memory contiguity. Would that work?

>> Four issues are identified in the WIP hugetlb-based guest_memfd [1]:
>>
>> (1) Compilation error due to missing symbol export of
>>     hugetlb_restructuring_free_folio().
>>
>> (2) guest_memfd splits backend folios when the folio is still mapped as
>>     huge in KVM (which breaks KVM's basic assumption that EPT mapping size
>>     should not exceed the backend folio size).
>>
>> (3) guest_memfd is incapable of merging folios to huge for
>>     shared-to-private conversions.
>>
>> (4) Unnecessary disabling huge private mappings when HVA is not 2M-aligned,
>>     given that shared pages can only be mapped at 4KB.
>>
>> So, this series also depends on the four fixup patches included in [4]:
>>

Thank you for these fixes!

>> [FIXUP] KVM: guest_memfd: Allow gmem slot lpage even with non-aligned uaddr
>> [FIXUP] KVM: guest_memfd: Allow merging folios after to-private conversion

Thanks for catching this, Vishal also found this in a very recent
internal review. Our fix for this is to first apply the new state before
doing the folio merge. See the flow described above.

>> [FIXUP] KVM: guest_memfd: Zap mappings before splitting backend folios
>> [FIXUP] mm: hugetlb_restructuring: Export hugetlb_restructuring_free_folio()
>>
>> (lkp sent me some more gmem compilation errors. I ignored them as I didn't
>>  encounter them with my config and env).
>>
>> ...
>>
>> [0] RFC v2: https://lore.kernel.org/all/20250807093950.4395-1-yan.y.zhao@intel.com
>> [1] hugetlb-based gmem: https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25
>> [2] gmem-population rework v2: https://lore.kernel.org/all/20251215153411.3613928-1-michael.roth@amd.com
>> [3] DPAMT v4: https://lore.kernel.org/kvm/20251121005125.417831-1-rick.p.edgecombe@intel.com
>> [4] kernel full stack: https://github.com/intel-staging/tdx/tree/huge_page_v3
>> [5] https://lore.kernel.org/all/aF0Kg8FcHVMvsqSo@yzhao56-desk.sh.intel.com
>> [6] https://lore.kernel.org/all/aGSoDnODoG2%2FpbYn@yzhao56-desk.sh.intel.com
>> [7] https://lore.kernel.org/all/CAGtprH9vdpAGDNtzje=7faHBQc9qTSF2fUEGcbCkfJehFuP-rw@mail.gmail.com
>> [8] https://github.com/intel-staging/tdx/commit/a8aedac2df44e29247773db3444bc65f7100daa1
>> [9] https://github.com/intel-staging/tdx/commit/8747667feb0b37daabcaee7132c398f9e62a6edd
>> [10] https://github.com/intel-staging/tdx/commit/ab29a85ec2072393ab268e231c97f07833853d0d
>> [11] https://github.com/intel-staging/tdx/commit/4feb6bf371f3a747b71fc9f4ded25261e66b8895
>>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-06 21:26   ` Ackerley Tng
@ 2026-01-06 21:38     ` Sean Christopherson
  2026-01-06 22:04       ` Ackerley Tng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2026-01-06 21:38 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Vishal Annapurve, Yan Zhao, pbonzini, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

On Tue, Jan 06, 2026, Ackerley Tng wrote:
> Vishal Annapurve <vannapurve@google.com> writes:
> 
> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >>
> >> - EPT mapping size and folio size
> >>
> >>   This series is built upon the rule in KVM that the mapping size in the
> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> >>
> 
> I'm not familiar with this rule and would like to find out more. Why is
> this rule imposed? 

Because it's the only sane way to safely map memory into the guest? :-D

> Is this rule there just because traditionally folio sizes also define the
> limit of contiguity, and so the mapping size must not be greater than folio
> size in case the block of memory represented by the folio is not contiguous?

Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
is) strictly bound by the host mapping size.  That's handles contiguous addresses,
but it _also_ handles contiguous protections (e.g. RWX) and other attributes.

> In guest_memfd's case, even if the folio is split (just for refcount
> tracking purposese on private to shared conversion), the memory is still
> contiguous up to the original folio's size. Will the contiguity address
> the concerns?

Not really?  Why would the folio be split if the memory _and its attributes_ are
fully contiguous?  If the attributes are mixed, KVM must not create a mapping
spanning mixed ranges, i.e. with multiple folios.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-06 21:38     ` Sean Christopherson
@ 2026-01-06 22:04       ` Ackerley Tng
  2026-01-06 23:43         ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Ackerley Tng @ 2026-01-06 22:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vishal Annapurve, Yan Zhao, pbonzini, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

Sean Christopherson <seanjc@google.com> writes:

> On Tue, Jan 06, 2026, Ackerley Tng wrote:
>> Vishal Annapurve <vannapurve@google.com> writes:
>>
>> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>> >>
>> >> - EPT mapping size and folio size
>> >>
>> >>   This series is built upon the rule in KVM that the mapping size in the
>> >>   KVM-managed secondary MMU is no larger than the backend folio size.
>> >>
>>
>> I'm not familiar with this rule and would like to find out more. Why is
>> this rule imposed?
>
> Because it's the only sane way to safely map memory into the guest? :-D
>
>> Is this rule there just because traditionally folio sizes also define the
>> limit of contiguity, and so the mapping size must not be greater than folio
>> size in case the block of memory represented by the folio is not contiguous?
>
> Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
>
>> In guest_memfd's case, even if the folio is split (just for refcount
>> tracking purposese on private to shared conversion), the memory is still
>> contiguous up to the original folio's size. Will the contiguity address
>> the concerns?
>
> Not really?  Why would the folio be split if the memory _and its attributes_ are
> fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> spanning mixed ranges, i.e. with multiple folios.

The folio can be split if any (or all) of the pages in a huge page range
are shared (in the CoCo sense). So in a 1G block of memory, even if the
attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
would be split, and the split folios are necessary for tracking users of
shared pages using struct page refcounts.

However the split folios in that 1G range are still fully contiguous.

The process of conversion will split the EPT entries soon after the
folios are split so the rule remains upheld.

I guess perhaps the question is, is it okay if the folios are smaller
than the mapping while conversion is in progress? Does the order matter
(split page table entries first vs split folios first)?

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-06 22:04       ` Ackerley Tng
@ 2026-01-06 23:43         ` Sean Christopherson
  2026-01-07  9:03           ` Yan Zhao
                             ` (2 more replies)
  0 siblings, 3 replies; 127+ messages in thread
From: Sean Christopherson @ 2026-01-06 23:43 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Vishal Annapurve, Yan Zhao, pbonzini, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

On Tue, Jan 06, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> >> Vishal Annapurve <vannapurve@google.com> writes:
> >>
> >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >> >>
> >> >> - EPT mapping size and folio size
> >> >>
> >> >>   This series is built upon the rule in KVM that the mapping size in the
> >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> >> >>
> >>
> >> I'm not familiar with this rule and would like to find out more. Why is
> >> this rule imposed?
> >
> > Because it's the only sane way to safely map memory into the guest? :-D
> >
> >> Is this rule there just because traditionally folio sizes also define the
> >> limit of contiguity, and so the mapping size must not be greater than folio
> >> size in case the block of memory represented by the folio is not contiguous?
> >
> > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> >
> >> In guest_memfd's case, even if the folio is split (just for refcount
> >> tracking purposese on private to shared conversion), the memory is still
> >> contiguous up to the original folio's size. Will the contiguity address
> >> the concerns?
> >
> > Not really?  Why would the folio be split if the memory _and its attributes_ are
> > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> > spanning mixed ranges, i.e. with multiple folios.
> 
> The folio can be split if any (or all) of the pages in a huge page range
> are shared (in the CoCo sense). So in a 1G block of memory, even if the
> attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> would be split, and the split folios are necessary for tracking users of
> shared pages using struct page refcounts.

Ahh, that's what the refcounting was referring to.  Gotcha.

> However the split folios in that 1G range are still fully contiguous.
> 
> The process of conversion will split the EPT entries soon after the
> folios are split so the rule remains upheld.
> 
> I guess perhaps the question is, is it okay if the folios are smaller
> than the mapping while conversion is in progress? Does the order matter
> (split page table entries first vs split folios first)?

Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
conceptually totally fine, i.e. I'm not totally opposed to adding support for
mapping multiple guest_memfd folios with a single hugepage.   As to whether we
do (a) nothing, (b) change the refcounting, or (c) add support for mapping
multiple folios in one page, probably comes down to which option provides "good
enough" performance without incurring too much complexity.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-06 23:43         ` Sean Christopherson
@ 2026-01-07  9:03           ` Yan Zhao
  2026-01-08 20:11             ` Ackerley Tng
  2026-01-07 19:22           ` Edgecombe, Rick P
  2026-01-12 20:15           ` Ackerley Tng
  2 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-07  9:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng, Vishal Annapurve, pbonzini, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
> On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > Sean Christopherson <seanjc@google.com> writes:
> > 
> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > >> Vishal Annapurve <vannapurve@google.com> writes:
> > >>
> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >> >>
> > >> >> - EPT mapping size and folio size
> > >> >>
> > >> >>   This series is built upon the rule in KVM that the mapping size in the
> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> > >> >>
> > >>
> > >> I'm not familiar with this rule and would like to find out more. Why is
> > >> this rule imposed?
> > >
> > > Because it's the only sane way to safely map memory into the guest? :-D
> > >
> > >> Is this rule there just because traditionally folio sizes also define the
> > >> limit of contiguity, and so the mapping size must not be greater than folio
> > >> size in case the block of memory represented by the folio is not contiguous?
> > >
> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> > >
> > >> In guest_memfd's case, even if the folio is split (just for refcount
> > >> tracking purposese on private to shared conversion), the memory is still
> > >> contiguous up to the original folio's size. Will the contiguity address
> > >> the concerns?
> > >
> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> > > spanning mixed ranges, i.e. with multiple folios.
> > 
> > The folio can be split if any (or all) of the pages in a huge page range
> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> > would be split, and the split folios are necessary for tracking users of
> > shared pages using struct page refcounts.
> 
> Ahh, that's what the refcounting was referring to.  Gotcha.
> 
> > However the split folios in that 1G range are still fully contiguous.
> > 
> > The process of conversion will split the EPT entries soon after the
> > folios are split so the rule remains upheld.
Overall, I don't think allowing folios smaller than the mappings while
conversion is in progress brings enough benefit.

Cons:
(1) TDX's zapping callback has no idea whether the zapping is caused by an
    in-progress private-to-shared conversion or other reasons. It also has no
    idea if the attributes of the underlying folios remain unchanged during an
    in-progress private-to-shared conversion. Even if the assertion Ackerley
    mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
    callback for in-progress private-to-shared conversion alone (which would
    increase TDX's dependency on guest_memfd's specific implementation even if
    it's feasible).

    Removing the sanity checks entirely in TDX's zapping callback is confusing
    and would show a bad/false expectation from KVM -- what if a huge folio is
    incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
    others) in other conditions? And then do we still need the check in TDX's
    mapping callback? If not, does it mean TDX huge pages can stop relying on
    guest_memfd's ability to allocate huge folios, as KVM could still create
    huge mappings as long as small folios are physically contiguous with
    homogeneous memory attributes?

(2) Allowing folios smaller than the mapping would require splitting S-EPT in
    kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
    invalidate lock held in __kvm_gmem_set_attributes() could guard against
    concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
    error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).

Pro: Preventing zapping private memory until conversion is successful is good.

However, could we achieve this benefit in other ways? For example, is it
possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
optimization? (hugetlb_vmemmap conversion is super slow according to my
observation and I always disable it). Or pre-allocation for
vmemmap_remap_alloc()?

Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
private memory before conversion succeeds is still better than introducing the
mess between folio size and mapping size.

> > I guess perhaps the question is, is it okay if the folios are smaller
> > than the mapping while conversion is in progress? Does the order matter
> > (split page table entries first vs split folios first)?
> 
> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> multiple folios in one page, probably comes down to which option provides "good
> enough" performance without incurring too much complexity.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2026-01-06 21:08   ` Dave Hansen
@ 2026-01-07  9:12     ` Yan Zhao
  2026-01-07 16:39       ` Dave Hansen
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-07  9:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe, kas,
	tabba, ackerleytng, michael.roth, david, vannapurve, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

Hi Dave

Thanks for the review!

On Tue, Jan 06, 2026 at 01:08:00PM -0800, Dave Hansen wrote:
> On 1/6/26 02:18, Yan Zhao wrote:
> > Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.
> > 
> > The SEAMCALL TDH_MEM_PAGE_AUG currently supports adding physical memory to
> > the S-EPT up to 2MB in size.
> > 
> > While keeping the "level" parameter in the tdh_mem_page_aug() wrapper to
> > allow callers to specify the physical memory size, introduce the parameters
> > "folio" and "start_idx" to specify the physical memory starting from the
> > page at "start_idx" within the "folio". The specified physical memory must
> > be fully contained within a single folio.
> > 
> > Invoke tdx_clflush_page() for each 4KB segment of the physical memory being
> > added. tdx_clflush_page() performs CLFLUSH operations conservatively to
> > prevent dirty cache lines from writing back later and corrupting TD memory.
> 
> This changelog is heavy on the "what" and weak on the "why". It's not
> telling me what I need to know.
Indeed. I missed that. I'll keep it in mind. Thanks!

> > +	struct folio *folio = page_folio(page);
> >  	gpa_t gpa = gfn_to_gpa(gfn);
> >  	u64 entry, level_state;
> >  	u64 err;
> >  
> > -	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
> > -
> > +	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, folio,
> > +			       folio_page_idx(folio, page), &entry, &level_state);
> >  	if (unlikely(IS_TDX_OPERAND_BUSY(err)))
> >  		return -EBUSY;
> 
> For example, 'folio' is able to be trivially derived from page. Yet,
> this removes the 'page' argument and replaces it with 'folio' _and_
> another value which can be derived from 'page'.
> 
> This looks superficially like an illogical change. *Why* was this done?
Sorry for missing the "why".

I think we can alternatively derive "folio" and "start_idx" from "page" inside
the wrapper tdh_mem_page_aug() for huge pages.

However, my understanding is that it's better for functions expecting huge pages
to explicitly receive "folio" instead of "page". This way, people can tell from
a function's declaration what the function expects. Is this understanding
correct?

Passing "start_idx" along with "folio" is due to the requirement of mapping only
a sub-range of a huge folio. e.g., we allow creating a 2MB mapping starting from
the nth idx of a 1GB folio.

On the other hand, if we instead pass "page" to tdh_mem_page_aug() for huge
pages and have tdh_mem_page_aug() internally convert it to "folio" and
"start_idx", it makes me wonder if we could have previously just passed "pfn" to
tdh_mem_page_aug() and had tdh_mem_page_aug() convert it to "page".

> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index b0b33f606c11..41ce18619ffc 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1743,16 +1743,23 @@ u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page)
> >  }
> >  EXPORT_SYMBOL_GPL(tdh_vp_addcx);
> >  
> > -u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2)
> > +u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct folio *folio,
> > +		     unsigned long start_idx, u64 *ext_err1, u64 *ext_err2)
> >  {
> >  	struct tdx_module_args args = {
> >  		.rcx = gpa | level,
> >  		.rdx = tdx_tdr_pa(td),
> > -		.r8 = page_to_phys(page),
> > +		.r8 = page_to_phys(folio_page(folio, start_idx)),
> >  	};
> > +	unsigned long npages = 1 << (level * PTE_SHIFT);
> >  	u64 ret;
> 
> This 'npages' calculation is not obviously correct. It's not clear what
> "level" is or what values it should have.
> 
> This is precisely the kind of place to deploy a helper that explains
> what is going on.
Will do. Thanks for pointing it out!

> > -	tdx_clflush_page(page);
> > +	if (start_idx + npages > folio_nr_pages(folio))
> > +		return TDX_OPERAND_INVALID;
> 
> Why is this necessary? Would it be a bug if this happens?
This sanity check is due to the requirement in KVM that mapping size should be
no larger than the backend folio size, which ensures the mapping pages are
physically contiguous with homogeneous page attributes. (See the discussion
about "EPT mapping size and folio size" in thread [1]).

Failure of the sanity check could only be due to bugs in the caller (KVM). I
didn't convert the sanity check to an assertion because there's already a
TDX_BUG_ON_2() on error following the invocation of tdh_mem_page_aug() in KVM.

Also, there's no alignment checking because SEAMCALL TDH_MEM_PAGE_AUG() would
fail with a misaligned base PFN.

[1] https://lore.kernel.org/all/aV2A39fXgzuM4Toa@google.com/

> > +	for (int i = 0; i < npages; i++)
> > +		tdx_clflush_page(folio_page(folio, start_idx + i));
> 
> All of the page<->folio conversions are kinda hurting my brain. I think
> we need to decide what the canonical type for these things is in TDX, do
> the conversion once, and stick with it.
Got it!

Since passing in base "page" or base "pfn" may still require the
wrappers/helpers to internally convert them to "folio" for sanity checks, could
we decide that "folio" and "start_idx" are the canonical params for functions
expecting huge pages? Or do you prefer KVM to do the sanity check by itself?

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2026-01-07  9:12     ` Yan Zhao
@ 2026-01-07 16:39       ` Dave Hansen
  2026-01-08 19:05         ` Ackerley Tng
  2026-01-09  3:08         ` Yan Zhao
  0 siblings, 2 replies; 127+ messages in thread
From: Dave Hansen @ 2026-01-07 16:39 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe, kas,
	tabba, ackerleytng, michael.roth, david, vannapurve, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On 1/7/26 01:12, Yan Zhao wrote:
...
> However, my understanding is that it's better for functions expecting huge pages
> to explicitly receive "folio" instead of "page". This way, people can tell from
> a function's declaration what the function expects. Is this understanding
> correct?

In a perfect world, maybe.

But, in practice, a 'struct page' can still represent huge pages and
*does* represent huge pages all over the kernel. There's no need to cram
a folio in here just because a huge page is involved.

> Passing "start_idx" along with "folio" is due to the requirement of mapping only
> a sub-range of a huge folio. e.g., we allow creating a 2MB mapping starting from
> the nth idx of a 1GB folio.
> 
> On the other hand, if we instead pass "page" to tdh_mem_page_aug() for huge
> pages and have tdh_mem_page_aug() internally convert it to "folio" and
> "start_idx", it makes me wonder if we could have previously just passed "pfn" to
> tdh_mem_page_aug() and had tdh_mem_page_aug() convert it to "page".

As a general pattern, I discourage folks from using pfns and physical
addresses when passing around references to physical memory. They have
zero type safety.

It's also not just about type safety. A 'struct page' also *means*
something. It means that the kernel is, on some level, aware of and
managing that memory. It's not MMIO. It doesn't represent the physical
address of the APIC page. It's not SGX memory. It doesn't have a
Shared/Private bit.

All of those properties are important and they're *GONE* if you use a
pfn. It's even worse if you use a raw physical address.

Please don't go back to raw integers (pfns or paddrs).

>>> -	tdx_clflush_page(page);
>>> +	if (start_idx + npages > folio_nr_pages(folio))
>>> +		return TDX_OPERAND_INVALID;
>>
>> Why is this necessary? Would it be a bug if this happens?
> This sanity check is due to the requirement in KVM that mapping size should be
> no larger than the backend folio size, which ensures the mapping pages are
> physically contiguous with homogeneous page attributes. (See the discussion
> about "EPT mapping size and folio size" in thread [1]).
> 
> Failure of the sanity check could only be due to bugs in the caller (KVM). I
> didn't convert the sanity check to an assertion because there's already a
> TDX_BUG_ON_2() on error following the invocation of tdh_mem_page_aug() in KVM.

We generally don't protect against bugs in callers. Otherwise, we'd have
a trillion NULL checks in every function in the kernel.

The only reason to add caller sanity checks is to make things easier to
debug, and those almost always include some kind of spew:
WARN_ON_ONCE(), pr_warn(), etc...

>>> +	for (int i = 0; i < npages; i++)
>>> +		tdx_clflush_page(folio_page(folio, start_idx + i));
>>
>> All of the page<->folio conversions are kinda hurting my brain. I think
>> we need to decide what the canonical type for these things is in TDX, do
>> the conversion once, and stick with it.
> Got it!
> 
> Since passing in base "page" or base "pfn" may still require the
> wrappers/helpers to internally convert them to "folio" for sanity checks, could
> we decide that "folio" and "start_idx" are the canonical params for functions
> expecting huge pages? Or do you prefer KVM to do the sanity check by itself?

I'm not convinced the sanity check is a good idea in the first place. It
just adds complexity.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-06 23:43         ` Sean Christopherson
  2026-01-07  9:03           ` Yan Zhao
@ 2026-01-07 19:22           ` Edgecombe, Rick P
  2026-01-07 20:27             ` Sean Christopherson
  2026-01-12 20:15           ` Ackerley Tng
  2 siblings, 1 reply; 127+ messages in thread
From: Edgecombe, Rick P @ 2026-01-07 19:22 UTC (permalink / raw)
  To: ackerleytng@google.com, seanjc@google.com
  Cc: kvm@vger.kernel.org, Du, Fan, Li, Xiaoyao, Huang, Kai,
	Hansen, Dave, Zhao, Yan Y, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, david@kernel.org,
	linux-kernel@vger.kernel.org, kas@kernel.org, Weiny, Ira,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Yamahata, Isaku,
	michael.roth@amd.com, nik.borisov@suse.com, Peng, Chao P,
	francescolavra.fl@gmail.com, Annapurve, Vishal, sagis@google.com,
	Gao, Chao, Miao, Jun, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Tue, 2026-01-06 at 15:43 -0800, Sean Christopherson wrote:
> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> multiple folios in one page, probably comes down to which option provides "good
> enough" performance without incurring too much complexity.

Can we add "whether we can push it off to the future" to the considerations
list? The in-flight gmem stuff is pretty complex and this doesn't seem to have
an ABI intersection.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-07 19:22           ` Edgecombe, Rick P
@ 2026-01-07 20:27             ` Sean Christopherson
  0 siblings, 0 replies; 127+ messages in thread
From: Sean Christopherson @ 2026-01-07 20:27 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: ackerleytng@google.com, kvm@vger.kernel.org, Fan Du, Xiaoyao Li,
	Kai Huang, Dave Hansen, Yan Y Zhao, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, david@kernel.org,
	linux-kernel@vger.kernel.org, kas@kernel.org, Ira Weiny,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Isaku Yamahata,
	michael.roth@amd.com, nik.borisov@suse.com, Chao P Peng,
	francescolavra.fl@gmail.com, Vishal Annapurve, sagis@google.com,
	Chao Gao, Jun Miao, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Wed, Jan 07, 2026, Rick P Edgecombe wrote:
> On Tue, 2026-01-06 at 15:43 -0800, Sean Christopherson wrote:
> > Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> > conceptually totally fine, i.e. I'm not totally opposed to adding support for
> > mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> > do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> > multiple folios in one page, probably comes down to which option provides "good
> > enough" performance without incurring too much complexity.
> 
> Can we add "whether we can push it off to the future" to the considerations
> list? The in-flight gmem stuff is pretty complex and this doesn't seem to have
> an ABI intersection.

Ya, for sure.  The only wrinkle I can think of is if the refcounting somehow
bleeds into userspace, but that seems like it'd be a flaw on its own.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2026-01-07 16:39       ` Dave Hansen
@ 2026-01-08 19:05         ` Ackerley Tng
  2026-01-08 19:24           ` Dave Hansen
  2026-01-09  3:08         ` Yan Zhao
  1 sibling, 1 reply; 127+ messages in thread
From: Ackerley Tng @ 2026-01-08 19:05 UTC (permalink / raw)
  To: Dave Hansen, Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe, kas,
	tabba, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

Dave Hansen <dave.hansen@intel.com> writes:

> On 1/7/26 01:12, Yan Zhao wrote:
> ...
>> However, my understanding is that it's better for functions expecting huge pages
>> to explicitly receive "folio" instead of "page". This way, people can tell from
>> a function's declaration what the function expects. Is this understanding
>> correct?
>
> In a perfect world, maybe.
>
> But, in practice, a 'struct page' can still represent huge pages and
> *does* represent huge pages all over the kernel. There's no need to cram
> a folio in here just because a huge page is involved.
>
>> Passing "start_idx" along with "folio" is due to the requirement of mapping only
>> a sub-range of a huge folio. e.g., we allow creating a 2MB mapping starting from
>> the nth idx of a 1GB folio.
>>
>> On the other hand, if we instead pass "page" to tdh_mem_page_aug() for huge
>> pages and have tdh_mem_page_aug() internally convert it to "folio" and
>> "start_idx", it makes me wonder if we could have previously just passed "pfn" to
>> tdh_mem_page_aug() and had tdh_mem_page_aug() convert it to "page".
>
> As a general pattern, I discourage folks from using pfns and physical
> addresses when passing around references to physical memory. They have
> zero type safety.
>
> It's also not just about type safety. A 'struct page' also *means*
> something. It means that the kernel is, on some level, aware of and
> managing that memory. It's not MMIO. It doesn't represent the physical
> address of the APIC page. It's not SGX memory. It doesn't have a
> Shared/Private bit.
>

I agree that the use of struct pages is better than the use of struct
folios. I think the use of folios unnecessarily couples low level TDX
code to memory metadata (pages and folios) in the kernel.

> All of those properties are important and they're *GONE* if you use a
> pfn. It's even worse if you use a raw physical address.
>

We were thinking through what it would take to have TDs use VM_PFNMAP
memory, where the memory may not actually have associated struct
pages. Without further work, having struct pages in the TDX interface
would kind of lock out those sources of memory. Is TDX open to using
non-kernel managed memory?

> Please don't go back to raw integers (pfns or paddrs).
>

I guess what we're all looking for is a type representing regular memory
(to exclude MMIO/APIC pages/SGX/etc) but isn't limited to memory the
kernel.

Perhaps the best we have now is still pfn/paddrs + nr_pages, and having
the callers of TDX functions handle/ensure the checking required to
exclude unsupported types of memory.

For type safety, would phyrs help? [1] Perhaps starting with pfn/paddrs
+ nr_pages would allow transitioning to phyrs later. Using pages would
be okay for now, but I would rather not use folios.

[1] https://lore.kernel.org/all/YdyKWeU0HTv8m7wD@casper.infradead.org/

>>>> -	tdx_clflush_page(page);
>>>> +	if (start_idx + npages > folio_nr_pages(folio))
>>>> +		return TDX_OPERAND_INVALID;
>>>
>>> Why is this necessary? Would it be a bug if this happens?
>> This sanity check is due to the requirement in KVM that mapping size should be
>> no larger than the backend folio size, which ensures the mapping pages are
>> physically contiguous with homogeneous page attributes. (See the discussion
>> about "EPT mapping size and folio size" in thread [1]).
>>
>> Failure of the sanity check could only be due to bugs in the caller (KVM). I
>> didn't convert the sanity check to an assertion because there's already a
>> TDX_BUG_ON_2() on error following the invocation of tdh_mem_page_aug() in KVM.
>
> We generally don't protect against bugs in callers. Otherwise, we'd have
> a trillion NULL checks in every function in the kernel.
>
> The only reason to add caller sanity checks is to make things easier to
> debug, and those almost always include some kind of spew:
> WARN_ON_ONCE(), pr_warn(), etc...
>
>>>> +	for (int i = 0; i < npages; i++)
>>>> +		tdx_clflush_page(folio_page(folio, start_idx + i));
>>>
>>> All of the page<->folio conversions are kinda hurting my brain. I think
>>> we need to decide what the canonical type for these things is in TDX, do
>>> the conversion once, and stick with it.
>> Got it!
>>
>> Since passing in base "page" or base "pfn" may still require the
>> wrappers/helpers to internally convert them to "folio" for sanity checks, could
>> we decide that "folio" and "start_idx" are the canonical params for functions
>> expecting huge pages? Or do you prefer KVM to do the sanity check by itself?
>
> I'm not convinced the sanity check is a good idea in the first place. It
> just adds complexity.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2026-01-08 19:05         ` Ackerley Tng
@ 2026-01-08 19:24           ` Dave Hansen
  2026-01-09 16:21             ` Vishal Annapurve
  0 siblings, 1 reply; 127+ messages in thread
From: Dave Hansen @ 2026-01-08 19:24 UTC (permalink / raw)
  To: Ackerley Tng, Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe, kas,
	tabba, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On 1/8/26 11:05, Ackerley Tng wrote:
...
>> All of those properties are important and they're *GONE* if you use a
>> pfn. It's even worse if you use a raw physical address.
> 
> We were thinking through what it would take to have TDs use VM_PFNMAP
> memory, where the memory may not actually have associated struct
> pages. Without further work, having struct pages in the TDX interface
> would kind of lock out those sources of memory. Is TDX open to using
> non-kernel managed memory?

I was afraid someone was going to bring that up. I'm not open to such a
beast today. I'd certainly look at the patches, but it would be a hard
sell and it would need an awfully strong justification.

> For type safety, would phyrs help? [1] Perhaps starting with pfn/paddrs
> + nr_pages would allow transitioning to phyrs later. Using pages would
> be okay for now, but I would rather not use folios.

I don't have any first-hand experience with phyrs. It seems interesting,
but might be unwieldy to use in practice, kinda how the proposed code
got messy when folios got thrown in.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-07  9:03           ` Yan Zhao
@ 2026-01-08 20:11             ` Ackerley Tng
  2026-01-09  9:18               ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Ackerley Tng @ 2026-01-08 20:11 UTC (permalink / raw)
  To: Yan Zhao, Sean Christopherson
  Cc: Vishal Annapurve, pbonzini, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
>> On Tue, Jan 06, 2026, Ackerley Tng wrote:
>> > Sean Christopherson <seanjc@google.com> writes:
>> >
>> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
>> > >> Vishal Annapurve <vannapurve@google.com> writes:
>> > >>
>> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>> > >> >>
>> > >> >> - EPT mapping size and folio size
>> > >> >>
>> > >> >>   This series is built upon the rule in KVM that the mapping size in the
>> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
>> > >> >>
>> > >>
>> > >> I'm not familiar with this rule and would like to find out more. Why is
>> > >> this rule imposed?
>> > >
>> > > Because it's the only sane way to safely map memory into the guest? :-D
>> > >
>> > >> Is this rule there just because traditionally folio sizes also define the
>> > >> limit of contiguity, and so the mapping size must not be greater than folio
>> > >> size in case the block of memory represented by the folio is not contiguous?
>> > >
>> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
>> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
>> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
>> > >
>> > >> In guest_memfd's case, even if the folio is split (just for refcount
>> > >> tracking purposese on private to shared conversion), the memory is still
>> > >> contiguous up to the original folio's size. Will the contiguity address
>> > >> the concerns?
>> > >
>> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
>> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
>> > > spanning mixed ranges, i.e. with multiple folios.
>> >
>> > The folio can be split if any (or all) of the pages in a huge page range
>> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
>> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
>> > would be split, and the split folios are necessary for tracking users of
>> > shared pages using struct page refcounts.
>>
>> Ahh, that's what the refcounting was referring to.  Gotcha.
>>
>> > However the split folios in that 1G range are still fully contiguous.
>> >
>> > The process of conversion will split the EPT entries soon after the
>> > folios are split so the rule remains upheld.

Correction here: If we go with splitting from 1G to 4K uniformly on
sharing, only the EPT entries around the shared 4K folio will have their
page table entries split, so many of the EPT entries will be at 2M level
though the folios are 4K sized. This would be last beyond the conversion
process.

> Overall, I don't think allowing folios smaller than the mappings while
> conversion is in progress brings enough benefit.
>

I'll look into making the restructuring process always succeed, but off
the top of my head that's hard because

1. HugeTLB Vmemmap Optimization code would have to be refactored to
   use pre-allocated pages, which is refactoring deep in HugeTLB code

2. If we want to split non-uniformly such that only the folios that are
   shared are 4K, and the remaining folios are as large as possible (PMD
   sized as much as possible), it gets complex to figure out how many
   pages to allocate ahead of time.

So it's complex and will probably delay HugeTLB+conversion support even
more!

> Cons:
> (1) TDX's zapping callback has no idea whether the zapping is caused by an
>     in-progress private-to-shared conversion or other reasons. It also has no
>     idea if the attributes of the underlying folios remain unchanged during an
>     in-progress private-to-shared conversion. Even if the assertion Ackerley
>     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
>     callback for in-progress private-to-shared conversion alone (which would
>     increase TDX's dependency on guest_memfd's specific implementation even if
>     it's feasible).
>
>     Removing the sanity checks entirely in TDX's zapping callback is confusing
>     and would show a bad/false expectation from KVM -- what if a huge folio is
>     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
>     others) in other conditions? And then do we still need the check in TDX's
>     mapping callback? If not, does it mean TDX huge pages can stop relying on
>     guest_memfd's ability to allocate huge folios, as KVM could still create
>     huge mappings as long as small folios are physically contiguous with
>     homogeneous memory attributes?
>
> (2) Allowing folios smaller than the mapping would require splitting S-EPT in
>     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
>     invalidate lock held in __kvm_gmem_set_attributes() could guard against
>     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
>     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
>

I think the central question I have among all the above is what TDX
needs to actually care about (putting aside what KVM's folio size/memory
contiguity vs mapping level rule for a while).

I think TDX code can check what it cares about (if required to aid
debugging, as Dave suggested). Does TDX actually care about folio sizes,
or does it actually care about memory contiguity and alignment?

Separately, KVM could also enforce the folio size/memory contiguity vs
mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
the check is deemed necessary, it still shouldn't be in TDX code, I
think.

> Pro: Preventing zapping private memory until conversion is successful is good.
>
> However, could we achieve this benefit in other ways? For example, is it
> possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
> split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
> optimization? (hugetlb_vmemmap conversion is super slow according to my
> observation and I always disable it).

HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
huge VM, multiplied by a large number of hosts, this is not a trivial
amount of memory. It's one of the key reasons why we are using HugeTLB
in guest_memfd in the first place, other than to be able to get high
level page table mappings. We want this in production.

> Or pre-allocation for
> vmemmap_remap_alloc()?
>

Will investigate if this is possible as mentioned above. Thanks for the
suggestion again!

> Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
> private memory before conversion succeeds is still better than introducing the
> mess between folio size and mapping size.
>
>> > I guess perhaps the question is, is it okay if the folios are smaller
>> > than the mapping while conversion is in progress? Does the order matter
>> > (split page table entries first vs split folios first)?
>>
>> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
>> conceptually totally fine, i.e. I'm not totally opposed to adding support for
>> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
>> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
>> multiple folios in one page, probably comes down to which option provides "good
>> enough" performance without incurring too much complexity.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2026-01-07 16:39       ` Dave Hansen
  2026-01-08 19:05         ` Ackerley Tng
@ 2026-01-09  3:08         ` Yan Zhao
  2026-01-09 18:29           ` Ackerley Tng
  1 sibling, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-09  3:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe, kas,
	tabba, ackerleytng, michael.roth, david, vannapurve, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Wed, Jan 07, 2026 at 08:39:55AM -0800, Dave Hansen wrote:
> On 1/7/26 01:12, Yan Zhao wrote:
> ...
> > However, my understanding is that it's better for functions expecting huge pages
> > to explicitly receive "folio" instead of "page". This way, people can tell from
> > a function's declaration what the function expects. Is this understanding
> > correct?
> 
> In a perfect world, maybe.
> 
> But, in practice, a 'struct page' can still represent huge pages and
> *does* represent huge pages all over the kernel. There's no need to cram
> a folio in here just because a huge page is involved.
Ok. I can modify the param "struct page *page" to "struct page *base_page", 
explaining that it may belong to a huge folio but is not necessarily the
head page of the folio.

> > Passing "start_idx" along with "folio" is due to the requirement of mapping only
> > a sub-range of a huge folio. e.g., we allow creating a 2MB mapping starting from
> > the nth idx of a 1GB folio.
> > 
> > On the other hand, if we instead pass "page" to tdh_mem_page_aug() for huge
> > pages and have tdh_mem_page_aug() internally convert it to "folio" and
> > "start_idx", it makes me wonder if we could have previously just passed "pfn" to
> > tdh_mem_page_aug() and had tdh_mem_page_aug() convert it to "page".
> 
> As a general pattern, I discourage folks from using pfns and physical
> addresses when passing around references to physical memory. They have
> zero type safety.
> 
> It's also not just about type safety. A 'struct page' also *means*
> something. It means that the kernel is, on some level, aware of and
> managing that memory. It's not MMIO. It doesn't represent the physical
> address of the APIC page. It's not SGX memory. It doesn't have a
> Shared/Private bit.
> 
> All of those properties are important and they're *GONE* if you use a
> pfn. It's even worse if you use a raw physical address.
> 
> Please don't go back to raw integers (pfns or paddrs).
I understood and fully accept it.

I previously wondered if we could allow KVM to pass in pfn and let the SEAMCALL
wrapper do the pfn_to_page() conversion.
But it was just out of curiosity. I actually prefer "struct page" too.


> >>> -	tdx_clflush_page(page);
> >>> +	if (start_idx + npages > folio_nr_pages(folio))
> >>> +		return TDX_OPERAND_INVALID;
> >>
> >> Why is this necessary? Would it be a bug if this happens?
> > This sanity check is due to the requirement in KVM that mapping size should be
> > no larger than the backend folio size, which ensures the mapping pages are
> > physically contiguous with homogeneous page attributes. (See the discussion
> > about "EPT mapping size and folio size" in thread [1]).
> > 
> > Failure of the sanity check could only be due to bugs in the caller (KVM). I
> > didn't convert the sanity check to an assertion because there's already a
> > TDX_BUG_ON_2() on error following the invocation of tdh_mem_page_aug() in KVM.
> 
> We generally don't protect against bugs in callers. Otherwise, we'd have
> a trillion NULL checks in every function in the kernel.
> 
> The only reason to add caller sanity checks is to make things easier to
> debug, and those almost always include some kind of spew:
> WARN_ON_ONCE(), pr_warn(), etc...

Would it be better if I use WARN_ON_ONCE()? like this:

u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *base_page,
                     u64 *ext_err1, u64 *ext_err2)
{
        unsigned long npages = tdx_sept_level_to_npages(level);
        struct tdx_module_args args = {
                .rcx = gpa | level,
                .rdx = tdx_tdr_pa(td),
                .r8 = page_to_phys(base_page),
        };
        u64 ret;

        WARN_ON_ONCE(page_folio(base_page) != page_folio(base_page + npages - 1));

        for (int i = 0; i < npages; i++)
                tdx_clflush_page(base_page + i);

        ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);

        *ext_err1 = args.rcx;
        *ext_err2 = args.rdx;

        return ret;
}

The WARN_ON_ONCE() serves 2 purposes:
1. Loudly warn of subtle KVM bugs.
2. Ensure "page_to_pfn(base_page + i) == (page_to_pfn(base_page) + i)".

If you don't like using "base_page + i" (as the discussion in v2 [1]), we can
invoke folio_page() for the ith page instead.

[1] https://lore.kernel.org/all/01731a9a0346b08577fad75ae560c650145c7f39.camel@intel.com/

> >>> +	for (int i = 0; i < npages; i++)
> >>> +		tdx_clflush_page(folio_page(folio, start_idx + i));
> >>
> >> All of the page<->folio conversions are kinda hurting my brain. I think
> >> we need to decide what the canonical type for these things is in TDX, do
> >> the conversion once, and stick with it.
> > Got it!
> > 
> > Since passing in base "page" or base "pfn" may still require the
> > wrappers/helpers to internally convert them to "folio" for sanity checks, could
> > we decide that "folio" and "start_idx" are the canonical params for functions
> > expecting huge pages? Or do you prefer KVM to do the sanity check by itself?
> 
> I'm not convinced the sanity check is a good idea in the first place. It
> just adds complexity.
I'm worried about subtle bugs introduced by careless coding that might be
silently ignored otherwise, like the one in thread [2].

[2] https://lore.kernel.org/kvm/aV2A39fXgzuM4Toa@google.com/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-08 20:11             ` Ackerley Tng
@ 2026-01-09  9:18               ` Yan Zhao
  2026-01-09 16:12                 ` Vishal Annapurve
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-09  9:18 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Sean Christopherson, Vishal Annapurve, pbonzini, linux-kernel,
	kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth,
	david, sagis, vbabka, thomas.lendacky, nik.borisov, pgonda,
	fan.du, jun.miao, francescolavra.fl, jgross, ira.weiny,
	isaku.yamahata, xiaoyao.li, kai.huang, binbin.wu, chao.p.peng,
	chao.gao

On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
> >> On Tue, Jan 06, 2026, Ackerley Tng wrote:
> >> > Sean Christopherson <seanjc@google.com> writes:
> >> >
> >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> >> > >> Vishal Annapurve <vannapurve@google.com> writes:
> >> > >>
> >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >> > >> >>
> >> > >> >> - EPT mapping size and folio size
> >> > >> >>
> >> > >> >>   This series is built upon the rule in KVM that the mapping size in the
> >> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> >> > >> >>
> >> > >>
> >> > >> I'm not familiar with this rule and would like to find out more. Why is
> >> > >> this rule imposed?
> >> > >
> >> > > Because it's the only sane way to safely map memory into the guest? :-D
> >> > >
> >> > >> Is this rule there just because traditionally folio sizes also define the
> >> > >> limit of contiguity, and so the mapping size must not be greater than folio
> >> > >> size in case the block of memory represented by the folio is not contiguous?
> >> > >
> >> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> >> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> >> > >
> >> > >> In guest_memfd's case, even if the folio is split (just for refcount
> >> > >> tracking purposese on private to shared conversion), the memory is still
> >> > >> contiguous up to the original folio's size. Will the contiguity address
> >> > >> the concerns?
> >> > >
> >> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
> >> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> >> > > spanning mixed ranges, i.e. with multiple folios.
> >> >
> >> > The folio can be split if any (or all) of the pages in a huge page range
> >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
> >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> >> > would be split, and the split folios are necessary for tracking users of
> >> > shared pages using struct page refcounts.
> >>
> >> Ahh, that's what the refcounting was referring to.  Gotcha.
> >>
> >> > However the split folios in that 1G range are still fully contiguous.
> >> >
> >> > The process of conversion will split the EPT entries soon after the
> >> > folios are split so the rule remains upheld.
> 
> Correction here: If we go with splitting from 1G to 4K uniformly on
> sharing, only the EPT entries around the shared 4K folio will have their
> page table entries split, so many of the EPT entries will be at 2M level
> though the folios are 4K sized. This would be last beyond the conversion
> process.
> 
> > Overall, I don't think allowing folios smaller than the mappings while
> > conversion is in progress brings enough benefit.
> >
> 
> I'll look into making the restructuring process always succeed, but off
> the top of my head that's hard because
> 
> 1. HugeTLB Vmemmap Optimization code would have to be refactored to
>    use pre-allocated pages, which is refactoring deep in HugeTLB code
> 
> 2. If we want to split non-uniformly such that only the folios that are
>    shared are 4K, and the remaining folios are as large as possible (PMD
>    sized as much as possible), it gets complex to figure out how many
>    pages to allocate ahead of time.
> 
> So it's complex and will probably delay HugeTLB+conversion support even
> more!
> 
> > Cons:
> > (1) TDX's zapping callback has no idea whether the zapping is caused by an
> >     in-progress private-to-shared conversion or other reasons. It also has no
> >     idea if the attributes of the underlying folios remain unchanged during an
> >     in-progress private-to-shared conversion. Even if the assertion Ackerley
> >     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
> >     callback for in-progress private-to-shared conversion alone (which would
> >     increase TDX's dependency on guest_memfd's specific implementation even if
> >     it's feasible).
> >
> >     Removing the sanity checks entirely in TDX's zapping callback is confusing
> >     and would show a bad/false expectation from KVM -- what if a huge folio is
> >     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
> >     others) in other conditions? And then do we still need the check in TDX's
> >     mapping callback? If not, does it mean TDX huge pages can stop relying on
> >     guest_memfd's ability to allocate huge folios, as KVM could still create
> >     huge mappings as long as small folios are physically contiguous with
> >     homogeneous memory attributes?
> >
> > (2) Allowing folios smaller than the mapping would require splitting S-EPT in
> >     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
> >     invalidate lock held in __kvm_gmem_set_attributes() could guard against
> >     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
> >     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
> >
> 
> I think the central question I have among all the above is what TDX
> needs to actually care about (putting aside what KVM's folio size/memory
> contiguity vs mapping level rule for a while).
> 
> I think TDX code can check what it cares about (if required to aid
> debugging, as Dave suggested). Does TDX actually care about folio sizes,
> or does it actually care about memory contiguity and alignment?
TDX cares about memory contiguity. A single folio ensures memory contiguity.

Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
contiguous range larger than the page's folio range.

Additionally, we don't split private mappings in kvm_gmem_error_folio().
If smaller folios are allowed, splitting private mapping is required there.
(e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it
possible for splitting a huge folio to fail partially, without merging the huge
folio back or further zapping?).
Not sure if there're other edge cases we're still missing.

> Separately, KVM could also enforce the folio size/memory contiguity vs
> mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
> the check is deemed necessary, it still shouldn't be in TDX code, I
> think.
> 
> > Pro: Preventing zapping private memory until conversion is successful is good.
> >
> > However, could we achieve this benefit in other ways? For example, is it
> > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
> > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
> > optimization? (hugetlb_vmemmap conversion is super slow according to my
> > observation and I always disable it).
> 
> HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
> huge VM, multiplied by a large number of hosts, this is not a trivial
> amount of memory. It's one of the key reasons why we are using HugeTLB
> in guest_memfd in the first place, other than to be able to get high
> level page table mappings. We want this in production.
> 
> > Or pre-allocation for
> > vmemmap_remap_alloc()?
> >
> 
> Will investigate if this is possible as mentioned above. Thanks for the
> suggestion again!
> 
> > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
> > private memory before conversion succeeds is still better than introducing the
> > mess between folio size and mapping size.
> >
> >> > I guess perhaps the question is, is it okay if the folios are smaller
> >> > than the mapping while conversion is in progress? Does the order matter
> >> > (split page table entries first vs split folios first)?
> >>
> >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> >> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> >> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> >> multiple folios in one page, probably comes down to which option provides "good
> >> enough" performance without incurring too much complexity.
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-09  9:18               ` Yan Zhao
@ 2026-01-09 16:12                 ` Vishal Annapurve
  2026-01-09 17:16                   ` Vishal Annapurve
  2026-01-09 18:07                   ` Ackerley Tng
  0 siblings, 2 replies; 127+ messages in thread
From: Vishal Annapurve @ 2026-01-09 16:12 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, Sean Christopherson, pbonzini, linux-kernel, kvm,
	x86, rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth,
	david, sagis, vbabka, thomas.lendacky, nik.borisov, pgonda,
	fan.du, jun.miao, francescolavra.fl, jgross, ira.weiny,
	isaku.yamahata, xiaoyao.li, kai.huang, binbin.wu, chao.p.peng,
	chao.gao

On Fri, Jan 9, 2026 at 1:21 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote:
> > Yan Zhao <yan.y.zhao@intel.com> writes:
> >
> > > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
> > >> On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > >> > Sean Christopherson <seanjc@google.com> writes:
> > >> >
> > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > >> > >> Vishal Annapurve <vannapurve@google.com> writes:
> > >> > >>
> > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >> > >> >>
> > >> > >> >> - EPT mapping size and folio size
> > >> > >> >>
> > >> > >> >>   This series is built upon the rule in KVM that the mapping size in the
> > >> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> > >> > >> >>
> > >> > >>
> > >> > >> I'm not familiar with this rule and would like to find out more. Why is
> > >> > >> this rule imposed?
> > >> > >
> > >> > > Because it's the only sane way to safely map memory into the guest? :-D
> > >> > >
> > >> > >> Is this rule there just because traditionally folio sizes also define the
> > >> > >> limit of contiguity, and so the mapping size must not be greater than folio
> > >> > >> size in case the block of memory represented by the folio is not contiguous?
> > >> > >
> > >> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> > >> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> > >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> > >> > >
> > >> > >> In guest_memfd's case, even if the folio is split (just for refcount
> > >> > >> tracking purposese on private to shared conversion), the memory is still
> > >> > >> contiguous up to the original folio's size. Will the contiguity address
> > >> > >> the concerns?
> > >> > >
> > >> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
> > >> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> > >> > > spanning mixed ranges, i.e. with multiple folios.
> > >> >
> > >> > The folio can be split if any (or all) of the pages in a huge page range
> > >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
> > >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> > >> > would be split, and the split folios are necessary for tracking users of
> > >> > shared pages using struct page refcounts.
> > >>
> > >> Ahh, that's what the refcounting was referring to.  Gotcha.
> > >>
> > >> > However the split folios in that 1G range are still fully contiguous.
> > >> >
> > >> > The process of conversion will split the EPT entries soon after the
> > >> > folios are split so the rule remains upheld.
> >
> > Correction here: If we go with splitting from 1G to 4K uniformly on
> > sharing, only the EPT entries around the shared 4K folio will have their
> > page table entries split, so many of the EPT entries will be at 2M level
> > though the folios are 4K sized. This would be last beyond the conversion
> > process.
> >
> > > Overall, I don't think allowing folios smaller than the mappings while
> > > conversion is in progress brings enough benefit.
> > >
> >
> > I'll look into making the restructuring process always succeed, but off
> > the top of my head that's hard because
> >
> > 1. HugeTLB Vmemmap Optimization code would have to be refactored to
> >    use pre-allocated pages, which is refactoring deep in HugeTLB code
> >
> > 2. If we want to split non-uniformly such that only the folios that are
> >    shared are 4K, and the remaining folios are as large as possible (PMD
> >    sized as much as possible), it gets complex to figure out how many
> >    pages to allocate ahead of time.
> >
> > So it's complex and will probably delay HugeTLB+conversion support even
> > more!
> >
> > > Cons:
> > > (1) TDX's zapping callback has no idea whether the zapping is caused by an
> > >     in-progress private-to-shared conversion or other reasons. It also has no
> > >     idea if the attributes of the underlying folios remain unchanged during an
> > >     in-progress private-to-shared conversion. Even if the assertion Ackerley
> > >     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
> > >     callback for in-progress private-to-shared conversion alone (which would
> > >     increase TDX's dependency on guest_memfd's specific implementation even if
> > >     it's feasible).
> > >
> > >     Removing the sanity checks entirely in TDX's zapping callback is confusing
> > >     and would show a bad/false expectation from KVM -- what if a huge folio is
> > >     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
> > >     others) in other conditions? And then do we still need the check in TDX's
> > >     mapping callback? If not, does it mean TDX huge pages can stop relying on
> > >     guest_memfd's ability to allocate huge folios, as KVM could still create
> > >     huge mappings as long as small folios are physically contiguous with
> > >     homogeneous memory attributes?
> > >
> > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in
> > >     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
> > >     invalidate lock held in __kvm_gmem_set_attributes() could guard against
> > >     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
> > >     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
> > >
> >
> > I think the central question I have among all the above is what TDX
> > needs to actually care about (putting aside what KVM's folio size/memory
> > contiguity vs mapping level rule for a while).
> >
> > I think TDX code can check what it cares about (if required to aid
> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
> > or does it actually care about memory contiguity and alignment?
> TDX cares about memory contiguity. A single folio ensures memory contiguity.

In this slightly unusual case, I think the guarantee needed here is
that as long as a range is mapped into SEPT entries, guest_memfd
ensures that the complete range stays private.

i.e. I think it should be safe to rely on guest_memfd here,
irrespective of the folio sizes:
1) KVM TDX stack should be able to reclaim the complete range when unmapping.
2) KVM TDX stack can assume that as long as memory is mapped in SEPT
entries, guest_memfd will not let host userspace mappings to access
guest private memory.

>
> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
> contiguous range larger than the page's folio range.

What's the issue with passing the (struct page*, unsigned long nr_pages) pair?

>
> Additionally, we don't split private mappings in kvm_gmem_error_folio().
> If smaller folios are allowed, splitting private mapping is required there.

Yes, I believe splitting private mappings will be invoked to ensure
that the whole huge folio is not unmapped from KVM due to an error on
just a 4K page. Is that a problem?

If splitting fails, the implementation can fall back to completely
zapping the folio range.

> (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it
> possible for splitting a huge folio to fail partially, without merging the huge
> folio back or further zapping?).

Yes, splitting can fail partially, but guest_memfd will not make the
ranges available to host userspace and derivatives until:
1) The complete range to be converted is split to 4K granularity.
2) The complete range to be converted is zapped from KVM EPT mappings.

> Not sure if there're other edge cases we're still missing.
>
> > Separately, KVM could also enforce the folio size/memory contiguity vs
> > mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
> > the check is deemed necessary, it still shouldn't be in TDX code, I
> > think.
> >
> > > Pro: Preventing zapping private memory until conversion is successful is good.
> > >
> > > However, could we achieve this benefit in other ways? For example, is it
> > > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
> > > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
> > > optimization? (hugetlb_vmemmap conversion is super slow according to my
> > > observation and I always disable it).
> >
> > HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
> > huge VM, multiplied by a large number of hosts, this is not a trivial
> > amount of memory. It's one of the key reasons why we are using HugeTLB
> > in guest_memfd in the first place, other than to be able to get high
> > level page table mappings. We want this in production.
> >
> > > Or pre-allocation for
> > > vmemmap_remap_alloc()?
> > >
> >
> > Will investigate if this is possible as mentioned above. Thanks for the
> > suggestion again!
> >
> > > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
> > > private memory before conversion succeeds is still better than introducing the
> > > mess between folio size and mapping size.
> > >
> > >> > I guess perhaps the question is, is it okay if the folios are smaller
> > >> > than the mapping while conversion is in progress? Does the order matter
> > >> > (split page table entries first vs split folios first)?
> > >>
> > >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> > >> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> > >> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> > >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> > >> multiple folios in one page, probably comes down to which option provides "good
> > >> enough" performance without incurring too much complexity.
> >

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2026-01-08 19:24           ` Dave Hansen
@ 2026-01-09 16:21             ` Vishal Annapurve
  0 siblings, 0 replies; 127+ messages in thread
From: Vishal Annapurve @ 2026-01-09 16:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ackerley Tng, Yan Zhao, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, kas, tabba, michael.roth, david, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Thu, Jan 8, 2026 at 11:24 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 1/8/26 11:05, Ackerley Tng wrote:
> ...
> >> All of those properties are important and they're *GONE* if you use a
> >> pfn. It's even worse if you use a raw physical address.
> >
> > We were thinking through what it would take to have TDs use VM_PFNMAP
> > memory, where the memory may not actually have associated struct
> > pages. Without further work, having struct pages in the TDX interface
> > would kind of lock out those sources of memory. Is TDX open to using
> > non-kernel managed memory?
>
> I was afraid someone was going to bring that up. I'm not open to such a
> beast today. I'd certainly look at the patches, but it would be a hard
> sell and it would need an awfully strong justification.

Yeah, I will punt this discussion to later when we have something
working on the guest_memfd side. I expect that discussion will carry a
strong justification, backed by all the complexity in guest_memfd.

>
> > For type safety, would phyrs help? [1] Perhaps starting with pfn/paddrs
> > + nr_pages would allow transitioning to phyrs later. Using pages would
> > be okay for now, but I would rather not use folios.
>
> I don't have any first-hand experience with phyrs. It seems interesting,
> but might be unwieldy to use in practice, kinda how the proposed code
> got messy when folios got thrown in.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-09 16:12                 ` Vishal Annapurve
@ 2026-01-09 17:16                   ` Vishal Annapurve
  2026-01-09 18:07                   ` Ackerley Tng
  1 sibling, 0 replies; 127+ messages in thread
From: Vishal Annapurve @ 2026-01-09 17:16 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, Sean Christopherson, pbonzini, linux-kernel, kvm,
	x86, rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth,
	david, sagis, vbabka, thomas.lendacky, nik.borisov, pgonda,
	fan.du, jun.miao, francescolavra.fl, jgross, ira.weiny,
	isaku.yamahata, xiaoyao.li, kai.huang, binbin.wu, chao.p.peng,
	chao.gao

On Fri, Jan 9, 2026 at 8:12 AM Vishal Annapurve <vannapurve@google.com> wrote:
>
> > > >
> > >
> > > I think the central question I have among all the above is what TDX
> > > needs to actually care about (putting aside what KVM's folio size/memory
> > > contiguity vs mapping level rule for a while).
> > >
> > > I think TDX code can check what it cares about (if required to aid
> > > debugging, as Dave suggested). Does TDX actually care about folio sizes,
> > > or does it actually care about memory contiguity and alignment?
> > TDX cares about memory contiguity. A single folio ensures memory contiguity.
>
> In this slightly unusual case, I think the guarantee needed here is
> that as long as a range is mapped into SEPT entries, guest_memfd
> ensures that the complete range stays private.
>
> i.e. I think it should be safe to rely on guest_memfd here,
> irrespective of the folio sizes:
> 1) KVM TDX stack should be able to reclaim the complete range when unmapping.
> 2) KVM TDX stack can assume that as long as memory is mapped in SEPT
> entries, guest_memfd will not let host userspace mappings to access
> guest private memory.
>
> >
> > Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
> > reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
> > contiguous range larger than the page's folio range.
>
> What's the issue with passing the (struct page*, unsigned long nr_pages) pair?
>
> >
> > Additionally, we don't split private mappings in kvm_gmem_error_folio().
> > If smaller folios are allowed, splitting private mapping is required there.
>
> Yes, I believe splitting private mappings will be invoked to ensure
> that the whole huge folio is not unmapped from KVM due to an error on
> just a 4K page. Is that a problem?
>
> If splitting fails, the implementation can fall back to completely
> zapping the folio range.

I forgot to mention that this is a future improvement that will
introduce hugetlb memory failure handling and is not covered by
Ackerley's current set of patches.

>
> > (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it
> > possible for splitting a huge folio to fail partially, without merging the huge
> > folio back or further zapping?).
>
> Yes, splitting can fail partially, but guest_memfd will not make the
> ranges available to host userspace and derivatives until:
> 1) The complete range to be converted is split to 4K granularity.
> 2) The complete range to be converted is zapped from KVM EPT mappings.
>
> > Not sure if there're other edge cases we're still missing.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-09 16:12                 ` Vishal Annapurve
  2026-01-09 17:16                   ` Vishal Annapurve
@ 2026-01-09 18:07                   ` Ackerley Tng
  2026-01-12  1:39                     ` Yan Zhao
  1 sibling, 1 reply; 127+ messages in thread
From: Ackerley Tng @ 2026-01-09 18:07 UTC (permalink / raw)
  To: Vishal Annapurve, Yan Zhao
  Cc: Sean Christopherson, pbonzini, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

Vishal Annapurve <vannapurve@google.com> writes:

> On Fri, Jan 9, 2026 at 1:21 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>>
>> On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote:
>> > Yan Zhao <yan.y.zhao@intel.com> writes:
>> >
>> > > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
>> > >> On Tue, Jan 06, 2026, Ackerley Tng wrote:
>> > >> > Sean Christopherson <seanjc@google.com> writes:
>> > >> >
>> > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
>> > >> > >> Vishal Annapurve <vannapurve@google.com> writes:
>> > >> > >>
>> > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>> > >> > >> >>
>> > >> > >> >> - EPT mapping size and folio size
>> > >> > >> >>
>> > >> > >> >>   This series is built upon the rule in KVM that the mapping size in the
>> > >> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
>> > >> > >> >>
>> > >> > >>
>> > >> > >> I'm not familiar with this rule and would like to find out more. Why is
>> > >> > >> this rule imposed?
>> > >> > >
>> > >> > > Because it's the only sane way to safely map memory into the guest? :-D
>> > >> > >
>> > >> > >> Is this rule there just because traditionally folio sizes also define the
>> > >> > >> limit of contiguity, and so the mapping size must not be greater than folio
>> > >> > >> size in case the block of memory represented by the folio is not contiguous?
>> > >> > >
>> > >> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
>> > >> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
>> > >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
>> > >> > >
>> > >> > >> In guest_memfd's case, even if the folio is split (just for refcount
>> > >> > >> tracking purposese on private to shared conversion), the memory is still
>> > >> > >> contiguous up to the original folio's size. Will the contiguity address
>> > >> > >> the concerns?
>> > >> > >
>> > >> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
>> > >> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
>> > >> > > spanning mixed ranges, i.e. with multiple folios.
>> > >> >
>> > >> > The folio can be split if any (or all) of the pages in a huge page range
>> > >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
>> > >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
>> > >> > would be split, and the split folios are necessary for tracking users of
>> > >> > shared pages using struct page refcounts.
>> > >>
>> > >> Ahh, that's what the refcounting was referring to.  Gotcha.
>> > >>
>> > >> > However the split folios in that 1G range are still fully contiguous.
>> > >> >
>> > >> > The process of conversion will split the EPT entries soon after the
>> > >> > folios are split so the rule remains upheld.
>> >
>> > Correction here: If we go with splitting from 1G to 4K uniformly on
>> > sharing, only the EPT entries around the shared 4K folio will have their
>> > page table entries split, so many of the EPT entries will be at 2M level
>> > though the folios are 4K sized. This would be last beyond the conversion
>> > process.
>> >
>> > > Overall, I don't think allowing folios smaller than the mappings while
>> > > conversion is in progress brings enough benefit.
>> > >
>> >
>> > I'll look into making the restructuring process always succeed, but off
>> > the top of my head that's hard because
>> >
>> > 1. HugeTLB Vmemmap Optimization code would have to be refactored to
>> >    use pre-allocated pages, which is refactoring deep in HugeTLB code
>> >
>> > 2. If we want to split non-uniformly such that only the folios that are
>> >    shared are 4K, and the remaining folios are as large as possible (PMD
>> >    sized as much as possible), it gets complex to figure out how many
>> >    pages to allocate ahead of time.
>> >
>> > So it's complex and will probably delay HugeTLB+conversion support even
>> > more!
>> >
>> > > Cons:
>> > > (1) TDX's zapping callback has no idea whether the zapping is caused by an
>> > >     in-progress private-to-shared conversion or other reasons. It also has no
>> > >     idea if the attributes of the underlying folios remain unchanged during an
>> > >     in-progress private-to-shared conversion. Even if the assertion Ackerley
>> > >     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
>> > >     callback for in-progress private-to-shared conversion alone (which would
>> > >     increase TDX's dependency on guest_memfd's specific implementation even if
>> > >     it's feasible).
>> > >
>> > >     Removing the sanity checks entirely in TDX's zapping callback is confusing
>> > >     and would show a bad/false expectation from KVM -- what if a huge folio is
>> > >     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
>> > >     others) in other conditions? And then do we still need the check in TDX's
>> > >     mapping callback? If not, does it mean TDX huge pages can stop relying on
>> > >     guest_memfd's ability to allocate huge folios, as KVM could still create
>> > >     huge mappings as long as small folios are physically contiguous with
>> > >     homogeneous memory attributes?
>> > >
>> > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in
>> > >     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
>> > >     invalidate lock held in __kvm_gmem_set_attributes() could guard against
>> > >     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
>> > >     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
>> > >
>> >
>> > I think the central question I have among all the above is what TDX
>> > needs to actually care about (putting aside what KVM's folio size/memory
>> > contiguity vs mapping level rule for a while).
>> >
>> > I think TDX code can check what it cares about (if required to aid
>> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
>> > or does it actually care about memory contiguity and alignment?
>> TDX cares about memory contiguity. A single folio ensures memory contiguity.
>
> In this slightly unusual case, I think the guarantee needed here is
> that as long as a range is mapped into SEPT entries, guest_memfd
> ensures that the complete range stays private.
>
> i.e. I think it should be safe to rely on guest_memfd here,
> irrespective of the folio sizes:
> 1) KVM TDX stack should be able to reclaim the complete range when unmapping.
> 2) KVM TDX stack can assume that as long as memory is mapped in SEPT
> entries, guest_memfd will not let host userspace mappings to access
> guest private memory.
>
>>
>> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
>> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
>> contiguous range larger than the page's folio range.
>
> What's the issue with passing the (struct page*, unsigned long nr_pages) pair?
>
>>
>> Additionally, we don't split private mappings in kvm_gmem_error_folio().
>> If smaller folios are allowed, splitting private mapping is required there.

It was discussed before that for memory failure handling, we will want
to split huge pages, we will get to it! The trouble is that guest_memfd
took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
from the ground up), so we'll still need to figure out it's okay to let
HugeTLB deal with it when freeing, and when I last looked, HugeTLB
doesn't actually deal with poisoned folios on freeing, so there's more
work to do on the HugeTLB side.

This is a good point, although IIUC it is a separate issue. The need to
split private mappings on memory failure is not for confidentiality in
the TDX sense but to ensure that the guest doesn't use the failed
memory. In that case, contiguity is broken by the failed memory. The
folio is split, the private EPTs are split. The folio size should still
not be checked in TDX code. guest_memfd knows contiguity got broken, so
guest_memfd calls TDX code to split the EPTs.

>
> Yes, I believe splitting private mappings will be invoked to ensure
> that the whole huge folio is not unmapped from KVM due to an error on
> just a 4K page. Is that a problem?
>
> If splitting fails, the implementation can fall back to completely
> zapping the folio range.
>
>> (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it
>> possible for splitting a huge folio to fail partially, without merging the huge
>> folio back or further zapping?).

The current stance is to allow splitting failures and not undo that
splitting failure, so there's no merge back to fix the splitting
failure. (Not set in stone yet, I think merging back could turn out to
be a requirement from the mm side, which comes with more complexity in
restructuring logic.)

If it is not merged back on a split failure, the pages are still
contiguous, the pages are guaranteed contiguous while they are owned by
guest_memfd (even in the case of memory failure, if I get my way :P) so
TDX can still trust that.

I think you're worried that on split failure some folios are split, but
the private EPTs for those are not split, but the memory for those
unsplit private EPTs are still contiguous, and on split failure we quit
early so guest_memfd still tracks the ranges as private.

Privateness and contiguity are preserved so I think TDX should be good
with that? The TD can still run. IIUC it is part of the plan that on
splitting failure, conversion ioctl returns failure, guest is informed
of conversion failure so that it can do whatever it should do to clean
up.

>
> Yes, splitting can fail partially, but guest_memfd will not make the
> ranges available to host userspace and derivatives until:
> 1) The complete range to be converted is split to 4K granularity.
> 2) The complete range to be converted is zapped from KVM EPT mappings.
>
>> Not sure if there're other edge cases we're still missing.
>>

As you said, at the core TDX is concerned about contiguity of the memory
ranges (start_addr, length) that it was given. Contiguity is guaranteed
by guest_memfd while the folio is in guest_memfd ownership up to the
boundaries of the original folio, before any restructuring. So if we're
looking for edge cases, I think they would be around
truncation. Can't think of anything now.

(guest_memfd will also ensure truncation of anything less than the
original size of the folio before restructuring is blocked, regardless
of the current size of the folio)

>> > Separately, KVM could also enforce the folio size/memory contiguity vs
>> > mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
>> > the check is deemed necessary, it still shouldn't be in TDX code, I
>> > think.
>> >
>> > > Pro: Preventing zapping private memory until conversion is successful is good.
>> > >
>> > > However, could we achieve this benefit in other ways? For example, is it
>> > > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
>> > > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
>> > > optimization? (hugetlb_vmemmap conversion is super slow according to my
>> > > observation and I always disable it).
>> >
>> > HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
>> > huge VM, multiplied by a large number of hosts, this is not a trivial
>> > amount of memory. It's one of the key reasons why we are using HugeTLB
>> > in guest_memfd in the first place, other than to be able to get high
>> > level page table mappings. We want this in production.
>> >
>> > > Or pre-allocation for
>> > > vmemmap_remap_alloc()?
>> > >
>> >
>> > Will investigate if this is possible as mentioned above. Thanks for the
>> > suggestion again!
>> >
>> > > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
>> > > private memory before conversion succeeds is still better than introducing the
>> > > mess between folio size and mapping size.
>> > >
>> > >> > I guess perhaps the question is, is it okay if the folios are smaller
>> > >> > than the mapping while conversion is in progress? Does the order matter
>> > >> > (split page table entries first vs split folios first)?
>> > >>
>> > >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
>> > >> conceptually totally fine, i.e. I'm not totally opposed to adding support for
>> > >> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
>> > >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
>> > >> multiple folios in one page, probably comes down to which option provides "good
>> > >> enough" performance without incurring too much complexity.
>> >

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2026-01-09  3:08         ` Yan Zhao
@ 2026-01-09 18:29           ` Ackerley Tng
  2026-01-12  2:41             ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Ackerley Tng @ 2026-01-09 18:29 UTC (permalink / raw)
  To: Yan Zhao, Dave Hansen
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe, kas,
	tabba, michael.roth, david, vannapurve, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, Jan 07, 2026 at 08:39:55AM -0800, Dave Hansen wrote:
>> On 1/7/26 01:12, Yan Zhao wrote:
>> ...
>> > However, my understanding is that it's better for functions expecting huge pages
>> > to explicitly receive "folio" instead of "page". This way, people can tell from
>> > a function's declaration what the function expects. Is this understanding
>> > correct?
>>
>> In a perfect world, maybe.
>>
>> But, in practice, a 'struct page' can still represent huge pages and
>> *does* represent huge pages all over the kernel. There's no need to cram
>> a folio in here just because a huge page is involved.
> Ok. I can modify the param "struct page *page" to "struct page *base_page",
> explaining that it may belong to a huge folio but is not necessarily the
> head page of the folio.
>
>> > Passing "start_idx" along with "folio" is due to the requirement of mapping only
>> > a sub-range of a huge folio. e.g., we allow creating a 2MB mapping starting from
>> > the nth idx of a 1GB folio.
>> >
>> > On the other hand, if we instead pass "page" to tdh_mem_page_aug() for huge
>> > pages and have tdh_mem_page_aug() internally convert it to "folio" and
>> > "start_idx", it makes me wonder if we could have previously just passed "pfn" to
>> > tdh_mem_page_aug() and had tdh_mem_page_aug() convert it to "page".
>>
>> As a general pattern, I discourage folks from using pfns and physical
>> addresses when passing around references to physical memory. They have
>> zero type safety.
>>
>> It's also not just about type safety. A 'struct page' also *means*
>> something. It means that the kernel is, on some level, aware of and
>> managing that memory. It's not MMIO. It doesn't represent the physical
>> address of the APIC page. It's not SGX memory. It doesn't have a
>> Shared/Private bit.
>>
>> All of those properties are important and they're *GONE* if you use a
>> pfn. It's even worse if you use a raw physical address.
>>
>> Please don't go back to raw integers (pfns or paddrs).
> I understood and fully accept it.
>
> I previously wondered if we could allow KVM to pass in pfn and let the SEAMCALL
> wrapper do the pfn_to_page() conversion.
> But it was just out of curiosity. I actually prefer "struct page" too.
>
>
>> >>> -	tdx_clflush_page(page);
>> >>> +	if (start_idx + npages > folio_nr_pages(folio))
>> >>> +		return TDX_OPERAND_INVALID;
>> >>
>> >> Why is this necessary? Would it be a bug if this happens?
>> > This sanity check is due to the requirement in KVM that mapping size should be
>> > no larger than the backend folio size, which ensures the mapping pages are
>> > physically contiguous with homogeneous page attributes. (See the discussion
>> > about "EPT mapping size and folio size" in thread [1]).
>> >
>> > Failure of the sanity check could only be due to bugs in the caller (KVM). I
>> > didn't convert the sanity check to an assertion because there's already a
>> > TDX_BUG_ON_2() on error following the invocation of tdh_mem_page_aug() in KVM.
>>
>> We generally don't protect against bugs in callers. Otherwise, we'd have
>> a trillion NULL checks in every function in the kernel.
>>
>> The only reason to add caller sanity checks is to make things easier to
>> debug, and those almost always include some kind of spew:
>> WARN_ON_ONCE(), pr_warn(), etc...
>
> Would it be better if I use WARN_ON_ONCE()? like this:
>
> u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *base_page,
>                      u64 *ext_err1, u64 *ext_err2)
> {
>         unsigned long npages = tdx_sept_level_to_npages(level);
>         struct tdx_module_args args = {
>                 .rcx = gpa | level,
>                 .rdx = tdx_tdr_pa(td),
>                 .r8 = page_to_phys(base_page),
>         };
>         u64 ret;
>
>         WARN_ON_ONCE(page_folio(base_page) != page_folio(base_page + npages - 1));

This WARNs if the first and last folios are not the same folio, which
still assumes something about how pages are grouped into folios. I feel
that this is still stretching TDX code over to make assumptions about
how the kernel manages memory metadata, which is more than TDX actually
cares about.

>
>         for (int i = 0; i < npages; i++)
>                 tdx_clflush_page(base_page + i);
>
>         ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
>
>         *ext_err1 = args.rcx;
>         *ext_err2 = args.rdx;
>
>         return ret;
> }
>
> The WARN_ON_ONCE() serves 2 purposes:
> 1. Loudly warn of subtle KVM bugs.
> 2. Ensure "page_to_pfn(base_page + i) == (page_to_pfn(base_page) + i)".
>

I disagree with checking within TDX code, but if you would still like to
check, 2. that you suggested is less dependent on the concept of how the
kernel groups pages in folios, how about:

  WARN_ON_ONCE(page_to_pfn(base_page + npages - 1) !=
               page_to_pfn(base_page) + npages - 1);

The full contiguity check will scan every page, but I think this doesn't
take too many CPU cycles, and would probably catch what you're looking
to catch in most cases.

I still don't think TDX code should check. The caller should check or
know the right thing to do.

> If you don't like using "base_page + i" (as the discussion in v2 [1]), we can
> invoke folio_page() for the ith page instead.
>
> [1] https://lore.kernel.org/all/01731a9a0346b08577fad75ae560c650145c7f39.camel@intel.com/
>
>> >>> +	for (int i = 0; i < npages; i++)
>> >>> +		tdx_clflush_page(folio_page(folio, start_idx + i));
>> >>
>> >> All of the page<->folio conversions are kinda hurting my brain. I think
>> >> we need to decide what the canonical type for these things is in TDX, do
>> >> the conversion once, and stick with it.
>> > Got it!
>> >
>> > Since passing in base "page" or base "pfn" may still require the
>> > wrappers/helpers to internally convert them to "folio" for sanity checks, could
>> > we decide that "folio" and "start_idx" are the canonical params for functions
>> > expecting huge pages? Or do you prefer KVM to do the sanity check by itself?
>>
>> I'm not convinced the sanity check is a good idea in the first place. It
>> just adds complexity.
> I'm worried about subtle bugs introduced by careless coding that might be
> silently ignored otherwise, like the one in thread [2].
>
> [2] https://lore.kernel.org/kvm/aV2A39fXgzuM4Toa@google.com/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-09 18:07                   ` Ackerley Tng
@ 2026-01-12  1:39                     ` Yan Zhao
  2026-01-12  2:12                       ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-12  1:39 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Vishal Annapurve, Sean Christopherson, pbonzini, linux-kernel,
	kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth,
	david, sagis, vbabka, thomas.lendacky, nik.borisov, pgonda,
	fan.du, jun.miao, francescolavra.fl, jgross, ira.weiny,
	isaku.yamahata, xiaoyao.li, kai.huang, binbin.wu, chao.p.peng,
	chao.gao

On Fri, Jan 09, 2026 at 10:07:00AM -0800, Ackerley Tng wrote:
> Vishal Annapurve <vannapurve@google.com> writes:
> 
> > On Fri, Jan 9, 2026 at 1:21 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >>
> >> On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote:
> >> > Yan Zhao <yan.y.zhao@intel.com> writes:
> >> >
> >> > > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
> >> > >> On Tue, Jan 06, 2026, Ackerley Tng wrote:
> >> > >> > Sean Christopherson <seanjc@google.com> writes:
> >> > >> >
> >> > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> >> > >> > >> Vishal Annapurve <vannapurve@google.com> writes:
> >> > >> > >>
> >> > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >> > >> > >> >>
> >> > >> > >> >> - EPT mapping size and folio size
> >> > >> > >> >>
> >> > >> > >> >>   This series is built upon the rule in KVM that the mapping size in the
> >> > >> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> >> > >> > >> >>
> >> > >> > >>
> >> > >> > >> I'm not familiar with this rule and would like to find out more. Why is
> >> > >> > >> this rule imposed?
> >> > >> > >
> >> > >> > > Because it's the only sane way to safely map memory into the guest? :-D
> >> > >> > >
> >> > >> > >> Is this rule there just because traditionally folio sizes also define the
> >> > >> > >> limit of contiguity, and so the mapping size must not be greater than folio
> >> > >> > >> size in case the block of memory represented by the folio is not contiguous?
> >> > >> > >
> >> > >> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> >> > >> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> >> > >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> >> > >> > >
> >> > >> > >> In guest_memfd's case, even if the folio is split (just for refcount
> >> > >> > >> tracking purposese on private to shared conversion), the memory is still
> >> > >> > >> contiguous up to the original folio's size. Will the contiguity address
> >> > >> > >> the concerns?
> >> > >> > >
> >> > >> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
> >> > >> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> >> > >> > > spanning mixed ranges, i.e. with multiple folios.
> >> > >> >
> >> > >> > The folio can be split if any (or all) of the pages in a huge page range
> >> > >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
> >> > >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> >> > >> > would be split, and the split folios are necessary for tracking users of
> >> > >> > shared pages using struct page refcounts.
> >> > >>
> >> > >> Ahh, that's what the refcounting was referring to.  Gotcha.
> >> > >>
> >> > >> > However the split folios in that 1G range are still fully contiguous.
> >> > >> >
> >> > >> > The process of conversion will split the EPT entries soon after the
> >> > >> > folios are split so the rule remains upheld.
> >> >
> >> > Correction here: If we go with splitting from 1G to 4K uniformly on
> >> > sharing, only the EPT entries around the shared 4K folio will have their
> >> > page table entries split, so many of the EPT entries will be at 2M level
> >> > though the folios are 4K sized. This would be last beyond the conversion
> >> > process.
> >> >
> >> > > Overall, I don't think allowing folios smaller than the mappings while
> >> > > conversion is in progress brings enough benefit.
> >> > >
> >> >
> >> > I'll look into making the restructuring process always succeed, but off
> >> > the top of my head that's hard because
> >> >
> >> > 1. HugeTLB Vmemmap Optimization code would have to be refactored to
> >> >    use pre-allocated pages, which is refactoring deep in HugeTLB code
> >> >
> >> > 2. If we want to split non-uniformly such that only the folios that are
> >> >    shared are 4K, and the remaining folios are as large as possible (PMD
> >> >    sized as much as possible), it gets complex to figure out how many
> >> >    pages to allocate ahead of time.
> >> >
> >> > So it's complex and will probably delay HugeTLB+conversion support even
> >> > more!
> >> >
> >> > > Cons:
> >> > > (1) TDX's zapping callback has no idea whether the zapping is caused by an
> >> > >     in-progress private-to-shared conversion or other reasons. It also has no
> >> > >     idea if the attributes of the underlying folios remain unchanged during an
> >> > >     in-progress private-to-shared conversion. Even if the assertion Ackerley
> >> > >     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
> >> > >     callback for in-progress private-to-shared conversion alone (which would
> >> > >     increase TDX's dependency on guest_memfd's specific implementation even if
> >> > >     it's feasible).
> >> > >
> >> > >     Removing the sanity checks entirely in TDX's zapping callback is confusing
> >> > >     and would show a bad/false expectation from KVM -- what if a huge folio is
> >> > >     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
> >> > >     others) in other conditions? And then do we still need the check in TDX's
> >> > >     mapping callback? If not, does it mean TDX huge pages can stop relying on
> >> > >     guest_memfd's ability to allocate huge folios, as KVM could still create
> >> > >     huge mappings as long as small folios are physically contiguous with
> >> > >     homogeneous memory attributes?
> >> > >
> >> > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in
> >> > >     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
> >> > >     invalidate lock held in __kvm_gmem_set_attributes() could guard against
> >> > >     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
> >> > >     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
> >> > >
> >> >
> >> > I think the central question I have among all the above is what TDX
> >> > needs to actually care about (putting aside what KVM's folio size/memory
> >> > contiguity vs mapping level rule for a while).
> >> >
> >> > I think TDX code can check what it cares about (if required to aid
> >> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
> >> > or does it actually care about memory contiguity and alignment?
> >> TDX cares about memory contiguity. A single folio ensures memory contiguity.
> >
> > In this slightly unusual case, I think the guarantee needed here is
> > that as long as a range is mapped into SEPT entries, guest_memfd
> > ensures that the complete range stays private.
> >
> > i.e. I think it should be safe to rely on guest_memfd here,
> > irrespective of the folio sizes:
> > 1) KVM TDX stack should be able to reclaim the complete range when unmapping.
> > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT
> > entries, guest_memfd will not let host userspace mappings to access
> > guest private memory.
> >
> >>
> >> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
> >> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
> >> contiguous range larger than the page's folio range.
> >
> > What's the issue with passing the (struct page*, unsigned long nr_pages) pair?
> >
> >>
> >> Additionally, we don't split private mappings in kvm_gmem_error_folio().
> >> If smaller folios are allowed, splitting private mapping is required there.
> 
> It was discussed before that for memory failure handling, we will want
> to split huge pages, we will get to it! The trouble is that guest_memfd
> took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
> from the ground up), so we'll still need to figure out it's okay to let
> HugeTLB deal with it when freeing, and when I last looked, HugeTLB
> doesn't actually deal with poisoned folios on freeing, so there's more
> work to do on the HugeTLB side.
> 
> This is a good point, although IIUC it is a separate issue. The need to
> split private mappings on memory failure is not for confidentiality in
> the TDX sense but to ensure that the guest doesn't use the failed
> memory. In that case, contiguity is broken by the failed memory. The
> folio is split, the private EPTs are split. The folio size should still
> not be checked in TDX code. guest_memfd knows contiguity got broken, so
> guest_memfd calls TDX code to split the EPTs.

Hmm, maybe the key is that we need to split S-EPT first before allowing
guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the
folio splitting.

This is better than performing folio splitting while it's mapped as huge in
S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split
S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in
kvm_gmem_error_folio() would still trigger the over-zapping issue.

In the primary MMU, it follows the rule of unmapping a folio before splitting,
truncating, or migrating a folio. For S-EPT, considering the cost of zapping
more ranges than necessary, maybe a trade-off is to always split S-EPT before
allowing backend folio splitting.

Does this look good to you?

So, to convert a 2MB range from private to shared, even though guest_memfd will
eventually zap the entire 2MB range, do the S-EPT splitting first! If it fails,
don't split the backend folio.

Even if folio splitting may fail later, it just leaves split S-EPT mappings,
which matters little, especially after we support S-EPT promotion later.

The benefit is that we don't need to worry even in the case when guest_memfd
splits a 1GB folio directly to 4KB granularity, potentially introducing the
over-zapping issue later.

> > Yes, I believe splitting private mappings will be invoked to ensure
> > that the whole huge folio is not unmapped from KVM due to an error on
> > just a 4K page. Is that a problem?
> >
> > If splitting fails, the implementation can fall back to completely
> > zapping the folio range.
> >
> >> (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it
> >> possible for splitting a huge folio to fail partially, without merging the huge
> >> folio back or further zapping?).
> 
> The current stance is to allow splitting failures and not undo that
> splitting failure, so there's no merge back to fix the splitting
> failure. (Not set in stone yet, I think merging back could turn out to
> be a requirement from the mm side, which comes with more complexity in
> restructuring logic.)
> 
> If it is not merged back on a split failure, the pages are still
> contiguous, the pages are guaranteed contiguous while they are owned by
> guest_memfd (even in the case of memory failure, if I get my way :P) so
> TDX can still trust that.
> 
> I think you're worried that on split failure some folios are split, but
> the private EPTs for those are not split, but the memory for those
> unsplit private EPTs are still contiguous, and on split failure we quit
> early so guest_memfd still tracks the ranges as private.
> 
> Privateness and contiguity are preserved so I think TDX should be good
> with that? The TD can still run. IIUC it is part of the plan that on
> splitting failure, conversion ioctl returns failure, guest is informed
> of conversion failure so that it can do whatever it should do to clean
> up.
As above, what about the idea of always requesting KVM to split S-EPT before
guest_memfd splits a folio?

I think splitting S-EPT first is already required for all cases anyway, except
for the private-to-shared conversion of a full 2MB or 1GB range.

Requesting S-EPT splitting when it's about to do folio splitting is better than
leaving huge mappings with split folios and having to patch things up here and
there, just to make the single case of private-to-shared conversion easier.

> > Yes, splitting can fail partially, but guest_memfd will not make the
> > ranges available to host userspace and derivatives until:
> > 1) The complete range to be converted is split to 4K granularity.
> > 2) The complete range to be converted is zapped from KVM EPT mappings.
> >
> >> Not sure if there're other edge cases we're still missing.
> >>
> 
> As you said, at the core TDX is concerned about contiguity of the memory
> ranges (start_addr, length) that it was given. Contiguity is guaranteed
> by guest_memfd while the folio is in guest_memfd ownership up to the
> boundaries of the original folio, before any restructuring. So if we're
> looking for edge cases, I think they would be around
> truncation. Can't think of anything now.
Potentially, folio migration, if we support it in the future.

> (guest_memfd will also ensure truncation of anything less than the
> original size of the folio before restructuring is blocked, regardless
> of the current size of the folio)
> >> > Separately, KVM could also enforce the folio size/memory contiguity vs
> >> > mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
> >> > the check is deemed necessary, it still shouldn't be in TDX code, I
> >> > think.
> >> >
> >> > > Pro: Preventing zapping private memory until conversion is successful is good.
> >> > >
> >> > > However, could we achieve this benefit in other ways? For example, is it
> >> > > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
> >> > > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
> >> > > optimization? (hugetlb_vmemmap conversion is super slow according to my
> >> > > observation and I always disable it).
> >> >
> >> > HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
> >> > huge VM, multiplied by a large number of hosts, this is not a trivial
> >> > amount of memory. It's one of the key reasons why we are using HugeTLB
> >> > in guest_memfd in the first place, other than to be able to get high
> >> > level page table mappings. We want this in production.
> >> >
> >> > > Or pre-allocation for
> >> > > vmemmap_remap_alloc()?
> >> > >
> >> >
> >> > Will investigate if this is possible as mentioned above. Thanks for the
> >> > suggestion again!
> >> >
> >> > > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
> >> > > private memory before conversion succeeds is still better than introducing the
> >> > > mess between folio size and mapping size.
> >> > >
> >> > >> > I guess perhaps the question is, is it okay if the folios are smaller
> >> > >> > than the mapping while conversion is in progress? Does the order matter
> >> > >> > (split page table entries first vs split folios first)?
> >> > >>
> >> > >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> >> > >> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> >> > >> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> >> > >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> >> > >> multiple folios in one page, probably comes down to which option provides "good
> >> > >> enough" performance without incurring too much complexity.
> >> >
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-12  1:39                     ` Yan Zhao
@ 2026-01-12  2:12                       ` Yan Zhao
  2026-01-12 19:56                         ` Ackerley Tng
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-12  2:12 UTC (permalink / raw)
  To: Ackerley Tng, Vishal Annapurve, Sean Christopherson, pbonzini,
	linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	michael.roth, david, sagis, vbabka, thomas.lendacky, nik.borisov,
	pgonda, fan.du, jun.miao, francescolavra.fl, jgross, ira.weiny,
	isaku.yamahata, xiaoyao.li, kai.huang, binbin.wu, chao.p.peng,
	chao.gao

On Mon, Jan 12, 2026 at 09:39:39AM +0800, Yan Zhao wrote:
> On Fri, Jan 09, 2026 at 10:07:00AM -0800, Ackerley Tng wrote:
> > Vishal Annapurve <vannapurve@google.com> writes:
> > 
> > > On Fri, Jan 9, 2026 at 1:21 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >>
> > >> On Thu, Jan 08, 2026 at 12:11:14PM -0800, Ackerley Tng wrote:
> > >> > Yan Zhao <yan.y.zhao@intel.com> writes:
> > >> >
> > >> > > On Tue, Jan 06, 2026 at 03:43:29PM -0800, Sean Christopherson wrote:
> > >> > >> On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > >> > >> > Sean Christopherson <seanjc@google.com> writes:
> > >> > >> >
> > >> > >> > > On Tue, Jan 06, 2026, Ackerley Tng wrote:
> > >> > >> > >> Vishal Annapurve <vannapurve@google.com> writes:
> > >> > >> > >>
> > >> > >> > >> > On Tue, Jan 6, 2026 at 2:19 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >> > >> > >> >>
> > >> > >> > >> >> - EPT mapping size and folio size
> > >> > >> > >> >>
> > >> > >> > >> >>   This series is built upon the rule in KVM that the mapping size in the
> > >> > >> > >> >>   KVM-managed secondary MMU is no larger than the backend folio size.
> > >> > >> > >> >>
> > >> > >> > >>
> > >> > >> > >> I'm not familiar with this rule and would like to find out more. Why is
> > >> > >> > >> this rule imposed?
> > >> > >> > >
> > >> > >> > > Because it's the only sane way to safely map memory into the guest? :-D
> > >> > >> > >
> > >> > >> > >> Is this rule there just because traditionally folio sizes also define the
> > >> > >> > >> limit of contiguity, and so the mapping size must not be greater than folio
> > >> > >> > >> size in case the block of memory represented by the folio is not contiguous?
> > >> > >> > >
> > >> > >> > > Pre-guest_memfd, KVM didn't care about folios.  KVM's mapping size was (and still
> > >> > >> > > is) strictly bound by the host mapping size.  That's handles contiguous addresses,
> > >> > >> > > but it _also_ handles contiguous protections (e.g. RWX) and other attributes.
> > >> > >> > >
> > >> > >> > >> In guest_memfd's case, even if the folio is split (just for refcount
> > >> > >> > >> tracking purposese on private to shared conversion), the memory is still
> > >> > >> > >> contiguous up to the original folio's size. Will the contiguity address
> > >> > >> > >> the concerns?
> > >> > >> > >
> > >> > >> > > Not really?  Why would the folio be split if the memory _and its attributes_ are
> > >> > >> > > fully contiguous?  If the attributes are mixed, KVM must not create a mapping
> > >> > >> > > spanning mixed ranges, i.e. with multiple folios.
> > >> > >> >
> > >> > >> > The folio can be split if any (or all) of the pages in a huge page range
> > >> > >> > are shared (in the CoCo sense). So in a 1G block of memory, even if the
> > >> > >> > attributes all read 0 (!KVM_MEMORY_ATTRIBUTE_PRIVATE), the folio
> > >> > >> > would be split, and the split folios are necessary for tracking users of
> > >> > >> > shared pages using struct page refcounts.
> > >> > >>
> > >> > >> Ahh, that's what the refcounting was referring to.  Gotcha.
> > >> > >>
> > >> > >> > However the split folios in that 1G range are still fully contiguous.
> > >> > >> >
> > >> > >> > The process of conversion will split the EPT entries soon after the
> > >> > >> > folios are split so the rule remains upheld.
> > >> >
> > >> > Correction here: If we go with splitting from 1G to 4K uniformly on
> > >> > sharing, only the EPT entries around the shared 4K folio will have their
> > >> > page table entries split, so many of the EPT entries will be at 2M level
> > >> > though the folios are 4K sized. This would be last beyond the conversion
> > >> > process.
> > >> >
> > >> > > Overall, I don't think allowing folios smaller than the mappings while
> > >> > > conversion is in progress brings enough benefit.
> > >> > >
> > >> >
> > >> > I'll look into making the restructuring process always succeed, but off
> > >> > the top of my head that's hard because
> > >> >
> > >> > 1. HugeTLB Vmemmap Optimization code would have to be refactored to
> > >> >    use pre-allocated pages, which is refactoring deep in HugeTLB code
> > >> >
> > >> > 2. If we want to split non-uniformly such that only the folios that are
> > >> >    shared are 4K, and the remaining folios are as large as possible (PMD
> > >> >    sized as much as possible), it gets complex to figure out how many
> > >> >    pages to allocate ahead of time.
> > >> >
> > >> > So it's complex and will probably delay HugeTLB+conversion support even
> > >> > more!
> > >> >
> > >> > > Cons:
> > >> > > (1) TDX's zapping callback has no idea whether the zapping is caused by an
> > >> > >     in-progress private-to-shared conversion or other reasons. It also has no
> > >> > >     idea if the attributes of the underlying folios remain unchanged during an
> > >> > >     in-progress private-to-shared conversion. Even if the assertion Ackerley
> > >> > >     mentioned is true, it's not easy to drop the sanity checks in TDX's zapping
> > >> > >     callback for in-progress private-to-shared conversion alone (which would
> > >> > >     increase TDX's dependency on guest_memfd's specific implementation even if
> > >> > >     it's feasible).
> > >> > >
> > >> > >     Removing the sanity checks entirely in TDX's zapping callback is confusing
> > >> > >     and would show a bad/false expectation from KVM -- what if a huge folio is
> > >> > >     incorrectly split while it's still mapped in KVM (by a buggy guest_memfd or
> > >> > >     others) in other conditions? And then do we still need the check in TDX's
> > >> > >     mapping callback? If not, does it mean TDX huge pages can stop relying on
> > >> > >     guest_memfd's ability to allocate huge folios, as KVM could still create
> > >> > >     huge mappings as long as small folios are physically contiguous with
> > >> > >     homogeneous memory attributes?
> > >> > >
> > >> > > (2) Allowing folios smaller than the mapping would require splitting S-EPT in
> > >> > >     kvm_gmem_error_folio() before kvm_gmem_zap(). Though one may argue that the
> > >> > >     invalidate lock held in __kvm_gmem_set_attributes() could guard against
> > >> > >     concurrent kvm_gmem_error_folio(), it still doesn't seem clean and looks
> > >> > >     error-prone. (This may also apply to kvm_gmem_migrate_folio() potentially).
> > >> > >
> > >> >
> > >> > I think the central question I have among all the above is what TDX
> > >> > needs to actually care about (putting aside what KVM's folio size/memory
> > >> > contiguity vs mapping level rule for a while).
> > >> >
> > >> > I think TDX code can check what it cares about (if required to aid
> > >> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
> > >> > or does it actually care about memory contiguity and alignment?
> > >> TDX cares about memory contiguity. A single folio ensures memory contiguity.
> > >
> > > In this slightly unusual case, I think the guarantee needed here is
> > > that as long as a range is mapped into SEPT entries, guest_memfd
> > > ensures that the complete range stays private.
> > >
> > > i.e. I think it should be safe to rely on guest_memfd here,
> > > irrespective of the folio sizes:
> > > 1) KVM TDX stack should be able to reclaim the complete range when unmapping.
> > > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT
> > > entries, guest_memfd will not let host userspace mappings to access
> > > guest private memory.
> > >
> > >>
> > >> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
> > >> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
> > >> contiguous range larger than the page's folio range.
> > >
> > > What's the issue with passing the (struct page*, unsigned long nr_pages) pair?
> > >
> > >>
> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio().
> > >> If smaller folios are allowed, splitting private mapping is required there.
> > 
> > It was discussed before that for memory failure handling, we will want
> > to split huge pages, we will get to it! The trouble is that guest_memfd
> > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
> > from the ground up), so we'll still need to figure out it's okay to let
> > HugeTLB deal with it when freeing, and when I last looked, HugeTLB
> > doesn't actually deal with poisoned folios on freeing, so there's more
> > work to do on the HugeTLB side.
> > 
> > This is a good point, although IIUC it is a separate issue. The need to
> > split private mappings on memory failure is not for confidentiality in
> > the TDX sense but to ensure that the guest doesn't use the failed
> > memory. In that case, contiguity is broken by the failed memory. The
> > folio is split, the private EPTs are split. The folio size should still
> > not be checked in TDX code. guest_memfd knows contiguity got broken, so
> > guest_memfd calls TDX code to split the EPTs.
> 
> Hmm, maybe the key is that we need to split S-EPT first before allowing
> guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the
> folio splitting.
> 
> This is better than performing folio splitting while it's mapped as huge in
> S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split
> S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in
> kvm_gmem_error_folio() would still trigger the over-zapping issue.
> 
> In the primary MMU, it follows the rule of unmapping a folio before splitting,
> truncating, or migrating a folio. For S-EPT, considering the cost of zapping
> more ranges than necessary, maybe a trade-off is to always split S-EPT before
> allowing backend folio splitting.
> 
> Does this look good to you?
So, the flow of converting 0-4KB from private to shared in a 1GB folio in
guest_memfd is:

a. If guest_memfd splits 1GB to 2MB first:
   1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
   2. split folio
   3. zap the 0-4KB mapping.

b. If guest_memfd splits 1GB to 4KB directly:
   1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
   2. split folio
   3. zap the 0-4KB mapping.

The flow of converting 0-2MB from private to shared in a 1GB folio in
guest_memfd is:

a. If guest_memfd splits 1GB to 2MB first:
   1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
   2. split folio
   3. zap the 0-2MB mapping.

b. If guest_memfd splits 1GB to 4KB directly:
   1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
   2. split folio
   3. zap the 0-2MB mapping.

> So, to convert a 2MB range from private to shared, even though guest_memfd will
> eventually zap the entire 2MB range, do the S-EPT splitting first! If it fails,
> don't split the backend folio.
> 
> Even if folio splitting may fail later, it just leaves split S-EPT mappings,
> which matters little, especially after we support S-EPT promotion later.
> 
> The benefit is that we don't need to worry even in the case when guest_memfd
> splits a 1GB folio directly to 4KB granularity, potentially introducing the
> over-zapping issue later.
> 
> > > Yes, I believe splitting private mappings will be invoked to ensure
> > > that the whole huge folio is not unmapped from KVM due to an error on
> > > just a 4K page. Is that a problem?
> > >
> > > If splitting fails, the implementation can fall back to completely
> > > zapping the folio range.
> > >
> > >> (e.g., after splitting a 1GB folio to 4KB folios with 2MB mappings. Also, is it
> > >> possible for splitting a huge folio to fail partially, without merging the huge
> > >> folio back or further zapping?).
> > 
> > The current stance is to allow splitting failures and not undo that
> > splitting failure, so there's no merge back to fix the splitting
> > failure. (Not set in stone yet, I think merging back could turn out to
> > be a requirement from the mm side, which comes with more complexity in
> > restructuring logic.)
> > 
> > If it is not merged back on a split failure, the pages are still
> > contiguous, the pages are guaranteed contiguous while they are owned by
> > guest_memfd (even in the case of memory failure, if I get my way :P) so
> > TDX can still trust that.
> > 
> > I think you're worried that on split failure some folios are split, but
> > the private EPTs for those are not split, but the memory for those
> > unsplit private EPTs are still contiguous, and on split failure we quit
> > early so guest_memfd still tracks the ranges as private.
> > 
> > Privateness and contiguity are preserved so I think TDX should be good
> > with that? The TD can still run. IIUC it is part of the plan that on
> > splitting failure, conversion ioctl returns failure, guest is informed
> > of conversion failure so that it can do whatever it should do to clean
> > up.
> As above, what about the idea of always requesting KVM to split S-EPT before
> guest_memfd splits a folio?
> 
> I think splitting S-EPT first is already required for all cases anyway, except
> for the private-to-shared conversion of a full 2MB or 1GB range.
> 
> Requesting S-EPT splitting when it's about to do folio splitting is better than
> leaving huge mappings with split folios and having to patch things up here and
> there, just to make the single case of private-to-shared conversion easier.
> 
> > > Yes, splitting can fail partially, but guest_memfd will not make the
> > > ranges available to host userspace and derivatives until:
> > > 1) The complete range to be converted is split to 4K granularity.
> > > 2) The complete range to be converted is zapped from KVM EPT mappings.
> > >
> > >> Not sure if there're other edge cases we're still missing.
> > >>
> > 
> > As you said, at the core TDX is concerned about contiguity of the memory
> > ranges (start_addr, length) that it was given. Contiguity is guaranteed
> > by guest_memfd while the folio is in guest_memfd ownership up to the
> > boundaries of the original folio, before any restructuring. So if we're
> > looking for edge cases, I think they would be around
> > truncation. Can't think of anything now.
> Potentially, folio migration, if we support it in the future.
> 
> > (guest_memfd will also ensure truncation of anything less than the
> > original size of the folio before restructuring is blocked, regardless
> > of the current size of the folio)
> > >> > Separately, KVM could also enforce the folio size/memory contiguity vs
> > >> > mapping level rule, but TDX code shouldn't enforce KVM's rules. So if
> > >> > the check is deemed necessary, it still shouldn't be in TDX code, I
> > >> > think.
> > >> >
> > >> > > Pro: Preventing zapping private memory until conversion is successful is good.
> > >> > >
> > >> > > However, could we achieve this benefit in other ways? For example, is it
> > >> > > possible to ensure hugetlb_restructuring_split_folio() can't fail by ensuring
> > >> > > split_entries() can't fail (via pre-allocation?) and disabling hugetlb_vmemmap
> > >> > > optimization? (hugetlb_vmemmap conversion is super slow according to my
> > >> > > observation and I always disable it).
> > >> >
> > >> > HugeTLB vmemmap optimization gives us 1.6% of memory in savings. For a
> > >> > huge VM, multiplied by a large number of hosts, this is not a trivial
> > >> > amount of memory. It's one of the key reasons why we are using HugeTLB
> > >> > in guest_memfd in the first place, other than to be able to get high
> > >> > level page table mappings. We want this in production.
> > >> >
> > >> > > Or pre-allocation for
> > >> > > vmemmap_remap_alloc()?
> > >> > >
> > >> >
> > >> > Will investigate if this is possible as mentioned above. Thanks for the
> > >> > suggestion again!
> > >> >
> > >> > > Dropping TDX's sanity check may only serve as our last resort. IMHO, zapping
> > >> > > private memory before conversion succeeds is still better than introducing the
> > >> > > mess between folio size and mapping size.
> > >> > >
> > >> > >> > I guess perhaps the question is, is it okay if the folios are smaller
> > >> > >> > than the mapping while conversion is in progress? Does the order matter
> > >> > >> > (split page table entries first vs split folios first)?
> > >> > >>
> > >> > >> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> > >> > >> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> > >> > >> mapping multiple guest_memfd folios with a single hugepage.   As to whether we
> > >> > >> do (a) nothing, (b) change the refcounting, or (c) add support for mapping
> > >> > >> multiple folios in one page, probably comes down to which option provides "good
> > >> > >> enough" performance without incurring too much complexity.
> > >> >
> > 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2026-01-09 18:29           ` Ackerley Tng
@ 2026-01-12  2:41             ` Yan Zhao
  2026-01-13 16:50               ` Vishal Annapurve
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-12  2:41 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Dave Hansen, pbonzini, seanjc, linux-kernel, kvm, x86,
	rick.p.edgecombe, kas, tabba, michael.roth, david, vannapurve,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

On Fri, Jan 09, 2026 at 10:29:47AM -0800, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Wed, Jan 07, 2026 at 08:39:55AM -0800, Dave Hansen wrote:
> >> On 1/7/26 01:12, Yan Zhao wrote:
> >> ...
> >> > However, my understanding is that it's better for functions expecting huge pages
> >> > to explicitly receive "folio" instead of "page". This way, people can tell from
> >> > a function's declaration what the function expects. Is this understanding
> >> > correct?
> >>
> >> In a perfect world, maybe.
> >>
> >> But, in practice, a 'struct page' can still represent huge pages and
> >> *does* represent huge pages all over the kernel. There's no need to cram
> >> a folio in here just because a huge page is involved.
> > Ok. I can modify the param "struct page *page" to "struct page *base_page",
> > explaining that it may belong to a huge folio but is not necessarily the
> > head page of the folio.
> >
> >> > Passing "start_idx" along with "folio" is due to the requirement of mapping only
> >> > a sub-range of a huge folio. e.g., we allow creating a 2MB mapping starting from
> >> > the nth idx of a 1GB folio.
> >> >
> >> > On the other hand, if we instead pass "page" to tdh_mem_page_aug() for huge
> >> > pages and have tdh_mem_page_aug() internally convert it to "folio" and
> >> > "start_idx", it makes me wonder if we could have previously just passed "pfn" to
> >> > tdh_mem_page_aug() and had tdh_mem_page_aug() convert it to "page".
> >>
> >> As a general pattern, I discourage folks from using pfns and physical
> >> addresses when passing around references to physical memory. They have
> >> zero type safety.
> >>
> >> It's also not just about type safety. A 'struct page' also *means*
> >> something. It means that the kernel is, on some level, aware of and
> >> managing that memory. It's not MMIO. It doesn't represent the physical
> >> address of the APIC page. It's not SGX memory. It doesn't have a
> >> Shared/Private bit.
> >>
> >> All of those properties are important and they're *GONE* if you use a
> >> pfn. It's even worse if you use a raw physical address.
> >>
> >> Please don't go back to raw integers (pfns or paddrs).
> > I understood and fully accept it.
> >
> > I previously wondered if we could allow KVM to pass in pfn and let the SEAMCALL
> > wrapper do the pfn_to_page() conversion.
> > But it was just out of curiosity. I actually prefer "struct page" too.
> >
> >
> >> >>> -	tdx_clflush_page(page);
> >> >>> +	if (start_idx + npages > folio_nr_pages(folio))
> >> >>> +		return TDX_OPERAND_INVALID;
> >> >>
> >> >> Why is this necessary? Would it be a bug if this happens?
> >> > This sanity check is due to the requirement in KVM that mapping size should be
> >> > no larger than the backend folio size, which ensures the mapping pages are
> >> > physically contiguous with homogeneous page attributes. (See the discussion
> >> > about "EPT mapping size and folio size" in thread [1]).
> >> >
> >> > Failure of the sanity check could only be due to bugs in the caller (KVM). I
> >> > didn't convert the sanity check to an assertion because there's already a
> >> > TDX_BUG_ON_2() on error following the invocation of tdh_mem_page_aug() in KVM.
> >>
> >> We generally don't protect against bugs in callers. Otherwise, we'd have
> >> a trillion NULL checks in every function in the kernel.
> >>
> >> The only reason to add caller sanity checks is to make things easier to
> >> debug, and those almost always include some kind of spew:
> >> WARN_ON_ONCE(), pr_warn(), etc...
> >
> > Would it be better if I use WARN_ON_ONCE()? like this:
> >
> > u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *base_page,
> >                      u64 *ext_err1, u64 *ext_err2)
> > {
> >         unsigned long npages = tdx_sept_level_to_npages(level);
> >         struct tdx_module_args args = {
> >                 .rcx = gpa | level,
> >                 .rdx = tdx_tdr_pa(td),
> >                 .r8 = page_to_phys(base_page),
> >         };
> >         u64 ret;
> >
> >         WARN_ON_ONCE(page_folio(base_page) != page_folio(base_page + npages - 1));
> 
> This WARNs if the first and last folios are not the same folio, which
If the first and last page are belonging to the same folio, the entire range
should be fully covered by a single folio, no?

Maybe the original checking as below is better :)

struct folio *folio = page_folio(base_page);
WARN_ON_ONCE(folio_page_idx(folio, base_page) + npages > folio_nr_pages(folio));

See more in the next comment below.

> still assumes something about how pages are grouped into folios. I feel
> that this is still stretching TDX code over to make assumptions about
> how the kernel manages memory metadata, which is more than TDX actually
> cares about.
> 
> >
> >         for (int i = 0; i < npages; i++)
> >                 tdx_clflush_page(base_page + i);
> >
> >         ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
> >
> >         *ext_err1 = args.rcx;
> >         *ext_err2 = args.rdx;
> >
> >         return ret;
> > }
> >
> > The WARN_ON_ONCE() serves 2 purposes:
> > 1. Loudly warn of subtle KVM bugs.
> > 2. Ensure "page_to_pfn(base_page + i) == (page_to_pfn(base_page) + i)".
> >
> 
> I disagree with checking within TDX code, but if you would still like to
> check, 2. that you suggested is less dependent on the concept of how the
> kernel groups pages in folios, how about:
> 
>   WARN_ON_ONCE(page_to_pfn(base_page + npages - 1) !=
>                page_to_pfn(base_page) + npages - 1);
> 
> The full contiguity check will scan every page, but I think this doesn't
> take too many CPU cycles, and would probably catch what you're looking
> to catch in most cases.
As Dave said,  "struct page" serves to guard against MMIO.

e.g., with below memory layout, checking continuity of every PFN is still not
enough.

PFN 0x1000: Normal RAM
PFN 0x1001: MMIO
PFN 0x1002: Normal RAM

Also, is it even safe to reference struct page for PFN 0x1001 (e.g. with
SPARSEMEM without SPARSEMEM_VMEMMAP)?

Leveraging folio makes it safe and simpler.
Since KVM also relies on folio size to determine mapping size, TDX doesn't
introduce extra limitations.

> I still don't think TDX code should check. The caller should check or
> know the right thing to do.
Hmm. I don't think the backend folio should be split before it's unmapped
(refer to __folio_split()). Or at least we need to split the S-EPT before
performing the backend folio split (see *).

However, the new gmem does make this happen.
So, I think a warning is necessary to aid in debugging subtle bugs.

[*] https://lore.kernel.org/kvm/aWRQ2xyc9coA6aCg@yzhao56-desk.sh.intel.com/
> > If you don't like using "base_page + i" (as the discussion in v2 [1]), we can
> > invoke folio_page() for the ith page instead.
> >
> > [1] https://lore.kernel.org/all/01731a9a0346b08577fad75ae560c650145c7f39.camel@intel.com/
> >
> >> >>> +	for (int i = 0; i < npages; i++)
> >> >>> +		tdx_clflush_page(folio_page(folio, start_idx + i));
> >> >>
> >> >> All of the page<->folio conversions are kinda hurting my brain. I think
> >> >> we need to decide what the canonical type for these things is in TDX, do
> >> >> the conversion once, and stick with it.
> >> > Got it!
> >> >
> >> > Since passing in base "page" or base "pfn" may still require the
> >> > wrappers/helpers to internally convert them to "folio" for sanity checks, could
> >> > we decide that "folio" and "start_idx" are the canonical params for functions
> >> > expecting huge pages? Or do you prefer KVM to do the sanity check by itself?
> >>
> >> I'm not convinced the sanity check is a good idea in the first place. It
> >> just adds complexity.
> > I'm worried about subtle bugs introduced by careless coding that might be
> > silently ignored otherwise, like the one in thread [2].
> >
> > [2] https://lore.kernel.org/kvm/aV2A39fXgzuM4Toa@google.com/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-12  2:12                       ` Yan Zhao
@ 2026-01-12 19:56                         ` Ackerley Tng
  2026-01-13  6:10                           ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Ackerley Tng @ 2026-01-12 19:56 UTC (permalink / raw)
  To: Yan Zhao, Vishal Annapurve, Sean Christopherson, pbonzini,
	linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	michael.roth, david, sagis, vbabka, thomas.lendacky, nik.borisov,
	pgonda, fan.du, jun.miao, francescolavra.fl, jgross, ira.weiny,
	isaku.yamahata, xiaoyao.li, kai.huang, binbin.wu, chao.p.peng,
	chao.gao

Yan Zhao <yan.y.zhao@intel.com> writes:

>> > >> > I think the central question I have among all the above is what TDX
>> > >> > needs to actually care about (putting aside what KVM's folio size/memory
>> > >> > contiguity vs mapping level rule for a while).
>> > >> >
>> > >> > I think TDX code can check what it cares about (if required to aid
>> > >> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
>> > >> > or does it actually care about memory contiguity and alignment?
>> > >> TDX cares about memory contiguity. A single folio ensures memory contiguity.
>> > >
>> > > In this slightly unusual case, I think the guarantee needed here is
>> > > that as long as a range is mapped into SEPT entries, guest_memfd
>> > > ensures that the complete range stays private.
>> > >
>> > > i.e. I think it should be safe to rely on guest_memfd here,
>> > > irrespective of the folio sizes:
>> > > 1) KVM TDX stack should be able to reclaim the complete range when unmapping.
>> > > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT
>> > > entries, guest_memfd will not let host userspace mappings to access
>> > > guest private memory.
>> > >
>> > >>
>> > >> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
>> > >> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
>> > >> contiguous range larger than the page's folio range.
>> > >
>> > > What's the issue with passing the (struct page*, unsigned long nr_pages) pair?
>> > >

Please let us know what you think of this too, why not parametrize using
page and nr_pages?

>> > >>
>> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio().
>> > >> If smaller folios are allowed, splitting private mapping is required there.
>> >
>> > It was discussed before that for memory failure handling, we will want
>> > to split huge pages, we will get to it! The trouble is that guest_memfd
>> > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
>> > from the ground up), so we'll still need to figure out it's okay to let
>> > HugeTLB deal with it when freeing, and when I last looked, HugeTLB
>> > doesn't actually deal with poisoned folios on freeing, so there's more
>> > work to do on the HugeTLB side.
>> >
>> > This is a good point, although IIUC it is a separate issue. The need to
>> > split private mappings on memory failure is not for confidentiality in
>> > the TDX sense but to ensure that the guest doesn't use the failed
>> > memory. In that case, contiguity is broken by the failed memory. The
>> > folio is split, the private EPTs are split. The folio size should still
>> > not be checked in TDX code. guest_memfd knows contiguity got broken, so
>> > guest_memfd calls TDX code to split the EPTs.
>>
>> Hmm, maybe the key is that we need to split S-EPT first before allowing
>> guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the
>> folio splitting.
>>
>> This is better than performing folio splitting while it's mapped as huge in
>> S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split
>> S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in
>> kvm_gmem_error_folio() would still trigger the over-zapping issue.
>>

Let's put memory failure handling aside for now since for now it zaps
the entire huge page, so there's no impact on ordering between S-EPT and
folio split.

>> In the primary MMU, it follows the rule of unmapping a folio before splitting,
>> truncating, or migrating a folio. For S-EPT, considering the cost of zapping
>> more ranges than necessary, maybe a trade-off is to always split S-EPT before
>> allowing backend folio splitting.
>>

The mapping size <= folio size rule (for KVM and the primary MMU) is
there because it is the safe way to map memory into the guest because a
folio implies contiguity. Folios are basically a core MM concept so it
makes sense that the primary MMU relies on that.

IIUC the core of the rule isn't folio sizes, it's memory
contiguity. guest_memfd guarantees memory contiguity, and KVM should be
able to rely on guest_memfd's guarantee, especially since guest_memfd is
virtualiation-first, and KVM first.

I think rules from the primary MMU are a good reference, but we
shouldn't copy rules from the primary MMU, and KVM can rely on
guest_memfd's guarantee of memory contiguity.

>> Does this look good to you?
> So, the flow of converting 0-4KB from private to shared in a 1GB folio in
> guest_memfd is:
>
> a. If guest_memfd splits 1GB to 2MB first:
>    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
>    2. split folio
>    3. zap the 0-4KB mapping.
>
> b. If guest_memfd splits 1GB to 4KB directly:
>    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
>    2. split folio
>    3. zap the 0-4KB mapping.
>
> The flow of converting 0-2MB from private to shared in a 1GB folio in
> guest_memfd is:
>
> a. If guest_memfd splits 1GB to 2MB first:
>    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
>    2. split folio
>    3. zap the 0-2MB mapping.
>
> b. If guest_memfd splits 1GB to 4KB directly:
>    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
>    2. split folio
>    3. zap the 0-2MB mapping.
>
>> So, to convert a 2MB range from private to shared, even though guest_memfd will
>> eventually zap the entire 2MB range, do the S-EPT splitting first! If it fails,
>> don't split the backend folio.
>>
>> Even if folio splitting may fail later, it just leaves split S-EPT mappings,
>> which matters little, especially after we support S-EPT promotion later.
>>

I didn't consider leaving split S-EPT mappings since there is a
performance impact. Let me think about this a little.

Meanwhile, if the folios are split before the S-EPTs are split, as long
as huge folios worth of memory are guaranteed contiguous by guest_memfd
for KVM, what are the problems you see?

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-06 23:43         ` Sean Christopherson
  2026-01-07  9:03           ` Yan Zhao
  2026-01-07 19:22           ` Edgecombe, Rick P
@ 2026-01-12 20:15           ` Ackerley Tng
  2026-01-14  0:33             ` Yan Zhao
  2 siblings, 1 reply; 127+ messages in thread
From: Ackerley Tng @ 2026-01-12 20:15 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vishal Annapurve, Yan Zhao, pbonzini, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

Sean Christopherson <seanjc@google.com> writes:

> Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> conceptually totally fine, i.e. I'm not totally opposed to adding support for
> mapping multiple guest_memfd folios with a single hugepage. As to whether we

Sean, I'd like to clarify this.

> do (a) nothing,

What does do nothing mean here?

In this patch series the TDX functions do sanity checks ensuring that
mapping size <= folio size. IIUC the checks at mapping time, like in
tdh_mem_page_aug() would be fine since at the time of mapping, the
mapping size <= folio size, but we'd be in trouble at the time of
zapping, since that's when mapping sizes > folio sizes get discovered.

The sanity checks are in principle in direct conflict with allowing
mapping of multiple guest_memfd folios at hugepage level.

> (b) change the refcounting, or

I think this is pretty hard unless something changes in core MM that
allows refcounting to be customizable by the FS. guest_memfd would love
to have that, but customizable refcounting is going to hurt refcounting
performance throughout the kernel.

> (c) add support for mapping multiple folios in one page,

Where would the changes need to be made, IIUC there aren't any checks
currently elsewhere in KVM to ensure that mapping size <= folio size,
other than the sanity checks in the TDX code proposed in this series.

Does any support need to be added, or is it about amending the
unenforced/unwritten rule from "mapping size <= folio size" to "mapping
size <= contiguous memory size"?

> probably comes down to which option provides "good
> enough" performance without incurring too much complexity.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-12 19:56                         ` Ackerley Tng
@ 2026-01-13  6:10                           ` Yan Zhao
  2026-01-13 16:40                             ` Vishal Annapurve
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-13  6:10 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Vishal Annapurve, Sean Christopherson, pbonzini, linux-kernel,
	kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth,
	david, sagis, vbabka, thomas.lendacky, nik.borisov, pgonda,
	fan.du, jun.miao, francescolavra.fl, jgross, ira.weiny,
	isaku.yamahata, xiaoyao.li, kai.huang, binbin.wu, chao.p.peng,
	chao.gao

On Mon, Jan 12, 2026 at 11:56:01AM -0800, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> >> > >> > I think the central question I have among all the above is what TDX
> >> > >> > needs to actually care about (putting aside what KVM's folio size/memory
> >> > >> > contiguity vs mapping level rule for a while).
> >> > >> >
> >> > >> > I think TDX code can check what it cares about (if required to aid
> >> > >> > debugging, as Dave suggested). Does TDX actually care about folio sizes,
> >> > >> > or does it actually care about memory contiguity and alignment?
> >> > >> TDX cares about memory contiguity. A single folio ensures memory contiguity.
> >> > >
> >> > > In this slightly unusual case, I think the guarantee needed here is
> >> > > that as long as a range is mapped into SEPT entries, guest_memfd
> >> > > ensures that the complete range stays private.
> >> > >
> >> > > i.e. I think it should be safe to rely on guest_memfd here,
> >> > > irrespective of the folio sizes:
> >> > > 1) KVM TDX stack should be able to reclaim the complete range when unmapping.
> >> > > 2) KVM TDX stack can assume that as long as memory is mapped in SEPT
> >> > > entries, guest_memfd will not let host userspace mappings to access
> >> > > guest private memory.
> >> > >
> >> > >>
> >> > >> Allowing one S-EPT mapping to cover multiple folios may also mean it's no longer
> >> > >> reasonable to pass "struct page" to tdh_phymem_page_wbinvd_hkid() for a
> >> > >> contiguous range larger than the page's folio range.
> >> > >
> >> > > What's the issue with passing the (struct page*, unsigned long nr_pages) pair?
> >> > >
> 
> Please let us know what you think of this too, why not parametrize using
> page and nr_pages?
With (struct page*, unsigned long nr_pages) pair, IMHO, a warning when the
entire range is not fully contained in a folio is still necessary. 

I expressed the concern here:
https://lore.kernel.org/kvm/aWRfVOZpTUdYJ+7C@yzhao56-desk.sh.intel.com/

> >> > >>
> >> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio().
> >> > >> If smaller folios are allowed, splitting private mapping is required there.
> >> >
> >> > It was discussed before that for memory failure handling, we will want
> >> > to split huge pages, we will get to it! The trouble is that guest_memfd
> >> > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
> >> > from the ground up), so we'll still need to figure out it's okay to let
> >> > HugeTLB deal with it when freeing, and when I last looked, HugeTLB
> >> > doesn't actually deal with poisoned folios on freeing, so there's more
> >> > work to do on the HugeTLB side.
> >> >
> >> > This is a good point, although IIUC it is a separate issue. The need to
> >> > split private mappings on memory failure is not for confidentiality in
> >> > the TDX sense but to ensure that the guest doesn't use the failed
> >> > memory. In that case, contiguity is broken by the failed memory. The
> >> > folio is split, the private EPTs are split. The folio size should still
> >> > not be checked in TDX code. guest_memfd knows contiguity got broken, so
> >> > guest_memfd calls TDX code to split the EPTs.
> >>
> >> Hmm, maybe the key is that we need to split S-EPT first before allowing
> >> guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the
> >> folio splitting.
> >>
> >> This is better than performing folio splitting while it's mapped as huge in
> >> S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split
> >> S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in
> >> kvm_gmem_error_folio() would still trigger the over-zapping issue.
> >>
> 
> Let's put memory failure handling aside for now since for now it zaps
> the entire huge page, so there's no impact on ordering between S-EPT and
> folio split.
Relying on guest_memfd's specific implemenation is not a good thing. e.g.,

Given there's a version of guest_memfd allocating folios from buddy.
1. KVM maps a 2MB folio in a 2MB mappings.
2. guest_memfd splits the 2MB folio into 4KB folios, but fails and leaves the
   2MB folio partially split.
3. Memory failure occurs on one of the split folio.
4. When splitting S-EPT fails, the over-zapping issue is still there.

> >> In the primary MMU, it follows the rule of unmapping a folio before splitting,
> >> truncating, or migrating a folio. For S-EPT, considering the cost of zapping
> >> more ranges than necessary, maybe a trade-off is to always split S-EPT before
> >> allowing backend folio splitting.
> >>
> 
> The mapping size <= folio size rule (for KVM and the primary MMU) is
> there because it is the safe way to map memory into the guest because a
> folio implies contiguity. Folios are basically a core MM concept so it
> makes sense that the primary MMU relies on that.
So, why the primary MMU needs to unmap and check refcount before folio
splitting?

> IIUC the core of the rule isn't folio sizes, it's memory
> contiguity. guest_memfd guarantees memory contiguity, and KVM should be
> able to rely on guest_memfd's guarantee, especially since guest_memfd is
> virtualiation-first, and KVM first.
>
> I think rules from the primary MMU are a good reference, but we
> shouldn't copy rules from the primary MMU, and KVM can rely on
> guest_memfd's guarantee of memory contiguity.
>
> >> Does this look good to you?
> > So, the flow of converting 0-4KB from private to shared in a 1GB folio in
> > guest_memfd is:
> >
> > a. If guest_memfd splits 1GB to 2MB first:
> >    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
> >    2. split folio
> >    3. zap the 0-4KB mapping.
> >
> > b. If guest_memfd splits 1GB to 4KB directly:
> >    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
> >    2. split folio
> >    3. zap the 0-4KB mapping.
> >
> > The flow of converting 0-2MB from private to shared in a 1GB folio in
> > guest_memfd is:
> >
> > a. If guest_memfd splits 1GB to 2MB first:
> >    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 2MB for the rest range.
> >    2. split folio
> >    3. zap the 0-2MB mapping.
> >
> > b. If guest_memfd splits 1GB to 4KB directly:
> >    1. split S-EPT to 4KB for 0-2MB range, split S-EPT to 4KB for the rest range.
> >    2. split folio
> >    3. zap the 0-2MB mapping.
> >
> >> So, to convert a 2MB range from private to shared, even though guest_memfd will
> >> eventually zap the entire 2MB range, do the S-EPT splitting first! If it fails,
> >> don't split the backend folio.
> >>
> >> Even if folio splitting may fail later, it just leaves split S-EPT mappings,
> >> which matters little, especially after we support S-EPT promotion later.
> >>
> 
> I didn't consider leaving split S-EPT mappings since there is a
> performance impact. Let me think about this a little.
> 
> Meanwhile, if the folios are split before the S-EPTs are split, as long
> as huge folios worth of memory are guaranteed contiguous by guest_memfd
> for KVM, what are the problems you see?
Hmm. As the reply in
https://lore.kernel.org/kvm/aV4hAfPZXfKKB+7i@yzhao56-desk.sh.intel.com/,
there're pros and cons. I'll defer to maintainers' decision.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-13  6:10                           ` Yan Zhao
@ 2026-01-13 16:40                             ` Vishal Annapurve
  2026-01-14  9:32                               ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Vishal Annapurve @ 2026-01-13 16:40 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, Sean Christopherson, pbonzini, linux-kernel, kvm,
	x86, rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth,
	david, sagis, vbabka, thomas.lendacky, nik.borisov, pgonda,
	fan.du, jun.miao, francescolavra.fl, jgross, ira.weiny,
	isaku.yamahata, xiaoyao.li, kai.huang, binbin.wu, chao.p.peng,
	chao.gao

On Mon, Jan 12, 2026 at 10:13 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> > >> > >>
> > >> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio().
> > >> > >> If smaller folios are allowed, splitting private mapping is required there.
> > >> >
> > >> > It was discussed before that for memory failure handling, we will want
> > >> > to split huge pages, we will get to it! The trouble is that guest_memfd
> > >> > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
> > >> > from the ground up), so we'll still need to figure out it's okay to let
> > >> > HugeTLB deal with it when freeing, and when I last looked, HugeTLB
> > >> > doesn't actually deal with poisoned folios on freeing, so there's more
> > >> > work to do on the HugeTLB side.
> > >> >
> > >> > This is a good point, although IIUC it is a separate issue. The need to
> > >> > split private mappings on memory failure is not for confidentiality in
> > >> > the TDX sense but to ensure that the guest doesn't use the failed
> > >> > memory. In that case, contiguity is broken by the failed memory. The
> > >> > folio is split, the private EPTs are split. The folio size should still
> > >> > not be checked in TDX code. guest_memfd knows contiguity got broken, so
> > >> > guest_memfd calls TDX code to split the EPTs.
> > >>
> > >> Hmm, maybe the key is that we need to split S-EPT first before allowing
> > >> guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the
> > >> folio splitting.
> > >>
> > >> This is better than performing folio splitting while it's mapped as huge in
> > >> S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split
> > >> S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in
> > >> kvm_gmem_error_folio() would still trigger the over-zapping issue.
> > >>
> >
> > Let's put memory failure handling aside for now since for now it zaps
> > the entire huge page, so there's no impact on ordering between S-EPT and
> > folio split.
> Relying on guest_memfd's specific implemenation is not a good thing. e.g.,
>
> Given there's a version of guest_memfd allocating folios from buddy.
> 1. KVM maps a 2MB folio in a 2MB mappings.
> 2. guest_memfd splits the 2MB folio into 4KB folios, but fails and leaves the
>    2MB folio partially split.
> 3. Memory failure occurs on one of the split folio.
> 4. When splitting S-EPT fails, the over-zapping issue is still there.
>

Why is overzapping an issue?

Memory failure is supposed to be a rare occurrence and if there is no
memory to handle the splitting, I don't see any other choice than
overzapping. IIUC splitting the huge page range (in 1G -> 4K scenario)
requires even more memory than just splitting cross-boundary leaves
and has a higher chance of failing.

i.e. Whether the folio is split first or the SEPTs, there is always a
chance of failure leading to over-zapping. I don't see value in
optimizing rare failures within rarer memory failure handling
codepaths which are supposed to make best-effort decisions anyway.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2026-01-12  2:41             ` Yan Zhao
@ 2026-01-13 16:50               ` Vishal Annapurve
  2026-01-14  1:48                 ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Vishal Annapurve @ 2026-01-13 16:50 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, Dave Hansen, pbonzini, seanjc, linux-kernel, kvm,
	x86, rick.p.edgecombe, kas, tabba, michael.roth, david, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Sun, Jan 11, 2026 at 6:44 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> > > The WARN_ON_ONCE() serves 2 purposes:
> > > 1. Loudly warn of subtle KVM bugs.
> > > 2. Ensure "page_to_pfn(base_page + i) == (page_to_pfn(base_page) + i)".
> > >
> >
> > I disagree with checking within TDX code, but if you would still like to
> > check, 2. that you suggested is less dependent on the concept of how the
> > kernel groups pages in folios, how about:
> >
> >   WARN_ON_ONCE(page_to_pfn(base_page + npages - 1) !=
> >                page_to_pfn(base_page) + npages - 1);
> >
> > The full contiguity check will scan every page, but I think this doesn't
> > take too many CPU cycles, and would probably catch what you're looking
> > to catch in most cases.
> As Dave said,  "struct page" serves to guard against MMIO.
>
> e.g., with below memory layout, checking continuity of every PFN is still not
> enough.
>
> PFN 0x1000: Normal RAM
> PFN 0x1001: MMIO
> PFN 0x1002: Normal RAM
>

I don't see how guest_memfd memory can be interspersed with MMIO regions.

Is this in reference to the future extension to add private MMIO
ranges? I think this discussion belongs in the context of TDX connect
feature patches. I assume shared/private MMIO assignment to the guests
will happen via completely different paths. And I would assume EPT
entries will have information about whether the mapped ranges are MMIO
or normal memory.

i.e. Anything mapped as normal memory in SEPT entries as a huge range
should be safe to operate on without needing to cross-check sanity in
the KVM TDX stack. If a hugerange has MMIO/normal RAM ranges mixed up
then that is a much bigger problem.

> Also, is it even safe to reference struct page for PFN 0x1001 (e.g. with
> SPARSEMEM without SPARSEMEM_VMEMMAP)?
>
> Leveraging folio makes it safe and simpler.
> Since KVM also relies on folio size to determine mapping size, TDX doesn't
> introduce extra limitations.
>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-12 20:15           ` Ackerley Tng
@ 2026-01-14  0:33             ` Yan Zhao
  2026-01-14  1:24               ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-14  0:33 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Sean Christopherson, Vishal Annapurve, pbonzini, linux-kernel,
	kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth,
	david, sagis, vbabka, thomas.lendacky, nik.borisov, pgonda,
	fan.du, jun.miao, francescolavra.fl, jgross, ira.weiny,
	isaku.yamahata, xiaoyao.li, kai.huang, binbin.wu, chao.p.peng,
	chao.gao

On Mon, Jan 12, 2026 at 12:15:17PM -0800, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> > conceptually totally fine, i.e. I'm not totally opposed to adding support for
> > mapping multiple guest_memfd folios with a single hugepage. As to whether we
> 
> Sean, I'd like to clarify this.
> 
> > do (a) nothing,
> 
> What does do nothing mean here?
> 
> In this patch series the TDX functions do sanity checks ensuring that
> mapping size <= folio size. IIUC the checks at mapping time, like in
> tdh_mem_page_aug() would be fine since at the time of mapping, the
> mapping size <= folio size, but we'd be in trouble at the time of
> zapping, since that's when mapping sizes > folio sizes get discovered.
> 
> The sanity checks are in principle in direct conflict with allowing
> mapping of multiple guest_memfd folios at hugepage level.
> 
> > (b) change the refcounting, or
> 
> I think this is pretty hard unless something changes in core MM that
> allows refcounting to be customizable by the FS. guest_memfd would love
> to have that, but customizable refcounting is going to hurt refcounting
> performance throughout the kernel.
> 
> > (c) add support for mapping multiple folios in one page,
> 
> Where would the changes need to be made, IIUC there aren't any checks
> currently elsewhere in KVM to ensure that mapping size <= folio size,
> other than the sanity checks in the TDX code proposed in this series.
> 
> Does any support need to be added, or is it about amending the
> unenforced/unwritten rule from "mapping size <= folio size" to "mapping
> size <= contiguous memory size"?
The rule is not "unenforced/unwritten". In fact, it's the de facto standard in
KVM.

For non-gmem cases, KVM uses the mapping size in the primary MMU as the max
mapping size in the secondary MMU, while the primary MMU does not create a
mapping larger than the backend folio size. When splitting the backend folio,
the Linux kernel unmaps the folio from both the primary MMU and the KVM-managed
secondary MMU (through the MMU notifier).

On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger
than folio sizes, splitting folios while they are still mapped in the IOMMU
stage-2 page table is not permitted due to the extra folio refcount held by the
IOMMU.

For gmem cases, KVM also does not create mappings larger than the folio size
allocated from gmem. This is why the TDX huge page series relies on gmem's
ability to allocate huge folios.

We really need to be careful if we hope to break this long-established rule.

> > probably comes down to which option provides "good
> > enough" performance without incurring too much complexity.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-14  0:33             ` Yan Zhao
@ 2026-01-14  1:24               ` Sean Christopherson
  2026-01-14  9:23                 ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2026-01-14  1:24 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, Vishal Annapurve, pbonzini, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

On Wed, Jan 14, 2026, Yan Zhao wrote:
> On Mon, Jan 12, 2026 at 12:15:17PM -0800, Ackerley Tng wrote:
> > Sean Christopherson <seanjc@google.com> writes:
> > 
> > > Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> > > conceptually totally fine, i.e. I'm not totally opposed to adding support for
> > > mapping multiple guest_memfd folios with a single hugepage. As to whether we
> > 
> > Sean, I'd like to clarify this.
> > 
> > > do (a) nothing,
> > 
> > What does do nothing mean here?

Don't support hugepage's for shared mappings, at least for now (as Rick pointed
out, doing nothing now doesn't mean we can't do something in the future).

> > In this patch series the TDX functions do sanity checks ensuring that
> > mapping size <= folio size. IIUC the checks at mapping time, like in
> > tdh_mem_page_aug() would be fine since at the time of mapping, the
> > mapping size <= folio size, but we'd be in trouble at the time of
> > zapping, since that's when mapping sizes > folio sizes get discovered.
> > 
> > The sanity checks are in principle in direct conflict with allowing
> > mapping of multiple guest_memfd folios at hugepage level.
> > 
> > > (b) change the refcounting, or
> > 
> > I think this is pretty hard unless something changes in core MM that
> > allows refcounting to be customizable by the FS. guest_memfd would love
> > to have that, but customizable refcounting is going to hurt refcounting
> > performance throughout the kernel.
> > 
> > > (c) add support for mapping multiple folios in one page,
> > 
> > Where would the changes need to be made, IIUC there aren't any checks
> > currently elsewhere in KVM to ensure that mapping size <= folio size,
> > other than the sanity checks in the TDX code proposed in this series.
> > 
> > Does any support need to be added, or is it about amending the
> > unenforced/unwritten rule from "mapping size <= folio size" to "mapping
> > size <= contiguous memory size"?
>
> The rule is not "unenforced/unwritten". In fact, it's the de facto standard in
> KVM.

Ya, more or less.

The rules aren't formally documented because the overarching rule is very
simple: KVM must not map memory into the guest that the guest shouldn't have
access to.  That falls firmly into the "well, duh" category, and so it's not
written down anywhere :-)

How exactly KVM has honored that rule has varied over the years, and still varies
between architectures.  In the past KVM x86 special cased HugeTLB and THP, but
that proved to be a pain to maintain and wasn't extensible, e.g. didn't play nice
with DAX, and so KVM x86 pivoted to pulling the mapping size from the primary MMU
page tables.

But arm64 still special cases THP and HugeTLB, *and* VM_PFNMAP memory (eww).

> For non-gmem cases, KVM uses the mapping size in the primary MMU as the max
> mapping size in the secondary MMU, while the primary MMU does not create a
> mapping larger than the backend folio size.

Super strictly speaking, this might not hold true for VM_PFNMAP memory.  E.g. a
driver _could_ split a folio (no idea why it would) but map the entire thing into
userspace, and then userspace could have off that memory to KVM.

So I wouldn't say _KVM's_ rule isn't so much "mapping size <= folio size", it's
that "KVM mapping size <= primary MMU mapping size", at least for x86.  Arm's
VM_PFNMAP code sketches me out a bit, but on the other hand, a driver mapping
discontiguous pages into a single VM_PFNMAP VMA would be even more sketch.

But yes, ignoring VM_PFNMAP, AFAIK the primary MMU and thus KVM doesn't map larger
than the folio size.

> When splitting the backend folio, the Linux kernel unmaps the folio from both
> the primary MMU and the KVM-managed secondary MMU (through the MMU notifier).
> 
> On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger
> than folio sizes, splitting folios while they are still mapped in the IOMMU
> stage-2 page table is not permitted due to the extra folio refcount held by the
> IOMMU.
> 
> For gmem cases, KVM also does not create mappings larger than the folio size
> allocated from gmem. This is why the TDX huge page series relies on gmem's
> ability to allocate huge folios.
> 
> We really need to be careful if we hope to break this long-established rule.

+100 to being careful, but at the same time I don't think we should get _too_
fixated on the guest_memfd folio size.  E.g. similar to VM_PFNMAP, where there
might not be a folio, if guest_memfd stopped using folios, then the entire
discussion becomes moot.

And as above, the long-standing rule isn't about the implementation details so
much as it is about KVM's behavior.  If the simplest solution to support huge
guest_memfd pages is to decouple the max order from the folio, then so be it.

That said, I'd very much like to get a sense of the alternatives, because at the
end of the day, guest_memfd needs to track the max mapping sizes _somewhere_,
and naively, tying that to the folio seems like an easy solution.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2026-01-13 16:50               ` Vishal Annapurve
@ 2026-01-14  1:48                 ` Yan Zhao
  0 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-14  1:48 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Ackerley Tng, Dave Hansen, pbonzini, seanjc, linux-kernel, kvm,
	x86, rick.p.edgecombe, kas, tabba, michael.roth, david, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Tue, Jan 13, 2026 at 08:50:30AM -0800, Vishal Annapurve wrote:
> On Sun, Jan 11, 2026 at 6:44 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > > > The WARN_ON_ONCE() serves 2 purposes:
> > > > 1. Loudly warn of subtle KVM bugs.
> > > > 2. Ensure "page_to_pfn(base_page + i) == (page_to_pfn(base_page) + i)".
> > > >
> > >
> > > I disagree with checking within TDX code, but if you would still like to
> > > check, 2. that you suggested is less dependent on the concept of how the
> > > kernel groups pages in folios, how about:
> > >
> > >   WARN_ON_ONCE(page_to_pfn(base_page + npages - 1) !=
> > >                page_to_pfn(base_page) + npages - 1);
> > >
> > > The full contiguity check will scan every page, but I think this doesn't
> > > take too many CPU cycles, and would probably catch what you're looking
> > > to catch in most cases.
> > As Dave said,  "struct page" serves to guard against MMIO.
> >
> > e.g., with below memory layout, checking continuity of every PFN is still not
> > enough.
> >
> > PFN 0x1000: Normal RAM
> > PFN 0x1001: MMIO
> > PFN 0x1002: Normal RAM
> >
> 
> I don't see how guest_memfd memory can be interspersed with MMIO regions.
It's about API design.

When KVM invokes tdh_phymem_page_wbinvd_hkid(), passing "struct page *base_page"
and "unsigned long npages", WARN_ON_ONCE() in tdh_phymem_page_wbinvd_hkid() to
ensure those pages belong to a folio can effectively ensure they are physically
contiguous and do not contain MMIO.

Similar to "VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio)" in
__folio_split().

Otherwise, why not just pass "pfn + npages" to tdh_phymem_page_wbinvd_hkid()?

> Is this in reference to the future extension to add private MMIO
> ranges? I think this discussion belongs in the context of TDX connect
> feature patches. I assume shared/private MMIO assignment to the guests
> will happen via completely different paths. And I would assume EPT
> entries will have information about whether the mapped ranges are MMIO
> or normal memory.
> 
> i.e. Anything mapped as normal memory in SEPT entries as a huge range
> should be safe to operate on without needing to cross-check sanity in
> the KVM TDX stack. If a hugerange has MMIO/normal RAM ranges mixed up
> then that is a much bigger problem.
> 
> > Also, is it even safe to reference struct page for PFN 0x1001 (e.g. with
> > SPARSEMEM without SPARSEMEM_VMEMMAP)?
> >
> > Leveraging folio makes it safe and simpler.
> > Since KVM also relies on folio size to determine mapping size, TDX doesn't
> > introduce extra limitations.
> >

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-14  1:24               ` Sean Christopherson
@ 2026-01-14  9:23                 ` Yan Zhao
  2026-01-14 15:26                   ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-14  9:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng, Vishal Annapurve, pbonzini, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

On Tue, Jan 13, 2026 at 05:24:36PM -0800, Sean Christopherson wrote:
> On Wed, Jan 14, 2026, Yan Zhao wrote:
> > On Mon, Jan 12, 2026 at 12:15:17PM -0800, Ackerley Tng wrote:
> > > Sean Christopherson <seanjc@google.com> writes:
> > > 
> > > > Mapping a hugepage for memory that KVM _knows_ is contiguous and homogenous is
> > > > conceptually totally fine, i.e. I'm not totally opposed to adding support for
> > > > mapping multiple guest_memfd folios with a single hugepage. As to whether we
> > > 
> > > Sean, I'd like to clarify this.
> > > 
> > > > do (a) nothing,
> > > 
> > > What does do nothing mean here?
> 
> Don't support hugepage's for shared mappings, at least for now (as Rick pointed
> out, doing nothing now doesn't mean we can't do something in the future).
> 
> > > In this patch series the TDX functions do sanity checks ensuring that
> > > mapping size <= folio size. IIUC the checks at mapping time, like in
> > > tdh_mem_page_aug() would be fine since at the time of mapping, the
> > > mapping size <= folio size, but we'd be in trouble at the time of
> > > zapping, since that's when mapping sizes > folio sizes get discovered.
> > > 
> > > The sanity checks are in principle in direct conflict with allowing
> > > mapping of multiple guest_memfd folios at hugepage level.
> > > 
> > > > (b) change the refcounting, or
> > > 
> > > I think this is pretty hard unless something changes in core MM that
> > > allows refcounting to be customizable by the FS. guest_memfd would love
> > > to have that, but customizable refcounting is going to hurt refcounting
> > > performance throughout the kernel.
> > > 
> > > > (c) add support for mapping multiple folios in one page,
> > > 
> > > Where would the changes need to be made, IIUC there aren't any checks
> > > currently elsewhere in KVM to ensure that mapping size <= folio size,
> > > other than the sanity checks in the TDX code proposed in this series.
> > > 
> > > Does any support need to be added, or is it about amending the
> > > unenforced/unwritten rule from "mapping size <= folio size" to "mapping
> > > size <= contiguous memory size"?
> >
> > The rule is not "unenforced/unwritten". In fact, it's the de facto standard in
> > KVM.
> 
> Ya, more or less.
> 
> The rules aren't formally documented because the overarching rule is very
> simple: KVM must not map memory into the guest that the guest shouldn't have
> access to.  That falls firmly into the "well, duh" category, and so it's not
> written down anywhere :-)
> 
> How exactly KVM has honored that rule has varied over the years, and still varies
> between architectures.  In the past KVM x86 special cased HugeTLB and THP, but
> that proved to be a pain to maintain and wasn't extensible, e.g. didn't play nice
> with DAX, and so KVM x86 pivoted to pulling the mapping size from the primary MMU
> page tables.
> 
> But arm64 still special cases THP and HugeTLB, *and* VM_PFNMAP memory (eww).
> 
> > For non-gmem cases, KVM uses the mapping size in the primary MMU as the max
> > mapping size in the secondary MMU, while the primary MMU does not create a
> > mapping larger than the backend folio size.
> 
> Super strictly speaking, this might not hold true for VM_PFNMAP memory.  E.g. a
> driver _could_ split a folio (no idea why it would) but map the entire thing into
> userspace, and then userspace could have off that memory to KVM.
> 
> So I wouldn't say _KVM's_ rule isn't so much "mapping size <= folio size", it's
> that "KVM mapping size <= primary MMU mapping size", at least for x86.  Arm's
> VM_PFNMAP code sketches me out a bit, but on the other hand, a driver mapping
> discontiguous pages into a single VM_PFNMAP VMA would be even more sketch.
> 
> But yes, ignoring VM_PFNMAP, AFAIK the primary MMU and thus KVM doesn't map larger
> than the folio size.

Oh. I forgot about the VM_PFNMAP case, which allows to provide folios as the
backend. Indeed, a driver can create a huge mapping in primary MMU for the
VM_PFNMAP range with multiple discontiguous pages, if it wants.

But this occurs before KVM creates the mapping. Per my understanding, pages
under VM_PFNMAP are pinned, so it looks like there're no splits after they are
mapped into the primary MMU.

So, out of curiosity, do you know why linux kernel needs to unmap mappings from
both primary and secondary MMUs, and check folio refcount before performing
folio splitting?

> > When splitting the backend folio, the Linux kernel unmaps the folio from both
> > the primary MMU and the KVM-managed secondary MMU (through the MMU notifier).
> > 
> > On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger
> > than folio sizes, splitting folios while they are still mapped in the IOMMU
> > stage-2 page table is not permitted due to the extra folio refcount held by the
> > IOMMU.
> > 
> > For gmem cases, KVM also does not create mappings larger than the folio size
> > allocated from gmem. This is why the TDX huge page series relies on gmem's
> > ability to allocate huge folios.
> > 
> > We really need to be careful if we hope to break this long-established rule.
> 
> +100 to being careful, but at the same time I don't think we should get _too_
> fixated on the guest_memfd folio size.  E.g. similar to VM_PFNMAP, where there
> might not be a folio, if guest_memfd stopped using folios, then the entire
> discussion becomes moot.
> 
> And as above, the long-standing rule isn't about the implementation details so
> much as it is about KVM's behavior.  If the simplest solution to support huge
> guest_memfd pages is to decouple the max order from the folio, then so be it.
> 
> That said, I'd very much like to get a sense of the alternatives, because at the
> end of the day, guest_memfd needs to track the max mapping sizes _somewhere_,
> and naively, tying that to the folio seems like an easy solution.
Thanks for the explanation.

Alternatively, how do you feel about the approach of splitting S-EPT first
before splitting folios?
If guest_memfd always splits 1GB folios to 2MB first and only splits the
converted range to 4KB, splitting S-EPT before splitting folios should not
introduce too much overhead. Then, we can defer the folio size problem until
guest_memfd stops using folios.

If the decision is to stop relying on folios for unmapping now, do you think
the following changes are reasonable for the TDX huge page series?

- Add WARN_ON_ONCE() to assert that pages are in a single folio in
  tdh_mem_page_aug().
- Do not assert that pages are in a single folio in
  tdh_phymem_page_wbinvd_hkid(). (or just assert of pfn_valid() for each page?)
  Could you please give me guidance on
  https://lore.kernel.org/kvm/aWb16XJuSVuyRu7l@yzhao56-desk.sh.intel.com.
- Add S-EPT splitting in kvm_gmem_error_folio() and fail on splitting error.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-13 16:40                             ` Vishal Annapurve
@ 2026-01-14  9:32                               ` Yan Zhao
  0 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-14  9:32 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Ackerley Tng, Sean Christopherson, pbonzini, linux-kernel, kvm,
	x86, rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth,
	david, sagis, vbabka, thomas.lendacky, nik.borisov, pgonda,
	fan.du, jun.miao, francescolavra.fl, jgross, ira.weiny,
	isaku.yamahata, xiaoyao.li, kai.huang, binbin.wu, chao.p.peng,
	chao.gao

On Tue, Jan 13, 2026 at 08:40:11AM -0800, Vishal Annapurve wrote:
> On Mon, Jan 12, 2026 at 10:13 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > > >> > >>
> > > >> > >> Additionally, we don't split private mappings in kvm_gmem_error_folio().
> > > >> > >> If smaller folios are allowed, splitting private mapping is required there.
> > > >> >
> > > >> > It was discussed before that for memory failure handling, we will want
> > > >> > to split huge pages, we will get to it! The trouble is that guest_memfd
> > > >> > took the page from HugeTLB (unlike buddy or HugeTLB which manages memory
> > > >> > from the ground up), so we'll still need to figure out it's okay to let
> > > >> > HugeTLB deal with it when freeing, and when I last looked, HugeTLB
> > > >> > doesn't actually deal with poisoned folios on freeing, so there's more
> > > >> > work to do on the HugeTLB side.
> > > >> >
> > > >> > This is a good point, although IIUC it is a separate issue. The need to
> > > >> > split private mappings on memory failure is not for confidentiality in
> > > >> > the TDX sense but to ensure that the guest doesn't use the failed
> > > >> > memory. In that case, contiguity is broken by the failed memory. The
> > > >> > folio is split, the private EPTs are split. The folio size should still
> > > >> > not be checked in TDX code. guest_memfd knows contiguity got broken, so
> > > >> > guest_memfd calls TDX code to split the EPTs.
> > > >>
> > > >> Hmm, maybe the key is that we need to split S-EPT first before allowing
> > > >> guest_memfd to split the backend folio. If splitting S-EPT fails, don't do the
> > > >> folio splitting.
> > > >>
> > > >> This is better than performing folio splitting while it's mapped as huge in
> > > >> S-EPT, since in the latter case, kvm_gmem_error_folio() needs to try to split
> > > >> S-EPT. If the S-EPT splitting fails, falling back to zapping the huge mapping in
> > > >> kvm_gmem_error_folio() would still trigger the over-zapping issue.
> > > >>
> > >
> > > Let's put memory failure handling aside for now since for now it zaps
> > > the entire huge page, so there's no impact on ordering between S-EPT and
> > > folio split.
> > Relying on guest_memfd's specific implemenation is not a good thing. e.g.,
> >
> > Given there's a version of guest_memfd allocating folios from buddy.
> > 1. KVM maps a 2MB folio in a 2MB mappings.
> > 2. guest_memfd splits the 2MB folio into 4KB folios, but fails and leaves the
> >    2MB folio partially split.
> > 3. Memory failure occurs on one of the split folio.
> > 4. When splitting S-EPT fails, the over-zapping issue is still there.
> >
> 
> Why is overzapping an issue?
> Memory failure is supposed to be a rare occurrence and if there is no
> memory to handle the splitting, I don't see any other choice than
> overzapping. IIUC splitting the huge page range (in 1G -> 4K scenario)
> requires even more memory than just splitting cross-boundary leaves
> and has a higher chance of failing.
> 
> i.e. Whether the folio is split first or the SEPTs, there is always a
> chance of failure leading to over-zapping. I don't see value in
Hmm. If the split occurs after memory failure, yes, splitting S-EPT first also
has chance of over-zapping. But if the split occurs during private-to-shared
conversion for the non-conversion range, when memory failure later occurs on the
split folio, over-zapping can be avoided.

> optimizing rare failures within rarer memory failure handling
> codepaths which are supposed to make best-effort decisions anyway.
I agree it's of low priority.
Just not sure if there're edge cases besides this one.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-14  9:23                 ` Yan Zhao
@ 2026-01-14 15:26                   ` Sean Christopherson
  2026-01-14 18:45                     ` Ackerley Tng
                                       ` (2 more replies)
  0 siblings, 3 replies; 127+ messages in thread
From: Sean Christopherson @ 2026-01-14 15:26 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, Vishal Annapurve, pbonzini, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

On Wed, Jan 14, 2026, Yan Zhao wrote:
> On Tue, Jan 13, 2026 at 05:24:36PM -0800, Sean Christopherson wrote:
> > On Wed, Jan 14, 2026, Yan Zhao wrote:
> > > For non-gmem cases, KVM uses the mapping size in the primary MMU as the max
> > > mapping size in the secondary MMU, while the primary MMU does not create a
> > > mapping larger than the backend folio size.
> > 
> > Super strictly speaking, this might not hold true for VM_PFNMAP memory.  E.g. a
> > driver _could_ split a folio (no idea why it would) but map the entire thing into
> > userspace, and then userspace could have off that memory to KVM.
> > 
> > So I wouldn't say _KVM's_ rule isn't so much "mapping size <= folio size", it's
> > that "KVM mapping size <= primary MMU mapping size", at least for x86.  Arm's
> > VM_PFNMAP code sketches me out a bit, but on the other hand, a driver mapping
> > discontiguous pages into a single VM_PFNMAP VMA would be even more sketch.
> > 
> > But yes, ignoring VM_PFNMAP, AFAIK the primary MMU and thus KVM doesn't map larger
> > than the folio size.
> 
> Oh. I forgot about the VM_PFNMAP case, which allows to provide folios as the
> backend. Indeed, a driver can create a huge mapping in primary MMU for the
> VM_PFNMAP range with multiple discontiguous pages, if it wants.
> 
> But this occurs before KVM creates the mapping. Per my understanding, pages
> under VM_PFNMAP are pinned,

Nope.  Only the driver that owns the VMAs knows what sits behind the PFN and the
lifecycle rules for that memory.

That last point is *very* important.  Even if the PFNs shoved into VM_PFNMAP VMAs
have an associated "struct page", that doesn't mean the "struct page" is refcounted,
i.e. can be pinned.  That detail was the heart of "KVM: Stop grabbing references to
PFNMAP'd pages" overhaul[*].

To _safely_ map VM_PFNMAP into a secondary MMU, i.e. without relying on (priveleged)
userspace to "do the right thing", the secondary MMU needs to be tied into
mmu_notifiers, so that modifications to the mappings in the primary MMU are
reflected into the secondary MMU.

[*] https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com

> so it looks like there're no splits after they are mapped into the primary MMU.
> 
> So, out of curiosity, do you know why linux kernel needs to unmap mappings from
> both primary and secondary MMUs, and check folio refcount before performing
> folio splitting?

Because it's a straightforward rule for the primary MMU.  Similar to guest_memfd,
if something is going through the effort of splitting a folio, then odds are very,
very good that the new folios can't be safely mapped as a contiguous hugepage.
Limiting mapping sizes to folios makes the rules/behavior straightfoward for core
MM to implement, and for drivers/users to understand.

Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to
communicate the maximum mapping size; folios are the "currency" for doing so.

And then for edge cases that want to map a split folio as a hugepage (if any such
edge cases exist), thus take on the responsibility of managing the lifecycle of
the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality.
 
> > > When splitting the backend folio, the Linux kernel unmaps the folio from both
> > > the primary MMU and the KVM-managed secondary MMU (through the MMU notifier).
> > > 
> > > On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger
> > > than folio sizes, splitting folios while they are still mapped in the IOMMU
> > > stage-2 page table is not permitted due to the extra folio refcount held by the
> > > IOMMU.
> > > 
> > > For gmem cases, KVM also does not create mappings larger than the folio size
> > > allocated from gmem. This is why the TDX huge page series relies on gmem's
> > > ability to allocate huge folios.
> > > 
> > > We really need to be careful if we hope to break this long-established rule.
> > 
> > +100 to being careful, but at the same time I don't think we should get _too_
> > fixated on the guest_memfd folio size.  E.g. similar to VM_PFNMAP, where there
> > might not be a folio, if guest_memfd stopped using folios, then the entire
> > discussion becomes moot.
> > 
> > And as above, the long-standing rule isn't about the implementation details so
> > much as it is about KVM's behavior.  If the simplest solution to support huge
> > guest_memfd pages is to decouple the max order from the folio, then so be it.
> > 
> > That said, I'd very much like to get a sense of the alternatives, because at the
> > end of the day, guest_memfd needs to track the max mapping sizes _somewhere_,
> > and naively, tying that to the folio seems like an easy solution.
> Thanks for the explanation.
> 
> Alternatively, how do you feel about the approach of splitting S-EPT first
> before splitting folios?
> If guest_memfd always splits 1GB folios to 2MB first and only splits the
> converted range to 4KB, splitting S-EPT before splitting folios should not
> introduce too much overhead. Then, we can defer the folio size problem until
> guest_memfd stops using folios.
> 
> If the decision is to stop relying on folios for unmapping now, do you think
> the following changes are reasonable for the TDX huge page series?
> 
> - Add WARN_ON_ONCE() to assert that pages are in a single folio in
>   tdh_mem_page_aug().
> - Do not assert that pages are in a single folio in
>   tdh_phymem_page_wbinvd_hkid(). (or just assert of pfn_valid() for each page?)
>   Could you please give me guidance on
>   https://lore.kernel.org/kvm/aWb16XJuSVuyRu7l@yzhao56-desk.sh.intel.com.
> - Add S-EPT splitting in kvm_gmem_error_folio() and fail on splitting error.

Ok, with the disclaimer that I hadn't actually looked at the patches in this
series before now...

TDX absolutely should not be doing _anything_ with folios.  I am *very* strongly
opposed to TDX assuming that memory is backed by refcounted "struct page", and
thus can use folios to glean the maximum mapping size.

guest_memfd is _the_ owner of that information.  guest_memfd needs to explicitly
_tell_ the rest of KVM what the maximum mapping size is; arch code should not
infer that size from a folio.

And that code+behavior already exists in the form of kvm_gmem_mapping_order() and
its users, _and_ is plumbed all the way into tdx_mem_page_aug() as @level.  IIUC,
the _only_ reason tdx_mem_page_aug() retrieves the page+folio is because
tdx_clflush_page() ultimately requires a "struct page".  That is absolutely
ridiculous and not acceptable.  CLFLUSH takes a virtual address, there is *zero*
reason tdh_mem_page_aug() needs to require/assume a struct page.

Dave may feel differently, but I am not going to budge on this.  I am not going
to bake in assumptions throughout KVM about memory being backed by page+folio.
We _just_ cleaned up that mess in the aformentioned "Stop grabbing references to
PFNMAP'd pages" series, I am NOT reintroducing such assumptions.

NAK to any KVM TDX code that pulls a page or folio out of a guest_memfd pfn.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-14 15:26                   ` Sean Christopherson
@ 2026-01-14 18:45                     ` Ackerley Tng
  2026-01-15  3:08                       ` Yan Zhao
  2026-01-14 18:56                     ` Dave Hansen
  2026-01-15  1:41                     ` Yan Zhao
  2 siblings, 1 reply; 127+ messages in thread
From: Ackerley Tng @ 2026-01-14 18:45 UTC (permalink / raw)
  To: Sean Christopherson, Yan Zhao
  Cc: Vishal Annapurve, pbonzini, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

Sean Christopherson <seanjc@google.com> writes:

>>
>> [...snip...]
>>
>> > +100 to being careful, but at the same time I don't think we should get _too_
>> > fixated on the guest_memfd folio size.  E.g. similar to VM_PFNMAP, where there
>> > might not be a folio, if guest_memfd stopped using folios, then the entire
>> > discussion becomes moot.

+1, IMO the usage of folios on the guest_memfd <-> KVM boundary
(kvm_gmem_get_pfn()) is transitional, hopefully we get to a point where
guest_memfd will pass KVM pfn, order and no folios.

>> > And as above, the long-standing rule isn't about the implementation details so
>> > much as it is about KVM's behavior.  If the simplest solution to support huge
>> > guest_memfd pages is to decouple the max order from the folio, then so be it.
>> >
>> > That said, I'd very much like to get a sense of the alternatives, because at the
>> > end of the day, guest_memfd needs to track the max mapping sizes _somewhere_,
>> > and naively, tying that to the folio seems like an easy solution.

The upcoming attributes maple tree allows a lookup from guest_memfd
index to contiguous range, so the max mapping size (at least
guest_memfd's contribution to max mapping level, to be augmented by
contribution from lpage_info etc) would be the contiguous range in the
xarray containing the index, clamped to guest_memfd page size bounds
(both for huge pages and regular PAGE_SIZE pages).

The lookup complexity is mainly the maple tree lookup complexity. This
lookup happens on mapping and on trying to recover to the largest
mapping level, both of which shouldn't happen super often, so I think
this should be pretty good for now.

This max mapping size is currently memoized as folio size with all the
folio splitting work, but memoizing into a folio is expensive (struct
pages/folios are big). Hopefully guest_memfd gets to a point where it
also supports non-struct page backed memory, and that would save us a
bunch more memory.

>>
>> [...snip...]
>>
>> So, out of curiosity, do you know why linux kernel needs to unmap mappings from
>> both primary and secondary MMUs, and check folio refcount before performing
>> folio splitting?
>
> Because it's a straightforward rule for the primary MMU.  Similar to guest_memfd,
> if something is going through the effort of splitting a folio, then odds are very,
> very good that the new folios can't be safely mapped as a contiguous hugepage.
> Limiting mapping sizes to folios makes the rules/behavior straightfoward for core
> MM to implement, and for drivers/users to understand.
>
> Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to
> communicate the maximum mapping size; folios are the "currency" for doing so.
>
> And then for edge cases that want to map a split folio as a hugepage (if any such
> edge cases exist), thus take on the responsibility of managing the lifecycle of
> the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality.
>

Here's my understanding, hope it helps: there might also be a
practical/simpler reason for first unmapping then check refcounts, and
then splitting folios, and guest_memfd kind of does the same thing.

Folio splitting races with lots of other things in the kernel, and the
folio lock isn't super useful because the lock itself is going to be
split up.

Folio splitting wants all users to stop using this folio, so one big
source of users is mappings. Hence, get those mappers (both primary and
secondary MMUs) to unmap.

Core-mm-managed mappings take a refcount, so those refcounts go away. Of
the secondary mmu notifiers, KVM doesn't take a refcount, but KVM does
unmap as requested, so that still falls in line with "stop using this
folio".

I think the refcounting check isn't actually necessary if all users of
folios STOP using the folio on request (via mmu notifiers or
otherwise). Unfortunately, there are other users other than mappers. The
best way to find these users is to check the refcount. The refcount
check is asking "how many other users are left?" and if the number of
users is as expected (just the filemap, or whatever else is expected),
then splitting can go ahead, since the splitting code is now confident
the remaining users won't try and use the folio metadata while splitting
is happening.


guest_memfd does a modified version of that on shared to private
conversions. guest_memfd will unmap from host userspace page tables for
the same reason, mainly to tell all the host users to unmap. The
unmapping also triggers mmu notifiers so the stage 2 mappings also go
away (TBD if this should be skipped) and this is okay because they're
shared pages. guest usage will just map them back in on any failure and
it doesn't break guests.

At this point all the mappers are gone, then guest_memfd checks
refcounts to make sure that guest_memfd itself is the only user of the
folio. If the refcount is as expected, guest_memfd is confident to
continue with splitting folios, since other folio accesses will be
locked out by the filemap invalidate lock.

The one main guest_memfd folio user that won't go away on an unmap call
is if the folios get pinned for IOMMU access. In this case, guest_memfd
fails the conversion and returns an error to userspace so userspace can
sort out the IOMMU unpinning.


As for private to shared conversions, folio merging would require the
same thing that nobody else is using the folios (the folio
metadata). guest_memfd skips that check because for private memory, KVM
is the only other user, and guest_memfd knows KVM doesn't use folio
metadata once the memory is mapped for the guest.

>>
>> [...snip...]
>>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-14 15:26                   ` Sean Christopherson
  2026-01-14 18:45                     ` Ackerley Tng
@ 2026-01-14 18:56                     ` Dave Hansen
  2026-01-15  0:19                       ` Sean Christopherson
  2026-01-15  1:41                     ` Yan Zhao
  2 siblings, 1 reply; 127+ messages in thread
From: Dave Hansen @ 2026-01-14 18:56 UTC (permalink / raw)
  To: Sean Christopherson, Yan Zhao
  Cc: Ackerley Tng, Vishal Annapurve, pbonzini, linux-kernel, kvm, x86,
	rick.p.edgecombe, kas, tabba, michael.roth, david, sagis, vbabka,
	thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On 1/14/26 07:26, Sean Christopherson wrote:
...
> Dave may feel differently, but I am not going to budge on this.  I am not going
> to bake in assumptions throughout KVM about memory being backed by page+folio.
> We _just_ cleaned up that mess in the aformentioned "Stop grabbing references to
> PFNMAP'd pages" series, I am NOT reintroducing such assumptions.
> 
> NAK to any KVM TDX code that pulls a page or folio out of a guest_memfd pfn.

'struct page' gives us two things: One is the type safety, but I'm
pretty flexible on how that's implemented as long as it's not a raw u64
getting passed around everywhere.

The second thing is a (near) guarantee that the backing memory is RAM.
Not only RAM, but RAM that the TDX module knows about and has a PAMT and
TDMR and all that TDX jazz.

We've also done things like stopping memory hotplug because you can't
amend TDX page metadata at runtime. So we prevent new 'struct pages'
from coming into existence. So 'struct page' is a quite useful choke
point for TDX.

I'd love to hear more about how guest_memfd is going to tie all the
pieces together and give the same straightforward guarantees without
leaning on the core mm the same way we do now.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-14 18:56                     ` Dave Hansen
@ 2026-01-15  0:19                       ` Sean Christopherson
  2026-01-16 15:45                         ` Edgecombe, Rick P
  2026-01-16 16:57                         ` Dave Hansen
  0 siblings, 2 replies; 127+ messages in thread
From: Sean Christopherson @ 2026-01-15  0:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yan Zhao, Ackerley Tng, Vishal Annapurve, pbonzini, linux-kernel,
	kvm, x86, rick.p.edgecombe, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

On Wed, Jan 14, 2026, Dave Hansen wrote:
> On 1/14/26 07:26, Sean Christopherson wrote:
> ...
> > Dave may feel differently, but I am not going to budge on this.  I am not going
> > to bake in assumptions throughout KVM about memory being backed by page+folio.
> > We _just_ cleaned up that mess in the aformentioned "Stop grabbing references to
> > PFNMAP'd pages" series, I am NOT reintroducing such assumptions.
> > 
> > NAK to any KVM TDX code that pulls a page or folio out of a guest_memfd pfn.
> 
> 'struct page' gives us two things: One is the type safety, but I'm
> pretty flexible on how that's implemented as long as it's not a raw u64
> getting passed around everywhere.

I don't necessarily disagree on the type safety front, but for the specific code
in question, any type safety is a facade.  Everything leading up to the TDX code
is dealing with raw PFNs and/or PTEs.  Then the TDX code assumes that the PFN
being mapped into the guest is backed by a struct page, and that the folio size
is consistent with @level, without _any_ checks whatsover.  This is providing
the exact opposite of safety.

  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
			    enum pg_level level, kvm_pfn_t pfn)
  {
	int tdx_level = pg_level_to_tdx_sept_level(level);
	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
	struct page *page = pfn_to_page(pfn);    <==================
	struct folio *folio = page_folio(page);
	gpa_t gpa = gfn_to_gpa(gfn);
	u64 entry, level_state;
	u64 err;

	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, folio,
			       folio_page_idx(folio, page), &entry, &level_state);

	...
  }

I've no objection if e.g. tdh_mem_page_aug() wants to sanity check that a PFN
is backed by a struct page with a valid refcount, it's code like that above that
I don't want.

> The second thing is a (near) guarantee that the backing memory is RAM.
> Not only RAM, but RAM that the TDX module knows about and has a PAMT and
> TDMR and all that TDX jazz.

I'm not at all opposed to backing guest_memfd with "struct page", quite the
opposite.  What I don't want is to bake assumptions into KVM code that doesn't
_require_ struct page, because that has cause KVM immense pain in the past.

And I'm strongly opposed to KVM special-casing TDX or anything else, precisely
 because we struggled through all that pain so that KVM would work better with
memory that isn't backed by "struct page", or more specifically, memory that has
an associated "struct page", but isn't managed by core MM, e.g. isn't refcounted.

> We've also done things like stopping memory hotplug because you can't
> amend TDX page metadata at runtime. So we prevent new 'struct pages'
> from coming into existence. So 'struct page' is a quite useful choke
> point for TDX.
> 
> I'd love to hear more about how guest_memfd is going to tie all the
> pieces together and give the same straightforward guarantees without
> leaning on the core mm the same way we do now.

I don't think guest_memfd needs to be different, and that's not what I'm advocating.
What I don't want is to make KVM TDX's handling of memory different from the rest
of KVM and KVM's MMU(s).

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-14 15:26                   ` Sean Christopherson
  2026-01-14 18:45                     ` Ackerley Tng
  2026-01-14 18:56                     ` Dave Hansen
@ 2026-01-15  1:41                     ` Yan Zhao
  2026-01-15 16:26                       ` Sean Christopherson
  2 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-15  1:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng, Vishal Annapurve, pbonzini, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

On Wed, Jan 14, 2026 at 07:26:44AM -0800, Sean Christopherson wrote:
> On Wed, Jan 14, 2026, Yan Zhao wrote:
> > On Tue, Jan 13, 2026 at 05:24:36PM -0800, Sean Christopherson wrote:
> > > On Wed, Jan 14, 2026, Yan Zhao wrote:
> > > > For non-gmem cases, KVM uses the mapping size in the primary MMU as the max
> > > > mapping size in the secondary MMU, while the primary MMU does not create a
> > > > mapping larger than the backend folio size.
> > > 
> > > Super strictly speaking, this might not hold true for VM_PFNMAP memory.  E.g. a
> > > driver _could_ split a folio (no idea why it would) but map the entire thing into
> > > userspace, and then userspace could have off that memory to KVM.
> > > 
> > > So I wouldn't say _KVM's_ rule isn't so much "mapping size <= folio size", it's
> > > that "KVM mapping size <= primary MMU mapping size", at least for x86.  Arm's
> > > VM_PFNMAP code sketches me out a bit, but on the other hand, a driver mapping
> > > discontiguous pages into a single VM_PFNMAP VMA would be even more sketch.
> > > 
> > > But yes, ignoring VM_PFNMAP, AFAIK the primary MMU and thus KVM doesn't map larger
> > > than the folio size.
> > 
> > Oh. I forgot about the VM_PFNMAP case, which allows to provide folios as the
> > backend. Indeed, a driver can create a huge mapping in primary MMU for the
> > VM_PFNMAP range with multiple discontiguous pages, if it wants.
> > 
> > But this occurs before KVM creates the mapping. Per my understanding, pages
> > under VM_PFNMAP are pinned,
> 
> Nope.  Only the driver that owns the VMAs knows what sits behind the PFN and the
> lifecycle rules for that memory.
> 
> That last point is *very* important.  Even if the PFNs shoved into VM_PFNMAP VMAs
> have an associated "struct page", that doesn't mean the "struct page" is refcounted,
> i.e. can be pinned.  That detail was the heart of "KVM: Stop grabbing references to
> PFNMAP'd pages" overhaul[*].
> 
> To _safely_ map VM_PFNMAP into a secondary MMU, i.e. without relying on (priveleged)
> userspace to "do the right thing", the secondary MMU needs to be tied into
> mmu_notifiers, so that modifications to the mappings in the primary MMU are
> reflected into the secondary MMU.

You are right! It maps tail page of a !compound huge page, which is not
refcounted.

> [*] https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com
> 
> > so it looks like there're no splits after they are mapped into the primary MMU.
> > 
> > So, out of curiosity, do you know why linux kernel needs to unmap mappings from
> > both primary and secondary MMUs, and check folio refcount before performing
> > folio splitting?
> 
> Because it's a straightforward rule for the primary MMU.  Similar to guest_memfd,
> if something is going through the effort of splitting a folio, then odds are very,
> very good that the new folios can't be safely mapped as a contiguous hugepage.
> Limiting mapping sizes to folios makes the rules/behavior straightfoward for core
> MM to implement, and for drivers/users to understand.
> 
> Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to
> communicate the maximum mapping size; folios are the "currency" for doing so.
> 
> And then for edge cases that want to map a split folio as a hugepage (if any such
> edge cases exist), thus take on the responsibility of managing the lifecycle of
> the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality.

Thanks for the explanation.

> > > > When splitting the backend folio, the Linux kernel unmaps the folio from both
> > > > the primary MMU and the KVM-managed secondary MMU (through the MMU notifier).
> > > > 
> > > > On the non-KVM side, though IOMMU stage-2 mappings are allowed to be larger
> > > > than folio sizes, splitting folios while they are still mapped in the IOMMU
> > > > stage-2 page table is not permitted due to the extra folio refcount held by the
> > > > IOMMU.
> > > > 
> > > > For gmem cases, KVM also does not create mappings larger than the folio size
> > > > allocated from gmem. This is why the TDX huge page series relies on gmem's
> > > > ability to allocate huge folios.
> > > > 
> > > > We really need to be careful if we hope to break this long-established rule.
> > > 
> > > +100 to being careful, but at the same time I don't think we should get _too_
> > > fixated on the guest_memfd folio size.  E.g. similar to VM_PFNMAP, where there
> > > might not be a folio, if guest_memfd stopped using folios, then the entire
> > > discussion becomes moot.
> > > 
> > > And as above, the long-standing rule isn't about the implementation details so
> > > much as it is about KVM's behavior.  If the simplest solution to support huge
> > > guest_memfd pages is to decouple the max order from the folio, then so be it.
> > > 
> > > That said, I'd very much like to get a sense of the alternatives, because at the
> > > end of the day, guest_memfd needs to track the max mapping sizes _somewhere_,
> > > and naively, tying that to the folio seems like an easy solution.
> > Thanks for the explanation.
> > 
> > Alternatively, how do you feel about the approach of splitting S-EPT first
> > before splitting folios?
> > If guest_memfd always splits 1GB folios to 2MB first and only splits the
> > converted range to 4KB, splitting S-EPT before splitting folios should not
> > introduce too much overhead. Then, we can defer the folio size problem until
> > guest_memfd stops using folios.
> > 
> > If the decision is to stop relying on folios for unmapping now, do you think
> > the following changes are reasonable for the TDX huge page series?
> > 
> > - Add WARN_ON_ONCE() to assert that pages are in a single folio in
> >   tdh_mem_page_aug().
> > - Do not assert that pages are in a single folio in
> >   tdh_phymem_page_wbinvd_hkid(). (or just assert of pfn_valid() for each page?)
> >   Could you please give me guidance on
> >   https://lore.kernel.org/kvm/aWb16XJuSVuyRu7l@yzhao56-desk.sh.intel.com.
> > - Add S-EPT splitting in kvm_gmem_error_folio() and fail on splitting error.
> 
> Ok, with the disclaimer that I hadn't actually looked at the patches in this
> series before now...
> 
> TDX absolutely should not be doing _anything_ with folios.  I am *very* strongly
> opposed to TDX assuming that memory is backed by refcounted "struct page", and
> thus can use folios to glean the maximum mapping size.
> 
> guest_memfd is _the_ owner of that information.  guest_memfd needs to explicitly
> _tell_ the rest of KVM what the maximum mapping size is; arch code should not
> infer that size from a folio.
> 
> And that code+behavior already exists in the form of kvm_gmem_mapping_order() and
> its users, _and_ is plumbed all the way into tdx_mem_page_aug() as @level.  IIUC,
> the _only_ reason tdx_mem_page_aug() retrieves the page+folio is because
> tdx_clflush_page() ultimately requires a "struct page".  That is absolutely
> ridiculous and not acceptable.  CLFLUSH takes a virtual address, there is *zero*
> reason tdh_mem_page_aug() needs to require/assume a struct page.
Not really.

Per my understanding, tdx_mem_page_aug() requires "struct page" (and checks
folios for huge pages) because the SEAMCALL wrapper APIs are not currently built
into KVM. Since they may have callers other than KVM, some sanity checking in
case the caller does something incorrect seems necessary (e.g., in case the
caller provides an out-of-range struct page or a page with !pfn_valid() PFN).
This is similar to "VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio)" in
__folio_split().

With tdx_mem_page_aug() ensuring pages validity and contiguity, invoking local
static function tdx_clflush_page() page-per-page looks good to me.
Alternatively, we could convert tdx_clflush_page() to tdx_clflush_cache_range(),
which receives VA.

However, I'm not sure if my understanding is correct now, especially since it
seems like everyone thinks the SEAMCALL wrapper APIs should trust the caller,
assuming they are KVM-specific.

> Dave may feel differently, but I am not going to budge on this.  I am not going
> to bake in assumptions throughout KVM about memory being backed by page+folio.
> We _just_ cleaned up that mess in the aformentioned "Stop grabbing references to
> PFNMAP'd pages" series, I am NOT reintroducing such assumptions.
> 
> NAK to any KVM TDX code that pulls a page or folio out of a guest_memfd pfn.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-14 18:45                     ` Ackerley Tng
@ 2026-01-15  3:08                       ` Yan Zhao
  2026-01-15 18:13                         ` Ackerley Tng
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-15  3:08 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Sean Christopherson, Vishal Annapurve, pbonzini, linux-kernel,
	kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth,
	david, sagis, vbabka, thomas.lendacky, nik.borisov, pgonda,
	fan.du, jun.miao, francescolavra.fl, jgross, ira.weiny,
	isaku.yamahata, xiaoyao.li, kai.huang, binbin.wu, chao.p.peng,
	chao.gao

On Wed, Jan 14, 2026 at 10:45:32AM -0800, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> >> So, out of curiosity, do you know why linux kernel needs to unmap mappings from
> >> both primary and secondary MMUs, and check folio refcount before performing
> >> folio splitting?
> >
> > Because it's a straightforward rule for the primary MMU.  Similar to guest_memfd,
> > if something is going through the effort of splitting a folio, then odds are very,
> > very good that the new folios can't be safely mapped as a contiguous hugepage.
> > Limiting mapping sizes to folios makes the rules/behavior straightfoward for core
> > MM to implement, and for drivers/users to understand.
> >
> > Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to
> > communicate the maximum mapping size; folios are the "currency" for doing so.
> >
> > And then for edge cases that want to map a split folio as a hugepage (if any such
> > edge cases exist), thus take on the responsibility of managing the lifecycle of
> > the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality.
> >
> 
> Here's my understanding, hope it helps: there might also be a
> practical/simpler reason for first unmapping then check refcounts, and
> then splitting folios, and guest_memfd kind of does the same thing.
> 
> Folio splitting races with lots of other things in the kernel, and the
> folio lock isn't super useful because the lock itself is going to be
> split up.
> 
> Folio splitting wants all users to stop using this folio, so one big
> source of users is mappings. Hence, get those mappers (both primary and
> secondary MMUs) to unmap.
> 
> Core-mm-managed mappings take a refcount, so those refcounts go away. Of
> the secondary mmu notifiers, KVM doesn't take a refcount, but KVM does
> unmap as requested, so that still falls in line with "stop using this
> folio".
> 
> I think the refcounting check isn't actually necessary if all users of
> folios STOP using the folio on request (via mmu notifiers or
> otherwise). Unfortunately, there are other users other than mappers. The
> best way to find these users is to check the refcount. The refcount
> check is asking "how many other users are left?" and if the number of
> users is as expected (just the filemap, or whatever else is expected),
> then splitting can go ahead, since the splitting code is now confident
> the remaining users won't try and use the folio metadata while splitting
> is happening.
> 
> 
> guest_memfd does a modified version of that on shared to private
> conversions. guest_memfd will unmap from host userspace page tables for
> the same reason, mainly to tell all the host users to unmap. The
> unmapping also triggers mmu notifiers so the stage 2 mappings also go
> away (TBD if this should be skipped) and this is okay because they're
> shared pages. guest usage will just map them back in on any failure and
> it doesn't break guests.
> 
> At this point all the mappers are gone, then guest_memfd checks
> refcounts to make sure that guest_memfd itself is the only user of the
> folio. If the refcount is as expected, guest_memfd is confident to
> continue with splitting folios, since other folio accesses will be
> locked out by the filemap invalidate lock.
> 
> The one main guest_memfd folio user that won't go away on an unmap call
> is if the folios get pinned for IOMMU access. In this case, guest_memfd
> fails the conversion and returns an error to userspace so userspace can
> sort out the IOMMU unpinning.
> 
> 
> As for private to shared conversions, folio merging would require the
> same thing that nobody else is using the folios (the folio
> metadata). guest_memfd skips that check because for private memory, KVM
> is the only other user, and guest_memfd knows KVM doesn't use folio
> metadata once the memory is mapped for the guest.
Ok. That makes sense. Thanks for the explanation.
It looks like guest_memfd also rules out concurrent folio metadata access by
holding the filemap_invalidate_lock.

BTW: Could that potentially cause guest soft lockup due to holding the
filemap_invalidate_lock for too long?

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2026-01-06 10:21 ` [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
@ 2026-01-15 12:25   ` Huang, Kai
  2026-01-16 23:39     ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Huang, Kai @ 2026-01-15 12:25 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Du, Fan, Li, Xiaoyao, Gao, Chao,
	Hansen, Dave, thomas.lendacky@amd.com, vbabka@suse.cz,
	tabba@google.com, david@kernel.org, kas@kernel.org,
	michael.roth@amd.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, ackerleytng@google.com,
	nik.borisov@suse.com, Yamahata, Isaku, Peng, Chao P,
	francescolavra.fl@gmail.com, sagis@google.com, Annapurve, Vishal,
	Edgecombe, Rick P, Miao, Jun, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Tue, 2026-01-06 at 18:21 +0800, Yan Zhao wrote:
> @@ -1692,12 +1707,35 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
>  
>  	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
>  	for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id) {
> -		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level, shared);
> +		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level,
> +						  shared, false);
> +		if (r) {
> +			kvm_tdp_mmu_put_root(kvm, root);
> +			break;
> +		}
> +	}
> +}
> +
> +int kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(struct kvm *kvm,
> +						     struct kvm_gfn_range *range,
> +						     bool shared)
> +{
> +	enum kvm_tdp_mmu_root_types types;
> +	struct kvm_mmu_page *root;
> +	int r = 0;
> +
> +	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> +	types = kvm_gfn_range_filter_to_root_types(kvm, range->attr_filter);
> +
> +	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, types) {
> +		r = tdp_mmu_split_huge_pages_root(kvm, root, range->start, range->end,
> +						  PG_LEVEL_4K, shared, true);
>  		if (r) {
>  			kvm_tdp_mmu_put_root(kvm, root);
>  			break;
>  		}
>  	}
> +	return r;
>  }
>  

Seems the two functions -- kvm_tdp_mmu_try_split_huge_pages() and
kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() -- are almost
identical.  Is it better to introduce a helper and make the two just be
the wrappers?

E.g.,

static int __kvm_tdp_mmu_split_huge_pages(struct kvm *kvm, 
					  struct kvm_gfn_range *range,
					  int target_level,
					  bool shared,
					  bool cross_boundary_only)
{
	...
}

And by using this helper, I found the name of the two wrapper functions
are not ideal:

kvm_tdp_mmu_try_split_huge_pages() is only for log dirty, and it should
not be reachable for TD (VM with mirrored PT).  But currently it uses
KVM_VALID_ROOTS for root filter thus mirrored PT is also included.  I
think it's better to rename it, e.g., at least with "log_dirty" in the
name so it's more clear this function is only for dealing log dirty (at
least currently).  We can also add a WARN() if it's called for VM with
mirrored PT but it's a different topic.

kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() doesn't have
"huge_pages", which isn't consistent with the other.  And it is a bit
long.  If we don't have "gfn_range" in __kvm_tdp_mmu_split_huge_pages(),
then I think we can remove "gfn_range" from
kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() too to make it shorter.

So how about:

Rename kvm_tdp_mmu_try_split_huge_pages() to
kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
kvm_tdp_mmu_split_huge_pages_cross_boundary()

?

E.g.,:

int kvm_tdp_mmu_split_huge_pages_log_dirty(struct kvm *kvm, 
				     	   const kvm_memory_slot *slot,
				    	   gfn_t start, gfn_t end,
					   int target_level, bool shared)
{
	struct kvm_gfn_range range = {
		.slot		= slot,
		.start		= start,
		.end		= end,
		.attr_filter	= 0, /* doesn't matter */
		.may_block	= true,
	};

	if (WARN_ON_ONCE(kvm_has_mirrored_tdp(kvm))
		return -EINVAL;

	return __kvm_tdp_mmu_split_huge_pages(kvm, &range, target_level,
					      shared, false);
}

int kvm_tdp_mmu_split_huge_pages_cross_boundary(struct kvm *kvm,
					struct kvm_gfn_range *range,
					int target_level,
					bool shared)
{
	return __kvm_tdp_mmu_split_huge_pages(kvm, range, target_level,
					      shared, true);
}

Anything I missed?

And one more minor thing:

With that, I think you can move range->may_block check from
kvm_split_cross_boundary_leafs() to the __kvm_tdp_mmu_split_huge_pages()
common helper:

	if (!range->may_block)
		return -EOPNOTSUPP;
					

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-15  1:41                     ` Yan Zhao
@ 2026-01-15 16:26                       ` Sean Christopherson
  0 siblings, 0 replies; 127+ messages in thread
From: Sean Christopherson @ 2026-01-15 16:26 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, Vishal Annapurve, pbonzini, linux-kernel, kvm, x86,
	rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

On Thu, Jan 15, 2026, Yan Zhao wrote:
> On Wed, Jan 14, 2026 at 07:26:44AM -0800, Sean Christopherson wrote:
> > Ok, with the disclaimer that I hadn't actually looked at the patches in this
> > series before now...
> > 
> > TDX absolutely should not be doing _anything_ with folios.  I am *very* strongly
> > opposed to TDX assuming that memory is backed by refcounted "struct page", and
> > thus can use folios to glean the maximum mapping size.
> > 
> > guest_memfd is _the_ owner of that information.  guest_memfd needs to explicitly
> > _tell_ the rest of KVM what the maximum mapping size is; arch code should not
> > infer that size from a folio.
> > 
> > And that code+behavior already exists in the form of kvm_gmem_mapping_order() and
> > its users, _and_ is plumbed all the way into tdx_mem_page_aug() as @level.  IIUC,
> > the _only_ reason tdx_mem_page_aug() retrieves the page+folio is because
> > tdx_clflush_page() ultimately requires a "struct page".  That is absolutely
> > ridiculous and not acceptable.  CLFLUSH takes a virtual address, there is *zero*
> > reason tdh_mem_page_aug() needs to require/assume a struct page.
> Not really.
> 
> Per my understanding, tdx_mem_page_aug() requires "struct page" (and checks
> folios for huge pages) because the SEAMCALL wrapper APIs are not currently built
> into KVM. Since they may have callers other than KVM, some sanity checking in
> case the caller does something incorrect seems necessary (e.g., in case the
> caller provides an out-of-range struct page or a page with !pfn_valid() PFN).

As a mentioned in my reply to Dave, I don't object to reasonable sanity checks.

> This is similar to "VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio)" in
> __folio_split().

No, it's not.  __folio_split() is verifying that the input for the exact one thing
it's doing, splitting a huge folio, matches what the function is being asked to do.

TDX requiring guest_memfd to back everything with struct page, and to only use
single, huge folios to map hugepages into the guest is making completely unnecessary
about guest_memfd and KVM MMU implementation details.

> With tdx_mem_page_aug() ensuring pages validity and contiguity,

It absolutely does not.

 - If guest_memfd unmaps the direct map[*], CLFLUSH will fault and panic the
   kernel.
 - If the PFN isn't backed by struct page, tdx_mem_page_aug() will hit a NULL
   pointer deref.
 - If the PFN is back by struct page, but the page is managed by something other
   than guest_memfd or core MM, all bets are off.

[*] https://lore.kernel.org/all/20260114134510.1835-1-kalyazin@amazon.com

> invoking local static function tdx_clflush_page() page-per-page looks good to
> me.  Alternatively, we could convert tdx_clflush_page() to
> tdx_clflush_cache_range(), which receives VA.
> 
> However, I'm not sure if my understanding is correct now, especially since it
> seems like everyone thinks the SEAMCALL wrapper APIs should trust the caller,
> assuming they are KVM-specific.

It's all kernel code.  Implying that KVM is somehow untrusted is absurd.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-15  3:08                       ` Yan Zhao
@ 2026-01-15 18:13                         ` Ackerley Tng
  0 siblings, 0 replies; 127+ messages in thread
From: Ackerley Tng @ 2026-01-15 18:13 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Sean Christopherson, Vishal Annapurve, pbonzini, linux-kernel,
	kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba, michael.roth,
	david, sagis, vbabka, thomas.lendacky, nik.borisov, pgonda,
	fan.du, jun.miao, francescolavra.fl, jgross, ira.weiny,
	isaku.yamahata, xiaoyao.li, kai.huang, binbin.wu, chao.p.peng,
	chao.gao

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, Jan 14, 2026 at 10:45:32AM -0800, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>> >> So, out of curiosity, do you know why linux kernel needs to unmap mappings from
>> >> both primary and secondary MMUs, and check folio refcount before performing
>> >> folio splitting?
>> >
>> > Because it's a straightforward rule for the primary MMU.  Similar to guest_memfd,
>> > if something is going through the effort of splitting a folio, then odds are very,
>> > very good that the new folios can't be safely mapped as a contiguous hugepage.
>> > Limiting mapping sizes to folios makes the rules/behavior straightfoward for core
>> > MM to implement, and for drivers/users to understand.
>> >
>> > Again like guest_memfd, there needs to be _some_ way for a driver/filesystem to
>> > communicate the maximum mapping size; folios are the "currency" for doing so.
>> >
>> > And then for edge cases that want to map a split folio as a hugepage (if any such
>> > edge cases exist), thus take on the responsibility of managing the lifecycle of
>> > the mappings, VM_PFNMAP and vmf_insert_pfn() provide the necessary functionality.
>> >
>>
>> Here's my understanding, hope it helps: there might also be a
>> practical/simpler reason for first unmapping then check refcounts, and
>> then splitting folios, and guest_memfd kind of does the same thing.
>>
>> Folio splitting races with lots of other things in the kernel, and the
>> folio lock isn't super useful because the lock itself is going to be
>> split up.
>>
>> Folio splitting wants all users to stop using this folio, so one big
>> source of users is mappings. Hence, get those mappers (both primary and
>> secondary MMUs) to unmap.
>>
>> Core-mm-managed mappings take a refcount, so those refcounts go away. Of
>> the secondary mmu notifiers, KVM doesn't take a refcount, but KVM does
>> unmap as requested, so that still falls in line with "stop using this
>> folio".
>>
>> I think the refcounting check isn't actually necessary if all users of
>> folios STOP using the folio on request (via mmu notifiers or
>> otherwise). Unfortunately, there are other users other than mappers. The
>> best way to find these users is to check the refcount. The refcount
>> check is asking "how many other users are left?" and if the number of
>> users is as expected (just the filemap, or whatever else is expected),
>> then splitting can go ahead, since the splitting code is now confident
>> the remaining users won't try and use the folio metadata while splitting
>> is happening.
>>
>>
>> guest_memfd does a modified version of that on shared to private
>> conversions. guest_memfd will unmap from host userspace page tables for
>> the same reason, mainly to tell all the host users to unmap. The
>> unmapping also triggers mmu notifiers so the stage 2 mappings also go
>> away (TBD if this should be skipped) and this is okay because they're
>> shared pages. guest usage will just map them back in on any failure and
>> it doesn't break guests.
>>
>> At this point all the mappers are gone, then guest_memfd checks
>> refcounts to make sure that guest_memfd itself is the only user of the
>> folio. If the refcount is as expected, guest_memfd is confident to
>> continue with splitting folios, since other folio accesses will be
>> locked out by the filemap invalidate lock.
>>
>> The one main guest_memfd folio user that won't go away on an unmap call
>> is if the folios get pinned for IOMMU access. In this case, guest_memfd
>> fails the conversion and returns an error to userspace so userspace can
>> sort out the IOMMU unpinning.
>>
>>
>> As for private to shared conversions, folio merging would require the
>> same thing that nobody else is using the folios (the folio
>> metadata). guest_memfd skips that check because for private memory, KVM
>> is the only other user, and guest_memfd knows KVM doesn't use folio
>> metadata once the memory is mapped for the guest.
> Ok. That makes sense. Thanks for the explanation.
> It looks like guest_memfd also rules out concurrent folio metadata access by
> holding the filemap_invalidate_lock.
>
> BTW: Could that potentially cause guest soft lockup due to holding the
> filemap_invalidate_lock for too long?

Yes, potentially. You mean because the vCPUs are all blocked on page
faults, right? We can definitely optimize later, perhaps lock by
guest_memfd index ranges.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root
  2026-01-06 10:20 ` [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
@ 2026-01-15 22:49   ` Sean Christopherson
  2026-01-16  7:54     ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2026-01-15 22:49 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kas, tabba, ackerleytng, michael.roth, david, vannapurve, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Tue, Jan 06, 2026, Yan Zhao wrote:
> From: Rick P Edgecombe <rick.p.edgecombe@intel.com>
> 
> Disallow page merging (huge page adjustment) for the mirror root by
> utilizing disallowed_hugepage_adjust().

Why?  What is this actually doing?  The below explains "how" but I'm baffled as
to the purpose.  I'm guessing there are hints in the surrounding patches, but I
haven't read them in depth, and shouldn't need to in order to understand the
primary reason behind a change.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 14/24] KVM: Change the return type of gfn_handler_t() from bool to int
  2026-01-06 10:22 ` [PATCH v3 14/24] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
@ 2026-01-16  0:21   ` Sean Christopherson
  2026-01-16  6:42     ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2026-01-16  0:21 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kas, tabba, ackerleytng, michael.roth, david, vannapurve, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Tue, Jan 06, 2026, Yan Zhao wrote:
> Modify the return type of gfn_handler_t() from bool to int. A negative
> return value indicates failure, while a return value of 1 signifies success
> with a flush required, and 0 denotes success without a flush required.
> 
> This adjustment prepares for a later change that will enable
> kvm_pre_set_memory_attributes() to fail.

No, just don't support S-EPT hugepages with per-VM memory attributes.  This type
of complexity isn't worth carrying for a feature we want to deprecate.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
                   ` (24 preceding siblings ...)
  2026-01-06 17:47 ` [PATCH v3 00/24] KVM: TDX huge page support for private memory Vishal Annapurve
@ 2026-01-16  0:28 ` Sean Christopherson
  2026-01-16 11:25   ` Yan Zhao
  25 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2026-01-16  0:28 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kas, tabba, ackerleytng, michael.roth, david, vannapurve, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Tue, Jan 06, 2026, Yan Zhao wrote:
> This is v3 of the TDX huge page series. The full stack is available at [4].

Nope, that's different code.

> [4] kernel full stack: https://github.com/intel-staging/tdx/tree/huge_page_v3

E.g. this branch doesn't have

  [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2026-01-06 10:18 ` [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
@ 2026-01-16  1:00   ` Huang, Kai
  2026-01-16  8:35     ` Yan Zhao
  2026-01-16 11:22   ` Huang, Kai
  2026-01-28 22:49   ` Sean Christopherson
  2 siblings, 1 reply; 127+ messages in thread
From: Huang, Kai @ 2026-01-16  1:00 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Du, Fan, Li, Xiaoyao, Gao, Chao,
	Hansen, Dave, thomas.lendacky@amd.com, vbabka@suse.cz,
	tabba@google.com, david@kernel.org, kas@kernel.org,
	michael.roth@amd.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, ackerleytng@google.com,
	nik.borisov@suse.com, Yamahata, Isaku, Peng, Chao P,
	francescolavra.fl@gmail.com, sagis@google.com, Annapurve, Vishal,
	Edgecombe, Rick P, Miao, Jun, jgross@suse.com, pgonda@google.com,
	x86@kernel.org


> 
> Enable tdh_mem_page_demote() only on TDX modules that support feature
> TDX_FEATURES0.ENHANCE_DEMOTE_INTERRUPTIBILITY, which does not return error
> TDX_INTERRUPTED_RESTARTABLE on basic TDX (i.e., without TD partition) [2].
> 
> This is because error TDX_INTERRUPTED_RESTARTABLE is difficult to handle.
> The TDX module provides no guaranteed maximum retry count to ensure forward
> progress of the demotion. Interrupt storms could then result in a DoS if
> host simply retries endlessly for TDX_INTERRUPTED_RESTARTABLE. Disabling
> interrupts before invoking the SEAMCALL also doesn't work because NMIs can
> also trigger TDX_INTERRUPTED_RESTARTABLE. Therefore, the tradeoff for basic
> TDX is to disable the TDX_INTERRUPTED_RESTARTABLE error given the
> reasonable execution time for demotion. [1]
> 

[...]

> v3:
> - Use a var name that clearly tell that the page is used as a page table
>   page. (Binbin).
> - Check if TDX module supports feature ENHANCE_DEMOTE_INTERRUPTIBILITY.
>   (Kai).
> 
[...]

> +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
> +			u64 *ext_err1, u64 *ext_err2)
> +{
> +	struct tdx_module_args args = {
> +		.rcx = gpa | level,
> +		.rdx = tdx_tdr_pa(td),
> +		.r8 = page_to_phys(new_sept_page),
> +	};
> +	u64 ret;
> +
> +	if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
> +		return TDX_SW_ERROR;
> 

For the record, while I replied my suggestion [*] to this patch in v2, it
was basically because the discussion was already in that patch -- I didn't
mean to do this check inside tdh_mem_page_demote(), but do this check in
KVM page fault patch and return 4K as maximum mapping level.

The precise words were:

  So if the decision is to not use 2M page when TDH_MEM_PAGE_DEMOTE can 
  return TDX_INTERRUPTED_RESTARTABLE, maybe we can just check this 
  enumeration in fault handler and always make mapping level as 4K?

Looking at this series, this is eventually done in your last patch.  But I
don't quite understand what's the additional value of doing such check and
return TDX_SW_ERROR in this SEAMCALL wrapper.

Currently in this series, it doesn't matter whether this wrapper returns
TDX_SW_ERROR or the real TDX_INTERRUPTED_RESTARTABLE -- KVM terminates the
TD anyway (see your patch 8) because this is unexpected as checked in your
last patch.

IMHO we should get rid of this check in this low level wrapper.

[*]:
https://lore.kernel.org/all/fbf04b09f13bc2ce004ac97ee9c1f2c965f44fdf.camel@intel.com/#t

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 14/24] KVM: Change the return type of gfn_handler_t() from bool to int
  2026-01-16  0:21   ` Sean Christopherson
@ 2026-01-16  6:42     ` Yan Zhao
  0 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-16  6:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kas, tabba, ackerleytng, michael.roth, david, vannapurve, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Thu, Jan 15, 2026 at 04:21:14PM -0800, Sean Christopherson wrote:
> On Tue, Jan 06, 2026, Yan Zhao wrote:
> > Modify the return type of gfn_handler_t() from bool to int. A negative
> > return value indicates failure, while a return value of 1 signifies success
> > with a flush required, and 0 denotes success without a flush required.
> > 
> > This adjustment prepares for a later change that will enable
> > kvm_pre_set_memory_attributes() to fail.
> 
> No, just don't support S-EPT hugepages with per-VM memory attributes.  This type
> of complexity isn't worth carrying for a feature we want to deprecate.
Got it! Will disable TDX huge page if vm_memory_attributes is true.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root
  2026-01-15 22:49   ` Sean Christopherson
@ 2026-01-16  7:54     ` Yan Zhao
  2026-01-26 16:08       ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-16  7:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kas, tabba, ackerleytng, michael.roth, david, vannapurve, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

Hi Sean,
Thanks for the review!

On Thu, Jan 15, 2026 at 02:49:59PM -0800, Sean Christopherson wrote:
> On Tue, Jan 06, 2026, Yan Zhao wrote:
> > From: Rick P Edgecombe <rick.p.edgecombe@intel.com>
> > 
> > Disallow page merging (huge page adjustment) for the mirror root by
> > utilizing disallowed_hugepage_adjust().
> 
> Why?  What is this actually doing?  The below explains "how" but I'm baffled as
> to the purpose.  I'm guessing there are hints in the surrounding patches, but I
> haven't read them in depth, and shouldn't need to in order to understand the
> primary reason behind a change.
Sorry for missing the background. I will explain the "why" in the patch log in
the next version.

The reason for introducing this patch is to disallow page merging for TDX. I
explained the reasons to disallow page merging in the cover letter:

"
7. Page merging (page promotion)

   Promotion is disallowed, because:

   - The current TDX module requires all 4KB leafs to be either all PENDING
     or all ACCEPTED before a successful promotion to 2MB. This requirement
     prevents successful page merging after partially converting a 2MB
     range from private to shared and then back to private, which is the
     primary scenario necessitating page promotion.

   - tdh_mem_page_promote() depends on tdh_mem_range_block() in the current
     TDX module. Consequently, handling BUSY errors is complex, as page
     merging typically occurs in the fault path under shared mmu_lock.

   - Limited amount of initial private memory (typically ~4MB) means the
     need for page merging during TD build time is minimal.
"

Without this patch, page promotion may be triggered in the following scenario:

1. guest_memfd allocates a 2MB folio for GPA X, so the max mapping level is 2MB.
2. KVM maps GPA X at 4KB level during TD build time.
3. Guest converts GPA X to shared, zapping the 4KB leaf private mapping while
   keeping the 2MB non-leaf private mapping.
3. Guest converts GPA X to private and accepts it at 2MB level.
4. KVM maps GPA X at 2MB level, triggering page merging.

However, we currently don't support page merging yet. Specifically for the above
scenario, the purpose is to avoid handling the error from
tdh_mem_page_promote(), which SEAMCALL currently needs to be preceded by
tdh_mem_range_block(). To handle the promotion error (e.g., due to busy) under
read mmu_lock, we may need to introduce several spinlocks and guarantees from
the guest to ensure the success of tdh_mem_range_unblock() to restore the S-EPT
status. 

Therefore, we introduced this patch for simplicity, and because the promotion
scenario is not common.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2026-01-16  1:00   ` Huang, Kai
@ 2026-01-16  8:35     ` Yan Zhao
  2026-01-16 11:10       ` Huang, Kai
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-16  8:35 UTC (permalink / raw)
  To: Huang, Kai
  Cc: pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org,
	Du, Fan, Li, Xiaoyao, Gao, Chao, Hansen, Dave,
	thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
	david@kernel.org, kas@kernel.org, michael.roth@amd.com,
	Weiny, Ira, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, ackerleytng@google.com,
	nik.borisov@suse.com, Yamahata, Isaku, Peng, Chao P,
	francescolavra.fl@gmail.com, sagis@google.com, Annapurve, Vishal,
	Edgecombe, Rick P, Miao, Jun, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

Hi Kai,
Thanks for reviewing!

On Fri, Jan 16, 2026 at 09:00:29AM +0800, Huang, Kai wrote:
> 
> > 
> > Enable tdh_mem_page_demote() only on TDX modules that support feature
> > TDX_FEATURES0.ENHANCE_DEMOTE_INTERRUPTIBILITY, which does not return error
> > TDX_INTERRUPTED_RESTARTABLE on basic TDX (i.e., without TD partition) [2].
> > 
> > This is because error TDX_INTERRUPTED_RESTARTABLE is difficult to handle.
> > The TDX module provides no guaranteed maximum retry count to ensure forward
> > progress of the demotion. Interrupt storms could then result in a DoS if
> > host simply retries endlessly for TDX_INTERRUPTED_RESTARTABLE. Disabling
> > interrupts before invoking the SEAMCALL also doesn't work because NMIs can
> > also trigger TDX_INTERRUPTED_RESTARTABLE. Therefore, the tradeoff for basic
> > TDX is to disable the TDX_INTERRUPTED_RESTARTABLE error given the
> > reasonable execution time for demotion. [1]
> > 
> 
> [...]
> 
> > v3:
> > - Use a var name that clearly tell that the page is used as a page table
> >   page. (Binbin).
> > - Check if TDX module supports feature ENHANCE_DEMOTE_INTERRUPTIBILITY.
> >   (Kai).
> > 
> [...]
> 
> > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
> > +			u64 *ext_err1, u64 *ext_err2)
> > +{
> > +	struct tdx_module_args args = {
> > +		.rcx = gpa | level,
> > +		.rdx = tdx_tdr_pa(td),
> > +		.r8 = page_to_phys(new_sept_page),
> > +	};
> > +	u64 ret;
> > +
> > +	if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
> > +		return TDX_SW_ERROR;
> > 
> 
> For the record, while I replied my suggestion [*] to this patch in v2, it
> was basically because the discussion was already in that patch -- I didn't
> mean to do this check inside tdh_mem_page_demote(), but do this check in
> KVM page fault patch and return 4K as maximum mapping level.
> 
> The precise words were:
> 
>   So if the decision is to not use 2M page when TDH_MEM_PAGE_DEMOTE can 
>   return TDX_INTERRUPTED_RESTARTABLE, maybe we can just check this 
>   enumeration in fault handler and always make mapping level as 4K?
Right. I followed it in the last patch (patch 24).

> Looking at this series, this is eventually done in your last patch.  But I
> don't quite understand what's the additional value of doing such check and
> return TDX_SW_ERROR in this SEAMCALL wrapper.
> 
> Currently in this series, it doesn't matter whether this wrapper returns
> TDX_SW_ERROR or the real TDX_INTERRUPTED_RESTARTABLE -- KVM terminates the
> TD anyway (see your patch 8) because this is unexpected as checked in your
> last patch.
> 
> IMHO we should get rid of this check in this low level wrapper.
You are right, the wrapper shouldn't hit this error after the last patch.

However, I found it's better to introduce the feature bit
TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY and the helper
tdx_supports_demote_nointerrupt() together with the demote SEAMCALL wrapper.
This way, people can understand how the TDX_INTERRUPTED_RESTARTABLE error is
handled for this SEAMCALL. Invoking the helper in this patch also gives the
helper a user :)

What do you think about changing it to a WARN_ON_ONCE()? i.e.,
WARN_ON_ONCE(!tdx_supports_demote_nointerrupt(&tdx_sysinfo));


> [*]:
> https://lore.kernel.org/all/fbf04b09f13bc2ce004ac97ee9c1f2c965f44fdf.camel@intel.com/#t

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2026-01-16  8:35     ` Yan Zhao
@ 2026-01-16 11:10       ` Huang, Kai
  2026-01-16 11:22         ` Huang, Kai
  2026-01-19  6:15         ` Yan Zhao
  0 siblings, 2 replies; 127+ messages in thread
From: Huang, Kai @ 2026-01-16 11:10 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Du, Fan, kvm@vger.kernel.org, Li, Xiaoyao, Hansen, Dave,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	david@kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Weiny, Ira,
	kas@kernel.org, nik.borisov@suse.com, ackerleytng@google.com,
	Peng, Chao P, francescolavra.fl@gmail.com, Yamahata, Isaku,
	sagis@google.com, Gao, Chao, Edgecombe, Rick P, Miao, Jun,
	Annapurve, Vishal, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Fri, 2026-01-16 at 16:35 +0800, Yan Zhao wrote:
> Hi Kai,
> Thanks for reviewing!
> 
> On Fri, Jan 16, 2026 at 09:00:29AM +0800, Huang, Kai wrote:
> > 
> > > 
> > > Enable tdh_mem_page_demote() only on TDX modules that support feature
> > > TDX_FEATURES0.ENHANCE_DEMOTE_INTERRUPTIBILITY, which does not return error
> > > TDX_INTERRUPTED_RESTARTABLE on basic TDX (i.e., without TD partition) [2].
> > > 
> > > This is because error TDX_INTERRUPTED_RESTARTABLE is difficult to handle.
> > > The TDX module provides no guaranteed maximum retry count to ensure forward
> > > progress of the demotion. Interrupt storms could then result in a DoS if
> > > host simply retries endlessly for TDX_INTERRUPTED_RESTARTABLE. Disabling
> > > interrupts before invoking the SEAMCALL also doesn't work because NMIs can
> > > also trigger TDX_INTERRUPTED_RESTARTABLE. Therefore, the tradeoff for basic
> > > TDX is to disable the TDX_INTERRUPTED_RESTARTABLE error given the
> > > reasonable execution time for demotion. [1]
> > > 
> > 
> > [...]
> > 
> > > v3:
> > > - Use a var name that clearly tell that the page is used as a page table
> > >   page. (Binbin).
> > > - Check if TDX module supports feature ENHANCE_DEMOTE_INTERRUPTIBILITY.
> > >   (Kai).
> > > 
> > [...]
> > 
> > > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
> > > +			u64 *ext_err1, u64 *ext_err2)
> > > +{
> > > +	struct tdx_module_args args = {
> > > +		.rcx = gpa | level,
> > > +		.rdx = tdx_tdr_pa(td),
> > > +		.r8 = page_to_phys(new_sept_page),
> > > +	};
> > > +	u64 ret;
> > > +
> > > +	if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
> > > +		return TDX_SW_ERROR;
> > > 
> > 
> > For the record, while I replied my suggestion [*] to this patch in v2, it
> > was basically because the discussion was already in that patch -- I didn't
> > mean to do this check inside tdh_mem_page_demote(), but do this check in
> > KVM page fault patch and return 4K as maximum mapping level.
> > 
> > The precise words were:
> > 
> >   So if the decision is to not use 2M page when TDH_MEM_PAGE_DEMOTE can 
> >   return TDX_INTERRUPTED_RESTARTABLE, maybe we can just check this 
> >   enumeration in fault handler and always make mapping level as 4K?
> Right. I followed it in the last patch (patch 24).
> 
> > Looking at this series, this is eventually done in your last patch.  But I
> > don't quite understand what's the additional value of doing such check and
> > return TDX_SW_ERROR in this SEAMCALL wrapper.
> > 
> > Currently in this series, it doesn't matter whether this wrapper returns
> > TDX_SW_ERROR or the real TDX_INTERRUPTED_RESTARTABLE -- KVM terminates the
> > TD anyway (see your patch 8) because this is unexpected as checked in your
> > last patch.
> > 
> > IMHO we should get rid of this check in this low level wrapper.
> You are right, the wrapper shouldn't hit this error after the last patch.
> 
> However, I found it's better to introduce the feature bit
> TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY and the helper
> tdx_supports_demote_nointerrupt() together with the demote SEAMCALL wrapper.
> This way, people can understand how the TDX_INTERRUPTED_RESTARTABLE error is
> handled for this SEAMCALL. 
> 

So the "handling" here is basically making DEMOTE SEAMCALL unavailable
when DEMOTE is interruptible at low SEAMCALL wrapper level.

I guess you can argue this has some value since it tells users "don't even
try to call me when I am interruptible because I am not available".  

However, IMHO this also implies the benefit is mostly for the case where
the user wants to use this wrapper to tell whether DEMOTE is available. 
E.g.,

	err = tdh_mem_page_demote(...);
	if (err == TDX_SW_ERROR)
		enable_tdx_hugepage = false;

But in this series you are using tdx_supports_demote_nointerrupt() for
this purpose, which is better IMHO.

So maybe there's a *theoretical* value to have the check here, but I don't
see any *real* value.

But I don't have strong opinion either -- I guess I just don't like making
these low level SEAMCALL wrappers more complicated than what the SEAMCALL
does -- and it's up to you to decide. :-)

> 
> What do you think about changing it to a WARN_ON_ONCE()? i.e.,
> WARN_ON_ONCE(!tdx_supports_demote_nointerrupt(&tdx_sysinfo));

What's your intention?

W/o the WARN(), the caller _can_ call this wrapper (i.e., not a kernel
bug) but it always get a SW-defined error.  Again, maybe it has value for
the case where the caller wants to use this to tell whether DEMOTE is
available.

With the WARN(), it's a kernel bug to call the wrapper, and the caller
needs to use other way (i.e., tdx_supports_demote_nointerrupt()) to tell
whether DEMOTE is available.

So if you want the check, probably WARN() is a better idea since I suppose
we always want users to use tdx_supports_demote_nointerrupt() to know
whether DEMOTE can be done, and the WARN() is just to catch bug.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2026-01-06 10:18 ` [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
  2026-01-16  1:00   ` Huang, Kai
@ 2026-01-16 11:22   ` Huang, Kai
  2026-01-19  5:55     ` Yan Zhao
  2026-01-28 22:49   ` Sean Christopherson
  2 siblings, 1 reply; 127+ messages in thread
From: Huang, Kai @ 2026-01-16 11:22 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Du, Fan, Li, Xiaoyao, Gao, Chao,
	Hansen, Dave, thomas.lendacky@amd.com, vbabka@suse.cz,
	tabba@google.com, david@kernel.org, kas@kernel.org,
	michael.roth@amd.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, ackerleytng@google.com,
	nik.borisov@suse.com, Yamahata, Isaku, Peng, Chao P,
	francescolavra.fl@gmail.com, sagis@google.com, Annapurve, Vishal,
	Edgecombe, Rick P, Miao, Jun, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Tue, 2026-01-06 at 18:18 +0800, Yan Zhao wrote:
>  /* Bit definitions of TDX_FEATURES0 metadata field */
>  #define TDX_FEATURES0_NO_RBP_MOD		BIT_ULL(18)
>  #define TDX_FEATURES0_DYNAMIC_PAMT		BIT_ULL(36)
> +#define TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY	BIT_ULL(51)

Nit: the spec uses "ENHANCED" but not "ENHANCE", so perhaps change to
TDX_FEATURES0_ENHANCED_DEMOTE_INTERRUPTIBILITY ?

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2026-01-16 11:10       ` Huang, Kai
@ 2026-01-16 11:22         ` Huang, Kai
  2026-01-19  6:18           ` Yan Zhao
  2026-01-19  6:15         ` Yan Zhao
  1 sibling, 1 reply; 127+ messages in thread
From: Huang, Kai @ 2026-01-16 11:22 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Du, Fan, kvm@vger.kernel.org, Li, Xiaoyao, Hansen, Dave,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	david@kernel.org, kas@kernel.org, linux-kernel@vger.kernel.org,
	seanjc@google.com, pbonzini@redhat.com, binbin.wu@linux.intel.com,
	Weiny, Ira, nik.borisov@suse.com, Annapurve, Vishal,
	ackerleytng@google.com, Peng, Chao P, michael.roth@amd.com,
	Yamahata, Isaku, sagis@google.com, Gao, Chao,
	francescolavra.fl@gmail.com, Miao, Jun, Edgecombe, Rick P,
	jgross@suse.com, pgonda@google.com, x86@kernel.org

On Fri, 2026-01-16 at 11:10 +0000, Huang, Kai wrote:
> W/o the WARN(), the caller _can_ call this wrapper (i.e., not a kernel
> bug) but it always get a SW-defined error.  Again, maybe it has value for
> the case where the caller wants to use this to tell whether DEMOTE is
> available.
> 
> With the WARN(), it's a kernel bug to call the wrapper, and the caller
> needs to use other way (i.e., tdx_supports_demote_nointerrupt()) to tell
> whether DEMOTE is available.
> 
> So if you want the check, probably WARN() is a better idea since I suppose
> we always want users to use tdx_supports_demote_nointerrupt() to know
> whether DEMOTE can be done, and the WARN() is just to catch bug.

Forgot to say, the name tdx_supports_demote_nointerrupt() somehow only
tells the TDX module *supports* non-interruptible DEMOTE, it doesn't tell
whether TDX module has *enabled* that.

So while we know for this DEMOTE case, there's no need to *enable* this
feature (i.e., DEMOTE is non-interruptible when this feature is reported
as *supported*), from kernel's point of view, is it better to just use a
clearer name?

E.g., tdx_huge_page_demote_uninterruptible()?

A bonus is the name contains "huge_page" so it's super clear what's the
demote about.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-16  0:28 ` Sean Christopherson
@ 2026-01-16 11:25   ` Yan Zhao
  2026-01-16 14:46     ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-16 11:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kas, tabba, ackerleytng, michael.roth, david, vannapurve, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Thu, Jan 15, 2026 at 04:28:12PM -0800, Sean Christopherson wrote:
> On Tue, Jan 06, 2026, Yan Zhao wrote:
> > This is v3 of the TDX huge page series. The full stack is available at [4].
> 
> Nope, that's different code.
I double-checked. It's the correct code.
See https://github.com/intel-staging/tdx/commits/huge_page_v3.

However, I did make a mistake with the session name, which says
"=== Beginning of section "TDX huge page v2" ===",
but it's actually v3! [facepalm]

> > [4] kernel full stack: https://github.com/intel-staging/tdx/tree/huge_page_v3
> 
> E.g. this branch doesn't have
> 
>   [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion
It's here https://github.com/intel-staging/tdx/commit/d0f7465a7f56f10b5bcc9593351c32b37be073a4

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-16 11:25   ` Yan Zhao
@ 2026-01-16 14:46     ` Sean Christopherson
  2026-01-19  1:25       ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2026-01-16 14:46 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kas, tabba, ackerleytng, michael.roth, david, vannapurve, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Fri, Jan 16, 2026, Yan Zhao wrote:
> On Thu, Jan 15, 2026 at 04:28:12PM -0800, Sean Christopherson wrote:
> > On Tue, Jan 06, 2026, Yan Zhao wrote:
> > > This is v3 of the TDX huge page series. The full stack is available at [4].
> > 
> > Nope, that's different code.
> I double-checked. It's the correct code.
> See https://github.com/intel-staging/tdx/commits/huge_page_v3.

Argh, and I even double-checked before complaining, but apparently I screwed up
twice.  On triple-checking, I do see the same code as the patches.  *sigh*

Sorry.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-15  0:19                       ` Sean Christopherson
@ 2026-01-16 15:45                         ` Edgecombe, Rick P
  2026-01-16 16:31                           ` Sean Christopherson
  2026-01-16 16:57                         ` Dave Hansen
  1 sibling, 1 reply; 127+ messages in thread
From: Edgecombe, Rick P @ 2026-01-16 15:45 UTC (permalink / raw)
  To: Hansen, Dave, seanjc@google.com
  Cc: kvm@vger.kernel.org, Du, Fan, Li, Xiaoyao, Huang, Kai,
	Zhao, Yan Y, thomas.lendacky@amd.com, vbabka@suse.cz,
	tabba@google.com, david@kernel.org, linux-kernel@vger.kernel.org,
	kas@kernel.org, Weiny, Ira, pbonzini@redhat.com,
	francescolavra.fl@gmail.com, ackerleytng@google.com,
	nik.borisov@suse.com, binbin.wu@linux.intel.com, Yamahata, Isaku,
	Peng, Chao P, michael.roth@amd.com, Annapurve, Vishal,
	sagis@google.com, Gao, Chao, Miao, Jun, jgross@suse.com,
	pgonda@google.com, x86@kernel.org

On Wed, 2026-01-14 at 16:19 -0800, Sean Christopherson wrote:
> I've no objection if e.g. tdh_mem_page_aug() wants to sanity check
> that a PFN is backed by a struct page with a valid refcount, it's
> code like that above that I don't want.

Dave wants safety for the TDX pages getting handed to the module. Sean
doesn’t want TDX code to have specialness about requiring struct page.
I think they are not too much at odds actually. Here are two ideas to
get both:

1. Have the TDX module do the checking

Yan points out that we could possibly rely on
TDX_OPERAND_ADDR_RANGE_ERROR to detect operating on the wrong type of
memory. We would have to make sure this will check everything we need
going forward.

2. Invent a new tdx_page_t type.

We could have a mk_foo()-type helpers that does the checks and converts
kvm_pfn_t to TDX’s verified type. The helper can do the checks that it
is valid TDX capable memory. Then there is one place that does the
conversion. It will be easy to change the verification method if we
ever need to.

One benefit is that struct page has already been a problem for other
reasons [0]. To work around that issue we had to keep duplicate formats
of the TDVPR page and lose the standardization of how we handle pages
in the TDX code. This is perfectly functional, but a bit annoying.

But (2) is inventing a new type, which is somewhat disagreeable too.
 
I’m thinking maybe explore 1 first, with the eventual goal of moving
everything to some type of pfn type to unify with the rest of KVM.
Either KVM’s or the normal one.  But before we do that, can we settle
on what sanity checks we want:
1. Page is TDX capable memory
2. ... I think that is it? There was some discussion of refcount
checking. I think we don’t need it?

[0]
https://lore.kernel.org/kvm/20250910144453.1389652-1-dave.hansen@linux.intel.com/#r


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-16 15:45                         ` Edgecombe, Rick P
@ 2026-01-16 16:31                           ` Sean Christopherson
  2026-01-16 16:58                             ` Edgecombe, Rick P
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2026-01-16 16:31 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Dave Hansen, kvm@vger.kernel.org, Fan Du, Xiaoyao Li, Kai Huang,
	Yan Y Zhao, thomas.lendacky@amd.com, vbabka@suse.cz,
	tabba@google.com, david@kernel.org, linux-kernel@vger.kernel.org,
	kas@kernel.org, Ira Weiny, pbonzini@redhat.com,
	francescolavra.fl@gmail.com, ackerleytng@google.com,
	nik.borisov@suse.com, binbin.wu@linux.intel.com, Isaku Yamahata,
	Chao P Peng, michael.roth@amd.com, Vishal Annapurve,
	sagis@google.com, Chao Gao, Jun Miao, jgross@suse.com,
	pgonda@google.com, x86@kernel.org

On Fri, Jan 16, 2026, Rick P Edgecombe wrote:
> On Wed, 2026-01-14 at 16:19 -0800, Sean Christopherson wrote:
> > I've no objection if e.g. tdh_mem_page_aug() wants to sanity check
> > that a PFN is backed by a struct page with a valid refcount, it's
> > code like that above that I don't want.
> 
> Dave wants safety for the TDX pages getting handed to the module.

Define "safety".  As I stressed earlier, blinding retrieving a "struct page" and
dereferencing that pointer is the exact opposite of safe.

> 2. Invent a new tdx_page_t type.

Still doesn't provide meaningful safety.  Regardless of what type gets passed
into the low level tdh_*() helpers, it's going to require KVM to effectively cast
a bare pfn, because I am completely against passing anything other than a SPTE
to tdx_sept_set_private_spte().

> 1. Page is TDX capable memory

That's fine by me, but that's _very_ different than what was proposed here.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-15  0:19                       ` Sean Christopherson
  2026-01-16 15:45                         ` Edgecombe, Rick P
@ 2026-01-16 16:57                         ` Dave Hansen
  2026-01-16 17:14                           ` Sean Christopherson
  1 sibling, 1 reply; 127+ messages in thread
From: Dave Hansen @ 2026-01-16 16:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yan Zhao, Ackerley Tng, Vishal Annapurve, pbonzini, linux-kernel,
	kvm, x86, rick.p.edgecombe, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

On 1/14/26 16:19, Sean Christopherson wrote:
>> 'struct page' gives us two things: One is the type safety, but I'm
>> pretty flexible on how that's implemented as long as it's not a raw u64
>> getting passed around everywhere.
> I don't necessarily disagree on the type safety front, but for the specific code
> in question, any type safety is a facade.  Everything leading up to the TDX code
> is dealing with raw PFNs and/or PTEs.  Then the TDX code assumes that the PFN
> being mapped into the guest is backed by a struct page, and that the folio size
> is consistent with @level, without _any_ checks whatsover.  This is providing
> the exact opposite of safety.
> 
>   static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> 			    enum pg_level level, kvm_pfn_t pfn)
>   {
> 	int tdx_level = pg_level_to_tdx_sept_level(level);
> 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> 	struct page *page = pfn_to_page(pfn);    <==================

I of course agree that this is fundamentally unsafe, it's just not
necessarily bad code.

I hope we both agree that this could be made _more_ safe by, for
instance, making sure the page is in a zone, pfn_valid(), and a few more
things.

In a perfect world, these conversions would happen at a well-defined
layer (KVM=>TDX) and in relatively few places. That layer transition is
where the sanity checks happen. It's super useful to have:

struct page *kvm_pfn_to_tdx_private_page(kvm_pfn_t pfn)
{
	struct page *page = pfn_to_page(pfn);
#ifdef DEBUG
	WARN_ON_ONCE(pfn_valid(pfn));
	// page must be from a "file"???
	WARN_ON_ONCE(!page_mapping(page));
	WARN_ON_ONCE(...);
#endif
	return page;
}

*EVEN* if the pfn_to_page() itself is unsafe, and even if the WARN()s
are compiled out, this explicitly lays out the assumptions and it means
someone reading TDX code has an easier idea comprehending it.

It's also not a crime to do the *same* checking on kvm_pfn_t and not
have a type transition. I just like the idea of changing the type so
that the transition line is clear and the concept is carried (forced,
even) through the layers of helpers.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-16 16:31                           ` Sean Christopherson
@ 2026-01-16 16:58                             ` Edgecombe, Rick P
  2026-01-19  5:53                               ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Edgecombe, Rick P @ 2026-01-16 16:58 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: Du, Fan, Li, Xiaoyao, Huang, Kai, kvm@vger.kernel.org,
	Hansen, Dave, Zhao, Yan Y, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, david@kernel.org,
	kas@kernel.org, linux-kernel@vger.kernel.org, Weiny, Ira,
	francescolavra.fl@gmail.com, pbonzini@redhat.com,
	ackerleytng@google.com, nik.borisov@suse.com,
	binbin.wu@linux.intel.com, Yamahata, Isaku, Peng, Chao P,
	michael.roth@amd.com, Annapurve, Vishal, sagis@google.com,
	Gao, Chao, Miao, Jun, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Fri, 2026-01-16 at 08:31 -0800, Sean Christopherson wrote:
> > Dave wants safety for the TDX pages getting handed to the module.
> 
> Define "safety".  As I stressed earlier, blinding retrieving a
> "struct page" and dereferencing that pointer is the exact opposite of
> safe.

I think we had two problems.

1. Passing in raw PA's via u64 led to buggy code. IIRC we had a bug
with this that was caught before it went upstream. So a page needs a
real type of some sort.

2. Work was done on the tip side to prevent non-TDX capable memory from
entering the page allocator. With that in place, by requiring struct
page, TDX code can know that it is getting the type of memory it worked
hard to guarantee was good.

You are saying that shifting a PFN to a struct page blindly doesn't
actually guarantee that it meets those requirements. Makes sense.

For (1) we can just use any old type I think - pfn_t, etc. As we
discussed in the base series.

For (2) we need to check that the memory came from the page allocator,
or otherwise is valid TDX memory somehow. That is at least the only
check that makes sense to me.

There was some discussion about refcounts somewhere in this thread. I
don't think it's arch/x86's worry. Then Yan was saying something last
night that I didn't quite follow. We said, let's just resume the
discussion on the list. So she might suggest another check.

> 
> > 2. Invent a new tdx_page_t type.
> 
> Still doesn't provide meaningful safety.  Regardless of what type
> gets passed into the low level tdh_*() helpers, it's going to require
> KVM to effectively cast a bare pfn, because I am completely against
> passing anything other than a SPTE to tdx_sept_set_private_spte().

I'm not sure I was clear, like:
1. A raw PFN gets passed in to the conversion helper in arch/x86.
2. The helper does the check that it is TDX capable memory, or anything
it cares to check about memory safety, then returns the new type to
KVM.
3. KVM uses the type as an argument to any seamcall that requires TDX
capable memory.

> 
> > 1. Page is TDX capable memory
> 
> That's fine by me, but that's _very_ different than what was proposed
> here.

Proposed by me just now or the series? We are trying to find a new
solution now.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-16 16:57                         ` Dave Hansen
@ 2026-01-16 17:14                           ` Sean Christopherson
  2026-01-16 17:45                             ` Dave Hansen
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2026-01-16 17:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yan Zhao, Ackerley Tng, Vishal Annapurve, pbonzini, linux-kernel,
	kvm, x86, rick.p.edgecombe, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

On Fri, Jan 16, 2026, Dave Hansen wrote:
> On 1/14/26 16:19, Sean Christopherson wrote:
> >> 'struct page' gives us two things: One is the type safety, but I'm
> >> pretty flexible on how that's implemented as long as it's not a raw u64
> >> getting passed around everywhere.
> > I don't necessarily disagree on the type safety front, but for the specific code
> > in question, any type safety is a facade.  Everything leading up to the TDX code
> > is dealing with raw PFNs and/or PTEs.  Then the TDX code assumes that the PFN
> > being mapped into the guest is backed by a struct page, and that the folio size
> > is consistent with @level, without _any_ checks whatsover.  This is providing
> > the exact opposite of safety.
> > 
> >   static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> > 			    enum pg_level level, kvm_pfn_t pfn)
> >   {
> > 	int tdx_level = pg_level_to_tdx_sept_level(level);
> > 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > 	struct page *page = pfn_to_page(pfn);    <==================
> 
> I of course agree that this is fundamentally unsafe, it's just not
> necessarily bad code.
> 
> I hope we both agree that this could be made _more_ safe by, for
> instance, making sure the page is in a zone, pfn_valid(), and a few more
> things.
>
> In a perfect world, these conversions would happen at a well-defined
> layer (KVM=>TDX) and in relatively few places. That layer transition is
> where the sanity checks happen. It's super useful to have:
> 
> struct page *kvm_pfn_to_tdx_private_page(kvm_pfn_t pfn)
> {
> 	struct page *page = pfn_to_page(pfn);
> #ifdef DEBUG
> 	WARN_ON_ONCE(pfn_valid(pfn));
> 	// page must be from a "file"???
> 	WARN_ON_ONCE(!page_mapping(page));
> 	WARN_ON_ONCE(...);
> #endif
> 	return page;
> }
> 
> *EVEN* if the pfn_to_page() itself is unsafe, and even if the WARN()s
> are compiled out, this explicitly lays out the assumptions and it means
> someone reading TDX code has an easier idea comprehending it.

I object to the existence of those assumptions.  Why the blazes does TDX care
how KVM and guest_memfd manages memory?  If you want to assert that the pfn is
compatible with TDX, then by all means.  But I am NOT accepting any more KVM code
that assumes TDX memory is backed by refcounted struct page.  If I had been paying
more attention when the initial TDX series landed, I would have NAK'd that too.

tdh_mem_page_aug() is just an absurdly slow way of writing a PTE.  It doesn't
_need_ the pfn to be backed a struct page, at all.  IMO, what you're asking for
is akin to adding a pile of unnecessary assumptions to e.g. __set_spte() and
__kvm_tdp_mmu_write_spte().  No thanks.

> It's also not a crime to do the *same* checking on kvm_pfn_t and not
> have a type transition. I just like the idea of changing the type so
> that the transition line is clear and the concept is carried (forced,
> even) through the layers of helpers.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-16 17:14                           ` Sean Christopherson
@ 2026-01-16 17:45                             ` Dave Hansen
  2026-01-16 19:59                               ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Dave Hansen @ 2026-01-16 17:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yan Zhao, Ackerley Tng, Vishal Annapurve, pbonzini, linux-kernel,
	kvm, x86, rick.p.edgecombe, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

On 1/16/26 09:14, Sean Christopherson wrote:
...
>> *EVEN* if the pfn_to_page() itself is unsafe, and even if the WARN()s
>> are compiled out, this explicitly lays out the assumptions and it means
>> someone reading TDX code has an easier idea comprehending it.
> I object to the existence of those assumptions.  Why the blazes does TDX care
> how KVM and guest_memfd manages memory? 

For me, it's because TDX can't take arbitrary kvm_pfn_t's for private
memory. It's got to be able to be converted in the hardware, and also
have allocated TDX metadata (PAMT) that the TDX module was handed at
module init time. I thought kvm_pfn_t might, for instance be pointing
over to a shared page or MMIO. Those can't be used for TD private memory.

I think it's a pretty useful convention to know that the generic,
flexible kvm_pfn_t has been winnowed down to a more restrictive type
that is what TDX needs.

But, honestly, my big aversion was to u64's everywhere. I can certainly
live with a few kvm_pfn_t's in the TDX code. It doesn't have to be
'struct page'.

> If you want to assert that the pfn is compatible with TDX, then by
> all means.  But I am NOT accepting any more KVM code that assumes
> TDX memory is backed by refcounted struct page.  If I had been
> paying more attention when the initial TDX series landed, I would
> have NAK'd that too.
I'm kinda surprised by that. The only memory we support handing into TDs
for private memory is refcounted struct page. I can imagine us being
able to do this with DAX pages in the near future, but those have
'struct page' too, and I think they're refcounted pretty normally now as
well.

The TDX module initialization is pretty tied to NUMA nodes, too. If it's
in a NUMA node, the TDX module is told about it and it also universally
gets a 'struct page'.

Is there some kind of memory that I'm missing? What else *is* there? :)

> tdh_mem_page_aug() is just an absurdly slow way of writing a PTE.  It doesn't
> _need_ the pfn to be backed a struct page, at all.  IMO, what you're asking for
> is akin to adding a pile of unnecessary assumptions to e.g. __set_spte() and
> __kvm_tdp_mmu_write_spte().  No thanks.

Which part is absurdly slow? page_to_phys()? Isn't that just a shift by
an immediate and a subtraction of an immediate? Yeah, the subtraction
immediate is chonky so the instruction is big.

But at a quick glance I'm not seeing anything absurd.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-16 17:45                             ` Dave Hansen
@ 2026-01-16 19:59                               ` Sean Christopherson
  2026-01-16 22:25                                 ` Dave Hansen
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2026-01-16 19:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yan Zhao, Ackerley Tng, Vishal Annapurve, pbonzini, linux-kernel,
	kvm, x86, rick.p.edgecombe, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

On Fri, Jan 16, 2026, Dave Hansen wrote:
> On 1/16/26 09:14, Sean Christopherson wrote:
> > If you want to assert that the pfn is compatible with TDX, then by
> > all means.  But I am NOT accepting any more KVM code that assumes
> > TDX memory is backed by refcounted struct page.  If I had been
> > paying more attention when the initial TDX series landed, I would
> > have NAK'd that too.
> I'm kinda surprised by that. The only memory we support handing into TDs
> for private memory is refcounted struct page. I can imagine us being
> able to do this with DAX pages in the near future, but those have
> 'struct page' too, and I think they're refcounted pretty normally now as
> well.
> 
> The TDX module initialization is pretty tied to NUMA nodes, too. If it's
> in a NUMA node, the TDX module is told about it and it also universally
> gets a 'struct page'.
> 
> Is there some kind of memory that I'm missing? What else *is* there? :)

I don't want to special case TDX on the backend of KVM's MMU.  There's already
waaaay too much code and complexity in KVM that exists purely for S-EPT.  Baking
in assumptions on how _exactly_ KVM is managing guest memory goes too far.

The reason I'm so hostile towards struct page is that, as evidenced by this series
and a ton of historical KVM code, assuming that memory is backed by struct page is
a _very_ slippery slope towards code that is extremely nasty to unwind later on.

E.g. see all of the effort that ended up going into commit ce7b5695397b ("KVM: TDX:
Drop superfluous page pinning in S-EPT management").  And in this series, the
constraints that will be placed on guest_memfd if TDX assumes hugepages will always
be covered in a single folio.  Untangling KVM's historical (non-TDX) messes around
struct page took us something like two years.

And so to avoid introducing similar messes in the future, I don't want KVM's MMU
to make _any_ references to struct page when it comes to mapping memory into the
guest unless it's absolutely necessary, e.g. to put a reference when KVM _knows_
it acquired a refcounted page via gup() (and ideally we'd kill even that, e.g.
by telling gup() not to bump the refcount in the first place).

> > tdh_mem_page_aug() is just an absurdly slow way of writing a PTE.  It doesn't
> > _need_ the pfn to be backed a struct page, at all.  IMO, what you're asking for
> > is akin to adding a pile of unnecessary assumptions to e.g. __set_spte() and
> > __kvm_tdp_mmu_write_spte().  No thanks.
> 
> Which part is absurdly slow?

The SEAMCALL itself.  I'm saying that TDH_MEM_PAGE_AUG is really just the S-EPT
version of "make this PTE PRESENT", and that piling on sanity checks that aren't
fundamental to TDX shouldn't be done when KVM is writing PTEs.

In other words, something like this is totally fine:

	KVM_MMU_WARN_ON(!tdx_is_convertible_pfn(pfn));

but this is not:

	WARN_ON_ONCE(!page_mapping(pfn_to_page(pfn)));

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-16 19:59                               ` Sean Christopherson
@ 2026-01-16 22:25                                 ` Dave Hansen
  0 siblings, 0 replies; 127+ messages in thread
From: Dave Hansen @ 2026-01-16 22:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yan Zhao, Ackerley Tng, Vishal Annapurve, pbonzini, linux-kernel,
	kvm, x86, rick.p.edgecombe, kas, tabba, michael.roth, david,
	sagis, vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du,
	jun.miao, francescolavra.fl, jgross, ira.weiny, isaku.yamahata,
	xiaoyao.li, kai.huang, binbin.wu, chao.p.peng, chao.gao

On 1/16/26 11:59, Sean Christopherson wrote:
> The SEAMCALL itself.  I'm saying that TDH_MEM_PAGE_AUG is really just the S-EPT
> version of "make this PTE PRESENT", and that piling on sanity checks that aren't
> fundamental to TDX shouldn't be done when KVM is writing PTEs.
> 
> In other words, something like this is totally fine:
> 
> 	KVM_MMU_WARN_ON(!tdx_is_convertible_pfn(pfn));
> 
> but this is not:
> 
> 	WARN_ON_ONCE(!page_mapping(pfn_to_page(pfn)));

OK, I think I've got a better idea what you're aiming for. I think
that's totally doable going forward.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2026-01-15 12:25   ` Huang, Kai
@ 2026-01-16 23:39     ` Sean Christopherson
  2026-01-19  1:28       ` Yan Zhao
  2026-01-20 17:57       ` Vishal Annapurve
  0 siblings, 2 replies; 127+ messages in thread
From: Sean Christopherson @ 2026-01-16 23:39 UTC (permalink / raw)
  To: Kai Huang
  Cc: pbonzini@redhat.com, Yan Y Zhao, kvm@vger.kernel.org, Fan Du,
	Xiaoyao Li, Chao Gao, Dave Hansen, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, david@kernel.org,
	kas@kernel.org, michael.roth@amd.com, Ira Weiny,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, nik.borisov@suse.com, Isaku Yamahata,
	Chao P Peng, francescolavra.fl@gmail.com, sagis@google.com,
	Vishal Annapurve, Rick P Edgecombe, Jun Miao, jgross@suse.com,
	pgonda@google.com, x86@kernel.org

On Thu, Jan 15, 2026, Kai Huang wrote:
> static int __kvm_tdp_mmu_split_huge_pages(struct kvm *kvm, 
> 					  struct kvm_gfn_range *range,
> 					  int target_level,
> 					  bool shared,
> 					  bool cross_boundary_only)
> {
> 	...
> }
> 
> And by using this helper, I found the name of the two wrapper functions
> are not ideal:
> 
> kvm_tdp_mmu_try_split_huge_pages() is only for log dirty, and it should
> not be reachable for TD (VM with mirrored PT).  But currently it uses
> KVM_VALID_ROOTS for root filter thus mirrored PT is also included.  I
> think it's better to rename it, e.g., at least with "log_dirty" in the
> name so it's more clear this function is only for dealing log dirty (at
> least currently).  We can also add a WARN() if it's called for VM with
> mirrored PT but it's a different topic.
> 
> kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() doesn't have
> "huge_pages", which isn't consistent with the other.  And it is a bit
> long.  If we don't have "gfn_range" in __kvm_tdp_mmu_split_huge_pages(),
> then I think we can remove "gfn_range" from
> kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() too to make it shorter.
> 
> So how about:
> 
> Rename kvm_tdp_mmu_try_split_huge_pages() to
> kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> kvm_tdp_mmu_split_huge_pages_cross_boundary()
> 
> ?

I find the "cross_boundary" termininology extremely confusing.  I also dislike
the concept itself, in the sense that it shoves a weird, specific concept into
the guts of the TDP MMU.

The other wart is that it's inefficient when punching a large hole.  E.g. say
there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
and tail pages is asinine.

And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
_only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.

For the EPT violation case, the guest is accepting a page.  Just split to the
guest's accepted level, I don't see any reason to make things more complicated
than that.

And then for the PUNCH_HOLE case, do the math to determine which, if any, head
and tail pages need to be split, and use the existing APIs to make that happen.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-16 14:46     ` Sean Christopherson
@ 2026-01-19  1:25       ` Yan Zhao
  0 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-19  1:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kas, tabba, ackerleytng, michael.roth, david, vannapurve, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Fri, Jan 16, 2026 at 06:46:09AM -0800, Sean Christopherson wrote:
> On Fri, Jan 16, 2026, Yan Zhao wrote:
> > On Thu, Jan 15, 2026 at 04:28:12PM -0800, Sean Christopherson wrote:
> > > On Tue, Jan 06, 2026, Yan Zhao wrote:
> > > > This is v3 of the TDX huge page series. The full stack is available at [4].
> > > 
> > > Nope, that's different code.
> > I double-checked. It's the correct code.
> > See https://github.com/intel-staging/tdx/commits/huge_page_v3.
> 
> Argh, and I even double-checked before complaining, but apparently I screwed up
> twice.  On triple-checking, I do see the same code as the patches.  *sigh*
> 
> Sorry.
No problem. Thanks for downloading the code and reviewing it!

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2026-01-16 23:39     ` Sean Christopherson
@ 2026-01-19  1:28       ` Yan Zhao
  2026-01-19  8:35         ` Huang, Kai
  2026-01-20 17:51         ` Sean Christopherson
  2026-01-20 17:57       ` Vishal Annapurve
  1 sibling, 2 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-19  1:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kai Huang, pbonzini@redhat.com, kvm@vger.kernel.org, Fan Du,
	Xiaoyao Li, Chao Gao, Dave Hansen, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, david@kernel.org,
	kas@kernel.org, michael.roth@amd.com, Ira Weiny,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, nik.borisov@suse.com, Isaku Yamahata,
	Chao P Peng, francescolavra.fl@gmail.com, sagis@google.com,
	Vishal Annapurve, Rick P Edgecombe, Jun Miao, jgross@suse.com,
	pgonda@google.com, x86@kernel.org

On Fri, Jan 16, 2026 at 03:39:13PM -0800, Sean Christopherson wrote:
> On Thu, Jan 15, 2026, Kai Huang wrote:
> > static int __kvm_tdp_mmu_split_huge_pages(struct kvm *kvm, 
> > 					  struct kvm_gfn_range *range,
> > 					  int target_level,
> > 					  bool shared,
> > 					  bool cross_boundary_only)
> > {
> > 	...
> > }
> > 
> > And by using this helper, I found the name of the two wrapper functions
> > are not ideal:
> > 
> > kvm_tdp_mmu_try_split_huge_pages() is only for log dirty, and it should
> > not be reachable for TD (VM with mirrored PT).  But currently it uses
> > KVM_VALID_ROOTS for root filter thus mirrored PT is also included.  I
> > think it's better to rename it, e.g., at least with "log_dirty" in the
> > name so it's more clear this function is only for dealing log dirty (at
> > least currently).  We can also add a WARN() if it's called for VM with
> > mirrored PT but it's a different topic.
> > 
> > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() doesn't have
> > "huge_pages", which isn't consistent with the other.  And it is a bit
> > long.  If we don't have "gfn_range" in __kvm_tdp_mmu_split_huge_pages(),
> > then I think we can remove "gfn_range" from
> > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() too to make it shorter.
> > 
> > So how about:
> > 
> > Rename kvm_tdp_mmu_try_split_huge_pages() to
> > kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> > kvm_tdp_mmu_split_huge_pages_cross_boundary()
> > 
> > ?
> 
> I find the "cross_boundary" termininology extremely confusing.  I also dislike
> the concept itself, in the sense that it shoves a weird, specific concept into
> the guts of the TDP MMU.
> The other wart is that it's inefficient when punching a large hole.  E.g. say
> there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> and tail pages is asinine.
That's a reasonable concern. I actually thought about it.
My consideration was as follows:
Currently, we don't have such large areas. Usually, the conversion ranges are
less than 1GB. Though the initial conversion which converts all memory from
private to shared may be wide, there are usually no mappings at that stage. So,
the traversal should be very fast (since the traversal doesn't even need to go
down to the 2MB/1GB level).

If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
very large range at runtime, it can optimize by invoking the API twice:
once for range [start, ALIGN(start, 1GB)), and
once for range [ALIGN_DOWN(end, 1GB), end).

I can also implement this optimization within kvm_split_cross_boundary_leafs()
by checking the range size if you think that would be better.

> And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
> _only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
> code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.
Sorry for the confusion about the usage of tdx_honor_guest_accept_level(). I
should add a better comment.

There are 4 use cases for the API kvm_split_cross_boundary_leafs():
1. PUNCH_HOLE
2. KVM_SET_MEMORY_ATTRIBUTES2, which invokes kvm_gmem_set_attributes() for
   private-to-shared conversions
3. tdx_honor_guest_accept_level()
4. kvm_gmem_error_folio()

Use cases 1-3 are already in the current code. Use case 4 is per our discussion,
and will be implemented in the next version (because guest_memfd may split
folios without first splitting S-EPT).

The 4 use cases can be divided into two categories:

1. Category 1: use cases 1, 2, 4
   We must ensure GFN start - 1 and GFN start are not mapped in a single
   mapping. However, for GFN start or GFN start - 1 specifically, we don't care
   about their actual mapping levels, which means they are free to be mapped at
   2MB or 1GB. The same applies to GFN end - 1 and GFN end.

   --|------------------|-----------
     ^                  ^
    start              end - 1 

2. Category 2: use case 3
   It cares about the mapping level of the GFN, i.e., it must not be mapped
   above a certain level.

   -----|-------
        ^
       GFN

   So, to unify the two categories, I have tdx_honor_guest_accept_level() check
   the range of [level-aligned GFN, level-aligned GFN + level size). e.g.,
   If the accept level is 2MB, only 1GB mapping is possible to be outside the
   range and needs splitting.

   -----|-------------|---
        ^             ^
        |             |
   level-aligned     level-aligned
      GFN            GFN + level size - 1


> For the EPT violation case, the guest is accepting a page.  Just split to the
> guest's accepted level, I don't see any reason to make things more complicated
> than that.
This use case could reuse the kvm_mmu_try_split_huge_pages() API, except that we
need a return value.

> And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> and tail pages need to be split, and use the existing APIs to make that happen.
This use case cannot reuse kvm_mmu_try_split_huge_pages() without modification.
Or which existing APIs are you referring to?
The cross_boundary information is still useful?

BTW: Currently, kvm_split_cross_boundary_leafs() internally reuses
tdp_mmu_split_huge_pages_root() (as shown below).

kvm_split_cross_boundary_leafs
  kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs
    tdp_mmu_split_huge_pages_root

However, tdp_mmu_split_huge_pages_root() is originally used to split huge
mappings in a wide range, so it temporarily releases mmu_lock for memory
allocation for sp, since it can't predict how many pages to pre-allocate in the
KVM mmu cache.

For kvm_split_cross_boundary_leafs(), we can actually predict the max number of
pages to pre-allocate. If we don't reuse tdp_mmu_split_huge_pages_root(), we can
allocate sp, sp->spt, sp->external_spt and DPAMT pages from the KVM mmu cache
without releasing mmu_lock and invoking tdp_mmu_alloc_sp_for_split(). Do you
think this approach is better?

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-16 16:58                             ` Edgecombe, Rick P
@ 2026-01-19  5:53                               ` Yan Zhao
  2026-01-30 15:32                                 ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-19  5:53 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, Du, Fan, Li, Xiaoyao, Huang, Kai,
	kvm@vger.kernel.org, Hansen, Dave, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, david@kernel.org,
	kas@kernel.org, linux-kernel@vger.kernel.org, Weiny, Ira,
	francescolavra.fl@gmail.com, pbonzini@redhat.com,
	ackerleytng@google.com, nik.borisov@suse.com,
	binbin.wu@linux.intel.com, Yamahata, Isaku, Peng, Chao P,
	michael.roth@amd.com, Annapurve, Vishal, sagis@google.com,
	Gao, Chao, Miao, Jun, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Sat, Jan 17, 2026 at 12:58:02AM +0800, Edgecombe, Rick P wrote:
> On Fri, 2026-01-16 at 08:31 -0800, Sean Christopherson wrote:
> > > Dave wants safety for the TDX pages getting handed to the module.
> > 
> > Define "safety".  As I stressed earlier, blinding retrieving a
> > "struct page" and dereferencing that pointer is the exact opposite of
> > safe.
> 
> I think we had two problems.
> 
> 1. Passing in raw PA's via u64 led to buggy code. IIRC we had a bug
> with this that was caught before it went upstream. So a page needs a
> real type of some sort.
> 
> 2. Work was done on the tip side to prevent non-TDX capable memory from
> entering the page allocator. With that in place, by requiring struct
> page, TDX code can know that it is getting the type of memory it worked
> hard to guarantee was good.
> 
> You are saying that shifting a PFN to a struct page blindly doesn't
> actually guarantee that it meets those requirements. Makes sense.
> 
> For (1) we can just use any old type I think - pfn_t, etc. As we
> discussed in the base series.
> 
> For (2) we need to check that the memory came from the page allocator,
> or otherwise is valid TDX memory somehow. That is at least the only
> check that makes sense to me.
> 
> There was some discussion about refcounts somewhere in this thread. I
> don't think it's arch/x86's worry. Then Yan was saying something last
> night that I didn't quite follow. We said, let's just resume the
> discussion on the list. So she might suggest another check.
Hmm, I previously had a concern about passing "struct page *" as the SEAMCALL
wrapper parameter. For example, when we do sanity checks for valid TDX memory in
tdh_mem_page_aug(), we need to do the sanity check on every page, right?
However, with base_page + npages, it's not easy to get the ith page's pointer
without first ensuring the pages are contained in a single folio. It would also
be superfluous if we first get base_pfn from base_page, and then derive the ith
page from base_pfn + i.

IIUC, this concern should be gone as Dave has agreed to use "pfn" as the
SEAMCALL parameter [1]?
Then should we invoke "KVM_MMU_WARN_ON(!tdx_is_convertible_pfn(pfn));" in KVM
for every pfn of a huge mapping? Or should we keep the sanity check inside the
SEAMCALL wrappers?

BTW, I have another question about the SEAMCALL wrapper implementation, as Kai
also pointed out in [2]: since the SEAMCALL wrappers now serve as APIs available
to callers besides KVM, should the SEAMCALL wrappers return TDX_OPERAND_INVALID
or WARN_ON() (or WARN_ON_ONCE()) on sanity check failure?

By returning TDX_OPERAND_INVALID, the caller can check the return code, adjust
the input or trigger WARN_ON() by itself;
By triggering WARN_ON() directly in the SEAMCALL wrapper, we need to document
this requirement for the SEAMCALL wrappers and have the caller invoke the API
correctly.

So, it looks that "WARN_ON() directly in the SEAMCALL wrapper" is the preferred
approach, right?

[1] https://lore.kernel.org/all/d119c824-4770-41d2-a926-4ab5268ea3a6@intel.com/
[2] https://lore.kernel.org/all/baf6df2cc63d8e897455168c1bf07180fc9c1db8.camel@intel.com


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2026-01-16 11:22   ` Huang, Kai
@ 2026-01-19  5:55     ` Yan Zhao
  0 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-19  5:55 UTC (permalink / raw)
  To: Huang, Kai
  Cc: pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org,
	Du, Fan, Li, Xiaoyao, Gao, Chao, Hansen, Dave,
	thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
	david@kernel.org, kas@kernel.org, michael.roth@amd.com,
	Weiny, Ira, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, ackerleytng@google.com,
	nik.borisov@suse.com, Yamahata, Isaku, Peng, Chao P,
	francescolavra.fl@gmail.com, sagis@google.com, Annapurve, Vishal,
	Edgecombe, Rick P, Miao, Jun, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Fri, Jan 16, 2026 at 07:22:20PM +0800, Huang, Kai wrote:
> On Tue, 2026-01-06 at 18:18 +0800, Yan Zhao wrote:
> >  /* Bit definitions of TDX_FEATURES0 metadata field */
> >  #define TDX_FEATURES0_NO_RBP_MOD		BIT_ULL(18)
> >  #define TDX_FEATURES0_DYNAMIC_PAMT		BIT_ULL(36)
> > +#define TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY	BIT_ULL(51)
> 
> Nit: the spec uses "ENHANCED" but not "ENHANCE", so perhaps change to
> TDX_FEATURES0_ENHANCED_DEMOTE_INTERRUPTIBILITY ?
Good catch. Will update the name. Thanks!

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2026-01-16 11:10       ` Huang, Kai
  2026-01-16 11:22         ` Huang, Kai
@ 2026-01-19  6:15         ` Yan Zhao
  1 sibling, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-19  6:15 UTC (permalink / raw)
  To: Huang, Kai
  Cc: Du, Fan, kvm@vger.kernel.org, Li, Xiaoyao, Hansen, Dave,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	david@kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Weiny, Ira,
	kas@kernel.org, nik.borisov@suse.com, ackerleytng@google.com,
	Peng, Chao P, francescolavra.fl@gmail.com, Yamahata, Isaku,
	sagis@google.com, Gao, Chao, Edgecombe, Rick P, Miao, Jun,
	Annapurve, Vishal, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Fri, Jan 16, 2026 at 07:10:05PM +0800, Huang, Kai wrote:
> On Fri, 2026-01-16 at 16:35 +0800, Yan Zhao wrote:
> > Hi Kai,
> > Thanks for reviewing!
> > 
> > On Fri, Jan 16, 2026 at 09:00:29AM +0800, Huang, Kai wrote:
> > > 
> > > > 
> > > > Enable tdh_mem_page_demote() only on TDX modules that support feature
> > > > TDX_FEATURES0.ENHANCE_DEMOTE_INTERRUPTIBILITY, which does not return error
> > > > TDX_INTERRUPTED_RESTARTABLE on basic TDX (i.e., without TD partition) [2].
> > > > 
> > > > This is because error TDX_INTERRUPTED_RESTARTABLE is difficult to handle.
> > > > The TDX module provides no guaranteed maximum retry count to ensure forward
> > > > progress of the demotion. Interrupt storms could then result in a DoS if
> > > > host simply retries endlessly for TDX_INTERRUPTED_RESTARTABLE. Disabling
> > > > interrupts before invoking the SEAMCALL also doesn't work because NMIs can
> > > > also trigger TDX_INTERRUPTED_RESTARTABLE. Therefore, the tradeoff for basic
> > > > TDX is to disable the TDX_INTERRUPTED_RESTARTABLE error given the
> > > > reasonable execution time for demotion. [1]
> > > > 
> > > 
> > > [...]
> > > 
> > > > v3:
> > > > - Use a var name that clearly tell that the page is used as a page table
> > > >   page. (Binbin).
> > > > - Check if TDX module supports feature ENHANCE_DEMOTE_INTERRUPTIBILITY.
> > > >   (Kai).
> > > > 
> > > [...]
> > > 
> > > > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
> > > > +			u64 *ext_err1, u64 *ext_err2)
> > > > +{
> > > > +	struct tdx_module_args args = {
> > > > +		.rcx = gpa | level,
> > > > +		.rdx = tdx_tdr_pa(td),
> > > > +		.r8 = page_to_phys(new_sept_page),
> > > > +	};
> > > > +	u64 ret;
> > > > +
> > > > +	if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
> > > > +		return TDX_SW_ERROR;
> > > > 
> > > 
> > > For the record, while I replied my suggestion [*] to this patch in v2, it
> > > was basically because the discussion was already in that patch -- I didn't
> > > mean to do this check inside tdh_mem_page_demote(), but do this check in
> > > KVM page fault patch and return 4K as maximum mapping level.
> > > 
> > > The precise words were:
> > > 
> > >   So if the decision is to not use 2M page when TDH_MEM_PAGE_DEMOTE can 
> > >   return TDX_INTERRUPTED_RESTARTABLE, maybe we can just check this 
> > >   enumeration in fault handler and always make mapping level as 4K?
> > Right. I followed it in the last patch (patch 24).
> > 
> > > Looking at this series, this is eventually done in your last patch.  But I
> > > don't quite understand what's the additional value of doing such check and
> > > return TDX_SW_ERROR in this SEAMCALL wrapper.
> > > 
> > > Currently in this series, it doesn't matter whether this wrapper returns
> > > TDX_SW_ERROR or the real TDX_INTERRUPTED_RESTARTABLE -- KVM terminates the
> > > TD anyway (see your patch 8) because this is unexpected as checked in your
> > > last patch.
> > > 
> > > IMHO we should get rid of this check in this low level wrapper.
> > You are right, the wrapper shouldn't hit this error after the last patch.
> > 
> > However, I found it's better to introduce the feature bit
> > TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY and the helper
> > tdx_supports_demote_nointerrupt() together with the demote SEAMCALL wrapper.
> > This way, people can understand how the TDX_INTERRUPTED_RESTARTABLE error is
> > handled for this SEAMCALL. 
> > 
> 
> So the "handling" here is basically making DEMOTE SEAMCALL unavailable
> when DEMOTE is interruptible at low SEAMCALL wrapper level.
> 
> I guess you can argue this has some value since it tells users "don't even
> try to call me when I am interruptible because I am not available".  
Right. The caller can understand the API usage by examining the code
implementation.

> However, IMHO this also implies the benefit is mostly for the case where
> the user wants to use this wrapper to tell whether DEMOTE is available. 
> E.g.,
> 
> 	err = tdh_mem_page_demote(...);
> 	if (err == TDX_SW_ERROR)
> 		enable_tdx_hugepage = false;
This use case is not valid.
When the caller invokes tdh_mem_page_demote(), it means huge pages have already
been enabled, so turning huge pages off on error from splitting huge pages is
self-contradictory.

> But in this series you are using tdx_supports_demote_nointerrupt() for
> this purpose, which is better IMHO.
> 
> So maybe there's a *theoretical* value to have the check here, but I don't
> see any *real* value.
> 
> But I don't have strong opinion either -- I guess I just don't like making
> these low level SEAMCALL wrappers more complicated than what the SEAMCALL
> does -- and it's up to you to decide. :-)
Thanks. I added the checking in the SEAMCALL wrapper for two reasons:
- Let the callers know what the wrapper is expected to work under. So, the
  caller (e.g., KVM) can turn off huge pages upon detecting an incompatible TDX
  module. And forgetting to turn off huge pages would yield at least a WARNING.

- Give tdx_supports_demote_nointerrupt() a user in this patch which introduces
  the helper.

So, I'll keep the check unless someone has a strong opinion :)

> > What do you think about changing it to a WARN_ON_ONCE()? i.e.,
> > WARN_ON_ONCE(!tdx_supports_demote_nointerrupt(&tdx_sysinfo));
> 
> What's your intention?
Hmm, either TDX_SW_ERROR or WARN_ON_ONCE() is fine with me.

I've asked about it in [1]. Let's wait for the maintainers' reply.

[1] https://lore.kernel.org/all/aW3G6yZuvclYABzP@yzhao56-desk.sh.intel.com

> W/o the WARN(), the caller _can_ call this wrapper (i.e., not a kernel
> bug) but it always get a SW-defined error.  Again, maybe it has value for
> the case where the caller wants to use this to tell whether DEMOTE is
> available.
> 
> With the WARN(), it's a kernel bug to call the wrapper, and the caller
> needs to use other way (i.e., tdx_supports_demote_nointerrupt()) to tell
> whether DEMOTE is available.
> 
> So if you want the check, probably WARN() is a better idea since I suppose
> we always want users to use tdx_supports_demote_nointerrupt() to know
> whether DEMOTE can be done, and the WARN() is just to catch bug.
Agreed.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2026-01-16 11:22         ` Huang, Kai
@ 2026-01-19  6:18           ` Yan Zhao
  0 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-19  6:18 UTC (permalink / raw)
  To: Huang, Kai
  Cc: Du, Fan, kvm@vger.kernel.org, Li, Xiaoyao, Hansen, Dave,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	david@kernel.org, kas@kernel.org, linux-kernel@vger.kernel.org,
	seanjc@google.com, pbonzini@redhat.com, binbin.wu@linux.intel.com,
	Weiny, Ira, nik.borisov@suse.com, Annapurve, Vishal,
	ackerleytng@google.com, Peng, Chao P, michael.roth@amd.com,
	Yamahata, Isaku, sagis@google.com, Gao, Chao,
	francescolavra.fl@gmail.com, Miao, Jun, Edgecombe, Rick P,
	jgross@suse.com, pgonda@google.com, x86@kernel.org

On Fri, Jan 16, 2026 at 07:22:33PM +0800, Huang, Kai wrote:
> On Fri, 2026-01-16 at 11:10 +0000, Huang, Kai wrote:
> > W/o the WARN(), the caller _can_ call this wrapper (i.e., not a kernel
> > bug) but it always get a SW-defined error.  Again, maybe it has value for
> > the case where the caller wants to use this to tell whether DEMOTE is
> > available.
> > 
> > With the WARN(), it's a kernel bug to call the wrapper, and the caller
> > needs to use other way (i.e., tdx_supports_demote_nointerrupt()) to tell
> > whether DEMOTE is available.
> > 
> > So if you want the check, probably WARN() is a better idea since I suppose
> > we always want users to use tdx_supports_demote_nointerrupt() to know
> > whether DEMOTE can be done, and the WARN() is just to catch bug.
> 
> Forgot to say, the name tdx_supports_demote_nointerrupt() somehow only
> tells the TDX module *supports* non-interruptible DEMOTE, it doesn't tell
> whether TDX module has *enabled* that.
> 
> So while we know for this DEMOTE case, there's no need to *enable* this
> feature (i.e., DEMOTE is non-interruptible when this feature is reported
> as *supported*), from kernel's point of view, is it better to just use a
> clearer name?
> 
> E.g., tdx_huge_page_demote_uninterruptible()?
> 
> A bonus is the name contains "huge_page" so it's super clear what's the
> demote about.
LGTM. Thanks!

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2026-01-19  1:28       ` Yan Zhao
@ 2026-01-19  8:35         ` Huang, Kai
  2026-01-19  8:49           ` Huang, Kai
  2026-01-20 17:51         ` Sean Christopherson
  1 sibling, 1 reply; 127+ messages in thread
From: Huang, Kai @ 2026-01-19  8:35 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: Du, Fan, kvm@vger.kernel.org, Li, Xiaoyao, Hansen, Dave,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	david@kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	pbonzini@redhat.com, Peng, Chao P, ackerleytng@google.com,
	kas@kernel.org, nik.borisov@suse.com, Weiny, Ira,
	francescolavra.fl@gmail.com, Yamahata, Isaku, sagis@google.com,
	Gao, Chao, Edgecombe, Rick P, Miao, Jun, Annapurve, Vishal,
	jgross@suse.com, pgonda@google.com, x86@kernel.org

On Mon, 2026-01-19 at 09:28 +0800, Zhao, Yan Y wrote:
> > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > the concept itself, in the sense that it shoves a weird, specific concept into
> > the guts of the TDP MMU.
> > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > and tail pages is asinine.
> That's a reasonable concern. I actually thought about it.
> My consideration was as follows:
> Currently, we don't have such large areas. Usually, the conversion ranges are
> less than 1GB. Though the initial conversion which converts all memory from
> private to shared may be wide, there are usually no mappings at that stage. So,
> the traversal should be very fast (since the traversal doesn't even need to go
> down to the 2MB/1GB level).
> 
> If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> very large range at runtime, it can optimize by invoking the API twice:
> once for range [start, ALIGN(start, 1GB)), and
> once for range [ALIGN_DOWN(end, 1GB), end).
> 
> I can also implement this optimization within kvm_split_cross_boundary_leafs()
> by checking the range size if you think that would be better.

I am not sure why do we even need kvm_split_cross_boundary_leafs(), if you
want to do optimization.

I think I've raised this in v2, and asked why not just letting the caller
to figure out the ranges to split for a given range (see at the end of
[*]), because the "cross boundary" can only happen at the beginning and
end of the given range, if possible.

[*]:
https://lore.kernel.org/all/35fd7d70475d5743a3c45bc5b8118403036e439b.camel@intel.com/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2026-01-19  8:35         ` Huang, Kai
@ 2026-01-19  8:49           ` Huang, Kai
  2026-01-19 10:11             ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Huang, Kai @ 2026-01-19  8:49 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: Du, Fan, kvm@vger.kernel.org, Li, Xiaoyao, Hansen, Dave,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	david@kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	Peng, Chao P, pbonzini@redhat.com, ackerleytng@google.com,
	kas@kernel.org, nik.borisov@suse.com, Weiny, Ira,
	francescolavra.fl@gmail.com, Yamahata, Isaku, sagis@google.com,
	Gao, Chao, Edgecombe, Rick P, Miao, Jun, Annapurve, Vishal,
	jgross@suse.com, pgonda@google.com, x86@kernel.org

On Mon, 2026-01-19 at 08:35 +0000, Huang, Kai wrote:
> On Mon, 2026-01-19 at 09:28 +0800, Zhao, Yan Y wrote:
> > > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > > the concept itself, in the sense that it shoves a weird, specific concept into
> > > the guts of the TDP MMU.
> > > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > > and tail pages is asinine.
> > That's a reasonable concern. I actually thought about it.
> > My consideration was as follows:
> > Currently, we don't have such large areas. Usually, the conversion ranges are
> > less than 1GB. Though the initial conversion which converts all memory from
> > private to shared may be wide, there are usually no mappings at that stage. So,
> > the traversal should be very fast (since the traversal doesn't even need to go
> > down to the 2MB/1GB level).
> > 
> > If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> > very large range at runtime, it can optimize by invoking the API twice:
> > once for range [start, ALIGN(start, 1GB)), and
> > once for range [ALIGN_DOWN(end, 1GB), end).
> > 
> > I can also implement this optimization within kvm_split_cross_boundary_leafs()
> > by checking the range size if you think that would be better.
> 
> I am not sure why do we even need kvm_split_cross_boundary_leafs(), if you
> want to do optimization.
> 
> I think I've raised this in v2, and asked why not just letting the caller
> to figure out the ranges to split for a given range (see at the end of
> [*]), because the "cross boundary" can only happen at the beginning and
> end of the given range, if possible.
> 
> [*]:
> https://lore.kernel.org/all/35fd7d70475d5743a3c45bc5b8118403036e439b.camel@intel.com/

Hmm.. thinking again, if you have multiple places needing to do this, then
kvm_split_cross_boundary_leafs() may serve as a helper to calculate the
ranges to split.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2026-01-19  8:49           ` Huang, Kai
@ 2026-01-19 10:11             ` Yan Zhao
  2026-01-19 10:40               ` Huang, Kai
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-19 10:11 UTC (permalink / raw)
  To: Huang, Kai
  Cc: seanjc@google.com, Du, Fan, kvm@vger.kernel.org, Li, Xiaoyao,
	Hansen, Dave, thomas.lendacky@amd.com, tabba@google.com,
	vbabka@suse.cz, david@kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	Peng, Chao P, pbonzini@redhat.com, ackerleytng@google.com,
	kas@kernel.org, nik.borisov@suse.com, Weiny, Ira,
	francescolavra.fl@gmail.com, Yamahata, Isaku, sagis@google.com,
	Gao, Chao, Edgecombe, Rick P, Miao, Jun, Annapurve, Vishal,
	jgross@suse.com, pgonda@google.com, x86@kernel.org

On Mon, Jan 19, 2026 at 04:49:58PM +0800, Huang, Kai wrote:
> On Mon, 2026-01-19 at 08:35 +0000, Huang, Kai wrote:
> > On Mon, 2026-01-19 at 09:28 +0800, Zhao, Yan Y wrote:
> > > > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > > > the concept itself, in the sense that it shoves a weird, specific concept into
> > > > the guts of the TDP MMU.
> > > > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > > > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > > > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > > > and tail pages is asinine.
> > > That's a reasonable concern. I actually thought about it.
> > > My consideration was as follows:
> > > Currently, we don't have such large areas. Usually, the conversion ranges are
> > > less than 1GB. Though the initial conversion which converts all memory from
> > > private to shared may be wide, there are usually no mappings at that stage. So,
> > > the traversal should be very fast (since the traversal doesn't even need to go
> > > down to the 2MB/1GB level).
> > > 
> > > If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> > > very large range at runtime, it can optimize by invoking the API twice:
> > > once for range [start, ALIGN(start, 1GB)), and
> > > once for range [ALIGN_DOWN(end, 1GB), end).
> > > 
> > > I can also implement this optimization within kvm_split_cross_boundary_leafs()
> > > by checking the range size if you think that would be better.
> > 
> > I am not sure why do we even need kvm_split_cross_boundary_leafs(), if you
> > want to do optimization.
> > 
> > I think I've raised this in v2, and asked why not just letting the caller
> > to figure out the ranges to split for a given range (see at the end of
> > [*]), because the "cross boundary" can only happen at the beginning and
> > end of the given range, if possible.
Hmm, the caller can only figure out when splitting is NOT necessary, e.g., if
start is 1GB-aligned, then there's no need to split for start. However, if start
is not 1GB/2MB-aligned, the caller has no idea if there's a 2MB mapping covering
start - 1 and start.
(for non-TDX cases, if start is not 1GB-aligned and is just 2MB-aligned,
invoking tdp_mmu_split_huge_pages_root() is still necessary because there may
exist a 1GB mapping covering start -1 and start).

In my reply to [*], I didn't want to do the calculation because I didn't see
much overhead from always invoking tdp_mmu_split_huge_pages_root().
But the scenario Sean pointed out is different. When both start and end are not
2MB-aligned, if [start, end) covers a huge range, we can still pre-calculate to
reduce the iterations in tdp_mmu_split_huge_pages_root().

Opportunistically, optimization to skip splits for 1GB-aligned start or end is
possible :)

> > [*]:
> > https://lore.kernel.org/all/35fd7d70475d5743a3c45bc5b8118403036e439b.camel@intel.com/
> 
> Hmm.. thinking again, if you have multiple places needing to do this, then
> kvm_split_cross_boundary_leafs() may serve as a helper to calculate the
> ranges to split.
Yes.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2026-01-19 10:11             ` Yan Zhao
@ 2026-01-19 10:40               ` Huang, Kai
  2026-01-19 11:06                 ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Huang, Kai @ 2026-01-19 10:40 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Du, Fan, kvm@vger.kernel.org, Li, Xiaoyao, Hansen, Dave,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	david@kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	pbonzini@redhat.com, ackerleytng@google.com, kas@kernel.org,
	binbin.wu@linux.intel.com, Weiny, Ira, nik.borisov@suse.com,
	francescolavra.fl@gmail.com, Yamahata, Isaku, sagis@google.com,
	Gao, Chao, Edgecombe, Rick P, Miao, Jun, Annapurve, Vishal,
	jgross@suse.com, pgonda@google.com, x86@kernel.org

On Mon, 2026-01-19 at 18:11 +0800, Yan Zhao wrote:
> On Mon, Jan 19, 2026 at 04:49:58PM +0800, Huang, Kai wrote:
> > On Mon, 2026-01-19 at 08:35 +0000, Huang, Kai wrote:
> > > On Mon, 2026-01-19 at 09:28 +0800, Zhao, Yan Y wrote:
> > > > > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > > > > the concept itself, in the sense that it shoves a weird, specific concept into
> > > > > the guts of the TDP MMU.
> > > > > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > > > > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > > > > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > > > > and tail pages is asinine.
> > > > That's a reasonable concern. I actually thought about it.
> > > > My consideration was as follows:
> > > > Currently, we don't have such large areas. Usually, the conversion ranges are
> > > > less than 1GB. Though the initial conversion which converts all memory from
> > > > private to shared may be wide, there are usually no mappings at that stage. So,
> > > > the traversal should be very fast (since the traversal doesn't even need to go
> > > > down to the 2MB/1GB level).
> > > > 
> > > > If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> > > > very large range at runtime, it can optimize by invoking the API twice:
> > > > once for range [start, ALIGN(start, 1GB)), and
> > > > once for range [ALIGN_DOWN(end, 1GB), end).
> > > > 
> > > > I can also implement this optimization within kvm_split_cross_boundary_leafs()
> > > > by checking the range size if you think that would be better.
> > > 
> > > I am not sure why do we even need kvm_split_cross_boundary_leafs(), if you
> > > want to do optimization.
> > > 
> > > I think I've raised this in v2, and asked why not just letting the caller
> > > to figure out the ranges to split for a given range (see at the end of
> > > [*]), because the "cross boundary" can only happen at the beginning and
> > > end of the given range, if possible.
> Hmm, the caller can only figure out when splitting is NOT necessary, e.g., if
> start is 1GB-aligned, then there's no need to split for start. However, if start
> is not 1GB/2MB-aligned, the caller has no idea if there's a 2MB mapping covering
> start - 1 and start.

Why does the caller need to know?

Let's only talk about 'start' for simplicity:

- If start is 1G aligned, then no split is needed.

- If start is not 1G-aligned but 2M-aligned, you split the range:

   [ALIGN_DOWN(start, 1G), ALIGN(start, 1G)) to 2M level.

- If start is 4K-aligned only, you firstly split

   [ALIGN_DOWN(start, 1G), ALIGN(start, 1G))

  to 2M level, then you split

   [ALIGN_DOWN(start, 2M), ALIGN(start, 2M))

  to 4K level.

Similar handling to 'end'.  An additional thing is if one to-be-split-
range calculated from 'start' overlaps one calculated from 'end', the
split is only needed once. 

Wouldn't this work?

> (for non-TDX cases, if start is not 1GB-aligned and is just 2MB-aligned,
> invoking tdp_mmu_split_huge_pages_root() is still necessary because there may
> exist a 1GB mapping covering start -1 and start).
> 
> In my reply to [*], I didn't want to do the calculation because I didn't see
> much overhead from always invoking tdp_mmu_split_huge_pages_root().
> But the scenario Sean pointed out is different. When both start and end are not
> 2MB-aligned, if [start, end) covers a huge range, we can still pre-calculate to
> reduce the iterations in tdp_mmu_split_huge_pages_root().

I don't see much difference.  Maybe I am missing something.

> 
> Opportunistically, optimization to skip splits for 1GB-aligned start or end is
> possible :)

If this makes code easier to review/maintain then sure.

As long as the solution is easy to review (i.e., not too complicated to
understand/maintain) then I am fine with whatever Sean/you prefer.

However the 'cross_boundary_only' thing was indeed a bit odd to me when I
firstly saw this :-)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 22/24] x86/tdx: Add/Remove DPAMT pages for guest private memory to demote
  2026-01-06 10:24 ` [PATCH v3 22/24] x86/tdx: Add/Remove DPAMT pages for guest private memory to demote Yan Zhao
@ 2026-01-19 10:52   ` Huang, Kai
  2026-01-19 11:11     ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Huang, Kai @ 2026-01-19 10:52 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Du, Fan, Li, Xiaoyao, Gao, Chao,
	Hansen, Dave, thomas.lendacky@amd.com, vbabka@suse.cz,
	tabba@google.com, david@kernel.org, kas@kernel.org,
	michael.roth@amd.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, ackerleytng@google.com,
	nik.borisov@suse.com, Yamahata, Isaku, Peng, Chao P,
	francescolavra.fl@gmail.com, sagis@google.com, Annapurve, Vishal,
	Edgecombe, Rick P, Miao, Jun, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Tue, 2026-01-06 at 18:24 +0800, Yan Zhao wrote:
>  u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
> +			struct tdx_prealloc *prealloc,
>  			u64 *ext_err1, u64 *ext_err2)
>  {
> -	struct tdx_module_args args = {
> -		.rcx = gpa | level,
> -		.rdx = tdx_tdr_pa(td),
> -		.r8 = page_to_phys(new_sept_page),
> +	bool dpamt = tdx_supports_dynamic_pamt(&tdx_sysinfo) && level == TDX_PS_2M;

The spec of TDH.MEM.PAGE.DEMOTE says:

  If the TDX Module is configured to use Dynamic PAMT and the large page
  level is 1 (2MB), R12 contains the host physical address of a new 
  PAMT page (HKID bits must be 0).

It says "... is configured to use Dynamic PAMT ...", but not ".. Dynamic
PAMT is supported ..".

tdx_supports_dynamic_pamt() only reports whether the module "supports"
DPAMT.  Although in the DPAMT series the kernel always enables DPAMT when
it is supported, I think it's better to have a comment point out this fact
so we don't need to go to that series to figure out.

> +	u64 guest_memory_pamt_page[MAX_TDX_ARG_SIZE(r12)];
> +	struct tdx_module_array_args args = {
> +		.args.rcx = gpa | level,
> +		.args.rdx = tdx_tdr_pa(td),
> +		.args.r8 = page_to_phys(new_sept_page),
>  	};
>  	u64 ret;
>  
>  	if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
>  		return TDX_SW_ERROR;
>  
> +	if (dpamt) {
> +		u64 *args_array = dpamt_args_array_ptr_r12(&args);
> +
> +		if (alloc_pamt_array(guest_memory_pamt_page, prealloc))
> +			return TDX_SW_ERROR;
> +
> +		/*
> +		 * Copy PAMT page PAs of the guest memory into the struct per the
> +		 * TDX ABI
> +		 */
> +		memcpy(args_array, guest_memory_pamt_page,
> +		       tdx_dpamt_entry_pages() * sizeof(*args_array));
> +	}

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2026-01-19 10:40               ` Huang, Kai
@ 2026-01-19 11:06                 ` Yan Zhao
  2026-01-19 12:32                   ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-19 11:06 UTC (permalink / raw)
  To: Huang, Kai
  Cc: Du, Fan, kvm@vger.kernel.org, Li, Xiaoyao, Hansen, Dave,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	david@kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	pbonzini@redhat.com, ackerleytng@google.com, kas@kernel.org,
	binbin.wu@linux.intel.com, Weiny, Ira, nik.borisov@suse.com,
	francescolavra.fl@gmail.com, Yamahata, Isaku, sagis@google.com,
	Gao, Chao, Edgecombe, Rick P, Miao, Jun, Annapurve, Vishal,
	jgross@suse.com, pgonda@google.com, x86@kernel.org

On Mon, Jan 19, 2026 at 06:40:50PM +0800, Huang, Kai wrote:
> On Mon, 2026-01-19 at 18:11 +0800, Yan Zhao wrote:
> > On Mon, Jan 19, 2026 at 04:49:58PM +0800, Huang, Kai wrote:
> > > On Mon, 2026-01-19 at 08:35 +0000, Huang, Kai wrote:
> > > > On Mon, 2026-01-19 at 09:28 +0800, Zhao, Yan Y wrote:
> > > > > > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > > > > > the concept itself, in the sense that it shoves a weird, specific concept into
> > > > > > the guts of the TDP MMU.
> > > > > > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > > > > > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > > > > > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > > > > > and tail pages is asinine.
> > > > > That's a reasonable concern. I actually thought about it.
> > > > > My consideration was as follows:
> > > > > Currently, we don't have such large areas. Usually, the conversion ranges are
> > > > > less than 1GB. Though the initial conversion which converts all memory from
> > > > > private to shared may be wide, there are usually no mappings at that stage. So,
> > > > > the traversal should be very fast (since the traversal doesn't even need to go
> > > > > down to the 2MB/1GB level).
> > > > > 
> > > > > If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> > > > > very large range at runtime, it can optimize by invoking the API twice:
> > > > > once for range [start, ALIGN(start, 1GB)), and
> > > > > once for range [ALIGN_DOWN(end, 1GB), end).
> > > > > 
> > > > > I can also implement this optimization within kvm_split_cross_boundary_leafs()
> > > > > by checking the range size if you think that would be better.
> > > > 
> > > > I am not sure why do we even need kvm_split_cross_boundary_leafs(), if you
> > > > want to do optimization.
> > > > 
> > > > I think I've raised this in v2, and asked why not just letting the caller
> > > > to figure out the ranges to split for a given range (see at the end of
> > > > [*]), because the "cross boundary" can only happen at the beginning and
> > > > end of the given range, if possible.
> > Hmm, the caller can only figure out when splitting is NOT necessary, e.g., if
> > start is 1GB-aligned, then there's no need to split for start. However, if start
> > is not 1GB/2MB-aligned, the caller has no idea if there's a 2MB mapping covering
> > start - 1 and start.
> 
> Why does the caller need to know?
> 
> Let's only talk about 'start' for simplicity:
> 
> - If start is 1G aligned, then no split is needed.
> 
> - If start is not 1G-aligned but 2M-aligned, you split the range:
> 
>    [ALIGN_DOWN(start, 1G), ALIGN(start, 1G)) to 2M level.
> 
> - If start is 4K-aligned only, you firstly split
> 
>    [ALIGN_DOWN(start, 1G), ALIGN(start, 1G))
> 
>   to 2M level, then you split
> 
>    [ALIGN_DOWN(start, 2M), ALIGN(start, 2M))
> 
>   to 4K level.
> 
> Similar handling to 'end'.  An additional thing is if one to-be-split-
> range calculated from 'start' overlaps one calculated from 'end', the
> split is only needed once. 
> 
> Wouldn't this work?
It can work. But I don't think the calculations are necessary if the length
of [start, end) is less than 1G or 2MB.

e.g., if both start and end are just 4KB-aligned, of a length 8KB, the current
implementation can invoke a single tdp_mmu_split_huge_pages_root() to split
a 1GB mapping to 4KB directly. Why bother splitting twice for start or end?

> > (for non-TDX cases, if start is not 1GB-aligned and is just 2MB-aligned,
> > invoking tdp_mmu_split_huge_pages_root() is still necessary because there may
> > exist a 1GB mapping covering start -1 and start).
> > 
> > In my reply to [*], I didn't want to do the calculation because I didn't see
> > much overhead from always invoking tdp_mmu_split_huge_pages_root().
> > But the scenario Sean pointed out is different. When both start and end are not
> > 2MB-aligned, if [start, end) covers a huge range, we can still pre-calculate to
> > reduce the iterations in tdp_mmu_split_huge_pages_root().
> 
> I don't see much difference.  Maybe I am missing something.
The difference is the length of the range.
For lengths < 1GB, always invoking tdp_mmu_split_huge_pages_root() without any
calculation is simpler and more efficient.

> > 
> > Opportunistically, optimization to skip splits for 1GB-aligned start or end is
> > possible :)
> 
> If this makes code easier to review/maintain then sure.
> 
> As long as the solution is easy to review (i.e., not too complicated to
> understand/maintain) then I am fine with whatever Sean/you prefer.
> 
> However the 'cross_boundary_only' thing was indeed a bit odd to me when I
> firstly saw this :-)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 22/24] x86/tdx: Add/Remove DPAMT pages for guest private memory to demote
  2026-01-19 10:52   ` Huang, Kai
@ 2026-01-19 11:11     ` Yan Zhao
  0 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-19 11:11 UTC (permalink / raw)
  To: Huang, Kai
  Cc: pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org,
	Du, Fan, Li, Xiaoyao, Gao, Chao, Hansen, Dave,
	thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
	david@kernel.org, kas@kernel.org, michael.roth@amd.com,
	Weiny, Ira, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, ackerleytng@google.com,
	nik.borisov@suse.com, Yamahata, Isaku, Peng, Chao P,
	francescolavra.fl@gmail.com, sagis@google.com, Annapurve, Vishal,
	Edgecombe, Rick P, Miao, Jun, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Mon, Jan 19, 2026 at 06:52:46PM +0800, Huang, Kai wrote:
> On Tue, 2026-01-06 at 18:24 +0800, Yan Zhao wrote:
> >  u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *new_sept_page,
> > +			struct tdx_prealloc *prealloc,
> >  			u64 *ext_err1, u64 *ext_err2)
> >  {
> > -	struct tdx_module_args args = {
> > -		.rcx = gpa | level,
> > -		.rdx = tdx_tdr_pa(td),
> > -		.r8 = page_to_phys(new_sept_page),
> > +	bool dpamt = tdx_supports_dynamic_pamt(&tdx_sysinfo) && level == TDX_PS_2M;
> 
> The spec of TDH.MEM.PAGE.DEMOTE says:
> 
>   If the TDX Module is configured to use Dynamic PAMT and the large page
>   level is 1 (2MB), R12 contains the host physical address of a new 
>   PAMT page (HKID bits must be 0).
> 
> It says "... is configured to use Dynamic PAMT ...", but not ".. Dynamic
> PAMT is supported ..".
Good catch.

> tdx_supports_dynamic_pamt() only reports whether the module "supports"
> DPAMT.  Although in the DPAMT series the kernel always enables DPAMT when
> it is supported, I think it's better to have a comment point out this fact
> so we don't need to go to that series to figure out.
Will add the comment. Thanks!

> > +	u64 guest_memory_pamt_page[MAX_TDX_ARG_SIZE(r12)];
> > +	struct tdx_module_array_args args = {
> > +		.args.rcx = gpa | level,
> > +		.args.rdx = tdx_tdr_pa(td),
> > +		.args.r8 = page_to_phys(new_sept_page),
> >  	};
> >  	u64 ret;
> >  
> >  	if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
> >  		return TDX_SW_ERROR;
> >  
> > +	if (dpamt) {
> > +		u64 *args_array = dpamt_args_array_ptr_r12(&args);
> > +
> > +		if (alloc_pamt_array(guest_memory_pamt_page, prealloc))
> > +			return TDX_SW_ERROR;
> > +
> > +		/*
> > +		 * Copy PAMT page PAs of the guest memory into the struct per the
> > +		 * TDX ABI
> > +		 */
> > +		memcpy(args_array, guest_memory_pamt_page,
> > +		       tdx_dpamt_entry_pages() * sizeof(*args_array));
> > +	}

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2026-01-19 11:06                 ` Yan Zhao
@ 2026-01-19 12:32                   ` Yan Zhao
  2026-01-29 14:36                     ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-19 12:32 UTC (permalink / raw)
  To: Huang, Kai, Du, Fan, kvm@vger.kernel.org, Li, Xiaoyao,
	Hansen, Dave, thomas.lendacky@amd.com, tabba@google.com,
	vbabka@suse.cz, david@kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	pbonzini@redhat.com, ackerleytng@google.com, kas@kernel.org,
	binbin.wu@linux.intel.com, Weiny, Ira, nik.borisov@suse.com,
	francescolavra.fl@gmail.com, Yamahata, Isaku, sagis@google.com,
	Gao, Chao, Edgecombe, Rick P, Miao, Jun, Annapurve, Vishal,
	jgross@suse.com, pgonda@google.com, x86@kernel.org

On Mon, Jan 19, 2026 at 07:06:01PM +0800, Yan Zhao wrote:
> On Mon, Jan 19, 2026 at 06:40:50PM +0800, Huang, Kai wrote:
> > On Mon, 2026-01-19 at 18:11 +0800, Yan Zhao wrote:
> > > On Mon, Jan 19, 2026 at 04:49:58PM +0800, Huang, Kai wrote:
> > > > On Mon, 2026-01-19 at 08:35 +0000, Huang, Kai wrote:
> > > > > On Mon, 2026-01-19 at 09:28 +0800, Zhao, Yan Y wrote:
> > > > > > > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > > > > > > the concept itself, in the sense that it shoves a weird, specific concept into
> > > > > > > the guts of the TDP MMU.
> > > > > > > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > > > > > > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > > > > > > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > > > > > > and tail pages is asinine.
> > > > > > That's a reasonable concern. I actually thought about it.
> > > > > > My consideration was as follows:
> > > > > > Currently, we don't have such large areas. Usually, the conversion ranges are
> > > > > > less than 1GB. Though the initial conversion which converts all memory from
> > > > > > private to shared may be wide, there are usually no mappings at that stage. So,
> > > > > > the traversal should be very fast (since the traversal doesn't even need to go
> > > > > > down to the 2MB/1GB level).
> > > > > > 
> > > > > > If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> > > > > > very large range at runtime, it can optimize by invoking the API twice:
> > > > > > once for range [start, ALIGN(start, 1GB)), and
> > > > > > once for range [ALIGN_DOWN(end, 1GB), end).
> > > > > > 
> > > > > > I can also implement this optimization within kvm_split_cross_boundary_leafs()
> > > > > > by checking the range size if you think that would be better.
> > > > > 
> > > > > I am not sure why do we even need kvm_split_cross_boundary_leafs(), if you
> > > > > want to do optimization.
> > > > > 
> > > > > I think I've raised this in v2, and asked why not just letting the caller
> > > > > to figure out the ranges to split for a given range (see at the end of
> > > > > [*]), because the "cross boundary" can only happen at the beginning and
> > > > > end of the given range, if possible.
> > > Hmm, the caller can only figure out when splitting is NOT necessary, e.g., if
> > > start is 1GB-aligned, then there's no need to split for start. However, if start
> > > is not 1GB/2MB-aligned, the caller has no idea if there's a 2MB mapping covering
> > > start - 1 and start.
> > 
> > Why does the caller need to know?
> > 
> > Let's only talk about 'start' for simplicity:
> > 
> > - If start is 1G aligned, then no split is needed.
> > 
> > - If start is not 1G-aligned but 2M-aligned, you split the range:
> > 
> >    [ALIGN_DOWN(start, 1G), ALIGN(start, 1G)) to 2M level.
> > 
> > - If start is 4K-aligned only, you firstly split
> > 
> >    [ALIGN_DOWN(start, 1G), ALIGN(start, 1G))
> > 
> >   to 2M level, then you split
> > 
> >    [ALIGN_DOWN(start, 2M), ALIGN(start, 2M))
> > 
> >   to 4K level.
> > 
> > Similar handling to 'end'.  An additional thing is if one to-be-split-
> > range calculated from 'start' overlaps one calculated from 'end', the
> > split is only needed once. 
> > 
> > Wouldn't this work?
> It can work. But I don't think the calculations are necessary if the length
> of [start, end) is less than 1G or 2MB.
> 
> e.g., if both start and end are just 4KB-aligned, of a length 8KB, the current
> implementation can invoke a single tdp_mmu_split_huge_pages_root() to split
> a 1GB mapping to 4KB directly. Why bother splitting twice for start or end?
I think I get your point now.
It's a good idea if introducing only_cross_boundary is undesirable.

So, the remaining question (as I asked at the bottom of [1]) is whether we could
create a specific function for this split use case, rather than reusing
tdp_mmu_split_huge_pages_root() which allocates pages outside of mmu_lock. This
way, we don't need to introduce a spinlock to protect the page enqueuing/
dequeueing of the per-VM external cache (see prealloc_split_cache_lock in patch
20 [2]).

Then we would disallow mirror_root for tdp_mmu_split_huge_pages_root(), which is
currently called for dirty page tracking in upstream code. Would this be
acceptable for TDX migration?


[1] https://lore.kernel.org/all/aW2Iwpuwoyod8eQc@yzhao56-desk.sh.intel.com/
[2] https://lore.kernel.org/all/20260106102345.25261-1-yan.y.zhao@intel.com/
> > > (for non-TDX cases, if start is not 1GB-aligned and is just 2MB-aligned,
> > > invoking tdp_mmu_split_huge_pages_root() is still necessary because there may
> > > exist a 1GB mapping covering start -1 and start).
> > > 
> > > In my reply to [*], I didn't want to do the calculation because I didn't see
> > > much overhead from always invoking tdp_mmu_split_huge_pages_root().
> > > But the scenario Sean pointed out is different. When both start and end are not
> > > 2MB-aligned, if [start, end) covers a huge range, we can still pre-calculate to
> > > reduce the iterations in tdp_mmu_split_huge_pages_root().
> > 
> > I don't see much difference.  Maybe I am missing something.
> The difference is the length of the range.
> For lengths < 1GB, always invoking tdp_mmu_split_huge_pages_root() without any
> calculation is simpler and more efficient.
> 
> > > 
> > > Opportunistically, optimization to skip splits for 1GB-aligned start or end is
> > > possible :)
> > 
> > If this makes code easier to review/maintain then sure.
> > 
> > As long as the solution is easy to review (i.e., not too complicated to
> > understand/maintain) then I am fine with whatever Sean/you prefer.
> > 
> > However the 'cross_boundary_only' thing was indeed a bit odd to me when I
> > firstly saw this :-)

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2026-01-19  1:28       ` Yan Zhao
  2026-01-19  8:35         ` Huang, Kai
@ 2026-01-20 17:51         ` Sean Christopherson
  2026-01-22  6:27           ` Yan Zhao
  1 sibling, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2026-01-20 17:51 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Kai Huang, pbonzini@redhat.com, kvm@vger.kernel.org, Fan Du,
	Xiaoyao Li, Chao Gao, Dave Hansen, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, david@kernel.org,
	kas@kernel.org, michael.roth@amd.com, Ira Weiny,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, nik.borisov@suse.com, Isaku Yamahata,
	Chao P Peng, francescolavra.fl@gmail.com, sagis@google.com,
	Vishal Annapurve, Rick P Edgecombe, Jun Miao, jgross@suse.com,
	pgonda@google.com, x86@kernel.org

On Mon, Jan 19, 2026, Yan Zhao wrote:
> On Fri, Jan 16, 2026 at 03:39:13PM -0800, Sean Christopherson wrote:
> > On Thu, Jan 15, 2026, Kai Huang wrote:
> > > So how about:
> > > 
> > > Rename kvm_tdp_mmu_try_split_huge_pages() to
> > > kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> > > kvm_tdp_mmu_split_huge_pages_cross_boundary()
> > > 
> > > ?
> > 
> > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > the concept itself, in the sense that it shoves a weird, specific concept into
> > the guts of the TDP MMU.
> > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > and tail pages is asinine.
> That's a reasonable concern. I actually thought about it.
> My consideration was as follows:
> Currently, we don't have such large areas. Usually, the conversion ranges are
> less than 1GB. 

Nothing guarantees that behavior.

> Though the initial conversion which converts all memory from private to
> shared may be wide, there are usually no mappings at that stage. So, the
> traversal should be very fast (since the traversal doesn't even need to go
> down to the 2MB/1GB level).
> 
> If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> very large range at runtime, it can optimize by invoking the API twice:
> once for range [start, ALIGN(start, 1GB)), and
> once for range [ALIGN_DOWN(end, 1GB), end).
> 
> I can also implement this optimization within kvm_split_cross_boundary_leafs()
> by checking the range size if you think that would be better.
> 
> > And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
> > _only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
> > code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.
> Sorry for the confusion about the usage of tdx_honor_guest_accept_level(). I
> should add a better comment.
> 
> There are 4 use cases for the API kvm_split_cross_boundary_leafs():
> 1. PUNCH_HOLE
> 2. KVM_SET_MEMORY_ATTRIBUTES2, which invokes kvm_gmem_set_attributes() for
>    private-to-shared conversions
> 3. tdx_honor_guest_accept_level()
> 4. kvm_gmem_error_folio()
> 
> Use cases 1-3 are already in the current code. Use case 4 is per our discussion,
> and will be implemented in the next version (because guest_memfd may split
> folios without first splitting S-EPT).
> 
> The 4 use cases can be divided into two categories:
> 
> 1. Category 1: use cases 1, 2, 4
>    We must ensure GFN start - 1 and GFN start are not mapped in a single
>    mapping. However, for GFN start or GFN start - 1 specifically, we don't care
>    about their actual mapping levels, which means they are free to be mapped at
>    2MB or 1GB. The same applies to GFN end - 1 and GFN end.
> 
>    --|------------------|-----------
>      ^                  ^
>     start              end - 1 
> 
> 2. Category 2: use case 3
>    It cares about the mapping level of the GFN, i.e., it must not be mapped
>    above a certain level.
> 
>    -----|-------
>         ^
>        GFN
> 
>    So, to unify the two categories, I have tdx_honor_guest_accept_level() check
>    the range of [level-aligned GFN, level-aligned GFN + level size). e.g.,
>    If the accept level is 2MB, only 1GB mapping is possible to be outside the
>    range and needs splitting.

But that overlooks the fact that Category 2 already fits the existing "category"
that is supported by the TDP MMU.  I.e. Category 1 is (somewhat) new and novel,
Category 2 is not.

>    -----|-------------|---
>         ^             ^
>         |             |
>    level-aligned     level-aligned
>       GFN            GFN + level size - 1
> 
> 
> > For the EPT violation case, the guest is accepting a page.  Just split to the
> > guest's accepted level, I don't see any reason to make things more complicated
> > than that.
> This use case could reuse the kvm_mmu_try_split_huge_pages() API, except that we
> need a return value.

Just expose tdp_mmu_split_huge_pages_root(), the fault path only _needs_ to split
the current root, and in fact shouldn't even try to split other roots (ignoring
that no other relevant roots exist).

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9c26038f6b77..7d924da75106 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1555,10 +1555,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
        return ret;
 }
 
-static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
-                                        struct kvm_mmu_page *root,
-                                        gfn_t start, gfn_t end,
-                                        int target_level, bool shared)
+int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
+                                 gfn_t start, gfn_t end, int target_level,
+                                 bool shared)
 {
        struct kvm_mmu_page *sp = NULL;
        struct tdp_iter iter;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index bd62977c9199..ea9a509608fb 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -93,6 +93,9 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
                                   struct kvm_memory_slot *slot, gfn_t gfn,
                                   int min_level);
 
+int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
+                                 gfn_t start, gfn_t end, int target_level,
+                                 bool shared);
 void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
                                      const struct kvm_memory_slot *slot,
                                      gfn_t start, gfn_t end,

> > And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> > and tail pages need to be split, and use the existing APIs to make that happen.
> This use case cannot reuse kvm_mmu_try_split_huge_pages() without modification.

Modifying existing code is a non-issue, and you're already modifying TDP MMU
functions, so I don't see that as a reason for choosing X instead of Y.

> Or which existing APIs are you referring to?

See above.

> The cross_boundary information is still useful?
> 
> BTW: Currently, kvm_split_cross_boundary_leafs() internally reuses
> tdp_mmu_split_huge_pages_root() (as shown below).
> 
> kvm_split_cross_boundary_leafs
>   kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs
>     tdp_mmu_split_huge_pages_root
> 
> However, tdp_mmu_split_huge_pages_root() is originally used to split huge
> mappings in a wide range, so it temporarily releases mmu_lock for memory
> allocation for sp, since it can't predict how many pages to pre-allocate in the
> KVM mmu cache.
> 
> For kvm_split_cross_boundary_leafs(), we can actually predict the max number of
> pages to pre-allocate. If we don't reuse tdp_mmu_split_huge_pages_root(), we can
> allocate sp, sp->spt, sp->external_spt and DPAMT pages from the KVM mmu cache
> without releasing mmu_lock and invoking tdp_mmu_alloc_sp_for_split().

That's completely orthogonal to the "only need to maybe split head and tail pages".
E.g. kvm_tdp_mmu_try_split_huge_pages() can also predict the _max_ number of pages
to pre-allocate, it's just not worth adding a kvm_mmu_memory_cache for that use
case because that path can drop mmu_lock at will, unlike the full page fault path.
I.e. the complexity doesn't justify the benefits, especially since the max number
of pages is so large.

AFAICT, the only pre-allocation that is _necessary_ is for the dynamic PAMT,
because the allocation is done outside of KVM's control.  But that's a solvable
problem, the tricky part is protecting the PAMT cache for PUNCH_HOLE, but that
too is solvable, e.g. by adding a per-VM mutex that's taken by kvm_gmem_punch_hole()
to handle the PUNCH_HOLE case, and then using the per-vCPU cache when splitting
for a mismatched accept.

^ permalink raw reply related	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2026-01-16 23:39     ` Sean Christopherson
  2026-01-19  1:28       ` Yan Zhao
@ 2026-01-20 17:57       ` Vishal Annapurve
  2026-01-20 18:02         ` Sean Christopherson
  1 sibling, 1 reply; 127+ messages in thread
From: Vishal Annapurve @ 2026-01-20 17:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kai Huang, pbonzini@redhat.com, Yan Y Zhao, kvm@vger.kernel.org,
	Fan Du, Xiaoyao Li, Chao Gao, Dave Hansen,
	thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
	david@kernel.org, kas@kernel.org, michael.roth@amd.com, Ira Weiny,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, nik.borisov@suse.com, Isaku Yamahata,
	Chao P Peng, francescolavra.fl@gmail.com, sagis@google.com,
	Rick P Edgecombe, Jun Miao, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Fri, Jan 16, 2026 at 3:39 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jan 15, 2026, Kai Huang wrote:
> > static int __kvm_tdp_mmu_split_huge_pages(struct kvm *kvm,
> >                                         struct kvm_gfn_range *range,
> >                                         int target_level,
> >                                         bool shared,
> >                                         bool cross_boundary_only)
> > {
> >       ...
> > }
> >
> > And by using this helper, I found the name of the two wrapper functions
> > are not ideal:
> >
> > kvm_tdp_mmu_try_split_huge_pages() is only for log dirty, and it should
> > not be reachable for TD (VM with mirrored PT).  But currently it uses
> > KVM_VALID_ROOTS for root filter thus mirrored PT is also included.  I
> > think it's better to rename it, e.g., at least with "log_dirty" in the
> > name so it's more clear this function is only for dealing log dirty (at
> > least currently).  We can also add a WARN() if it's called for VM with
> > mirrored PT but it's a different topic.
> >
> > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() doesn't have
> > "huge_pages", which isn't consistent with the other.  And it is a bit
> > long.  If we don't have "gfn_range" in __kvm_tdp_mmu_split_huge_pages(),
> > then I think we can remove "gfn_range" from
> > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() too to make it shorter.
> >
> > So how about:
> >
> > Rename kvm_tdp_mmu_try_split_huge_pages() to
> > kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> > kvm_tdp_mmu_split_huge_pages_cross_boundary()
> >
> > ?
>
> I find the "cross_boundary" termininology extremely confusing.  I also dislike
> the concept itself, in the sense that it shoves a weird, specific concept into
> the guts of the TDP MMU.
>
> The other wart is that it's inefficient when punching a large hole.  E.g. say
> there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> and tail pages is asinine.
>
> And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
> _only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
> code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.
>
> For the EPT violation case, the guest is accepting a page.  Just split to the
> guest's accepted level, I don't see any reason to make things more complicated
> than that.
>
> And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> and tail pages need to be split, and use the existing APIs to make that happen.

Just a note: Through guest_memfd upstream syncs, we agreed that
guest_memfd will only allow the punch_hole operation for huge page
size-aligned ranges for hugetlb and thp backing. i.e. the PUNCH_HOLE
operation doesn't need to split any EPT mappings for foreseeable
future.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2026-01-20 17:57       ` Vishal Annapurve
@ 2026-01-20 18:02         ` Sean Christopherson
  2026-01-22  6:33           ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2026-01-20 18:02 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Kai Huang, pbonzini@redhat.com, Yan Y Zhao, kvm@vger.kernel.org,
	Fan Du, Xiaoyao Li, Chao Gao, Dave Hansen,
	thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
	david@kernel.org, kas@kernel.org, michael.roth@amd.com, Ira Weiny,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, nik.borisov@suse.com, Isaku Yamahata,
	Chao P Peng, francescolavra.fl@gmail.com, sagis@google.com,
	Rick P Edgecombe, Jun Miao, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Tue, Jan 20, 2026, Vishal Annapurve wrote:
> On Fri, Jan 16, 2026 at 3:39 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Jan 15, 2026, Kai Huang wrote:
> > > static int __kvm_tdp_mmu_split_huge_pages(struct kvm *kvm,
> > >                                         struct kvm_gfn_range *range,
> > >                                         int target_level,
> > >                                         bool shared,
> > >                                         bool cross_boundary_only)
> > > {
> > >       ...
> > > }
> > >
> > > And by using this helper, I found the name of the two wrapper functions
> > > are not ideal:
> > >
> > > kvm_tdp_mmu_try_split_huge_pages() is only for log dirty, and it should
> > > not be reachable for TD (VM with mirrored PT).  But currently it uses
> > > KVM_VALID_ROOTS for root filter thus mirrored PT is also included.  I
> > > think it's better to rename it, e.g., at least with "log_dirty" in the
> > > name so it's more clear this function is only for dealing log dirty (at
> > > least currently).  We can also add a WARN() if it's called for VM with
> > > mirrored PT but it's a different topic.
> > >
> > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() doesn't have
> > > "huge_pages", which isn't consistent with the other.  And it is a bit
> > > long.  If we don't have "gfn_range" in __kvm_tdp_mmu_split_huge_pages(),
> > > then I think we can remove "gfn_range" from
> > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() too to make it shorter.
> > >
> > > So how about:
> > >
> > > Rename kvm_tdp_mmu_try_split_huge_pages() to
> > > kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> > > kvm_tdp_mmu_split_huge_pages_cross_boundary()
> > >
> > > ?
> >
> > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > the concept itself, in the sense that it shoves a weird, specific concept into
> > the guts of the TDP MMU.
> >
> > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > and tail pages is asinine.
> >
> > And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
> > _only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
> > code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.
> >
> > For the EPT violation case, the guest is accepting a page.  Just split to the
> > guest's accepted level, I don't see any reason to make things more complicated
> > than that.
> >
> > And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> > and tail pages need to be split, and use the existing APIs to make that happen.
> 
> Just a note: Through guest_memfd upstream syncs, we agreed that
> guest_memfd will only allow the punch_hole operation for huge page
> size-aligned ranges for hugetlb and thp backing. i.e. the PUNCH_HOLE
> operation doesn't need to split any EPT mappings for foreseeable
> future.

Oh!  Right, forgot about that.  It's the conversion path that we need to sort out,
not PUNCH_HOLE.  Thanks for the reminder!

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting
  2026-01-06 10:23 ` [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting Yan Zhao
@ 2026-01-21  1:54   ` Huang, Kai
  2026-01-21 17:30     ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Huang, Kai @ 2026-01-21  1:54 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Du, Fan, Li, Xiaoyao, Gao, Chao,
	Hansen, Dave, thomas.lendacky@amd.com, vbabka@suse.cz,
	tabba@google.com, david@kernel.org, kas@kernel.org,
	michael.roth@amd.com, Weiny, Ira, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, ackerleytng@google.com,
	nik.borisov@suse.com, Yamahata, Isaku, Peng, Chao P,
	francescolavra.fl@gmail.com, sagis@google.com, Annapurve, Vishal,
	Edgecombe, Rick P, Miao, Jun, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Tue, 2026-01-06 at 18:23 +0800, Yan Zhao wrote:
> Introduce per-VM external cache for splitting the external page table by
> adding KVM x86 ops for cache "topup", "free", "need topup" operations.
> 
> Invoke the KVM x86 ops for "topup", "need topup" for the per-VM external
> split cache when splitting the mirror root in
> tdp_mmu_split_huge_pages_root() where there's no per-vCPU context.
> 
> Invoke the KVM x86 op for "free" to destroy the per-VM external split cache
> when KVM frees memory caches.
> 
> This per-VM external split cache is only used when per-vCPU context is not
> available. Use the per-vCPU external fault cache in the fault path
> when per-vCPU context is available.
> 
> The per-VM external split cache is protected under both kvm->mmu_lock and a
> cache lock inside vendor implementations to ensure that there're enough
> pages in cache for one split:
> 
> - Dequeuing of the per-VM external split cache is in
>   kvm_x86_ops.split_external_spte() under mmu_lock.
> 
> - Yield the traversal in tdp_mmu_split_huge_pages_root() after topup of
>   the per-VM cache, so that need_topup() is checked again after
>   re-acquiring the mmu_lock.
> 
> - Vendor implementations of the per-VM external split cache provide a
>   cache lock to protect the enqueue/dequeue of pages into/from the cache.
> 
> Here's the sequence to show how enough pages in cache is guaranteed.
> 
> a. with write mmu_lock:
> 
>    1. write_lock(&kvm->mmu_lock)
>       kvm_x86_ops.need_topup()
> 
>    2. write_unlock(&kvm->mmu_lock)
>       kvm_x86_ops.topup() --> in vendor:
>       {
>         allocate pages
>         get cache lock
>         enqueue pages in cache
>         put cache lock
>       }
> 
>    3. write_lock(&kvm->mmu_lock)
>       kvm_x86_ops.need_topup() (goto 2 if topup is necessary)  (*)
> 
>       kvm_x86_ops.split_external_spte() --> in vendor:
>       {
>          get cache lock
>          dequeue pages in cache
>          put cache lock
>       }
>       write_unlock(&kvm->mmu_lock)
> 
> b. with read mmu_lock,
> 
>    1. read_lock(&kvm->mmu_lock)
>       kvm_x86_ops.need_topup()
> 
>    2. read_unlock(&kvm->mmu_lock)
>       kvm_x86_ops.topup() --> in vendor:
>       {
>         allocate pages
>         get cache lock
>         enqueue pages in cache
>         put cache lock
>       }
> 
>    3. read_lock(&kvm->mmu_lock)
>       kvm_x86_ops.need_topup() (goto 2 if topup is necessary)
> 
>       kvm_x86_ops.split_external_spte() --> in vendor:
>       {
>          get cache lock
>          kvm_x86_ops.need_topup() (return retry if topup is necessary) (**)
>          dequeue pages in cache
>          put cache lock
>       }
> 
>       read_unlock(&kvm->mmu_lock)
> 
> Due to (*) and (**) in step 3, enough pages for split is guaranteed.

It feels like enormous pain to make sure there's enough objects in the
cache, _especially_ under MMU read lock -- you need an additional cache
lock and need to call need_topup() twice for that, and the caller needs
handle -EAGAIN.

That being said, I _think_ this is also the reason that
tdp_mmu_alloc_sp_for_split() chose to just use normal memory allocation
for allocating sp and sp->spt but not use a per-VM cache of KVM's
kvm_mmu_memory_cache.

I have been thinking whether we can simplify the solution, not only just
for avoiding this complicated memory cache topup-then-consume mechanism
under MMU read lock, but also for avoiding kinda duplicated code about how
to calculate how many DPAMT pages needed to topup etc between your next
patch and similar code in DPAMT series for the per-vCPU cache.

IIRC, the per-VM DPAMT cache (in your next patch) covers both S-EPT pages
and the mapped 2M range when splitting.

- For S-EPT pages, they are _ALWAYS_ 4K, so we can actually use
tdx_alloc_page() directly which also handles DPAMT pages internally.

Here in tdp_mmmu_alloc_sp_for_split():

	sp->external_spt = tdx_alloc_page();

For the fault path we need to use the normal 'kvm_mmu_memory_cache' but
that's per-vCPU cache which doesn't have the pain of per-VM cache.  As I
mentioned in v3, I believe we can also hook to use tdx_alloc_page() if we
add two new obj_alloc()/free() callback to 'kvm_mmu_memory_cache':

https://lore.kernel.org/kvm/9e72261602bdab914cf7ff6f7cb921e35385136e.camel@intel.com/

So we can get rid of the per-VM DPAMT cache for S-EPT pages.

- For DPAMT pages for the TDX guest private memory, I think we can also
get rid of the per-VM DPAMT cache if we use 'kvm_mmu_page' to carry the
needed DPAMT pages:

--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -111,6 +111,7 @@ struct kvm_mmu_page {
                 * Passed to TDX module, not accessed by KVM.
                 */
                void *external_spt;
+               void *leaf_level_private;
        };

Then we can define a structure which contains DPAMT pages for a given 2M
range:

	struct tdx_dmapt_metadata {
		struct page *page1;
		struct page *page2;
	};

Then when we allocate sp->external_spt, we can also allocate it for
leaf_level_private via kvm_x86_ops call when we the 'sp' is actually the
last level page table.

In this case, I think we can get rid of the per-VM DPAMT cache?

For the fault path, similarly, I believe we can use a per-vCPU cache for
'struct tdx_dpamt_memtadata' if we utilize the two new obj_alloc()/free()
hooks.

The cost is the new 'leaf_level_private' takes additional 8-bytes for non-
TDX guests even they are never used, but if what I said above is feasible,
maybe it's worth the cost.

But it's completely possible that I missed something.  Any thoughts?



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting
  2026-01-21  1:54   ` Huang, Kai
@ 2026-01-21 17:30     ` Sean Christopherson
  2026-01-21 19:39       ` Edgecombe, Rick P
                         ` (2 more replies)
  0 siblings, 3 replies; 127+ messages in thread
From: Sean Christopherson @ 2026-01-21 17:30 UTC (permalink / raw)
  To: Kai Huang
  Cc: pbonzini@redhat.com, Yan Y Zhao, kvm@vger.kernel.org, Fan Du,
	Xiaoyao Li, Chao Gao, Dave Hansen, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, david@kernel.org,
	kas@kernel.org, michael.roth@amd.com, Ira Weiny,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, nik.borisov@suse.com, Isaku Yamahata,
	Chao P Peng, francescolavra.fl@gmail.com, sagis@google.com,
	Vishal Annapurve, Rick P Edgecombe, Jun Miao, jgross@suse.com,
	pgonda@google.com, x86@kernel.org

On Wed, Jan 21, 2026, Kai Huang wrote:
> On Tue, 2026-01-06 at 18:23 +0800, Yan Zhao wrote:
> I have been thinking whether we can simplify the solution, not only just
> for avoiding this complicated memory cache topup-then-consume mechanism
> under MMU read lock, but also for avoiding kinda duplicated code about how
> to calculate how many DPAMT pages needed to topup etc between your next
> patch and similar code in DPAMT series for the per-vCPU cache.
> 
> IIRC, the per-VM DPAMT cache (in your next patch) covers both S-EPT pages
> and the mapped 2M range when splitting.
> 
> - For S-EPT pages, they are _ALWAYS_ 4K, so we can actually use
> tdx_alloc_page() directly which also handles DPAMT pages internally.
> 
> Here in tdp_mmmu_alloc_sp_for_split():
> 
> 	sp->external_spt = tdx_alloc_page();
> 
> For the fault path we need to use the normal 'kvm_mmu_memory_cache' but
> that's per-vCPU cache which doesn't have the pain of per-VM cache.  As I
> mentioned in v3, I believe we can also hook to use tdx_alloc_page() if we
> add two new obj_alloc()/free() callback to 'kvm_mmu_memory_cache':
> 
> https://lore.kernel.org/kvm/9e72261602bdab914cf7ff6f7cb921e35385136e.camel@intel.com/
> 
> So we can get rid of the per-VM DPAMT cache for S-EPT pages.
> 
> - For DPAMT pages for the TDX guest private memory, I think we can also
> get rid of the per-VM DPAMT cache if we use 'kvm_mmu_page' to carry the
> needed DPAMT pages:
> 
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -111,6 +111,7 @@ struct kvm_mmu_page {
>                  * Passed to TDX module, not accessed by KVM.
>                  */
>                 void *external_spt;
> +               void *leaf_level_private;
>         };

There's no need to put this in with external_spt, we could throw it in a new union
with unsync_child_bitmap (TDP MMU can't have unsync children).  IIRC, the main
reason I've never suggested unionizing unsync_child_bitmap is that overloading
the bitmap would risk corruption if KVM ever marked a TDP MMU page as unsync, but
that's easy enough to guard against:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3d568512201d..d6c6768c1f50 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1917,9 +1917,10 @@ static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
 
 static void mark_unsync(u64 *spte)
 {
-       struct kvm_mmu_page *sp;
+       struct kvm_mmu_page *sp = sptep_to_sp(spte);
 
-       sp = sptep_to_sp(spte);
+       if (WARN_ON_ONCE(is_tdp_mmu_page(sp)))
+               return;
        if (__test_and_set_bit(spte_index(spte), sp->unsync_child_bitmap))
                return;
        if (sp->unsync_children++)


I might send a patch to do that even if we don't overload the bitmap, as a
hardening measure.

> Then we can define a structure which contains DPAMT pages for a given 2M
> range:
> 
> 	struct tdx_dmapt_metadata {
> 		struct page *page1;
> 		struct page *page2;
> 	};
> 
> Then when we allocate sp->external_spt, we can also allocate it for
> leaf_level_private via kvm_x86_ops call when we the 'sp' is actually the
> last level page table.
> 
> In this case, I think we can get rid of the per-VM DPAMT cache?
> 
> For the fault path, similarly, I believe we can use a per-vCPU cache for
> 'struct tdx_dpamt_memtadata' if we utilize the two new obj_alloc()/free()
> hooks.
> 
> The cost is the new 'leaf_level_private' takes additional 8-bytes for non-
> TDX guests even they are never used, but if what I said above is feasible,
> maybe it's worth the cost.
> 
> But it's completely possible that I missed something.  Any thoughts?

I *LOVE* the core idea (seriously, this made my week), though I think we should
take it a step further and _immediately_ do DPAMT maintenance on allocation.
I.e. do tdx_pamt_get() via tdx_alloc_control_page() when KVM tops up the S-EPT
SP cache instead of waiting until KVM links the SP.  Then KVM doesn't need to
track PAMT pages except for memory that is mapped into a guest, and we end up
with better symmetry and more consistency throughout TDX.  E.g. all pages that
KVM allocates and gifts to the TDX-Module will allocated and freed via the same
TDX APIs.

Absolute worst case scenario, KVM allocates 40 (KVM's SP cache capacity) PAMT
entries per-vCPU that end up being free without ever being gifted to the TDX-Module.
But I doubt that will be a problem in practice, because odds are good the adjacent
pages/pfns will already have been consumed, i.e. the "speculative" allocation is
really just bumping the refcount.  And _if_ it's a problem, e.g. results in too
many wasted DPAMT entries, then it's one we can solve in KVM by tuning the cache
capacity to less aggresively allocate DPAMT entries.

I'll send compile-tested v4 for the DPAMT series later today (I think I can get
it out today), as I have other non-trival feedback that I've accumulated when
going through the patches.

^ permalink raw reply related	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting
  2026-01-21 17:30     ` Sean Christopherson
@ 2026-01-21 19:39       ` Edgecombe, Rick P
  2026-01-21 23:01       ` Huang, Kai
  2026-01-22  7:03       ` Yan Zhao
  2 siblings, 0 replies; 127+ messages in thread
From: Edgecombe, Rick P @ 2026-01-21 19:39 UTC (permalink / raw)
  To: seanjc@google.com, Huang, Kai
  Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
	Zhao, Yan Y, thomas.lendacky@amd.com, vbabka@suse.cz,
	tabba@google.com, david@kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	pbonzini@redhat.com, Peng, Chao P, Weiny, Ira, kas@kernel.org,
	nik.borisov@suse.com, ackerleytng@google.com,
	francescolavra.fl@gmail.com, Yamahata, Isaku, sagis@google.com,
	Gao, Chao, Miao, Jun, Annapurve, Vishal, jgross@suse.com,
	pgonda@google.com, x86@kernel.org

On Wed, 2026-01-21 at 09:30 -0800, Sean Christopherson wrote:
> I *LOVE* the core idea (seriously, this made my week), though I think we should
> take it a step further and _immediately_ do DPAMT maintenance on allocation.
> I.e. do tdx_pamt_get() via tdx_alloc_control_page() when KVM tops up the S-EPT
> SP cache instead of waiting until KVM links the SP.  Then KVM doesn't need to
> track PAMT pages except for memory that is mapped into a guest, and we end up
> with better symmetry and more consistency throughout TDX.  E.g. all pages that
> KVM allocates and gifts to the TDX-Module will allocated and freed via the same
> TDX APIs.
> 
> Absolute worst case scenario, KVM allocates 40 (KVM's SP cache capacity) PAMT
> entries per-vCPU that end up being free without ever being gifted to the TDX-Module.
> But I doubt that will be a problem in practice, because odds are good the adjacent
> pages/pfns will already have been consumed, i.e. the "speculative" allocation is
> really just bumping the refcount.  And _if_ it's a problem, e.g. results in too
> many wasted DPAMT entries, then it's one we can solve in KVM by tuning the cache
> capacity to less aggresively allocate DPAMT entries.

It doesn't sound like much impact. Especially given we earlier considered
installing DPAMT for all TDX capable memory to try to simplify things.

> 
> I'll send compile-tested v4 for the DPAMT series later today (I think I can get
> it out today), as I have other non-trival feedback that I've accumulated when
> going through the patches.

Interesting idea. I have a local branch with the rest of the feedback and a few
other tweaks. Anything I can do do help?

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting
  2026-01-21 17:30     ` Sean Christopherson
  2026-01-21 19:39       ` Edgecombe, Rick P
@ 2026-01-21 23:01       ` Huang, Kai
  2026-01-22  7:03       ` Yan Zhao
  2 siblings, 0 replies; 127+ messages in thread
From: Huang, Kai @ 2026-01-21 23:01 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
	Zhao, Yan Y, thomas.lendacky@amd.com, vbabka@suse.cz,
	tabba@google.com, david@kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	pbonzini@redhat.com, Peng, Chao P, Weiny, Ira, kas@kernel.org,
	nik.borisov@suse.com, ackerleytng@google.com,
	francescolavra.fl@gmail.com, Yamahata, Isaku, sagis@google.com,
	Gao, Chao, Edgecombe, Rick P, Miao, Jun, Annapurve, Vishal,
	jgross@suse.com, pgonda@google.com, x86@kernel.org

On Wed, 2026-01-21 at 09:30 -0800, Sean Christopherson wrote:
> On Wed, Jan 21, 2026, Kai Huang wrote:
> > On Tue, 2026-01-06 at 18:23 +0800, Yan Zhao wrote:
> > I have been thinking whether we can simplify the solution, not only just
> > for avoiding this complicated memory cache topup-then-consume mechanism
> > under MMU read lock, but also for avoiding kinda duplicated code about how
> > to calculate how many DPAMT pages needed to topup etc between your next
> > patch and similar code in DPAMT series for the per-vCPU cache.
> > 
> > IIRC, the per-VM DPAMT cache (in your next patch) covers both S-EPT pages
> > and the mapped 2M range when splitting.
> > 
> > - For S-EPT pages, they are _ALWAYS_ 4K, so we can actually use
> > tdx_alloc_page() directly which also handles DPAMT pages internally.
> > 
> > Here in tdp_mmmu_alloc_sp_for_split():
> > 
> > 	sp->external_spt = tdx_alloc_page();
> > 
> > For the fault path we need to use the normal 'kvm_mmu_memory_cache' but
> > that's per-vCPU cache which doesn't have the pain of per-VM cache.  As I
> > mentioned in v3, I believe we can also hook to use tdx_alloc_page() if we
> > add two new obj_alloc()/free() callback to 'kvm_mmu_memory_cache':
> > 
> > https://lore.kernel.org/kvm/9e72261602bdab914cf7ff6f7cb921e35385136e.camel@intel.com/
> > 
> > So we can get rid of the per-VM DPAMT cache for S-EPT pages.
> > 
> > - For DPAMT pages for the TDX guest private memory, I think we can also
> > get rid of the per-VM DPAMT cache if we use 'kvm_mmu_page' to carry the
> > needed DPAMT pages:
> > 
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -111,6 +111,7 @@ struct kvm_mmu_page {
> >                  * Passed to TDX module, not accessed by KVM.
> >                  */
> >                 void *external_spt;
> > +               void *leaf_level_private;
> >         };
> 
> There's no need to put this in with external_spt, we could throw it in a new union
> with unsync_child_bitmap (TDP MMU can't have unsync children).
> 

Agreed.

> IIRC, the main
> reason I've never suggested unionizing unsync_child_bitmap is that overloading
> the bitmap would risk corruption if KVM ever marked a TDP MMU page as unsync, but
> that's easy enough to guard against:
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3d568512201d..d6c6768c1f50 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1917,9 +1917,10 @@ static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
>  
>  static void mark_unsync(u64 *spte)
>  {
> -       struct kvm_mmu_page *sp;
> +       struct kvm_mmu_page *sp = sptep_to_sp(spte);
>  
> -       sp = sptep_to_sp(spte);
> +       if (WARN_ON_ONCE(is_tdp_mmu_page(sp)))
> +               return;
>         if (__test_and_set_bit(spte_index(spte), sp->unsync_child_bitmap))
>                 return;
>         if (sp->unsync_children++)
> 

LGTM.

> 
> I might send a patch to do that even if we don't overload the bitmap, as a
> hardening measure.
> 
> > Then we can define a structure which contains DPAMT pages for a given 2M
> > range:
> > 
> > 	struct tdx_dmapt_metadata {
> > 		struct page *page1;
> > 		struct page *page2;
> > 	};
> > 
> > Then when we allocate sp->external_spt, we can also allocate it for
> > leaf_level_private via kvm_x86_ops call when we the 'sp' is actually the
> > last level page table.
> > 
> > In this case, I think we can get rid of the per-VM DPAMT cache?
> > 
> > For the fault path, similarly, I believe we can use a per-vCPU cache for
> > 'struct tdx_dpamt_memtadata' if we utilize the two new obj_alloc()/free()
> > hooks.
> > 
> > The cost is the new 'leaf_level_private' takes additional 8-bytes for non-
> > TDX guests even they are never used, but if what I said above is feasible,
> > maybe it's worth the cost.
> > 
> > But it's completely possible that I missed something.  Any thoughts?
> 
> I *LOVE* the core idea (seriously, this made my week), though I think we should
> take it a step further and _immediately_ do DPAMT maintenance on allocation.
> I.e. do tdx_pamt_get() via tdx_alloc_control_page() when KVM tops up the S-EPT
> SP cache instead of waiting until KVM links the SP.  Then KVM doesn't need to
> track PAMT pages except for memory that is mapped into a guest, and we end up
> with better symmetry and more consistency throughout TDX.  E.g. all pages that
> KVM allocates and gifts to the TDX-Module will allocated and freed via the same
> TDX APIs.

Agreed.

Nit: do you still want to use the name tdx_alloc_control_page() instead of
tdx_alloc_page() if it is also used for S-EPT pages?  Kinda not sure but
both work for me.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2026-01-20 17:51         ` Sean Christopherson
@ 2026-01-22  6:27           ` Yan Zhao
  0 siblings, 0 replies; 127+ messages in thread
From: Yan Zhao @ 2026-01-22  6:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kai Huang, pbonzini@redhat.com, kvm@vger.kernel.org, Fan Du,
	Xiaoyao Li, Chao Gao, Dave Hansen, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, david@kernel.org,
	kas@kernel.org, michael.roth@amd.com, Ira Weiny,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, nik.borisov@suse.com, Isaku Yamahata,
	Chao P Peng, francescolavra.fl@gmail.com, sagis@google.com,
	Vishal Annapurve, Rick P Edgecombe, Jun Miao, jgross@suse.com,
	pgonda@google.com, x86@kernel.org

On Tue, Jan 20, 2026 at 09:51:06AM -0800, Sean Christopherson wrote:
> On Mon, Jan 19, 2026, Yan Zhao wrote:
> > On Fri, Jan 16, 2026 at 03:39:13PM -0800, Sean Christopherson wrote:
> > > On Thu, Jan 15, 2026, Kai Huang wrote:
> > > > So how about:
> > > > 
> > > > Rename kvm_tdp_mmu_try_split_huge_pages() to
> > > > kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> > > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> > > > kvm_tdp_mmu_split_huge_pages_cross_boundary()
> > > > 
> > > > ?
> > > 
> > > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > > the concept itself, in the sense that it shoves a weird, specific concept into
> > > the guts of the TDP MMU.
> > > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > > and tail pages is asinine.
> > That's a reasonable concern. I actually thought about it.
> > My consideration was as follows:
> > Currently, we don't have such large areas. Usually, the conversion ranges are
> > less than 1GB. 
> 
> Nothing guarantees that behavior.
> 
> > Though the initial conversion which converts all memory from private to
> > shared may be wide, there are usually no mappings at that stage. So, the
> > traversal should be very fast (since the traversal doesn't even need to go
> > down to the 2MB/1GB level).
> > 
> > If the caller of kvm_split_cross_boundary_leafs() finds it needs to convert a
> > very large range at runtime, it can optimize by invoking the API twice:
> > once for range [start, ALIGN(start, 1GB)), and
> > once for range [ALIGN_DOWN(end, 1GB), end).
> > 
> > I can also implement this optimization within kvm_split_cross_boundary_leafs()
> > by checking the range size if you think that would be better.
> > 
> > > And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
> > > _only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
> > > code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.
> > Sorry for the confusion about the usage of tdx_honor_guest_accept_level(). I
> > should add a better comment.
> > 
> > There are 4 use cases for the API kvm_split_cross_boundary_leafs():
> > 1. PUNCH_HOLE
> > 2. KVM_SET_MEMORY_ATTRIBUTES2, which invokes kvm_gmem_set_attributes() for
> >    private-to-shared conversions
> > 3. tdx_honor_guest_accept_level()
> > 4. kvm_gmem_error_folio()
> > 
> > Use cases 1-3 are already in the current code. Use case 4 is per our discussion,
> > and will be implemented in the next version (because guest_memfd may split
> > folios without first splitting S-EPT).
> > 
> > The 4 use cases can be divided into two categories:
> > 
> > 1. Category 1: use cases 1, 2, 4
> >    We must ensure GFN start - 1 and GFN start are not mapped in a single
> >    mapping. However, for GFN start or GFN start - 1 specifically, we don't care
> >    about their actual mapping levels, which means they are free to be mapped at
> >    2MB or 1GB. The same applies to GFN end - 1 and GFN end.
> > 
> >    --|------------------|-----------
> >      ^                  ^
> >     start              end - 1 
> > 
> > 2. Category 2: use case 3
> >    It cares about the mapping level of the GFN, i.e., it must not be mapped
> >    above a certain level.
> > 
> >    -----|-------
> >         ^
> >        GFN
> > 
> >    So, to unify the two categories, I have tdx_honor_guest_accept_level() check
> >    the range of [level-aligned GFN, level-aligned GFN + level size). e.g.,
> >    If the accept level is 2MB, only 1GB mapping is possible to be outside the
> >    range and needs splitting.
> 
> But that overlooks the fact that Category 2 already fits the existing "category"
> that is supported by the TDP MMU.  I.e. Category 1 is (somewhat) new and novel,
> Category 2 is not.
> 
> >    -----|-------------|---
> >         ^             ^
> >         |             |
> >    level-aligned     level-aligned
> >       GFN            GFN + level size - 1
> > 
> > 
> > > For the EPT violation case, the guest is accepting a page.  Just split to the
> > > guest's accepted level, I don't see any reason to make things more complicated
> > > than that.
> > This use case could reuse the kvm_mmu_try_split_huge_pages() API, except that we
> > need a return value.
> 
> Just expose tdp_mmu_split_huge_pages_root(), the fault path only _needs_ to split
> the current root, and in fact shouldn't even try to split other roots (ignoring
> that no other relevant roots exist).
Ok.

> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 9c26038f6b77..7d924da75106 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1555,10 +1555,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>         return ret;
>  }
>  
> -static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> -                                        struct kvm_mmu_page *root,
> -                                        gfn_t start, gfn_t end,
> -                                        int target_level, bool shared)
> +int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
> +                                 gfn_t start, gfn_t end, int target_level,
> +                                 bool shared)
>  {
>         struct kvm_mmu_page *sp = NULL;
>         struct tdp_iter iter;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index bd62977c9199..ea9a509608fb 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -93,6 +93,9 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
>                                    struct kvm_memory_slot *slot, gfn_t gfn,
>                                    int min_level);
>  
> +int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
> +                                 gfn_t start, gfn_t end, int target_level,
> +                                 bool shared);
>  void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
>                                       const struct kvm_memory_slot *slot,
>                                       gfn_t start, gfn_t end,
> 
> > > And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> > > and tail pages need to be split, and use the existing APIs to make that happen.
> > This use case cannot reuse kvm_mmu_try_split_huge_pages() without modification.
> 
> Modifying existing code is a non-issue, and you're already modifying TDP MMU
> functions, so I don't see that as a reason for choosing X instead of Y.
> 
> > Or which existing APIs are you referring to?
> 
> See above.
Ok. Do you like the idea of introducing only_cross_boundary (or something with a
different name) to tdp_mmu_split_huge_pages_root() ?
If not, could I expose a helper to help range calculate?


> > The cross_boundary information is still useful?
> > 
> > BTW: Currently, kvm_split_cross_boundary_leafs() internally reuses
> > tdp_mmu_split_huge_pages_root() (as shown below).
> > 
> > kvm_split_cross_boundary_leafs
> >   kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs
> >     tdp_mmu_split_huge_pages_root
> > 
> > However, tdp_mmu_split_huge_pages_root() is originally used to split huge
> > mappings in a wide range, so it temporarily releases mmu_lock for memory
> > allocation for sp, since it can't predict how many pages to pre-allocate in the
> > KVM mmu cache.
> > 
> > For kvm_split_cross_boundary_leafs(), we can actually predict the max number of
> > pages to pre-allocate. If we don't reuse tdp_mmu_split_huge_pages_root(), we can
> > allocate sp, sp->spt, sp->external_spt and DPAMT pages from the KVM mmu cache
> > without releasing mmu_lock and invoking tdp_mmu_alloc_sp_for_split().
> 
> That's completely orthogonal to the "only need to maybe split head and tail pages".
> E.g. kvm_tdp_mmu_try_split_huge_pages() can also predict the _max_ number of pages
> to pre-allocate, it's just not worth adding a kvm_mmu_memory_cache for that use
> case because that path can drop mmu_lock at will, unlike the full page fault path.
> I.e. the complexity doesn't justify the benefits, especially since the max number
> of pages is so large.
Right, it's technically feasible, but practically not.
If to split a huge range down to 4KB, like 16G, the _max_ number of pages to
pre-allocate is too large. It's 16*512=8192 pages without TDX.

> AFAICT, the only pre-allocation that is _necessary_ is for the dynamic PAMT,
Yes, patch 20 in this series just pre-allocates DPAMT pages for splitting.

See:
static int tdx_min_split_cache_sz(struct kvm *kvm, int level)
{
	KVM_BUG_ON(level != PG_LEVEL_2M, kvm);

	if (!tdx_supports_dynamic_pamt(tdx_sysinfo))
		return 0;

	return tdx_dpamt_entry_pages() * 2;
}

> because the allocation is done outside of KVM's control.  But that's a solvable
> problem, the tricky part is protecting the PAMT cache for PUNCH_HOLE, but that
> too is solvable, e.g. by adding a per-VM mutex that's taken by kvm_gmem_punch_hole()
I don't get why only PUNCH_HOLE case needs to be protected.
It's not guaranteed that KVM_SET_MEMORY_ATTRIBUTES2 ioctls are in vCPU contexts.

BTW: the split_external_spte() hook is invoked under mmu_lock, so I used a
spinlock kvm_tdx->prealloc_split_cache_lock to protect cache enqueuing
and dequeuing.

> to handle the PUNCH_HOLE case, and then using the per-vCPU cache when splitting
> for a mismatched accept.
Yes, I planned to use per-vCPU cache in the future. e.g., when splitting under
shared mmu_lock to honor guest accept level.


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2026-01-20 18:02         ` Sean Christopherson
@ 2026-01-22  6:33           ` Yan Zhao
  2026-01-29 14:51             ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-22  6:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vishal Annapurve, Kai Huang, pbonzini@redhat.com,
	kvm@vger.kernel.org, Fan Du, Xiaoyao Li, Chao Gao, Dave Hansen,
	thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
	david@kernel.org, kas@kernel.org, michael.roth@amd.com, Ira Weiny,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, nik.borisov@suse.com, Isaku Yamahata,
	Chao P Peng, francescolavra.fl@gmail.com, sagis@google.com,
	Rick P Edgecombe, Jun Miao, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Tue, Jan 20, 2026 at 10:02:41AM -0800, Sean Christopherson wrote:
> On Tue, Jan 20, 2026, Vishal Annapurve wrote:
> > On Fri, Jan 16, 2026 at 3:39 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Thu, Jan 15, 2026, Kai Huang wrote:
> > > > static int __kvm_tdp_mmu_split_huge_pages(struct kvm *kvm,
> > > >                                         struct kvm_gfn_range *range,
> > > >                                         int target_level,
> > > >                                         bool shared,
> > > >                                         bool cross_boundary_only)
> > > > {
> > > >       ...
> > > > }
> > > >
> > > > And by using this helper, I found the name of the two wrapper functions
> > > > are not ideal:
> > > >
> > > > kvm_tdp_mmu_try_split_huge_pages() is only for log dirty, and it should
> > > > not be reachable for TD (VM with mirrored PT).  But currently it uses
> > > > KVM_VALID_ROOTS for root filter thus mirrored PT is also included.  I
> > > > think it's better to rename it, e.g., at least with "log_dirty" in the
> > > > name so it's more clear this function is only for dealing log dirty (at
> > > > least currently).  We can also add a WARN() if it's called for VM with
> > > > mirrored PT but it's a different topic.
> > > >
> > > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() doesn't have
> > > > "huge_pages", which isn't consistent with the other.  And it is a bit
> > > > long.  If we don't have "gfn_range" in __kvm_tdp_mmu_split_huge_pages(),
> > > > then I think we can remove "gfn_range" from
> > > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() too to make it shorter.
> > > >
> > > > So how about:
> > > >
> > > > Rename kvm_tdp_mmu_try_split_huge_pages() to
> > > > kvm_tdp_mmu_split_huge_pages_log_dirty(), and rename
> > > > kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs() to
> > > > kvm_tdp_mmu_split_huge_pages_cross_boundary()
> > > >
> > > > ?
> > >
> > > I find the "cross_boundary" termininology extremely confusing.  I also dislike
> > > the concept itself, in the sense that it shoves a weird, specific concept into
> > > the guts of the TDP MMU.
> > >
> > > The other wart is that it's inefficient when punching a large hole.  E.g. say
> > > there's a 16TiB guest_memfd instance (no idea if that's even possible), and then
> > > userpace punches a 12TiB hole.  Walking all ~12TiB just to _maybe_ split the head
> > > and tail pages is asinine.
> > >
> > > And once kvm_arch_pre_set_memory_attributes() is dropped, I'm pretty sure the
> > > _only_ usage is for guest_memfd PUNCH_HOLE, because unless I'm misreading the
> > > code, the usage in tdx_honor_guest_accept_level() is superfluous and confusing.
> > >
> > > For the EPT violation case, the guest is accepting a page.  Just split to the
> > > guest's accepted level, I don't see any reason to make things more complicated
> > > than that.
> > >
> > > And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> > > and tail pages need to be split, and use the existing APIs to make that happen.
> > 
> > Just a note: Through guest_memfd upstream syncs, we agreed that
> > guest_memfd will only allow the punch_hole operation for huge page
> > size-aligned ranges for hugetlb and thp backing. i.e. the PUNCH_HOLE
> > operation doesn't need to split any EPT mappings for foreseeable
> > future.
> 
> Oh!  Right, forgot about that.  It's the conversion path that we need to sort out,
> not PUNCH_HOLE.  Thanks for the reminder!
Hmm, I see.
However, do you think it's better to leave the splitting logic in PUNCH_HOLE as
well? e.g., guest_memfd may want to map several folios in a mapping in the
future, i.e., after *max_order > folio_order(folio);

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting
  2026-01-21 17:30     ` Sean Christopherson
  2026-01-21 19:39       ` Edgecombe, Rick P
  2026-01-21 23:01       ` Huang, Kai
@ 2026-01-22  7:03       ` Yan Zhao
  2026-01-22  7:30         ` Huang, Kai
  2 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-22  7:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kai Huang, pbonzini@redhat.com, kvm@vger.kernel.org, Fan Du,
	Xiaoyao Li, Chao Gao, Dave Hansen, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, david@kernel.org,
	kas@kernel.org, michael.roth@amd.com, Ira Weiny,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, nik.borisov@suse.com, Isaku Yamahata,
	Chao P Peng, francescolavra.fl@gmail.com, sagis@google.com,
	Vishal Annapurve, Rick P Edgecombe, Jun Miao, jgross@suse.com,
	pgonda@google.com, x86@kernel.org

On Wed, Jan 21, 2026 at 09:30:28AM -0800, Sean Christopherson wrote:
> On Wed, Jan 21, 2026, Kai Huang wrote:
> > On Tue, 2026-01-06 at 18:23 +0800, Yan Zhao wrote:
> > I have been thinking whether we can simplify the solution, not only just
> > for avoiding this complicated memory cache topup-then-consume mechanism
> > under MMU read lock, but also for avoiding kinda duplicated code about how
> > to calculate how many DPAMT pages needed to topup etc between your next
> > patch and similar code in DPAMT series for the per-vCPU cache.
> > 
> > IIRC, the per-VM DPAMT cache (in your next patch) covers both S-EPT pages
> > and the mapped 2M range when splitting.
> > 
> > - For S-EPT pages, they are _ALWAYS_ 4K, so we can actually use
> > tdx_alloc_page() directly which also handles DPAMT pages internally.
> > 
> > Here in tdp_mmmu_alloc_sp_for_split():
> > 
> > 	sp->external_spt = tdx_alloc_page();
> > 
> > For the fault path we need to use the normal 'kvm_mmu_memory_cache' but
> > that's per-vCPU cache which doesn't have the pain of per-VM cache.  As I
> > mentioned in v3, I believe we can also hook to use tdx_alloc_page() if we
> > add two new obj_alloc()/free() callback to 'kvm_mmu_memory_cache':
> > 
> > https://lore.kernel.org/kvm/9e72261602bdab914cf7ff6f7cb921e35385136e.camel@intel.com/
> > 
> > So we can get rid of the per-VM DPAMT cache for S-EPT pages.
> > 
> > - For DPAMT pages for the TDX guest private memory, I think we can also
> > get rid of the per-VM DPAMT cache if we use 'kvm_mmu_page' to carry the
> > needed DPAMT pages:
> > 
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -111,6 +111,7 @@ struct kvm_mmu_page {
> >                  * Passed to TDX module, not accessed by KVM.
> >                  */
> >                 void *external_spt;
> > +               void *leaf_level_private;
> >         };
> 
> There's no need to put this in with external_spt, we could throw it in a new union
> with unsync_child_bitmap (TDP MMU can't have unsync children).  IIRC, the main
> reason I've never suggested unionizing unsync_child_bitmap is that overloading
> the bitmap would risk corruption if KVM ever marked a TDP MMU page as unsync, but
> that's easy enough to guard against:
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3d568512201d..d6c6768c1f50 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1917,9 +1917,10 @@ static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
>  
>  static void mark_unsync(u64 *spte)
>  {
> -       struct kvm_mmu_page *sp;
> +       struct kvm_mmu_page *sp = sptep_to_sp(spte);
>  
> -       sp = sptep_to_sp(spte);
> +       if (WARN_ON_ONCE(is_tdp_mmu_page(sp)))
> +               return;
>         if (__test_and_set_bit(spte_index(spte), sp->unsync_child_bitmap))
>                 return;
>         if (sp->unsync_children++)
> 
> 
> I might send a patch to do that even if we don't overload the bitmap, as a
> hardening measure.
> 
> > Then we can define a structure which contains DPAMT pages for a given 2M
> > range:
> > 
> > 	struct tdx_dmapt_metadata {
> > 		struct page *page1;
> > 		struct page *page2;
> > 	};

Note: we need 4 pages to split a 2MB range, 2 for the new S-EPT page, 2 for the
2MB guest memory range.


> > Then when we allocate sp->external_spt, we can also allocate it for
> > leaf_level_private via kvm_x86_ops call when we the 'sp' is actually the
> > last level page table.
> > 
> > In this case, I think we can get rid of the per-VM DPAMT cache?
> > 
> > For the fault path, similarly, I believe we can use a per-vCPU cache for
> > 'struct tdx_dpamt_memtadata' if we utilize the two new obj_alloc()/free()
> > hooks.
> > 
> > The cost is the new 'leaf_level_private' takes additional 8-bytes for non-
> > TDX guests even they are never used, but if what I said above is feasible,
> > maybe it's worth the cost.
> > 
> > But it's completely possible that I missed something.  Any thoughts?
> 
> I *LOVE* the core idea (seriously, this made my week), though I think we should
Me too!

> take it a step further and _immediately_ do DPAMT maintenance on allocation.
> I.e. do tdx_pamt_get() via tdx_alloc_control_page() when KVM tops up the S-EPT
> SP cache instead of waiting until KVM links the SP.  Then KVM doesn't need to
> track PAMT pages except for memory that is mapped into a guest, and we end up
> with better symmetry and more consistency throughout TDX.  E.g. all pages that
> KVM allocates and gifts to the TDX-Module will allocated and freed via the same
> TDX APIs.
Not sure if I understand this paragraph correctly.

I'm wondering if it can help us get rid of asymmetry. e.g.
When KVM wants to split a 2MB page, it allocates a sp for level 4K, which
contains 2 PAMT pages for the new S-EPT page.
During split, the 2 PAMT pages are installed successfully. However, the
splitting fails due to DEMOTE failure. Then, it looks like KVM needs to
uninstall and free the 2 PAMT pages for the new S-EPT page, right?

However, some other threads may have consumed the 2 PAMT pages for an adjacent
4KB page within the same 2MB range of the new S-EPT page.
So, KVM still can't free the 2 PAMT pages allocated from it.

Will check your patches for better understanding.

> Absolute worst case scenario, KVM allocates 40 (KVM's SP cache capacity) PAMT
> entries per-vCPU that end up being free without ever being gifted to the TDX-Module.
> But I doubt that will be a problem in practice, because odds are good the adjacent
> pages/pfns will already have been consumed, i.e. the "speculative" allocation is
> really just bumping the refcount.  And _if_ it's a problem, e.g. results in too
> many wasted DPAMT entries, then it's one we can solve in KVM by tuning the cache
> capacity to less aggresively allocate DPAMT entries.
> 
> I'll send compile-tested v4 for the DPAMT series later today (I think I can get
> it out today), as I have other non-trival feedback that I've accumulated when
> going through the patches.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting
  2026-01-22  7:03       ` Yan Zhao
@ 2026-01-22  7:30         ` Huang, Kai
  2026-01-22  7:49           ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Huang, Kai @ 2026-01-22  7:30 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: Du, Fan, kvm@vger.kernel.org, Li, Xiaoyao, Hansen, Dave,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	david@kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	pbonzini@redhat.com, Peng, Chao P, ackerleytng@google.com,
	kas@kernel.org, nik.borisov@suse.com, Weiny, Ira,
	francescolavra.fl@gmail.com, Yamahata, Isaku, sagis@google.com,
	Gao, Chao, Edgecombe, Rick P, Miao, Jun, Annapurve, Vishal,
	jgross@suse.com, pgonda@google.com, x86@kernel.org

> > 
> > > Then we can define a structure which contains DPAMT pages for a given 2M
> > > range:
> > > 
> > > 	struct tdx_dmapt_metadata {
> > > 		struct page *page1;
> > > 		struct page *page2;
> > > 	};
> 
> Note: we need 4 pages to split a 2MB range, 2 for the new S-EPT page, 2 for the
> 2MB guest memory range.

In this proposal the pair for S-EPT is already handled by tdx_alloc_page()
(or tdx_alloc_control_page()):

  sp->external_spt = tdx_alloc_page();

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting
  2026-01-22  7:30         ` Huang, Kai
@ 2026-01-22  7:49           ` Yan Zhao
  2026-01-22 10:33             ` Huang, Kai
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-22  7:49 UTC (permalink / raw)
  To: Huang, Kai
  Cc: seanjc@google.com, Du, Fan, kvm@vger.kernel.org, Li, Xiaoyao,
	Hansen, Dave, thomas.lendacky@amd.com, tabba@google.com,
	vbabka@suse.cz, david@kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	pbonzini@redhat.com, Peng, Chao P, ackerleytng@google.com,
	kas@kernel.org, nik.borisov@suse.com, Weiny, Ira,
	francescolavra.fl@gmail.com, Yamahata, Isaku, sagis@google.com,
	Gao, Chao, Edgecombe, Rick P, Miao, Jun, Annapurve, Vishal,
	jgross@suse.com, pgonda@google.com, x86@kernel.org

On Thu, Jan 22, 2026 at 03:30:16PM +0800, Huang, Kai wrote:
> > > 
> > > > Then we can define a structure which contains DPAMT pages for a given 2M
> > > > range:
> > > > 
> > > > 	struct tdx_dmapt_metadata {
> > > > 		struct page *page1;
> > > > 		struct page *page2;
> > > > 	};
> > 
> > Note: we need 4 pages to split a 2MB range, 2 for the new S-EPT page, 2 for the
> > 2MB guest memory range.
> 
> In this proposal the pair for S-EPT is already handled by tdx_alloc_page()
> (or tdx_alloc_control_page()):
> 
>   sp->external_spt = tdx_alloc_page();
Oh, ok.

So, in the fault path, sp->external_spt and sp->leaf_level_private are from
fault cache.

In the non-vCPU split path, sp->external_spt is from tdx_alloc_page(),
sp->leaf_level_private is from 2 get_zeroed_page() (or the count is from an
x86_kvm hook ?)


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting
  2026-01-22  7:49           ` Yan Zhao
@ 2026-01-22 10:33             ` Huang, Kai
  0 siblings, 0 replies; 127+ messages in thread
From: Huang, Kai @ 2026-01-22 10:33 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Du, Fan, kvm@vger.kernel.org, Li, Xiaoyao, Hansen, Dave,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	david@kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com, Peng, Chao P,
	pbonzini@redhat.com, ackerleytng@google.com, kas@kernel.org,
	binbin.wu@linux.intel.com, Weiny, Ira, nik.borisov@suse.com,
	francescolavra.fl@gmail.com, Yamahata, Isaku, sagis@google.com,
	Gao, Chao, Edgecombe, Rick P, Miao, Jun, Annapurve, Vishal,
	jgross@suse.com, pgonda@google.com, x86@kernel.org

On Thu, 2026-01-22 at 15:49 +0800, Zhao, Yan Y wrote:
> On Thu, Jan 22, 2026 at 03:30:16PM +0800, Huang, Kai wrote:
> > > > 
> > > > > Then we can define a structure which contains DPAMT pages for a given 2M
> > > > > range:
> > > > > 
> > > > > 	struct tdx_dmapt_metadata {
> > > > > 		struct page *page1;
> > > > > 		struct page *page2;
> > > > > 	};
> > > 
> > > Note: we need 4 pages to split a 2MB range, 2 for the new S-EPT page, 2 for the
> > > 2MB guest memory range.
> > 
> > In this proposal the pair for S-EPT is already handled by tdx_alloc_page()
> > (or tdx_alloc_control_page()):
> > 
> >   sp->external_spt = tdx_alloc_page();
> Oh, ok.
> 
> So, in the fault path, sp->external_spt and sp->leaf_level_private are from
> fault cache.
> 
> In the non-vCPU split path, sp->external_spt is from tdx_alloc_page(),
> sp->leaf_level_private is from 2 get_zeroed_page() (or the count is from an
> x86_kvm hook ?)

The idea is we can add two new hooks (e.g., obj_alloc()/obj_free()) to
'kvm_mmu_memory_cache' so that we can just hook tdx_alloc_page() for vcpu-
>arch.mmu_external_spte_cache(), in which way KVM will also call
tdx_alloc_page() when topping up memory cache for the S-EPT.

Ditto for DPAMT pair for the actual TDX guest private memory:

We can provide a helper to allocate a structure which contains a pair of
DPAMT pages.  For split path we call that helper directly for sp-
>leaf_level_private; for the fault path we can have another per-vCPU
'kvm_mmu_memory_cache' for DPAMT pair with obj_alloc() being hooked to
that helper so that KVM will also call that helper when topping up the
DPAMT pair cache.

At least this is my idea.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root
  2026-01-16  7:54     ` Yan Zhao
@ 2026-01-26 16:08       ` Sean Christopherson
  2026-01-27  3:40         ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2026-01-26 16:08 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kas, tabba, ackerleytng, michael.roth, david, vannapurve, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Fri, Jan 16, 2026, Yan Zhao wrote:
> Hi Sean,
> Thanks for the review!
> 
> On Thu, Jan 15, 2026 at 02:49:59PM -0800, Sean Christopherson wrote:
> > On Tue, Jan 06, 2026, Yan Zhao wrote:
> > > From: Rick P Edgecombe <rick.p.edgecombe@intel.com>
> > > 
> > > Disallow page merging (huge page adjustment) for the mirror root by
> > > utilizing disallowed_hugepage_adjust().
> > 
> > Why?  What is this actually doing?  The below explains "how" but I'm baffled as
> > to the purpose.  I'm guessing there are hints in the surrounding patches, but I
> > haven't read them in depth, and shouldn't need to in order to understand the
> > primary reason behind a change.
> Sorry for missing the background. I will explain the "why" in the patch log in
> the next version.
> 
> The reason for introducing this patch is to disallow page merging for TDX. I
> explained the reasons to disallow page merging in the cover letter:
> 
> "
> 7. Page merging (page promotion)
> 
>    Promotion is disallowed, because:
> 
>    - The current TDX module requires all 4KB leafs to be either all PENDING
>      or all ACCEPTED before a successful promotion to 2MB. This requirement
>      prevents successful page merging after partially converting a 2MB
>      range from private to shared and then back to private, which is the
>      primary scenario necessitating page promotion.
> 
>    - tdh_mem_page_promote() depends on tdh_mem_range_block() in the current
>      TDX module. Consequently, handling BUSY errors is complex, as page
>      merging typically occurs in the fault path under shared mmu_lock.
> 
>    - Limited amount of initial private memory (typically ~4MB) means the
>      need for page merging during TD build time is minimal.
> "

> However, we currently don't support page merging yet. Specifically for the above
> scenariol, the purpose is to avoid handling the error from
> tdh_mem_page_promote(), which SEAMCALL currently needs to be preceded by
> tdh_mem_range_block(). To handle the promotion error (e.g., due to busy) under
> read mmu_lock, we may need to introduce several spinlocks and guarantees from
> the guest to ensure the success of tdh_mem_range_unblock() to restore the S-EPT
> status. 
> 
> Therefore, we introduced this patch for simplicity, and because the promotion
> scenario is not common.

Say that in the changelog!  Describing the "how" in detail is completely unnecessary,
or at least it should be.  Because I strongly disagree with Rick's opinion from
the RFC that kvm_tdp_mmu_map() should check kvm_has_mirrored_tdp()[*].

 : I think part of the thing that is bugging me is that
 : nx_huge_page_workaround_enabled is not conceptually about whether the specific
 : fault/level needs to disallow huge page adjustments, it's whether it needs to
 : check if it does. Then disallowed_hugepage_adjust() does the actual specific
 : checking. But for the mirror logic the check is the same for both. It's
 : asymmetric with NX huge pages, and just sort of jammed in. It would be easier to
 : follow if the kvm_tdp_mmu_map() conditional checked wither mirror TDP was
 : "active", rather than the mirror role.

[*] http://lore.kernel.org/all/eea0bf7925c3b9c16573be8e144ddcc77b54cc92.camel@intel.com

If the changelog explains _why_, and the code is actually commented, then calling
into disallowed_hugepage_adjust() for all faults in a VM with mirrored roots is
nonsensical, because the code won't match the comment.

From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Date: Tue, 22 Apr 2025 10:21:12 +0800
Subject: [PATCH] KVM: x86/mmu: Prevent hugepage promotion for mirror roots in
 fault path

Disallow hugepage promotion in the TDP MMU for mirror roots as KVM doesn't
currently support promoting S-EPT entries due to the complexity incurred
by the TDX-Module's rules for hugepage promotion.

 - The current TDX-Module requires all 4KB leafs to be either all PENDING
   or all ACCEPTED before a successful promotion to 2MB. This requirement
   prevents successful page merging after partially converting a 2MB
   range from private to shared and then back to private, which is the
   primary scenario necessitating page promotion.

 - The TDX-Module effectively requires a break-before-make sequence (to
   satisfy its TLB flushing rules), i.e. creates a window of time where a
   different vCPU can encounter faults on a SPTE that KVM is trying to
   promote to a hugepage.  To avoid unexpected BUSY errors, KVM would need
   to FREEZE the non-leaf SPTE before replacing it with a huge SPTE.

Disable hugepage promotion for all map() operations, as supporting page
promotion when building the initial image is still non-trivial, and the
vast majority of images are ~4MB or less, i.e. the benefit of creating
hugepages during TD build time is minimal.

Signed-off-by: Edgecombe, Rick P <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
[sean: check root, add comment, rewrite changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c     |  3 ++-
 arch/x86/kvm/mmu/tdp_mmu.c | 12 +++++++++++-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4ecbf216d96f..45650f70eeab 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3419,7 +3419,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
 	    cur_level == fault->goal_level &&
 	    is_shadow_present_pte(spte) &&
 	    !is_large_pte(spte) &&
-	    spte_to_child_sp(spte)->nx_huge_page_disallowed) {
+	    ((spte_to_child_sp(spte)->nx_huge_page_disallowed) ||
+	     is_mirror_sp(spte_to_child_sp(spte)))) {
 		/*
 		 * A small SPTE exists for this pfn, but FNAME(fetch),
 		 * direct_map(), or kvm_tdp_mmu_map() would like to create a
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 321dbde77d3f..0fe3be41594f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1232,7 +1232,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
 		int r;
 
-		if (fault->nx_huge_page_workaround_enabled)
+		/*
+		 * Don't replace a page table (non-leaf) SPTE with a huge SPTE
+		 * (a.k.a. hugepage promotion) if the NX hugepage workaround is
+		 * enabled, as doing so will cause significant thrashing if one
+		 * or more leaf SPTEs needs to be executable.
+		 *
+		 * Disallow hugepage promotion for mirror roots as KVM doesn't
+		 * (yet) support promoting S-EPT entries while holding mmu_lock
+		 * for read (due to complexity induced by the TDX-Module APIs).
+		 */
+		if (fault->nx_huge_page_workaround_enabled || is_mirror_sp(root))
 			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
 
 		/*

base-commit: 914ea33c797e95e5fa7a0803e44b621a9e70a90f
-- 

^ permalink raw reply related	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root
  2026-01-26 16:08       ` Sean Christopherson
@ 2026-01-27  3:40         ` Yan Zhao
  2026-01-28 19:51           ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-01-27  3:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kas, tabba, ackerleytng, michael.roth, david, vannapurve, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Mon, Jan 26, 2026 at 08:08:31AM -0800, Sean Christopherson wrote:
> On Fri, Jan 16, 2026, Yan Zhao wrote:
> > Hi Sean,
> > Thanks for the review!
> > 
> > On Thu, Jan 15, 2026 at 02:49:59PM -0800, Sean Christopherson wrote:
> > > On Tue, Jan 06, 2026, Yan Zhao wrote:
> > > > From: Rick P Edgecombe <rick.p.edgecombe@intel.com>
> > > > 
> > > > Disallow page merging (huge page adjustment) for the mirror root by
> > > > utilizing disallowed_hugepage_adjust().
> > > 
> > > Why?  What is this actually doing?  The below explains "how" but I'm baffled as
> > > to the purpose.  I'm guessing there are hints in the surrounding patches, but I
> > > haven't read them in depth, and shouldn't need to in order to understand the
> > > primary reason behind a change.
> > Sorry for missing the background. I will explain the "why" in the patch log in
> > the next version.
> > 
> > The reason for introducing this patch is to disallow page merging for TDX. I
> > explained the reasons to disallow page merging in the cover letter:
> > 
> > "
> > 7. Page merging (page promotion)
> > 
> >    Promotion is disallowed, because:
> > 
> >    - The current TDX module requires all 4KB leafs to be either all PENDING
> >      or all ACCEPTED before a successful promotion to 2MB. This requirement
> >      prevents successful page merging after partially converting a 2MB
> >      range from private to shared and then back to private, which is the
> >      primary scenario necessitating page promotion.
> > 
> >    - tdh_mem_page_promote() depends on tdh_mem_range_block() in the current
> >      TDX module. Consequently, handling BUSY errors is complex, as page
> >      merging typically occurs in the fault path under shared mmu_lock.
> > 
> >    - Limited amount of initial private memory (typically ~4MB) means the
> >      need for page merging during TD build time is minimal.
> > "
> 
> > However, we currently don't support page merging yet. Specifically for the above
> > scenariol, the purpose is to avoid handling the error from
> > tdh_mem_page_promote(), which SEAMCALL currently needs to be preceded by
> > tdh_mem_range_block(). To handle the promotion error (e.g., due to busy) under
> > read mmu_lock, we may need to introduce several spinlocks and guarantees from
> > the guest to ensure the success of tdh_mem_range_unblock() to restore the S-EPT
> > status. 
> > 
> > Therefore, we introduced this patch for simplicity, and because the promotion
> > scenario is not common.
> 
> Say that in the changelog!  Describing the "how" in detail is completely unnecessary,
I'll keep it in mind in the future!

> or at least it should be.  Because I strongly disagree with Rick's opinion from
> the RFC that kvm_tdp_mmu_map() should check kvm_has_mirrored_tdp()[*].
> 
>  : I think part of the thing that is bugging me is that
>  : nx_huge_page_workaround_enabled is not conceptually about whether the specific
>  : fault/level needs to disallow huge page adjustments, it's whether it needs to
>  : check if it does. Then disallowed_hugepage_adjust() does the actual specific
>  : checking. But for the mirror logic the check is the same for both. It's
>  : asymmetric with NX huge pages, and just sort of jammed in. It would be easier to
>  : follow if the kvm_tdp_mmu_map() conditional checked wither mirror TDP was
>  : "active", rather than the mirror role.
> 
> [*] http://lore.kernel.org/all/eea0bf7925c3b9c16573be8e144ddcc77b54cc92.camel@intel.com
> 
> If the changelog explains _why_, and the code is actually commented, then calling
> into disallowed_hugepage_adjust() for all faults in a VM with mirrored roots is
> nonsensical, because the code won't match the comment.

Thanks a lot! It looks good to me. 

> From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
> Date: Tue, 22 Apr 2025 10:21:12 +0800
> Subject: [PATCH] KVM: x86/mmu: Prevent hugepage promotion for mirror roots in
>  fault path
> 
> Disallow hugepage promotion in the TDP MMU for mirror roots as KVM doesn't
> currently support promoting S-EPT entries due to the complexity incurred
> by the TDX-Module's rules for hugepage promotion.
> 
>  - The current TDX-Module requires all 4KB leafs to be either all PENDING
>    or all ACCEPTED before a successful promotion to 2MB. This requirement
>    prevents successful page merging after partially converting a 2MB
>    range from private to shared and then back to private, which is the
>    primary scenario necessitating page promotion.
> 
>  - The TDX-Module effectively requires a break-before-make sequence (to
>    satisfy its TLB flushing rules), i.e. creates a window of time where a
>    different vCPU can encounter faults on a SPTE that KVM is trying to
>    promote to a hugepage.  To avoid unexpected BUSY errors, KVM would need
>    to FREEZE the non-leaf SPTE before replacing it with a huge SPTE.
> 
> Disable hugepage promotion for all map() operations, as supporting page
> promotion when building the initial image is still non-trivial, and the
> vast majority of images are ~4MB or less, i.e. the benefit of creating
> hugepages during TD build time is minimal.
> 
> Signed-off-by: Edgecombe, Rick P <rick.p.edgecombe@intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> [sean: check root, add comment, rewrite changelog]
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     |  3 ++-
>  arch/x86/kvm/mmu/tdp_mmu.c | 12 +++++++++++-
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4ecbf216d96f..45650f70eeab 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3419,7 +3419,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
>  	    cur_level == fault->goal_level &&
>  	    is_shadow_present_pte(spte) &&
>  	    !is_large_pte(spte) &&
> -	    spte_to_child_sp(spte)->nx_huge_page_disallowed) {
> +	    ((spte_to_child_sp(spte)->nx_huge_page_disallowed) ||
> +	     is_mirror_sp(spte_to_child_sp(spte)))) {
>  		/*
>  		 * A small SPTE exists for this pfn, but FNAME(fetch),
>  		 * direct_map(), or kvm_tdp_mmu_map() would like to create a
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 321dbde77d3f..0fe3be41594f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1232,7 +1232,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
>  		int r;
>  
> -		if (fault->nx_huge_page_workaround_enabled)
> +		/*
> +		 * Don't replace a page table (non-leaf) SPTE with a huge SPTE
> +		 * (a.k.a. hugepage promotion) if the NX hugepage workaround is
> +		 * enabled, as doing so will cause significant thrashing if one
> +		 * or more leaf SPTEs needs to be executable.
> +		 *
> +		 * Disallow hugepage promotion for mirror roots as KVM doesn't
> +		 * (yet) support promoting S-EPT entries while holding mmu_lock
> +		 * for read (due to complexity induced by the TDX-Module APIs).
> +		 */
> +		if (fault->nx_huge_page_workaround_enabled || is_mirror_sp(root))
A small nit:
Here, we check is_mirror_sp(root).
However, not far from here,  in kvm_tdp_mmu_map(), we have another check of
is_mirror_sp(), which should get the same result since sp->role.is_mirror is
inherited from its parent.

               if (is_mirror_sp(sp))
                       kvm_mmu_alloc_external_spt(vcpu, sp);

So, do you think we can save the is_mirror status in a local variable?
Like this:

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b524b44733b8..c54befec3042 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1300,6 +1300,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
        struct kvm_mmu_page *root = tdp_mmu_get_root_for_fault(vcpu, fault);
+       bool is_mirror = root && is_mirror_sp(root);
        struct kvm *kvm = vcpu->kvm;
        struct tdp_iter iter;
        struct kvm_mmu_page *sp;
@@ -1316,7 +1317,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
        for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
                int r;

-               if (fault->nx_huge_page_workaround_enabled)
+               /*
+                * Don't replace a page table (non-leaf) SPTE with a huge SPTE
+                * (a.k.a. hugepage promotion) if the NX hugepage workaround is
+                * enabled, as doing so will cause significant thrashing if one
+                * or more leaf SPTEs needs to be executable.
+                *
+                * Disallow hugepage promotion for mirror roots as KVM doesn't
+                * (yet) support promoting S-EPT entries while holding mmu_lock
+                * for read (due to complexity induced by the TDX-Module APIs).
+                */
+               if (fault->nx_huge_page_workaround_enabled || is_mirror)
                        disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);

                /*
@@ -1340,7 +1351,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
                 */
                sp = tdp_mmu_alloc_sp(vcpu);
                tdp_mmu_init_child_sp(sp, &iter);
-               if (is_mirror_sp(sp))
+               if (is_mirror)
                        kvm_mmu_alloc_external_spt(vcpu, sp);


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root
  2026-01-27  3:40         ` Yan Zhao
@ 2026-01-28 19:51           ` Sean Christopherson
  0 siblings, 0 replies; 127+ messages in thread
From: Sean Christopherson @ 2026-01-28 19:51 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kas, tabba, ackerleytng, michael.roth, david, vannapurve, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Tue, Jan 27, 2026, Yan Zhao wrote:
> On Mon, Jan 26, 2026 at 08:08:31AM -0800, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 321dbde77d3f..0fe3be41594f 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1232,7 +1232,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  	for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
> >  		int r;
> >  
> > -		if (fault->nx_huge_page_workaround_enabled)
> > +		/*
> > +		 * Don't replace a page table (non-leaf) SPTE with a huge SPTE
> > +		 * (a.k.a. hugepage promotion) if the NX hugepage workaround is
> > +		 * enabled, as doing so will cause significant thrashing if one
> > +		 * or more leaf SPTEs needs to be executable.
> > +		 *
> > +		 * Disallow hugepage promotion for mirror roots as KVM doesn't
> > +		 * (yet) support promoting S-EPT entries while holding mmu_lock
> > +		 * for read (due to complexity induced by the TDX-Module APIs).
> > +		 */
> > +		if (fault->nx_huge_page_workaround_enabled || is_mirror_sp(root))
> A small nit:
> Here, we check is_mirror_sp(root).
> However, not far from here,  in kvm_tdp_mmu_map(), we have another check of
> is_mirror_sp(), which should get the same result since sp->role.is_mirror is
> inherited from its parent.
> 
>                if (is_mirror_sp(sp))
>                        kvm_mmu_alloc_external_spt(vcpu, sp);
> 
> So, do you think we can save the is_mirror status in a local variable?

Eh, I vote "no".  From a performance perspective, it's basically meaningless.
The check is a single uop to test a flag that is all but guaranteed to be
cache-hot, and any halfway decent CPU be able to predict the branch.

From a code perspective, I'd rather have the explicit is_mirror_sp(root) check,
as opposed to having to go look at the origins of is_mirror.

> Like this:
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b524b44733b8..c54befec3042 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1300,6 +1300,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>         struct kvm_mmu_page *root = tdp_mmu_get_root_for_fault(vcpu, fault);
> +       bool is_mirror = root && is_mirror_sp(root);
>         struct kvm *kvm = vcpu->kvm;
>         struct tdp_iter iter;
>         struct kvm_mmu_page *sp;
> @@ -1316,7 +1317,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>         for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
>                 int r;
> 
> -               if (fault->nx_huge_page_workaround_enabled)
> +               /*
> +                * Don't replace a page table (non-leaf) SPTE with a huge SPTE
> +                * (a.k.a. hugepage promotion) if the NX hugepage workaround is
> +                * enabled, as doing so will cause significant thrashing if one
> +                * or more leaf SPTEs needs to be executable.
> +                *
> +                * Disallow hugepage promotion for mirror roots as KVM doesn't
> +                * (yet) support promoting S-EPT entries while holding mmu_lock
> +                * for read (due to complexity induced by the TDX-Module APIs).
> +                */
> +               if (fault->nx_huge_page_workaround_enabled || is_mirror)
>                         disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
> 
>                 /*
> @@ -1340,7 +1351,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                  */
>                 sp = tdp_mmu_alloc_sp(vcpu);
>                 tdp_mmu_init_child_sp(sp, &iter);
> -               if (is_mirror_sp(sp))
> +               if (is_mirror)
>                         kvm_mmu_alloc_external_spt(vcpu, sp);
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock
  2026-01-06 10:20 ` [PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock Yan Zhao
@ 2026-01-28 22:38   ` Sean Christopherson
  0 siblings, 0 replies; 127+ messages in thread
From: Sean Christopherson @ 2026-01-28 22:38 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kas, tabba, ackerleytng, michael.roth, david, vannapurve, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Tue, Jan 06, 2026, Yan Zhao wrote:
> Introduce a new valid transition case for splitting and document all valid
> transitions of the mirror page table under write mmu_lock in
> tdp_mmu_set_spte().

...

> ---
>  arch/x86/include/asm/kvm-x86-ops.h |  1 +
>  arch/x86/include/asm/kvm_host.h    |  4 ++++
>  arch/x86/kvm/mmu/tdp_mmu.c         | 29 +++++++++++++++++++++++++----
>  3 files changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 58c5c9b082ca..84fa8689b45c 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -98,6 +98,7 @@ KVM_X86_OP_OPTIONAL(link_external_spt)
>  KVM_X86_OP_OPTIONAL(set_external_spte)
>  KVM_X86_OP_OPTIONAL(free_external_spt)
>  KVM_X86_OP_OPTIONAL(remove_external_spte)
> +KVM_X86_OP_OPTIONAL(split_external_spte)

This is all going in the wrong direction.  Sprinking S-EPT callbacks all over the
TDP MMU leaks *more* TDX details into the MMU, and for all intents and purposes
does nothing in terms of encapsulating MMU details in the MMU.  E.g. the TDX code
has sanity checks all over the place to ensure the "right" API is called.

The bajillion callbacks also make this code extremely difficult to follow and
review.  It requires knowing exactly which TDP MMU paths are used for what
operations, and what paths are (allegedly) unreachable for mirror roots.  Adding
hooks at specific points is also brittle, because an unexpected update/change is
more likely to go unnoticed, at least until the system explodes.

There are really only two novel paths: atomic versus non-atomic writes.  An atomic
set_spte() can fail, and also needs special handling so that the entire operation
is atomic from KVM's point of view.

There's another outlier, removal of a non-leaf S-EPT page, that I think is worth
keeping separate, because I don't see a sane way of containing the TDX-Module's
ordering requirements to the TDX code.  Specifically, the TDX-Module requires that
leaf S-EPT entries be removed before the parent page table can be removed, whereas
KVM prefers to prune the page table and _then_ reap its children.

We _could_ funnel that case into the non-atomic update, but it would either require
propagating the non-leaf removal to TDX after the call_rcu():

	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);

which is all kinds of gross, or it would require moving the call_rcu() invocation,
which obviously bleeds TDX details into the TDP MMU.  So I think it's work keeping
a dedicated hook for that case, but literally everything else can funnel into a
single callback, invoked from two locations: handle_changed_spte() and
__tdp_mmu_set_spte_atomic().

Then the TDX code is (quite simply, IMO):

static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
				     u64 new_spte, enum pg_level level)
{
	if (is_shadow_present_pte(old_spte) && is_shadow_present_pte(new_spte))
		return tdx_sept_split_private_spte(kvm, gfn, old_spte, new_spte, level);
	else if (is_shadow_present_pte(old_spte))
		return tdx_sept_remove_private_spte(kvm, gfn, old_spte, level);

	if (KVM_BUG_ON(!is_shadow_present_pte(new_spte), kvm))
		return -EIO;

	if (!is_last_spte(new_spte, level))
		return tdx_sept_link_private_spt(kvm, gfn, new_spte, level);

	return tdx_sept_map_leaf_spte(kvm, gfn, new_spte, level);
}

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion
  2026-01-06 10:22 ` [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
@ 2026-01-28 22:39   ` Sean Christopherson
  0 siblings, 0 replies; 127+ messages in thread
From: Sean Christopherson @ 2026-01-28 22:39 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kas, tabba, ackerleytng, michael.roth, david, vannapurve, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Tue, Jan 06, 2026, Yan Zhao wrote:
>  virt/kvm/guest_memfd.c | 67 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 67 insertions(+)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 03613b791728..8e7fbed57a20 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -486,6 +486,55 @@ static int merge_truncate_range(struct inode *inode, pgoff_t start,
>  	return ret;
>  }
>  
> +static int __kvm_gmem_split_private(struct gmem_file *f, pgoff_t start, pgoff_t end)
> +{
> +	enum kvm_gfn_range_filter attr_filter = KVM_FILTER_PRIVATE;
> +
> +	bool locked = false;
> +	struct kvm_memory_slot *slot;
> +	struct kvm *kvm = f->kvm;
> +	unsigned long index;
> +	int ret = 0;
> +
> +	xa_for_each_range(&f->bindings, index, slot, start, end - 1) {
> +		pgoff_t pgoff = slot->gmem.pgoff;
> +		struct kvm_gfn_range gfn_range = {
> +			.start = slot->base_gfn + max(pgoff, start) - pgoff,
> +			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
> +			.slot = slot,
> +			.may_block = true,
> +			.attr_filter = attr_filter,
> +		};
> +
> +		if (!locked) {
> +			KVM_MMU_LOCK(kvm);
> +			locked = true;
> +		}
> +
> +		ret = kvm_split_cross_boundary_leafs(kvm, &gfn_range, false);

This bleeds TDX details all over guest_memfd.  Presumably SNP needs a similar
callback to update the RMP, but SNP most definitely doesn't _need_ to split
hugepages that now have mixed attributes.  In fact, SNP can probably do literally
nothing here and let kvm_gmem_zap() do the heavy lifting.

Sadly, an arch hook is "necessary", because otherwise we'll end up in dependency
hell.  E.g. I _want_ to just let the TDP MMU do the splits during kvm_gmem_zap(),
but then an -ENOMEM when splitting would result in a partial conversion if more
than one KVM instance was bound to the gmem instance (ignoring that it's actually
"fine" for the TDX case, because only one S-EPT tree can have a valid mapping).

Even if we're willing to live with that assumption baked into the TDP MMU, we'd
still need to allow kvm_gmem_zap() to fail, e.g. because -ENOMEM isn't strictly
fatal.  And I really, really don't want to set the precedence that "zap" operations
are allow to fail.

But those details absolutely do not belong in guest_memfd.c.  Provide an arch
hook to give x86 the opportunity to pre-split hugepages, but keep the details
in arch code.

static int __kvm_gmem_convert(struct gmem_file *f, pgoff_t start, pgoff_t end,
			      bool to_private)
{
	struct kvm_memory_slot *slot;
	unsigned long index;
	int r;

	xa_for_each_range(&f->bindings, index, slot, start, end - 1) {
		r = kvm_arch_gmem_convert(f->kvm,
					  kvm_gmem_get_start_gfn(slot, start),
					  kvm_gmem_get_end_gfn(slot, end),
					  to_private);
		if (r)
			return r;
	}
	return 0;
}

static int kvm_gmem_convert(struct inode *inode, pgoff_t start, pgoff_t end,
			    bool to_private)
{
	struct gmem_file *f;
	int r;

	kvm_gmem_for_each_file(f, inode->i_mapping) {
		r = __kvm_gmem_convert(f, start, end, to_private);
		if (r)
			return r;
	}
	return 0;
}


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2026-01-06 10:18 ` [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
  2026-01-16  1:00   ` Huang, Kai
  2026-01-16 11:22   ` Huang, Kai
@ 2026-01-28 22:49   ` Sean Christopherson
  2 siblings, 0 replies; 127+ messages in thread
From: Sean Christopherson @ 2026-01-28 22:49 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen,
	kas, tabba, ackerleytng, michael.roth, david, vannapurve, sagis,
	vbabka, thomas.lendacky, nik.borisov, pgonda, fan.du, jun.miao,
	francescolavra.fl, jgross, ira.weiny, isaku.yamahata, xiaoyao.li,
	kai.huang, binbin.wu, chao.p.peng, chao.gao

On Tue, Jan 06, 2026, Yan Zhao wrote:
> From: Xiaoyao Li <xiaoyao.li@intel.com>
> 
> Introduce SEAMCALL wrapper tdh_mem_page_demote() to invoke
> TDH_MEM_PAGE_DEMOTE, which splits a 2MB or a 1GB mapping in S-EPT into
> 512 4KB or 2MB mappings respectively.
> 
> SEAMCALL TDH_MEM_PAGE_DEMOTE walks the S-EPT to locate the huge mapping to
> split and add a new S-EPT page to hold the 512 smaller mappings.
> 
> Parameters "gpa" and "level" specify the huge mapping to split, and
> parameter "new_sept_page" specifies the 4KB page to be added as the S-EPT
> page. Invoke tdx_clflush_page() before adding the new S-EPT page
> conservatively to prevent dirty cache lines from writing back later and
> corrupting TD memory.
> 
> tdh_mem_page_demote() may fail, e.g., due to S-EPT walk error. Callers must
> check function return value and can retrieve the extended error info from
> the output parameters "ext_err1", and "ext_err2".
> 
> The TDX module has many internal locks. To avoid staying in SEAM mode for
> too long, SEAMCALLs return a BUSY error code to the kernel instead of
> spinning on the locks. Depending on the specific SEAMCALL, the caller may
> need to handle this error in specific ways (e.g., retry). Therefore, return
> the SEAMCALL error code directly to the caller without attempting to handle
> it in the core kernel.
> 
> Enable tdh_mem_page_demote() only on TDX modules that support feature
> TDX_FEATURES0.ENHANCE_DEMOTE_INTERRUPTIBILITY, which does not return error
> TDX_INTERRUPTED_RESTARTABLE on basic TDX (i.e., without TD partition) [2].
> 
> This is because error TDX_INTERRUPTED_RESTARTABLE is difficult to handle.
> The TDX module provides no guaranteed maximum retry count to ensure forward
> progress of the demotion. Interrupt storms could then result in a DoS if
> host simply retries endlessly for TDX_INTERRUPTED_RESTARTABLE. Disabling
> interrupts before invoking the SEAMCALL also doesn't work because NMIs can
> also trigger TDX_INTERRUPTED_RESTARTABLE. Therefore, the tradeoff for basic
> TDX is to disable the TDX_INTERRUPTED_RESTARTABLE error given the
> reasonable execution time for demotion. [1]
> 
> Link: https://lore.kernel.org/kvm/99f5585d759328db973403be0713f68e492b492a.camel@intel.com [1]
> Link: https://lore.kernel.org/all/fbf04b09f13bc2ce004ac97ee9c1f2c965f44fdf.camel@intel.com [2]
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---

This is ridiculous.  The DEMOTE API is spread across three patches:

  Add SEAMCALL wrapper tdh_mem_page_demote()
  Add/Remove DPAMT pages for guest private memory to demote
  Pass guest memory's PFN info to demote for updating pamt_refcount

with significant changes between the "add wrapper" and when the API is actually
usable.  Even worse, it's wired up in KVM before it's finalized, and so those
changes mean touching KVM code.

And to top things off, "Add/Remove DPAMT pages for guest private memory to demote"
includes a non-trivial refactoring and code movement of MAX_TDX_ARG_SIZE() and
dpamt_args_array_ptr().

This borderline unreviewable.  It took me literally more than a day to wrap my
head around what actually needs to happen for DEMOTE, what patch was doing what,
at what point in the series DEMOTE actually became usable, etc.

I get that y'all are juggling multiple intertwined series, but spraying them all
at upstream without any apparent rhyme or reason does not work.  Figure out
priorities, pick an ordering, and make it happen.

In the end, this fits nicely into *one* patch, with significantly fewer lines
changed overall.

E.g.

---
 arch/x86/include/asm/tdx.h  |  9 ++++++
 arch/x86/virt/vmx/tdx/tdx.c | 58 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 3 files changed, 68 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 50feea01b066..483441de7fe0 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -15,6 +15,7 @@
 /* Bit definitions of TDX_FEATURES0 metadata field */
 #define TDX_FEATURES0_NO_RBP_MOD		BIT_ULL(18)
 #define TDX_FEATURES0_DYNAMIC_PAMT		BIT_ULL(36)
+#define TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY	BIT_ULL(51)
 
 #ifndef __ASSEMBLER__
 
@@ -140,6 +141,11 @@ static inline bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
 	return sysinfo->features.tdx_features0 & TDX_FEATURES0_DYNAMIC_PAMT;
 }
 
+static inline bool tdx_supports_demote_nointerrupt(const struct tdx_sys_info *sysinfo)
+{
+	return sysinfo->features.tdx_features0 & TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY;
+}
+
 /* Simple structure for pre-allocating Dynamic PAMT pages outside of locks. */
 struct tdx_pamt_cache {
 	struct list_head page_list;
@@ -240,6 +246,9 @@ u64 tdh_mng_key_config(struct tdx_td *td);
 u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
 u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
 u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, enum pg_level level, u64 pfn,
+			struct page *new_sp, struct tdx_pamt_cache *pamt_cache,
+			u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_finalize(struct tdx_td *td);
 u64 tdh_vp_flush(struct tdx_vp *vp);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 6a2871e83761..97016b3e26b8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1874,6 +1874,64 @@ static u64 *dpamt_args_array_ptr_r12(struct tdx_module_array_args *args)
 	return &args->args_array[TDX_ARG_INDEX(r12)];
 }
 
+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, enum pg_level level, u64 pfn,
+			struct page *new_sp, struct tdx_pamt_cache *pamt_cache,
+			u64 *ext_err1, u64 *ext_err2)
+{
+	bool dpamt = tdx_supports_dynamic_pamt(&tdx_sysinfo) && level == PG_LEVEL_2M;
+	u64 guest_memory_pamt_page[MAX_TDX_ARGS(r12)];
+	struct tdx_module_array_args args = {
+		.args.rcx = gpa | pg_level_to_tdx_sept_level(level),
+		.args.rdx = tdx_tdr_pa(td),
+		.args.r8 = page_to_phys(new_sp),
+	};
+	u64 ret;
+
+	if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
+		return TDX_SW_ERROR;
+
+	if (dpamt) {
+		u64 *args_array = dpamt_args_array_ptr_r12(&args);
+
+		if (alloc_pamt_array(guest_memory_pamt_page, pamt_cache))
+			return TDX_SW_ERROR;
+
+		/*
+		 * Copy PAMT page PAs of the guest memory into the struct per the
+		 * TDX ABI
+		 */
+		memcpy(args_array, guest_memory_pamt_page,
+		       tdx_dpamt_entry_pages() * sizeof(*args_array));
+	}
+
+	/* Flush the new S-EPT page to be added */
+	tdx_clflush_page(new_sp);
+
+	ret = seamcall_saved_ret(TDH_MEM_PAGE_DEMOTE, &args.args);
+
+	*ext_err1 = args.args.rcx;
+	*ext_err2 = args.args.rdx;
+
+	if (dpamt) {
+		if (ret) {
+			free_pamt_array(guest_memory_pamt_page);
+		} else {
+			/*
+			 * Set the PAMT refcount for the guest private memory,
+			 * i.e. for the hugepage that was just demoted to 512
+			 * smaller pages.
+			 */
+			atomic_t *pamt_refcount;
+
+			pamt_refcount = tdx_find_pamt_refcount(pfn);
+			WARN_ON_ONCE(atomic_cmpxchg_release(pamt_refcount, 0,
+							    PTRS_PER_PMD));
+		}
+	}
+	return ret;
+}
+EXPORT_SYMBOL_FOR_KVM(tdh_mem_page_demote);
+
 u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2)
 {
 	struct tdx_module_args args = {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 096c78a1d438..a6c0fa53ece9 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -24,6 +24,7 @@
 #define TDH_MNG_KEY_CONFIG		8
 #define TDH_MNG_CREATE			9
 #define TDH_MNG_RD			11
+#define TDH_MEM_PAGE_DEMOTE		15
 #define TDH_MR_EXTEND			16
 #define TDH_MR_FINALIZE			17
 #define TDH_VP_FLUSH			18

base-commit: 0f969bc3e7a9aa441122ad51bc2ff220a200b88e
-- 


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2026-01-19 12:32                   ` Yan Zhao
@ 2026-01-29 14:36                     ` Sean Christopherson
  0 siblings, 0 replies; 127+ messages in thread
From: Sean Christopherson @ 2026-01-29 14:36 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Kai Huang, Fan Du, kvm@vger.kernel.org, Xiaoyao Li, Dave Hansen,
	thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
	david@kernel.org, michael.roth@amd.com,
	linux-kernel@vger.kernel.org, Chao P Peng, pbonzini@redhat.com,
	ackerleytng@google.com, kas@kernel.org, binbin.wu@linux.intel.com,
	Ira Weiny, nik.borisov@suse.com, francescolavra.fl@gmail.com,
	Isaku Yamahata, sagis@google.com, Chao Gao, Rick P Edgecombe,
	Jun Miao, Vishal Annapurve, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Mon, Jan 19, 2026, Yan Zhao wrote:
> On Mon, Jan 19, 2026 at 07:06:01PM +0800, Yan Zhao wrote:
> > On Mon, Jan 19, 2026 at 06:40:50PM +0800, Huang, Kai wrote:
> > > Similar handling to 'end'.  An additional thing is if one to-be-split-
> > > range calculated from 'start' overlaps one calculated from 'end', the
> > > split is only needed once. 
> > > 
> > > Wouldn't this work?
> > It can work. But I don't think the calculations are necessary if the length
> > of [start, end) is less than 1G or 2MB.
> > 
> > e.g., if both start and end are just 4KB-aligned, of a length 8KB, the current
> > implementation can invoke a single tdp_mmu_split_huge_pages_root() to split
> > a 1GB mapping to 4KB directly. Why bother splitting twice for start or end?
> I think I get your point now.
> It's a good idea if introducing only_cross_boundary is undesirable.
> 
> So, the remaining question (as I asked at the bottom of [1]) is whether we could
> create a specific function for this split use case, rather than reusing
> tdp_mmu_split_huge_pages_root() which allocates pages outside of mmu_lock. 

Belatedly, yes.  What I want to avoid is modifying core MMU functionality to add
edge-case handling for TDX.  Inevitably, TDX will require invasive changes, but
in this case they're completely unjustified.

FWIW, if __for_each_tdp_mmu_root_yield_safe() were visible outside of tdp_mmu.c,
all of the x86 code guarded by CONFIG_HAVE_KVM_ARCH_GMEM_CONVERT[*] could live in
tdx.c.

Hmm, actually, looking at that again, it's totally doable to bury the majority of
the logic in tdx.c, the TDP MMU just needs to expose an API to split hugepages in
mirror roots.  Which is effectively what tdx_handle_mismatched_accept() needs as
well, since there can only be one mirror root in practice.

Oof, and kvm_tdp_mmu_split_huge_pages() used by tdx_handle_mismatched_accept()
is wrong; it operates on the "normal" root, not the mirror root.

Let me respond to those patches.

[*] https://lore.kernel.org/all/20260129011517.3545883-45-seanjc@google.com

> This
> way, we don't need to introduce a spinlock to protect the page enqueuing/
> dequeueing of the per-VM external cache (see prealloc_split_cache_lock in patch
> 20 [2]).
> 
> Then we would disallow mirror_root for tdp_mmu_split_huge_pages_root(), which is
> currently called for dirty page tracking in upstream code. Would this be
> acceptable for TDX migration?

Honestly, I have no idea.  That's so far in the future.

> [1] https://lore.kernel.org/all/aW2Iwpuwoyod8eQc@yzhao56-desk.sh.intel.com/
> [2] https://lore.kernel.org/all/20260106102345.25261-1-yan.y.zhao@intel.com/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2026-01-22  6:33           ` Yan Zhao
@ 2026-01-29 14:51             ` Sean Christopherson
  0 siblings, 0 replies; 127+ messages in thread
From: Sean Christopherson @ 2026-01-29 14:51 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Vishal Annapurve, Kai Huang, pbonzini@redhat.com,
	kvm@vger.kernel.org, Fan Du, Xiaoyao Li, Chao Gao, Dave Hansen,
	thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
	david@kernel.org, kas@kernel.org, michael.roth@amd.com, Ira Weiny,
	linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
	ackerleytng@google.com, nik.borisov@suse.com, Isaku Yamahata,
	Chao P Peng, francescolavra.fl@gmail.com, sagis@google.com,
	Rick P Edgecombe, Jun Miao, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Thu, Jan 22, 2026, Yan Zhao wrote:
> On Tue, Jan 20, 2026 at 10:02:41AM -0800, Sean Christopherson wrote:
> > On Tue, Jan 20, 2026, Vishal Annapurve wrote:
> > > On Fri, Jan 16, 2026 at 3:39 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > And then for the PUNCH_HOLE case, do the math to determine which, if any, head
> > > > and tail pages need to be split, and use the existing APIs to make that happen.
> > > 
> > > Just a note: Through guest_memfd upstream syncs, we agreed that
> > > guest_memfd will only allow the punch_hole operation for huge page
> > > size-aligned ranges for hugetlb and thp backing. i.e. the PUNCH_HOLE
> > > operation doesn't need to split any EPT mappings for foreseeable
> > > future.
> > 
> > Oh!  Right, forgot about that.  It's the conversion path that we need to sort out,
> > not PUNCH_HOLE.  Thanks for the reminder!
> Hmm, I see.
> However, do you think it's better to leave the splitting logic in PUNCH_HOLE as
> well? e.g., guest_memfd may want to map several folios in a mapping in the
> future, i.e., after *max_order > folio_order(folio);

No, not at this time.  That is a _very_ big "if".  Coordinating and tracking
contiguous chunks of memory at a larger granularity than the underlying HugeTLB
page size would require significant complexity, I don't see us ever doing that.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-19  5:53                               ` Yan Zhao
@ 2026-01-30 15:32                                 ` Sean Christopherson
  2026-02-03  9:18                                   ` Yan Zhao
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2026-01-30 15:32 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Rick P Edgecombe, Fan Du, Xiaoyao Li, Kai Huang,
	kvm@vger.kernel.org, Dave Hansen, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, david@kernel.org,
	kas@kernel.org, linux-kernel@vger.kernel.org, Ira Weiny,
	francescolavra.fl@gmail.com, pbonzini@redhat.com,
	ackerleytng@google.com, nik.borisov@suse.com,
	binbin.wu@linux.intel.com, Isaku Yamahata, Chao P Peng,
	michael.roth@amd.com, Vishal Annapurve, sagis@google.com,
	Chao Gao, Jun Miao, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Mon, Jan 19, 2026, Yan Zhao wrote:
> On Sat, Jan 17, 2026 at 12:58:02AM +0800, Edgecombe, Rick P wrote:
> > On Fri, 2026-01-16 at 08:31 -0800, Sean Christopherson wrote:
> IIUC, this concern should be gone as Dave has agreed to use "pfn" as the
> SEAMCALL parameter [1]?
> Then should we invoke "KVM_MMU_WARN_ON(!tdx_is_convertible_pfn(pfn));" in KVM
> for every pfn of a huge mapping? Or should we keep the sanity check inside the
> SEAMCALL wrappers?

I don't have a strong preference.  But if it goes in KVM, definitely guard it with
KVM_MMU_WARN_ON().

> BTW, I have another question about the SEAMCALL wrapper implementation, as Kai
> also pointed out in [2]: since the SEAMCALL wrappers now serve as APIs available
> to callers besides KVM, should the SEAMCALL wrappers return TDX_OPERAND_INVALID
> or WARN_ON() (or WARN_ON_ONCE()) on sanity check failure?

Why not both?  But maybe TDX_SW_ERROR instead of TDX_OPERAND_INVALID?

If an API has a defined contract and/or set of expectations, and those expectations
aren't met by the caller, then a WARN is justified.  But the failure still needs
to be communicated to the caller.

> By returning TDX_OPERAND_INVALID, the caller can check the return code, adjust
> the input or trigger WARN_ON() by itself;
> By triggering WARN_ON() directly in the SEAMCALL wrapper, we need to document
> this requirement for the SEAMCALL wrappers and have the caller invoke the API
> correctly.

Document what exactly?  Most of this should be common sense.  E.g. we don't generally
document that pointers must be non-NULL, because that goes without saying 99.9%
of the time.

IMO, that holds true here as well.  E.g. trying to map memory into a TDX guest
that isn't convertible is obviously a bug, I don't see any value in formally
documenting that requirement.

> So, it looks that "WARN_ON() directly in the SEAMCALL wrapper" is the preferred
> approach, right?

> 
> [1] https://lore.kernel.org/all/d119c824-4770-41d2-a926-4ab5268ea3a6@intel.com/
> [2] https://lore.kernel.org/all/baf6df2cc63d8e897455168c1bf07180fc9c1db8.camel@intel.com

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-01-30 15:32                                 ` Sean Christopherson
@ 2026-02-03  9:18                                   ` Yan Zhao
  2026-02-09 17:01                                     ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Yan Zhao @ 2026-02-03  9:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, Fan Du, Xiaoyao Li, Kai Huang,
	kvm@vger.kernel.org, Dave Hansen, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, david@kernel.org,
	kas@kernel.org, linux-kernel@vger.kernel.org, Ira Weiny,
	francescolavra.fl@gmail.com, pbonzini@redhat.com,
	ackerleytng@google.com, nik.borisov@suse.com,
	binbin.wu@linux.intel.com, Isaku Yamahata, Chao P Peng,
	michael.roth@amd.com, Vishal Annapurve, sagis@google.com,
	Chao Gao, Jun Miao, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Fri, Jan 30, 2026 at 07:32:48AM -0800, Sean Christopherson wrote:
> On Mon, Jan 19, 2026, Yan Zhao wrote:
> > On Sat, Jan 17, 2026 at 12:58:02AM +0800, Edgecombe, Rick P wrote:
> > > On Fri, 2026-01-16 at 08:31 -0800, Sean Christopherson wrote:
> > IIUC, this concern should be gone as Dave has agreed to use "pfn" as the
> > SEAMCALL parameter [1]?
> > Then should we invoke "KVM_MMU_WARN_ON(!tdx_is_convertible_pfn(pfn));" in KVM
> > for every pfn of a huge mapping? Or should we keep the sanity check inside the
> > SEAMCALL wrappers?
> 
> I don't have a strong preference.  But if it goes in KVM, definitely guard it with
> KVM_MMU_WARN_ON().
Thank you for your insights, Sean!

> > BTW, I have another question about the SEAMCALL wrapper implementation, as Kai
> > also pointed out in [2]: since the SEAMCALL wrappers now serve as APIs available
> > to callers besides KVM, should the SEAMCALL wrappers return TDX_OPERAND_INVALID
> > or WARN_ON() (or WARN_ON_ONCE()) on sanity check failure?
> 
> Why not both?  But maybe TDX_SW_ERROR instead of TDX_OPERAND_INVALID?
Hmm, I previously returned TDX_OPERAND_INVALID for non-aligned base PFN.
TDX_SW_ERROR is also ok if we want to indicate that passing an invalid PFN is a
software error.
(I had tdh_mem_page_demote() return TDX_SW_ERROR when an incompatible TDX module
is used, i.e., when !tdx_supports_demote_nointerrupt()).

> If an API has a defined contract and/or set of expectations, and those expectations
> aren't met by the caller, then a WARN is justified.  But the failure still needs
> to be communicated to the caller.
Ok.

The reason for 'not both' is that there's already TDX_BUG_ON_2() in KVM after
the SEAMCALL wrapper returns a non-BUSY error. I'm not sure if having double
WARN_ON_ONCE() calls is good, so I intended to let the caller decide whether to
warn.

> > By returning TDX_OPERAND_INVALID, the caller can check the return code, adjust
> > the input or trigger WARN_ON() by itself;
> > By triggering WARN_ON() directly in the SEAMCALL wrapper, we need to document
> > this requirement for the SEAMCALL wrappers and have the caller invoke the API
> > correctly.
> 
> Document what exactly?  Most of this should be common sense.  E.g. we don't generally
> document that pointers must be non-NULL, because that goes without saying 99.9%
> of the time.
Document the SEAMCALL wrapper's expectations. e.g., for demote, a PFN must be
2MB-aligned, or the caller must not invoke tdh_mem_page_demote() if a TDX module
does not support feature ENHANCED_DEMOTE_INTERRUPTIBILITY...

> IMO, that holds true here as well.  E.g. trying to map memory into a TDX guest
> that isn't convertible is obviously a bug, I don't see any value in formally
> documenting that requirement.
Do we need a comment for documentation above the tdh_mem_page_demote() API?

> > So, it looks that "WARN_ON() directly in the SEAMCALL wrapper" is the preferred
> > approach, right?
> 
> > 
> > [1] https://lore.kernel.org/all/d119c824-4770-41d2-a926-4ab5268ea3a6@intel.com/
> > [2] https://lore.kernel.org/all/baf6df2cc63d8e897455168c1bf07180fc9c1db8.camel@intel.com
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v3 00/24] KVM: TDX huge page support for private memory
  2026-02-03  9:18                                   ` Yan Zhao
@ 2026-02-09 17:01                                     ` Sean Christopherson
  0 siblings, 0 replies; 127+ messages in thread
From: Sean Christopherson @ 2026-02-09 17:01 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Rick P Edgecombe, Fan Du, Xiaoyao Li, Kai Huang,
	kvm@vger.kernel.org, Dave Hansen, thomas.lendacky@amd.com,
	vbabka@suse.cz, tabba@google.com, david@kernel.org,
	kas@kernel.org, linux-kernel@vger.kernel.org, Ira Weiny,
	francescolavra.fl@gmail.com, pbonzini@redhat.com,
	ackerleytng@google.com, nik.borisov@suse.com,
	binbin.wu@linux.intel.com, Isaku Yamahata, Chao P Peng,
	michael.roth@amd.com, Vishal Annapurve, sagis@google.com,
	Chao Gao, Jun Miao, jgross@suse.com, pgonda@google.com,
	x86@kernel.org

On Tue, Feb 03, 2026, Yan Zhao wrote:
> On Fri, Jan 30, 2026 at 07:32:48AM -0800, Sean Christopherson wrote:
> > On Mon, Jan 19, 2026, Yan Zhao wrote:
> > > On Sat, Jan 17, 2026 at 12:58:02AM +0800, Edgecombe, Rick P wrote:
> > > > On Fri, 2026-01-16 at 08:31 -0800, Sean Christopherson wrote:
> > > IIUC, this concern should be gone as Dave has agreed to use "pfn" as the
> > > SEAMCALL parameter [1]?
> > > Then should we invoke "KVM_MMU_WARN_ON(!tdx_is_convertible_pfn(pfn));" in KVM
> > > for every pfn of a huge mapping? Or should we keep the sanity check inside the
> > > SEAMCALL wrappers?
> > 
> > I don't have a strong preference.  But if it goes in KVM, definitely guard it with
> > KVM_MMU_WARN_ON().
> Thank you for your insights, Sean!
> 
> > > BTW, I have another question about the SEAMCALL wrapper implementation, as Kai
> > > also pointed out in [2]: since the SEAMCALL wrappers now serve as APIs available
> > > to callers besides KVM, should the SEAMCALL wrappers return TDX_OPERAND_INVALID
> > > or WARN_ON() (or WARN_ON_ONCE()) on sanity check failure?
> > 
> > Why not both?  But maybe TDX_SW_ERROR instead of TDX_OPERAND_INVALID?
> Hmm, I previously returned TDX_OPERAND_INVALID for non-aligned base PFN.
> TDX_SW_ERROR is also ok if we want to indicate that passing an invalid PFN is a
> software error.
> (I had tdh_mem_page_demote() return TDX_SW_ERROR when an incompatible TDX module
> is used, i.e., when !tdx_supports_demote_nointerrupt()).
> 
> > If an API has a defined contract and/or set of expectations, and those expectations
> > aren't met by the caller, then a WARN is justified.  But the failure still needs
> > to be communicated to the caller.
> Ok.
> 
> The reason for 'not both' is that there's already TDX_BUG_ON_2() in KVM after
> the SEAMCALL wrapper returns a non-BUSY error. I'm not sure if having double
> WARN_ON_ONCE() calls is good, so I intended to let the caller decide whether to
> warn.

Two WARNs isn't the end of the world.  It might even be helpful in some cases,
e.g. to more precisely document what went wrong.

> > > By returning TDX_OPERAND_INVALID, the caller can check the return code, adjust
> > > the input or trigger WARN_ON() by itself;
> > > By triggering WARN_ON() directly in the SEAMCALL wrapper, we need to document
> > > this requirement for the SEAMCALL wrappers and have the caller invoke the API
> > > correctly.
> > 
> > Document what exactly?  Most of this should be common sense.  E.g. we don't generally
> > document that pointers must be non-NULL, because that goes without saying 99.9%
> > of the time.
> Document the SEAMCALL wrapper's expectations. e.g., for demote, a PFN must be
> 2MB-aligned, or the caller must not invoke tdh_mem_page_demote() if a TDX module
> does not support feature ENHANCED_DEMOTE_INTERRUPTIBILITY...

FWIW, for me, all of those are self-explanatory and/or effectively covered by the
TDX specs.

> > IMO, that holds true here as well.  E.g. trying to map memory into a TDX guest
> > that isn't convertible is obviously a bug, I don't see any value in formally
> > documenting that requirement.
> Do we need a comment for documentation above the tdh_mem_page_demote() API?

I wouldn't bother, but I truly don't care if the TDX subsystem wants to document
everything in gory detail.

^ permalink raw reply	[flat|nested] 127+ messages in thread

end of thread, other threads:[~2026-02-09 17:01 UTC | newest]

Thread overview: 127+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
2026-01-06 10:18 ` [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
2026-01-06 21:08   ` Dave Hansen
2026-01-07  9:12     ` Yan Zhao
2026-01-07 16:39       ` Dave Hansen
2026-01-08 19:05         ` Ackerley Tng
2026-01-08 19:24           ` Dave Hansen
2026-01-09 16:21             ` Vishal Annapurve
2026-01-09  3:08         ` Yan Zhao
2026-01-09 18:29           ` Ackerley Tng
2026-01-12  2:41             ` Yan Zhao
2026-01-13 16:50               ` Vishal Annapurve
2026-01-14  1:48                 ` Yan Zhao
2026-01-06 10:18 ` [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
2026-01-16  1:00   ` Huang, Kai
2026-01-16  8:35     ` Yan Zhao
2026-01-16 11:10       ` Huang, Kai
2026-01-16 11:22         ` Huang, Kai
2026-01-19  6:18           ` Yan Zhao
2026-01-19  6:15         ` Yan Zhao
2026-01-16 11:22   ` Huang, Kai
2026-01-19  5:55     ` Yan Zhao
2026-01-28 22:49   ` Sean Christopherson
2026-01-06 10:19 ` [PATCH v3 03/24] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
2026-01-06 10:19 ` [PATCH v3 04/24] x86/tdx: Introduce tdx_quirk_reset_folio() to reset private " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 05/24] x86/virt/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
2026-01-15 22:49   ` Sean Christopherson
2026-01-16  7:54     ` Yan Zhao
2026-01-26 16:08       ` Sean Christopherson
2026-01-27  3:40         ` Yan Zhao
2026-01-28 19:51           ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock Yan Zhao
2026-01-28 22:38   ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 08/24] KVM: TDX: Enable huge page splitting " Yan Zhao
2026-01-06 10:21 ` [PATCH v3 09/24] KVM: x86: Reject splitting huge pages under shared mmu_lock in TDX Yan Zhao
2026-01-06 10:21 ` [PATCH v3 10/24] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
2026-01-06 10:21 ` [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
2026-01-15 12:25   ` Huang, Kai
2026-01-16 23:39     ` Sean Christopherson
2026-01-19  1:28       ` Yan Zhao
2026-01-19  8:35         ` Huang, Kai
2026-01-19  8:49           ` Huang, Kai
2026-01-19 10:11             ` Yan Zhao
2026-01-19 10:40               ` Huang, Kai
2026-01-19 11:06                 ` Yan Zhao
2026-01-19 12:32                   ` Yan Zhao
2026-01-29 14:36                     ` Sean Christopherson
2026-01-20 17:51         ` Sean Christopherson
2026-01-22  6:27           ` Yan Zhao
2026-01-20 17:57       ` Vishal Annapurve
2026-01-20 18:02         ` Sean Christopherson
2026-01-22  6:33           ` Yan Zhao
2026-01-29 14:51             ` Sean Christopherson
2026-01-06 10:21 ` [PATCH v3 12/24] KVM: x86: Introduce hugepage_set_guest_inhibit() Yan Zhao
2026-01-06 10:22 ` [PATCH v3 13/24] KVM: TDX: Honor the guest's accept level contained in an EPT violation Yan Zhao
2026-01-06 10:22 ` [PATCH v3 14/24] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
2026-01-16  0:21   ` Sean Christopherson
2026-01-16  6:42     ` Yan Zhao
2026-01-06 10:22 ` [PATCH v3 15/24] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Yan Zhao
2026-01-06 10:22 ` [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
2026-01-28 22:39   ` Sean Christopherson
2026-01-06 10:23 ` [PATCH v3 17/24] KVM: TDX: Get/Put DPAMT page pair only when mapping size is 4KB Yan Zhao
2026-01-06 10:23 ` [PATCH v3 18/24] x86/virt/tdx: Add loud warning when tdx_pamt_put() fails Yan Zhao
2026-01-06 10:23 ` [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting Yan Zhao
2026-01-21  1:54   ` Huang, Kai
2026-01-21 17:30     ` Sean Christopherson
2026-01-21 19:39       ` Edgecombe, Rick P
2026-01-21 23:01       ` Huang, Kai
2026-01-22  7:03       ` Yan Zhao
2026-01-22  7:30         ` Huang, Kai
2026-01-22  7:49           ` Yan Zhao
2026-01-22 10:33             ` Huang, Kai
2026-01-06 10:23 ` [PATCH v3 20/24] KVM: TDX: Implement per-VM external cache for splitting in TDX Yan Zhao
2026-01-06 10:23 ` [PATCH v3 21/24] KVM: TDX: Add/Remove DPAMT pages for the new S-EPT page for splitting Yan Zhao
2026-01-06 10:24 ` [PATCH v3 22/24] x86/tdx: Add/Remove DPAMT pages for guest private memory to demote Yan Zhao
2026-01-19 10:52   ` Huang, Kai
2026-01-19 11:11     ` Yan Zhao
2026-01-06 10:24 ` [PATCH v3 23/24] x86/tdx: Pass guest memory's PFN info to demote for updating pamt_refcount Yan Zhao
2026-01-06 10:24 ` [PATCH v3 24/24] KVM: TDX: Turn on PG_LEVEL_2M Yan Zhao
2026-01-06 17:47 ` [PATCH v3 00/24] KVM: TDX huge page support for private memory Vishal Annapurve
2026-01-06 21:26   ` Ackerley Tng
2026-01-06 21:38     ` Sean Christopherson
2026-01-06 22:04       ` Ackerley Tng
2026-01-06 23:43         ` Sean Christopherson
2026-01-07  9:03           ` Yan Zhao
2026-01-08 20:11             ` Ackerley Tng
2026-01-09  9:18               ` Yan Zhao
2026-01-09 16:12                 ` Vishal Annapurve
2026-01-09 17:16                   ` Vishal Annapurve
2026-01-09 18:07                   ` Ackerley Tng
2026-01-12  1:39                     ` Yan Zhao
2026-01-12  2:12                       ` Yan Zhao
2026-01-12 19:56                         ` Ackerley Tng
2026-01-13  6:10                           ` Yan Zhao
2026-01-13 16:40                             ` Vishal Annapurve
2026-01-14  9:32                               ` Yan Zhao
2026-01-07 19:22           ` Edgecombe, Rick P
2026-01-07 20:27             ` Sean Christopherson
2026-01-12 20:15           ` Ackerley Tng
2026-01-14  0:33             ` Yan Zhao
2026-01-14  1:24               ` Sean Christopherson
2026-01-14  9:23                 ` Yan Zhao
2026-01-14 15:26                   ` Sean Christopherson
2026-01-14 18:45                     ` Ackerley Tng
2026-01-15  3:08                       ` Yan Zhao
2026-01-15 18:13                         ` Ackerley Tng
2026-01-14 18:56                     ` Dave Hansen
2026-01-15  0:19                       ` Sean Christopherson
2026-01-16 15:45                         ` Edgecombe, Rick P
2026-01-16 16:31                           ` Sean Christopherson
2026-01-16 16:58                             ` Edgecombe, Rick P
2026-01-19  5:53                               ` Yan Zhao
2026-01-30 15:32                                 ` Sean Christopherson
2026-02-03  9:18                                   ` Yan Zhao
2026-02-09 17:01                                     ` Sean Christopherson
2026-01-16 16:57                         ` Dave Hansen
2026-01-16 17:14                           ` Sean Christopherson
2026-01-16 17:45                             ` Dave Hansen
2026-01-16 19:59                               ` Sean Christopherson
2026-01-16 22:25                                 ` Dave Hansen
2026-01-15  1:41                     ` Yan Zhao
2026-01-15 16:26                       ` Sean Christopherson
2026-01-16  0:28 ` Sean Christopherson
2026-01-16 11:25   ` Yan Zhao
2026-01-16 14:46     ` Sean Christopherson
2026-01-19  1:25       ` Yan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox