[RFC PATCH v2 00/23] KVM: TDX huge page support for private memory

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory
@ 2025-08-07  9:39 Yan Zhao
  2025-08-07  9:41 ` [RFC PATCH v2 01/23] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
                   ` (22 more replies)
  0 siblings, 23 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:39 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

Hi,

This is RFC v2 series to support huge pages in TDX, currently focusing on
supporting 2MB huge pages only. (code available at [2]).

The main goal for RFC v2 is to have the TDX side feedback ready.

Tip folks, there are some SEAMCALL wrapper changes in this series, but we
need to have some discussion on the KVM side to figure out what it needs
still. Please feel free to ignore it for now."


4 opens in RFC v1
====
Open 1: How to pass guest's ACCEPT level info.
 
	The TDX guest can affect KVM's mapping level of a fault due to the
	current implementation of the guest ACCEPT operation in the TDX
	module. If KVM maps at a higher level than the guest's ACCEPT
	level, repeated faults are generated, prompting KVM to split the
	huge mapping to match the ACCEPT level. To prevent these repeated
	faults and subsequent splitting (at least for Linux TDs), it's
	efficient to pass the guest's ACCEPT level info to the KVM MMU core
	to restrict the mapping level of a fault.

	Prior to RFC v1, this info was conveyed by modifying each fault's
	error code [10][11].
	In RFC v1, the info was passed through the hook
	private_max_mapping_level, checking a per-vCPU
	violation_request_level inside TDX [12].

	In RFC v2, Sean and Rick suggested a global way [17] (vs a
	fault-by-fault per-vCPU way) to specify the info. This involves
	setting the disallow_lpage's GUEST_INHIBIT bit in a slot's
	lpage_infos above the specified guest ACCEPT level, indicating that
	the guest wants to prevent a specific GFN from being mapped at
	those higher levels (Patches 13 and 14).  This approach helps
	prevent potential subtle bugs that could arise from different
	ACCEPT levels specified by different vCPUs.


Open 2: Do we need to support huge pages on non-Linux TDs and the split in
        fault path.

	The current TDX module requires tdh_mem_range_block() to be invoked
	before each tdh_mem_page_demote(). It leads to the complexity of
	handling BUSY errors under a shared mmu_lock, since if a BUSY error
	is returned from tdh_mem_page_demote(), TDX must call
	tdh_mem_range_unblock() (which may also return a BUSY error) before
	passing the error to the KVM MMU core.

	For Linux TDs, guest ACCEPT operations usually occur before KVM
	populates the mappings, except when KVM pre-faults certain
	mappings. Therefore, to avoid splitting in the fault path for Linux
	TDs, RFC v1 required prefetch faults to be mapped at 4KB and KVM's
	mapping level to honor the guest's ACCEPT level.

	For non-Linux TDs, where guest ACCEPT operations may occur after
	KVM has populated mappings, KVM's mapping level has no way to honor
	the not-yet-specified guest's ACCEPT level. If the guest later
	accepts at a level smaller than KVM's mapping level, splitting in
	the fault path may be necessary. In RFC v1, splitting in the fault
	path was ignored, with the expectation that the guest would never
	accept at a lower level than KVM's mapping level. This expectation
	was later proven incorrect. Forcing a KVM's mapping to be at 4KB
	before the guest's ACCEPT level is available would disable huge
	pages for non-Linux TDs as the TD's later ACCEPT operations at a
	higher level would return error TDX_PAGE_SIZE_MISMATCH to the
	guest.
 
	In RFC v2, the fault path splitting issue is addressed by having
	TDX's EPT violation handler:
	(1) explicitly split previous huge mapping under write mmu_lock,
	(2) set GUEST_INHIBIT bit in a slot's lpage_info above the
	    specified guest ACCEPT level,
	before invoking KVM MMU core's kvm_mmu_page_fault().

	This mechanism allows support of huge pages for non-Linux TDs and
	also removes the 4KB restriction on pre-fault mappings for Linux
	TDs in RFC v1.

	It's a TODO to support split in the fault path once new TDX module
	supporting blockless tdh_mem_page_demote() is available. After
	that, step (1) would be unnecessary and step (2) could be optimized
	to be under shared mmu_lock to enhance performance.


Open 3: How to handle the TDX_INTERRUPTED_RESTARTABLE error from
	tdh_mem_page_demote().

	Currently, the TDX module returns TDX_INTERRUPTED_RESTARTABLE when
	it receives interrupts (including NMIs) during the SEAMCALL
	tdh_mem_page_demote(). However, a new TDX module is in planning,
        which will disable the interrupt checking (including for NMIs)
        during SEAMCALL tdh_mem_page_demote() for basic TDX (w/ or w/o
        DPAMT).

	As a result, the loop for endlessly retrying
	TDX_INTERRUPTED_RESTARTABLE has been removed in RFC v2.

	The retry logic remains in the code tree [2] for testing purpose
	with current TDX module.
										 
Open 4: How to handle unmap failures.

	To enable guest_memfd to support in-place conversion between shared
	and private memory [5], TDX is required not to hold refcount of the
	private pages allocated from guest_memfd. (see details in patch 6).

	Without TDX holding extra folio refcount for memory that are still
	mapped in S-EPT, the guest_memfd may have freed a page while it's
	still mapped in the S-EPT, since the current KVM unmap operation
	does not return errors to its callers. (For TDX, the error could
	arise from failures in SEAMCALLs tdh_mem_page_remove(),
	tdh_phymem_page_wbinvd_hkid(), tdh_phymem_page_reclaim()).

	Several approaches have been explored. (Refer to the links in patch
	6). Due to the complexity and given that S-EPT zapping failure is
	currently only possible due to bugs in the KVM or TDX module, which
	are very rare in a production kernel, this RFC v2 adopts a
	straightforward approach -- simply avoiding holding the page
	reference count in TDX and generating a KVM_BUG_ON() on S-EPT unmap
	failure without propagating the error back to guest_memfd or
	notifying guest_memfd through any out-of-band method.  To be robust
        against bugs, the user can enable panic_on_warn as normal.


Main changes from RFC v1
====
- Switched from using 2MB THP series [4] to HugeTLB based in-place
  conversion series v2 [5] as the huge page allocator. Patches 17 is the
  updated implementation in guest_memfd for punch_hole and
  private-to-shared conversion.

- Rebased onto DPAMT series [7] and pulled patches from [7.1] for DPAMT
  related changes in TDX huge pages. (Patches 19-23).

- Addressed the 4 opens in RFC v1.

- Updated tdh_mem_page_aug(), tdh_phymem_page_wbinvd_hkid(),
  tdh_phymem_page_reclaim(), introduced tdx_clear_folio() to support
  "folios" rather than "pages".

- Renamed kvm_split_boundary_leafs() to kvm_split_cross_boundary_leafs()
  and made it to leverage the existing tdp_mmu_split_huge_pages_root().
  The API is now available to both direct and mirror roots and also usable
  under shared mmu_lock.

- To address the lock issues present in the gmem series [4][5], this RFC v2
  is now based on the RFC patch "KVM: TDX: Decouple TDX init mem region
  from kvm_gmem_populate()" [9].
  (Will rebase to Sean's proposed fixes ([13]-[16]) later when the formal
  patches are available. It's not expected to introduce more than minimal
  change to the implementation of the TDX huge page series.)


Design
====
1. guest_memfd
   Although the TDX huge pages series uses guest_memfd as the huge page
   allocator, it does not depend on the implementation specifics of
   guest_memfd. Therefore, aside from the guest_memfd-specific code in
   patch 17, which calls kvm_split_cross_boundary_leafs() to split private
   mappings for hole punching and private-to-shared conversion, the TDX
   huge pages series can function with either 2MB THP guest_memfd [4] (by
   omitting patch 17) or HugeTLB-based in-place conversion guest_memfd [5].

2. Page size during TD build time
   It's forced to 4KB in the last patch (patch 23), because
   - tdh_mem_page_add() only adds private pages at 4KB.
   - No need to allow merging or splitting during TD build time.

3. Page size during TD runtime
   kvm_tdp_mmu_page_fault() can create mappings up to the 2MB level, as
   determined by TDX's private_max_mapping_level hook. This function can be
   triggered by host pre-faults, guest ACCEPT operations, or guest GPA
   accesses.

   KVM's mapping size can be influenced by the guest's ACCEPT size. When
   faults that carry the guest's ACCEPT level occur, the guest ACCEPT level
   info is passed to the KVM MMU core by setting the GUEST_INHIBIT bit in a
   slot's lpage_info above the guest ACCEPT level. Consequently, the KVM
   MMU core will map up to the guest's ACCEPT level.

   Currently, the GUEST_INHIBIT bit is set in a one-off manner under a
   write mmu_lock. Once set, it will not be unset, and the GFN will not be
   allowed to be mapped at higher levels, even if the mappings are zapped
   and re-accepted by the guest at higher levels later. This approach
   simplifies the code and has been tested to have a minor impact on the
   huge page map count with a typical Linux TD. Future optimizations may
   include setting the GUEST_INHIBIT bit under a shared mmu_lock or
   unsetting the GUEST_INHIBIT bit upon zapping.

4. Page splitting (page demotion)
   Page splitting can occur due to (1) private-to-shared conversion,
   (2) hole punching in guest_memfd, or (3) a guest's ACCEPT operation at
   a lower level than KVM's created mapping.

   With the current TDX module, splitting huge mappings in S-EPT requires
   executing tdh_mem_range_block(), tdh_mem_track(), kicking off vCPUs,
   and tdh_mem_page_demote() in sequence. If DPAMT is involved,
   tdh_phymem_pamt_add() is also required (see bullet 5).

   Currently, only scenario (3) may trigger page splitting under a shared
   mmu_lock. However, to avoid the complexity of handling BUSY errors when
   tdh_mem_page_demote() depends on tdh_mem_range_block(), TDX page
   splitting under a shared mmu_lock is not supported in this series (patch
   11 directly returns -EOPNOTSUPP). This is also prevented by triggering
   the splitting under an exclusive mmu_lock on a fault caused by (3) and
   setting the GUEST_INHIBIT bit in a slot's lpage_info above the guest
   ACCEPT level. Supporting split in the fault path is a TODO for when a
   new TDX module that supports blockless tdh_mem_page_demote() becomes
   available. 

   With an exclusive kvm->mmu_lock, TDX_OPERAND_BUSY is handled similarly
   to removing a private page, i.e., by kicking off all vCPUs and retrying,
   which should succeed on the second attempt.
  
   TDX_INTERRUPTED_RESTARTABLE will not be returned from
   tdh_mem_page_demote() for basic TDX in the new TDX module that is
   currently in planning. Therefore, this error is not handled in this RFC
   v2.

5. Dynamic PAMT
   Currently, DPAMT's involvement with TDX huge pages is limited to page
   splitting.

   This part of design depends on the outcome of the DPAMT series [7]
   design discussions.

   The call stack based on the current design is as follows:

   kvm_split_cross_boundary_leafs
      kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs
         tdp_mmu_split_cross_boundary_leafs
            tdp_mmu_split_huge_pages_root
               1.1 tdp_mmu_alloc_sp_for_split
               1.2 topup_mirror_caches   <-------------------------------
               2.tdp_mmu_split_huge_page                                 |
                    tdp_mmu_link_sp                                      |
                       tdp_mmu_iter_set_spte                             |
                          tdp_mmu_set_spte                               |
                             split_external_spt                          |
                                kvm_x86_call(split_external_spt)         |
                                   tdx_sept_split_private_spt            |
                                      3.1 BLOCK, TRACK                   |
                                      3.2 PAMT_ADD for the adding page --|
                                          table page                     |
                                      3.3 alloc pamt pages for the 2MB --
                                          guest page and DEMOTE

   1.2 allocates PAMT pages in the per-VM KVM MMU memory cache for both
       the adding page table page and the 2MB guest page to be demoted.
       (Patch 21).

   3.2 and 3.3 retrieve the PAMT pages from the per-VM KVM MMU memory
       cache and perform tdh_phymem_pamt_add() and tdh_mem_page_demote().
       Holding of the exclusive mmu_lock after 1.2 and re-checking the
       count of free pages ensure that 3.2 and 3.3 can safely acquire
       enough pages in an atomic context.

6. Page merging (page promotion)
   Promotion is disallowed (in patch 7), because
   - The current TDX module requires all 4KB leafs to be either all PENDING
     or all ACCEPTED before a successful promotion to 2MB. This requirement
     prevents successful page merging after partially converting a 2MB
     range from private to shared and then back to private, which is the
     primary scenario necessitating page promotion.
   - tdh_mem_page_promote() depends on tdh_mem_range_block() in the current
     TDX module. Consequently, handling BUSY errors is complex, as page
     merging typically occurs in the fault path under a shared mmu_lock.


Patches Layout
====
- Patches 01-05: Update SEAMCALL wrappers or page clearing helper.

- Patch 06: Drop holding gmem page refcount in TDX to facilitate
            guest_memfd in-place conversion [5].

- Patch 07: Disallow page merging for TDX.

- Patches 08-11: Enable KVM MMU core to propagate splitting requests to TDX
                 and provide the corresponding implementation in TDX.

- Patch 12: Introduce a higher level API to split pages.

- Patches 13-14: Split and inhibit huge pages in EPT violation handler.

- Patches 15-17: Split huge pages on page conversion/punch hole.

- Patches 18-22: Dynamic PAMT related changes for TDX huge pages.

- Patch 23: Turn on 2MB huge pages.


Testing
===
The kernel code is available at [2].

The code base includes
- kvm-x86-next-2025.07.21 (with shutdown optimization [8])
- guest_memfd code [6]
- DPAMT v2 [7]
- TDX selftests out of [6].

TDX huge pages can be tested with the KVM selftest tdx_vm_huge_page in [2]
or with the matching QEMU [3].

The 2MB mapping count can be checked via "/sys/kernel/debug/kvm/pages_2m".

Huge pages has been verified to work with DPAMT and shutdown optimization
on TDX module version TDX_1.5.20.00.887, though further testing is needed
(e.g., the mmu stress test).


Thanks
Yan

[1] RFC v1: https://lore.kernel.org/all/20250424030033.32635-1-yan.y.zhao@intel.com
[2] Kernel code: https://github.com/intel/tdx/tree/huge_page_v2
[3] QEMU code: https://github.com/intel-staging/qemu-tdx/tree/gmem-tdx-hugepage-v2
[4] 2M THP: https://lore.kernel.org/all/20241212063635.712877-1-michael.roth@amd.com
[5] 1G: https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com
[6] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
[7] DPAMT: https://lore.kernel.org/all/20250609191340.2051741-1-kirill.shutemov@linux.intel.com
[7.1] git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge
[8] shutdown: https://lore.kernel.org/all/20250718181541.98146-1-seanjc@google.com 
[9] decouple: https://lore.kernel.org/all/20250703062641.3247-1-yan.y.zhao@intel.com
[10] https://lore.kernel.org/all/4d61104bff388a081ff8f6ae4ac71e05a13e53c3.1708933624.git.isaku.yamahata@intel.com/
[11]https://lore.kernel.org/all/3d2a6bfb033ee1b51f7b875360bd295376c32b54.1708933624.git.isaku.yamahata@intel.com/
[12] https://lore.kernel.org/all/20250424030713.403-1-yan.y.zhao@intel.com
[13] https://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com/
[14] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/
[15] https://lore.kernel.org/lkml/20250613005400.3694904-2-michael.roth@amd.com
[16] https://lore.kernel.org/all/20250729225455.670324-15-seanjc@google.com
[17] https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com


Edgecombe, Rick P (1):
  KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror
    root

Isaku Yamahata (1):
  KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table
    splitting

Kirill A. Shutemov (5):
  x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC
    is set
  KVM: TDX: Pass down pfn to split_external_spt()
  KVM: TDX: Handle Dynamic PAMT in tdh_mem_page_demote()
  KVM: TDX: Preallocate PAMT pages to be used in split path
  KVM: TDX: Handle Dynamic PAMT on page split

Xiaoyao Li (1):
  x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()

Yan Zhao (15):
  x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
  x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge
    pages
  KVM: TDX: Introduce tdx_clear_folio() to clear huge pages
  x86/tdx: Enhance tdh_phymem_page_reclaim() to support huge pages
  KVM: TDX: Do not hold page refcount on private guest pages
  KVM: x86/tdp_mmu: Add split_external_spt hook called during write
    mmu_lock
  KVM: TDX: Enable huge page splitting under write kvm->mmu_lock
  KVM: x86: Reject splitting huge pages under shared mmu_lock for mirror
    root
  KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  KVM: x86: Introduce hugepage_set_guest_inhibit()
  KVM: TDX: Split and inhibit huge mappings if a VMExit carries level
    info
  KVM: Change the return type of gfn_handler_t() from bool to int
  KVM: x86: Split cross-boundary mirror leafs for
    KVM_SET_MEMORY_ATTRIBUTES
  KVM: guest_memfd: Split for punch hole and private-to-shared
    conversion
  KVM: TDX: Turn on PG_LEVEL_2M after TD is RUNNABLE

 arch/arm64/kvm/mmu.c               |   8 +-
 arch/loongarch/kvm/mmu.c           |   8 +-
 arch/mips/kvm/mmu.c                |   6 +-
 arch/powerpc/kvm/book3s.c          |   4 +-
 arch/powerpc/kvm/e500_mmu_host.c   |   8 +-
 arch/riscv/kvm/mmu.c               |  12 +-
 arch/x86/include/asm/kvm-x86-ops.h |   1 +
 arch/x86/include/asm/kvm_host.h    |   7 +
 arch/x86/include/asm/tdx.h         |  18 ++-
 arch/x86/kvm/mmu.h                 |   3 +
 arch/x86/kvm/mmu/mmu.c             |  85 ++++++++---
 arch/x86/kvm/mmu/tdp_mmu.c         | 208 ++++++++++++++++++++++---
 arch/x86/kvm/mmu/tdp_mmu.h         |   3 +
 arch/x86/kvm/vmx/tdx.c             | 238 ++++++++++++++++++++++-------
 arch/x86/kvm/vmx/tdx_arch.h        |   3 +
 arch/x86/virt/vmx/tdx/tdx.c        | 102 ++++++++++---
 arch/x86/virt/vmx/tdx/tdx.h        |   1 +
 include/linux/kvm_host.h           |  14 +-
 virt/kvm/guest_memfd.c             | 142 ++++++++++-------
 virt/kvm/kvm_main.c                |  37 +++--
 20 files changed, 691 insertions(+), 217 deletions(-)

-- 
2.43.2


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 01/23] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
@ 2025-08-07  9:41 ` Yan Zhao
  2025-08-07  9:41 ` [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:41 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.

The SEAMCALL TDH_MEM_PAGE_AUG currently supports adding physical memory to
the S-EPT up to 2MB in size.

While keeping the "level" parameter in the tdh_mem_page_aug() wrapper to
allow callers to specify the physical memory size, introduce the parameters
"folio" and "start_idx" to specify the physical memory starting from the
page at "start_idx" within the "folio". The specified physical memory must
be fully contained within a single folio.

Invoke tdx_clflush_page() for each 4KB segment of the physical memory being
added. tdx_clflush_page() performs CLFLUSH operations on certain
TDX-capable platforms, or conservatively on all TDX-capable platforms, to
prevent dirty cache lines from writing back later and corrupting TD memory.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Refine patch log. (Rick)
- Removed the level checking. (Kirill, Chao Gao)
- Use "folio", and "start_idx" rather than "page".
- Return TDX_OPERAND_INVALID if the specified physical memory is not
  contained within a single folio.
- Use PTE_SHIFT to replace the 9 in "1 << (level * 9)" (Kirill)
- Use C99-style definition of variables inside a loop. (Nikolay Borisov)

RFC v1:
- Rebased to new tdh_mem_page_aug() with "struct page *" as param.
- Check folio, folio_page_idx.
---
 arch/x86/include/asm/tdx.h  |  3 ++-
 arch/x86/kvm/vmx/tdx.c      |  4 +++-
 arch/x86/virt/vmx/tdx/tdx.c | 14 +++++++++++---
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 48d579092590..f968b736871a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -171,7 +171,8 @@ u64 tdh_mng_addcx(struct tdx_td *td, struct page *tdcs_page);
 u64 tdh_mem_page_add(struct tdx_td *td, u64 gpa, struct page *page, struct page *source, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mem_sept_add(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page);
-u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2);
+u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct folio *folio,
+		     unsigned long start_idx, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mem_range_block(struct tdx_td *td, u64 gpa, int level, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mng_key_config(struct tdx_td *td);
 u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ed67f842b6ec..0a2b183899d8 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1593,11 +1593,13 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 {
 	int tdx_level = pg_level_to_tdx_sept_level(level);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct folio *folio = page_folio(page);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 entry, level_state;
 	u64 err;
 
-	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
+	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, folio,
+			       folio_page_idx(folio, page), &entry, &level_state);
 	if (unlikely(tdx_operand_busy(err))) {
 		tdx_unpin(kvm, page);
 		return -EBUSY;
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index e411cf878547..580f14f64822 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1730,16 +1730,24 @@ u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page)
 }
 EXPORT_SYMBOL_GPL(tdh_vp_addcx);
 
-u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2)
+u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct folio *folio,
+		     unsigned long start_idx, u64 *ext_err1, u64 *ext_err2)
 {
+	struct page *start = folio_page(folio, start_idx);
+	unsigned long npages = 1 << (level * PTE_SHIFT);
 	struct tdx_module_args args = {
 		.rcx = gpa | level,
 		.rdx = tdx_tdr_pa(td),
-		.r8 = page_to_phys(page),
+		.r8 = page_to_phys(start),
 	};
 	u64 ret;
 
-	tdx_clflush_page(page);
+	if (start_idx + npages > folio_nr_pages(folio))
+		return TDX_OPERAND_INVALID;
+
+	for (int i = 0; i < npages; i++)
+		tdx_clflush_page(nth_page(start, i));
+
 	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
 
 	*ext_err1 = args.rcx;
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
  2025-08-07  9:41 ` [RFC PATCH v2 01/23] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
@ 2025-08-07  9:41 ` Yan Zhao
  2025-09-01  8:55   ` Binbin Wu
  2025-08-07  9:42 ` [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
                   ` (20 subsequent siblings)
  22 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:41 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

From: Xiaoyao Li <xiaoyao.li@intel.com>

Introduce SEAMCALL wrapper tdh_mem_page_demote() to invoke the SEAMCALL
TDH_MEM_PAGE_DEMOTE, which demotes a huge leaf entry to a non-leaf entry
in the S-EPT.

SEAMCALL TDH_MEM_PAGE_DEMOTE supports the demotion of 2MB or 1GB huge leaf
entries.

The "gpa" and "level" parameters enable the SEAMCALL TDH_MEM_PAGE_DEMOTE to
walk the S-EPT for the huge leaf entry that needs to be demoted.

The "page" parameter specifies a 4KB page that will be used in the demotion
operation to be added as a page table page in the S-EPT.

Invoke tdx_clflush_page() on the 4KB page being added as a page table page.
This function performs CLFLUSH operations on certain TDX-capable platforms,
or conservatively on all TDX-capable platforms, to prevent dirty cache
lines from writing back later and corrupting TD memory.

tdh_mem_page_demote() may fail. Callers can check function return value and
retrieve extended error info from the function output parameters "ext_err1"
and "ext_err2". e.g., due to S-EPT walk error or arriving interrupts.

The TDX module has many internal locks. To avoid staying in SEAM mode for
too long, SEAMCALLs return a BUSY error code to the kernel instead of
spinning on the locks. Depending on the specific SEAMCALL, the caller may
need to handle this error in specific ways (e.g., retry). Therefore, return
the SEAMCALL error code directly to the caller without attempting to handle
it in the core kernel.

Do not handle TDX_INTERRUPTED_RESTARTABLE because SEAMCALL
TDH_MEM_PAGE_DEMOTE does not check interrupts (including NMIs) for basic
TDX (with or without Dynamic PAMT).

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Refine the patch log (Rick).
- Do not handle TDX_INTERRUPTED_RESTARTABLE as the new TDX modules in
  planning do not check interrupts for basic TDX.

RFC v1:
- Rebased and split patch. Updated patch log.
---
 arch/x86/include/asm/tdx.h  |  2 ++
 arch/x86/virt/vmx/tdx/tdx.c | 20 ++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 3 files changed, 23 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index f968b736871a..d2cf48e273d5 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -178,6 +178,8 @@ u64 tdh_mng_key_config(struct tdx_td *td);
 u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
 u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
 u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
+			u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_finalize(struct tdx_td *td);
 u64 tdh_vp_flush(struct tdx_vp *vp);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 580f14f64822..d941f083f741 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1825,6 +1825,26 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data)
 }
 EXPORT_SYMBOL_GPL(tdh_mng_rd);

+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
+			u64 *ext_err1, u64 *ext_err2)
+{
+	struct tdx_module_args args = {
+		.rcx = gpa | level,
+		.rdx = tdx_tdr_pa(td),
+		.r8 = page_to_phys(page),
+	};
+	u64 ret;
+
+	tdx_clflush_page(page);
+	ret = seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args);
+
+	*ext_err1 = args.rcx;
+	*ext_err2 = args.rdx;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_mem_page_demote);
+
 u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2)
 {
 	struct tdx_module_args args = {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 096c78a1d438..a6c0fa53ece9 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -24,6 +24,7 @@
 #define TDH_MNG_KEY_CONFIG		8
 #define TDH_MNG_CREATE			9
 #define TDH_MNG_RD			11
+#define TDH_MEM_PAGE_DEMOTE		15
 #define TDH_MR_EXTEND			16
 #define TDH_MR_FINALIZE			17
 #define TDH_VP_FLUSH			18
-- 
2.43.2

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2025-08-07  9:41 ` [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
@ 2025-09-01  8:55   ` Binbin Wu
  2025-09-01  9:08     ` Yan Zhao
  0 siblings, 1 reply; 43+ messages in thread
From: Binbin Wu @ 2025-09-01  8:55 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
	fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
	chao.p.peng



On 8/7/2025 5:41 PM, Yan Zhao wrote:
> From: Xiaoyao Li <xiaoyao.li@intel.com>
>
> Introduce SEAMCALL wrapper tdh_mem_page_demote() to invoke the SEAMCALL
> TDH_MEM_PAGE_DEMOTE, which demotes a huge leaf entry to a non-leaf entry
> in the S-EPT.
>
> SEAMCALL TDH_MEM_PAGE_DEMOTE supports the demotion of 2MB or 1GB huge leaf
> entries.
>
> The "gpa" and "level" parameters enable the SEAMCALL TDH_MEM_PAGE_DEMOTE to
> walk the S-EPT for the huge leaf entry that needs to be demoted.
>
> The "page" parameter specifies a 4KB page that will be used in the demotion
> operation to be added as a page table page in the S-EPT.
>
> Invoke tdx_clflush_page() on the 4KB page being added as a page table page.
> This function performs CLFLUSH operations on certain TDX-capable platforms,
> or conservatively on all TDX-capable platforms, to prevent dirty cache
> lines from writing back later and corrupting TD memory.
>
> tdh_mem_page_demote() may fail. Callers can check function return value and
> retrieve extended error info from the function output parameters "ext_err1"
> and "ext_err2". e.g., due to S-EPT walk error or arriving interrupts.
>
> The TDX module has many internal locks. To avoid staying in SEAM mode for
> too long, SEAMCALLs return a BUSY error code to the kernel instead of
> spinning on the locks. Depending on the specific SEAMCALL, the caller may
> need to handle this error in specific ways (e.g., retry). Therefore, return
> the SEAMCALL error code directly to the caller without attempting to handle
> it in the core kernel.
>
> Do not handle TDX_INTERRUPTED_RESTARTABLE because SEAMCALL
> TDH_MEM_PAGE_DEMOTE does not check interrupts (including NMIs) for basic
> TDX (with or without Dynamic PAMT).

The cover letter mentions that there is a new TDX module in planning, which
disables the interrupt checking. I guess TDX module would need to have a
interface to report the change, KVM then decides to enable huge page support or
not for TDs?

>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2:
> - Refine the patch log (Rick).
> - Do not handle TDX_INTERRUPTED_RESTARTABLE as the new TDX modules in
>    planning do not check interrupts for basic TDX.
>
> RFC v1:
> - Rebased and split patch. Updated patch log.
> ---
>   arch/x86/include/asm/tdx.h  |  2 ++
>   arch/x86/virt/vmx/tdx/tdx.c | 20 ++++++++++++++++++++
>   arch/x86/virt/vmx/tdx/tdx.h |  1 +
>   3 files changed, 23 insertions(+)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index f968b736871a..d2cf48e273d5 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -178,6 +178,8 @@ u64 tdh_mng_key_config(struct tdx_td *td);
>   u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
>   u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
>   u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
> +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
> +			u64 *ext_err1, u64 *ext_err2);
>   u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
>   u64 tdh_mr_finalize(struct tdx_td *td);
>   u64 tdh_vp_flush(struct tdx_vp *vp);
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 580f14f64822..d941f083f741 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1825,6 +1825,26 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data)
>   }
>   EXPORT_SYMBOL_GPL(tdh_mng_rd);
>   
> +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,

Nit: Is it better to use a var name that clearly tell that the page is used as a
table page?

> +			u64 *ext_err1, u64 *ext_err2)
> +{
> +	struct tdx_module_args args = {
> +		.rcx = gpa | level,
> +		.rdx = tdx_tdr_pa(td),
> +		.r8 = page_to_phys(page),
> +	};
> +	u64 ret;
> +
> +	tdx_clflush_page(page);
> +	ret = seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args);
> +
> +	*ext_err1 = args.rcx;
> +	*ext_err2 = args.rdx;
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdh_mem_page_demote);
> +
>   u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2)
>   {
>   	struct tdx_module_args args = {
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 096c78a1d438..a6c0fa53ece9 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -24,6 +24,7 @@
>   #define TDH_MNG_KEY_CONFIG		8
>   #define TDH_MNG_CREATE			9
>   #define TDH_MNG_RD			11
> +#define TDH_MEM_PAGE_DEMOTE		15
>   #define TDH_MR_EXTEND			16
>   #define TDH_MR_FINALIZE			17
>   #define TDH_VP_FLUSH			18


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2025-09-01  8:55   ` Binbin Wu
@ 2025-09-01  9:08     ` Yan Zhao
  2025-09-02 16:56       ` Edgecombe, Rick P
  0 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2025-09-01  9:08 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
	fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
	chao.p.peng

On Mon, Sep 01, 2025 at 04:55:30PM +0800, Binbin Wu wrote:
> 
> 
> On 8/7/2025 5:41 PM, Yan Zhao wrote:
> > From: Xiaoyao Li <xiaoyao.li@intel.com>
> > 
> > Introduce SEAMCALL wrapper tdh_mem_page_demote() to invoke the SEAMCALL
> > TDH_MEM_PAGE_DEMOTE, which demotes a huge leaf entry to a non-leaf entry
> > in the S-EPT.
> > 
> > SEAMCALL TDH_MEM_PAGE_DEMOTE supports the demotion of 2MB or 1GB huge leaf
> > entries.
> > 
> > The "gpa" and "level" parameters enable the SEAMCALL TDH_MEM_PAGE_DEMOTE to
> > walk the S-EPT for the huge leaf entry that needs to be demoted.
> > 
> > The "page" parameter specifies a 4KB page that will be used in the demotion
> > operation to be added as a page table page in the S-EPT.
> > 
> > Invoke tdx_clflush_page() on the 4KB page being added as a page table page.
> > This function performs CLFLUSH operations on certain TDX-capable platforms,
> > or conservatively on all TDX-capable platforms, to prevent dirty cache
> > lines from writing back later and corrupting TD memory.
> > 
> > tdh_mem_page_demote() may fail. Callers can check function return value and
> > retrieve extended error info from the function output parameters "ext_err1"
> > and "ext_err2". e.g., due to S-EPT walk error or arriving interrupts.
> > 
> > The TDX module has many internal locks. To avoid staying in SEAM mode for
> > too long, SEAMCALLs return a BUSY error code to the kernel instead of
> > spinning on the locks. Depending on the specific SEAMCALL, the caller may
> > need to handle this error in specific ways (e.g., retry). Therefore, return
> > the SEAMCALL error code directly to the caller without attempting to handle
> > it in the core kernel.
> > 
> > Do not handle TDX_INTERRUPTED_RESTARTABLE because SEAMCALL
> > TDH_MEM_PAGE_DEMOTE does not check interrupts (including NMIs) for basic
> > TDX (with or without Dynamic PAMT).
> 
> The cover letter mentions that there is a new TDX module in planning, which
> disables the interrupt checking. I guess TDX module would need to have a
> interface to report the change, KVM then decides to enable huge page support or
> not for TDs?
Yes. But I guess detecting TDX module version or if it supports certain feature
is a generic problem. e.g., certain versions of TDX module have bugs in
zero-step mitigation and may block vCPU entering.

So, maybe it deserves a separate series?

> > 
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> > RFC v2:
> > - Refine the patch log (Rick).
> > - Do not handle TDX_INTERRUPTED_RESTARTABLE as the new TDX modules in
> >    planning do not check interrupts for basic TDX.
> > 
> > RFC v1:
> > - Rebased and split patch. Updated patch log.
> > ---
> >   arch/x86/include/asm/tdx.h  |  2 ++
> >   arch/x86/virt/vmx/tdx/tdx.c | 20 ++++++++++++++++++++
> >   arch/x86/virt/vmx/tdx/tdx.h |  1 +
> >   3 files changed, 23 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index f968b736871a..d2cf48e273d5 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -178,6 +178,8 @@ u64 tdh_mng_key_config(struct tdx_td *td);
> >   u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
> >   u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
> >   u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
> > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
> > +			u64 *ext_err1, u64 *ext_err2);
> >   u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
> >   u64 tdh_mr_finalize(struct tdx_td *td);
> >   u64 tdh_vp_flush(struct tdx_vp *vp);
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 580f14f64822..d941f083f741 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1825,6 +1825,26 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data)
> >   }
> >   EXPORT_SYMBOL_GPL(tdh_mng_rd);
> > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
> 
> Nit: Is it better to use a var name that clearly tell that the page is used as a
> table page?
Yes, Thanks!
I also plan to do it (as well as for that tdx_spte_demote_private_spte() as
mentioned in
https://lore.kernel.org/all/aKKp3fyoYgaaqidm@yzhao56-desk.sh.intel.com).


> > +			u64 *ext_err1, u64 *ext_err2)
> > +{
> > +	struct tdx_module_args args = {
> > +		.rcx = gpa | level,
> > +		.rdx = tdx_tdr_pa(td),
> > +		.r8 = page_to_phys(page),
> > +	};
> > +	u64 ret;
> > +
> > +	tdx_clflush_page(page);
> > +	ret = seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args);
> > +
> > +	*ext_err1 = args.rcx;
> > +	*ext_err2 = args.rdx;
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(tdh_mem_page_demote);
> > +
> >   u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2)
> >   {
> >   	struct tdx_module_args args = {
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> > index 096c78a1d438..a6c0fa53ece9 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.h
> > +++ b/arch/x86/virt/vmx/tdx/tdx.h
> > @@ -24,6 +24,7 @@
> >   #define TDH_MNG_KEY_CONFIG		8
> >   #define TDH_MNG_CREATE			9
> >   #define TDH_MNG_RD			11
> > +#define TDH_MEM_PAGE_DEMOTE		15
> >   #define TDH_MR_EXTEND			16
> >   #define TDH_MR_FINALIZE			17
> >   #define TDH_VP_FLUSH			18
> 
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2025-09-01  9:08     ` Yan Zhao
@ 2025-09-02 16:56       ` Edgecombe, Rick P
  2025-09-02 17:37         ` Sean Christopherson
  0 siblings, 1 reply; 43+ messages in thread
From: Edgecombe, Rick P @ 2025-09-02 16:56 UTC (permalink / raw)
  To: Zhao, Yan Y, binbin.wu@linux.intel.com
  Cc: kvm@vger.kernel.org, quic_eberman@quicinc.com, Li, Xiaoyao,
	Du, Fan, Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
	tabba@google.com, vbabka@suse.cz, michael.roth@amd.com,
	seanjc@google.com, Weiny, Ira, kas@kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com,
	linux-kernel@vger.kernel.org, Yamahata, Isaku, Peng, Chao P,
	zhiquan1.li@intel.com, Annapurve, Vishal, Miao, Jun,
	x86@kernel.org, pgonda@google.com

On Mon, 2025-09-01 at 17:08 +0800, Yan Zhao wrote:
> > The cover letter mentions that there is a new TDX module in planning, which
> > disables the interrupt checking. I guess TDX module would need to have a
> > interface to report the change, KVM then decides to enable huge page support
> > or not for TDs?
> Yes. But I guess detecting TDX module version or if it supports certain
> feature is a generic problem. e.g., certain versions of TDX module have bugs
> in zero-step mitigation and may block vCPU entering.
> 

We had talked in the past of not checking versions because it would require KVM
to keep logic of which features in which TDX module.

If there is a flag we could check it, but we did not ask for one here. We
already have a situation where there are bug fixes that KVM depends on, with no
way to check.

I guess the difference here is that if the behavior is missing, KVM has an
option to continue with just small pages. But at the same time, huge pages is
very likely to succeed in either case. The "feature" is closer to closing a
theoretical race. So very much like the many bugs we don't check for. I'm
leaning towards lumping it into that category. And we can add "how do we want to
check for TDX module bugs" to the arch todo list. But it's probably down the
list, if we even want to do anything.

What do you think?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2025-09-02 16:56       ` Edgecombe, Rick P
@ 2025-09-02 17:37         ` Sean Christopherson
  2025-09-02 17:45           ` Edgecombe, Rick P
  0 siblings, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2025-09-02 17:37 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Yan Y Zhao, binbin.wu@linux.intel.com, kvm@vger.kernel.org,
	quic_eberman@quicinc.com, Xiaoyao Li, Fan Du, Dave Hansen,
	david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
	vbabka@suse.cz, michael.roth@amd.com, Ira Weiny, kas@kernel.org,
	pbonzini@redhat.com, ackerleytng@google.com,
	linux-kernel@vger.kernel.org, Isaku Yamahata, Chao P Peng,
	zhiquan1.li@intel.com, Vishal Annapurve, Jun Miao, x86@kernel.org,
	pgonda@google.com

On Tue, Sep 02, 2025, Rick P Edgecombe wrote:
> On Mon, 2025-09-01 at 17:08 +0800, Yan Zhao wrote:
> > > The cover letter mentions that there is a new TDX module in planning, which
> > > disables the interrupt checking. I guess TDX module would need to have a
> > > interface to report the change, KVM then decides to enable huge page support
> > > or not for TDs?
> > Yes. But I guess detecting TDX module version or if it supports certain
> > feature is a generic problem. e.g., certain versions of TDX module have bugs
> > in zero-step mitigation and may block vCPU entering.
> > 
> 
> We had talked in the past of not checking versions because it would require KVM
> to keep logic of which features in which TDX module.

Checking for features is different from refusing to load broken modules.  I don't
want KVM to rely on version numbers to query features, because that relies on
"newer" module versions always being a superset relative to "older" versions.

> If there is a flag we could check it, but we did not ask for one here. We
> already have a situation where there are bug fixes that KVM depends on, with no
> way to check.
> 
> I guess the difference here is that if the behavior is missing, KVM has an
> option to continue with just small pages. But at the same time, huge pages is
> very likely to succeed in either case. The "feature" is closer to closing a
> theoretical race. So very much like the many bugs we don't check for. I'm
> leaning towards lumping it into that category. And we can add "how do we want to
> check for TDX module bugs" to the arch todo list. But it's probably down the
> list, if we even want to do anything.
> 
> What do you think?

Could we taint the kernel and print a scary message if a known-buggy TDX module
is loaded?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  2025-09-02 17:37         ` Sean Christopherson
@ 2025-09-02 17:45           ` Edgecombe, Rick P
  0 siblings, 0 replies; 43+ messages in thread
From: Edgecombe, Rick P @ 2025-09-02 17:45 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan, Hansen, Dave,
	david@redhat.com, thomas.lendacky@amd.com, Zhao, Yan Y,
	tabba@google.com, kvm@vger.kernel.org, michael.roth@amd.com,
	binbin.wu@linux.intel.com, Weiny, Ira, vbabka@suse.cz,
	pbonzini@redhat.com, ackerleytng@google.com, kas@kernel.org,
	Yamahata, Isaku, Peng, Chao P, linux-kernel@vger.kernel.org,
	Annapurve, Vishal, Miao, Jun, zhiquan1.li@intel.com,
	x86@kernel.org, pgonda@google.com

On Tue, 2025-09-02 at 10:37 -0700, Sean Christopherson wrote:
> > If there is a flag we could check it, but we did not ask for one here. We
> > already have a situation where there are bug fixes that KVM depends on, with
> > no way to check.
> > 
> > I guess the difference here is that if the behavior is missing, KVM has an
> > option to continue with just small pages. But at the same time, huge pages
> > is very likely to succeed in either case. The "feature" is closer to closing
> > a theoretical race. So very much like the many bugs we don't check for. I'm
> > leaning towards lumping it into that category. And we can add "how do we
> > want to check for TDX module bugs" to the arch todo list. But it's probably
> > down the list, if we even want to do anything.
> > 
> > What do you think?
> 
> Could we taint the kernel and print a scary message if a known-buggy TDX
> module is loaded?

If we know which TDX modules have bugs, I guess. There may be some bugs that
only affect the guest, where tainting would not be appropriate. Probably would
want to do it at TDX module load time, so that people that don't use TDX don't
get their kernel tainted from an old TDX module in the BIOS.

What would you want a TDX module interface for this to look like? Like a bitmap
of fixed bugs? KVM keeps a list of bugs it cares about and compares it to the
list provided by TDX module? I think it could work if KVM is ok selecting and
keeping a bitmap of TDX module bugs.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
  2025-08-07  9:41 ` [RFC PATCH v2 01/23] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
  2025-08-07  9:41 ` [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
@ 2025-08-07  9:42 ` Yan Zhao
  2025-08-07  9:42 ` [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear " Yan Zhao
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:42 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

After removing a TD's private page, the TDX module does not write back and
invalidate cache lines associated with the page and its keyID (i.e., the
TD's guest keyID). The SEAMCALL wrapper tdh_phymem_page_wbinvd_hkid()
enables the caller to provide the TD's guest keyID and physical memory
address to invoke the SEAMCALL TDH_PHYMEM_PAGE_WBINVD to perform cache line
invalidation.

Enhance the SEAMCALL wrapper tdh_phymem_page_wbinvd_hkid() to support cache
line invalidation for huge pages by introducing the parameters "folio",
"start_idx", and "npages". These parameters specify the physical memory
starting from the page at "start_idx" within a "folio" and spanning
"npages" contiguous PFNs. Return TDX_OPERAND_INVALID if the specified
memory is not entirely contained within a single folio.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Enhance tdh_phymem_page_wbinvd_hkid() to invalidate multiple pages
  directly, rather than looping within KVM, following Dave's suggestion:
  "Don't wrap the wrappers." (Rick).

RFC v1:
- Split patch
- Aded a helper tdx_wbinvd_page() in TDX, which accepts param
  "struct page *".
---
 arch/x86/include/asm/tdx.h  |  4 ++--
 arch/x86/kvm/vmx/tdx.c      |  6 ++++--
 arch/x86/virt/vmx/tdx/tdx.c | 17 ++++++++++++++---
 3 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d2cf48e273d5..a125bb20a28a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -194,8 +194,8 @@ u64 tdh_mem_track(struct tdx_td *tdr);
 u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_phymem_cache_wb(bool resume);
 u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td);
-u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
-
+u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
+				unsigned long start_idx, unsigned long npages);
 void tdx_meminfo(struct seq_file *m);
 #else
 static inline void tdx_init(void) { }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 0a2b183899d8..8eaf8431c5f1 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1694,6 +1694,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 {
 	int tdx_level = pg_level_to_tdx_sept_level(level);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct folio *folio = page_folio(page);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 err, entry, level_state;
 
@@ -1728,8 +1729,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 		return -EIO;
 	}
 
-	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
-
+	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, folio,
+					  folio_page_idx(folio, page),
+					  KVM_PAGES_PER_HPAGE(level));
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
 		return -EIO;
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index d941f083f741..64219c659844 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -2030,13 +2030,24 @@ u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
 }
 EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_tdr);
 
-u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
+u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
+				unsigned long start_idx, unsigned long npages)
 {
+	struct page *start = folio_page(folio, start_idx);
 	struct tdx_module_args args = {};
+	u64 err;
+
+	if (start_idx + npages > folio_nr_pages(folio))
+		return TDX_OPERAND_INVALID;
 
-	args.rcx = mk_keyed_paddr(hkid, page);
+	for (unsigned long i = 0; i < npages; i++) {
+		args.rcx = mk_keyed_paddr(hkid, nth_page(start, i));
 
-	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
+		err = seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
+		if (err)
+			break;
+	}
+	return err;
 }
 EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
 
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear huge pages
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (2 preceding siblings ...)
  2025-08-07  9:42 ` [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
@ 2025-08-07  9:42 ` Yan Zhao
  2025-09-02  2:56   ` Binbin Wu
  2025-08-07  9:42 ` [RFC PATCH v2 05/23] x86/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:42 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

After removing or reclaiming a guest private page or a control page from a
TD, zero the physical page using movdir64b(), enabling the kernel to reuse
the pages.

Introduce the function tdx_clear_folio() to zero out physical memory using
movdir64b(), starting from the page at "start_idx" within a "folio" and
spanning "npages" contiguous PFNs.

Convert tdx_clear_page() to be a helper function to facilitate the
zeroing of 4KB pages.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Add tdx_clear_folio().
- Drop inner loop _tdx_clear_page() and move __mb() outside of the loop.
  (Rick)
- Use C99-style definition of variables inside a for loop.
- Note: [1] also changes tdx_clear_page(). RFC v2 is not based on [1] now.

[1] https://lore.kernel.org/all/20250724130354.79392-2-adrian.hunter@intel.com

RFC v1:
- split out, let tdx_clear_page() accept level.
---
 arch/x86/kvm/vmx/tdx.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8eaf8431c5f1..4fabefb27135 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -277,18 +277,21 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
 	vcpu->cpu = -1;
 }
 
-static void tdx_clear_page(struct page *page)
+static void tdx_clear_folio(struct folio *folio, unsigned long start_idx,
+			    unsigned long npages)
 {
 	const void *zero_page = (const void *) page_to_virt(ZERO_PAGE(0));
-	void *dest = page_to_virt(page);
-	unsigned long i;
 
 	/*
 	 * The page could have been poisoned.  MOVDIR64B also clears
 	 * the poison bit so the kernel can safely use the page again.
 	 */
-	for (i = 0; i < PAGE_SIZE; i += 64)
-		movdir64b(dest + i, zero_page);
+	for (unsigned long j = 0; j < npages; j++) {
+		void *dest = page_to_virt(folio_page(folio, start_idx + j));
+
+		for (unsigned long i = 0; i < PAGE_SIZE; i += 64)
+			movdir64b(dest + i, zero_page);
+	}
 	/*
 	 * MOVDIR64B store uses WC buffer.  Prevent following memory reads
 	 * from seeing potentially poisoned cache.
@@ -296,6 +299,13 @@ static void tdx_clear_page(struct page *page)
 	__mb();
 }
 
+static inline void tdx_clear_page(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	tdx_clear_folio(folio, folio_page_idx(folio, page), 1);
+}
+
 static void tdx_no_vcpus_enter_start(struct kvm *kvm)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -1736,7 +1746,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
 		return -EIO;
 	}
-	tdx_clear_page(page);
+	tdx_clear_folio(folio, folio_page_idx(folio, page), KVM_PAGES_PER_HPAGE(level));
 	tdx_pamt_put(page, level);
 	tdx_unpin(kvm, page);
 	return 0;
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear huge pages
  2025-08-07  9:42 ` [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear " Yan Zhao
@ 2025-09-02  2:56   ` Binbin Wu
  2025-09-03  9:51     ` Yan Zhao
  0 siblings, 1 reply; 43+ messages in thread
From: Binbin Wu @ 2025-09-02  2:56 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
	fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
	chao.p.peng



On 8/7/2025 5:42 PM, Yan Zhao wrote:
> After removing or reclaiming a guest private page or a control page from a
> TD, zero the physical page using movdir64b(), enabling the kernel to reuse
> the pages.
>
> Introduce the function tdx_clear_folio() to zero out physical memory using
> movdir64b(), starting from the page at "start_idx" within a "folio" and
> spanning "npages" contiguous PFNs.
>
> Convert tdx_clear_page() to be a helper function to facilitate the
> zeroing of 4KB pages.

I think this sentence is outdated?

>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2:
> - Add tdx_clear_folio().
> - Drop inner loop _tdx_clear_page() and move __mb() outside of the loop.
>    (Rick)
> - Use C99-style definition of variables inside a for loop.
> - Note: [1] also changes tdx_clear_page(). RFC v2 is not based on [1] now.
>
> [1] https://lore.kernel.org/all/20250724130354.79392-2-adrian.hunter@intel.com
>
> RFC v1:
> - split out, let tdx_clear_page() accept level.
> ---
>   arch/x86/kvm/vmx/tdx.c | 22 ++++++++++++++++------
>   1 file changed, 16 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 8eaf8431c5f1..4fabefb27135 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -277,18 +277,21 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
>   	vcpu->cpu = -1;
>   }
>   
> -static void tdx_clear_page(struct page *page)
> +static void tdx_clear_folio(struct folio *folio, unsigned long start_idx,
> +			    unsigned long npages)
>   {
>   	const void *zero_page = (const void *) page_to_virt(ZERO_PAGE(0));
> -	void *dest = page_to_virt(page);
> -	unsigned long i;
>   
>   	/*
>   	 * The page could have been poisoned.  MOVDIR64B also clears
>   	 * the poison bit so the kernel can safely use the page again.
>   	 */
> -	for (i = 0; i < PAGE_SIZE; i += 64)
> -		movdir64b(dest + i, zero_page);
> +	for (unsigned long j = 0; j < npages; j++) {
> +		void *dest = page_to_virt(folio_page(folio, start_idx + j));
> +
> +		for (unsigned long i = 0; i < PAGE_SIZE; i += 64)
> +			movdir64b(dest + i, zero_page);
> +	}
>   	/*
>   	 * MOVDIR64B store uses WC buffer.  Prevent following memory reads
>   	 * from seeing potentially poisoned cache.
> @@ -296,6 +299,13 @@ static void tdx_clear_page(struct page *page)
>   	__mb();
>   }
>   
> +static inline void tdx_clear_page(struct page *page)
No need to tag a local static function with "inline".

> +{
> +	struct folio *folio = page_folio(page);
> +
> +	tdx_clear_folio(folio, folio_page_idx(folio, page), 1);

This is strange at my first thought.
And then I realized that it is to avoid unnecessary memory barrier.

No better idea so far.
> +}
> +
>   static void tdx_no_vcpus_enter_start(struct kvm *kvm)
>   {
>   	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> @@ -1736,7 +1746,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>   		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
>   		return -EIO;
>   	}
> -	tdx_clear_page(page);
> +	tdx_clear_folio(folio, folio_page_idx(folio, page), KVM_PAGES_PER_HPAGE(level));
>   	tdx_pamt_put(page, level);
>   	tdx_unpin(kvm, page);
>   	return 0;


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear huge pages
  2025-09-02  2:56   ` Binbin Wu
@ 2025-09-03  9:51     ` Yan Zhao
  2025-09-03 11:19       ` Binbin Wu
  0 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2025-09-03  9:51 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
	fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
	chao.p.peng

On Tue, Sep 02, 2025 at 10:56:25AM +0800, Binbin Wu wrote:
> 
> 
> On 8/7/2025 5:42 PM, Yan Zhao wrote:
> > After removing or reclaiming a guest private page or a control page from a
> > TD, zero the physical page using movdir64b(), enabling the kernel to reuse
> > the pages.
> > 
> > Introduce the function tdx_clear_folio() to zero out physical memory using
> > movdir64b(), starting from the page at "start_idx" within a "folio" and
> > spanning "npages" contiguous PFNs.
> > 
> > Convert tdx_clear_page() to be a helper function to facilitate the
> > zeroing of 4KB pages.
> 
> I think this sentence is outdated?
No? tdx_clear_page() is still invoked to clear tdr_page.

> > 
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> > RFC v2:
> > - Add tdx_clear_folio().
> > - Drop inner loop _tdx_clear_page() and move __mb() outside of the loop.
> >    (Rick)
> > - Use C99-style definition of variables inside a for loop.
> > - Note: [1] also changes tdx_clear_page(). RFC v2 is not based on [1] now.
> > 
> > [1] https://lore.kernel.org/all/20250724130354.79392-2-adrian.hunter@intel.com
> > 
> > RFC v1:
> > - split out, let tdx_clear_page() accept level.
> > ---
> >   arch/x86/kvm/vmx/tdx.c | 22 ++++++++++++++++------
> >   1 file changed, 16 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 8eaf8431c5f1..4fabefb27135 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -277,18 +277,21 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> >   	vcpu->cpu = -1;
> >   }
> > -static void tdx_clear_page(struct page *page)
> > +static void tdx_clear_folio(struct folio *folio, unsigned long start_idx,
> > +			    unsigned long npages)
> >   {
> >   	const void *zero_page = (const void *) page_to_virt(ZERO_PAGE(0));
> > -	void *dest = page_to_virt(page);
> > -	unsigned long i;
> >   	/*
> >   	 * The page could have been poisoned.  MOVDIR64B also clears
> >   	 * the poison bit so the kernel can safely use the page again.
> >   	 */
> > -	for (i = 0; i < PAGE_SIZE; i += 64)
> > -		movdir64b(dest + i, zero_page);
> > +	for (unsigned long j = 0; j < npages; j++) {
> > +		void *dest = page_to_virt(folio_page(folio, start_idx + j));
> > +
> > +		for (unsigned long i = 0; i < PAGE_SIZE; i += 64)
> > +			movdir64b(dest + i, zero_page);
> > +	}
> >   	/*
> >   	 * MOVDIR64B store uses WC buffer.  Prevent following memory reads
> >   	 * from seeing potentially poisoned cache.
> > @@ -296,6 +299,13 @@ static void tdx_clear_page(struct page *page)
> >   	__mb();
> >   }
> > +static inline void tdx_clear_page(struct page *page)
> No need to tag a local static function with "inline".
Ok.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear huge pages
  2025-09-03  9:51     ` Yan Zhao
@ 2025-09-03 11:19       ` Binbin Wu
  2025-09-04  2:53         ` Yan Zhao
  0 siblings, 1 reply; 43+ messages in thread
From: Binbin Wu @ 2025-09-03 11:19 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
	fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
	chao.p.peng



On 9/3/2025 5:51 PM, Yan Zhao wrote:
> On Tue, Sep 02, 2025 at 10:56:25AM +0800, Binbin Wu wrote:
>>
>> On 8/7/2025 5:42 PM, Yan Zhao wrote:
>>> After removing or reclaiming a guest private page or a control page from a
>>> TD, zero the physical page using movdir64b(), enabling the kernel to reuse
>>> the pages.
>>>
>>> Introduce the function tdx_clear_folio() to zero out physical memory using
>>> movdir64b(), starting from the page at "start_idx" within a "folio" and
>>> spanning "npages" contiguous PFNs.
>>>
>>> Convert tdx_clear_page() to be a helper function to facilitate the
>>> zeroing of 4KB pages.
>> I think this sentence is outdated?
> No? tdx_clear_page() is still invoked to clear tdr_page.

I didn't get the word "Convert".

>
>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>>> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>>> ---
>>> RFC v2:
>>> - Add tdx_clear_folio().
>>> - Drop inner loop _tdx_clear_page() and move __mb() outside of the loop.
>>>     (Rick)
>>> - Use C99-style definition of variables inside a for loop.
>>> - Note: [1] also changes tdx_clear_page(). RFC v2 is not based on [1] now.
>>>
>>> [1] https://lore.kernel.org/all/20250724130354.79392-2-adrian.hunter@intel.com
>>>
>>> RFC v1:
>>> - split out, let tdx_clear_page() accept level.
>>> ---
>>>    arch/x86/kvm/vmx/tdx.c | 22 ++++++++++++++++------
>>>    1 file changed, 16 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>>> index 8eaf8431c5f1..4fabefb27135 100644
>>> --- a/arch/x86/kvm/vmx/tdx.c
>>> +++ b/arch/x86/kvm/vmx/tdx.c
>>> @@ -277,18 +277,21 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
>>>    	vcpu->cpu = -1;
>>>    }
>>> -static void tdx_clear_page(struct page *page)
>>> +static void tdx_clear_folio(struct folio *folio, unsigned long start_idx,
>>> +			    unsigned long npages)
>>>    {
>>>    	const void *zero_page = (const void *) page_to_virt(ZERO_PAGE(0));
>>> -	void *dest = page_to_virt(page);
>>> -	unsigned long i;
>>>    	/*
>>>    	 * The page could have been poisoned.  MOVDIR64B also clears
>>>    	 * the poison bit so the kernel can safely use the page again.
>>>    	 */
>>> -	for (i = 0; i < PAGE_SIZE; i += 64)
>>> -		movdir64b(dest + i, zero_page);
>>> +	for (unsigned long j = 0; j < npages; j++) {
>>> +		void *dest = page_to_virt(folio_page(folio, start_idx + j));
>>> +
>>> +		for (unsigned long i = 0; i < PAGE_SIZE; i += 64)
>>> +			movdir64b(dest + i, zero_page);
>>> +	}
>>>    	/*
>>>    	 * MOVDIR64B store uses WC buffer.  Prevent following memory reads
>>>    	 * from seeing potentially poisoned cache.
>>> @@ -296,6 +299,13 @@ static void tdx_clear_page(struct page *page)
>>>    	__mb();
>>>    }
>>> +static inline void tdx_clear_page(struct page *page)
>> No need to tag a local static function with "inline".
> Ok.
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear huge pages
  2025-09-03 11:19       ` Binbin Wu
@ 2025-09-04  2:53         ` Yan Zhao
  0 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-09-04  2:53 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, thomas.lendacky, pgonda, fan.du,
	jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li, chao.p.peng

On Wed, Sep 03, 2025 at 07:19:32PM +0800, Binbin Wu wrote:
> 
> 
> On 9/3/2025 5:51 PM, Yan Zhao wrote:
> > On Tue, Sep 02, 2025 at 10:56:25AM +0800, Binbin Wu wrote:
> > > 
> > > On 8/7/2025 5:42 PM, Yan Zhao wrote:
> > > > After removing or reclaiming a guest private page or a control page from a
> > > > TD, zero the physical page using movdir64b(), enabling the kernel to reuse
> > > > the pages.
> > > > 
> > > > Introduce the function tdx_clear_folio() to zero out physical memory using
> > > > movdir64b(), starting from the page at "start_idx" within a "folio" and
> > > > spanning "npages" contiguous PFNs.
> > > > 
> > > > Convert tdx_clear_page() to be a helper function to facilitate the
> > > > zeroing of 4KB pages.
> > > I think this sentence is outdated?
> > No? tdx_clear_page() is still invoked to clear tdr_page.
> 
> I didn't get the word "Convert".
Ok. I wanted to express that tdx_clear_page() now is just a helper.
Will rephrase it to

"Make tdx_clear_page() to be a helper function to facilitate the zeroing
of 4KB pages".

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 05/23] x86/tdx: Enhance tdh_phymem_page_reclaim() to support huge pages
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (3 preceding siblings ...)
  2025-08-07  9:42 ` [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear " Yan Zhao
@ 2025-08-07  9:42 ` Yan Zhao
  2025-08-07  9:42 ` [RFC PATCH v2 06/23] KVM: TDX: Do not hold page refcount on private guest pages Yan Zhao
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:42 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

Enhance the SEAMCALL wrapper tdh_phymem_page_reclaim() to support huge
pages by introducing new parameters: "folio", "start_idx", and "npages".
These parameters specify the physical memory to be reclaimed, i.e.,
starting from the page at "start_idx" within a folio and spanning "npages"
contiguous PFNs. The specified memory must be entirely contained within a
single folio. Return TDX_SW_ERROR if the size of the reclaimed memory does
not match the specified size.

On the KVM side, introduce tdx_reclaim_folio() to align with and invoke the
SEAMCALL wrapper tdh_phymem_page_reclaim(). The "noclear" parameter
specifies whether tdx_clear_folio() should be subsequently invoked within
tdx_reclaim_folio(). Additionally, provide two helper functions,
tdx_reclaim_page() and tdx_reclaim_page_noclear(), to facilitate the
reclaiming of 4KB pages.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Introduce new params "folio", "start_idx" and "npages" to wrapper
  tdh_phymem_page_reclaim().
- Move the checking of return size from KVM to x86/virt and return error.
- Rename tdx_reclaim_page() to tdx_reclaim_folio().
- Add two helper functions tdx_reclaim_page() tdx_reclaim_page_noclear()
  to faciliate the reclaiming of 4KB pages.

RFC v1:
- Rebased and split patch.
---
 arch/x86/include/asm/tdx.h  |  3 ++-
 arch/x86/kvm/vmx/tdx.c      | 27 ++++++++++++++++++---------
 arch/x86/virt/vmx/tdx/tdx.c | 12 ++++++++++--
 3 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index a125bb20a28a..f1bd74348b34 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -189,7 +189,8 @@ u64 tdh_mng_init(struct tdx_td *td, u64 td_params, u64 *extended_err);
 u64 tdh_vp_init(struct tdx_vp *vp, u64 initial_rcx, u32 x2apicid);
 u64 tdh_vp_rd(struct tdx_vp *vp, u64 field, u64 *data);
 u64 tdh_vp_wr(struct tdx_vp *vp, u64 field, u64 data, u64 mask);
-u64 tdh_phymem_page_reclaim(struct page *page, u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size);
+u64 tdh_phymem_page_reclaim(struct folio *folio, unsigned long start_idx, unsigned long npages,
+			    u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size);
 u64 tdh_mem_track(struct tdx_td *tdr);
 u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_phymem_cache_wb(bool resume);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 4fabefb27135..facfe589e006 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -327,11 +327,12 @@ static void tdx_no_vcpus_enter_stop(struct kvm *kvm)
 }
 
 /* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */
-static int __tdx_reclaim_page(struct page *page)
+static int tdx_reclaim_folio(struct folio *folio, unsigned long start_idx,
+			     unsigned long npages, bool noclear)
 {
 	u64 err, tdx_pt, tdx_owner, tdx_size;
 
-	err = tdh_phymem_page_reclaim(page, &tdx_pt, &tdx_owner, &tdx_size);
+	err = tdh_phymem_page_reclaim(folio, start_idx, npages, &tdx_pt, &tdx_owner, &tdx_size);
 
 	/*
 	 * No need to check for TDX_OPERAND_BUSY; all TD pages are freed
@@ -342,19 +343,25 @@ static int __tdx_reclaim_page(struct page *page)
 		pr_tdx_error_3(TDH_PHYMEM_PAGE_RECLAIM, err, tdx_pt, tdx_owner, tdx_size);
 		return -EIO;
 	}
+
+	if (!noclear)
+		tdx_clear_folio(folio, start_idx, npages);
 	return 0;
 }
 
 static int tdx_reclaim_page(struct page *page)
 {
-	int r;
+	struct folio *folio = page_folio(page);
 
-	r = __tdx_reclaim_page(page);
-	if (!r)
-		tdx_clear_page(page);
-	return r;
+	return tdx_reclaim_folio(folio, folio_page_idx(folio, page), 1, false);
 }
 
+static int tdx_reclaim_page_noclear(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	return tdx_reclaim_folio(folio, folio_page_idx(folio, page), 1, true);
+}
 
 /*
  * Reclaim the TD control page(s) which are crypto-protected by TDX guest's
@@ -587,7 +594,7 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
 	if (!kvm_tdx->td.tdr_page)
 		return;
 
-	if (__tdx_reclaim_page(kvm_tdx->td.tdr_page))
+	if (tdx_reclaim_page_noclear(kvm_tdx->td.tdr_page))
 		return;
 
 	/*
@@ -1932,11 +1939,13 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 					enum pg_level level, kvm_pfn_t pfn)
 {
 	struct page *page = pfn_to_page(pfn);
+	struct folio *folio = page_folio(page);
 	int ret;
 
 	if (!is_hkid_assigned(to_kvm_tdx(kvm))) {
 		KVM_BUG_ON(!kvm->vm_dead, kvm);
-		ret = tdx_reclaim_page(page);
+		ret = tdx_reclaim_folio(folio, folio_page_idx(folio, page),
+					KVM_PAGES_PER_HPAGE(level), false);
 		if (!ret) {
 			tdx_pamt_put(page, level);
 			tdx_unpin(kvm, page);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 64219c659844..9ed585bde062 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1966,19 +1966,27 @@ EXPORT_SYMBOL_GPL(tdh_vp_init);
  * So despite the names, they must be interpted specially as described by the spec. Return
  * them only for error reporting purposes.
  */
-u64 tdh_phymem_page_reclaim(struct page *page, u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size)
+u64 tdh_phymem_page_reclaim(struct folio *folio, unsigned long start_idx, unsigned long npages,
+			    u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size)
 {
+	struct page *start = folio_page(folio, start_idx);
 	struct tdx_module_args args = {
-		.rcx = page_to_phys(page),
+		.rcx = page_to_phys(start),
 	};
 	u64 ret;
 
+	if (start_idx + npages > folio_nr_pages(folio))
+		return TDX_OPERAND_INVALID;
+
 	ret = seamcall_ret(TDH_PHYMEM_PAGE_RECLAIM, &args);
 
 	*tdx_pt = args.rcx;
 	*tdx_owner = args.rdx;
 	*tdx_size = args.r8;
 
+	if (npages != (1 << (*tdx_size) * PTE_SHIFT))
+		return TDX_SW_ERROR;
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(tdh_phymem_page_reclaim);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 06/23] KVM: TDX: Do not hold page refcount on private guest pages
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (4 preceding siblings ...)
  2025-08-07  9:42 ` [RFC PATCH v2 05/23] x86/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
@ 2025-08-07  9:42 ` Yan Zhao
  2025-08-07  9:42 ` [RFC PATCH v2 07/23] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:42 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

To enable guest_memfd to support in-place conversion between shared and
private memory [1], TDX is required not to hold refcount of the private
pages allocated from guest_memfd.

Due to that a folio only has a single refcount and the need to reliably
determine unexpected reference when converting any shared part to private,
guest_memfd [1] does not permit shared memory to be huge [2]. Consequently,
it must split private huge pages into 4KB shared pages. However, since
guest_memfd cannot distinguish between the speculative/transient refcounts
and the intentional refcount for TDX on private pages[3], failing to
release private page refcount in TDX could cause guest_memfd to
indefinitely wait on decreasing the refcount for the splitting.

Under normal conditions, not holding an extra page refcount in TDX is safe
because guest_memfd ensures pages are retained until its invalidation
notification to KVM MMU is completed. However, if there're bugs in KVM/TDX
module, not holding an extra refcount when a page is mapped in S-EPT could
result in a page being released from guest_memfd while still mapped in the
S-EPT.

Several approaches were considered to address this issue, including
- Attempting to modify the KVM unmap operation to return a failure, which
  was deemed too complex and potentially incorrect [4].
- Increasing the folio reference count only upon S-EPT zapping failure [5].
- Use page flags or page_ext to indicate a page is still used by TDX [6],
  which does not work for HVO (HugeTLB Vmemmap Optimization).
- Setting HWPOISON bit or leveraging folio_set_hugetlb_hwpoison() [7].

Due to the complexity or inappropriateness of these approaches, and the
fact that S-EPT zapping failure is currently only possible when there are
bugs in the KVM or TDX module, which is very rare in a production kernel, a
straightforward approach of simply not holding the page reference count in
TDX was chosen [8].

When S-EPT zapping errors occur, KVM_BUG_ON() is invoked to kick off all
vCPUs and mark the VM as dead. Although there is a potential window that a
private page mapped in the S-EPT could be reallocated and used outside the
VM, the loud warning from KVM_BUG_ON() should provide sufficient debug
information. To be robust against bugs, the user can enable panic_on_warn
as normal.

Link: https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com [1]
Link: https://youtu.be/UnBKahkAon4 [2]
Link: https://lore.kernel.org/all/CAGtprH_ypohFy9TOJ8Emm_roT4XbQUtLKZNFcM6Fr+fhTFkE0Q@mail.gmail.com [3]
Link: https://lore.kernel.org/all/aEEEJbTzlncbRaRA@yzhao56-desk.sh.intel.com [4]
Link: https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com [5]
Link: https://lore.kernel.org/all/aFkeBtuNBN1RrDAJ@yzhao56-desk.sh.intel.com [6]
Link: https://lore.kernel.org/all/diqzy0tikran.fsf@ackerleytng-ctop.c.googlers.com [7]
Link: https://lore.kernel.org/all/53ea5239f8ef9d8df9af593647243c10435fd219.camel@intel.com [8]
Suggested-by: Vishal Annapurve <vannapurve@google.com>
Suggested-by: Ackerley Tng <ackerleytng@google.com>
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- new in RFC v2.
- Rebased on DPAMT and shutdown optimization.
---
 arch/x86/kvm/vmx/tdx.c | 28 ++++------------------------
 1 file changed, 4 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index facfe589e006..376287a2ddf4 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1600,11 +1600,6 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
 }
 
-static void tdx_unpin(struct kvm *kvm, struct page *page)
-{
-	put_page(page);
-}
-
 static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 			    enum pg_level level, struct page *page)
 {
@@ -1617,14 +1612,11 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 
 	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, folio,
 			       folio_page_idx(folio, page), &entry, &level_state);
-	if (unlikely(tdx_operand_busy(err))) {
-		tdx_unpin(kvm, page);
+	if (unlikely(tdx_operand_busy(err)))
 		return -EBUSY;
-	}
 
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
-		tdx_unpin(kvm, page);
 		return -EIO;
 	}
 
@@ -1679,16 +1671,6 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
 		return -EINVAL;
 
-	/*
-	 * Because guest_memfd doesn't support page migration with
-	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
-	 * migration.  Until guest_memfd supports page migration, prevent page
-	 * migration.
-	 * TODO: Once guest_memfd introduces callback on page migration,
-	 * implement it and remove get_page/put_page().
-	 */
-	get_page(page);
-
 	/*
 	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
 	 * barrier in tdx_td_finalize().
@@ -1755,7 +1737,6 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 	}
 	tdx_clear_folio(folio, folio_page_idx(folio, page), KVM_PAGES_PER_HPAGE(level));
 	tdx_pamt_put(page, level);
-	tdx_unpin(kvm, page);
 	return 0;
 }
 
@@ -1845,7 +1826,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 	    !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
 		atomic64_dec(&kvm_tdx->nr_premapped);
 		tdx_pamt_put(page, level);
-		tdx_unpin(kvm, page);
 		return 0;
 	}
 
@@ -1944,12 +1924,12 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 
 	if (!is_hkid_assigned(to_kvm_tdx(kvm))) {
 		KVM_BUG_ON(!kvm->vm_dead, kvm);
+
 		ret = tdx_reclaim_folio(folio, folio_page_idx(folio, page),
 					KVM_PAGES_PER_HPAGE(level), false);
-		if (!ret) {
+		if (!ret)
 			tdx_pamt_put(page, level);
-			tdx_unpin(kvm, page);
-		}
+
 		return ret;
 	}
 
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 07/23] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (5 preceding siblings ...)
  2025-08-07  9:42 ` [RFC PATCH v2 06/23] KVM: TDX: Do not hold page refcount on private guest pages Yan Zhao
@ 2025-08-07  9:42 ` Yan Zhao
  2025-08-07  9:43 ` [RFC PATCH v2 08/23] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:42 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>

Disallow page merging (huge page adjustment) for the mirror root by
utilizing disallowed_hugepage_adjust().

Make the mirror root check asymmetric with NX huge pages and not to litter
the generic MMU code:

Invoke disallowed_hugepage_adjust() in kvm_tdp_mmu_map() when necessary,
specifically when KVM has mirrored TDP or the NX huge page workaround is
enabled.

Check and reduce the goal_level of a fault internally in
disallowed_hugepage_adjust() when the fault is for a mirror root and
there's a shadow present non-leaf entry at the original goal_level.

Signed-off-by: Edgecombe, Rick P <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Check is_mirror_sp() in disallowed_hugepage_adjust() instead of passing
  in an is_mirror arg. (Rick)
- Check kvm_has_mirrored_tdp() in kvm_tdp_mmu_map() to determine whether
  to invoke disallowed_hugepage_adjust(). (Rick)

RFC v1:
- new patch
---
 arch/x86/kvm/mmu/mmu.c     | 3 ++-
 arch/x86/kvm/mmu/tdp_mmu.c | 4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3f76415cec71..9182192daa3a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3412,7 +3412,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
 	    cur_level == fault->goal_level &&
 	    is_shadow_present_pte(spte) &&
 	    !is_large_pte(spte) &&
-	    spte_to_child_sp(spte)->nx_huge_page_disallowed) {
+	    ((spte_to_child_sp(spte)->nx_huge_page_disallowed) ||
+	     is_mirror_sp(spte_to_child_sp(spte)))) {
 		/*
 		 * A small SPTE exists for this pfn, but FNAME(fetch),
 		 * direct_map(), or kvm_tdp_mmu_map() would like to create a
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index bb95c95f6531..f9a054754544 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1243,6 +1243,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
 	int ret = RET_PF_RETRY;
+	bool hugepage_adjust_disallowed = fault->nx_huge_page_workaround_enabled ||
+					  kvm_has_mirrored_tdp(kvm);
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
 
@@ -1253,7 +1255,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
 		int r;
 
-		if (fault->nx_huge_page_workaround_enabled)
+		if (hugepage_adjust_disallowed)
 			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
 
 		/*
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 08/23] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (6 preceding siblings ...)
  2025-08-07  9:42 ` [RFC PATCH v2 07/23] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
@ 2025-08-07  9:43 ` Yan Zhao
  2025-08-07  9:43 ` [RFC PATCH v2 09/23] KVM: x86/tdp_mmu: Add split_external_spt hook called during write mmu_lock Yan Zhao
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:43 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

From: Isaku Yamahata <isaku.yamahata@intel.com>

Enhance tdp_mmu_alloc_sp_split() to allocate a page for sp->external_spt,
i.e., the external page table page, for splitting the mirror page table.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- NO change.

RFC v1:
- Rebased and simplified the code.
---
 arch/x86/kvm/mmu/tdp_mmu.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f9a054754544..46b9f276bb6d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -324,6 +324,8 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 				u64 old_spte, u64 new_spte, int level,
 				bool shared);
 
+static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror);
+
 static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	kvm_account_pgtable_pages((void *)sp->spt, +1);
@@ -1475,7 +1477,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
 	return spte_set;
 }
 
-static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(void)
+static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror)
 {
 	struct kvm_mmu_page *sp;
 
@@ -1489,6 +1491,15 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(void)
 		return NULL;
 	}
 
+	if (mirror) {
+		sp->external_spt = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+		if (!sp->external_spt) {
+			free_page((unsigned long)sp->spt);
+			kmem_cache_free(mmu_page_header_cache, sp);
+			return NULL;
+		}
+	}
+
 	return sp;
 }
 
@@ -1568,7 +1579,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 			else
 				write_unlock(&kvm->mmu_lock);
 
-			sp = tdp_mmu_alloc_sp_for_split();
+			sp = tdp_mmu_alloc_sp_for_split(is_mirror_sp(root));
 
 			if (shared)
 				read_lock(&kvm->mmu_lock);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 09/23] KVM: x86/tdp_mmu: Add split_external_spt hook called during write mmu_lock
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (7 preceding siblings ...)
  2025-08-07  9:43 ` [RFC PATCH v2 08/23] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
@ 2025-08-07  9:43 ` Yan Zhao
  2025-08-07  9:43 ` [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock Yan Zhao
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:43 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

Introduce the split_external_spt hook and call it within tdp_mmu_set_spte()
for the mirror page table.

tdp_mmu_set_spte() is invoked for SPTE transitions under write mmu_lock.
For the mirror page table, in addition to the valid transitions from a
shadow-present entry to !shadow-present entry, introduce a new valid
transition case for splitting and propagate the transition to the external
page table via the hook split_external_spt.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Removed the KVM_BUG_ON() in split_external_spt(). (Rick)
- Add a comment for the KVM_BUG_ON() in tdp_mmu_set_spte(). (Rick)
- Use kvm_x86_call() instead of static_call(). (Binbin)

RFC v1:
- Split patch.
- Dropped invoking hook zap_private_spte and kvm_flush_remote_tlbs() in KVM
  MMU core.
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  4 ++++
 arch/x86/kvm/mmu/tdp_mmu.c         | 29 +++++++++++++++++++++++++----
 3 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 18a5c3119e1a..7653a45ad5b2 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -98,6 +98,7 @@ KVM_X86_OP_OPTIONAL(link_external_spt)
 KVM_X86_OP_OPTIONAL(set_external_spte)
 KVM_X86_OP_OPTIONAL(free_external_spt)
 KVM_X86_OP_OPTIONAL(remove_external_spte)
+KVM_X86_OP_OPTIONAL(split_external_spt)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 823d1aeef2a8..e431ce0e3180 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1839,6 +1839,10 @@ struct kvm_x86_ops {
 	int (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 				    kvm_pfn_t pfn_for_gfn);
 
+	/* Split the external page table into smaller page tables */
+	int (*split_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				  void *external_spt);
+
 	bool (*has_wbinvd_exit)(void);
 
 	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 46b9f276bb6d..a2c6e6e4773f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -325,6 +325,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 				bool shared);
 
 static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror);
+static void *get_external_spt(gfn_t gfn, u64 new_spte, int level);
 
 static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
@@ -384,6 +385,18 @@ static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
 	KVM_BUG_ON(ret, kvm);
 }
 
+static int split_external_spt(struct kvm *kvm, gfn_t gfn, u64 old_spte,
+			      u64 new_spte, int level)
+{
+	void *external_spt = get_external_spt(gfn, new_spte, level);
+	int ret;
+
+	KVM_BUG_ON(!external_spt, kvm);
+
+	ret = kvm_x86_call(split_external_spt)(kvm, gfn, level, external_spt);
+
+	return ret;
+}
 /**
  * handle_removed_pt() - handle a page table removed from the TDP structure
  *
@@ -765,12 +778,20 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
 
 	/*
-	 * Users that do non-atomic setting of PTEs don't operate on mirror
-	 * roots, so don't handle it and bug the VM if it's seen.
+	 * Propagate changes of SPTE to the external page table under write
+	 * mmu_lock.
+	 * Current valid transitions:
+	 * - present leaf to !present.
+	 * - present non-leaf to !present.
+	 * - present leaf to present non-leaf (splitting)
 	 */
 	if (is_mirror_sptep(sptep)) {
-		KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
-		remove_external_spte(kvm, gfn, old_spte, level);
+		if (!is_shadow_present_pte(new_spte))
+			remove_external_spte(kvm, gfn, old_spte, level);
+		else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
+			split_external_spt(kvm, gfn, old_spte, new_spte, level);
+		else
+			KVM_BUG_ON(1, kvm);
 	}
 
 	return old_spte;
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (8 preceding siblings ...)
  2025-08-07  9:43 ` [RFC PATCH v2 09/23] KVM: x86/tdp_mmu: Add split_external_spt hook called during write mmu_lock Yan Zhao
@ 2025-08-07  9:43 ` Yan Zhao
  2025-08-07  9:43 ` [RFC PATCH v2 11/23] KVM: x86: Reject splitting huge pages under shared mmu_lock for mirror root Yan Zhao
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:43 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

Implement the split_external_spt hook to enable huge page splitting for
TDX when kvm->mmu_lock is held for writing.

Invoke tdh_mem_range_block(), tdh_mem_track(), kicking off vCPUs,
tdh_mem_page_demote() in sequence. All operations are performed under
kvm->mmu_lock held for writing, similar to those in page removal.

Even with kvm->mmu_lock held for writing, tdh_mem_page_demote() may still
contend with tdh_vp_enter() and potentially with the guest's S-EPT entry
operations. Therefore, kick off other vCPUs and prevent tdh_vp_enter()
from being called on them to ensure success on the second attempt. Use
KVM_BUG_ON() for any other unexpected errors.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Split out the code to handle the error TDX_INTERRUPTED_RESTARTABLE.
- Rebased to 6.16.0-rc6 (the way of defining TDX hook changes).

RFC v1:
- Split patch for exclusive mmu_lock only,
- Invoke tdx_sept_zap_private_spte() and tdx_track() for splitting.
- Handled busy error of tdh_mem_page_demote() by kicking off vCPUs.
---
 arch/x86/kvm/vmx/tdx.c | 45 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 376287a2ddf4..8a60ba5b6595 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1915,6 +1915,50 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
+static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
+					enum pg_level level, struct page *page)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	u64 err, entry, level_state;
+
+	err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
+				  &entry, &level_state);
+
+	if (unlikely(tdx_operand_busy(err))) {
+		tdx_no_vcpus_enter_start(kvm);
+		err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
+					  &entry, &level_state);
+		tdx_no_vcpus_enter_stop(kvm);
+	}
+
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error_2(TDH_MEM_PAGE_DEMOTE, err, entry, level_state);
+		return -EIO;
+	}
+	return 0;
+}
+
+static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				      void *private_spt)
+{
+	struct page *page = virt_to_page(private_spt);
+	int ret;
+
+	if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE ||
+		       level != PG_LEVEL_2M, kvm))
+		return -EINVAL;
+
+	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
+	if (ret <= 0)
+		return ret;
+
+	tdx_track(kvm);
+
+	return tdx_spte_demote_private_spte(kvm, gfn, level, page);
+}
+
 static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 					enum pg_level level, kvm_pfn_t pfn)
 {
@@ -3668,5 +3712,6 @@ void __init tdx_hardware_setup(void)
 	vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
 	vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
 	vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
+	vt_x86_ops.split_external_spt = tdx_sept_split_private_spt;
 	vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
 }
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 11/23] KVM: x86: Reject splitting huge pages under shared mmu_lock for mirror root
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (9 preceding siblings ...)
  2025-08-07  9:43 ` [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock Yan Zhao
@ 2025-08-07  9:43 ` Yan Zhao
  2025-09-03  3:30   ` Binbin Wu
  2025-08-07  9:43 ` [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
                   ` (11 subsequent siblings)
  22 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:43 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

While removing the KVM_BUG_ON() for the mirror root before invoking
tdp_mmu_split_huge_page() in the fault path, update the hook
split_external_spt to pass in shared mmu_lock info and invoke the hook in
set_external_spte_present() on splitting is detected. Reject the splitting
in TDX if the splitting is under shared mmu_lock.

TDX requires different handling for splitting under shared or exclusive
mmu_lock.

Under a shared mmu_lock, TDX cannot kick off all vCPUs to avoid BUSY error
from tdh_mem_page_demote(). As the current TDX module requires
tdh_mem_range_block() to be invoked before each tdh_mem_page_demote(), if a
BUSY error occurs, TDX must call tdh_mem_range_unblock() before returning
the error to the KVM MMU core to roll back the old SPTE and retry. However,
tdh_mem_range_unblock() may also fail due to contention.

Reject splitting huge pages under shared mmu_lock for mirror root in TDX
rather than KVM_BUG_ON() in KVM MMU core to allow for future real
implementation of demote under shared mmu_lock once non-blocking demote is
available.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- WARN_ON_ONCE() and return error in tdx_sept_split_private_spt() if it's
  invoked under shared mmu_lock. (rather than increase the next fault's
  max_level in current vCPU via tdx->violation_gfn_start/end and
  tdx->violation_request_level).
- TODO: Perform the real implementation of demote under shared mmu_lock
        when new version of TDX module supporting non-blocking demote is
        available.

RFC v1:
- New patch.
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c      | 45 ++++++++++++++++++++-------------
 arch/x86/kvm/vmx/tdx.c          |  8 +++++-
 3 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e431ce0e3180..6cb5b422dd1d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1841,7 +1841,7 @@ struct kvm_x86_ops {
 
 	/* Split the external page table into smaller page tables */
 	int (*split_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-				  void *external_spt);
+				  void *external_spt, bool mmu_lock_shared);
 
 	bool (*has_wbinvd_exit)(void);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a2c6e6e4773f..ce49cc850ed5 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -386,15 +386,14 @@ static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
 }
 
 static int split_external_spt(struct kvm *kvm, gfn_t gfn, u64 old_spte,
-			      u64 new_spte, int level)
+			      u64 new_spte, int level, bool shared)
 {
 	void *external_spt = get_external_spt(gfn, new_spte, level);
 	int ret;
 
 	KVM_BUG_ON(!external_spt, kvm);
 
-	ret = kvm_x86_call(split_external_spt)(kvm, gfn, level, external_spt);
-
+	ret = kvm_x86_call(split_external_spt)(kvm, gfn, level, external_spt, shared);
 	return ret;
 }
 /**
@@ -533,11 +532,19 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
 {
 	bool was_present = is_shadow_present_pte(old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
+	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
 	int ret = 0;
 
-	KVM_BUG_ON(was_present, kvm);
+	/*
+	 * Caller ensures new_spte must be present.
+	 * Current valid transitions:
+	 * - leaf to non-leaf (demote)
+	 * - !present to present leaf
+	 * - !present to present non-leaf
+	 */
+	KVM_BUG_ON(!(!was_present || (was_leaf && !is_leaf)), kvm);
 
 	lockdep_assert_held(&kvm->mmu_lock);
 	/*
@@ -548,18 +555,24 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
 	if (!try_cmpxchg64(rcu_dereference(sptep), &old_spte, FROZEN_SPTE))
 		return -EBUSY;
 
-	/*
-	 * Use different call to either set up middle level
-	 * external page table, or leaf.
-	 */
-	if (is_leaf) {
-		ret = kvm_x86_call(set_external_spte)(kvm, gfn, level, new_pfn);
-	} else {
-		void *external_spt = get_external_spt(gfn, new_spte, level);
+	if (!was_present) {
+		/*
+		 * Use different call to either set up middle level
+		 * external page table, or leaf.
+		 */
+		if (is_leaf) {
+			ret = kvm_x86_call(set_external_spte)(kvm, gfn, level, new_pfn);
+		} else {
+			void *external_spt = get_external_spt(gfn, new_spte, level);
 
-		KVM_BUG_ON(!external_spt, kvm);
-		ret = kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt);
+			KVM_BUG_ON(!external_spt, kvm);
+			ret = kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt);
+		}
+	} else if (was_leaf && !is_leaf) {
+		/* demote */
+		ret = split_external_spt(kvm, gfn, old_spte, new_spte, level, true);
 	}
+
 	if (ret)
 		__kvm_tdp_mmu_write_spte(sptep, old_spte);
 	else
@@ -789,7 +802,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 		if (!is_shadow_present_pte(new_spte))
 			remove_external_spte(kvm, gfn, old_spte, level);
 		else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
-			split_external_spt(kvm, gfn, old_spte, new_spte, level);
+			split_external_spt(kvm, gfn, old_spte, new_spte, level, false);
 		else
 			KVM_BUG_ON(1, kvm);
 	}
@@ -1308,8 +1321,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
 
 		if (is_shadow_present_pte(iter.old_spte)) {
-			/* Don't support large page for mirrored roots (TDX) */
-			KVM_BUG_ON(is_mirror_sptep(iter.sptep), vcpu->kvm);
 			r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
 		} else {
 			r = tdp_mmu_link_sp(kvm, &iter, sp, true);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8a60ba5b6595..035d81275be4 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1941,7 +1941,7 @@ static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
 }
 
 static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-				      void *private_spt)
+				      void *private_spt, bool mmu_lock_shared)
 {
 	struct page *page = virt_to_page(private_spt);
 	int ret;
@@ -1950,6 +1950,12 @@ static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level
 		       level != PG_LEVEL_2M, kvm))
 		return -EINVAL;
 
+	if (WARN_ON_ONCE(mmu_lock_shared)) {
+		pr_warn_once("Splitting of GFN %llx level %d under shared lock occurs when KVM does not support it yet\n",
+			     gfn, level);
+		return -EOPNOTSUPP;
+	}
+
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
 		return ret;
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 11/23] KVM: x86: Reject splitting huge pages under shared mmu_lock for mirror root
  2025-08-07  9:43 ` [RFC PATCH v2 11/23] KVM: x86: Reject splitting huge pages under shared mmu_lock for mirror root Yan Zhao
@ 2025-09-03  3:30   ` Binbin Wu
  0 siblings, 0 replies; 43+ messages in thread
From: Binbin Wu @ 2025-09-03  3:30 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
	fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
	chao.p.peng



On 8/7/2025 5:43 PM, Yan Zhao wrote:
> While removing the KVM_BUG_ON() for the mirror root before invoking
> tdp_mmu_split_huge_page() in the fault path, update the hook
> split_external_spt to pass in shared mmu_lock info and invoke the hook in
> set_external_spte_present() on splitting is detected. Reject the splitting
> in TDX if the splitting is under shared mmu_lock.
>
> TDX requires different handling for splitting under shared or exclusive
> mmu_lock.
>
> Under a shared mmu_lock, TDX cannot kick off all vCPUs to avoid BUSY error
> from tdh_mem_page_demote(). As the current TDX module requires
> tdh_mem_range_block() to be invoked before each tdh_mem_page_demote(), if a
> BUSY error occurs, TDX must call tdh_mem_range_unblock() before returning
> the error to the KVM MMU core to roll back the old SPTE and retry. However,
> tdh_mem_range_unblock() may also fail due to contention.
>
> Reject splitting huge pages under shared mmu_lock for mirror root in TDX
> rather than KVM_BUG_ON() in KVM MMU core to allow for future real
> implementation of demote under shared mmu_lock once non-blocking demote is
> available.

Prefer "blockless" used in the cover letter to non-blocking.

[...]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (10 preceding siblings ...)
  2025-08-07  9:43 ` [RFC PATCH v2 11/23] KVM: x86: Reject splitting huge pages under shared mmu_lock for mirror root Yan Zhao
@ 2025-08-07  9:43 ` Yan Zhao
  2025-09-03  6:57   ` Binbin Wu
  2025-08-07  9:44 ` [RFC PATCH v2 13/23] KVM: x86: Introduce hugepage_set_guest_inhibit() Yan Zhao
                   ` (10 subsequent siblings)
  22 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:43 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

Introduce kvm_split_cross_boundary_leafs() to split huge leaf entries that
cross the boundary of a specified range.

Splitting huge leaf entries that cross the boundary is essential before
zapping the range in the mirror root. This ensures that the subsequent zap
operation does not affect any GFNs outside the specified range. This is
crucial for the mirror root, as the private page table requires the guest's
ACCEPT operation after a GFN faults back.

The core of kvm_split_cross_boundary_leafs() leverages the main logic from
tdp_mmu_split_huge_pages_root(). It traverses the specified root and splits
huge leaf entries if they cross the range boundary. When splitting is
necessary, kvm->mmu_lock is temporarily released for memory allocation,
which means returning -ENOMEM is possible.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Rename the API to kvm_split_cross_boundary_leafs().
- Make the API to be usable for direct roots or under shared mmu_lock.
- Leverage the main logic from tdp_mmu_split_huge_pages_root(). (Rick)

RFC v1:
- Split patch.
- introduced API kvm_split_boundary_leafs(), refined the logic and
  simplified the code.
---
 arch/x86/kvm/mmu/mmu.c     | 27 +++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c | 68 ++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/tdp_mmu.h |  3 ++
 include/linux/kvm_host.h   |  2 ++
 4 files changed, 97 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9182192daa3a..13910ae05f76 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1647,6 +1647,33 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
 				 start, end - 1, can_yield, true, flush);
 }
 
+/*
+ * Split large leafs crossing the boundary of the specified range
+ *
+ * Return value:
+ * 0 : success, no flush is required;
+ * 1 : success, flush is required;
+ * <0: failure.
+ */
+int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
+				   bool shared)
+{
+	bool ret = 0;
+
+	lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
+			    lockdep_is_held(&kvm->slots_lock) ||
+			    srcu_read_lock_held(&kvm->srcu));
+
+	if (!range->may_block)
+		return -EOPNOTSUPP;
+
+	if (tdp_mmu_enabled)
+		ret = kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(kvm, range, shared);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kvm_split_cross_boundary_leafs);
+
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool flush = false;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ce49cc850ed5..62a09a9655c3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1574,10 +1574,17 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	return ret;
 }
 
+static bool iter_cross_boundary(struct tdp_iter *iter, gfn_t start, gfn_t end)
+{
+	return !(iter->gfn >= start &&
+		 (iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <= end);
+}
+
 static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 					 struct kvm_mmu_page *root,
 					 gfn_t start, gfn_t end,
-					 int target_level, bool shared)
+					 int target_level, bool shared,
+					 bool only_cross_bounday, bool *flush)
 {
 	struct kvm_mmu_page *sp = NULL;
 	struct tdp_iter iter;
@@ -1589,6 +1596,13 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 	 * level into one lower level. For example, if we encounter a 1GB page
 	 * we split it into 512 2MB pages.
 	 *
+	 * When only_cross_bounday is true, just split huge pages above the
+	 * target level into one lower level if the huge pages cross the start
+	 * or end boundary.
+	 *
+	 * No need to update @flush for !only_cross_bounday cases, which rely
+	 * on the callers to do the TLB flush in the end.
+	 *
 	 * Since the TDP iterator uses a pre-order traversal, we are guaranteed
 	 * to visit an SPTE before ever visiting its children, which means we
 	 * will correctly recursively split huge pages that are more than one
@@ -1597,12 +1611,19 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 	 */
 	for_each_tdp_pte_min_level(iter, kvm, root, target_level + 1, start, end) {
 retry:
-		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, *flush, shared)) {
+			if (only_cross_bounday)
+				*flush = false;
 			continue;
+		}
 
 		if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
 			continue;
 
+		if (only_cross_bounday &&
+		    !iter_cross_boundary(&iter, start, end))
+			continue;
+
 		if (!sp) {
 			rcu_read_unlock();
 
@@ -1637,6 +1658,8 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 			goto retry;
 
 		sp = NULL;
+		if (only_cross_bounday)
+			*flush = true;
 	}
 
 	rcu_read_unlock();
@@ -1663,10 +1686,12 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
 {
 	struct kvm_mmu_page *root;
 	int r = 0;
+	bool flush = false;
 
 	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
 	for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id) {
-		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level, shared);
+		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level,
+						  shared, false, &flush);
 		if (r) {
 			kvm_tdp_mmu_put_root(kvm, root);
 			break;
@@ -1674,6 +1699,43 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
 	}
 }
 
+/*
+ * Split large leafs which cross the specified boundary
+ */
+static int tdp_mmu_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
+					      gfn_t start, gfn_t end, bool shared,
+					      bool *flush)
+{
+	return tdp_mmu_split_huge_pages_root(kvm, root, start, end, PG_LEVEL_4K,
+					     shared, true, flush);
+}
+
+int kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(struct kvm *kvm,
+						     struct kvm_gfn_range *range,
+						     bool shared)
+{
+	enum kvm_tdp_mmu_root_types types;
+	struct kvm_mmu_page *root;
+	bool flush = false;
+	int ret;
+
+	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+	types = kvm_gfn_range_filter_to_root_types(kvm, range->attr_filter);
+
+	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, types) {
+		ret = tdp_mmu_split_cross_boundary_leafs(kvm, root, range->start,
+							 range->end, shared, &flush);
+		if (ret < 0) {
+			if (flush)
+				kvm_flush_remote_tlbs(kvm);
+
+			kvm_tdp_mmu_put_root(kvm, root);
+			return ret;
+		}
+	}
+	return flush;
+}
+
 static bool tdp_mmu_need_write_protect(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 52acf99d40a0..332d47cce714 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -69,6 +69,9 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
 				  enum kvm_tdp_mmu_root_types root_types);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared);
+int kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(struct kvm *kvm,
+						     struct kvm_gfn_range *range,
+						     bool shared);
 
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fb79d2b7decd..6137b76341e1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -273,6 +273,8 @@ struct kvm_gfn_range {
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
+				   bool shared);
 #endif
 
 enum {
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2025-08-07  9:43 ` [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
@ 2025-09-03  6:57   ` Binbin Wu
  2025-09-03  9:44     ` Yan Zhao
  0 siblings, 1 reply; 43+ messages in thread
From: Binbin Wu @ 2025-09-03  6:57 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
	fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
	chao.p.peng



On 8/7/2025 5:43 PM, Yan Zhao wrote:
> Introduce kvm_split_cross_boundary_leafs() to split huge leaf entries that
> cross the boundary of a specified range.
>
> Splitting huge leaf entries that cross the boundary is essential before
> zapping the range in the mirror root. This ensures that the subsequent zap
> operation does not affect any GFNs outside the specified range. This is
> crucial for the mirror root, as the private page table requires the guest's
> ACCEPT operation after a GFN faults back.
>
> The core of kvm_split_cross_boundary_leafs() leverages the main logic from
> tdp_mmu_split_huge_pages_root(). It traverses the specified root and splits
> huge leaf entries if they cross the range boundary. When splitting is
> necessary, kvm->mmu_lock is temporarily released for memory allocation,
> which means returning -ENOMEM is possible.
>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2:
> - Rename the API to kvm_split_cross_boundary_leafs().
> - Make the API to be usable for direct roots or under shared mmu_lock.
> - Leverage the main logic from tdp_mmu_split_huge_pages_root(). (Rick)
>
> RFC v1:
> - Split patch.
> - introduced API kvm_split_boundary_leafs(), refined the logic and
>    simplified the code.
> ---
>   arch/x86/kvm/mmu/mmu.c     | 27 +++++++++++++++
>   arch/x86/kvm/mmu/tdp_mmu.c | 68 ++++++++++++++++++++++++++++++++++++--
>   arch/x86/kvm/mmu/tdp_mmu.h |  3 ++
>   include/linux/kvm_host.h   |  2 ++
>   4 files changed, 97 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9182192daa3a..13910ae05f76 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1647,6 +1647,33 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
>   				 start, end - 1, can_yield, true, flush);
>   }
>   
> +/*
> + * Split large leafs crossing the boundary of the specified range
> + *
> + * Return value:
> + * 0 : success, no flush is required;
> + * 1 : success, flush is required;
> + * <0: failure.
> + */
> +int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
> +				   bool shared)
> +{
> +	bool ret = 0;
> +
> +	lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> +			    lockdep_is_held(&kvm->slots_lock) ||
> +			    srcu_read_lock_held(&kvm->srcu));
> +
> +	if (!range->may_block)
> +		return -EOPNOTSUPP;
> +
> +	if (tdp_mmu_enabled)
> +		ret = kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(kvm, range, shared);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(kvm_split_cross_boundary_leafs);
> +
>   bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>   {
>   	bool flush = false;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index ce49cc850ed5..62a09a9655c3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1574,10 +1574,17 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>   	return ret;
>   }
>   
> +static bool iter_cross_boundary(struct tdp_iter *iter, gfn_t start, gfn_t end)
> +{
> +	return !(iter->gfn >= start &&
> +		 (iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <= end);
> +}
> +
>   static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>   					 struct kvm_mmu_page *root,
>   					 gfn_t start, gfn_t end,
> -					 int target_level, bool shared)
> +					 int target_level, bool shared,
> +					 bool only_cross_bounday, bool *flush)
s/only_cross_bounday/only_cross_boundary

>   {
>   	struct kvm_mmu_page *sp = NULL;
>   	struct tdp_iter iter;
> @@ -1589,6 +1596,13 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>   	 * level into one lower level. For example, if we encounter a 1GB page
>   	 * we split it into 512 2MB pages.
>   	 *
> +	 * When only_cross_bounday is true, just split huge pages above the
> +	 * target level into one lower level if the huge pages cross the start
> +	 * or end boundary.
> +	 *
> +	 * No need to update @flush for !only_cross_bounday cases, which rely
> +	 * on the callers to do the TLB flush in the end.

I think API wise, it's a bit confusing, although it's a local API.
If just look at the API without digging into the function implementation, my
initial thought is *flush will tell whether TLB flush is needed or not.

Just update *flush unconditionally? Or move the comment as the description for
the function to call it out?

I have thought another option to combine the two inputs, i.e., if *flush is a
valid pointer, it means it's for only_cross_boundary. Otherwise, just passing
NULL. But then I felt it was a bit risky to reply on the pointer to indicate the
scenario.

> +	 *
>   	 * Since the TDP iterator uses a pre-order traversal, we are guaranteed
>   	 * to visit an SPTE before ever visiting its children, which means we
>   	 * will correctly recursively split huge pages that are more than one
> @@ -1597,12 +1611,19 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>   	 */
>   	for_each_tdp_pte_min_level(iter, kvm, root, target_level + 1, start, end) {
>   retry:
> -		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
> +		if (tdp_mmu_iter_cond_resched(kvm, &iter, *flush, shared)) {
> +			if (only_cross_bounday)
> +				*flush = false;
>   			continue;
> +		}
>   
>   		if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
>   			continue;
>   
> +		if (only_cross_bounday &&
> +		    !iter_cross_boundary(&iter, start, end))
> +			continue;
> +
>   		if (!sp) {
>   			rcu_read_unlock();
>   
> @@ -1637,6 +1658,8 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>   			goto retry;
>   
>   		sp = NULL;
> +		if (only_cross_bounday)
> +			*flush = true;
>   	}
>   
>   	rcu_read_unlock();
[...]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
  2025-09-03  6:57   ` Binbin Wu
@ 2025-09-03  9:44     ` Yan Zhao
  0 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-09-03  9:44 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
	fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
	chao.p.peng

On Wed, Sep 03, 2025 at 02:57:07PM +0800, Binbin Wu wrote:
> 
> 
> On 8/7/2025 5:43 PM, Yan Zhao wrote:
> > Introduce kvm_split_cross_boundary_leafs() to split huge leaf entries that
> > cross the boundary of a specified range.
> > 
> > Splitting huge leaf entries that cross the boundary is essential before
> > zapping the range in the mirror root. This ensures that the subsequent zap
> > operation does not affect any GFNs outside the specified range. This is
> > crucial for the mirror root, as the private page table requires the guest's
> > ACCEPT operation after a GFN faults back.
> > 
> > The core of kvm_split_cross_boundary_leafs() leverages the main logic from
> > tdp_mmu_split_huge_pages_root(). It traverses the specified root and splits
> > huge leaf entries if they cross the range boundary. When splitting is
> > necessary, kvm->mmu_lock is temporarily released for memory allocation,
> > which means returning -ENOMEM is possible.
> > 
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> > RFC v2:
> > - Rename the API to kvm_split_cross_boundary_leafs().
> > - Make the API to be usable for direct roots or under shared mmu_lock.
> > - Leverage the main logic from tdp_mmu_split_huge_pages_root(). (Rick)
> > 
> > RFC v1:
> > - Split patch.
> > - introduced API kvm_split_boundary_leafs(), refined the logic and
> >    simplified the code.
> > ---
> >   arch/x86/kvm/mmu/mmu.c     | 27 +++++++++++++++
> >   arch/x86/kvm/mmu/tdp_mmu.c | 68 ++++++++++++++++++++++++++++++++++++--
> >   arch/x86/kvm/mmu/tdp_mmu.h |  3 ++
> >   include/linux/kvm_host.h   |  2 ++
> >   4 files changed, 97 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 9182192daa3a..13910ae05f76 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1647,6 +1647,33 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
> >   				 start, end - 1, can_yield, true, flush);
> >   }
> > +/*
> > + * Split large leafs crossing the boundary of the specified range
> > + *
> > + * Return value:
> > + * 0 : success, no flush is required;
> > + * 1 : success, flush is required;
> > + * <0: failure.
> > + */
> > +int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
> > +				   bool shared)
> > +{
> > +	bool ret = 0;
> > +
> > +	lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> > +			    lockdep_is_held(&kvm->slots_lock) ||
> > +			    srcu_read_lock_held(&kvm->srcu));
> > +
> > +	if (!range->may_block)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (tdp_mmu_enabled)
> > +		ret = kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(kvm, range, shared);
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_split_cross_boundary_leafs);
> > +
> >   bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> >   {
> >   	bool flush = false;
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index ce49cc850ed5..62a09a9655c3 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1574,10 +1574,17 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> >   	return ret;
> >   }
> > +static bool iter_cross_boundary(struct tdp_iter *iter, gfn_t start, gfn_t end)
> > +{
> > +	return !(iter->gfn >= start &&
> > +		 (iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <= end);
> > +}
> > +
> >   static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> >   					 struct kvm_mmu_page *root,
> >   					 gfn_t start, gfn_t end,
> > -					 int target_level, bool shared)
> > +					 int target_level, bool shared,
> > +					 bool only_cross_bounday, bool *flush)
> s/only_cross_bounday/only_cross_boundary
Will fix.

> >   {
> >   	struct kvm_mmu_page *sp = NULL;
> >   	struct tdp_iter iter;
> > @@ -1589,6 +1596,13 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> >   	 * level into one lower level. For example, if we encounter a 1GB page
> >   	 * we split it into 512 2MB pages.
> >   	 *
> > +	 * When only_cross_bounday is true, just split huge pages above the
> > +	 * target level into one lower level if the huge pages cross the start
> > +	 * or end boundary.
> > +	 *
> > +	 * No need to update @flush for !only_cross_bounday cases, which rely
> > +	 * on the callers to do the TLB flush in the end.
> 
> I think API wise, it's a bit confusing, although it's a local API.
> If just look at the API without digging into the function implementation, my
> initial thought is *flush will tell whether TLB flush is needed or not.
> 
> Just update *flush unconditionally? Or move the comment as the description for
> the function to call it out?
> 
> I have thought another option to combine the two inputs, i.e., if *flush is a
> valid pointer, it means it's for only_cross_boundary. Otherwise, just passing
> NULL. But then I felt it was a bit risky to reply on the pointer to indicate the
> scenario.

I feel it's better not to combine flush and only_cross_boundary.
Will add a function description to tdp_mmu_split_huge_pages_root().

> > +	 *
> >   	 * Since the TDP iterator uses a pre-order traversal, we are guaranteed
> >   	 * to visit an SPTE before ever visiting its children, which means we
> >   	 * will correctly recursively split huge pages that are more than one
> > @@ -1597,12 +1611,19 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> >   	 */
> >   	for_each_tdp_pte_min_level(iter, kvm, root, target_level + 1, start, end) {
> >   retry:
> > -		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
> > +		if (tdp_mmu_iter_cond_resched(kvm, &iter, *flush, shared)) {
> > +			if (only_cross_bounday)
> > +				*flush = false;
> >   			continue;
> > +		}
> >   		if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
> >   			continue;
> > +		if (only_cross_bounday &&
> > +		    !iter_cross_boundary(&iter, start, end))
> > +			continue;
> > +
> >   		if (!sp) {
> >   			rcu_read_unlock();
> > @@ -1637,6 +1658,8 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> >   			goto retry;
> >   		sp = NULL;
> > +		if (only_cross_bounday)
> > +			*flush = true;
> >   	}
> >   	rcu_read_unlock();
> [...]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 13/23] KVM: x86: Introduce hugepage_set_guest_inhibit()
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (11 preceding siblings ...)
  2025-08-07  9:43 ` [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
@ 2025-08-07  9:44 ` Yan Zhao
  2025-08-07  9:44 ` [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info Yan Zhao
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:44 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

TDX requires guests to accept S-EPT mappings created by the host KVM. Due
to the current implementation of the TDX module, if a guest accepts a GFN
at a lower level after KVM maps it at a higher level, the TDX module will
emulate an EPT violation VMExit to KVM instead of returning a size mismatch
error to the guest. If KVM fails to perform page splitting in the VMExit
handler, the guest's accept operation will be triggered again upon
re-entering the guest, causing a repeated EPT violation VMExit.

To facilitate passing the guest's accept level information to the KVM MMU
core and to prevent the repeated mapping of a GFN at different levels due
to different accept levels specified by different vCPUs, introduce the
interface hugepage_set_guest_inhibit(). This interface specifies across
vCPUs that mapping at a certain level is inhibited from the guest.

The KVM_LPAGE_GUEST_INHIBIT_FLAG bit is currently modified in one
direction (set), so no clear interface is provided.

Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com/ [1]
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- new in RFC v2
---
 arch/x86/kvm/mmu.h     |  3 +++
 arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b122255c7d4e..c2d8819f3438 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -326,4 +326,7 @@ static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
 {
 	return gfn & kvm_gfn_direct_bits(kvm);
 }
+
+void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
+bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
 #endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 13910ae05f76..1c639286aac2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -721,12 +721,14 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
 }
 
 /*
- * The most significant bit in disallow_lpage tracks whether or not memory
- * attributes are mixed, i.e. not identical for all gfns at the current level.
+ * The most 2 significant bits in disallow_lpage tracks whether or not memory
+ * attributes are mixed, i.e. not identical for all gfns at the current level,
+ * or whether or not guest inhibits the current level of hugepage at the gfn.
  * The lower order bits are used to refcount other cases where a hugepage is
  * disallowed, e.g. if KVM has shadow a page table at the gfn.
  */
 #define KVM_LPAGE_MIXED_FLAG	BIT(31)
+#define KVM_LPAGE_GUEST_INHIBIT_FLAG   BIT(30)
 
 static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
 					    gfn_t gfn, int count)
@@ -739,7 +741,8 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
 
 		old = linfo->disallow_lpage;
 		linfo->disallow_lpage += count;
-		WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG);
+		WARN_ON_ONCE((old ^ linfo->disallow_lpage) &
+			     (KVM_LPAGE_MIXED_FLAG | KVM_LPAGE_GUEST_INHIBIT_FLAG));
 	}
 }
 
@@ -1647,6 +1650,18 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
 				 start, end - 1, can_yield, true, flush);
 }
 
+bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
+{
+	return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_GUEST_INHIBIT_FLAG;
+}
+EXPORT_SYMBOL_GPL(hugepage_test_guest_inhibit);
+
+void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
+{
+	lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_GUEST_INHIBIT_FLAG;
+}
+EXPORT_SYMBOL_GPL(hugepage_set_guest_inhibit);
+
 /*
  * Split large leafs crossing the boundary of the specified range
  *
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (12 preceding siblings ...)
  2025-08-07  9:44 ` [RFC PATCH v2 13/23] KVM: x86: Introduce hugepage_set_guest_inhibit() Yan Zhao
@ 2025-08-07  9:44 ` Yan Zhao
  2025-09-03  7:36   ` Binbin Wu
  2025-08-07  9:44 ` [RFC PATCH v2 15/23] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
                   ` (8 subsequent siblings)
  22 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:44 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

TDX requires guests to accept S-EPT mappings created by the host KVM. Due
to the current implementation of the TDX module, if a guest accepts a GFN
at a lower level after KVM maps it at a higher level, the TDX module will
emulate an EPT violation VMExit to KVM instead of returning a size mismatch
error to the guest. If KVM fails to perform page splitting in the VMExit
handler, the guest's accept operation will be triggered again upon
re-entering the guest, causing a repeated EPT violation VMExit.

The TDX module thus enables the EPT violation VMExit to carry the guest's
accept level when the VMExit is caused by the guest's accept operation.

Therefore, in TDX's EPT violation handler
(1) Set the guest inhibit bit in the lpage info to prevent KVM MMU core
    from mapping at a higher a level than the guest's accept level.

(2) Split any existing huge mapping at the fault GFN to avoid unsupported
    splitting under the shared mmu_lock by TDX.

Use write mmu_lock to pretect (1) and (2) for now. If future KVM TDX can
perform the actual splitting under shared mmu_lock with enhanced TDX
modules, (1) is possible to be called under shared mmu_lock, and (2) would
become unnecessary.

As an optimization, this patch calls hugepage_test_guest_inhibit() without
holding the mmu_lock to reduce the frequency of acquiring the write
mmu_lock. The write mmu_lock is thus only acquired if the guest inhibit bit
is not already set. This is safe because the guest inhibit bit is set in a
one-way manner while the splitting under the write mmu_lock is performed
before setting the guest inhibit bit.

Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2
- Change tdx_get_accept_level() to tdx_check_accept_level().
- Invoke kvm_split_cross_boundary_leafs() and hugepage_set_guest_inhibit()
  to change KVM mapping level in a global way according to guest accept
  level. (Rick, Sean).

RFC v1:
- Introduce tdx_get_accept_level() to get guest accept level.
- Use tdx->violation_request_level and tdx->violation_gfn* to pass guest
  accept level to tdx_gmem_private_max_mapping_level() to detemine KVM
  mapping level.
---
 arch/x86/kvm/vmx/tdx.c      | 50 +++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx_arch.h |  3 +++
 2 files changed, 53 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 035d81275be4..71115058e5e6 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2019,6 +2019,53 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
 	return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
 }
 
+static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	struct kvm_memory_slot *slot = gfn_to_memslot(vcpu->kvm, gfn);
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	struct kvm *kvm = vcpu->kvm;
+	u64 eeq_type, eeq_info;
+	int level = -1;
+
+	if (!slot)
+		return 0;
+
+	eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
+	if (eeq_type != TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
+		return 0;
+
+	eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
+		   TDX_EXT_EXIT_QUAL_INFO_SHIFT;
+
+	level = (eeq_info & GENMASK(2, 0)) + 1;
+
+	if (level == PG_LEVEL_4K || level == PG_LEVEL_2M) {
+		if (!hugepage_test_guest_inhibit(slot, gfn, level + 1)) {
+			gfn_t base_gfn = gfn_round_for_level(gfn, level);
+			struct kvm_gfn_range gfn_range = {
+				.start = base_gfn,
+				.end = base_gfn + KVM_PAGES_PER_HPAGE(level),
+				.slot = slot,
+				.may_block = true,
+				.attr_filter = KVM_FILTER_PRIVATE,
+			};
+
+			scoped_guard(write_lock, &kvm->mmu_lock) {
+				int ret;
+
+				ret = kvm_split_cross_boundary_leafs(kvm, &gfn_range, false);
+				if (ret)
+					return ret;
+
+				hugepage_set_guest_inhibit(slot, gfn, level + 1);
+				if (level == PG_LEVEL_4K)
+					hugepage_set_guest_inhibit(slot, gfn, level + 2);
+			}
+		}
+	}
+	return 0;
+}
+
 static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qual;
@@ -2044,6 +2091,9 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 		 */
 		exit_qual = EPT_VIOLATION_ACC_WRITE;
 
+		if (tdx_check_accept_level(vcpu, gpa_to_gfn(gpa)))
+			return RET_PF_RETRY;
+
 		/* Only private GPA triggers zero-step mitigation */
 		local_retry = true;
 	} else {
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index a30e880849e3..af006a73ee05 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -82,7 +82,10 @@ struct tdx_cpuid_value {
 #define TDX_TD_ATTR_PERFMON		BIT_ULL(63)
 
 #define TDX_EXT_EXIT_QUAL_TYPE_MASK	GENMASK(3, 0)
+#define TDX_EXT_EXIT_QUAL_TYPE_ACCEPT  1
 #define TDX_EXT_EXIT_QUAL_TYPE_PENDING_EPT_VIOLATION  6
+#define TDX_EXT_EXIT_QUAL_INFO_MASK	GENMASK(63, 32)
+#define TDX_EXT_EXIT_QUAL_INFO_SHIFT	32
 /*
  * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
  */
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
  2025-08-07  9:44 ` [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info Yan Zhao
@ 2025-09-03  7:36   ` Binbin Wu
  2025-09-03  9:37     ` Yan Zhao
  0 siblings, 1 reply; 43+ messages in thread
From: Binbin Wu @ 2025-09-03  7:36 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
	fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
	chao.p.peng



On 8/7/2025 5:44 PM, Yan Zhao wrote:
> TDX requires guests to accept S-EPT mappings created by the host KVM. Due
> to the current implementation of the TDX module, if a guest accepts a GFN
> at a lower level after KVM maps it at a higher level, the TDX module will
> emulate an EPT violation VMExit to KVM instead of returning a size mismatch
> error to the guest. If KVM fails to perform page splitting in the VMExit
> handler, the guest's accept operation will be triggered again upon
> re-entering the guest, causing a repeated EPT violation VMExit.
>
> The TDX module thus enables the EPT violation VMExit to carry the guest's
> accept level when the VMExit is caused by the guest's accept operation.
>
> Therefore, in TDX's EPT violation handler
> (1) Set the guest inhibit bit in the lpage info to prevent KVM MMU core
>      from mapping at a higher a level than the guest's accept level.
>
> (2) Split any existing huge mapping at the fault GFN to avoid unsupported
>      splitting under the shared mmu_lock by TDX.
>
> Use write mmu_lock to pretect (1) and (2) for now. If future KVM TDX can
> perform the actual splitting under shared mmu_lock with enhanced TDX
> modules, (1) is possible to be called under shared mmu_lock, and (2) would
> become unnecessary.

The description for (1) and (2) reversed?

>
> As an optimization, this patch calls hugepage_test_guest_inhibit() without
> holding the mmu_lock to reduce the frequency of acquiring the write
> mmu_lock. The write mmu_lock is thus only acquired if the guest inhibit bit
> is not already set. This is safe because the guest inhibit bit is set in a
> one-way manner while the splitting under the write mmu_lock is performed
> before setting the guest inhibit bit.
>
> Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com
> Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2
> - Change tdx_get_accept_level() to tdx_check_accept_level().
> - Invoke kvm_split_cross_boundary_leafs() and hugepage_set_guest_inhibit()
>    to change KVM mapping level in a global way according to guest accept
>    level. (Rick, Sean).
>
> RFC v1:
> - Introduce tdx_get_accept_level() to get guest accept level.
> - Use tdx->violation_request_level and tdx->violation_gfn* to pass guest
>    accept level to tdx_gmem_private_max_mapping_level() to detemine KVM
>    mapping level.
> ---
>   arch/x86/kvm/vmx/tdx.c      | 50 +++++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/vmx/tdx_arch.h |  3 +++
>   2 files changed, 53 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 035d81275be4..71115058e5e6 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2019,6 +2019,53 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
>   	return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
>   }
>   
> +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	struct kvm_memory_slot *slot = gfn_to_memslot(vcpu->kvm, gfn);
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +	struct kvm *kvm = vcpu->kvm;
> +	u64 eeq_type, eeq_info;
> +	int level = -1;
> +
> +	if (!slot)
> +		return 0;
> +
> +	eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> +	if (eeq_type != TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
> +		return 0;
> +
> +	eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> +		   TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> +
> +	level = (eeq_info & GENMASK(2, 0)) + 1;
> +
> +	if (level == PG_LEVEL_4K || level == PG_LEVEL_2M) {
> +		if (!hugepage_test_guest_inhibit(slot, gfn, level + 1)) {
> +			gfn_t base_gfn = gfn_round_for_level(gfn, level);
> +			struct kvm_gfn_range gfn_range = {
> +				.start = base_gfn,
> +				.end = base_gfn + KVM_PAGES_PER_HPAGE(level),
> +				.slot = slot,
> +				.may_block = true,
> +				.attr_filter = KVM_FILTER_PRIVATE,
> +			};
> +
> +			scoped_guard(write_lock, &kvm->mmu_lock) {
> +				int ret;
> +
> +				ret = kvm_split_cross_boundary_leafs(kvm, &gfn_range, false);
> +				if (ret)
> +					return ret;

kvm_split_cross_boundary_leafs() calls kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(), which could return flush as 1 if any of the huge page crossing boundary is split, return directly when ret is non-zero seems not right. Also, the TLB flush should also be taken care because in kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(), TLB flush is only done for negative return value.


> +
> +				hugepage_set_guest_inhibit(slot, gfn, level + 1);
> +				if (level == PG_LEVEL_4K)
> +					hugepage_set_guest_inhibit(slot, gfn, level + 2);
> +			}
> +		}
> +	}
> +	return 0;
> +}
> +
>   static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>   {
>   	unsigned long exit_qual;
> @@ -2044,6 +2091,9 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>   		 */
>   		exit_qual = EPT_VIOLATION_ACC_WRITE;
>   
> +		if (tdx_check_accept_level(vcpu, gpa_to_gfn(gpa)))
> +			return RET_PF_RETRY;
> +
>   		/* Only private GPA triggers zero-step mitigation */
>   		local_retry = true;
>   	} else {
> diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
> index a30e880849e3..af006a73ee05 100644
> --- a/arch/x86/kvm/vmx/tdx_arch.h
> +++ b/arch/x86/kvm/vmx/tdx_arch.h
> @@ -82,7 +82,10 @@ struct tdx_cpuid_value {
>   #define TDX_TD_ATTR_PERFMON		BIT_ULL(63)
>   
>   #define TDX_EXT_EXIT_QUAL_TYPE_MASK	GENMASK(3, 0)
> +#define TDX_EXT_EXIT_QUAL_TYPE_ACCEPT  1
>   #define TDX_EXT_EXIT_QUAL_TYPE_PENDING_EPT_VIOLATION  6
> +#define TDX_EXT_EXIT_QUAL_INFO_MASK	GENMASK(63, 32)
> +#define TDX_EXT_EXIT_QUAL_INFO_SHIFT	32
>   /*
>    * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
>    */


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
  2025-09-03  7:36   ` Binbin Wu
@ 2025-09-03  9:37     ` Yan Zhao
  0 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-09-03  9:37 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
	fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
	chao.p.peng

On Wed, Sep 03, 2025 at 03:36:49PM +0800, Binbin Wu wrote:
> 
> 
> On 8/7/2025 5:44 PM, Yan Zhao wrote:
> > TDX requires guests to accept S-EPT mappings created by the host KVM. Due
> > to the current implementation of the TDX module, if a guest accepts a GFN
> > at a lower level after KVM maps it at a higher level, the TDX module will
> > emulate an EPT violation VMExit to KVM instead of returning a size mismatch
> > error to the guest. If KVM fails to perform page splitting in the VMExit
> > handler, the guest's accept operation will be triggered again upon
> > re-entering the guest, causing a repeated EPT violation VMExit.
> > 
> > The TDX module thus enables the EPT violation VMExit to carry the guest's
> > accept level when the VMExit is caused by the guest's accept operation.
> > 
> > Therefore, in TDX's EPT violation handler
> > (1) Set the guest inhibit bit in the lpage info to prevent KVM MMU core
> >      from mapping at a higher a level than the guest's accept level.
> > 
> > (2) Split any existing huge mapping at the fault GFN to avoid unsupported
> >      splitting under the shared mmu_lock by TDX.
> > 
> > Use write mmu_lock to pretect (1) and (2) for now. If future KVM TDX can
> > perform the actual splitting under shared mmu_lock with enhanced TDX
> > modules, (1) is possible to be called under shared mmu_lock, and (2) would
> > become unnecessary.
> 
> The description for (1) and (2) reversed?
No.
After supporting splitting under shared mmu_lock,
- setting guest inhibit bit can be performed under shared mmu_lock. (*)
- splitting existing huge mapping under write mmu_lock here would be unnecessary.

(*) is still required to convey the info of which max level the guest requires.
    (as explained in "Open 1: How to pass guest's ACCEPT level info" in the
    cover letter).


> > As an optimization, this patch calls hugepage_test_guest_inhibit() without
> > holding the mmu_lock to reduce the frequency of acquiring the write
> > mmu_lock. The write mmu_lock is thus only acquired if the guest inhibit bit
> > is not already set. This is safe because the guest inhibit bit is set in a
> > one-way manner while the splitting under the write mmu_lock is performed
> > before setting the guest inhibit bit.
> > 
> > Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com
> > Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> > RFC v2
> > - Change tdx_get_accept_level() to tdx_check_accept_level().
> > - Invoke kvm_split_cross_boundary_leafs() and hugepage_set_guest_inhibit()
> >    to change KVM mapping level in a global way according to guest accept
> >    level. (Rick, Sean).
> > 
> > RFC v1:
> > - Introduce tdx_get_accept_level() to get guest accept level.
> > - Use tdx->violation_request_level and tdx->violation_gfn* to pass guest
> >    accept level to tdx_gmem_private_max_mapping_level() to detemine KVM
> >    mapping level.
> > ---
> >   arch/x86/kvm/vmx/tdx.c      | 50 +++++++++++++++++++++++++++++++++++++
> >   arch/x86/kvm/vmx/tdx_arch.h |  3 +++
> >   2 files changed, 53 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 035d81275be4..71115058e5e6 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -2019,6 +2019,53 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
> >   	return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
> >   }
> > +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gfn_t gfn)
> > +{
> > +	struct kvm_memory_slot *slot = gfn_to_memslot(vcpu->kvm, gfn);
> > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +	struct kvm *kvm = vcpu->kvm;
> > +	u64 eeq_type, eeq_info;
> > +	int level = -1;
> > +
> > +	if (!slot)
> > +		return 0;
> > +
> > +	eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> > +	if (eeq_type != TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
> > +		return 0;
> > +
> > +	eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> > +		   TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> > +
> > +	level = (eeq_info & GENMASK(2, 0)) + 1;
> > +
> > +	if (level == PG_LEVEL_4K || level == PG_LEVEL_2M) {
> > +		if (!hugepage_test_guest_inhibit(slot, gfn, level + 1)) {
> > +			gfn_t base_gfn = gfn_round_for_level(gfn, level);
> > +			struct kvm_gfn_range gfn_range = {
> > +				.start = base_gfn,
> > +				.end = base_gfn + KVM_PAGES_PER_HPAGE(level),
> > +				.slot = slot,
> > +				.may_block = true,
> > +				.attr_filter = KVM_FILTER_PRIVATE,
> > +			};
> > +
> > +			scoped_guard(write_lock, &kvm->mmu_lock) {
> > +				int ret;
> > +
> > +				ret = kvm_split_cross_boundary_leafs(kvm, &gfn_range, false);
> > +				if (ret)
> > +					return ret;
> 
> kvm_split_cross_boundary_leafs() calls kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(), which could return flush as 1 if any of the huge page crossing boundary is split, return directly when ret is non-zero seems not right. Also, the TLB flush should also be taken care because in kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(), TLB flush is only done for negative return value.
Oh, good catch!

I forgot about the 2 facts. Will fix them.

> > +
> > +				hugepage_set_guest_inhibit(slot, gfn, level + 1);
> > +				if (level == PG_LEVEL_4K)
> > +					hugepage_set_guest_inhibit(slot, gfn, level + 2);
> > +			}
> > +		}
> > +	}
> > +	return 0;
> > +}
> > +
> >   static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> >   {
> >   	unsigned long exit_qual;
> > @@ -2044,6 +2091,9 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> >   		 */
> >   		exit_qual = EPT_VIOLATION_ACC_WRITE;
> > +		if (tdx_check_accept_level(vcpu, gpa_to_gfn(gpa)))
> > +			return RET_PF_RETRY;
> > +
> >   		/* Only private GPA triggers zero-step mitigation */
> >   		local_retry = true;
> >   	} else {
> > diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
> > index a30e880849e3..af006a73ee05 100644
> > --- a/arch/x86/kvm/vmx/tdx_arch.h
> > +++ b/arch/x86/kvm/vmx/tdx_arch.h
> > @@ -82,7 +82,10 @@ struct tdx_cpuid_value {
> >   #define TDX_TD_ATTR_PERFMON		BIT_ULL(63)
> >   #define TDX_EXT_EXIT_QUAL_TYPE_MASK	GENMASK(3, 0)
> > +#define TDX_EXT_EXIT_QUAL_TYPE_ACCEPT  1
> >   #define TDX_EXT_EXIT_QUAL_TYPE_PENDING_EPT_VIOLATION  6
> > +#define TDX_EXT_EXIT_QUAL_INFO_MASK	GENMASK(63, 32)
> > +#define TDX_EXT_EXIT_QUAL_INFO_SHIFT	32
> >   /*
> >    * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
> >    */
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 15/23] KVM: Change the return type of gfn_handler_t() from bool to int
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (13 preceding siblings ...)
  2025-08-07  9:44 ` [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info Yan Zhao
@ 2025-08-07  9:44 ` Yan Zhao
  2025-08-07  9:44 ` [RFC PATCH v2 16/23] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Yan Zhao
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:44 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

Modify the return type of gfn_handler_t() from bool to int. A negative
return value indicates failure, while a return value of 1 signifies success
with a flush required, and 0 denotes success without a flush required.

This adjustment prepares for a later change that will enable
kvm_pre_set_memory_attributes() to fail.

No functional changes expected.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- No change

RFC v1:
- New patch.
---
 arch/arm64/kvm/mmu.c             |  8 ++++----
 arch/loongarch/kvm/mmu.c         |  8 ++++----
 arch/mips/kvm/mmu.c              |  6 +++---
 arch/powerpc/kvm/book3s.c        |  4 ++--
 arch/powerpc/kvm/e500_mmu_host.c |  8 ++++----
 arch/riscv/kvm/mmu.c             | 12 ++++++------
 arch/x86/kvm/mmu/mmu.c           | 20 ++++++++++----------
 include/linux/kvm_host.h         | 12 ++++++------
 virt/kvm/kvm_main.c              | 24 ++++++++++++++++--------
 9 files changed, 55 insertions(+), 47 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8b225450a4eb..991a6df0ca21 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1999,12 +1999,12 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return false;
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	u64 size = (range->end - range->start) << PAGE_SHIFT;
 
 	if (!kvm->arch.mmu.pgt)
-		return false;
+		return 0;
 
 	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
 						   range->start << PAGE_SHIFT,
@@ -2015,12 +2015,12 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	 */
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	u64 size = (range->end - range->start) << PAGE_SHIFT;
 
 	if (!kvm->arch.mmu.pgt)
-		return false;
+		return 0;
 
 	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
 						   range->start << PAGE_SHIFT,
diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c
index ed956c5cf2cc..0542516c98eb 100644
--- a/arch/loongarch/kvm/mmu.c
+++ b/arch/loongarch/kvm/mmu.c
@@ -511,7 +511,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 			range->end << PAGE_SHIFT, &ctx);
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	kvm_ptw_ctx ctx;
 
@@ -523,15 +523,15 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 				range->end << PAGE_SHIFT, &ctx);
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	gpa_t gpa = range->start << PAGE_SHIFT;
 	kvm_pte_t *ptep = kvm_populate_gpa(kvm, NULL, gpa, 0);
 
 	if (ptep && kvm_pte_present(NULL, ptep) && kvm_pte_young(*ptep))
-		return true;
+		return 1;
 
-	return false;
+	return 0;
 }
 
 /*
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index d2c3b6b41f18..c26cc89c8e98 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -444,18 +444,18 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return true;
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	return kvm_mips_mkold_gpa_pt(kvm, range->start, range->end);
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	gpa_t gpa = range->start << PAGE_SHIFT;
 	pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
 
 	if (!gpa_pte)
-		return false;
+		return 0;
 	return pte_young(*gpa_pte);
 }
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index d79c5d1098c0..9bf6e1cf64f1 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -886,12 +886,12 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return kvm->arch.kvm_ops->unmap_gfn_range(kvm, range);
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	return kvm->arch.kvm_ops->age_gfn(kvm, range);
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
 }
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 06caf8bbbe2b..dd5411ee242e 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -697,16 +697,16 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return kvm_e500_mmu_unmap_gfn(kvm, range);
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* XXX could be more clever ;) */
-	return false;
+	return 0;
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* XXX could be more clever ;) */
-	return false;
+	return 0;
 }
 
 /*****************************************/
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 1087ea74567b..98c2fcd9229f 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -550,38 +550,38 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return false;
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	pte_t *ptep;
 	u32 ptep_level = 0;
 	u64 size = (range->end - range->start) << PAGE_SHIFT;
 
 	if (!kvm->arch.pgd)
-		return false;
+		return 0;
 
 	WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
 
 	if (!gstage_get_leaf_entry(kvm, range->start << PAGE_SHIFT,
 				   &ptep, &ptep_level))
-		return false;
+		return 0;
 
 	return ptep_test_and_clear_young(NULL, 0, ptep);
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	pte_t *ptep;
 	u32 ptep_level = 0;
 	u64 size = (range->end - range->start) << PAGE_SHIFT;
 
 	if (!kvm->arch.pgd)
-		return false;
+		return 0;
 
 	WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
 
 	if (!gstage_get_leaf_entry(kvm, range->start << PAGE_SHIFT,
 				   &ptep, &ptep_level))
-		return false;
+		return 0;
 
 	return pte_young(ptep_get(ptep));
 }
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1c639286aac2..c71f8bb0b903 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1806,7 +1806,7 @@ static bool kvm_may_have_shadow_mmu_sptes(struct kvm *kvm)
 	return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
@@ -1819,7 +1819,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	return young;
 }
 
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
@@ -7841,8 +7841,8 @@ static void hugepage_set_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
 	lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_MIXED_FLAG;
 }
 
-bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
-					struct kvm_gfn_range *range)
+int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+				       struct kvm_gfn_range *range)
 {
 	struct kvm_memory_slot *slot = range->slot;
 	int level;
@@ -7859,10 +7859,10 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 	 * a hugepage can be used for affected ranges.
 	 */
 	if (WARN_ON_ONCE(!kvm_arch_supports_gmem(kvm)))
-		return false;
+		return 0;
 
 	if (WARN_ON_ONCE(range->end <= range->start))
-		return false;
+		return 0;
 
 	/*
 	 * If the head and tail pages of the range currently allow a hugepage,
@@ -7921,8 +7921,8 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
 	return true;
 }
 
-bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
-					 struct kvm_gfn_range *range)
+int kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+					struct kvm_gfn_range *range)
 {
 	unsigned long attrs = range->arg.attributes;
 	struct kvm_memory_slot *slot = range->slot;
@@ -7938,7 +7938,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 	 * SHARED may now allow hugepages.
 	 */
 	if (WARN_ON_ONCE(!kvm_arch_supports_gmem(kvm)))
-		return false;
+		return 0;
 
 	/*
 	 * The sequence matters here: upper levels consume the result of lower
@@ -7985,7 +7985,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 				hugepage_set_mixed(slot, gfn, level);
 		}
 	}
-	return false;
+	return 0;
 }
 
 void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6137b76341e1..d03e4a70a6db 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -271,8 +271,8 @@ struct kvm_gfn_range {
 	bool lockless;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
 				   bool shared);
 #endif
@@ -1537,7 +1537,7 @@ void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void kvm_mmu_invalidate_begin(struct kvm *kvm);
 void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
 void kvm_mmu_invalidate_end(struct kvm *kvm);
-bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
@@ -2524,10 +2524,10 @@ static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn
 
 bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 				     unsigned long mask, unsigned long attrs);
-bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+				       struct kvm_gfn_range *range);
+int kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 					struct kvm_gfn_range *range);
-bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
-					 struct kvm_gfn_range *range);
 
 /*
  * Returns true if the given gfn's private/shared status (in the CoCo sense) is
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fe86f3f627ba..8f87d6c6be3f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -508,7 +508,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 	return container_of(mn, struct kvm, mmu_notifier);
 }
 
-typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+typedef int (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm);
 
@@ -592,6 +592,7 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
 		kvm_for_each_memslot_in_hva_range(node, slots,
 						  range->start, range->end - 1) {
 			unsigned long hva_start, hva_end;
+			int ret;
 
 			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
 			hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
@@ -632,7 +633,9 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
 						goto mmu_unlock;
 				}
 			}
-			r.ret |= range->handler(kvm, &gfn_range);
+			ret = range->handler(kvm, &gfn_range);
+			WARN_ON_ONCE(ret < 0);
+			r.ret |= ret;
 		}
 	}
 
@@ -718,7 +721,7 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
 	}
 }
 
-bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
 	return kvm_unmap_gfn_range(kvm, range);
@@ -2469,7 +2472,8 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 	struct kvm_memslots *slots;
 	struct kvm_memslot_iter iter;
 	bool found_memslot = false;
-	bool ret = false;
+	bool flush = false;
+	int ret = 0;
 	int i;
 
 	gfn_range.arg = range->arg;
@@ -2502,19 +2506,23 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 					range->on_lock(kvm);
 			}
 
-			ret |= range->handler(kvm, &gfn_range);
+			ret = range->handler(kvm, &gfn_range);
+			if (ret < 0)
+				goto err;
+			flush |= ret;
 		}
 	}
 
-	if (range->flush_on_ret && ret)
+err:
+	if (range->flush_on_ret && flush)
 		kvm_flush_remote_tlbs(kvm);
 
 	if (found_memslot)
 		KVM_MMU_UNLOCK(kvm);
 }
 
-static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
-					  struct kvm_gfn_range *range)
+static int kvm_pre_set_memory_attributes(struct kvm *kvm,
+					 struct kvm_gfn_range *range)
 {
 	/*
 	 * Unconditionally add the range to the invalidation set, regardless of
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 16/23] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (14 preceding siblings ...)
  2025-08-07  9:44 ` [RFC PATCH v2 15/23] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
@ 2025-08-07  9:44 ` Yan Zhao
  2025-08-07  9:45 ` [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:44 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

In TDX, private page tables require precise zapping because faulting back
the zapped mappings necessitates the guest's re-acceptance. Therefore,
before performing a zap for the private-to-shared conversion, rather than
zapping a huge leaf entry that crosses the boundary of the GFN range to be
zapped, split the leaf entry to ensure GFNs outside the conversion range
are not affected.

Invoke kvm_split_cross_boundary_leafs() in
kvm_arch_pre_set_memory_attributes() to split the huge leafs that cross
GFN range boundary before calling kvm_unmap_gfn_range() to zap the GFN
range that will be converted to shared.
When kvm_split_cross_boundary_leafs() fails, it is expected to internally
invoke kvm_flush_remote_tlbs() to flush any changes that have been
successfully completed.

Unlike kvm_unmap_gfn_range(), which cannot fail,
kvm_split_cross_boundary_leafs() may fail due to memory allocation for
splitting. Update kvm_handle_gfn_range() to propagate the error back to
kvm_vm_set_mem_attributes(), which can then fail the ioctl
KVM_SET_MEMORY_ATTRIBUTES.

The downside of current implementation is that though
kvm_split_cross_boundary_leafs() is invoked before kvm_unmap_gfn_range()
for each GFN range, the entire conversion range may consist of several GFN
ranges. If an out-of-memory error occurs during the splitting of a GFN
range, some previous GFN ranges may have been successfully split and
zapped, even though their page attributes remain unchanged due to the
splitting failure.

If it's necessary, a patch can be arranged to divide a single invocation of
"kvm_handle_gfn_range(kvm, &pre_set_range)" into two, e.g.,

kvm_handle_gfn_range(kvm, &pre_set_range_prepare_and_split)
kvm_handle_gfn_range(kvm, &pre_set_range_unmap),

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- update kvm_split_boundary_leafs() to kvm_split_cross_boundary_leafs() and
  invoke it only for priate-to-shared conversion.

RFC v1:
- new patch.
---
 arch/x86/kvm/mmu/mmu.c | 13 ++++++++++---
 virt/kvm/kvm_main.c    | 13 +++++++++----
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c71f8bb0b903..f23d8fc59323 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7845,7 +7845,9 @@ int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 				       struct kvm_gfn_range *range)
 {
 	struct kvm_memory_slot *slot = range->slot;
+	bool flush = false;
 	int level;
+	int ret;
 
 	/*
 	 * Zap SPTEs even if the slot can't be mapped PRIVATE.  KVM x86 only
@@ -7894,12 +7896,17 @@ int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 	}
 
 	/* Unmap the old attribute page. */
-	if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)
+	if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
 		range->attr_filter = KVM_FILTER_SHARED;
-	else
+	} else {
 		range->attr_filter = KVM_FILTER_PRIVATE;
+		ret = kvm_split_cross_boundary_leafs(kvm, range, false);
+		if (ret < 0)
+			return ret;
+		flush |= ret;
+	}
 
-	return kvm_unmap_gfn_range(kvm, range);
+	return kvm_unmap_gfn_range(kvm, range) | flush;
 }
 
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8f87d6c6be3f..9dceecf34822 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2464,8 +2464,8 @@ bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	return true;
 }
 
-static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
-						 struct kvm_mmu_notifier_range *range)
+static __always_inline int kvm_handle_gfn_range(struct kvm *kvm,
+						struct kvm_mmu_notifier_range *range)
 {
 	struct kvm_gfn_range gfn_range;
 	struct kvm_memory_slot *slot;
@@ -2519,6 +2519,8 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 
 	if (found_memslot)
 		KVM_MMU_UNLOCK(kvm);
+
+	return ret < 0 ? ret : 0;
 }
 
 static int kvm_pre_set_memory_attributes(struct kvm *kvm,
@@ -2587,7 +2589,9 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 		cond_resched();
 	}
 
-	kvm_handle_gfn_range(kvm, &pre_set_range);
+	r = kvm_handle_gfn_range(kvm, &pre_set_range);
+	if (r)
+		goto out_unlock;
 
 	for (i = start; i < end; i++) {
 		r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
@@ -2596,7 +2600,8 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 		cond_resched();
 	}
 
-	kvm_handle_gfn_range(kvm, &post_set_range);
+	r = kvm_handle_gfn_range(kvm, &post_set_range);
+	KVM_BUG_ON(r, kvm);
 
 out_unlock:
 	mutex_unlock(&kvm->slots_lock);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (15 preceding siblings ...)
  2025-08-07  9:44 ` [RFC PATCH v2 16/23] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Yan Zhao
@ 2025-08-07  9:45 ` Yan Zhao
  2025-08-07  9:45 ` [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set Yan Zhao
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:45 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

In TDX, private page tables require precise zapping because faulting back
the zapped mappings necessitates the guest's re-acceptance. Therefore,
before performing a zap for hole punching and private-to-shared
conversions, huge leafs that cross the boundary of the zapping GFN range in
the mirror page table must be split.

Splitting may result in an error. If this happens, hole punching and
private-to-shared conversion should bail out early and return an error to
userspace.

Splitting is not necessary for kvm_gmem_release() since the entire page
table is being zapped, nor for kvm_gmem_error_folio() as an SPTE must not
map more than one physical folio.

Therefore, in this patch,
- break kvm_gmem_invalidate_begin_and_zap() into
  kvm_gmem_invalidate_begin() and kvm_gmem_zap() and have
  kvm_gmem_release() and kvm_gmem_error_folio() to invoke them.

- have kvm_gmem_punch_hole() to invoke kvm_gmem_invalidate_begin(),
  kvm_gmem_split_private(), and kvm_gmem_zap().
  Bail out if kvm_gmem_split_private() returns error.

- drop the old kvm_gmem_unmap_private() and have private-to-shared
  conversion to invoke kvm_gmem_split_private() and kvm_gmem_zap() instead.
  Bail out if kvm_gmem_split_private() returns error.

Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Rebased to [1]. As changes in this patch are gmem specific, they may need
  to be updated if the implementation in [1] changes.
- Update kvm_split_boundary_leafs() to kvm_split_cross_boundary_leafs() and
  invoke it before kvm_gmem_punch_hole() and private-to-shared conversion.

[1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/

RFC v1:
- new patch.
---
 virt/kvm/guest_memfd.c | 142 ++++++++++++++++++++++++-----------------
 1 file changed, 84 insertions(+), 58 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 67aa2285aa49..9edf33c482d7 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -318,14 +318,14 @@ static bool kvm_gmem_has_safe_refcount(struct address_space *mapping, pgoff_t st
 	return refcount_safe;
 }
 
-static void kvm_gmem_unmap_private(struct kvm_gmem *gmem, pgoff_t start,
-				   pgoff_t end)
+static int kvm_gmem_split_private(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end)
 {
 	struct kvm_memory_slot *slot;
 	struct kvm *kvm = gmem->kvm;
 	unsigned long index;
 	bool locked = false;
 	bool flush = false;
+	int ret = 0;
 
 	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
 		pgoff_t pgoff = slot->gmem.pgoff;
@@ -335,7 +335,6 @@ static void kvm_gmem_unmap_private(struct kvm_gmem *gmem, pgoff_t start,
 			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
 			.slot = slot,
 			.may_block = true,
-			/* This function is only concerned with private mappings. */
 			.attr_filter = KVM_FILTER_PRIVATE,
 		};
 
@@ -344,6 +343,47 @@ static void kvm_gmem_unmap_private(struct kvm_gmem *gmem, pgoff_t start,
 			locked = true;
 		}
 
+		ret = kvm_split_cross_boundary_leafs(kvm, &gfn_range, false);
+		if (ret < 0)
+			goto out;
+
+		flush |= ret;
+		ret = 0;
+	}
+out:
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+
+	if (locked)
+		KVM_MMU_UNLOCK(kvm);
+
+	return ret;
+}
+
+static void kvm_gmem_zap(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end,
+			 enum kvm_gfn_range_filter filter)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = gmem->kvm;
+	unsigned long index;
+	bool locked = false;
+	bool flush = false;
+
+	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
+		pgoff_t pgoff = slot->gmem.pgoff;
+		struct kvm_gfn_range gfn_range = {
+			.start = slot->base_gfn + max(pgoff, start) - pgoff,
+			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
+			.slot = slot,
+			.may_block = true,
+			.attr_filter = filter,
+		};
+
+		if (!locked) {
+			KVM_MMU_LOCK(kvm);
+			locked = true;
+		}
+
 		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
 	}
 
@@ -514,6 +554,8 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
 					   struct conversion_work *work,
 					   bool to_shared, pgoff_t *error_index)
 {
+	int ret = 0;
+
 	if (to_shared) {
 		struct list_head *gmem_list;
 		struct kvm_gmem *gmem;
@@ -522,19 +564,24 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
 		work_end = work->start + work->nr_pages;
 
 		gmem_list = &inode->i_mapping->i_private_list;
+		list_for_each_entry(gmem, gmem_list, entry) {
+			ret = kvm_gmem_split_private(gmem, work->start, work_end);
+			if (ret)
+				return ret;
+		}
 		list_for_each_entry(gmem, gmem_list, entry)
-			kvm_gmem_unmap_private(gmem, work->start, work_end);
+			kvm_gmem_zap(gmem, work->start, work_end, KVM_FILTER_PRIVATE);
 	} else {
 		unmap_mapping_pages(inode->i_mapping, work->start,
 				    work->nr_pages, false);
 
 		if (!kvm_gmem_has_safe_refcount(inode->i_mapping, work->start,
 						work->nr_pages, error_index)) {
-			return -EAGAIN;
+			ret = -EAGAIN;
 		}
 	}
 
-	return 0;
+	return ret;
 }
 
 static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
@@ -1187,54 +1234,6 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 	return ERR_PTR(ret);
 }
 
-static void kvm_gmem_invalidate_begin_and_zap(struct kvm_gmem *gmem,
-					      pgoff_t start, pgoff_t end)
-{
-	bool flush = false, found_memslot = false;
-	struct kvm_memory_slot *slot;
-	struct kvm *kvm = gmem->kvm;
-	unsigned long index;
-
-	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
-		enum kvm_gfn_range_filter filter;
-		pgoff_t pgoff = slot->gmem.pgoff;
-
-		filter = KVM_FILTER_PRIVATE;
-		if (kvm_gmem_memslot_supports_shared(slot)) {
-			/*
-			 * Unmapping would also cause invalidation, but cannot
-			 * rely on mmu_notifiers to do invalidation via
-			 * unmapping, since memory may not be mapped to
-			 * userspace.
-			 */
-			filter |= KVM_FILTER_SHARED;
-		}
-
-		struct kvm_gfn_range gfn_range = {
-			.start = slot->base_gfn + max(pgoff, start) - pgoff,
-			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
-			.slot = slot,
-			.may_block = true,
-			.attr_filter = filter,
-		};
-
-		if (!found_memslot) {
-			found_memslot = true;
-
-			KVM_MMU_LOCK(kvm);
-			kvm_mmu_invalidate_begin(kvm);
-		}
-
-		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
-	}
-
-	if (flush)
-		kvm_flush_remote_tlbs(kvm);
-
-	if (found_memslot)
-		KVM_MMU_UNLOCK(kvm);
-}
-
 static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
 				    pgoff_t end)
 {
@@ -1445,9 +1444,28 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	filemap_invalidate_lock(inode->i_mapping);
 
 	list_for_each_entry(gmem, gmem_list, entry)
-		kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
+		kvm_gmem_invalidate_begin(gmem, start, end);
 
 	ret = 0;
+	list_for_each_entry(gmem, gmem_list, entry) {
+		ret = kvm_gmem_split_private(gmem, start, end);
+		if (ret)
+			goto out;
+	}
+	list_for_each_entry(gmem, gmem_list, entry) {
+		enum kvm_gfn_range_filter filter;
+
+		/*
+		 * kvm_gmem_invalidate_begin() would have unmapped shared
+		 * mappings via mmu notifiers, but only if those mappings were
+		 * actually set up. Since guest_memfd cannot assume that shared
+		 * mappings were set up, zap both private and shared mappings
+		 * here. If shared mappings were zapped, this should not be
+		 * expensive.
+		 */
+		filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED;
+		kvm_gmem_zap(gmem, start, end, filter);
+	}
 	if (kvm_gmem_has_custom_allocator(inode)) {
 		ret = kvm_gmem_truncate_inode_range(inode, offset, offset + len);
 	} else {
@@ -1455,6 +1473,7 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 		truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
 	}
 
+out:
 	list_for_each_entry(gmem, gmem_list, entry)
 		kvm_gmem_invalidate_end(gmem, start, end);
 
@@ -1576,7 +1595,8 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
 	 * Zap all SPTEs pointed at by this file.  Do not free the backing
 	 * memory, as its lifetime is associated with the inode, not the file.
 	 */
-	kvm_gmem_invalidate_begin_and_zap(gmem, 0, -1ul);
+	kvm_gmem_invalidate_begin(gmem, 0, -1ul);
+	kvm_gmem_zap(gmem, 0, -1ul, KVM_FILTER_PRIVATE | KVM_FILTER_SHARED);
 	kvm_gmem_invalidate_end(gmem, 0, -1ul);
 
 	list_del(&gmem->entry);
@@ -1906,8 +1926,14 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
 	start = folio->index;
 	end = start + folio_nr_pages(folio);
 
-	list_for_each_entry(gmem, gmem_list, entry)
-		kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
+	/* The size of the SEPT will not exceed the size of the folio */
+	list_for_each_entry(gmem, gmem_list, entry) {
+		enum kvm_gfn_range_filter filter;
+
+		kvm_gmem_invalidate_begin(gmem, start, end);
+		filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED;
+		kvm_gmem_zap(gmem, start, end, filter);
+	}
 
 	/*
 	 * Do not truncate the range, what action is taken in response to the
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (16 preceding siblings ...)
  2025-08-07  9:45 ` [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
@ 2025-08-07  9:45 ` Yan Zhao
  2025-08-11 21:10   ` Sagi Shahar
  2025-08-07  9:45 ` [RFC PATCH v2 19/23] KVM: TDX: Pass down pfn to split_external_spt() Yan Zhao
                   ` (4 subsequent siblings)
  22 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:45 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The TDX module enumerates with a TDX_FEATURES0 bit if an explicit cache
flush is necessary when switching KeyID for a page, like before
handing the page over to a TD.

Currently, none of the TDX-capable platforms have this bit enabled.

Moreover, cache flushing with TDH.PHYMEM.PAGE.WBINVD fails if
Dynamic PAMT is active and the target page is not 4k. The SEAMCALL only
supports 4k pages and will fail if there is no PAMT_4K for the HPA.

Avoid performing these cache flushes unless the CLFLUSH_BEFORE_ALLOC bit
of TDX_FEATURES0 is set.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Pulled from
  git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
- Rebased on top of TDX huge page RFC v2 (Yan)
---
 arch/x86/include/asm/tdx.h  |  1 +
 arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++------
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index f1bd74348b34..c058a82d4a97 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -15,6 +15,7 @@
 
 /* Bit definitions of TDX_FEATURES0 metadata field */
 #define TDX_FEATURES0_NO_RBP_MOD		BIT_ULL(18)
+#define TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC	BIT_ULL(23)
 #define TDX_FEATURES0_DYNAMIC_PAMT		BIT_ULL(36)
 
 #ifndef __ASSEMBLER__
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 9ed585bde062..b7a0ee0f4a50 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1648,14 +1648,13 @@ static inline u64 tdx_tdvpr_pa(struct tdx_vp *td)
 	return page_to_phys(td->tdvpr_page);
 }
 
-/*
- * The TDX module exposes a CLFLUSH_BEFORE_ALLOC bit to specify whether
- * a CLFLUSH of pages is required before handing them to the TDX module.
- * Be conservative and make the code simpler by doing the CLFLUSH
- * unconditionally.
- */
 static void tdx_clflush_page(struct page *page)
 {
+	u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
+
+	if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
+		return;
+
 	clflush_cache_range(page_to_virt(page), PAGE_SIZE);
 }
 
@@ -2030,8 +2029,12 @@ EXPORT_SYMBOL_GPL(tdh_phymem_cache_wb);
 
 u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
 {
+	u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
 	struct tdx_module_args args = {};
 
+	if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
+		return 0;
+
 	args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
 
 	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
@@ -2041,10 +2044,14 @@ EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_tdr);
 u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
 				unsigned long start_idx, unsigned long npages)
 {
+	u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
 	struct page *start = folio_page(folio, start_idx);
 	struct tdx_module_args args = {};
 	u64 err;
 
+	if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
+		return 0;
+
 	if (start_idx + npages > folio_nr_pages(folio))
 		return TDX_OPERAND_INVALID;
 
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set
  2025-08-07  9:45 ` [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set Yan Zhao
@ 2025-08-11 21:10   ` Sagi Shahar
  2025-08-12  6:37     ` Yan Zhao
  0 siblings, 1 reply; 43+ messages in thread
From: Sagi Shahar @ 2025-08-11 21:10 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
	fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
	binbin.wu, chao.p.peng

On Thu, Aug 7, 2025 at 4:47 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> The TDX module enumerates with a TDX_FEATURES0 bit if an explicit cache
> flush is necessary when switching KeyID for a page, like before
> handing the page over to a TD.
>
> Currently, none of the TDX-capable platforms have this bit enabled.
>
> Moreover, cache flushing with TDH.PHYMEM.PAGE.WBINVD fails if
> Dynamic PAMT is active and the target page is not 4k. The SEAMCALL only
> supports 4k pages and will fail if there is no PAMT_4K for the HPA.
>
> Avoid performing these cache flushes unless the CLFLUSH_BEFORE_ALLOC bit
> of TDX_FEATURES0 is set.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2:
> - Pulled from
>   git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
> - Rebased on top of TDX huge page RFC v2 (Yan)
> ---
>  arch/x86/include/asm/tdx.h  |  1 +
>  arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++------
>  2 files changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index f1bd74348b34..c058a82d4a97 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -15,6 +15,7 @@
>
>  /* Bit definitions of TDX_FEATURES0 metadata field */
>  #define TDX_FEATURES0_NO_RBP_MOD               BIT_ULL(18)
> +#define TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC     BIT_ULL(23)
>  #define TDX_FEATURES0_DYNAMIC_PAMT             BIT_ULL(36)
>
>  #ifndef __ASSEMBLER__
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 9ed585bde062..b7a0ee0f4a50 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1648,14 +1648,13 @@ static inline u64 tdx_tdvpr_pa(struct tdx_vp *td)
>         return page_to_phys(td->tdvpr_page);
>  }
>
> -/*
> - * The TDX module exposes a CLFLUSH_BEFORE_ALLOC bit to specify whether
> - * a CLFLUSH of pages is required before handing them to the TDX module.
> - * Be conservative and make the code simpler by doing the CLFLUSH
> - * unconditionally.
> - */
>  static void tdx_clflush_page(struct page *page)
>  {
> +       u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
> +
> +       if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
> +               return;

Isn't the logic here and below reversed? If
TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC bit is set, we want to perform the
clflush()

> +
>         clflush_cache_range(page_to_virt(page), PAGE_SIZE);
>  }
>
> @@ -2030,8 +2029,12 @@ EXPORT_SYMBOL_GPL(tdh_phymem_cache_wb);
>
>  u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
>  {
> +       u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
>         struct tdx_module_args args = {};
>
> +       if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
> +               return 0;
> +
>         args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
>
>         return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
> @@ -2041,10 +2044,14 @@ EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_tdr);
>  u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
>                                 unsigned long start_idx, unsigned long npages)
>  {
> +       u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
>         struct page *start = folio_page(folio, start_idx);
>         struct tdx_module_args args = {};
>         u64 err;
>
> +       if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
> +               return 0;
> +
>         if (start_idx + npages > folio_nr_pages(folio))
>                 return TDX_OPERAND_INVALID;
>
> --
> 2.43.2
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set
  2025-08-11 21:10   ` Sagi Shahar
@ 2025-08-12  6:37     ` Yan Zhao
  0 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-12  6:37 UTC (permalink / raw)
  To: Sagi Shahar
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
	fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
	binbin.wu, chao.p.peng

On Mon, Aug 11, 2025 at 04:10:41PM -0500, Sagi Shahar wrote:
> On Thu, Aug 7, 2025 at 4:47 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > The TDX module enumerates with a TDX_FEATURES0 bit if an explicit cache
> > flush is necessary when switching KeyID for a page, like before
> > handing the page over to a TD.
> >
> > Currently, none of the TDX-capable platforms have this bit enabled.
> >
> > Moreover, cache flushing with TDH.PHYMEM.PAGE.WBINVD fails if
> > Dynamic PAMT is active and the target page is not 4k. The SEAMCALL only
> > supports 4k pages and will fail if there is no PAMT_4K for the HPA.
I actually couldn't observe this failure in my side with DPAMT + hugepage
(without shutdown optimization).

> > Avoid performing these cache flushes unless the CLFLUSH_BEFORE_ALLOC bit
> > of TDX_FEATURES0 is set.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> > RFC v2:
> > - Pulled from
> >   git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
> > - Rebased on top of TDX huge page RFC v2 (Yan)
> > ---
> >  arch/x86/include/asm/tdx.h  |  1 +
> >  arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++------
> >  2 files changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index f1bd74348b34..c058a82d4a97 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -15,6 +15,7 @@
> >
> >  /* Bit definitions of TDX_FEATURES0 metadata field */
> >  #define TDX_FEATURES0_NO_RBP_MOD               BIT_ULL(18)
> > +#define TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC     BIT_ULL(23)
> >  #define TDX_FEATURES0_DYNAMIC_PAMT             BIT_ULL(36)
> >
> >  #ifndef __ASSEMBLER__
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 9ed585bde062..b7a0ee0f4a50 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1648,14 +1648,13 @@ static inline u64 tdx_tdvpr_pa(struct tdx_vp *td)
> >         return page_to_phys(td->tdvpr_page);
> >  }
> >
> > -/*
> > - * The TDX module exposes a CLFLUSH_BEFORE_ALLOC bit to specify whether
> > - * a CLFLUSH of pages is required before handing them to the TDX module.
> > - * Be conservative and make the code simpler by doing the CLFLUSH
> > - * unconditionally.
> > - */
> >  static void tdx_clflush_page(struct page *page)
> >  {
> > +       u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
> > +
> > +       if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
> > +               return;
> 
> Isn't the logic here and below reversed? If
> TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC bit is set, we want to perform the
> clflush()
Yes, I think so.

As my test machine has boot_cpu_has_bug(X86_BUG_TDX_PW_MCE) returning true, I
thought it was right to perform clflush() and overlooked this logical error.

> >         clflush_cache_range(page_to_virt(page), PAGE_SIZE);
> >  }
> >
> > @@ -2030,8 +2029,12 @@ EXPORT_SYMBOL_GPL(tdh_phymem_cache_wb);
> >
> >  u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
> >  {
> > +       u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
> >         struct tdx_module_args args = {};
> >
> > +       if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
> > +               return 0;
> > +
> >         args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
> >
> >         return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
> > @@ -2041,10 +2044,14 @@ EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_tdr);
> >  u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
> >                                 unsigned long start_idx, unsigned long npages)
> >  {
> > +       u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
> >         struct page *start = folio_page(folio, start_idx);
> >         struct tdx_module_args args = {};
> >         u64 err;
> >
> > +       if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
> > +               return 0;
> > +
> >         if (start_idx + npages > folio_nr_pages(folio))
> >                 return TDX_OPERAND_INVALID;
> >
> > --
> > 2.43.2
> >
> >
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 19/23] KVM: TDX: Pass down pfn to split_external_spt()
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (17 preceding siblings ...)
  2025-08-07  9:45 ` [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set Yan Zhao
@ 2025-08-07  9:45 ` Yan Zhao
  2025-08-07  9:45 ` [RFC PATCH v2 20/23] KVM: TDX: Handle Dynamic PAMT in tdh_mem_page_demote() Yan Zhao
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:45 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Pass down pfn to kvm_x86_ops::split_external_spt(). It is required for
handling Dynamic PAMT in tdx_sept_split_private_spt().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Pulled from
  git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
- Rebased on top of TDX huge page RFC v2 (Yan)
---
 arch/x86/include/asm/kvm_host.h | 3 ++-
 arch/x86/kvm/mmu/tdp_mmu.c      | 6 +++++-
 arch/x86/kvm/vmx/tdx.c          | 3 ++-
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6cb5b422dd1d..6b6c46c27390 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1841,7 +1841,8 @@ struct kvm_x86_ops {
 
 	/* Split the external page table into smaller page tables */
 	int (*split_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-				  void *external_spt, bool mmu_lock_shared);
+				  kvm_pfn_t pfn_for_gfn, void *external_spt,
+				  bool mmu_lock_shared);
 
 	bool (*has_wbinvd_exit)(void);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 62a09a9655c3..eb758aaa4374 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -389,11 +389,15 @@ static int split_external_spt(struct kvm *kvm, gfn_t gfn, u64 old_spte,
 			      u64 new_spte, int level, bool shared)
 {
 	void *external_spt = get_external_spt(gfn, new_spte, level);
+	kvm_pfn_t pfn_for_gfn = spte_to_pfn(old_spte);
 	int ret;
 
 	KVM_BUG_ON(!external_spt, kvm);
 
-	ret = kvm_x86_call(split_external_spt)(kvm, gfn, level, external_spt, shared);
+	ret = kvm_x86_call(split_external_spt)(kvm, gfn, level,
+					       pfn_for_gfn, external_spt,
+					       shared);
+
 	return ret;
 }
 /**
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 71115058e5e6..24aa9aaad6d8 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1941,7 +1941,8 @@ static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
 }
 
 static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-				      void *private_spt, bool mmu_lock_shared)
+				      kvm_pfn_t pfn_for_gfn, void *private_spt,
+				      bool mmu_lock_shared)
 {
 	struct page *page = virt_to_page(private_spt);
 	int ret;
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 20/23] KVM: TDX: Handle Dynamic PAMT in tdh_mem_page_demote()
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (18 preceding siblings ...)
  2025-08-07  9:45 ` [RFC PATCH v2 19/23] KVM: TDX: Pass down pfn to split_external_spt() Yan Zhao
@ 2025-08-07  9:45 ` Yan Zhao
  2025-08-07  9:46 ` [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path Yan Zhao
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:45 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

If Dynamic PAMT is enabled, TDH.MEM.PAGE.DEMOTE will take the PAMT page
pair in registers R12 and R13.

Pass the pamt_pages list down to tdh_mem_page_demote() and populate
registers R12 and R13 from it.

Instead of using seamcall_ret(), use seamcall_saved_ret() as it can
handle registers above R11.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Pulled from
  git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
- Rebased on top of TDX huge page RFC v2 (Yan).
---
 arch/x86/include/asm/tdx.h  |  1 +
 arch/x86/kvm/vmx/tdx.c      |  4 ++--
 arch/x86/virt/vmx/tdx/tdx.c | 13 +++++++++++--
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index c058a82d4a97..2e529f0c578a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -180,6 +180,7 @@ u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
 u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
 u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
 u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
+			struct list_head *pamt_pages,
 			u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mr_finalize(struct tdx_td *td);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 24aa9aaad6d8..9d24a1a86a23 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1924,12 +1924,12 @@ static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
 	u64 err, entry, level_state;
 
 	err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
-				  &entry, &level_state);
+				  NULL, &entry, &level_state);
 
 	if (unlikely(tdx_operand_busy(err))) {
 		tdx_no_vcpus_enter_start(kvm);
 		err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
-					  &entry, &level_state);
+					  NULL, &entry, &level_state);
 		tdx_no_vcpus_enter_stop(kvm);
 	}
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index b7a0ee0f4a50..50f9d49f1c91 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1825,6 +1825,7 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data)
 EXPORT_SYMBOL_GPL(tdh_mng_rd);
 
 u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
+			struct list_head *pamt_pages,
 			u64 *ext_err1, u64 *ext_err2)
 {
 	struct tdx_module_args args = {
@@ -1832,10 +1833,18 @@ u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page
 		.rdx = tdx_tdr_pa(td),
 		.r8 = page_to_phys(page),
 	};
-	u64 ret;
+	struct page *pamt_page;
+	u64 *p, ret;
 
+	if (level == TDX_PS_2M) {
+		p = &args.r12;
+		list_for_each_entry(pamt_page, pamt_pages, lru) {
+			*p = page_to_phys(pamt_page);
+			p++;
+		}
+	}
 	tdx_clflush_page(page);
-	ret = seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args);
+	ret = seamcall_saved_ret(TDH_MEM_PAGE_DEMOTE, &args);
 
 	*ext_err1 = args.rcx;
 	*ext_err2 = args.rdx;
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (19 preceding siblings ...)
  2025-08-07  9:45 ` [RFC PATCH v2 20/23] KVM: TDX: Handle Dynamic PAMT in tdh_mem_page_demote() Yan Zhao
@ 2025-08-07  9:46 ` Yan Zhao
  2025-08-07  9:46 ` [RFC PATCH v2 22/23] KVM: TDX: Handle Dynamic PAMT on page split Yan Zhao
  2025-08-07  9:46 ` [RFC PATCH v2 23/23] KVM: TDX: Turn on PG_LEVEL_2M after TD is RUNNABLE Yan Zhao
  22 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:46 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Preallocate a page to be used in the split_external_spt() path.

Kernel needs one PAMT page pair for external_spt and one that provided
directly to the TDH.MEM.PAGE.DEMOTE SEAMCALL.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Pulled from
  git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
- Implemented the flow of topup pamt_page_cache in
  tdp_mmu_split_huge_pages_root() (Yan)
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/mmu/mmu.c          |  1 +
 arch/x86/kvm/mmu/tdp_mmu.c      | 51 +++++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6b6c46c27390..508b133df903 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1591,6 +1591,8 @@ struct kvm_arch {
 #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
 	struct kvm_mmu_memory_cache split_desc_cache;
 
+	struct kvm_mmu_memory_cache pamt_page_cache;
+
 	gfn_t gfn_direct_bits;
 
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f23d8fc59323..e581cee37f64 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6848,6 +6848,7 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
 	kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
 	kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
 	kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
+	kvm_mmu_free_memory_cache(&kvm->arch.pamt_page_cache);
 }
 
 void kvm_mmu_uninit_vm(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index eb758aaa4374..064c4e823658 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1584,6 +1584,27 @@ static bool iter_cross_boundary(struct tdp_iter *iter, gfn_t start, gfn_t end)
 		 (iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <= end);
 }
 
+static bool need_topup_mirror_caches(struct kvm *kvm)
+{
+	int nr = tdx_nr_pamt_pages() * 2;
+
+	return kvm_mmu_memory_cache_nr_free_objects(&kvm->arch.pamt_page_cache) < nr;
+}
+
+static int topup_mirror_caches(struct kvm *kvm)
+{
+	int r, nr;
+
+	/* One for external_spt, one for TDH.MEM.PAGE.DEMOTE */
+	nr = tdx_nr_pamt_pages() * 2;
+
+	r = kvm_mmu_topup_memory_cache(&kvm->arch.pamt_page_cache, nr);
+	if (r)
+		return r;
+
+	return 0;
+}
+
 static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 					 struct kvm_mmu_page *root,
 					 gfn_t start, gfn_t end,
@@ -1656,6 +1677,36 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 			continue;
 		}
 
+		if (is_mirror_sp(root) && need_topup_mirror_caches(kvm)) {
+			int r;
+
+			rcu_read_unlock();
+
+			if (shared)
+				read_unlock(&kvm->mmu_lock);
+			else
+				write_unlock(&kvm->mmu_lock);
+
+			r = topup_mirror_caches(kvm);
+
+			if (shared)
+				read_lock(&kvm->mmu_lock);
+			else
+				write_lock(&kvm->mmu_lock);
+
+			if (r) {
+				trace_kvm_mmu_split_huge_page(iter.gfn,
+							      iter.old_spte,
+							      iter.level, r);
+				return r;
+			}
+
+			rcu_read_lock();
+
+			iter.yielded = true;
+			continue;
+		}
+
 		tdp_mmu_init_child_sp(sp, &iter);
 
 		if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 22/23] KVM: TDX: Handle Dynamic PAMT on page split
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (20 preceding siblings ...)
  2025-08-07  9:46 ` [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path Yan Zhao
@ 2025-08-07  9:46 ` Yan Zhao
  2025-08-14  5:31   ` Vishal Annapurve
  2025-08-07  9:46 ` [RFC PATCH v2 23/23] KVM: TDX: Turn on PG_LEVEL_2M after TD is RUNNABLE Yan Zhao
  22 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:46 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Page demote from 2M to 4k requires an additional PAMT page pair to cover
the 2M range that now mapped with 4k.

EPT page also has to be covered in PAMT_4K.

Allocate both from pre-allocated split PAMT pool.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Pulled from
  git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
- Rebased on top of TDX huge page RFC v2 (Yan).
---
 arch/x86/include/asm/tdx.h  |  4 ++++
 arch/x86/kvm/vmx/tdx.c      | 28 ++++++++++++++++++++++++----
 arch/x86/virt/vmx/tdx/tdx.c | 11 +++++++----
 3 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 2e529f0c578a..da317981e95a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -123,6 +123,10 @@ u32 tdx_get_nr_guest_keyids(void);
 void tdx_guest_keyid_free(unsigned int keyid);
 
 int tdx_nr_pamt_pages(void);
+atomic_t *tdx_get_pamt_refcount(unsigned long hpa);
+int tdx_alloc_pamt_pages(struct list_head *pamt_pages,
+			 struct page *(alloc)(void *data), void *data);
+void tdx_free_pamt_pages(struct list_head *pamt_pages);
 int tdx_pamt_get(struct page *page, enum pg_level level,
 		 struct page *(alloc)(void *data), void *data);
 void tdx_pamt_put(struct page *page, enum pg_level level);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 9d24a1a86a23..6e061d659639 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1915,28 +1915,48 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
+static struct page *tdx_alloc_pamt_page_split(void *data)
+{
+	struct kvm *kvm = data;
+	void *p;
+
+	p = kvm_mmu_memory_cache_alloc(&kvm->arch.pamt_page_cache);
+	return virt_to_page(p);
+}
+
 static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
-					enum pg_level level, struct page *page)
+					enum pg_level level, struct page *page,
+					kvm_pfn_t pfn_for_gfn)
 {
 	int tdx_level = pg_level_to_tdx_sept_level(level);
+	hpa_t hpa = pfn_to_hpa(pfn_for_gfn);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 err, entry, level_state;
+	LIST_HEAD(pamt_pages);
+
+	tdx_pamt_get(page, PG_LEVEL_4K, tdx_alloc_pamt_page_split, kvm);
+	tdx_alloc_pamt_pages(&pamt_pages, tdx_alloc_pamt_page_split, kvm);
 
 	err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
-				  NULL, &entry, &level_state);
+				  &pamt_pages, &entry, &level_state);
 
 	if (unlikely(tdx_operand_busy(err))) {
 		tdx_no_vcpus_enter_start(kvm);
 		err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
-					  NULL, &entry, &level_state);
+					  &pamt_pages, &entry, &level_state);
 		tdx_no_vcpus_enter_stop(kvm);
 	}
 
 	if (KVM_BUG_ON(err, kvm)) {
+		tdx_free_pamt_pages(&pamt_pages);
+		tdx_pamt_put(page, PG_LEVEL_4K);
 		pr_tdx_error_2(TDH_MEM_PAGE_DEMOTE, err, entry, level_state);
 		return -EIO;
 	}
+
+	if (tdx_supports_dynamic_pamt(tdx_sysinfo))
+		atomic_set(tdx_get_pamt_refcount(hpa), PTRS_PER_PMD);
 	return 0;
 }
 
@@ -1963,7 +1983,7 @@ static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level
 
 	tdx_track(kvm);
 
-	return tdx_spte_demote_private_spte(kvm, gfn, level, page);
+	return tdx_spte_demote_private_spte(kvm, gfn, level, page, pfn_for_gfn);
 }
 
 static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 50f9d49f1c91..dbbddd00ec60 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -188,10 +188,11 @@ int tdx_cpu_enable(void)
 }
 EXPORT_SYMBOL_GPL(tdx_cpu_enable);
 
-static atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
+atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
 {
 	return &pamt_refcounts[hpa / PMD_SIZE];
 }
+EXPORT_SYMBOL_GPL(tdx_get_pamt_refcount);
 
 static int pamt_refcount_populate(pte_t *pte, unsigned long addr, void *data)
 {
@@ -2151,7 +2152,7 @@ static u64 tdh_phymem_pamt_remove(unsigned long hpa,
 
 static DEFINE_SPINLOCK(pamt_lock);
 
-static void tdx_free_pamt_pages(struct list_head *pamt_pages)
+void tdx_free_pamt_pages(struct list_head *pamt_pages)
 {
 	struct page *page;
 
@@ -2160,9 +2161,10 @@ static void tdx_free_pamt_pages(struct list_head *pamt_pages)
 		__free_page(page);
 	}
 }
+EXPORT_SYMBOL_GPL(tdx_free_pamt_pages);
 
-static int tdx_alloc_pamt_pages(struct list_head *pamt_pages,
-				 struct page *(alloc)(void *data), void *data)
+int tdx_alloc_pamt_pages(struct list_head *pamt_pages,
+			 struct page *(alloc)(void *data), void *data)
 {
 	for (int i = 0; i < tdx_nr_pamt_pages(); i++) {
 		struct page *page;
@@ -2180,6 +2182,7 @@ static int tdx_alloc_pamt_pages(struct list_head *pamt_pages,
 	tdx_free_pamt_pages(pamt_pages);
 	return -ENOMEM;
 }
+EXPORT_SYMBOL_GPL(tdx_alloc_pamt_pages);
 
 static int tdx_pamt_add(atomic_t *pamt_refcount, unsigned long hpa,
 			struct list_head *pamt_pages)
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 22/23] KVM: TDX: Handle Dynamic PAMT on page split
  2025-08-07  9:46 ` [RFC PATCH v2 22/23] KVM: TDX: Handle Dynamic PAMT on page split Yan Zhao
@ 2025-08-14  5:31   ` Vishal Annapurve
  2025-08-14 18:29     ` Vishal Annapurve
  2025-08-18  4:19     ` Yan Zhao
  0 siblings, 2 replies; 43+ messages in thread
From: Vishal Annapurve @ 2025-08-14  5:31 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du,
	jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu,
	chao.p.peng

On Thu, Aug 7, 2025 at 2:46 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> +static struct page *tdx_alloc_pamt_page_split(void *data)
> +{
> +       struct kvm *kvm = data;
> +       void *p;
> +
> +       p = kvm_mmu_memory_cache_alloc(&kvm->arch.pamt_page_cache);
> +       return virt_to_page(p);
> +}
> +
>  static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
> -                                       enum pg_level level, struct page *page)
> +                                       enum pg_level level, struct page *page,
> +                                       kvm_pfn_t pfn_for_gfn)
>  {
>         int tdx_level = pg_level_to_tdx_sept_level(level);
> +       hpa_t hpa = pfn_to_hpa(pfn_for_gfn);
>         struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>         gpa_t gpa = gfn_to_gpa(gfn);
>         u64 err, entry, level_state;
> +       LIST_HEAD(pamt_pages);
> +
> +       tdx_pamt_get(page, PG_LEVEL_4K, tdx_alloc_pamt_page_split, kvm);

This invocation needs a return value check.

> +       tdx_alloc_pamt_pages(&pamt_pages, tdx_alloc_pamt_page_split, kvm);

IIUC tdx_pamt_get() will result in pamt_pages allocation above, so
this step is not needed.

>
>         err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> -                                 NULL, &entry, &level_state);
> +                                 &pamt_pages, &entry, &level_state);
>
>         if (unlikely(tdx_operand_busy(err))) {
>                 tdx_no_vcpus_enter_start(kvm);
>                 err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> -                                         NULL, &entry, &level_state);
> +                                         &pamt_pages, &entry, &level_state);
>                 tdx_no_vcpus_enter_stop(kvm);
>         }
>
>         if (KVM_BUG_ON(err, kvm)) {
> +               tdx_free_pamt_pages(&pamt_pages);

If tdx_alloc_pamt_pages() is not needed then this can be dropped as well.

> +               tdx_pamt_put(page, PG_LEVEL_4K);
>                 pr_tdx_error_2(TDH_MEM_PAGE_DEMOTE, err, entry, level_state);
>                 return -EIO;
>         }
> +
> +       if (tdx_supports_dynamic_pamt(tdx_sysinfo))
> +               atomic_set(tdx_get_pamt_refcount(hpa), PTRS_PER_PMD);

Should this be
atomic_set(tdx_get_pamt_refcount(hpa), PTRS_PER_PMD -1 );

as tdx_pamt_get would have increased the refcount by 1 already above?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 22/23] KVM: TDX: Handle Dynamic PAMT on page split
  2025-08-14  5:31   ` Vishal Annapurve
@ 2025-08-14 18:29     ` Vishal Annapurve
  2025-08-18  4:19     ` Yan Zhao
  1 sibling, 0 replies; 43+ messages in thread
From: Vishal Annapurve @ 2025-08-14 18:29 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du,
	jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu,
	chao.p.peng

On Wed, Aug 13, 2025 at 10:31 PM Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Thu, Aug 7, 2025 at 2:46 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > +static struct page *tdx_alloc_pamt_page_split(void *data)
> > +{
> > +       struct kvm *kvm = data;
> > +       void *p;
> > +
> > +       p = kvm_mmu_memory_cache_alloc(&kvm->arch.pamt_page_cache);
> > +       return virt_to_page(p);
> > +}
> > +
> >  static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
> > -                                       enum pg_level level, struct page *page)
> > +                                       enum pg_level level, struct page *page,
> > +                                       kvm_pfn_t pfn_for_gfn)
> >  {
> >         int tdx_level = pg_level_to_tdx_sept_level(level);
> > +       hpa_t hpa = pfn_to_hpa(pfn_for_gfn);
> >         struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> >         gpa_t gpa = gfn_to_gpa(gfn);
> >         u64 err, entry, level_state;
> > +       LIST_HEAD(pamt_pages);
> > +
> > +       tdx_pamt_get(page, PG_LEVEL_4K, tdx_alloc_pamt_page_split, kvm);
>
> This invocation needs a return value check.
>
> > +       tdx_alloc_pamt_pages(&pamt_pages, tdx_alloc_pamt_page_split, kvm);
>
> IIUC tdx_pamt_get() will result in pamt_pages allocation above, so
> this step is not needed.

I missed that one allocation is to cover the EPT page and another is
for HPA ranges backing the GPA mappings. So ignore my rest of the
comments except about the error handling for tdx_pamt_get() and
tdx_alloc_pamt_pages() missing in this patch.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v2 22/23] KVM: TDX: Handle Dynamic PAMT on page split
  2025-08-14  5:31   ` Vishal Annapurve
  2025-08-14 18:29     ` Vishal Annapurve
@ 2025-08-18  4:19     ` Yan Zhao
  1 sibling, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-18  4:19 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
	dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
	david, vbabka, thomas.lendacky, pgonda, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng

On Wed, Aug 13, 2025 at 10:31:27PM -0700, Vishal Annapurve wrote:
> On Thu, Aug 7, 2025 at 2:46 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > +static struct page *tdx_alloc_pamt_page_split(void *data)
> > +{
> > +       struct kvm *kvm = data;
> > +       void *p;
> > +
> > +       p = kvm_mmu_memory_cache_alloc(&kvm->arch.pamt_page_cache);
> > +       return virt_to_page(p);
> > +}
> > +
> >  static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
> > -                                       enum pg_level level, struct page *page)
> > +                                       enum pg_level level, struct page *page,
> > +                                       kvm_pfn_t pfn_for_gfn)
> >  {
> >         int tdx_level = pg_level_to_tdx_sept_level(level);
> > +       hpa_t hpa = pfn_to_hpa(pfn_for_gfn);
> >         struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> >         gpa_t gpa = gfn_to_gpa(gfn);
> >         u64 err, entry, level_state;
> > +       LIST_HEAD(pamt_pages);
> > +
> > +       tdx_pamt_get(page, PG_LEVEL_4K, tdx_alloc_pamt_page_split, kvm);
> 
> This invocation needs a return value check.
Ack.

> > +       tdx_alloc_pamt_pages(&pamt_pages, tdx_alloc_pamt_page_split, kvm);
> 
> IIUC tdx_pamt_get() will result in pamt_pages allocation above, so
> this step is not needed.

This step is to allocate pamt_pages for the guest 2MB page that needs splitting.
The above tdx_pamt_get() is for the EPT page to be added.
I'll add comments or update the param names for better clarity.

Regarding the absence of return value check for the tdx_alloc_pamt_pages(), I
think it's because the tdx_alloc_pamt_page_split() retrieves pages from the
pamt_page_cache via kvm_mmu_memory_cache_alloc(), which is guaranteed to succeed
(otherwise, there's a BUG_ON() in kvm_mmu_memory_cache_alloc()).

> >
> >         err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> > -                                 NULL, &entry, &level_state);
> > +                                 &pamt_pages, &entry, &level_state);
> >
> >         if (unlikely(tdx_operand_busy(err))) {
> >                 tdx_no_vcpus_enter_start(kvm);
> >                 err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> > -                                         NULL, &entry, &level_state);
> > +                                         &pamt_pages, &entry, &level_state);
> >                 tdx_no_vcpus_enter_stop(kvm);
> >         }
> >
> >         if (KVM_BUG_ON(err, kvm)) {
> > +               tdx_free_pamt_pages(&pamt_pages);
> 
> If tdx_alloc_pamt_pages() is not needed then this can be dropped as well.
> 
> > +               tdx_pamt_put(page, PG_LEVEL_4K);
> >                 pr_tdx_error_2(TDH_MEM_PAGE_DEMOTE, err, entry, level_state);
> >                 return -EIO;
> >         }
> > +
> > +       if (tdx_supports_dynamic_pamt(tdx_sysinfo))
> > +               atomic_set(tdx_get_pamt_refcount(hpa), PTRS_PER_PMD);
> 
> Should this be
> atomic_set(tdx_get_pamt_refcount(hpa), PTRS_PER_PMD -1 );
> 
> as tdx_pamt_get would have increased the refcount by 1 already above?
This hpa is for guest 2MB memory range. There shouldn't have any increased
pamt_refcount for this range before a successful demote.
So, atomic_set() to PTRS_PER_PMD looks correct, though atomic_add() seems even
safer.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH v2 23/23] KVM: TDX: Turn on PG_LEVEL_2M after TD is RUNNABLE
  2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
                   ` (21 preceding siblings ...)
  2025-08-07  9:46 ` [RFC PATCH v2 22/23] KVM: TDX: Handle Dynamic PAMT on page split Yan Zhao
@ 2025-08-07  9:46 ` Yan Zhao
  22 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2025-08-07  9:46 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
	ackerleytng, quic_eberman, michael.roth, david, vannapurve,
	vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
	ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
	yan.y.zhao

Turn on PG_LEVEL_2M in tdx_gmem_private_max_mapping_level() when TD is
RUNNABLE.

Update the warnings and KVM_BUG_ON() info elsewhere to match that 2MB
mappings are permitted after TD is RUNNABLE.

Opportunistically, remove the unused params "gfn" and "pfn" in
tdx_mem_page_record_premap_cnt().

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Merged RFC v1's patch 4 (forcing PG_LEVEL_4K before TD runnable) with
  patch 9 (allowing PG_LEVEL_2M after TD runnable).
---
 arch/x86/kvm/vmx/tdx.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 6e061d659639..a3e1ac044ee9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1633,12 +1633,11 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
  * The counter has to be zero on KVM_TDX_FINALIZE_VM, to ensure that there
  * are no half-initialized shared EPT pages.
  */
-static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
-					  enum pg_level level, kvm_pfn_t pfn)
+static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, enum pg_level level)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 
-	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
+	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed || level != PG_LEVEL_4K, kvm))
 		return -EINVAL;
 
 	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
@@ -1667,10 +1666,6 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (ret)
 		return ret;
 
-	/* TODO: handle large pages. */
-	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
-		return -EINVAL;
-
 	/*
 	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
 	 * barrier in tdx_td_finalize().
@@ -1680,7 +1675,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
 		ret = tdx_mem_page_aug(kvm, gfn, level, page);
 	else
-		ret = tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
+		ret = tdx_mem_page_record_premap_cnt(kvm, level);
 
 	if (ret)
 		tdx_pamt_put(page, level);
@@ -1697,8 +1692,8 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 err, entry, level_state;
 
-	/* TODO: handle large pages. */
-	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+	/* Large page is not supported before TD runnable,*/
+	if (KVM_BUG_ON(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K, kvm))
 		return -EINVAL;
 
 	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
@@ -1791,7 +1786,7 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
 static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
 					     u64 entry, int level)
 {
-	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
+	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE || level > PG_LEVEL_4K)
 		return false;
 
 	if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
@@ -1811,8 +1806,8 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 	gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
 	u64 err, entry, level_state;
 
-	/* For now large page isn't supported yet. */
-	WARN_ON_ONCE(level != PG_LEVEL_4K);
+	/* Large page is not supported before TD runnable,*/
+	WARN_ON_ONCE(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K);
 
 	err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
 
@@ -1993,6 +1988,9 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	struct folio *folio = page_folio(page);
 	int ret;
 
+	WARN_ON_ONCE(folio_page_idx(folio, page) + KVM_PAGES_PER_HPAGE(level) >
+		     folio_nr_pages(folio));
+
 	if (!is_hkid_assigned(to_kvm_tdx(kvm))) {
 		KVM_BUG_ON(!kvm->vm_dead, kvm);
 
@@ -3470,7 +3468,10 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 
 int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
 {
-	return PG_LEVEL_4K;
+	if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
+		return PG_LEVEL_4K;
+
+	return PG_LEVEL_2M;
 }
 
 static int tdx_online_cpu(unsigned int cpu)
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2025-09-04  2:54 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-07  9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
2025-08-07  9:41 ` [RFC PATCH v2 01/23] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
2025-08-07  9:41 ` [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
2025-09-01  8:55   ` Binbin Wu
2025-09-01  9:08     ` Yan Zhao
2025-09-02 16:56       ` Edgecombe, Rick P
2025-09-02 17:37         ` Sean Christopherson
2025-09-02 17:45           ` Edgecombe, Rick P
2025-08-07  9:42 ` [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
2025-08-07  9:42 ` [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear " Yan Zhao
2025-09-02  2:56   ` Binbin Wu
2025-09-03  9:51     ` Yan Zhao
2025-09-03 11:19       ` Binbin Wu
2025-09-04  2:53         ` Yan Zhao
2025-08-07  9:42 ` [RFC PATCH v2 05/23] x86/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
2025-08-07  9:42 ` [RFC PATCH v2 06/23] KVM: TDX: Do not hold page refcount on private guest pages Yan Zhao
2025-08-07  9:42 ` [RFC PATCH v2 07/23] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
2025-08-07  9:43 ` [RFC PATCH v2 08/23] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
2025-08-07  9:43 ` [RFC PATCH v2 09/23] KVM: x86/tdp_mmu: Add split_external_spt hook called during write mmu_lock Yan Zhao
2025-08-07  9:43 ` [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock Yan Zhao
2025-08-07  9:43 ` [RFC PATCH v2 11/23] KVM: x86: Reject splitting huge pages under shared mmu_lock for mirror root Yan Zhao
2025-09-03  3:30   ` Binbin Wu
2025-08-07  9:43 ` [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
2025-09-03  6:57   ` Binbin Wu
2025-09-03  9:44     ` Yan Zhao
2025-08-07  9:44 ` [RFC PATCH v2 13/23] KVM: x86: Introduce hugepage_set_guest_inhibit() Yan Zhao
2025-08-07  9:44 ` [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info Yan Zhao
2025-09-03  7:36   ` Binbin Wu
2025-09-03  9:37     ` Yan Zhao
2025-08-07  9:44 ` [RFC PATCH v2 15/23] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
2025-08-07  9:44 ` [RFC PATCH v2 16/23] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Yan Zhao
2025-08-07  9:45 ` [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
2025-08-07  9:45 ` [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set Yan Zhao
2025-08-11 21:10   ` Sagi Shahar
2025-08-12  6:37     ` Yan Zhao
2025-08-07  9:45 ` [RFC PATCH v2 19/23] KVM: TDX: Pass down pfn to split_external_spt() Yan Zhao
2025-08-07  9:45 ` [RFC PATCH v2 20/23] KVM: TDX: Handle Dynamic PAMT in tdh_mem_page_demote() Yan Zhao
2025-08-07  9:46 ` [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path Yan Zhao
2025-08-07  9:46 ` [RFC PATCH v2 22/23] KVM: TDX: Handle Dynamic PAMT on page split Yan Zhao
2025-08-14  5:31   ` Vishal Annapurve
2025-08-14 18:29     ` Vishal Annapurve
2025-08-18  4:19     ` Yan Zhao
2025-08-07  9:46 ` [RFC PATCH v2 23/23] KVM: TDX: Turn on PG_LEVEL_2M after TD is RUNNABLE Yan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).