From: Sean Christopherson <seanjc@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: pbonzini@redhat.com, linux-kernel@vger.kernel.org,
kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com,
dave.hansen@intel.com, kas@kernel.org, tabba@google.com,
ackerleytng@google.com, michael.roth@amd.com, david@kernel.org,
vannapurve@google.com, sagis@google.com, vbabka@suse.cz,
thomas.lendacky@amd.com, nik.borisov@suse.com,
pgonda@google.com, fan.du@intel.com, jun.miao@intel.com,
francescolavra.fl@gmail.com, jgross@suse.com,
ira.weiny@intel.com, isaku.yamahata@intel.com,
xiaoyao.li@intel.com, kai.huang@intel.com,
binbin.wu@linux.intel.com, chao.p.peng@intel.com,
chao.gao@intel.com
Subject: Re: [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
Date: Wed, 28 Jan 2026 14:49:39 -0800 [thread overview]
Message-ID: <aXqSg2nUSYJMbcXT@google.com> (raw)
In-Reply-To: <20260106101849.24889-1-yan.y.zhao@intel.com>
On Tue, Jan 06, 2026, Yan Zhao wrote:
> From: Xiaoyao Li <xiaoyao.li@intel.com>
>
> Introduce SEAMCALL wrapper tdh_mem_page_demote() to invoke
> TDH_MEM_PAGE_DEMOTE, which splits a 2MB or a 1GB mapping in S-EPT into
> 512 4KB or 2MB mappings respectively.
>
> SEAMCALL TDH_MEM_PAGE_DEMOTE walks the S-EPT to locate the huge mapping to
> split and add a new S-EPT page to hold the 512 smaller mappings.
>
> Parameters "gpa" and "level" specify the huge mapping to split, and
> parameter "new_sept_page" specifies the 4KB page to be added as the S-EPT
> page. Invoke tdx_clflush_page() before adding the new S-EPT page
> conservatively to prevent dirty cache lines from writing back later and
> corrupting TD memory.
>
> tdh_mem_page_demote() may fail, e.g., due to S-EPT walk error. Callers must
> check function return value and can retrieve the extended error info from
> the output parameters "ext_err1", and "ext_err2".
>
> The TDX module has many internal locks. To avoid staying in SEAM mode for
> too long, SEAMCALLs return a BUSY error code to the kernel instead of
> spinning on the locks. Depending on the specific SEAMCALL, the caller may
> need to handle this error in specific ways (e.g., retry). Therefore, return
> the SEAMCALL error code directly to the caller without attempting to handle
> it in the core kernel.
>
> Enable tdh_mem_page_demote() only on TDX modules that support feature
> TDX_FEATURES0.ENHANCE_DEMOTE_INTERRUPTIBILITY, which does not return error
> TDX_INTERRUPTED_RESTARTABLE on basic TDX (i.e., without TD partition) [2].
>
> This is because error TDX_INTERRUPTED_RESTARTABLE is difficult to handle.
> The TDX module provides no guaranteed maximum retry count to ensure forward
> progress of the demotion. Interrupt storms could then result in a DoS if
> host simply retries endlessly for TDX_INTERRUPTED_RESTARTABLE. Disabling
> interrupts before invoking the SEAMCALL also doesn't work because NMIs can
> also trigger TDX_INTERRUPTED_RESTARTABLE. Therefore, the tradeoff for basic
> TDX is to disable the TDX_INTERRUPTED_RESTARTABLE error given the
> reasonable execution time for demotion. [1]
>
> Link: https://lore.kernel.org/kvm/99f5585d759328db973403be0713f68e492b492a.camel@intel.com [1]
> Link: https://lore.kernel.org/all/fbf04b09f13bc2ce004ac97ee9c1f2c965f44fdf.camel@intel.com [2]
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
This is ridiculous. The DEMOTE API is spread across three patches:
Add SEAMCALL wrapper tdh_mem_page_demote()
Add/Remove DPAMT pages for guest private memory to demote
Pass guest memory's PFN info to demote for updating pamt_refcount
with significant changes between the "add wrapper" and when the API is actually
usable. Even worse, it's wired up in KVM before it's finalized, and so those
changes mean touching KVM code.
And to top things off, "Add/Remove DPAMT pages for guest private memory to demote"
includes a non-trivial refactoring and code movement of MAX_TDX_ARG_SIZE() and
dpamt_args_array_ptr().
This borderline unreviewable. It took me literally more than a day to wrap my
head around what actually needs to happen for DEMOTE, what patch was doing what,
at what point in the series DEMOTE actually became usable, etc.
I get that y'all are juggling multiple intertwined series, but spraying them all
at upstream without any apparent rhyme or reason does not work. Figure out
priorities, pick an ordering, and make it happen.
In the end, this fits nicely into *one* patch, with significantly fewer lines
changed overall.
E.g.
---
arch/x86/include/asm/tdx.h | 9 ++++++
arch/x86/virt/vmx/tdx/tdx.c | 58 +++++++++++++++++++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 1 +
3 files changed, 68 insertions(+)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 50feea01b066..483441de7fe0 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -15,6 +15,7 @@
/* Bit definitions of TDX_FEATURES0 metadata field */
#define TDX_FEATURES0_NO_RBP_MOD BIT_ULL(18)
#define TDX_FEATURES0_DYNAMIC_PAMT BIT_ULL(36)
+#define TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY BIT_ULL(51)
#ifndef __ASSEMBLER__
@@ -140,6 +141,11 @@ static inline bool tdx_supports_dynamic_pamt(const struct tdx_sys_info *sysinfo)
return sysinfo->features.tdx_features0 & TDX_FEATURES0_DYNAMIC_PAMT;
}
+static inline bool tdx_supports_demote_nointerrupt(const struct tdx_sys_info *sysinfo)
+{
+ return sysinfo->features.tdx_features0 & TDX_FEATURES0_ENHANCE_DEMOTE_INTERRUPTIBILITY;
+}
+
/* Simple structure for pre-allocating Dynamic PAMT pages outside of locks. */
struct tdx_pamt_cache {
struct list_head page_list;
@@ -240,6 +246,9 @@ u64 tdh_mng_key_config(struct tdx_td *td);
u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, enum pg_level level, u64 pfn,
+ struct page *new_sp, struct tdx_pamt_cache *pamt_cache,
+ u64 *ext_err1, u64 *ext_err2);
u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
u64 tdh_mr_finalize(struct tdx_td *td);
u64 tdh_vp_flush(struct tdx_vp *vp);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 6a2871e83761..97016b3e26b8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1874,6 +1874,64 @@ static u64 *dpamt_args_array_ptr_r12(struct tdx_module_array_args *args)
return &args->args_array[TDX_ARG_INDEX(r12)];
}
+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, enum pg_level level, u64 pfn,
+ struct page *new_sp, struct tdx_pamt_cache *pamt_cache,
+ u64 *ext_err1, u64 *ext_err2)
+{
+ bool dpamt = tdx_supports_dynamic_pamt(&tdx_sysinfo) && level == PG_LEVEL_2M;
+ u64 guest_memory_pamt_page[MAX_TDX_ARGS(r12)];
+ struct tdx_module_array_args args = {
+ .args.rcx = gpa | pg_level_to_tdx_sept_level(level),
+ .args.rdx = tdx_tdr_pa(td),
+ .args.r8 = page_to_phys(new_sp),
+ };
+ u64 ret;
+
+ if (!tdx_supports_demote_nointerrupt(&tdx_sysinfo))
+ return TDX_SW_ERROR;
+
+ if (dpamt) {
+ u64 *args_array = dpamt_args_array_ptr_r12(&args);
+
+ if (alloc_pamt_array(guest_memory_pamt_page, pamt_cache))
+ return TDX_SW_ERROR;
+
+ /*
+ * Copy PAMT page PAs of the guest memory into the struct per the
+ * TDX ABI
+ */
+ memcpy(args_array, guest_memory_pamt_page,
+ tdx_dpamt_entry_pages() * sizeof(*args_array));
+ }
+
+ /* Flush the new S-EPT page to be added */
+ tdx_clflush_page(new_sp);
+
+ ret = seamcall_saved_ret(TDH_MEM_PAGE_DEMOTE, &args.args);
+
+ *ext_err1 = args.args.rcx;
+ *ext_err2 = args.args.rdx;
+
+ if (dpamt) {
+ if (ret) {
+ free_pamt_array(guest_memory_pamt_page);
+ } else {
+ /*
+ * Set the PAMT refcount for the guest private memory,
+ * i.e. for the hugepage that was just demoted to 512
+ * smaller pages.
+ */
+ atomic_t *pamt_refcount;
+
+ pamt_refcount = tdx_find_pamt_refcount(pfn);
+ WARN_ON_ONCE(atomic_cmpxchg_release(pamt_refcount, 0,
+ PTRS_PER_PMD));
+ }
+ }
+ return ret;
+}
+EXPORT_SYMBOL_FOR_KVM(tdh_mem_page_demote);
+
u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2)
{
struct tdx_module_args args = {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 096c78a1d438..a6c0fa53ece9 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -24,6 +24,7 @@
#define TDH_MNG_KEY_CONFIG 8
#define TDH_MNG_CREATE 9
#define TDH_MNG_RD 11
+#define TDH_MEM_PAGE_DEMOTE 15
#define TDH_MR_EXTEND 16
#define TDH_MR_FINALIZE 17
#define TDH_VP_FLUSH 18
base-commit: 0f969bc3e7a9aa441122ad51bc2ff220a200b88e
--
next prev parent reply other threads:[~2026-01-28 22:49 UTC|newest]
Thread overview: 127+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-06 10:16 [PATCH v3 00/24] KVM: TDX huge page support for private memory Yan Zhao
2026-01-06 10:18 ` [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
2026-01-06 21:08 ` Dave Hansen
2026-01-07 9:12 ` Yan Zhao
2026-01-07 16:39 ` Dave Hansen
2026-01-08 19:05 ` Ackerley Tng
2026-01-08 19:24 ` Dave Hansen
2026-01-09 16:21 ` Vishal Annapurve
2026-01-09 3:08 ` Yan Zhao
2026-01-09 18:29 ` Ackerley Tng
2026-01-12 2:41 ` Yan Zhao
2026-01-13 16:50 ` Vishal Annapurve
2026-01-14 1:48 ` Yan Zhao
2026-01-06 10:18 ` [PATCH v3 02/24] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
2026-01-16 1:00 ` Huang, Kai
2026-01-16 8:35 ` Yan Zhao
2026-01-16 11:10 ` Huang, Kai
2026-01-16 11:22 ` Huang, Kai
2026-01-19 6:18 ` Yan Zhao
2026-01-19 6:15 ` Yan Zhao
2026-01-16 11:22 ` Huang, Kai
2026-01-19 5:55 ` Yan Zhao
2026-01-28 22:49 ` Sean Christopherson [this message]
2026-01-06 10:19 ` [PATCH v3 03/24] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
2026-01-06 10:19 ` [PATCH v3 04/24] x86/tdx: Introduce tdx_quirk_reset_folio() to reset private " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 05/24] x86/virt/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
2026-01-06 10:20 ` [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
2026-01-15 22:49 ` Sean Christopherson
2026-01-16 7:54 ` Yan Zhao
2026-01-26 16:08 ` Sean Christopherson
2026-01-27 3:40 ` Yan Zhao
2026-01-28 19:51 ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 07/24] KVM: x86/tdp_mmu: Introduce split_external_spte() under write mmu_lock Yan Zhao
2026-01-28 22:38 ` Sean Christopherson
2026-01-06 10:20 ` [PATCH v3 08/24] KVM: TDX: Enable huge page splitting " Yan Zhao
2026-01-06 10:21 ` [PATCH v3 09/24] KVM: x86: Reject splitting huge pages under shared mmu_lock in TDX Yan Zhao
2026-01-06 10:21 ` [PATCH v3 10/24] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
2026-01-06 10:21 ` [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
2026-01-15 12:25 ` Huang, Kai
2026-01-16 23:39 ` Sean Christopherson
2026-01-19 1:28 ` Yan Zhao
2026-01-19 8:35 ` Huang, Kai
2026-01-19 8:49 ` Huang, Kai
2026-01-19 10:11 ` Yan Zhao
2026-01-19 10:40 ` Huang, Kai
2026-01-19 11:06 ` Yan Zhao
2026-01-19 12:32 ` Yan Zhao
2026-01-29 14:36 ` Sean Christopherson
2026-01-20 17:51 ` Sean Christopherson
2026-01-22 6:27 ` Yan Zhao
2026-01-20 17:57 ` Vishal Annapurve
2026-01-20 18:02 ` Sean Christopherson
2026-01-22 6:33 ` Yan Zhao
2026-01-29 14:51 ` Sean Christopherson
2026-01-06 10:21 ` [PATCH v3 12/24] KVM: x86: Introduce hugepage_set_guest_inhibit() Yan Zhao
2026-01-06 10:22 ` [PATCH v3 13/24] KVM: TDX: Honor the guest's accept level contained in an EPT violation Yan Zhao
2026-01-06 10:22 ` [PATCH v3 14/24] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
2026-01-16 0:21 ` Sean Christopherson
2026-01-16 6:42 ` Yan Zhao
2026-01-06 10:22 ` [PATCH v3 15/24] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Yan Zhao
2026-01-06 10:22 ` [PATCH v3 16/24] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
2026-01-28 22:39 ` Sean Christopherson
2026-01-06 10:23 ` [PATCH v3 17/24] KVM: TDX: Get/Put DPAMT page pair only when mapping size is 4KB Yan Zhao
2026-01-06 10:23 ` [PATCH v3 18/24] x86/virt/tdx: Add loud warning when tdx_pamt_put() fails Yan Zhao
2026-01-06 10:23 ` [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting Yan Zhao
2026-01-21 1:54 ` Huang, Kai
2026-01-21 17:30 ` Sean Christopherson
2026-01-21 19:39 ` Edgecombe, Rick P
2026-01-21 23:01 ` Huang, Kai
2026-01-22 7:03 ` Yan Zhao
2026-01-22 7:30 ` Huang, Kai
2026-01-22 7:49 ` Yan Zhao
2026-01-22 10:33 ` Huang, Kai
2026-01-06 10:23 ` [PATCH v3 20/24] KVM: TDX: Implement per-VM external cache for splitting in TDX Yan Zhao
2026-01-06 10:23 ` [PATCH v3 21/24] KVM: TDX: Add/Remove DPAMT pages for the new S-EPT page for splitting Yan Zhao
2026-01-06 10:24 ` [PATCH v3 22/24] x86/tdx: Add/Remove DPAMT pages for guest private memory to demote Yan Zhao
2026-01-19 10:52 ` Huang, Kai
2026-01-19 11:11 ` Yan Zhao
2026-01-06 10:24 ` [PATCH v3 23/24] x86/tdx: Pass guest memory's PFN info to demote for updating pamt_refcount Yan Zhao
2026-01-06 10:24 ` [PATCH v3 24/24] KVM: TDX: Turn on PG_LEVEL_2M Yan Zhao
2026-01-06 17:47 ` [PATCH v3 00/24] KVM: TDX huge page support for private memory Vishal Annapurve
2026-01-06 21:26 ` Ackerley Tng
2026-01-06 21:38 ` Sean Christopherson
2026-01-06 22:04 ` Ackerley Tng
2026-01-06 23:43 ` Sean Christopherson
2026-01-07 9:03 ` Yan Zhao
2026-01-08 20:11 ` Ackerley Tng
2026-01-09 9:18 ` Yan Zhao
2026-01-09 16:12 ` Vishal Annapurve
2026-01-09 17:16 ` Vishal Annapurve
2026-01-09 18:07 ` Ackerley Tng
2026-01-12 1:39 ` Yan Zhao
2026-01-12 2:12 ` Yan Zhao
2026-01-12 19:56 ` Ackerley Tng
2026-01-13 6:10 ` Yan Zhao
2026-01-13 16:40 ` Vishal Annapurve
2026-01-14 9:32 ` Yan Zhao
2026-01-07 19:22 ` Edgecombe, Rick P
2026-01-07 20:27 ` Sean Christopherson
2026-01-12 20:15 ` Ackerley Tng
2026-01-14 0:33 ` Yan Zhao
2026-01-14 1:24 ` Sean Christopherson
2026-01-14 9:23 ` Yan Zhao
2026-01-14 15:26 ` Sean Christopherson
2026-01-14 18:45 ` Ackerley Tng
2026-01-15 3:08 ` Yan Zhao
2026-01-15 18:13 ` Ackerley Tng
2026-01-14 18:56 ` Dave Hansen
2026-01-15 0:19 ` Sean Christopherson
2026-01-16 15:45 ` Edgecombe, Rick P
2026-01-16 16:31 ` Sean Christopherson
2026-01-16 16:58 ` Edgecombe, Rick P
2026-01-19 5:53 ` Yan Zhao
2026-01-30 15:32 ` Sean Christopherson
2026-02-03 9:18 ` Yan Zhao
2026-02-09 17:01 ` Sean Christopherson
2026-01-16 16:57 ` Dave Hansen
2026-01-16 17:14 ` Sean Christopherson
2026-01-16 17:45 ` Dave Hansen
2026-01-16 19:59 ` Sean Christopherson
2026-01-16 22:25 ` Dave Hansen
2026-01-15 1:41 ` Yan Zhao
2026-01-15 16:26 ` Sean Christopherson
2026-01-16 0:28 ` Sean Christopherson
2026-01-16 11:25 ` Yan Zhao
2026-01-16 14:46 ` Sean Christopherson
2026-01-19 1:25 ` Yan Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aXqSg2nUSYJMbcXT@google.com \
--to=seanjc@google.com \
--cc=ackerleytng@google.com \
--cc=binbin.wu@linux.intel.com \
--cc=chao.gao@intel.com \
--cc=chao.p.peng@intel.com \
--cc=dave.hansen@intel.com \
--cc=david@kernel.org \
--cc=fan.du@intel.com \
--cc=francescolavra.fl@gmail.com \
--cc=ira.weiny@intel.com \
--cc=isaku.yamahata@intel.com \
--cc=jgross@suse.com \
--cc=jun.miao@intel.com \
--cc=kai.huang@intel.com \
--cc=kas@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=michael.roth@amd.com \
--cc=nik.borisov@suse.com \
--cc=pbonzini@redhat.com \
--cc=pgonda@google.com \
--cc=rick.p.edgecombe@intel.com \
--cc=sagis@google.com \
--cc=tabba@google.com \
--cc=thomas.lendacky@amd.com \
--cc=vannapurve@google.com \
--cc=vbabka@suse.cz \
--cc=x86@kernel.org \
--cc=xiaoyao.li@intel.com \
--cc=yan.y.zhao@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox