* [RFC PATCH v2 01/23] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
@ 2025-08-07 9:41 ` Yan Zhao
2025-08-07 9:41 ` [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
` (21 subsequent siblings)
22 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:41 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
Enhance the SEAMCALL wrapper tdh_mem_page_aug() to support huge pages.
The SEAMCALL TDH_MEM_PAGE_AUG currently supports adding physical memory to
the S-EPT up to 2MB in size.
While keeping the "level" parameter in the tdh_mem_page_aug() wrapper to
allow callers to specify the physical memory size, introduce the parameters
"folio" and "start_idx" to specify the physical memory starting from the
page at "start_idx" within the "folio". The specified physical memory must
be fully contained within a single folio.
Invoke tdx_clflush_page() for each 4KB segment of the physical memory being
added. tdx_clflush_page() performs CLFLUSH operations on certain
TDX-capable platforms, or conservatively on all TDX-capable platforms, to
prevent dirty cache lines from writing back later and corrupting TD memory.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Refine patch log. (Rick)
- Removed the level checking. (Kirill, Chao Gao)
- Use "folio", and "start_idx" rather than "page".
- Return TDX_OPERAND_INVALID if the specified physical memory is not
contained within a single folio.
- Use PTE_SHIFT to replace the 9 in "1 << (level * 9)" (Kirill)
- Use C99-style definition of variables inside a loop. (Nikolay Borisov)
RFC v1:
- Rebased to new tdh_mem_page_aug() with "struct page *" as param.
- Check folio, folio_page_idx.
---
arch/x86/include/asm/tdx.h | 3 ++-
arch/x86/kvm/vmx/tdx.c | 4 +++-
arch/x86/virt/vmx/tdx/tdx.c | 14 +++++++++++---
3 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 48d579092590..f968b736871a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -171,7 +171,8 @@ u64 tdh_mng_addcx(struct tdx_td *td, struct page *tdcs_page);
u64 tdh_mem_page_add(struct tdx_td *td, u64 gpa, struct page *page, struct page *source, u64 *ext_err1, u64 *ext_err2);
u64 tdh_mem_sept_add(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2);
u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page);
-u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2);
+u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct folio *folio,
+ unsigned long start_idx, u64 *ext_err1, u64 *ext_err2);
u64 tdh_mem_range_block(struct tdx_td *td, u64 gpa, int level, u64 *ext_err1, u64 *ext_err2);
u64 tdh_mng_key_config(struct tdx_td *td);
u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ed67f842b6ec..0a2b183899d8 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1593,11 +1593,13 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
{
int tdx_level = pg_level_to_tdx_sept_level(level);
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ struct folio *folio = page_folio(page);
gpa_t gpa = gfn_to_gpa(gfn);
u64 entry, level_state;
u64 err;
- err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
+ err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, folio,
+ folio_page_idx(folio, page), &entry, &level_state);
if (unlikely(tdx_operand_busy(err))) {
tdx_unpin(kvm, page);
return -EBUSY;
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index e411cf878547..580f14f64822 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1730,16 +1730,24 @@ u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page)
}
EXPORT_SYMBOL_GPL(tdh_vp_addcx);
-u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2)
+u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct folio *folio,
+ unsigned long start_idx, u64 *ext_err1, u64 *ext_err2)
{
+ struct page *start = folio_page(folio, start_idx);
+ unsigned long npages = 1 << (level * PTE_SHIFT);
struct tdx_module_args args = {
.rcx = gpa | level,
.rdx = tdx_tdr_pa(td),
- .r8 = page_to_phys(page),
+ .r8 = page_to_phys(start),
};
u64 ret;
- tdx_clflush_page(page);
+ if (start_idx + npages > folio_nr_pages(folio))
+ return TDX_OPERAND_INVALID;
+
+ for (int i = 0; i < npages; i++)
+ tdx_clflush_page(nth_page(start, i));
+
ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
*ext_err1 = args.rcx;
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
2025-08-07 9:41 ` [RFC PATCH v2 01/23] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
@ 2025-08-07 9:41 ` Yan Zhao
2025-09-01 8:55 ` Binbin Wu
2025-08-07 9:42 ` [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
` (20 subsequent siblings)
22 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:41 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
From: Xiaoyao Li <xiaoyao.li@intel.com>
Introduce SEAMCALL wrapper tdh_mem_page_demote() to invoke the SEAMCALL
TDH_MEM_PAGE_DEMOTE, which demotes a huge leaf entry to a non-leaf entry
in the S-EPT.
SEAMCALL TDH_MEM_PAGE_DEMOTE supports the demotion of 2MB or 1GB huge leaf
entries.
The "gpa" and "level" parameters enable the SEAMCALL TDH_MEM_PAGE_DEMOTE to
walk the S-EPT for the huge leaf entry that needs to be demoted.
The "page" parameter specifies a 4KB page that will be used in the demotion
operation to be added as a page table page in the S-EPT.
Invoke tdx_clflush_page() on the 4KB page being added as a page table page.
This function performs CLFLUSH operations on certain TDX-capable platforms,
or conservatively on all TDX-capable platforms, to prevent dirty cache
lines from writing back later and corrupting TD memory.
tdh_mem_page_demote() may fail. Callers can check function return value and
retrieve extended error info from the function output parameters "ext_err1"
and "ext_err2". e.g., due to S-EPT walk error or arriving interrupts.
The TDX module has many internal locks. To avoid staying in SEAM mode for
too long, SEAMCALLs return a BUSY error code to the kernel instead of
spinning on the locks. Depending on the specific SEAMCALL, the caller may
need to handle this error in specific ways (e.g., retry). Therefore, return
the SEAMCALL error code directly to the caller without attempting to handle
it in the core kernel.
Do not handle TDX_INTERRUPTED_RESTARTABLE because SEAMCALL
TDH_MEM_PAGE_DEMOTE does not check interrupts (including NMIs) for basic
TDX (with or without Dynamic PAMT).
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Refine the patch log (Rick).
- Do not handle TDX_INTERRUPTED_RESTARTABLE as the new TDX modules in
planning do not check interrupts for basic TDX.
RFC v1:
- Rebased and split patch. Updated patch log.
---
arch/x86/include/asm/tdx.h | 2 ++
arch/x86/virt/vmx/tdx/tdx.c | 20 ++++++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 1 +
3 files changed, 23 insertions(+)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index f968b736871a..d2cf48e273d5 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -178,6 +178,8 @@ u64 tdh_mng_key_config(struct tdx_td *td);
u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
+ u64 *ext_err1, u64 *ext_err2);
u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
u64 tdh_mr_finalize(struct tdx_td *td);
u64 tdh_vp_flush(struct tdx_vp *vp);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 580f14f64822..d941f083f741 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1825,6 +1825,26 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data)
}
EXPORT_SYMBOL_GPL(tdh_mng_rd);
+u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
+ u64 *ext_err1, u64 *ext_err2)
+{
+ struct tdx_module_args args = {
+ .rcx = gpa | level,
+ .rdx = tdx_tdr_pa(td),
+ .r8 = page_to_phys(page),
+ };
+ u64 ret;
+
+ tdx_clflush_page(page);
+ ret = seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args);
+
+ *ext_err1 = args.rcx;
+ *ext_err2 = args.rdx;
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(tdh_mem_page_demote);
+
u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2)
{
struct tdx_module_args args = {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 096c78a1d438..a6c0fa53ece9 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -24,6 +24,7 @@
#define TDH_MNG_KEY_CONFIG 8
#define TDH_MNG_CREATE 9
#define TDH_MNG_RD 11
+#define TDH_MEM_PAGE_DEMOTE 15
#define TDH_MR_EXTEND 16
#define TDH_MR_FINALIZE 17
#define TDH_VP_FLUSH 18
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
2025-08-07 9:41 ` [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
@ 2025-09-01 8:55 ` Binbin Wu
2025-09-01 9:08 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Binbin Wu @ 2025-09-01 8:55 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On 8/7/2025 5:41 PM, Yan Zhao wrote:
> From: Xiaoyao Li <xiaoyao.li@intel.com>
>
> Introduce SEAMCALL wrapper tdh_mem_page_demote() to invoke the SEAMCALL
> TDH_MEM_PAGE_DEMOTE, which demotes a huge leaf entry to a non-leaf entry
> in the S-EPT.
>
> SEAMCALL TDH_MEM_PAGE_DEMOTE supports the demotion of 2MB or 1GB huge leaf
> entries.
>
> The "gpa" and "level" parameters enable the SEAMCALL TDH_MEM_PAGE_DEMOTE to
> walk the S-EPT for the huge leaf entry that needs to be demoted.
>
> The "page" parameter specifies a 4KB page that will be used in the demotion
> operation to be added as a page table page in the S-EPT.
>
> Invoke tdx_clflush_page() on the 4KB page being added as a page table page.
> This function performs CLFLUSH operations on certain TDX-capable platforms,
> or conservatively on all TDX-capable platforms, to prevent dirty cache
> lines from writing back later and corrupting TD memory.
>
> tdh_mem_page_demote() may fail. Callers can check function return value and
> retrieve extended error info from the function output parameters "ext_err1"
> and "ext_err2". e.g., due to S-EPT walk error or arriving interrupts.
>
> The TDX module has many internal locks. To avoid staying in SEAM mode for
> too long, SEAMCALLs return a BUSY error code to the kernel instead of
> spinning on the locks. Depending on the specific SEAMCALL, the caller may
> need to handle this error in specific ways (e.g., retry). Therefore, return
> the SEAMCALL error code directly to the caller without attempting to handle
> it in the core kernel.
>
> Do not handle TDX_INTERRUPTED_RESTARTABLE because SEAMCALL
> TDH_MEM_PAGE_DEMOTE does not check interrupts (including NMIs) for basic
> TDX (with or without Dynamic PAMT).
The cover letter mentions that there is a new TDX module in planning, which
disables the interrupt checking. I guess TDX module would need to have a
interface to report the change, KVM then decides to enable huge page support or
not for TDs?
>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2:
> - Refine the patch log (Rick).
> - Do not handle TDX_INTERRUPTED_RESTARTABLE as the new TDX modules in
> planning do not check interrupts for basic TDX.
>
> RFC v1:
> - Rebased and split patch. Updated patch log.
> ---
> arch/x86/include/asm/tdx.h | 2 ++
> arch/x86/virt/vmx/tdx/tdx.c | 20 ++++++++++++++++++++
> arch/x86/virt/vmx/tdx/tdx.h | 1 +
> 3 files changed, 23 insertions(+)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index f968b736871a..d2cf48e273d5 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -178,6 +178,8 @@ u64 tdh_mng_key_config(struct tdx_td *td);
> u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
> u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
> u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
> +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
> + u64 *ext_err1, u64 *ext_err2);
> u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
> u64 tdh_mr_finalize(struct tdx_td *td);
> u64 tdh_vp_flush(struct tdx_vp *vp);
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 580f14f64822..d941f083f741 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1825,6 +1825,26 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data)
> }
> EXPORT_SYMBOL_GPL(tdh_mng_rd);
>
> +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
Nit: Is it better to use a var name that clearly tell that the page is used as a
table page?
> + u64 *ext_err1, u64 *ext_err2)
> +{
> + struct tdx_module_args args = {
> + .rcx = gpa | level,
> + .rdx = tdx_tdr_pa(td),
> + .r8 = page_to_phys(page),
> + };
> + u64 ret;
> +
> + tdx_clflush_page(page);
> + ret = seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args);
> +
> + *ext_err1 = args.rcx;
> + *ext_err2 = args.rdx;
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdh_mem_page_demote);
> +
> u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2)
> {
> struct tdx_module_args args = {
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 096c78a1d438..a6c0fa53ece9 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -24,6 +24,7 @@
> #define TDH_MNG_KEY_CONFIG 8
> #define TDH_MNG_CREATE 9
> #define TDH_MNG_RD 11
> +#define TDH_MEM_PAGE_DEMOTE 15
> #define TDH_MR_EXTEND 16
> #define TDH_MR_FINALIZE 17
> #define TDH_VP_FLUSH 18
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
2025-09-01 8:55 ` Binbin Wu
@ 2025-09-01 9:08 ` Yan Zhao
2025-09-02 16:56 ` Edgecombe, Rick P
2025-11-11 9:15 ` Huang, Kai
0 siblings, 2 replies; 129+ messages in thread
From: Yan Zhao @ 2025-09-01 9:08 UTC (permalink / raw)
To: Binbin Wu
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On Mon, Sep 01, 2025 at 04:55:30PM +0800, Binbin Wu wrote:
>
>
> On 8/7/2025 5:41 PM, Yan Zhao wrote:
> > From: Xiaoyao Li <xiaoyao.li@intel.com>
> >
> > Introduce SEAMCALL wrapper tdh_mem_page_demote() to invoke the SEAMCALL
> > TDH_MEM_PAGE_DEMOTE, which demotes a huge leaf entry to a non-leaf entry
> > in the S-EPT.
> >
> > SEAMCALL TDH_MEM_PAGE_DEMOTE supports the demotion of 2MB or 1GB huge leaf
> > entries.
> >
> > The "gpa" and "level" parameters enable the SEAMCALL TDH_MEM_PAGE_DEMOTE to
> > walk the S-EPT for the huge leaf entry that needs to be demoted.
> >
> > The "page" parameter specifies a 4KB page that will be used in the demotion
> > operation to be added as a page table page in the S-EPT.
> >
> > Invoke tdx_clflush_page() on the 4KB page being added as a page table page.
> > This function performs CLFLUSH operations on certain TDX-capable platforms,
> > or conservatively on all TDX-capable platforms, to prevent dirty cache
> > lines from writing back later and corrupting TD memory.
> >
> > tdh_mem_page_demote() may fail. Callers can check function return value and
> > retrieve extended error info from the function output parameters "ext_err1"
> > and "ext_err2". e.g., due to S-EPT walk error or arriving interrupts.
> >
> > The TDX module has many internal locks. To avoid staying in SEAM mode for
> > too long, SEAMCALLs return a BUSY error code to the kernel instead of
> > spinning on the locks. Depending on the specific SEAMCALL, the caller may
> > need to handle this error in specific ways (e.g., retry). Therefore, return
> > the SEAMCALL error code directly to the caller without attempting to handle
> > it in the core kernel.
> >
> > Do not handle TDX_INTERRUPTED_RESTARTABLE because SEAMCALL
> > TDH_MEM_PAGE_DEMOTE does not check interrupts (including NMIs) for basic
> > TDX (with or without Dynamic PAMT).
>
> The cover letter mentions that there is a new TDX module in planning, which
> disables the interrupt checking. I guess TDX module would need to have a
> interface to report the change, KVM then decides to enable huge page support or
> not for TDs?
Yes. But I guess detecting TDX module version or if it supports certain feature
is a generic problem. e.g., certain versions of TDX module have bugs in
zero-step mitigation and may block vCPU entering.
So, maybe it deserves a separate series?
> >
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> > RFC v2:
> > - Refine the patch log (Rick).
> > - Do not handle TDX_INTERRUPTED_RESTARTABLE as the new TDX modules in
> > planning do not check interrupts for basic TDX.
> >
> > RFC v1:
> > - Rebased and split patch. Updated patch log.
> > ---
> > arch/x86/include/asm/tdx.h | 2 ++
> > arch/x86/virt/vmx/tdx/tdx.c | 20 ++++++++++++++++++++
> > arch/x86/virt/vmx/tdx/tdx.h | 1 +
> > 3 files changed, 23 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index f968b736871a..d2cf48e273d5 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -178,6 +178,8 @@ u64 tdh_mng_key_config(struct tdx_td *td);
> > u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
> > u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
> > u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
> > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
> > + u64 *ext_err1, u64 *ext_err2);
> > u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
> > u64 tdh_mr_finalize(struct tdx_td *td);
> > u64 tdh_vp_flush(struct tdx_vp *vp);
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 580f14f64822..d941f083f741 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1825,6 +1825,26 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data)
> > }
> > EXPORT_SYMBOL_GPL(tdh_mng_rd);
> > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
>
> Nit: Is it better to use a var name that clearly tell that the page is used as a
> table page?
Yes, Thanks!
I also plan to do it (as well as for that tdx_spte_demote_private_spte() as
mentioned in
https://lore.kernel.org/all/aKKp3fyoYgaaqidm@yzhao56-desk.sh.intel.com).
> > + u64 *ext_err1, u64 *ext_err2)
> > +{
> > + struct tdx_module_args args = {
> > + .rcx = gpa | level,
> > + .rdx = tdx_tdr_pa(td),
> > + .r8 = page_to_phys(page),
> > + };
> > + u64 ret;
> > +
> > + tdx_clflush_page(page);
> > + ret = seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args);
> > +
> > + *ext_err1 = args.rcx;
> > + *ext_err2 = args.rdx;
> > +
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(tdh_mem_page_demote);
> > +
> > u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2)
> > {
> > struct tdx_module_args args = {
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> > index 096c78a1d438..a6c0fa53ece9 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.h
> > +++ b/arch/x86/virt/vmx/tdx/tdx.h
> > @@ -24,6 +24,7 @@
> > #define TDH_MNG_KEY_CONFIG 8
> > #define TDH_MNG_CREATE 9
> > #define TDH_MNG_RD 11
> > +#define TDH_MEM_PAGE_DEMOTE 15
> > #define TDH_MR_EXTEND 16
> > #define TDH_MR_FINALIZE 17
> > #define TDH_VP_FLUSH 18
>
>
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
2025-09-01 9:08 ` Yan Zhao
@ 2025-09-02 16:56 ` Edgecombe, Rick P
2025-09-02 17:37 ` Sean Christopherson
2025-11-11 9:15 ` Huang, Kai
1 sibling, 1 reply; 129+ messages in thread
From: Edgecombe, Rick P @ 2025-09-02 16:56 UTC (permalink / raw)
To: Zhao, Yan Y, binbin.wu@linux.intel.com
Cc: kvm@vger.kernel.org, quic_eberman@quicinc.com, Li, Xiaoyao,
Du, Fan, Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
tabba@google.com, vbabka@suse.cz, michael.roth@amd.com,
seanjc@google.com, Weiny, Ira, kas@kernel.org,
pbonzini@redhat.com, ackerleytng@google.com,
linux-kernel@vger.kernel.org, Yamahata, Isaku, Peng, Chao P,
zhiquan1.li@intel.com, Annapurve, Vishal, Miao, Jun,
x86@kernel.org, pgonda@google.com
On Mon, 2025-09-01 at 17:08 +0800, Yan Zhao wrote:
> > The cover letter mentions that there is a new TDX module in planning, which
> > disables the interrupt checking. I guess TDX module would need to have a
> > interface to report the change, KVM then decides to enable huge page support
> > or not for TDs?
> Yes. But I guess detecting TDX module version or if it supports certain
> feature is a generic problem. e.g., certain versions of TDX module have bugs
> in zero-step mitigation and may block vCPU entering.
>
We had talked in the past of not checking versions because it would require KVM
to keep logic of which features in which TDX module.
If there is a flag we could check it, but we did not ask for one here. We
already have a situation where there are bug fixes that KVM depends on, with no
way to check.
I guess the difference here is that if the behavior is missing, KVM has an
option to continue with just small pages. But at the same time, huge pages is
very likely to succeed in either case. The "feature" is closer to closing a
theoretical race. So very much like the many bugs we don't check for. I'm
leaning towards lumping it into that category. And we can add "how do we want to
check for TDX module bugs" to the arch todo list. But it's probably down the
list, if we even want to do anything.
What do you think?
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
2025-09-02 16:56 ` Edgecombe, Rick P
@ 2025-09-02 17:37 ` Sean Christopherson
2025-09-02 17:45 ` Edgecombe, Rick P
0 siblings, 1 reply; 129+ messages in thread
From: Sean Christopherson @ 2025-09-02 17:37 UTC (permalink / raw)
To: Rick P Edgecombe
Cc: Yan Y Zhao, binbin.wu@linux.intel.com, kvm@vger.kernel.org,
quic_eberman@quicinc.com, Xiaoyao Li, Fan Du, Dave Hansen,
david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
vbabka@suse.cz, michael.roth@amd.com, Ira Weiny, kas@kernel.org,
pbonzini@redhat.com, ackerleytng@google.com,
linux-kernel@vger.kernel.org, Isaku Yamahata, Chao P Peng,
zhiquan1.li@intel.com, Vishal Annapurve, Jun Miao, x86@kernel.org,
pgonda@google.com
On Tue, Sep 02, 2025, Rick P Edgecombe wrote:
> On Mon, 2025-09-01 at 17:08 +0800, Yan Zhao wrote:
> > > The cover letter mentions that there is a new TDX module in planning, which
> > > disables the interrupt checking. I guess TDX module would need to have a
> > > interface to report the change, KVM then decides to enable huge page support
> > > or not for TDs?
> > Yes. But I guess detecting TDX module version or if it supports certain
> > feature is a generic problem. e.g., certain versions of TDX module have bugs
> > in zero-step mitigation and may block vCPU entering.
> >
>
> We had talked in the past of not checking versions because it would require KVM
> to keep logic of which features in which TDX module.
Checking for features is different from refusing to load broken modules. I don't
want KVM to rely on version numbers to query features, because that relies on
"newer" module versions always being a superset relative to "older" versions.
> If there is a flag we could check it, but we did not ask for one here. We
> already have a situation where there are bug fixes that KVM depends on, with no
> way to check.
>
> I guess the difference here is that if the behavior is missing, KVM has an
> option to continue with just small pages. But at the same time, huge pages is
> very likely to succeed in either case. The "feature" is closer to closing a
> theoretical race. So very much like the many bugs we don't check for. I'm
> leaning towards lumping it into that category. And we can add "how do we want to
> check for TDX module bugs" to the arch todo list. But it's probably down the
> list, if we even want to do anything.
>
> What do you think?
Could we taint the kernel and print a scary message if a known-buggy TDX module
is loaded?
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
2025-09-02 17:37 ` Sean Christopherson
@ 2025-09-02 17:45 ` Edgecombe, Rick P
2025-09-04 9:31 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Edgecombe, Rick P @ 2025-09-02 17:45 UTC (permalink / raw)
To: seanjc@google.com
Cc: quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, Zhao, Yan Y,
tabba@google.com, kvm@vger.kernel.org, michael.roth@amd.com,
binbin.wu@linux.intel.com, Weiny, Ira, vbabka@suse.cz,
pbonzini@redhat.com, ackerleytng@google.com, kas@kernel.org,
Yamahata, Isaku, Peng, Chao P, linux-kernel@vger.kernel.org,
Annapurve, Vishal, Miao, Jun, zhiquan1.li@intel.com,
x86@kernel.org, pgonda@google.com
On Tue, 2025-09-02 at 10:37 -0700, Sean Christopherson wrote:
> > If there is a flag we could check it, but we did not ask for one here. We
> > already have a situation where there are bug fixes that KVM depends on, with
> > no way to check.
> >
> > I guess the difference here is that if the behavior is missing, KVM has an
> > option to continue with just small pages. But at the same time, huge pages
> > is very likely to succeed in either case. The "feature" is closer to closing
> > a theoretical race. So very much like the many bugs we don't check for. I'm
> > leaning towards lumping it into that category. And we can add "how do we
> > want to check for TDX module bugs" to the arch todo list. But it's probably
> > down the list, if we even want to do anything.
> >
> > What do you think?
>
> Could we taint the kernel and print a scary message if a known-buggy TDX
> module is loaded?
If we know which TDX modules have bugs, I guess. There may be some bugs that
only affect the guest, where tainting would not be appropriate. Probably would
want to do it at TDX module load time, so that people that don't use TDX don't
get their kernel tainted from an old TDX module in the BIOS.
What would you want a TDX module interface for this to look like? Like a bitmap
of fixed bugs? KVM keeps a list of bugs it cares about and compares it to the
list provided by TDX module? I think it could work if KVM is ok selecting and
keeping a bitmap of TDX module bugs.
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
2025-09-02 17:45 ` Edgecombe, Rick P
@ 2025-09-04 9:31 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-09-04 9:31 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: seanjc@google.com, quic_eberman@quicinc.com, Li, Xiaoyao, Du, Fan,
Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
tabba@google.com, kvm@vger.kernel.org, michael.roth@amd.com,
binbin.wu@linux.intel.com, Weiny, Ira, vbabka@suse.cz,
pbonzini@redhat.com, ackerleytng@google.com, kas@kernel.org,
Yamahata, Isaku, Peng, Chao P, linux-kernel@vger.kernel.org,
Annapurve, Vishal, Miao, Jun, zhiquan1.li@intel.com,
x86@kernel.org, pgonda@google.com
On Wed, Sep 03, 2025 at 01:45:27AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-09-02 at 10:37 -0700, Sean Christopherson wrote:
> > > If there is a flag we could check it, but we did not ask for one here. We
> > > already have a situation where there are bug fixes that KVM depends on, with
> > > no way to check.
> > >
> > > I guess the difference here is that if the behavior is missing, KVM has an
> > > option to continue with just small pages. But at the same time, huge pages
> > > is very likely to succeed in either case. The "feature" is closer to closing
> > > a theoretical race. So very much like the many bugs we don't check for. I'm
> > > leaning towards lumping it into that category. And we can add "how do we
> > > want to check for TDX module bugs" to the arch todo list. But it's probably
> > > down the list, if we even want to do anything.
> > >
> > > What do you think?
> >
> > Could we taint the kernel and print a scary message if a known-buggy TDX
> > module is loaded?
>
> If we know which TDX modules have bugs, I guess. There may be some bugs that
> only affect the guest, where tainting would not be appropriate. Probably would
> want to do it at TDX module load time, so that people that don't use TDX don't
> get their kernel tainted from an old TDX module in the BIOS.
>
> What would you want a TDX module interface for this to look like? Like a bitmap
> of fixed bugs? KVM keeps a list of bugs it cares about and compares it to the
> list provided by TDX module? I think it could work if KVM is ok selecting and
> keeping a bitmap of TDX module bugs.
Specific to the problem of TDX_INTERRUPTED_RESTARTABLE, could we choose to port
this feature to all old TDX modules?
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
2025-09-01 9:08 ` Yan Zhao
2025-09-02 16:56 ` Edgecombe, Rick P
@ 2025-11-11 9:15 ` Huang, Kai
2025-11-12 8:06 ` Yan Zhao
1 sibling, 1 reply; 129+ messages in thread
From: Huang, Kai @ 2025-11-11 9:15 UTC (permalink / raw)
To: Zhao, Yan Y, binbin.wu@linux.intel.com
Cc: quic_eberman@quicinc.com, kvm@vger.kernel.org, Li, Xiaoyao,
Du, Fan, Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
vbabka@suse.cz, tabba@google.com, michael.roth@amd.com,
seanjc@google.com, Weiny, Ira, pbonzini@redhat.com, Peng, Chao P,
ackerleytng@google.com, kas@kernel.org, Yamahata, Isaku,
linux-kernel@vger.kernel.org, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, zhiquan1.li@intel.com,
x86@kernel.org, pgonda@google.com
On Mon, 2025-09-01 at 17:08 +0800, Yan Zhao wrote:
> > > Do not handle TDX_INTERRUPTED_RESTARTABLE because SEAMCALL
> > > TDH_MEM_PAGE_DEMOTE does not check interrupts (including NMIs) for basic
> > > TDX (with or without Dynamic PAMT).
> >
> > The cover letter mentions that there is a new TDX module in planning, which
> > disables the interrupt checking. I guess TDX module would need to have a
> > interface to report the change, KVM then decides to enable huge page support or
> > not for TDs?
> Yes. But I guess detecting TDX module version or if it supports certain feature
> is a generic problem. e.g., certain versions of TDX module have bugs in
> zero-step mitigation and may block vCPU entering.
>
> So, maybe it deserves a separate series?
Looking at the spec (TDX module ABI spec 348551-007US), is it enumerated via
TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY?
5.4.25.3.9.
Interruptibility
If the TD is not partitioned (i.e., it has been configured with no L2
VMs), and the TDX Module enumerates
TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY as 1, TDH.MEM.PAGE.DEMOTE
is not interruptible.
So if the decision is to not use 2M page when TDH_MEM_PAGE_DEMOTE can return
TDX_INTERRUPTED_RESTARTABLE, maybe we can just check this enumeration in
fault handler and always make mapping level as 4K?
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
2025-11-11 9:15 ` Huang, Kai
@ 2025-11-12 8:06 ` Yan Zhao
2025-11-14 9:14 ` Binbin Wu
0 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-11-12 8:06 UTC (permalink / raw)
To: Huang, Kai
Cc: binbin.wu@linux.intel.com, quic_eberman@quicinc.com,
kvm@vger.kernel.org, Li, Xiaoyao, Du, Fan, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
tabba@google.com, michael.roth@amd.com, seanjc@google.com,
Weiny, Ira, pbonzini@redhat.com, Peng, Chao P,
ackerleytng@google.com, kas@kernel.org, Yamahata, Isaku,
linux-kernel@vger.kernel.org, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, zhiquan1.li@intel.com,
x86@kernel.org, pgonda@google.com
On Tue, Nov 11, 2025 at 05:15:22PM +0800, Huang, Kai wrote:
> On Mon, 2025-09-01 at 17:08 +0800, Yan Zhao wrote:
> > > > Do not handle TDX_INTERRUPTED_RESTARTABLE because SEAMCALL
> > > > TDH_MEM_PAGE_DEMOTE does not check interrupts (including NMIs) for basic
> > > > TDX (with or without Dynamic PAMT).
> > >
> > > The cover letter mentions that there is a new TDX module in planning, which
> > > disables the interrupt checking. I guess TDX module would need to have a
> > > interface to report the change, KVM then decides to enable huge page support or
> > > not for TDs?
> > Yes. But I guess detecting TDX module version or if it supports certain feature
> > is a generic problem. e.g., certain versions of TDX module have bugs in
> > zero-step mitigation and may block vCPU entering.
> >
> > So, maybe it deserves a separate series?
>
> Looking at the spec (TDX module ABI spec 348551-007US), is it enumerated via
> TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY?
Yes. I checked the unreleased TDX module code that enumerates this bit (starting
from version TDX_1.5.28.00.972). TDH.MEM.PAGE.DEMOTE will not return
TDX_INTERRUPTED_RESTARTABLE for L1 VMs.
> 5.4.25.3.9.
>
> Interruptibility
>
> If the TD is not partitioned (i.e., it has been configured with no L2
> VMs), and the TDX Module enumerates
> TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY as 1, TDH.MEM.PAGE.DEMOTE
> is not interruptible.
>
> So if the decision is to not use 2M page when TDH_MEM_PAGE_DEMOTE can return
> TDX_INTERRUPTED_RESTARTABLE, maybe we can just check this enumeration in
> fault handler and always make mapping level as 4K?
Thanks for this info! I think this is a very good idea and the right direction.
If no objection, I'll update the code in this way.
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
2025-11-12 8:06 ` Yan Zhao
@ 2025-11-14 9:14 ` Binbin Wu
2025-11-14 9:21 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Binbin Wu @ 2025-11-14 9:14 UTC (permalink / raw)
To: Yan Zhao, Huang, Kai
Cc: quic_eberman@quicinc.com, kvm@vger.kernel.org, Li, Xiaoyao,
Du, Fan, Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
vbabka@suse.cz, tabba@google.com, michael.roth@amd.com,
seanjc@google.com, Weiny, Ira, pbonzini@redhat.com, Peng, Chao P,
ackerleytng@google.com, kas@kernel.org, Yamahata, Isaku,
linux-kernel@vger.kernel.org, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, zhiquan1.li@intel.com,
x86@kernel.org, pgonda@google.com
On 11/12/2025 4:06 PM, Yan Zhao wrote:
> On Tue, Nov 11, 2025 at 05:15:22PM +0800, Huang, Kai wrote:
>> On Mon, 2025-09-01 at 17:08 +0800, Yan Zhao wrote:
>>>>> Do not handle TDX_INTERRUPTED_RESTARTABLE because SEAMCALL
>>>>> TDH_MEM_PAGE_DEMOTE does not check interrupts (including NMIs) for basic
>>>>> TDX (with or without Dynamic PAMT).
>>>> The cover letter mentions that there is a new TDX module in planning, which
>>>> disables the interrupt checking. I guess TDX module would need to have a
>>>> interface to report the change, KVM then decides to enable huge page support or
>>>> not for TDs?
>>> Yes. But I guess detecting TDX module version or if it supports certain feature
>>> is a generic problem. e.g., certain versions of TDX module have bugs in
>>> zero-step mitigation and may block vCPU entering.
>>>
>>> So, maybe it deserves a separate series?
>> Looking at the spec (TDX module ABI spec 348551-007US), is it enumerated via
>> TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY?
> Yes. I checked the unreleased TDX module code that enumerates this bit (starting
> from version TDX_1.5.28.00.972). TDH.MEM.PAGE.DEMOTE will not return
> TDX_INTERRUPTED_RESTARTABLE for L1 VMs.
According to the content pasted by Kai below, it just says there will be no
TDX_INTERRUPTED_RESTARTABLE for TDH.MEM.PAGE.DEMOTE if no L2 VMs.
KVM doesn't support TD partition yet, just for clarification, what if the
demotion is for L1 VM, but there are L2 VMs configured?
>
>> 5.4.25.3.9.
>>
>> Interruptibility
>>
>> If the TD is not partitioned (i.e., it has been configured with no L2
>> VMs), and the TDX Module enumerates
>> TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY as 1, TDH.MEM.PAGE.DEMOTE
>> is not interruptible.
>>
>> So if the decision is to not use 2M page when TDH_MEM_PAGE_DEMOTE can return
>> TDX_INTERRUPTED_RESTARTABLE, maybe we can just check this enumeration in
>> fault handler and always make mapping level as 4K?
> Thanks for this info! I think this is a very good idea and the right direction.
> If no objection, I'll update the code in this way.
>
>
>
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
2025-11-14 9:14 ` Binbin Wu
@ 2025-11-14 9:21 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-11-14 9:21 UTC (permalink / raw)
To: Binbin Wu
Cc: Huang, Kai, kvm@vger.kernel.org, Li, Xiaoyao, Du, Fan,
Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
vbabka@suse.cz, tabba@google.com, michael.roth@amd.com,
seanjc@google.com, Weiny, Ira, pbonzini@redhat.com, Peng, Chao P,
ackerleytng@google.com, kas@kernel.org, Yamahata, Isaku,
linux-kernel@vger.kernel.org, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
On Fri, Nov 14, 2025 at 05:14:03PM +0800, Binbin Wu wrote:
>
>
> On 11/12/2025 4:06 PM, Yan Zhao wrote:
> > On Tue, Nov 11, 2025 at 05:15:22PM +0800, Huang, Kai wrote:
> > > On Mon, 2025-09-01 at 17:08 +0800, Yan Zhao wrote:
> > > > > > Do not handle TDX_INTERRUPTED_RESTARTABLE because SEAMCALL
> > > > > > TDH_MEM_PAGE_DEMOTE does not check interrupts (including NMIs) for basic
> > > > > > TDX (with or without Dynamic PAMT).
> > > > > The cover letter mentions that there is a new TDX module in planning, which
> > > > > disables the interrupt checking. I guess TDX module would need to have a
> > > > > interface to report the change, KVM then decides to enable huge page support or
> > > > > not for TDs?
> > > > Yes. But I guess detecting TDX module version or if it supports certain feature
> > > > is a generic problem. e.g., certain versions of TDX module have bugs in
> > > > zero-step mitigation and may block vCPU entering.
> > > >
> > > > So, maybe it deserves a separate series?
> > > Looking at the spec (TDX module ABI spec 348551-007US), is it enumerated via
> > > TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY?
> > Yes. I checked the unreleased TDX module code that enumerates this bit (starting
> > from version TDX_1.5.28.00.972). TDH.MEM.PAGE.DEMOTE will not return
> > TDX_INTERRUPTED_RESTARTABLE for L1 VMs.
>
> According to the content pasted by Kai below, it just says there will be no
> TDX_INTERRUPTED_RESTARTABLE for TDH.MEM.PAGE.DEMOTE if no L2 VMs.
>
> KVM doesn't support TD partition yet, just for clarification, what if the
> demotion is for L1 VM, but there are L2 VMs configured?
Right. The description pasted by Kai is more accurate:
"There will be no TDX_INTERRUPTED_RESTARTABLE for TDH.MEM.PAGE.DEMOTE if no L2
VMs".
From the code, DEMOTE may return TDX_INTERRUPTED_RESTARTABLE if
tdcs_ptr->management_fields.num_l2_vms is non-zero.
Thanks for flagging this.
> > > 5.4.25.3.9.
> > >
> > > Interruptibility
> > >
> > > If the TD is not partitioned (i.e., it has been configured with no L2
> > > VMs), and the TDX Module enumerates
> > > TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY as 1, TDH.MEM.PAGE.DEMOTE
> > > is not interruptible.
> > >
> > > So if the decision is to not use 2M page when TDH_MEM_PAGE_DEMOTE can return
> > > TDX_INTERRUPTED_RESTARTABLE, maybe we can just check this enumeration in
> > > fault handler and always make mapping level as 4K?
> > Thanks for this info! I think this is a very good idea and the right direction.
> > If no objection, I'll update the code in this way.
> >
> >
> >
>
^ permalink raw reply [flat|nested] 129+ messages in thread
* [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
2025-08-07 9:41 ` [RFC PATCH v2 01/23] x86/tdx: Enhance tdh_mem_page_aug() to support huge pages Yan Zhao
2025-08-07 9:41 ` [RFC PATCH v2 02/23] x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote() Yan Zhao
@ 2025-08-07 9:42 ` Yan Zhao
2025-11-11 9:23 ` Huang, Kai
2025-12-10 1:14 ` Vishal Annapurve
2025-08-07 9:42 ` [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear " Yan Zhao
` (19 subsequent siblings)
22 siblings, 2 replies; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:42 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
After removing a TD's private page, the TDX module does not write back and
invalidate cache lines associated with the page and its keyID (i.e., the
TD's guest keyID). The SEAMCALL wrapper tdh_phymem_page_wbinvd_hkid()
enables the caller to provide the TD's guest keyID and physical memory
address to invoke the SEAMCALL TDH_PHYMEM_PAGE_WBINVD to perform cache line
invalidation.
Enhance the SEAMCALL wrapper tdh_phymem_page_wbinvd_hkid() to support cache
line invalidation for huge pages by introducing the parameters "folio",
"start_idx", and "npages". These parameters specify the physical memory
starting from the page at "start_idx" within a "folio" and spanning
"npages" contiguous PFNs. Return TDX_OPERAND_INVALID if the specified
memory is not entirely contained within a single folio.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Enhance tdh_phymem_page_wbinvd_hkid() to invalidate multiple pages
directly, rather than looping within KVM, following Dave's suggestion:
"Don't wrap the wrappers." (Rick).
RFC v1:
- Split patch
- Aded a helper tdx_wbinvd_page() in TDX, which accepts param
"struct page *".
---
arch/x86/include/asm/tdx.h | 4 ++--
arch/x86/kvm/vmx/tdx.c | 6 ++++--
arch/x86/virt/vmx/tdx/tdx.c | 17 ++++++++++++++---
3 files changed, 20 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d2cf48e273d5..a125bb20a28a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -194,8 +194,8 @@ u64 tdh_mem_track(struct tdx_td *tdr);
u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u64 *ext_err2);
u64 tdh_phymem_cache_wb(bool resume);
u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td);
-u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
-
+u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
+ unsigned long start_idx, unsigned long npages);
void tdx_meminfo(struct seq_file *m);
#else
static inline void tdx_init(void) { }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 0a2b183899d8..8eaf8431c5f1 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1694,6 +1694,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
{
int tdx_level = pg_level_to_tdx_sept_level(level);
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ struct folio *folio = page_folio(page);
gpa_t gpa = gfn_to_gpa(gfn);
u64 err, entry, level_state;
@@ -1728,8 +1729,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
return -EIO;
}
- err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
-
+ err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, folio,
+ folio_page_idx(folio, page),
+ KVM_PAGES_PER_HPAGE(level));
if (KVM_BUG_ON(err, kvm)) {
pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
return -EIO;
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index d941f083f741..64219c659844 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -2030,13 +2030,24 @@ u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
}
EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_tdr);
-u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
+u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
+ unsigned long start_idx, unsigned long npages)
{
+ struct page *start = folio_page(folio, start_idx);
struct tdx_module_args args = {};
+ u64 err;
+
+ if (start_idx + npages > folio_nr_pages(folio))
+ return TDX_OPERAND_INVALID;
- args.rcx = mk_keyed_paddr(hkid, page);
+ for (unsigned long i = 0; i < npages; i++) {
+ args.rcx = mk_keyed_paddr(hkid, nth_page(start, i));
- return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
+ err = seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
+ if (err)
+ break;
+ }
+ return err;
}
EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages
2025-08-07 9:42 ` [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
@ 2025-11-11 9:23 ` Huang, Kai
2025-11-12 8:43 ` Yan Zhao
2025-12-10 1:14 ` Vishal Annapurve
1 sibling, 1 reply; 129+ messages in thread
From: Huang, Kai @ 2025-11-11 9:23 UTC (permalink / raw)
To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
Cc: quic_eberman@quicinc.com, kvm@vger.kernel.org, Li, Xiaoyao,
Du, Fan, Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
vbabka@suse.cz, tabba@google.com, kas@kernel.org,
michael.roth@amd.com, Weiny, Ira, linux-kernel@vger.kernel.org,
binbin.wu@linux.intel.com, ackerleytng@google.com,
Yamahata, Isaku, Peng, Chao P, zhiquan1.li@intel.com,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On Thu, 2025-08-07 at 17:42 +0800, Yan Zhao wrote:
> -u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
> +u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
> + unsigned long start_idx, unsigned long npages)
> {
> + struct page *start = folio_page(folio, start_idx);
> struct tdx_module_args args = {};
> + u64 err;
> +
> + if (start_idx + npages > folio_nr_pages(folio))
> + return TDX_OPERAND_INVALID;
>
> - args.rcx = mk_keyed_paddr(hkid, page);
> + for (unsigned long i = 0; i < npages; i++) {
> + args.rcx = mk_keyed_paddr(hkid, nth_page(start, i));
>
Just FYI: seems there's a series to remove nth_page() completely:
https://lore.kernel.org/kvm/20250901150359.867252-1-david@redhat.com/
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages
2025-11-11 9:23 ` Huang, Kai
@ 2025-11-12 8:43 ` Yan Zhao
2025-11-12 10:29 ` Huang, Kai
0 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-11-12 8:43 UTC (permalink / raw)
To: Huang, Kai
Cc: pbonzini@redhat.com, seanjc@google.com, quic_eberman@quicinc.com,
kvm@vger.kernel.org, Li, Xiaoyao, Du, Fan, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
tabba@google.com, kas@kernel.org, michael.roth@amd.com,
Weiny, Ira, linux-kernel@vger.kernel.org,
binbin.wu@linux.intel.com, ackerleytng@google.com,
Yamahata, Isaku, Peng, Chao P, zhiquan1.li@intel.com,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On Tue, Nov 11, 2025 at 05:23:30PM +0800, Huang, Kai wrote:
> On Thu, 2025-08-07 at 17:42 +0800, Yan Zhao wrote:
> > -u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
> > +u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
> > + unsigned long start_idx, unsigned long npages)
> > {
> > + struct page *start = folio_page(folio, start_idx);
> > struct tdx_module_args args = {};
> > + u64 err;
> > +
> > + if (start_idx + npages > folio_nr_pages(folio))
> > + return TDX_OPERAND_INVALID;
> >
> > - args.rcx = mk_keyed_paddr(hkid, page);
> > + for (unsigned long i = 0; i < npages; i++) {
> > + args.rcx = mk_keyed_paddr(hkid, nth_page(start, i));
> >
>
> Just FYI: seems there's a series to remove nth_page() completely:
>
> https://lore.kernel.org/kvm/20250901150359.867252-1-david@redhat.com/
Ah, thanks!
Then we can get rid of the "unsigned long i".
- for (unsigned long i = 0; i < npages; i++) {
- args.rcx = mk_keyed_paddr(hkid, nth_page(start, i));
+ while (npages--) {
+ args.rcx = mk_keyed_paddr(hkid, start++);
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages
2025-11-12 8:43 ` Yan Zhao
@ 2025-11-12 10:29 ` Huang, Kai
2025-11-13 2:35 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Huang, Kai @ 2025-11-12 10:29 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
vbabka@suse.cz, tabba@google.com, Du, Fan, michael.roth@amd.com,
seanjc@google.com, binbin.wu@linux.intel.com,
linux-kernel@vger.kernel.org, pbonzini@redhat.com, Weiny, Ira,
kas@kernel.org, ackerleytng@google.com, Peng, Chao P,
zhiquan1.li@intel.com, Yamahata, Isaku, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
On Wed, 2025-11-12 at 16:43 +0800, Yan Zhao wrote:
> On Tue, Nov 11, 2025 at 05:23:30PM +0800, Huang, Kai wrote:
> > On Thu, 2025-08-07 at 17:42 +0800, Yan Zhao wrote:
> > > -u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
> > > +u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
> > > + unsigned long start_idx, unsigned long npages)
> > > {
> > > + struct page *start = folio_page(folio, start_idx);
> > > struct tdx_module_args args = {};
> > > + u64 err;
> > > +
> > > + if (start_idx + npages > folio_nr_pages(folio))
> > > + return TDX_OPERAND_INVALID;
> > >
> > > - args.rcx = mk_keyed_paddr(hkid, page);
> > > + for (unsigned long i = 0; i < npages; i++) {
> > > + args.rcx = mk_keyed_paddr(hkid, nth_page(start, i));
> > >
> >
> > Just FYI: seems there's a series to remove nth_page() completely:
> >
> > https://lore.kernel.org/kvm/20250901150359.867252-1-david@redhat.com/
> Ah, thanks!
> Then we can get rid of the "unsigned long i".
>
> - for (unsigned long i = 0; i < npages; i++) {
> - args.rcx = mk_keyed_paddr(hkid, nth_page(start, i));
> + while (npages--) {
> + args.rcx = mk_keyed_paddr(hkid, start++);
>
You may want to be careful about doing '++' on a 'struct page *'. I am not
expert, but I saw below discussion on the thread [*] which led to the series
to get rid of nth_page():
>
> I wish we didn't have nth_page() at all. I really don't think it's a
> valid operation. It's been around forever, but I think it was broken
> as introduced, exactly because I don't think you can validly even have
> allocations that cross section boundaries.
Ordinary buddy allocations cannot exceed a memory section, but hugetlb and
dax can with gigantic folios ... :(
We had some weird bugs with that, because people keep forgetting that you
cannot just use page++ unconditionally with such folios.
So, why not just get the actual page for each index within the loop?
[*]:
https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#m49ba78f5f630b27fa6d3d0737271f047af599c60
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages
2025-11-12 10:29 ` Huang, Kai
@ 2025-11-13 2:35 ` Yan Zhao
2025-11-13 7:37 ` Huang, Kai
0 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-11-13 2:35 UTC (permalink / raw)
To: Huang, Kai
Cc: kvm@vger.kernel.org, Li, Xiaoyao, Hansen, Dave, david@redhat.com,
thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
Du, Fan, michael.roth@amd.com, seanjc@google.com,
binbin.wu@linux.intel.com, linux-kernel@vger.kernel.org,
pbonzini@redhat.com, Weiny, Ira, kas@kernel.org,
ackerleytng@google.com, Peng, Chao P, Yamahata, Isaku,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On Wed, Nov 12, 2025 at 06:29:11PM +0800, Huang, Kai wrote:
> On Wed, 2025-11-12 at 16:43 +0800, Yan Zhao wrote:
> > On Tue, Nov 11, 2025 at 05:23:30PM +0800, Huang, Kai wrote:
> > > On Thu, 2025-08-07 at 17:42 +0800, Yan Zhao wrote:
> > > > -u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
> > > > +u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
> > > > + unsigned long start_idx, unsigned long npages)
> > > > {
> > > > + struct page *start = folio_page(folio, start_idx);
> > > > struct tdx_module_args args = {};
> > > > + u64 err;
> > > > +
> > > > + if (start_idx + npages > folio_nr_pages(folio))
> > > > + return TDX_OPERAND_INVALID;
> > > >
> > > > - args.rcx = mk_keyed_paddr(hkid, page);
> > > > + for (unsigned long i = 0; i < npages; i++) {
> > > > + args.rcx = mk_keyed_paddr(hkid, nth_page(start, i));
> > > >
> > >
> > > Just FYI: seems there's a series to remove nth_page() completely:
> > >
> > > https://lore.kernel.org/kvm/20250901150359.867252-1-david@redhat.com/
> > Ah, thanks!
> > Then we can get rid of the "unsigned long i".
> >
> > - for (unsigned long i = 0; i < npages; i++) {
> > - args.rcx = mk_keyed_paddr(hkid, nth_page(start, i));
> > + while (npages--) {
> > + args.rcx = mk_keyed_paddr(hkid, start++);
> >
>
> You may want to be careful about doing '++' on a 'struct page *'. I am not
Before the removing nth_page() series, linux kernel defines nth_page() like
this:
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
#define folio_page_idx(folio, p) (page_to_pfn(p) - folio_pfn(folio))
#else
#define nth_page(page,n) ((page) + (n))
#define folio_page_idx(folio, p) ((p) - &(folio)->page)
#endif
i.e., unless SPARSEMEM without SPARSEMEM_VMEMMAP, a folio's page is contiguous.
In David's removing nth_page() series, CONFIG_SPARSEMEM_VMEMMAP is auto-selected
along with CONFIG_SPARSEMEM in all architectures but sh.
David further ensures folio pages are continuous even on sh with the problematic
kernel configs (i.e., SPARSEMEM without SPARSEMEM_VMEMMAP) [1]:
: Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE
: but not SPARSEMEM_VMEMMAP: sh.
:
: Fortunately, the biggest hugetlb size sh supports is 64 MiB
: (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB
: (SECTION_SIZE_BITS == 26), so their use case is not degraded.
:
: As folios and memory sections are naturally aligned to their order-2 size
: in memory, consequently a single folio can no longer span multiple memory
: sections on these problematic kernel configs.
So it's safe to assume folio pages are continuous.
[1] https://lore.kernel.org/kvm/20250901150359.867252-12-david@redhat.com/
> expert, but I saw below discussion on the thread [*] which led to the series
> to get rid of nth_page():
> > I wish we didn't have nth_page() at all. I really don't think it's a
> > valid operation. It's been around forever, but I think it was broken
> > as introduced, exactly because I don't think you can validly even have
> > allocations that cross section boundaries.
>
> Ordinary buddy allocations cannot exceed a memory section, but hugetlb and
> dax can with gigantic folios ... :(
>
> We had some weird bugs with that, because people keep forgetting that you
> cannot just use page++ unconditionally with such folios.
I found Linus's reply to David [2] :
: On Tue, 5 Aug 2025 at 16:37, David Hildenbrand <david@redhat.com> wrote:
: >
: > Ordinary buddy allocations cannot exceed a memory section, but hugetlb and
: > dax can with gigantic folios ... :(
:
: Just turn that code off. Nobody sane cares.
:
: It sounds like people have bent over backwards to fix the insane case
: instead of saying "that's insane, let's not support it".
:
: And yes, "that's insane" is actually fairly recent. It's not that long
: ago that we made SPARSEMEM_VMEMMAP the mandatory option on x86-64. So
: it was all sane in a historical context, but it's not sane any more.
:
: But now it *is* the mandatory option both on x86 and arm64, so I
: really think it's time to get rid of pointless pain points.
:
: (I think powerpc still makes it an option to do sparsemem without
: vmemmap, but it *is* an option there too)
The removing nth_page() series then ensures hugetlb and dax are Ok like changes
in [3]. The series then iterates over all pages in a hugetlb folio by invoking
page++. e.g., [4][5].
[2] https://lore.kernel.org/all/CAHk-=wiYLcax-5THGofwk-SAWYZ1RsP08b+rozXOm0wZRCE9UQ@mail.gmail.com
[3] https://lore.kernel.org/kvm/20250901150359.867252-7-david@redhat.com
[4] https://lore.kernel.org/kvm/20250901150359.867252-14-david@redhat.com
[5] https://lore.kernel.org/kvm/20250901150359.867252-16-david@redhat.com
> So, why not just get the actual page for each index within the loop?
We need to invoke folio_page() to get the actual page.
In [6], the new folio_page() implementation is
static inline struct page *folio_page(struct folio *folio, unsigned long n)
{
return &folio->page + n;
}
So, invoking folio_page() should be equal to page++ in our case.
[6] https://lore.kernel.org/kvm/20250901150359.867252-13-david@redhat.com
> [*]:
> https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#m49ba78f5f630b27fa6d3d0737271f047af599c60
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages
2025-11-13 2:35 ` Yan Zhao
@ 2025-11-13 7:37 ` Huang, Kai
2025-11-13 9:03 ` Yan Zhao
2025-11-13 15:26 ` Dave Hansen
0 siblings, 2 replies; 129+ messages in thread
From: Huang, Kai @ 2025-11-13 7:37 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
vbabka@suse.cz, kas@kernel.org, linux-kernel@vger.kernel.org,
seanjc@google.com, pbonzini@redhat.com, binbin.wu@linux.intel.com,
ackerleytng@google.com, michael.roth@amd.com, Weiny, Ira,
Peng, Chao P, Yamahata, Isaku, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
Thanks for all the explanation.
[...]
> In [6], the new folio_page() implementation is
>
> static inline struct page *folio_page(struct folio *folio, unsigned long n)
> {
> return &folio->page + n;
> }
>
> So, invoking folio_page() should be equal to page++ in our case.
>
> [6] https://lore.kernel.org/kvm/20250901150359.867252-13-david@redhat.com
Sure. But it seems you will need to wait all patches that you mentioned to
be merged to safely use 'page++' for pages in a folio?
And if you do:
for (i = 0; i < npages; i++)
{
struct page *p = folio_page(folio, start_idx + i);
struct tdx_module_args args = {};
args.rcx = mk_keyed_paddr(hkid, p);
...
}
It should work w/o any dependency?
Anyway, I don't have any strong opinion, as long as it works. You may
choose what you want. :-)
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages
2025-11-13 7:37 ` Huang, Kai
@ 2025-11-13 9:03 ` Yan Zhao
2025-11-13 15:26 ` Dave Hansen
1 sibling, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-11-13 9:03 UTC (permalink / raw)
To: Huang, Kai
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
vbabka@suse.cz, kas@kernel.org, linux-kernel@vger.kernel.org,
seanjc@google.com, pbonzini@redhat.com, binbin.wu@linux.intel.com,
ackerleytng@google.com, michael.roth@amd.com, Weiny, Ira,
Peng, Chao P, Yamahata, Isaku, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
On Thu, Nov 13, 2025 at 03:37:29PM +0800, Huang, Kai wrote:
>
> Thanks for all the explanation.
>
> [...]
>
>
> > In [6], the new folio_page() implementation is
> >
> > static inline struct page *folio_page(struct folio *folio, unsigned long n)
> > {
> > return &folio->page + n;
> > }
> >
> > So, invoking folio_page() should be equal to page++ in our case.
> >
> > [6] https://lore.kernel.org/kvm/20250901150359.867252-13-david@redhat.com
>
> Sure. But it seems you will need to wait all patches that you mentioned to
> be merged to safely use 'page++' for pages in a folio?
Correct.
> And if you do:
>
> for (i = 0; i < npages; i++)
> {
> struct page *p = folio_page(folio, start_idx + i);
> struct tdx_module_args args = {};
>
> args.rcx = mk_keyed_paddr(hkid, p);
> ...
> }
>
> It should work w/o any dependency?
>
> Anyway, I don't have any strong opinion, as long as it works. You may
> choose what you want. :-)
I don't have a strong opinion either. However, based on previous reviews, it
seems people prefer while (npages--) over introducing an additional variable.
I guess I'll choose a version depending on whether these patches are merged when
I post the next version. :)
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages
2025-11-13 7:37 ` Huang, Kai
2025-11-13 9:03 ` Yan Zhao
@ 2025-11-13 15:26 ` Dave Hansen
2025-11-14 1:21 ` Yan Zhao
1 sibling, 1 reply; 129+ messages in thread
From: Dave Hansen @ 2025-11-13 15:26 UTC (permalink / raw)
To: Huang, Kai, Zhao, Yan Y
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, david@redhat.com,
thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
kas@kernel.org, linux-kernel@vger.kernel.org, seanjc@google.com,
pbonzini@redhat.com, binbin.wu@linux.intel.com,
ackerleytng@google.com, michael.roth@amd.com, Weiny, Ira,
Peng, Chao P, Yamahata, Isaku, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
On 11/12/25 23:37, Huang, Kai wrote:
> Sure. But it seems you will need to wait all patches that you mentioned to
> be merged to safely use 'page++' for pages in a folio?
>
> And if you do:
>
> for (i = 0; i < npages; i++)
> {
> struct page *p = folio_page(folio, start_idx + i);
> struct tdx_module_args args = {};
>
> args.rcx = mk_keyed_paddr(hkid, p);
> ...
> }
>
> It should work w/o any dependency?
>
> Anyway, I don't have any strong opinion, as long as it works. You may
> choose what you want. 🙂
Folks, I'll make it easy: Do what Kai suggested above. It works
universally and it's obvious. Saving an "i" variable only makes the code
harder to read.
If anyone thinks that:
while (npages--)
Is easier to understand than the most common C idiom on the planet:
for (i = 0; i < npages; i++)
... then I don't know what to tell them.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages
2025-11-13 15:26 ` Dave Hansen
@ 2025-11-14 1:21 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-11-14 1:21 UTC (permalink / raw)
To: Dave Hansen
Cc: Huang, Kai, Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org,
david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
vbabka@suse.cz, kas@kernel.org, linux-kernel@vger.kernel.org,
seanjc@google.com, pbonzini@redhat.com, binbin.wu@linux.intel.com,
ackerleytng@google.com, michael.roth@amd.com, Weiny, Ira,
Peng, Chao P, Yamahata, Isaku, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
On Thu, Nov 13, 2025 at 07:26:43AM -0800, Dave Hansen wrote:
> On 11/12/25 23:37, Huang, Kai wrote:
> > Sure. But it seems you will need to wait all patches that you mentioned to
> > be merged to safely use 'page++' for pages in a folio?
> >
> > And if you do:
> >
> > for (i = 0; i < npages; i++)
> > {
> > struct page *p = folio_page(folio, start_idx + i);
> > struct tdx_module_args args = {};
> >
> > args.rcx = mk_keyed_paddr(hkid, p);
> > ...
> > }
> >
> > It should work w/o any dependency?
> >
> > Anyway, I don't have any strong opinion, as long as it works. You may
> > choose what you want. 🙂
>
> Folks, I'll make it easy: Do what Kai suggested above. It works
> universally and it's obvious. Saving an "i" variable only makes the code
> harder to read.
>
> If anyone thinks that:
>
> while (npages--)
>
> Is easier to understand than the most common C idiom on the planet:
>
> for (i = 0; i < npages; i++)
>
> ... then I don't know what to tell them.
Got it. That's very helpful. Thanks for the instruction!
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages
2025-08-07 9:42 ` [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
2025-11-11 9:23 ` Huang, Kai
@ 2025-12-10 1:14 ` Vishal Annapurve
2025-12-10 1:18 ` Yan Zhao
1 sibling, 1 reply; 129+ messages in thread
From: Vishal Annapurve @ 2025-12-10 1:14 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du,
jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu,
chao.p.peng
On Thu, Aug 7, 2025 at 2:42 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> index 0a2b183899d8..8eaf8431c5f1 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1694,6 +1694,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> {
> int tdx_level = pg_level_to_tdx_sept_level(level);
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + struct folio *folio = page_folio(page);
> gpa_t gpa = gfn_to_gpa(gfn);
> u64 err, entry, level_state;
>
> @@ -1728,8 +1729,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> return -EIO;
> }
>
> - err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
> -
> + err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, folio,
> + folio_page_idx(folio, page),
> + KVM_PAGES_PER_HPAGE(level));
This code seems to assume that folio_order() always matches the level
at which it is mapped in the EPT entries. IIUC guest_memfd can decide
to split folios to 4K for the complete huge folio before zapping the
hugepage EPT mappings. I think it's better to just round the pfn to
the hugepage address based on the level they were mapped at instead of
relying on the folio order.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages
2025-12-10 1:14 ` Vishal Annapurve
@ 2025-12-10 1:18 ` Yan Zhao
2025-12-10 1:30 ` Vishal Annapurve
0 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-12-10 1:18 UTC (permalink / raw)
To: Vishal Annapurve
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du,
jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu,
chao.p.peng
On Tue, Dec 09, 2025 at 05:14:22PM -0800, Vishal Annapurve wrote:
> On Thu, Aug 7, 2025 at 2:42 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > index 0a2b183899d8..8eaf8431c5f1 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1694,6 +1694,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > {
> > int tdx_level = pg_level_to_tdx_sept_level(level);
> > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > + struct folio *folio = page_folio(page);
> > gpa_t gpa = gfn_to_gpa(gfn);
> > u64 err, entry, level_state;
> >
> > @@ -1728,8 +1729,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > return -EIO;
> > }
> >
> > - err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
> > -
> > + err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, folio,
> > + folio_page_idx(folio, page),
> > + KVM_PAGES_PER_HPAGE(level));
>
> This code seems to assume that folio_order() always matches the level
> at which it is mapped in the EPT entries.
I don't think so.
Please check the implemenation of tdh_phymem_page_wbinvd_hkid() [1].
Only npages=KVM_PAGES_PER_HPAGE(level) will be invalidated, while npages
<= folio_nr_pages(folio).
[1] https://lore.kernel.org/all/20250807094202.4481-1-yan.y.zhao@intel.com/
> IIUC guest_memfd can decide
> to split folios to 4K for the complete huge folio before zapping the
> hugepage EPT mappings. I think it's better to just round the pfn to
> the hugepage address based on the level they were mapped at instead of
> relying on the folio order.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages
2025-12-10 1:18 ` Yan Zhao
@ 2025-12-10 1:30 ` Vishal Annapurve
2025-12-10 1:55 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Vishal Annapurve @ 2025-12-10 1:30 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du,
jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu,
chao.p.peng
On Tue, Dec 9, 2025 at 5:20 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, Dec 09, 2025 at 05:14:22PM -0800, Vishal Annapurve wrote:
> > On Thu, Aug 7, 2025 at 2:42 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > index 0a2b183899d8..8eaf8431c5f1 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -1694,6 +1694,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > {
> > > int tdx_level = pg_level_to_tdx_sept_level(level);
> > > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > + struct folio *folio = page_folio(page);
> > > gpa_t gpa = gfn_to_gpa(gfn);
> > > u64 err, entry, level_state;
> > >
> > > @@ -1728,8 +1729,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > return -EIO;
> > > }
> > >
> > > - err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
> > > -
> > > + err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, folio,
> > > + folio_page_idx(folio, page),
> > > + KVM_PAGES_PER_HPAGE(level));
> >
> > This code seems to assume that folio_order() always matches the level
> > at which it is mapped in the EPT entries.
> I don't think so.
> Please check the implemenation of tdh_phymem_page_wbinvd_hkid() [1].
> Only npages=KVM_PAGES_PER_HPAGE(level) will be invalidated, while npages
> <= folio_nr_pages(folio).
Is the gfn passed to tdx_sept_drop_private_spte() always huge page
aligned if mapping is at huge page granularity?
If gfn/pfn is not aligned then when folio is split to 4K, page_folio()
will return the same page and folio_order and folio_page_idx() will be
zero. This can cause tdh_phymem_page_wbinvd_hkid() to return failure.
If the expectation is that page_folio() will always point to a head
page for given hugepage granularity mapping then that logic will not
work correctly IMO.
>
> [1] https://lore.kernel.org/all/20250807094202.4481-1-yan.y.zhao@intel.com/
>
> > IIUC guest_memfd can decide
> > to split folios to 4K for the complete huge folio before zapping the
> > hugepage EPT mappings. I think it's better to just round the pfn to
> > the hugepage address based on the level they were mapped at instead of
> > relying on the folio order.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages
2025-12-10 1:30 ` Vishal Annapurve
@ 2025-12-10 1:55 ` Yan Zhao
2025-12-31 19:37 ` Vishal Annapurve
0 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-12-10 1:55 UTC (permalink / raw)
To: Vishal Annapurve
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du,
jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu,
chao.p.peng
On Tue, Dec 09, 2025 at 05:30:54PM -0800, Vishal Annapurve wrote:
> On Tue, Dec 9, 2025 at 5:20 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Tue, Dec 09, 2025 at 05:14:22PM -0800, Vishal Annapurve wrote:
> > > On Thu, Aug 7, 2025 at 2:42 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > index 0a2b183899d8..8eaf8431c5f1 100644
> > > > --- a/arch/x86/kvm/vmx/tdx.c
> > > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > > @@ -1694,6 +1694,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > {
> > > > int tdx_level = pg_level_to_tdx_sept_level(level);
> > > > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > > + struct folio *folio = page_folio(page);
> > > > gpa_t gpa = gfn_to_gpa(gfn);
> > > > u64 err, entry, level_state;
> > > >
> > > > @@ -1728,8 +1729,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > return -EIO;
> > > > }
> > > >
> > > > - err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
> > > > -
> > > > + err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, folio,
> > > > + folio_page_idx(folio, page),
> > > > + KVM_PAGES_PER_HPAGE(level));
> > >
> > > This code seems to assume that folio_order() always matches the level
> > > at which it is mapped in the EPT entries.
> > I don't think so.
> > Please check the implemenation of tdh_phymem_page_wbinvd_hkid() [1].
> > Only npages=KVM_PAGES_PER_HPAGE(level) will be invalidated, while npages
> > <= folio_nr_pages(folio).
>
> Is the gfn passed to tdx_sept_drop_private_spte() always huge page
> aligned if mapping is at huge page granularity?
Yes.
The GFN passed to tdx_sept_set_private_spte() is huge page aligned in
kvm_tdp_mmu_map(). SEAMCALL TDH_MEM_PAGE_AUG will also fail otherwise.
The GFN passed to tdx_sept_remove_private_spte() comes from the same mapping
entry in the mirror EPT.
> If gfn/pfn is not aligned then when folio is split to 4K, page_folio()
> will return the same page and folio_order and folio_page_idx() will be
> zero. This can cause tdh_phymem_page_wbinvd_hkid() to return failure.
>
> If the expectation is that page_folio() will always point to a head
> page for given hugepage granularity mapping then that logic will not
> work correctly IMO.
The current logic is that:
1. tdh_mem_page_aug() maps physical memory starting from the page at "start_idx"
within a "folio" and spanning "npages" contiguous PFNs.
(npages corresponds to the mapping level KVM_PAGES_PER_HPAGE(level)).
e.g. it can map at level 2MB, starting from the 4MB offset in a folio of
order 1GB.
2. if split occurs, the huge 2MB mapping will be split into 4KB ones, while the
underlying folio remains 1GB.
e.g. now the 0th 4KB mapping after split points to the 4MB offset in the
1GB folio, and the 1st 4KB mapping points to the 4MB+4KB offset...
The mapping level after split is 4KB.
3. tdx_sept_remove_private_spte() invokes tdh_mem_page_remove() and
tdh_phymem_page_wbinvd_hkid().
-The GFN is 2MB aligned and level is 2MB if split does not occur or
-The GFN is 4KB aligned and level is 4KB if split has occurred.
While the underlying folio remains 1GB, the folio_page_idx(folio, page)
specifies the offset in the folio, and the npages corresponding to
the mapping level is <= folio_nr_pages(folio).
> > [1] https://lore.kernel.org/all/20250807094202.4481-1-yan.y.zhao@intel.com/
> >
> > > IIUC guest_memfd can decide
> > > to split folios to 4K for the complete huge folio before zapping the
> > > hugepage EPT mappings. I think it's better to just round the pfn to
> > > the hugepage address based on the level they were mapped at instead of
> > > relying on the folio order.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages
2025-12-10 1:55 ` Yan Zhao
@ 2025-12-31 19:37 ` Vishal Annapurve
2026-01-06 10:37 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Vishal Annapurve @ 2025-12-31 19:37 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du,
jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu,
chao.p.peng
On Tue, Dec 9, 2025 at 5:57 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, Dec 09, 2025 at 05:30:54PM -0800, Vishal Annapurve wrote:
> > On Tue, Dec 9, 2025 at 5:20 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Tue, Dec 09, 2025 at 05:14:22PM -0800, Vishal Annapurve wrote:
> > > > On Thu, Aug 7, 2025 at 2:42 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >
> > > > > index 0a2b183899d8..8eaf8431c5f1 100644
> > > > > --- a/arch/x86/kvm/vmx/tdx.c
> > > > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > > > @@ -1694,6 +1694,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > > {
> > > > > int tdx_level = pg_level_to_tdx_sept_level(level);
> > > > > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > > > + struct folio *folio = page_folio(page);
> > > > > gpa_t gpa = gfn_to_gpa(gfn);
> > > > > u64 err, entry, level_state;
> > > > >
> > > > > @@ -1728,8 +1729,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > > return -EIO;
> > > > > }
> > > > >
> > > > > - err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
> > > > > -
> > > > > + err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, folio,
> > > > > + folio_page_idx(folio, page),
> > > > > + KVM_PAGES_PER_HPAGE(level));
> > > >
> > > > This code seems to assume that folio_order() always matches the level
> > > > at which it is mapped in the EPT entries.
> > > I don't think so.
> > > Please check the implemenation of tdh_phymem_page_wbinvd_hkid() [1].
> > > Only npages=KVM_PAGES_PER_HPAGE(level) will be invalidated, while npages
> > > <= folio_nr_pages(folio).
> >
> > Is the gfn passed to tdx_sept_drop_private_spte() always huge page
> > aligned if mapping is at huge page granularity?
> Yes.
> The GFN passed to tdx_sept_set_private_spte() is huge page aligned in
> kvm_tdp_mmu_map(). SEAMCALL TDH_MEM_PAGE_AUG will also fail otherwise.
> The GFN passed to tdx_sept_remove_private_spte() comes from the same mapping
> entry in the mirror EPT.
>
> > If gfn/pfn is not aligned then when folio is split to 4K, page_folio()
> > will return the same page and folio_order and folio_page_idx() will be
> > zero. This can cause tdh_phymem_page_wbinvd_hkid() to return failure.
> >
> > If the expectation is that page_folio() will always point to a head
> > page for given hugepage granularity mapping then that logic will not
> > work correctly IMO.
> The current logic is that:
> 1. tdh_mem_page_aug() maps physical memory starting from the page at "start_idx"
> within a "folio" and spanning "npages" contiguous PFNs.
> (npages corresponds to the mapping level KVM_PAGES_PER_HPAGE(level)).
> e.g. it can map at level 2MB, starting from the 4MB offset in a folio of
> order 1GB.
>
> 2. if split occurs, the huge 2MB mapping will be split into 4KB ones, while the
> underlying folio remains 1GB.
Private to shared conversion flow discussed so far [1][2][3]:
1) Preallocate maple tree entries needed for conversion
2) Split filemap range being converted to 4K pages
3) Mark KVM MMU invalidation begin for the huge page aligned range
4) Zap KVM MMU entries for the converted range
5) Update maple tree entries to carry final attributes
6) Mark KVM MMU invalidation end for huge page aligned range
Possible addition of splitting cross boundary leafs with the above flow:
1) Preallocate maple tree entries needed for conversion
2) Split filemap range being converted to 4K pages
3) Mark KVM MMU invalidation begin for the huge page aligned range
4) Split KVM MMU private boundary leafs for converted range
5) Zap KVM MMU entries for the converted range
6) Update maple tree entries to carry final attributes
7) Mark KVM MMU invalidation end for huge page aligned range
Note that in both the above flows KVM MMU entries will get zapped
after folio is split to 4K i.e. when tdx_sept_remove_private_spte()
happens folio will be split but the EPT entry level will still be 2M
and the assumption of EPT entries always being subset of folios will
not hold true.
I think things might be simplified if KVM TDX stack always operates on
the pages without assuming ranges being covered by "folios".
[1] https://lore.kernel.org/kvm/aN8P87AXlxlEDdpP@google.com/
[2] https://lore.kernel.org/kvm/diqzzf8oazh4.fsf@google.com/
[3] https://github.com/googleprodkernel/linux-cc/blob/9ee2bd65cc9b63c871f8f49d217a7a70576a942d/virt/kvm/guest_memfd.c#L894
> e.g. now the 0th 4KB mapping after split points to the 4MB offset in the
> 1GB folio, and the 1st 4KB mapping points to the 4MB+4KB offset...
> The mapping level after split is 4KB.
>
> 3. tdx_sept_remove_private_spte() invokes tdh_mem_page_remove() and
> tdh_phymem_page_wbinvd_hkid().
> -The GFN is 2MB aligned and level is 2MB if split does not occur or
> -The GFN is 4KB aligned and level is 4KB if split has occurred.
> While the underlying folio remains 1GB, the folio_page_idx(folio, page)
> specifies the offset in the folio, and the npages corresponding to
> the mapping level is <= folio_nr_pages(folio).
>
>
> > > [1] https://lore.kernel.org/all/20250807094202.4481-1-yan.y.zhao@intel.com/
> > >
> > > > IIUC guest_memfd can decide
> > > > to split folios to 4K for the complete huge folio before zapping the
> > > > hugepage EPT mappings. I think it's better to just round the pfn to
> > > > the hugepage address based on the level they were mapped at instead of
> > > > relying on the folio order.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages
2025-12-31 19:37 ` Vishal Annapurve
@ 2026-01-06 10:37 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2026-01-06 10:37 UTC (permalink / raw)
To: Vishal Annapurve
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du,
jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu,
chao.p.peng
On Wed, Dec 31, 2025 at 11:37:26AM -0800, Vishal Annapurve wrote:
> On Tue, Dec 9, 2025 at 5:57 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Tue, Dec 09, 2025 at 05:30:54PM -0800, Vishal Annapurve wrote:
> > > On Tue, Dec 9, 2025 at 5:20 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Tue, Dec 09, 2025 at 05:14:22PM -0800, Vishal Annapurve wrote:
> > > > > On Thu, Aug 7, 2025 at 2:42 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >
> > > > > > index 0a2b183899d8..8eaf8431c5f1 100644
> > > > > > --- a/arch/x86/kvm/vmx/tdx.c
> > > > > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > > > > @@ -1694,6 +1694,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > > > {
> > > > > > int tdx_level = pg_level_to_tdx_sept_level(level);
> > > > > > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > > > > + struct folio *folio = page_folio(page);
> > > > > > gpa_t gpa = gfn_to_gpa(gfn);
> > > > > > u64 err, entry, level_state;
> > > > > >
> > > > > > @@ -1728,8 +1729,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > > > return -EIO;
> > > > > > }
> > > > > >
> > > > > > - err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
> > > > > > -
> > > > > > + err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, folio,
> > > > > > + folio_page_idx(folio, page),
> > > > > > + KVM_PAGES_PER_HPAGE(level));
> > > > >
> > > > > This code seems to assume that folio_order() always matches the level
> > > > > at which it is mapped in the EPT entries.
> > > > I don't think so.
> > > > Please check the implemenation of tdh_phymem_page_wbinvd_hkid() [1].
> > > > Only npages=KVM_PAGES_PER_HPAGE(level) will be invalidated, while npages
> > > > <= folio_nr_pages(folio).
> > >
> > > Is the gfn passed to tdx_sept_drop_private_spte() always huge page
> > > aligned if mapping is at huge page granularity?
> > Yes.
> > The GFN passed to tdx_sept_set_private_spte() is huge page aligned in
> > kvm_tdp_mmu_map(). SEAMCALL TDH_MEM_PAGE_AUG will also fail otherwise.
> > The GFN passed to tdx_sept_remove_private_spte() comes from the same mapping
> > entry in the mirror EPT.
> >
> > > If gfn/pfn is not aligned then when folio is split to 4K, page_folio()
> > > will return the same page and folio_order and folio_page_idx() will be
> > > zero. This can cause tdh_phymem_page_wbinvd_hkid() to return failure.
> > >
> > > If the expectation is that page_folio() will always point to a head
> > > page for given hugepage granularity mapping then that logic will not
> > > work correctly IMO.
> > The current logic is that:
> > 1. tdh_mem_page_aug() maps physical memory starting from the page at "start_idx"
> > within a "folio" and spanning "npages" contiguous PFNs.
> > (npages corresponds to the mapping level KVM_PAGES_PER_HPAGE(level)).
> > e.g. it can map at level 2MB, starting from the 4MB offset in a folio of
> > order 1GB.
> >
> > 2. if split occurs, the huge 2MB mapping will be split into 4KB ones, while the
> > underlying folio remains 1GB.
>
> Private to shared conversion flow discussed so far [1][2][3]:
> 1) Preallocate maple tree entries needed for conversion
> 2) Split filemap range being converted to 4K pages
> 3) Mark KVM MMU invalidation begin for the huge page aligned range
> 4) Zap KVM MMU entries for the converted range
> 5) Update maple tree entries to carry final attributes
> 6) Mark KVM MMU invalidation end for huge page aligned range
>
> Possible addition of splitting cross boundary leafs with the above flow:
> 1) Preallocate maple tree entries needed for conversion
> 2) Split filemap range being converted to 4K pages
> 3) Mark KVM MMU invalidation begin for the huge page aligned range
> 4) Split KVM MMU private boundary leafs for converted range
> 5) Zap KVM MMU entries for the converted range
> 6) Update maple tree entries to carry final attributes
> 7) Mark KVM MMU invalidation end for huge page aligned range
>
> Note that in both the above flows KVM MMU entries will get zapped
> after folio is split to 4K i.e. when tdx_sept_remove_private_spte()
> happens folio will be split but the EPT entry level will still be 2M
> and the assumption of EPT entries always being subset of folios will
> not hold true.
>
> I think things might be simplified if KVM TDX stack always operates on
> the pages without assuming ranges being covered by "folios".
Let's discuss that in v3 series
https://lore.kernel.org/all/20260106101646.24809-1-yan.y.zhao@intel.com/
> [1] https://lore.kernel.org/kvm/aN8P87AXlxlEDdpP@google.com/
> [2] https://lore.kernel.org/kvm/diqzzf8oazh4.fsf@google.com/
> [3] https://github.com/googleprodkernel/linux-cc/blob/9ee2bd65cc9b63c871f8f49d217a7a70576a942d/virt/kvm/guest_memfd.c#L894
>
> > e.g. now the 0th 4KB mapping after split points to the 4MB offset in the
> > 1GB folio, and the 1st 4KB mapping points to the 4MB+4KB offset...
> > The mapping level after split is 4KB.
> >
> > 3. tdx_sept_remove_private_spte() invokes tdh_mem_page_remove() and
> > tdh_phymem_page_wbinvd_hkid().
> > -The GFN is 2MB aligned and level is 2MB if split does not occur or
> > -The GFN is 4KB aligned and level is 4KB if split has occurred.
> > While the underlying folio remains 1GB, the folio_page_idx(folio, page)
> > specifies the offset in the folio, and the npages corresponding to
> > the mapping level is <= folio_nr_pages(folio).
> >
> >
> > > > [1] https://lore.kernel.org/all/20250807094202.4481-1-yan.y.zhao@intel.com/
> > > >
> > > > > IIUC guest_memfd can decide
> > > > > to split folios to 4K for the complete huge folio before zapping the
> > > > > hugepage EPT mappings. I think it's better to just round the pfn to
> > > > > the hugepage address based on the level they were mapped at instead of
> > > > > relying on the folio order.
^ permalink raw reply [flat|nested] 129+ messages in thread
* [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear huge pages
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (2 preceding siblings ...)
2025-08-07 9:42 ` [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd_hkid() to invalidate huge pages Yan Zhao
@ 2025-08-07 9:42 ` Yan Zhao
2025-09-02 2:56 ` Binbin Wu
2025-08-07 9:42 ` [RFC PATCH v2 05/23] x86/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
` (18 subsequent siblings)
22 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:42 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
After removing or reclaiming a guest private page or a control page from a
TD, zero the physical page using movdir64b(), enabling the kernel to reuse
the pages.
Introduce the function tdx_clear_folio() to zero out physical memory using
movdir64b(), starting from the page at "start_idx" within a "folio" and
spanning "npages" contiguous PFNs.
Convert tdx_clear_page() to be a helper function to facilitate the
zeroing of 4KB pages.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Add tdx_clear_folio().
- Drop inner loop _tdx_clear_page() and move __mb() outside of the loop.
(Rick)
- Use C99-style definition of variables inside a for loop.
- Note: [1] also changes tdx_clear_page(). RFC v2 is not based on [1] now.
[1] https://lore.kernel.org/all/20250724130354.79392-2-adrian.hunter@intel.com
RFC v1:
- split out, let tdx_clear_page() accept level.
---
arch/x86/kvm/vmx/tdx.c | 22 ++++++++++++++++------
1 file changed, 16 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8eaf8431c5f1..4fabefb27135 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -277,18 +277,21 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
vcpu->cpu = -1;
}
-static void tdx_clear_page(struct page *page)
+static void tdx_clear_folio(struct folio *folio, unsigned long start_idx,
+ unsigned long npages)
{
const void *zero_page = (const void *) page_to_virt(ZERO_PAGE(0));
- void *dest = page_to_virt(page);
- unsigned long i;
/*
* The page could have been poisoned. MOVDIR64B also clears
* the poison bit so the kernel can safely use the page again.
*/
- for (i = 0; i < PAGE_SIZE; i += 64)
- movdir64b(dest + i, zero_page);
+ for (unsigned long j = 0; j < npages; j++) {
+ void *dest = page_to_virt(folio_page(folio, start_idx + j));
+
+ for (unsigned long i = 0; i < PAGE_SIZE; i += 64)
+ movdir64b(dest + i, zero_page);
+ }
/*
* MOVDIR64B store uses WC buffer. Prevent following memory reads
* from seeing potentially poisoned cache.
@@ -296,6 +299,13 @@ static void tdx_clear_page(struct page *page)
__mb();
}
+static inline void tdx_clear_page(struct page *page)
+{
+ struct folio *folio = page_folio(page);
+
+ tdx_clear_folio(folio, folio_page_idx(folio, page), 1);
+}
+
static void tdx_no_vcpus_enter_start(struct kvm *kvm)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -1736,7 +1746,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
return -EIO;
}
- tdx_clear_page(page);
+ tdx_clear_folio(folio, folio_page_idx(folio, page), KVM_PAGES_PER_HPAGE(level));
tdx_pamt_put(page, level);
tdx_unpin(kvm, page);
return 0;
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear huge pages
2025-08-07 9:42 ` [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear " Yan Zhao
@ 2025-09-02 2:56 ` Binbin Wu
2025-09-03 9:51 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Binbin Wu @ 2025-09-02 2:56 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On 8/7/2025 5:42 PM, Yan Zhao wrote:
> After removing or reclaiming a guest private page or a control page from a
> TD, zero the physical page using movdir64b(), enabling the kernel to reuse
> the pages.
>
> Introduce the function tdx_clear_folio() to zero out physical memory using
> movdir64b(), starting from the page at "start_idx" within a "folio" and
> spanning "npages" contiguous PFNs.
>
> Convert tdx_clear_page() to be a helper function to facilitate the
> zeroing of 4KB pages.
I think this sentence is outdated?
>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2:
> - Add tdx_clear_folio().
> - Drop inner loop _tdx_clear_page() and move __mb() outside of the loop.
> (Rick)
> - Use C99-style definition of variables inside a for loop.
> - Note: [1] also changes tdx_clear_page(). RFC v2 is not based on [1] now.
>
> [1] https://lore.kernel.org/all/20250724130354.79392-2-adrian.hunter@intel.com
>
> RFC v1:
> - split out, let tdx_clear_page() accept level.
> ---
> arch/x86/kvm/vmx/tdx.c | 22 ++++++++++++++++------
> 1 file changed, 16 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 8eaf8431c5f1..4fabefb27135 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -277,18 +277,21 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> vcpu->cpu = -1;
> }
>
> -static void tdx_clear_page(struct page *page)
> +static void tdx_clear_folio(struct folio *folio, unsigned long start_idx,
> + unsigned long npages)
> {
> const void *zero_page = (const void *) page_to_virt(ZERO_PAGE(0));
> - void *dest = page_to_virt(page);
> - unsigned long i;
>
> /*
> * The page could have been poisoned. MOVDIR64B also clears
> * the poison bit so the kernel can safely use the page again.
> */
> - for (i = 0; i < PAGE_SIZE; i += 64)
> - movdir64b(dest + i, zero_page);
> + for (unsigned long j = 0; j < npages; j++) {
> + void *dest = page_to_virt(folio_page(folio, start_idx + j));
> +
> + for (unsigned long i = 0; i < PAGE_SIZE; i += 64)
> + movdir64b(dest + i, zero_page);
> + }
> /*
> * MOVDIR64B store uses WC buffer. Prevent following memory reads
> * from seeing potentially poisoned cache.
> @@ -296,6 +299,13 @@ static void tdx_clear_page(struct page *page)
> __mb();
> }
>
> +static inline void tdx_clear_page(struct page *page)
No need to tag a local static function with "inline".
> +{
> + struct folio *folio = page_folio(page);
> +
> + tdx_clear_folio(folio, folio_page_idx(folio, page), 1);
This is strange at my first thought.
And then I realized that it is to avoid unnecessary memory barrier.
No better idea so far.
> +}
> +
> static void tdx_no_vcpus_enter_start(struct kvm *kvm)
> {
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> @@ -1736,7 +1746,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
> return -EIO;
> }
> - tdx_clear_page(page);
> + tdx_clear_folio(folio, folio_page_idx(folio, page), KVM_PAGES_PER_HPAGE(level));
> tdx_pamt_put(page, level);
> tdx_unpin(kvm, page);
> return 0;
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear huge pages
2025-09-02 2:56 ` Binbin Wu
@ 2025-09-03 9:51 ` Yan Zhao
2025-09-03 11:19 ` Binbin Wu
0 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-09-03 9:51 UTC (permalink / raw)
To: Binbin Wu
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On Tue, Sep 02, 2025 at 10:56:25AM +0800, Binbin Wu wrote:
>
>
> On 8/7/2025 5:42 PM, Yan Zhao wrote:
> > After removing or reclaiming a guest private page or a control page from a
> > TD, zero the physical page using movdir64b(), enabling the kernel to reuse
> > the pages.
> >
> > Introduce the function tdx_clear_folio() to zero out physical memory using
> > movdir64b(), starting from the page at "start_idx" within a "folio" and
> > spanning "npages" contiguous PFNs.
> >
> > Convert tdx_clear_page() to be a helper function to facilitate the
> > zeroing of 4KB pages.
>
> I think this sentence is outdated?
No? tdx_clear_page() is still invoked to clear tdr_page.
> >
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> > RFC v2:
> > - Add tdx_clear_folio().
> > - Drop inner loop _tdx_clear_page() and move __mb() outside of the loop.
> > (Rick)
> > - Use C99-style definition of variables inside a for loop.
> > - Note: [1] also changes tdx_clear_page(). RFC v2 is not based on [1] now.
> >
> > [1] https://lore.kernel.org/all/20250724130354.79392-2-adrian.hunter@intel.com
> >
> > RFC v1:
> > - split out, let tdx_clear_page() accept level.
> > ---
> > arch/x86/kvm/vmx/tdx.c | 22 ++++++++++++++++------
> > 1 file changed, 16 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 8eaf8431c5f1..4fabefb27135 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -277,18 +277,21 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> > vcpu->cpu = -1;
> > }
> > -static void tdx_clear_page(struct page *page)
> > +static void tdx_clear_folio(struct folio *folio, unsigned long start_idx,
> > + unsigned long npages)
> > {
> > const void *zero_page = (const void *) page_to_virt(ZERO_PAGE(0));
> > - void *dest = page_to_virt(page);
> > - unsigned long i;
> > /*
> > * The page could have been poisoned. MOVDIR64B also clears
> > * the poison bit so the kernel can safely use the page again.
> > */
> > - for (i = 0; i < PAGE_SIZE; i += 64)
> > - movdir64b(dest + i, zero_page);
> > + for (unsigned long j = 0; j < npages; j++) {
> > + void *dest = page_to_virt(folio_page(folio, start_idx + j));
> > +
> > + for (unsigned long i = 0; i < PAGE_SIZE; i += 64)
> > + movdir64b(dest + i, zero_page);
> > + }
> > /*
> > * MOVDIR64B store uses WC buffer. Prevent following memory reads
> > * from seeing potentially poisoned cache.
> > @@ -296,6 +299,13 @@ static void tdx_clear_page(struct page *page)
> > __mb();
> > }
> > +static inline void tdx_clear_page(struct page *page)
> No need to tag a local static function with "inline".
Ok.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear huge pages
2025-09-03 9:51 ` Yan Zhao
@ 2025-09-03 11:19 ` Binbin Wu
2025-09-04 2:53 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Binbin Wu @ 2025-09-03 11:19 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On 9/3/2025 5:51 PM, Yan Zhao wrote:
> On Tue, Sep 02, 2025 at 10:56:25AM +0800, Binbin Wu wrote:
>>
>> On 8/7/2025 5:42 PM, Yan Zhao wrote:
>>> After removing or reclaiming a guest private page or a control page from a
>>> TD, zero the physical page using movdir64b(), enabling the kernel to reuse
>>> the pages.
>>>
>>> Introduce the function tdx_clear_folio() to zero out physical memory using
>>> movdir64b(), starting from the page at "start_idx" within a "folio" and
>>> spanning "npages" contiguous PFNs.
>>>
>>> Convert tdx_clear_page() to be a helper function to facilitate the
>>> zeroing of 4KB pages.
>> I think this sentence is outdated?
> No? tdx_clear_page() is still invoked to clear tdr_page.
I didn't get the word "Convert".
>
>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>>> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>>> ---
>>> RFC v2:
>>> - Add tdx_clear_folio().
>>> - Drop inner loop _tdx_clear_page() and move __mb() outside of the loop.
>>> (Rick)
>>> - Use C99-style definition of variables inside a for loop.
>>> - Note: [1] also changes tdx_clear_page(). RFC v2 is not based on [1] now.
>>>
>>> [1] https://lore.kernel.org/all/20250724130354.79392-2-adrian.hunter@intel.com
>>>
>>> RFC v1:
>>> - split out, let tdx_clear_page() accept level.
>>> ---
>>> arch/x86/kvm/vmx/tdx.c | 22 ++++++++++++++++------
>>> 1 file changed, 16 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>>> index 8eaf8431c5f1..4fabefb27135 100644
>>> --- a/arch/x86/kvm/vmx/tdx.c
>>> +++ b/arch/x86/kvm/vmx/tdx.c
>>> @@ -277,18 +277,21 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
>>> vcpu->cpu = -1;
>>> }
>>> -static void tdx_clear_page(struct page *page)
>>> +static void tdx_clear_folio(struct folio *folio, unsigned long start_idx,
>>> + unsigned long npages)
>>> {
>>> const void *zero_page = (const void *) page_to_virt(ZERO_PAGE(0));
>>> - void *dest = page_to_virt(page);
>>> - unsigned long i;
>>> /*
>>> * The page could have been poisoned. MOVDIR64B also clears
>>> * the poison bit so the kernel can safely use the page again.
>>> */
>>> - for (i = 0; i < PAGE_SIZE; i += 64)
>>> - movdir64b(dest + i, zero_page);
>>> + for (unsigned long j = 0; j < npages; j++) {
>>> + void *dest = page_to_virt(folio_page(folio, start_idx + j));
>>> +
>>> + for (unsigned long i = 0; i < PAGE_SIZE; i += 64)
>>> + movdir64b(dest + i, zero_page);
>>> + }
>>> /*
>>> * MOVDIR64B store uses WC buffer. Prevent following memory reads
>>> * from seeing potentially poisoned cache.
>>> @@ -296,6 +299,13 @@ static void tdx_clear_page(struct page *page)
>>> __mb();
>>> }
>>> +static inline void tdx_clear_page(struct page *page)
>> No need to tag a local static function with "inline".
> Ok.
>
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear huge pages
2025-09-03 11:19 ` Binbin Wu
@ 2025-09-04 2:53 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-09-04 2:53 UTC (permalink / raw)
To: Binbin Wu
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, fan.du,
jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li, chao.p.peng
On Wed, Sep 03, 2025 at 07:19:32PM +0800, Binbin Wu wrote:
>
>
> On 9/3/2025 5:51 PM, Yan Zhao wrote:
> > On Tue, Sep 02, 2025 at 10:56:25AM +0800, Binbin Wu wrote:
> > >
> > > On 8/7/2025 5:42 PM, Yan Zhao wrote:
> > > > After removing or reclaiming a guest private page or a control page from a
> > > > TD, zero the physical page using movdir64b(), enabling the kernel to reuse
> > > > the pages.
> > > >
> > > > Introduce the function tdx_clear_folio() to zero out physical memory using
> > > > movdir64b(), starting from the page at "start_idx" within a "folio" and
> > > > spanning "npages" contiguous PFNs.
> > > >
> > > > Convert tdx_clear_page() to be a helper function to facilitate the
> > > > zeroing of 4KB pages.
> > > I think this sentence is outdated?
> > No? tdx_clear_page() is still invoked to clear tdr_page.
>
> I didn't get the word "Convert".
Ok. I wanted to express that tdx_clear_page() now is just a helper.
Will rephrase it to
"Make tdx_clear_page() to be a helper function to facilitate the zeroing
of 4KB pages".
^ permalink raw reply [flat|nested] 129+ messages in thread
* [RFC PATCH v2 05/23] x86/tdx: Enhance tdh_phymem_page_reclaim() to support huge pages
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (3 preceding siblings ...)
2025-08-07 9:42 ` [RFC PATCH v2 04/23] KVM: TDX: Introduce tdx_clear_folio() to clear " Yan Zhao
@ 2025-08-07 9:42 ` Yan Zhao
2025-11-17 2:09 ` Binbin Wu
2025-08-07 9:42 ` [RFC PATCH v2 06/23] KVM: TDX: Do not hold page refcount on private guest pages Yan Zhao
` (17 subsequent siblings)
22 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:42 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
Enhance the SEAMCALL wrapper tdh_phymem_page_reclaim() to support huge
pages by introducing new parameters: "folio", "start_idx", and "npages".
These parameters specify the physical memory to be reclaimed, i.e.,
starting from the page at "start_idx" within a folio and spanning "npages"
contiguous PFNs. The specified memory must be entirely contained within a
single folio. Return TDX_SW_ERROR if the size of the reclaimed memory does
not match the specified size.
On the KVM side, introduce tdx_reclaim_folio() to align with and invoke the
SEAMCALL wrapper tdh_phymem_page_reclaim(). The "noclear" parameter
specifies whether tdx_clear_folio() should be subsequently invoked within
tdx_reclaim_folio(). Additionally, provide two helper functions,
tdx_reclaim_page() and tdx_reclaim_page_noclear(), to facilitate the
reclaiming of 4KB pages.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Introduce new params "folio", "start_idx" and "npages" to wrapper
tdh_phymem_page_reclaim().
- Move the checking of return size from KVM to x86/virt and return error.
- Rename tdx_reclaim_page() to tdx_reclaim_folio().
- Add two helper functions tdx_reclaim_page() tdx_reclaim_page_noclear()
to faciliate the reclaiming of 4KB pages.
RFC v1:
- Rebased and split patch.
---
arch/x86/include/asm/tdx.h | 3 ++-
arch/x86/kvm/vmx/tdx.c | 27 ++++++++++++++++++---------
arch/x86/virt/vmx/tdx/tdx.c | 12 ++++++++++--
3 files changed, 30 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index a125bb20a28a..f1bd74348b34 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -189,7 +189,8 @@ u64 tdh_mng_init(struct tdx_td *td, u64 td_params, u64 *extended_err);
u64 tdh_vp_init(struct tdx_vp *vp, u64 initial_rcx, u32 x2apicid);
u64 tdh_vp_rd(struct tdx_vp *vp, u64 field, u64 *data);
u64 tdh_vp_wr(struct tdx_vp *vp, u64 field, u64 data, u64 mask);
-u64 tdh_phymem_page_reclaim(struct page *page, u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size);
+u64 tdh_phymem_page_reclaim(struct folio *folio, unsigned long start_idx, unsigned long npages,
+ u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size);
u64 tdh_mem_track(struct tdx_td *tdr);
u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u64 *ext_err2);
u64 tdh_phymem_cache_wb(bool resume);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 4fabefb27135..facfe589e006 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -327,11 +327,12 @@ static void tdx_no_vcpus_enter_stop(struct kvm *kvm)
}
/* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */
-static int __tdx_reclaim_page(struct page *page)
+static int tdx_reclaim_folio(struct folio *folio, unsigned long start_idx,
+ unsigned long npages, bool noclear)
{
u64 err, tdx_pt, tdx_owner, tdx_size;
- err = tdh_phymem_page_reclaim(page, &tdx_pt, &tdx_owner, &tdx_size);
+ err = tdh_phymem_page_reclaim(folio, start_idx, npages, &tdx_pt, &tdx_owner, &tdx_size);
/*
* No need to check for TDX_OPERAND_BUSY; all TD pages are freed
@@ -342,19 +343,25 @@ static int __tdx_reclaim_page(struct page *page)
pr_tdx_error_3(TDH_PHYMEM_PAGE_RECLAIM, err, tdx_pt, tdx_owner, tdx_size);
return -EIO;
}
+
+ if (!noclear)
+ tdx_clear_folio(folio, start_idx, npages);
return 0;
}
static int tdx_reclaim_page(struct page *page)
{
- int r;
+ struct folio *folio = page_folio(page);
- r = __tdx_reclaim_page(page);
- if (!r)
- tdx_clear_page(page);
- return r;
+ return tdx_reclaim_folio(folio, folio_page_idx(folio, page), 1, false);
}
+static int tdx_reclaim_page_noclear(struct page *page)
+{
+ struct folio *folio = page_folio(page);
+
+ return tdx_reclaim_folio(folio, folio_page_idx(folio, page), 1, true);
+}
/*
* Reclaim the TD control page(s) which are crypto-protected by TDX guest's
@@ -587,7 +594,7 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
if (!kvm_tdx->td.tdr_page)
return;
- if (__tdx_reclaim_page(kvm_tdx->td.tdr_page))
+ if (tdx_reclaim_page_noclear(kvm_tdx->td.tdr_page))
return;
/*
@@ -1932,11 +1939,13 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
enum pg_level level, kvm_pfn_t pfn)
{
struct page *page = pfn_to_page(pfn);
+ struct folio *folio = page_folio(page);
int ret;
if (!is_hkid_assigned(to_kvm_tdx(kvm))) {
KVM_BUG_ON(!kvm->vm_dead, kvm);
- ret = tdx_reclaim_page(page);
+ ret = tdx_reclaim_folio(folio, folio_page_idx(folio, page),
+ KVM_PAGES_PER_HPAGE(level), false);
if (!ret) {
tdx_pamt_put(page, level);
tdx_unpin(kvm, page);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 64219c659844..9ed585bde062 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1966,19 +1966,27 @@ EXPORT_SYMBOL_GPL(tdh_vp_init);
* So despite the names, they must be interpted specially as described by the spec. Return
* them only for error reporting purposes.
*/
-u64 tdh_phymem_page_reclaim(struct page *page, u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size)
+u64 tdh_phymem_page_reclaim(struct folio *folio, unsigned long start_idx, unsigned long npages,
+ u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size)
{
+ struct page *start = folio_page(folio, start_idx);
struct tdx_module_args args = {
- .rcx = page_to_phys(page),
+ .rcx = page_to_phys(start),
};
u64 ret;
+ if (start_idx + npages > folio_nr_pages(folio))
+ return TDX_OPERAND_INVALID;
+
ret = seamcall_ret(TDH_PHYMEM_PAGE_RECLAIM, &args);
*tdx_pt = args.rcx;
*tdx_owner = args.rdx;
*tdx_size = args.r8;
+ if (npages != (1 << (*tdx_size) * PTE_SHIFT))
+ return TDX_SW_ERROR;
+
return ret;
}
EXPORT_SYMBOL_GPL(tdh_phymem_page_reclaim);
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 05/23] x86/tdx: Enhance tdh_phymem_page_reclaim() to support huge pages
2025-08-07 9:42 ` [RFC PATCH v2 05/23] x86/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
@ 2025-11-17 2:09 ` Binbin Wu
2025-11-17 4:05 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Binbin Wu @ 2025-11-17 2:09 UTC (permalink / raw)
To: Yan Zhao
Cc: seanjc, pbonzini, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On 8/7/2025 5:42 PM, Yan Zhao wrote:
[...]
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 64219c659844..9ed585bde062 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1966,19 +1966,27 @@ EXPORT_SYMBOL_GPL(tdh_vp_init);
> * So despite the names, they must be interpted specially as described by the spec. Return
> * them only for error reporting purposes.
> */
> -u64 tdh_phymem_page_reclaim(struct page *page, u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size)
> +u64 tdh_phymem_page_reclaim(struct folio *folio, unsigned long start_idx, unsigned long npages,
> + u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size)
> {
> + struct page *start = folio_page(folio, start_idx);
> struct tdx_module_args args = {
> - .rcx = page_to_phys(page),
> + .rcx = page_to_phys(start),
> };
> u64 ret;
>
> + if (start_idx + npages > folio_nr_pages(folio))
> + return TDX_OPERAND_INVALID;
> +
> ret = seamcall_ret(TDH_PHYMEM_PAGE_RECLAIM, &args);
>
> *tdx_pt = args.rcx;
> *tdx_owner = args.rdx;
> *tdx_size = args.r8;
>
> + if (npages != (1 << (*tdx_size) * PTE_SHIFT))
> + return TDX_SW_ERROR;
Nit:
The size check here is to make sure the reclamation on the correct level,
however, tdx_size may not be updated if some other error occurs first.
Do you think it's better to check 'ret' first before returning TDX_SW_ERROR?
Otherwise, the error code provided by the TDX module, which may be helpful for
debugging, will be buried under TDX_SW_ERROR.
> +
> return ret;
> }
> EXPORT_SYMBOL_GPL(tdh_phymem_page_reclaim);
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 05/23] x86/tdx: Enhance tdh_phymem_page_reclaim() to support huge pages
2025-11-17 2:09 ` Binbin Wu
@ 2025-11-17 4:05 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-11-17 4:05 UTC (permalink / raw)
To: Binbin Wu
Cc: seanjc, pbonzini, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, michael.roth, david,
vannapurve, vbabka, thomas.lendacky, pgonda, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, chao.p.peng
On Mon, Nov 17, 2025 at 10:09:42AM +0800, Binbin Wu wrote:
>
>
> On 8/7/2025 5:42 PM, Yan Zhao wrote:
> [...]
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 64219c659844..9ed585bde062 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1966,19 +1966,27 @@ EXPORT_SYMBOL_GPL(tdh_vp_init);
> > * So despite the names, they must be interpted specially as described by the spec. Return
> > * them only for error reporting purposes.
> > */
> > -u64 tdh_phymem_page_reclaim(struct page *page, u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size)
> > +u64 tdh_phymem_page_reclaim(struct folio *folio, unsigned long start_idx, unsigned long npages,
> > + u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size)
> > {
> > + struct page *start = folio_page(folio, start_idx);
> > struct tdx_module_args args = {
> > - .rcx = page_to_phys(page),
> > + .rcx = page_to_phys(start),
> > };
> > u64 ret;
> > + if (start_idx + npages > folio_nr_pages(folio))
> > + return TDX_OPERAND_INVALID;
> > +
> > ret = seamcall_ret(TDH_PHYMEM_PAGE_RECLAIM, &args);
> > *tdx_pt = args.rcx;
> > *tdx_owner = args.rdx;
> > *tdx_size = args.r8;
> > + if (npages != (1 << (*tdx_size) * PTE_SHIFT))
> > + return TDX_SW_ERROR;
>
> Nit:
>
> The size check here is to make sure the reclamation on the correct level,
> however, tdx_size may not be updated if some other error occurs first.
> Do you think it's better to check 'ret' first before returning TDX_SW_ERROR?
> Otherwise, the error code provided by the TDX module, which may be helpful for
> debugging, will be buried under TDX_SW_ERROR.
Makes sense. Thanks!
I'll change it to:
if (!ret && npages != (1 << (*tdx_size) * PTE_SHIFT))
return TDX_SW_ERROR;
> > return ret;
> > }
> > EXPORT_SYMBOL_GPL(tdh_phymem_page_reclaim);
>
^ permalink raw reply [flat|nested] 129+ messages in thread
* [RFC PATCH v2 06/23] KVM: TDX: Do not hold page refcount on private guest pages
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (4 preceding siblings ...)
2025-08-07 9:42 ` [RFC PATCH v2 05/23] x86/tdx: Enhance tdh_phymem_page_reclaim() to support " Yan Zhao
@ 2025-08-07 9:42 ` Yan Zhao
2025-08-07 9:42 ` [RFC PATCH v2 07/23] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
` (16 subsequent siblings)
22 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:42 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
To enable guest_memfd to support in-place conversion between shared and
private memory [1], TDX is required not to hold refcount of the private
pages allocated from guest_memfd.
Due to that a folio only has a single refcount and the need to reliably
determine unexpected reference when converting any shared part to private,
guest_memfd [1] does not permit shared memory to be huge [2]. Consequently,
it must split private huge pages into 4KB shared pages. However, since
guest_memfd cannot distinguish between the speculative/transient refcounts
and the intentional refcount for TDX on private pages[3], failing to
release private page refcount in TDX could cause guest_memfd to
indefinitely wait on decreasing the refcount for the splitting.
Under normal conditions, not holding an extra page refcount in TDX is safe
because guest_memfd ensures pages are retained until its invalidation
notification to KVM MMU is completed. However, if there're bugs in KVM/TDX
module, not holding an extra refcount when a page is mapped in S-EPT could
result in a page being released from guest_memfd while still mapped in the
S-EPT.
Several approaches were considered to address this issue, including
- Attempting to modify the KVM unmap operation to return a failure, which
was deemed too complex and potentially incorrect [4].
- Increasing the folio reference count only upon S-EPT zapping failure [5].
- Use page flags or page_ext to indicate a page is still used by TDX [6],
which does not work for HVO (HugeTLB Vmemmap Optimization).
- Setting HWPOISON bit or leveraging folio_set_hugetlb_hwpoison() [7].
Due to the complexity or inappropriateness of these approaches, and the
fact that S-EPT zapping failure is currently only possible when there are
bugs in the KVM or TDX module, which is very rare in a production kernel, a
straightforward approach of simply not holding the page reference count in
TDX was chosen [8].
When S-EPT zapping errors occur, KVM_BUG_ON() is invoked to kick off all
vCPUs and mark the VM as dead. Although there is a potential window that a
private page mapped in the S-EPT could be reallocated and used outside the
VM, the loud warning from KVM_BUG_ON() should provide sufficient debug
information. To be robust against bugs, the user can enable panic_on_warn
as normal.
Link: https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com [1]
Link: https://youtu.be/UnBKahkAon4 [2]
Link: https://lore.kernel.org/all/CAGtprH_ypohFy9TOJ8Emm_roT4XbQUtLKZNFcM6Fr+fhTFkE0Q@mail.gmail.com [3]
Link: https://lore.kernel.org/all/aEEEJbTzlncbRaRA@yzhao56-desk.sh.intel.com [4]
Link: https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com [5]
Link: https://lore.kernel.org/all/aFkeBtuNBN1RrDAJ@yzhao56-desk.sh.intel.com [6]
Link: https://lore.kernel.org/all/diqzy0tikran.fsf@ackerleytng-ctop.c.googlers.com [7]
Link: https://lore.kernel.org/all/53ea5239f8ef9d8df9af593647243c10435fd219.camel@intel.com [8]
Suggested-by: Vishal Annapurve <vannapurve@google.com>
Suggested-by: Ackerley Tng <ackerleytng@google.com>
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- new in RFC v2.
- Rebased on DPAMT and shutdown optimization.
---
arch/x86/kvm/vmx/tdx.c | 28 ++++------------------------
1 file changed, 4 insertions(+), 24 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index facfe589e006..376287a2ddf4 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1600,11 +1600,6 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
}
-static void tdx_unpin(struct kvm *kvm, struct page *page)
-{
- put_page(page);
-}
-
static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
enum pg_level level, struct page *page)
{
@@ -1617,14 +1612,11 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, folio,
folio_page_idx(folio, page), &entry, &level_state);
- if (unlikely(tdx_operand_busy(err))) {
- tdx_unpin(kvm, page);
+ if (unlikely(tdx_operand_busy(err)))
return -EBUSY;
- }
if (KVM_BUG_ON(err, kvm)) {
pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
- tdx_unpin(kvm, page);
return -EIO;
}
@@ -1679,16 +1671,6 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
return -EINVAL;
- /*
- * Because guest_memfd doesn't support page migration with
- * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
- * migration. Until guest_memfd supports page migration, prevent page
- * migration.
- * TODO: Once guest_memfd introduces callback on page migration,
- * implement it and remove get_page/put_page().
- */
- get_page(page);
-
/*
* Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
* barrier in tdx_td_finalize().
@@ -1755,7 +1737,6 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
}
tdx_clear_folio(folio, folio_page_idx(folio, page), KVM_PAGES_PER_HPAGE(level));
tdx_pamt_put(page, level);
- tdx_unpin(kvm, page);
return 0;
}
@@ -1845,7 +1826,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
!KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
atomic64_dec(&kvm_tdx->nr_premapped);
tdx_pamt_put(page, level);
- tdx_unpin(kvm, page);
return 0;
}
@@ -1944,12 +1924,12 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
if (!is_hkid_assigned(to_kvm_tdx(kvm))) {
KVM_BUG_ON(!kvm->vm_dead, kvm);
+
ret = tdx_reclaim_folio(folio, folio_page_idx(folio, page),
KVM_PAGES_PER_HPAGE(level), false);
- if (!ret) {
+ if (!ret)
tdx_pamt_put(page, level);
- tdx_unpin(kvm, page);
- }
+
return ret;
}
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* [RFC PATCH v2 07/23] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (5 preceding siblings ...)
2025-08-07 9:42 ` [RFC PATCH v2 06/23] KVM: TDX: Do not hold page refcount on private guest pages Yan Zhao
@ 2025-08-07 9:42 ` Yan Zhao
2025-08-07 9:43 ` [RFC PATCH v2 08/23] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
` (15 subsequent siblings)
22 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:42 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Disallow page merging (huge page adjustment) for the mirror root by
utilizing disallowed_hugepage_adjust().
Make the mirror root check asymmetric with NX huge pages and not to litter
the generic MMU code:
Invoke disallowed_hugepage_adjust() in kvm_tdp_mmu_map() when necessary,
specifically when KVM has mirrored TDP or the NX huge page workaround is
enabled.
Check and reduce the goal_level of a fault internally in
disallowed_hugepage_adjust() when the fault is for a mirror root and
there's a shadow present non-leaf entry at the original goal_level.
Signed-off-by: Edgecombe, Rick P <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Check is_mirror_sp() in disallowed_hugepage_adjust() instead of passing
in an is_mirror arg. (Rick)
- Check kvm_has_mirrored_tdp() in kvm_tdp_mmu_map() to determine whether
to invoke disallowed_hugepage_adjust(). (Rick)
RFC v1:
- new patch
---
arch/x86/kvm/mmu/mmu.c | 3 ++-
arch/x86/kvm/mmu/tdp_mmu.c | 4 +++-
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3f76415cec71..9182192daa3a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3412,7 +3412,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
cur_level == fault->goal_level &&
is_shadow_present_pte(spte) &&
!is_large_pte(spte) &&
- spte_to_child_sp(spte)->nx_huge_page_disallowed) {
+ ((spte_to_child_sp(spte)->nx_huge_page_disallowed) ||
+ is_mirror_sp(spte_to_child_sp(spte)))) {
/*
* A small SPTE exists for this pfn, but FNAME(fetch),
* direct_map(), or kvm_tdp_mmu_map() would like to create a
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index bb95c95f6531..f9a054754544 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1243,6 +1243,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
struct tdp_iter iter;
struct kvm_mmu_page *sp;
int ret = RET_PF_RETRY;
+ bool hugepage_adjust_disallowed = fault->nx_huge_page_workaround_enabled ||
+ kvm_has_mirrored_tdp(kvm);
kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -1253,7 +1255,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
int r;
- if (fault->nx_huge_page_workaround_enabled)
+ if (hugepage_adjust_disallowed)
disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
/*
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* [RFC PATCH v2 08/23] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (6 preceding siblings ...)
2025-08-07 9:42 ` [RFC PATCH v2 07/23] KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror root Yan Zhao
@ 2025-08-07 9:43 ` Yan Zhao
2025-11-11 9:52 ` Huang, Kai
2025-08-07 9:43 ` [RFC PATCH v2 09/23] KVM: x86/tdp_mmu: Add split_external_spt hook called during write mmu_lock Yan Zhao
` (14 subsequent siblings)
22 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:43 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
From: Isaku Yamahata <isaku.yamahata@intel.com>
Enhance tdp_mmu_alloc_sp_split() to allocate a page for sp->external_spt,
i.e., the external page table page, for splitting the mirror page table.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- NO change.
RFC v1:
- Rebased and simplified the code.
---
arch/x86/kvm/mmu/tdp_mmu.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f9a054754544..46b9f276bb6d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -324,6 +324,8 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
u64 old_spte, u64 new_spte, int level,
bool shared);
+static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror);
+
static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
kvm_account_pgtable_pages((void *)sp->spt, +1);
@@ -1475,7 +1477,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
return spte_set;
}
-static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(void)
+static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror)
{
struct kvm_mmu_page *sp;
@@ -1489,6 +1491,15 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(void)
return NULL;
}
+ if (mirror) {
+ sp->external_spt = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+ if (!sp->external_spt) {
+ free_page((unsigned long)sp->spt);
+ kmem_cache_free(mmu_page_header_cache, sp);
+ return NULL;
+ }
+ }
+
return sp;
}
@@ -1568,7 +1579,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
else
write_unlock(&kvm->mmu_lock);
- sp = tdp_mmu_alloc_sp_for_split();
+ sp = tdp_mmu_alloc_sp_for_split(is_mirror_sp(root));
if (shared)
read_lock(&kvm->mmu_lock);
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 08/23] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting
2025-08-07 9:43 ` [RFC PATCH v2 08/23] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
@ 2025-11-11 9:52 ` Huang, Kai
2025-11-12 9:29 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Huang, Kai @ 2025-11-11 9:52 UTC (permalink / raw)
To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
Cc: quic_eberman@quicinc.com, kvm@vger.kernel.org, Li, Xiaoyao,
Du, Fan, Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
vbabka@suse.cz, tabba@google.com, kas@kernel.org,
michael.roth@amd.com, Weiny, Ira, linux-kernel@vger.kernel.org,
binbin.wu@linux.intel.com, ackerleytng@google.com,
Yamahata, Isaku, Peng, Chao P, zhiquan1.li@intel.com,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> Enhance tdp_mmu_alloc_sp_split() to allocate a page for sp->external_spt,
^
tdp_mmu_alloc_sp_for_split()
[...]
> +static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror);
> +
It doesn't seem you need such declaration in _this_ patch. If any later
patch needs it, then perhaps it's better to do in that patch.
> static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> kvm_account_pgtable_pages((void *)sp->spt, +1);
> @@ -1475,7 +1477,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
> return spte_set;
> }
>
> -static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(void)
> +static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror)
> {
> struct kvm_mmu_page *sp;
>
> @@ -1489,6 +1491,15 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(void)
> return NULL;
> }
>
> + if (mirror) {
> + sp->external_spt = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
> + if (!sp->external_spt) {
> + free_page((unsigned long)sp->spt);
> + kmem_cache_free(mmu_page_header_cache, sp);
> + return NULL;
> + }
> + }
> +
> return sp;
> }
>
> @@ -1568,7 +1579,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> else
> write_unlock(&kvm->mmu_lock);
>
> - sp = tdp_mmu_alloc_sp_for_split();
> + sp = tdp_mmu_alloc_sp_for_split(is_mirror_sp(root));
>
> if (shared)
> read_lock(&kvm->mmu_lock);
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 08/23] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting
2025-11-11 9:52 ` Huang, Kai
@ 2025-11-12 9:29 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-11-12 9:29 UTC (permalink / raw)
To: Huang, Kai
Cc: pbonzini@redhat.com, seanjc@google.com, quic_eberman@quicinc.com,
kvm@vger.kernel.org, Li, Xiaoyao, Du, Fan, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
tabba@google.com, kas@kernel.org, michael.roth@amd.com,
Weiny, Ira, linux-kernel@vger.kernel.org,
binbin.wu@linux.intel.com, ackerleytng@google.com,
Yamahata, Isaku, Peng, Chao P, zhiquan1.li@intel.com,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On Tue, Nov 11, 2025 at 05:52:54PM +0800, Huang, Kai wrote:
> On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> >
> > Enhance tdp_mmu_alloc_sp_split() to allocate a page for sp->external_spt,
> ^
> tdp_mmu_alloc_sp_for_split()
Right.
> [...]
>
> > +static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror);
> > +
>
> It doesn't seem you need such declaration in _this_ patch. If any later
> patch needs it, then perhaps it's better to do in that patch.
Thanks!
I'll drop this declaration. It's no longer needed for v2, because its caller is
now tdp_mmu_split_huge_pages_root(), which is later than
tdp_mmu_alloc_sp_for_split().
^ permalink raw reply [flat|nested] 129+ messages in thread
* [RFC PATCH v2 09/23] KVM: x86/tdp_mmu: Add split_external_spt hook called during write mmu_lock
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (7 preceding siblings ...)
2025-08-07 9:43 ` [RFC PATCH v2 08/23] KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table splitting Yan Zhao
@ 2025-08-07 9:43 ` Yan Zhao
2025-11-11 10:06 ` Huang, Kai
2025-11-17 8:53 ` Binbin Wu
2025-08-07 9:43 ` [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock Yan Zhao
` (13 subsequent siblings)
22 siblings, 2 replies; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:43 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
Introduce the split_external_spt hook and call it within tdp_mmu_set_spte()
for the mirror page table.
tdp_mmu_set_spte() is invoked for SPTE transitions under write mmu_lock.
For the mirror page table, in addition to the valid transitions from a
shadow-present entry to !shadow-present entry, introduce a new valid
transition case for splitting and propagate the transition to the external
page table via the hook split_external_spt.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Removed the KVM_BUG_ON() in split_external_spt(). (Rick)
- Add a comment for the KVM_BUG_ON() in tdp_mmu_set_spte(). (Rick)
- Use kvm_x86_call() instead of static_call(). (Binbin)
RFC v1:
- Split patch.
- Dropped invoking hook zap_private_spte and kvm_flush_remote_tlbs() in KVM
MMU core.
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 4 ++++
arch/x86/kvm/mmu/tdp_mmu.c | 29 +++++++++++++++++++++++++----
3 files changed, 30 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 18a5c3119e1a..7653a45ad5b2 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -98,6 +98,7 @@ KVM_X86_OP_OPTIONAL(link_external_spt)
KVM_X86_OP_OPTIONAL(set_external_spte)
KVM_X86_OP_OPTIONAL(free_external_spt)
KVM_X86_OP_OPTIONAL(remove_external_spte)
+KVM_X86_OP_OPTIONAL(split_external_spt)
KVM_X86_OP(has_wbinvd_exit)
KVM_X86_OP(get_l2_tsc_offset)
KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 823d1aeef2a8..e431ce0e3180 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1839,6 +1839,10 @@ struct kvm_x86_ops {
int (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
kvm_pfn_t pfn_for_gfn);
+ /* Split the external page table into smaller page tables */
+ int (*split_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ void *external_spt);
+
bool (*has_wbinvd_exit)(void);
u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 46b9f276bb6d..a2c6e6e4773f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -325,6 +325,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
bool shared);
static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror);
+static void *get_external_spt(gfn_t gfn, u64 new_spte, int level);
static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
@@ -384,6 +385,18 @@ static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
KVM_BUG_ON(ret, kvm);
}
+static int split_external_spt(struct kvm *kvm, gfn_t gfn, u64 old_spte,
+ u64 new_spte, int level)
+{
+ void *external_spt = get_external_spt(gfn, new_spte, level);
+ int ret;
+
+ KVM_BUG_ON(!external_spt, kvm);
+
+ ret = kvm_x86_call(split_external_spt)(kvm, gfn, level, external_spt);
+
+ return ret;
+}
/**
* handle_removed_pt() - handle a page table removed from the TDP structure
*
@@ -765,12 +778,20 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
/*
- * Users that do non-atomic setting of PTEs don't operate on mirror
- * roots, so don't handle it and bug the VM if it's seen.
+ * Propagate changes of SPTE to the external page table under write
+ * mmu_lock.
+ * Current valid transitions:
+ * - present leaf to !present.
+ * - present non-leaf to !present.
+ * - present leaf to present non-leaf (splitting)
*/
if (is_mirror_sptep(sptep)) {
- KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
- remove_external_spte(kvm, gfn, old_spte, level);
+ if (!is_shadow_present_pte(new_spte))
+ remove_external_spte(kvm, gfn, old_spte, level);
+ else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
+ split_external_spt(kvm, gfn, old_spte, new_spte, level);
+ else
+ KVM_BUG_ON(1, kvm);
}
return old_spte;
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 09/23] KVM: x86/tdp_mmu: Add split_external_spt hook called during write mmu_lock
2025-08-07 9:43 ` [RFC PATCH v2 09/23] KVM: x86/tdp_mmu: Add split_external_spt hook called during write mmu_lock Yan Zhao
@ 2025-11-11 10:06 ` Huang, Kai
2025-11-13 3:16 ` Yan Zhao
2025-11-17 8:53 ` Binbin Wu
1 sibling, 1 reply; 129+ messages in thread
From: Huang, Kai @ 2025-11-11 10:06 UTC (permalink / raw)
To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
Cc: quic_eberman@quicinc.com, kvm@vger.kernel.org, Li, Xiaoyao,
Du, Fan, Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
vbabka@suse.cz, tabba@google.com, kas@kernel.org,
michael.roth@amd.com, Weiny, Ira, linux-kernel@vger.kernel.org,
binbin.wu@linux.intel.com, ackerleytng@google.com,
Yamahata, Isaku, Peng, Chao P, zhiquan1.li@intel.com,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
> Introduce the split_external_spt hook and call it within tdp_mmu_set_spte()
> for the mirror page table.
Nit: I think you need to use split_external_spt() since it's a function,
even you already mentioned it is a hook.
>
> tdp_mmu_set_spte() is invoked for SPTE transitions under write mmu_lock.
> For the mirror page table, in addition to the valid transitions from a
> shadow-present entry to !shadow-present entry, introduce a new valid
> transition case for splitting and propagate the transition to the external
> page table via the hook split_external_spt.
Ditto: split_external_spt()
[...]
> static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror);
> +static void *get_external_spt(gfn_t gfn, u64 new_spte, int level);
Is it possible to get rid of such declarations, e.g., by ...
>
> static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> @@ -384,6 +385,18 @@ static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
> KVM_BUG_ON(ret, kvm);
> }
>
> +static int split_external_spt(struct kvm *kvm, gfn_t gfn, u64 old_spte,
> + u64 new_spte, int level)
> +{
> + void *external_spt = get_external_spt(gfn, new_spte, level);
> + int ret;
> +
> + KVM_BUG_ON(!external_spt, kvm);
> +
> + ret = kvm_x86_call(split_external_spt)(kvm, gfn, level, external_spt);
> +
> + return ret;
> +}
... moving split_external_spt() somewhere else, e.g., after
set_external_spte_present() (which calls get_external_spt())?
Since ...
> /**
> * handle_removed_pt() - handle a page table removed from the TDP structure
> *
> @@ -765,12 +778,20 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
>
> /*
> - * Users that do non-atomic setting of PTEs don't operate on mirror
> - * roots, so don't handle it and bug the VM if it's seen.
> + * Propagate changes of SPTE to the external page table under write
> + * mmu_lock.
> + * Current valid transitions:
> + * - present leaf to !present.
> + * - present non-leaf to !present.
> + * - present leaf to present non-leaf (splitting)
> */
> if (is_mirror_sptep(sptep)) {
> - KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
> - remove_external_spte(kvm, gfn, old_spte, level);
> + if (!is_shadow_present_pte(new_spte))
> + remove_external_spte(kvm, gfn, old_spte, level);
> + else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
> + split_external_spt(kvm, gfn, old_spte, new_spte, level);
> + else
> + KVM_BUG_ON(1, kvm);
> }
>
... split_external_spt() is only called here in tdp_mmu_set_spte() which is
way after set_external_spte_present() AFAICT.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 09/23] KVM: x86/tdp_mmu: Add split_external_spt hook called during write mmu_lock
2025-11-11 10:06 ` Huang, Kai
@ 2025-11-13 3:16 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-11-13 3:16 UTC (permalink / raw)
To: Huang, Kai
Cc: pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org,
Li, Xiaoyao, Du, Fan, Hansen, Dave, david@redhat.com,
thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
kas@kernel.org, michael.roth@amd.com, Weiny, Ira,
linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
ackerleytng@google.com, Yamahata, Isaku, Peng, Chao P,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On Tue, Nov 11, 2025 at 06:06:47PM +0800, Huang, Kai wrote:
> On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
> > Introduce the split_external_spt hook and call it within tdp_mmu_set_spte()
> > for the mirror page table.
>
> Nit: I think you need to use split_external_spt() since it's a function,
> even you already mentioned it is a hook.
Makes sense.
> > tdp_mmu_set_spte() is invoked for SPTE transitions under write mmu_lock.
> > For the mirror page table, in addition to the valid transitions from a
> > shadow-present entry to !shadow-present entry, introduce a new valid
> > transition case for splitting and propagate the transition to the external
> > page table via the hook split_external_spt.
>
> Ditto: split_external_spt()
Will update.
> [...]
>
> > static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(bool mirror);
> > +static void *get_external_spt(gfn_t gfn, u64 new_spte, int level);
>
> Is it possible to get rid of such declarations, e.g., by ...
I think so.
Will drop this declaration by moving split_external_spt() after
get_external_spt() but before set_external_spte_present() and
tdp_mmu_set_spte().
Thanks for this suggestion.
> > static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> > {
> > @@ -384,6 +385,18 @@ static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
> > KVM_BUG_ON(ret, kvm);
> > }
> >
> > +static int split_external_spt(struct kvm *kvm, gfn_t gfn, u64 old_spte,
> > + u64 new_spte, int level)
> > +{
> > + void *external_spt = get_external_spt(gfn, new_spte, level);
> > + int ret;
> > +
> > + KVM_BUG_ON(!external_spt, kvm);
> > +
> > + ret = kvm_x86_call(split_external_spt)(kvm, gfn, level, external_spt);
> > +
> > + return ret;
> > +}
>
> ... moving split_external_spt() somewhere else, e.g., after
> set_external_spte_present() (which calls get_external_spt())?
>
> Since ...
>
> > /**
> > * handle_removed_pt() - handle a page table removed from the TDP structure
> > *
> > @@ -765,12 +778,20 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> >
> > /*
> > - * Users that do non-atomic setting of PTEs don't operate on mirror
> > - * roots, so don't handle it and bug the VM if it's seen.
> > + * Propagate changes of SPTE to the external page table under write
> > + * mmu_lock.
> > + * Current valid transitions:
> > + * - present leaf to !present.
> > + * - present non-leaf to !present.
> > + * - present leaf to present non-leaf (splitting)
> > */
> > if (is_mirror_sptep(sptep)) {
> > - KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
> > - remove_external_spte(kvm, gfn, old_spte, level);
> > + if (!is_shadow_present_pte(new_spte))
> > + remove_external_spte(kvm, gfn, old_spte, level);
> > + else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
> > + split_external_spt(kvm, gfn, old_spte, new_spte, level);
> > + else
> > + KVM_BUG_ON(1, kvm);
> > }
> >
>
> ... split_external_spt() is only called here in tdp_mmu_set_spte() which is
> way after set_external_spte_present() AFAICT.
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 09/23] KVM: x86/tdp_mmu: Add split_external_spt hook called during write mmu_lock
2025-08-07 9:43 ` [RFC PATCH v2 09/23] KVM: x86/tdp_mmu: Add split_external_spt hook called during write mmu_lock Yan Zhao
2025-11-11 10:06 ` Huang, Kai
@ 2025-11-17 8:53 ` Binbin Wu
2025-11-17 9:09 ` Yan Zhao
1 sibling, 1 reply; 129+ messages in thread
From: Binbin Wu @ 2025-11-17 8:53 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On 8/7/2025 5:43 PM, Yan Zhao wrote:
[...]
> /**
> * handle_removed_pt() - handle a page table removed from the TDP structure
> *
> @@ -765,12 +778,20 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
>
> /*
> - * Users that do non-atomic setting of PTEs don't operate on mirror
> - * roots, so don't handle it and bug the VM if it's seen.
> + * Propagate changes of SPTE to the external page table under write
> + * mmu_lock.
> + * Current valid transitions:
> + * - present leaf to !present.
> + * - present non-leaf to !present.
Nit:
Maybe add a small note to limit the scenario, such as "after releasing the HKID"
or "during the TD teardown"?
> + * - present leaf to present non-leaf (splitting)
> */
> if (is_mirror_sptep(sptep)) {
> - KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
> - remove_external_spte(kvm, gfn, old_spte, level);
> + if (!is_shadow_present_pte(new_spte))
> + remove_external_spte(kvm, gfn, old_spte, level);
> + else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
> + split_external_spt(kvm, gfn, old_spte, new_spte, level);
> + else
> + KVM_BUG_ON(1, kvm);
> }
>
> return old_spte;
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 09/23] KVM: x86/tdp_mmu: Add split_external_spt hook called during write mmu_lock
2025-11-17 8:53 ` Binbin Wu
@ 2025-11-17 9:09 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-11-17 9:09 UTC (permalink / raw)
To: Binbin Wu
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, michael.roth, david,
vannapurve, vbabka, thomas.lendacky, pgonda, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, chao.p.peng
On Mon, Nov 17, 2025 at 04:53:17PM +0800, Binbin Wu wrote:
>
>
> On 8/7/2025 5:43 PM, Yan Zhao wrote:
> [...]
> > /**
> > * handle_removed_pt() - handle a page table removed from the TDP structure
> > *
> > @@ -765,12 +778,20 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> > /*
> > - * Users that do non-atomic setting of PTEs don't operate on mirror
> > - * roots, so don't handle it and bug the VM if it's seen.
> > + * Propagate changes of SPTE to the external page table under write
> > + * mmu_lock.
> > + * Current valid transitions:
> > + * - present leaf to !present.
> > + * - present non-leaf to !present.
>
> Nit:
> Maybe add a small note to limit the scenario, such as "after releasing the HKID"
> or "during the TD teardown"?
I'm not sure if we need that level of detail.
e.g., for case "present leaf to !present", it's before releasing the HKID w/o
patch [1], but can be after the HKID w/ patch [1].
[1] https://lore.kernel.org/all/20250729193341.621487-6-seanjc@google.com/
> > + * - present leaf to present non-leaf (splitting)
> > */
> > if (is_mirror_sptep(sptep)) {
> > - KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
> > - remove_external_spte(kvm, gfn, old_spte, level);
> > + if (!is_shadow_present_pte(new_spte))
> > + remove_external_spte(kvm, gfn, old_spte, level);
> > + else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
> > + split_external_spt(kvm, gfn, old_spte, new_spte, level);
> > + else
> > + KVM_BUG_ON(1, kvm);
> > }
> > return old_spte;
>
^ permalink raw reply [flat|nested] 129+ messages in thread
* [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (8 preceding siblings ...)
2025-08-07 9:43 ` [RFC PATCH v2 09/23] KVM: x86/tdp_mmu: Add split_external_spt hook called during write mmu_lock Yan Zhao
@ 2025-08-07 9:43 ` Yan Zhao
2025-11-11 10:20 ` Huang, Kai
2025-12-09 23:49 ` Sagi Shahar
2025-08-07 9:43 ` [RFC PATCH v2 11/23] KVM: x86: Reject splitting huge pages under shared mmu_lock for mirror root Yan Zhao
` (12 subsequent siblings)
22 siblings, 2 replies; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:43 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
Implement the split_external_spt hook to enable huge page splitting for
TDX when kvm->mmu_lock is held for writing.
Invoke tdh_mem_range_block(), tdh_mem_track(), kicking off vCPUs,
tdh_mem_page_demote() in sequence. All operations are performed under
kvm->mmu_lock held for writing, similar to those in page removal.
Even with kvm->mmu_lock held for writing, tdh_mem_page_demote() may still
contend with tdh_vp_enter() and potentially with the guest's S-EPT entry
operations. Therefore, kick off other vCPUs and prevent tdh_vp_enter()
from being called on them to ensure success on the second attempt. Use
KVM_BUG_ON() for any other unexpected errors.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Split out the code to handle the error TDX_INTERRUPTED_RESTARTABLE.
- Rebased to 6.16.0-rc6 (the way of defining TDX hook changes).
RFC v1:
- Split patch for exclusive mmu_lock only,
- Invoke tdx_sept_zap_private_spte() and tdx_track() for splitting.
- Handled busy error of tdh_mem_page_demote() by kicking off vCPUs.
---
arch/x86/kvm/vmx/tdx.c | 45 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 376287a2ddf4..8a60ba5b6595 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1915,6 +1915,50 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
return 0;
}
+static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, struct page *page)
+{
+ int tdx_level = pg_level_to_tdx_sept_level(level);
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ gpa_t gpa = gfn_to_gpa(gfn);
+ u64 err, entry, level_state;
+
+ err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
+ &entry, &level_state);
+
+ if (unlikely(tdx_operand_busy(err))) {
+ tdx_no_vcpus_enter_start(kvm);
+ err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
+ &entry, &level_state);
+ tdx_no_vcpus_enter_stop(kvm);
+ }
+
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error_2(TDH_MEM_PAGE_DEMOTE, err, entry, level_state);
+ return -EIO;
+ }
+ return 0;
+}
+
+static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+ void *private_spt)
+{
+ struct page *page = virt_to_page(private_spt);
+ int ret;
+
+ if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE ||
+ level != PG_LEVEL_2M, kvm))
+ return -EINVAL;
+
+ ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
+ if (ret <= 0)
+ return ret;
+
+ tdx_track(kvm);
+
+ return tdx_spte_demote_private_spte(kvm, gfn, level, page);
+}
+
static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
enum pg_level level, kvm_pfn_t pfn)
{
@@ -3668,5 +3712,6 @@ void __init tdx_hardware_setup(void)
vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
+ vt_x86_ops.split_external_spt = tdx_sept_split_private_spt;
vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
}
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock
2025-08-07 9:43 ` [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock Yan Zhao
@ 2025-11-11 10:20 ` Huang, Kai
2025-11-13 5:53 ` Yan Zhao
2025-12-09 23:49 ` Sagi Shahar
1 sibling, 1 reply; 129+ messages in thread
From: Huang, Kai @ 2025-11-11 10:20 UTC (permalink / raw)
To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
Cc: quic_eberman@quicinc.com, kvm@vger.kernel.org, Li, Xiaoyao,
Du, Fan, Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
vbabka@suse.cz, tabba@google.com, kas@kernel.org,
michael.roth@amd.com, Weiny, Ira, linux-kernel@vger.kernel.org,
binbin.wu@linux.intel.com, ackerleytng@google.com,
Yamahata, Isaku, Peng, Chao P, zhiquan1.li@intel.com,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
> Implement the split_external_spt hook to enable huge page splitting for
> TDX when kvm->mmu_lock is held for writing.
>
> Invoke tdh_mem_range_block(), tdh_mem_track(), kicking off vCPUs,
> tdh_mem_page_demote() in sequence. All operations are performed under
> kvm->mmu_lock held for writing, similar to those in page removal.
>
> Even with kvm->mmu_lock held for writing, tdh_mem_page_demote() may still
> contend with tdh_vp_enter() and potentially with the guest's S-EPT entry
> operations. Therefore, kick off other vCPUs and prevent tdh_vp_enter()
> from being called on them to ensure success on the second attempt. Use
> KVM_BUG_ON() for any other unexpected errors.
I thought we also need to do UNBLOCK after DEMOTE, but it turns out we don't
need to. Maybe we can call this out.
[...]
>
> +static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
> + enum pg_level level, struct page *page)
> +{
> + int tdx_level = pg_level_to_tdx_sept_level(level);
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + gpa_t gpa = gfn_to_gpa(gfn);
> + u64 err, entry, level_state;
> +
> + err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> + &entry, &level_state);
> +
> + if (unlikely(tdx_operand_busy(err))) {
> + tdx_no_vcpus_enter_start(kvm);
> + err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> + &entry, &level_state);
> + tdx_no_vcpus_enter_stop(kvm);
> + }
> +
> + if (KVM_BUG_ON(err, kvm)) {
> + pr_tdx_error_2(TDH_MEM_PAGE_DEMOTE, err, entry, level_state);
> + return -EIO;
> + }
> + return 0;
> +}
> +
> +static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + void *private_spt)
> +{
> + struct page *page = virt_to_page(private_spt);
> + int ret;
> +
> + if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE ||
> + level != PG_LEVEL_2M, kvm))
> + return -EINVAL;
> +
> + ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
I don't quite follow why you pass 'private_spt' to
tdx_sept_zap_private_spte(), but it doesn't matter anymore since it's gone
in Sean's latest tree.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock
2025-11-11 10:20 ` Huang, Kai
@ 2025-11-13 5:53 ` Yan Zhao
2025-11-17 9:17 ` Binbin Wu
0 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-11-13 5:53 UTC (permalink / raw)
To: Huang, Kai
Cc: pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org,
Li, Xiaoyao, Du, Fan, Hansen, Dave, david@redhat.com,
thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
kas@kernel.org, michael.roth@amd.com, Weiny, Ira,
linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
ackerleytng@google.com, Yamahata, Isaku, Peng, Chao P,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On Tue, Nov 11, 2025 at 06:20:40PM +0800, Huang, Kai wrote:
> On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
> > Implement the split_external_spt hook to enable huge page splitting for
> > TDX when kvm->mmu_lock is held for writing.
> >
> > Invoke tdh_mem_range_block(), tdh_mem_track(), kicking off vCPUs,
> > tdh_mem_page_demote() in sequence. All operations are performed under
> > kvm->mmu_lock held for writing, similar to those in page removal.
> >
> > Even with kvm->mmu_lock held for writing, tdh_mem_page_demote() may still
> > contend with tdh_vp_enter() and potentially with the guest's S-EPT entry
> > operations. Therefore, kick off other vCPUs and prevent tdh_vp_enter()
> > from being called on them to ensure success on the second attempt. Use
> > KVM_BUG_ON() for any other unexpected errors.
>
> I thought we also need to do UNBLOCK after DEMOTE, but it turns out we don't
> need to.
Yes, the BLOCK operates on PG_LEVEL_2M, and a successful DEMOTE updates the SEPT
non-leaf 2MB entry to point to the newly added page table page with RWX
permission, so there's no need to do UNBLOCK on success.
The purpose of BLOCK + TRACK + kick off vCPUs is to ensure all vCPUs must find
the old huge guest page is no longer mapped in the SEPT.
> Maybe we can call this out.
Will do.
> > +static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
> > + enum pg_level level, struct page *page)
> > +{
> > + int tdx_level = pg_level_to_tdx_sept_level(level);
> > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > + gpa_t gpa = gfn_to_gpa(gfn);
> > + u64 err, entry, level_state;
> > +
> > + err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> > + &entry, &level_state);
> > +
> > + if (unlikely(tdx_operand_busy(err))) {
> > + tdx_no_vcpus_enter_start(kvm);
> > + err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> > + &entry, &level_state);
> > + tdx_no_vcpus_enter_stop(kvm);
> > + }
> > +
> > + if (KVM_BUG_ON(err, kvm)) {
> > + pr_tdx_error_2(TDH_MEM_PAGE_DEMOTE, err, entry, level_state);
> > + return -EIO;
> > + }
> > + return 0;
> > +}
> > +
> > +static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > + void *private_spt)
> > +{
> > + struct page *page = virt_to_page(private_spt);
> > + int ret;
> > +
> > + if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE ||
> > + level != PG_LEVEL_2M, kvm))
> > + return -EINVAL;
> > +
> > + ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
>
> I don't quite follow why you pass 'private_spt' to
> tdx_sept_zap_private_spte(),
Simply because tdx_sept_zap_private_spte() requires a "page", which is actually
not used by tdx_sept_zap_private_spte() in the split path.
> but it doesn't matter anymore since it's gone
> in Sean's latest tree.
Right.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock
2025-11-13 5:53 ` Yan Zhao
@ 2025-11-17 9:17 ` Binbin Wu
2025-11-17 9:26 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Binbin Wu @ 2025-11-17 9:17 UTC (permalink / raw)
To: Yan Zhao, Huang, Kai
Cc: pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org,
Li, Xiaoyao, Du, Fan, Hansen, Dave, david@redhat.com,
thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
kas@kernel.org, michael.roth@amd.com, Weiny, Ira,
linux-kernel@vger.kernel.org, ackerleytng@google.com,
Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
On 11/13/2025 1:53 PM, Yan Zhao wrote:
> On Tue, Nov 11, 2025 at 06:20:40PM +0800, Huang, Kai wrote:
>> On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
>>> Implement the split_external_spt hook to enable huge page splitting for
Nit:
split_external_spt(), similar as Kai mentioned in patch 9.
>>> TDX when kvm->mmu_lock is held for writing.
>>>
>>> Invoke tdh_mem_range_block(), tdh_mem_track(), kicking off vCPUs,
>>> tdh_mem_page_demote() in sequence. All operations are performed under
>>> kvm->mmu_lock held for writing, similar to those in page removal.
>>>
>>> Even with kvm->mmu_lock held for writing, tdh_mem_page_demote() may still
>>> contend with tdh_vp_enter() and potentially with the guest's S-EPT entry
>>> operations. Therefore, kick off other vCPUs and prevent tdh_vp_enter()
>>> from being called on them to ensure success on the second attempt. Use
>>> KVM_BUG_ON() for any other unexpected errors.
>> I thought we also need to do UNBLOCK after DEMOTE, but it turns out we don't
>> need to.
> Yes, the BLOCK operates on PG_LEVEL_2M, and a successful DEMOTE updates the SEPT
> non-leaf 2MB entry to point to the newly added page table page with RWX
> permission, so there's no need to do UNBLOCK on success.
>
> The purpose of BLOCK + TRACK + kick off vCPUs is to ensure all vCPUs must find
> the old huge guest page is no longer mapped in the SEPT.
>
>> Maybe we can call this out.
> Will do.
>
>>> +static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
>>> + enum pg_level level, struct page *page)
>>> +{
>>> + int tdx_level = pg_level_to_tdx_sept_level(level);
>>> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>>> + gpa_t gpa = gfn_to_gpa(gfn);
>>> + u64 err, entry, level_state;
>>> +
>>> + err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
>>> + &entry, &level_state);
>>> +
>>> + if (unlikely(tdx_operand_busy(err))) {
>>> + tdx_no_vcpus_enter_start(kvm);
>>> + err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
>>> + &entry, &level_state);
>>> + tdx_no_vcpus_enter_stop(kvm);
>>> + }
>>> +
>>> + if (KVM_BUG_ON(err, kvm)) {
>>> + pr_tdx_error_2(TDH_MEM_PAGE_DEMOTE, err, entry, level_state);
>>> + return -EIO;
>>> + }
>>> + return 0;
>>> +}
>>> +
>>> +static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
>>> + void *private_spt)
>>> +{
>>> + struct page *page = virt_to_page(private_spt);
>>> + int ret;
>>> +
>>> + if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE ||
>>> + level != PG_LEVEL_2M, kvm))
>>> + return -EINVAL;
>>> +
>>> + ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
>> I don't quite follow why you pass 'private_spt' to
>> tdx_sept_zap_private_spte(),
> Simply because tdx_sept_zap_private_spte() requires a "page", which is actually
> not used by tdx_sept_zap_private_spte() in the split path.
>
>> but it doesn't matter anymore since it's gone
>> in Sean's latest tree.
> Right.
>
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock
2025-11-17 9:17 ` Binbin Wu
@ 2025-11-17 9:26 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-11-17 9:26 UTC (permalink / raw)
To: Binbin Wu
Cc: Huang, Kai, pbonzini@redhat.com, seanjc@google.com,
kvm@vger.kernel.org, Li, Xiaoyao, Du, Fan, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
tabba@google.com, kas@kernel.org, michael.roth@amd.com,
Weiny, Ira, linux-kernel@vger.kernel.org, ackerleytng@google.com,
Yamahata, Isaku, Peng, Chao P, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
On Mon, Nov 17, 2025 at 05:17:27PM +0800, Binbin Wu wrote:
>
>
> On 11/13/2025 1:53 PM, Yan Zhao wrote:
> > On Tue, Nov 11, 2025 at 06:20:40PM +0800, Huang, Kai wrote:
> > > On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
> > > > Implement the split_external_spt hook to enable huge page splitting for
>
> Nit:
> split_external_spt(), similar as Kai mentioned in patch 9.
Ok. I'll refer it to kvm_x86_ops.split_external_spt().
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock
2025-08-07 9:43 ` [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock Yan Zhao
2025-11-11 10:20 ` Huang, Kai
@ 2025-12-09 23:49 ` Sagi Shahar
2025-12-09 23:54 ` Edgecombe, Rick P
1 sibling, 1 reply; 129+ messages in thread
From: Sagi Shahar @ 2025-12-09 23:49 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
binbin.wu, chao.p.peng
On Thu, Aug 7, 2025 at 4:44 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> Implement the split_external_spt hook to enable huge page splitting for
> TDX when kvm->mmu_lock is held for writing.
>
> Invoke tdh_mem_range_block(), tdh_mem_track(), kicking off vCPUs,
> tdh_mem_page_demote() in sequence. All operations are performed under
> kvm->mmu_lock held for writing, similar to those in page removal.
>
> Even with kvm->mmu_lock held for writing, tdh_mem_page_demote() may still
> contend with tdh_vp_enter() and potentially with the guest's S-EPT entry
> operations. Therefore, kick off other vCPUs and prevent tdh_vp_enter()
> from being called on them to ensure success on the second attempt. Use
> KVM_BUG_ON() for any other unexpected errors.
>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2:
> - Split out the code to handle the error TDX_INTERRUPTED_RESTARTABLE.
> - Rebased to 6.16.0-rc6 (the way of defining TDX hook changes).
>
> RFC v1:
> - Split patch for exclusive mmu_lock only,
> - Invoke tdx_sept_zap_private_spte() and tdx_track() for splitting.
> - Handled busy error of tdh_mem_page_demote() by kicking off vCPUs.
> ---
> arch/x86/kvm/vmx/tdx.c | 45 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 45 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 376287a2ddf4..8a60ba5b6595 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1915,6 +1915,50 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
> return 0;
> }
>
> +static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
> + enum pg_level level, struct page *page)
> +{
> + int tdx_level = pg_level_to_tdx_sept_level(level);
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + gpa_t gpa = gfn_to_gpa(gfn);
> + u64 err, entry, level_state;
> +
> + err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> + &entry, &level_state);
> +
> + if (unlikely(tdx_operand_busy(err))) {
I was trying to test this code locally (without the DPAMT patches and
with DPAMT disabled) and saw that sometimes tdh_mem_page_demote
returns TDX_INTERRUPTED_RESTARTABLE. Looking at the TDX module code
(version 1.5.16 from [1]) I see that demote and promote are the only
seamcalls that return TDX_INTERRUPTED_RESTARTABLE so it wasn't handled
by KVM until now.
I added manual handling for it and it's working correctly. Note that
my change is on top of a rebase to the latest version:
@@ -1989,9 +1989,16 @@ static int tdx_spte_demote_private_spte(struct
kvm *kvm, gfn_t gfn,
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
gpa_t gpa = gfn_to_gpa(gfn);
u64 err, entry, level_state;
+ int i = 0;
- err = tdh_do_no_vcpus(tdh_mem_page_demote, kvm, &kvm_tdx->td, gpa,
+ while (i < TDX_SEAMCALL_RETRIES) {
+ err = tdh_do_no_vcpus(tdh_mem_page_demote, kvm,
&kvm_tdx->td, gpa,
tdx_level, page, &entry, &level_state);
+ if (err != TDX_INTERRUPTED_RESTARTABLE)
+ break;
+ i++;
+ }
+
if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_DEMOTE, entry, level_state, kvm))
return -EIO;
[1] https://github.com/intel/confidential-computing.tdx.tdx-module
> + tdx_no_vcpus_enter_start(kvm);
> + err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> + &entry, &level_state);
> + tdx_no_vcpus_enter_stop(kvm);
> + }
> +
> + if (KVM_BUG_ON(err, kvm)) {
> + pr_tdx_error_2(TDH_MEM_PAGE_DEMOTE, err, entry, level_state);
> + return -EIO;
> + }
> + return 0;
> +}
> +
> +static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + void *private_spt)
> +{
> + struct page *page = virt_to_page(private_spt);
> + int ret;
> +
> + if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE ||
> + level != PG_LEVEL_2M, kvm))
> + return -EINVAL;
> +
> + ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
> + if (ret <= 0)
> + return ret;
> +
> + tdx_track(kvm);
> +
> + return tdx_spte_demote_private_spte(kvm, gfn, level, page);
> +}
> +
> static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> enum pg_level level, kvm_pfn_t pfn)
> {
> @@ -3668,5 +3712,6 @@ void __init tdx_hardware_setup(void)
> vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
> vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
> vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
> + vt_x86_ops.split_external_spt = tdx_sept_split_private_spt;
> vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
> }
> --
> 2.43.2
>
>
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock
2025-12-09 23:49 ` Sagi Shahar
@ 2025-12-09 23:54 ` Edgecombe, Rick P
2025-12-10 0:28 ` Sagi Shahar
0 siblings, 1 reply; 129+ messages in thread
From: Edgecombe, Rick P @ 2025-12-09 23:54 UTC (permalink / raw)
To: sagis@google.com, Zhao, Yan Y
Cc: kvm@vger.kernel.org, quic_eberman@quicinc.com, Li, Xiaoyao,
Du, Fan, Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
tabba@google.com, vbabka@suse.cz, michael.roth@amd.com,
seanjc@google.com, Weiny, Ira, kas@kernel.org,
pbonzini@redhat.com, ackerleytng@google.com,
linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
Yamahata, Isaku, Peng, Chao P, zhiquan1.li@intel.com,
Annapurve, Vishal, Miao, Jun, x86@kernel.org, pgonda@google.com
On Tue, 2025-12-09 at 17:49 -0600, Sagi Shahar wrote:
> I was trying to test this code locally (without the DPAMT patches and
> with DPAMT disabled) and saw that sometimes tdh_mem_page_demote
> returns TDX_INTERRUPTED_RESTARTABLE. Looking at the TDX module code
> (version 1.5.16 from [1]) I see that demote and promote are the only
> seamcalls that return TDX_INTERRUPTED_RESTARTABLE so it wasn't handled
> by KVM until now.
Did you see "Open 3" in the coverletter?
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock
2025-12-09 23:54 ` Edgecombe, Rick P
@ 2025-12-10 0:28 ` Sagi Shahar
2025-12-10 0:50 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Sagi Shahar @ 2025-12-10 0:28 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: Zhao, Yan Y, kvm@vger.kernel.org, quic_eberman@quicinc.com,
Li, Xiaoyao, Du, Fan, Hansen, Dave, david@redhat.com,
thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
kas@kernel.org, pbonzini@redhat.com, ackerleytng@google.com,
linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
Yamahata, Isaku, Peng, Chao P, zhiquan1.li@intel.com,
Annapurve, Vishal, Miao, Jun, x86@kernel.org, pgonda@google.com
On Tue, Dec 9, 2025 at 5:54 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2025-12-09 at 17:49 -0600, Sagi Shahar wrote:
> > I was trying to test this code locally (without the DPAMT patches and
> > with DPAMT disabled) and saw that sometimes tdh_mem_page_demote
> > returns TDX_INTERRUPTED_RESTARTABLE. Looking at the TDX module code
> > (version 1.5.16 from [1]) I see that demote and promote are the only
> > seamcalls that return TDX_INTERRUPTED_RESTARTABLE so it wasn't handled
> > by KVM until now.
>
> Did you see "Open 3" in the coverletter?
I tested the code using TDX module 1.5.24 which is the latest one we
got. Is there a newer TDX module that supports this new functionality?
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock
2025-12-10 0:28 ` Sagi Shahar
@ 2025-12-10 0:50 ` Yan Zhao
2025-12-10 17:16 ` Sagi Shahar
0 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-12-10 0:50 UTC (permalink / raw)
To: Sagi Shahar
Cc: Edgecombe, Rick P, kvm@vger.kernel.org, quic_eberman@quicinc.com,
Li, Xiaoyao, Du, Fan, Hansen, Dave, david@redhat.com,
thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
kas@kernel.org, pbonzini@redhat.com, ackerleytng@google.com,
linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
Yamahata, Isaku, Peng, Chao P, zhiquan1.li@intel.com,
Annapurve, Vishal, Miao, Jun, x86@kernel.org, pgonda@google.com
On Tue, Dec 09, 2025 at 06:28:56PM -0600, Sagi Shahar wrote:
> On Tue, Dec 9, 2025 at 5:54 PM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> >
> > On Tue, 2025-12-09 at 17:49 -0600, Sagi Shahar wrote:
> > > I was trying to test this code locally (without the DPAMT patches and
> > > with DPAMT disabled) and saw that sometimes tdh_mem_page_demote
> > > returns TDX_INTERRUPTED_RESTARTABLE. Looking at the TDX module code
> > > (version 1.5.16 from [1]) I see that demote and promote are the only
> > > seamcalls that return TDX_INTERRUPTED_RESTARTABLE so it wasn't handled
> > > by KVM until now.
> >
> > Did you see "Open 3" in the coverletter?
>
> I tested the code using TDX module 1.5.24 which is the latest one we
> got. Is there a newer TDX module that supports this new functionality?
AFAIK, TDX module 1.5.28 is the earliest version that enumerates
TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY (bit 51) and disables
TDX_INTERRUPTED_RESTARTABLE when there're no L2 TDs. (Please check the
discussions at [1]).
Looks 1.5.28.04 was just released (internally?), with release note saying
"Ensure TDH.MEM.PAGE.DEMOTE forward progress for non partitioned TDs".
Not sure if you can check it.
[1] https://lore.kernel.org/all/aRRAFhw11Dwcw7RG@yzhao56-desk.sh.intel.com/
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock
2025-12-10 0:50 ` Yan Zhao
@ 2025-12-10 17:16 ` Sagi Shahar
2025-12-10 19:49 ` Edgecombe, Rick P
0 siblings, 1 reply; 129+ messages in thread
From: Sagi Shahar @ 2025-12-10 17:16 UTC (permalink / raw)
To: Yan Zhao
Cc: Edgecombe, Rick P, kvm@vger.kernel.org, quic_eberman@quicinc.com,
Li, Xiaoyao, Du, Fan, Hansen, Dave, david@redhat.com,
thomas.lendacky@amd.com, tabba@google.com, vbabka@suse.cz,
michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
kas@kernel.org, pbonzini@redhat.com, ackerleytng@google.com,
linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
Yamahata, Isaku, Peng, Chao P, zhiquan1.li@intel.com,
Annapurve, Vishal, Miao, Jun, x86@kernel.org, pgonda@google.com
On Tue, Dec 9, 2025 at 6:53 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, Dec 09, 2025 at 06:28:56PM -0600, Sagi Shahar wrote:
> > On Tue, Dec 9, 2025 at 5:54 PM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > >
> > > On Tue, 2025-12-09 at 17:49 -0600, Sagi Shahar wrote:
> > > > I was trying to test this code locally (without the DPAMT patches and
> > > > with DPAMT disabled) and saw that sometimes tdh_mem_page_demote
> > > > returns TDX_INTERRUPTED_RESTARTABLE. Looking at the TDX module code
> > > > (version 1.5.16 from [1]) I see that demote and promote are the only
> > > > seamcalls that return TDX_INTERRUPTED_RESTARTABLE so it wasn't handled
> > > > by KVM until now.
> > >
> > > Did you see "Open 3" in the coverletter?
> >
> > I tested the code using TDX module 1.5.24 which is the latest one we
> > got. Is there a newer TDX module that supports this new functionality?
> AFAIK, TDX module 1.5.28 is the earliest version that enumerates
> TDX_FEATURES0.ENHANCED_DEMOTE_INTERRUPTIBILITY (bit 51) and disables
> TDX_INTERRUPTED_RESTARTABLE when there're no L2 TDs. (Please check the
> discussions at [1]).
>
> Looks 1.5.28.04 was just released (internally?), with release note saying
> "Ensure TDH.MEM.PAGE.DEMOTE forward progress for non partitioned TDs".
>
Thanks. I don't have access to the 1.5.28.04 module and we need the
code to work with the 1.5.24 module as well based on our timeline so I
guess we can just add the retries locally for now.
Do you see any issue with retrying the operation in case of
TDX_INTERRUPTED_RESTARTABLE? From what I saw this is not just a
theoretical race but happens every time I try to boot a VM, even for a
small VM with 4 VCPUs and 8GB of memory.
> Not sure if you can check it.
>
> [1] https://lore.kernel.org/all/aRRAFhw11Dwcw7RG@yzhao56-desk.sh.intel.com/
>
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock
2025-12-10 17:16 ` Sagi Shahar
@ 2025-12-10 19:49 ` Edgecombe, Rick P
2025-12-11 2:10 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Edgecombe, Rick P @ 2025-12-10 19:49 UTC (permalink / raw)
To: sagis@google.com, Zhao, Yan Y
Cc: Du, Fan, Li, Xiaoyao, quic_eberman@quicinc.com, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
vbabka@suse.cz, kvm@vger.kernel.org, michael.roth@amd.com,
seanjc@google.com, Weiny, Ira, pbonzini@redhat.com,
binbin.wu@linux.intel.com, ackerleytng@google.com,
linux-kernel@vger.kernel.org, Yamahata, Isaku, Peng, Chao P,
kas@kernel.org, Annapurve, Vishal, Miao, Jun,
zhiquan1.li@intel.com, x86@kernel.org, pgonda@google.com
On Wed, 2025-12-10 at 11:16 -0600, Sagi Shahar wrote:
> Thanks. I don't have access to the 1.5.28.04 module and we need the
> code to work with the 1.5.24 module as well based on our timeline so I
> guess we can just add the retries locally for now.
>
> Do you see any issue with retrying the operation in case of
> TDX_INTERRUPTED_RESTARTABLE?
>
Yan has been testing with a similar workaround. See "[DROP ME] x86/virt/tdx:
Loop for TDX_INTERRUPTED_RESTARTABLE in tdh_mem_page_demote()".
With TDX_INTERRUPTED_RESTARTABLE compared to RESUMABLE, the problem is that
there is no guarantee it will make forward progress. So looping during an
interrupt storm would halt the process context in an unusual way.
So the two kernel side options we discussed were loop forever, or loop for a
certain amount of times and KVM_BUG_ON()/warn (like you had). They have
different problems - unbounded loop vs potentially killing the TD for unrelated
host behavior. So that is how we came to the decision to rely on TDX module
changes for the long term upstream solution.
You could also see this thread that touches on disabling interrupts around the
seamcall:
https://lore.kernel.org/kvm/99f5585d759328db973403be0713f68e492b492a.camel@intel.com/
However it does not help the NMI case. Do you know which you are hitting?
> From what I saw this is not just a
> theoretical race but happens every time I try to boot a VM
>
Oh, interesting.
> , even for a
> small VM with 4 VCPUs and 8GB of memory.
It probably more matters what else is happening on the system to cause a host
interrupt.
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock
2025-12-10 19:49 ` Edgecombe, Rick P
@ 2025-12-11 2:10 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-12-11 2:10 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: sagis@google.com, Du, Fan, Li, Xiaoyao, quic_eberman@quicinc.com,
Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
tabba@google.com, vbabka@suse.cz, kvm@vger.kernel.org,
michael.roth@amd.com, seanjc@google.com, Weiny, Ira,
pbonzini@redhat.com, binbin.wu@linux.intel.com,
ackerleytng@google.com, linux-kernel@vger.kernel.org,
Yamahata, Isaku, Peng, Chao P, kas@kernel.org, Annapurve, Vishal,
Miao, Jun, zhiquan1.li@intel.com, x86@kernel.org,
pgonda@google.com
On Thu, Dec 11, 2025 at 03:49:26AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-12-10 at 11:16 -0600, Sagi Shahar wrote:
> > From what I saw this is not just a
> > theoretical race but happens every time I try to boot a VM
I don't think we mentioned that the retry for TDX_INTERRUPTED_RESTARTABLE is
theoretical, did we? :)
On my SPR, to boot a VM with 8 vCPUs 8 GB memory, on average there are
271.4 demotes, with 1.1 retry for TDX_INTERRUPTED_RESTARTABLE.
# | demote cnt | retry cnt (for TDX_INTERRUPTED_RESTARTABLE)
--------|------------|----------
1 | 271 | 2
2 | 273 | 0
3 | 271 | 1
4 | 270 | 2
5 | 271 | 3
6 | 271 | 0
7 | 274 | 0
8 | 270 | 0
9 | 271 | 2
10 | 272 | 0
> Oh, interesting.
>
> > , even for a
> > small VM with 4 VCPUs and 8GB of memory.
>
> It probably more matters what else is happening on the system to cause a host
> interrupt.
Agreed.
^ permalink raw reply [flat|nested] 129+ messages in thread
* [RFC PATCH v2 11/23] KVM: x86: Reject splitting huge pages under shared mmu_lock for mirror root
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (9 preceding siblings ...)
2025-08-07 9:43 ` [RFC PATCH v2 10/23] KVM: TDX: Enable huge page splitting under write kvm->mmu_lock Yan Zhao
@ 2025-08-07 9:43 ` Yan Zhao
2025-09-03 3:30 ` Binbin Wu
2025-08-07 9:43 ` [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
` (11 subsequent siblings)
22 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:43 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
While removing the KVM_BUG_ON() for the mirror root before invoking
tdp_mmu_split_huge_page() in the fault path, update the hook
split_external_spt to pass in shared mmu_lock info and invoke the hook in
set_external_spte_present() on splitting is detected. Reject the splitting
in TDX if the splitting is under shared mmu_lock.
TDX requires different handling for splitting under shared or exclusive
mmu_lock.
Under a shared mmu_lock, TDX cannot kick off all vCPUs to avoid BUSY error
from tdh_mem_page_demote(). As the current TDX module requires
tdh_mem_range_block() to be invoked before each tdh_mem_page_demote(), if a
BUSY error occurs, TDX must call tdh_mem_range_unblock() before returning
the error to the KVM MMU core to roll back the old SPTE and retry. However,
tdh_mem_range_unblock() may also fail due to contention.
Reject splitting huge pages under shared mmu_lock for mirror root in TDX
rather than KVM_BUG_ON() in KVM MMU core to allow for future real
implementation of demote under shared mmu_lock once non-blocking demote is
available.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- WARN_ON_ONCE() and return error in tdx_sept_split_private_spt() if it's
invoked under shared mmu_lock. (rather than increase the next fault's
max_level in current vCPU via tdx->violation_gfn_start/end and
tdx->violation_request_level).
- TODO: Perform the real implementation of demote under shared mmu_lock
when new version of TDX module supporting non-blocking demote is
available.
RFC v1:
- New patch.
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 45 ++++++++++++++++++++-------------
arch/x86/kvm/vmx/tdx.c | 8 +++++-
3 files changed, 36 insertions(+), 19 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e431ce0e3180..6cb5b422dd1d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1841,7 +1841,7 @@ struct kvm_x86_ops {
/* Split the external page table into smaller page tables */
int (*split_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
- void *external_spt);
+ void *external_spt, bool mmu_lock_shared);
bool (*has_wbinvd_exit)(void);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a2c6e6e4773f..ce49cc850ed5 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -386,15 +386,14 @@ static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
}
static int split_external_spt(struct kvm *kvm, gfn_t gfn, u64 old_spte,
- u64 new_spte, int level)
+ u64 new_spte, int level, bool shared)
{
void *external_spt = get_external_spt(gfn, new_spte, level);
int ret;
KVM_BUG_ON(!external_spt, kvm);
- ret = kvm_x86_call(split_external_spt)(kvm, gfn, level, external_spt);
-
+ ret = kvm_x86_call(split_external_spt)(kvm, gfn, level, external_spt, shared);
return ret;
}
/**
@@ -533,11 +532,19 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
{
bool was_present = is_shadow_present_pte(old_spte);
bool is_present = is_shadow_present_pte(new_spte);
+ bool was_leaf = was_present && is_last_spte(old_spte, level);
bool is_leaf = is_present && is_last_spte(new_spte, level);
kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
int ret = 0;
- KVM_BUG_ON(was_present, kvm);
+ /*
+ * Caller ensures new_spte must be present.
+ * Current valid transitions:
+ * - leaf to non-leaf (demote)
+ * - !present to present leaf
+ * - !present to present non-leaf
+ */
+ KVM_BUG_ON(!(!was_present || (was_leaf && !is_leaf)), kvm);
lockdep_assert_held(&kvm->mmu_lock);
/*
@@ -548,18 +555,24 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
if (!try_cmpxchg64(rcu_dereference(sptep), &old_spte, FROZEN_SPTE))
return -EBUSY;
- /*
- * Use different call to either set up middle level
- * external page table, or leaf.
- */
- if (is_leaf) {
- ret = kvm_x86_call(set_external_spte)(kvm, gfn, level, new_pfn);
- } else {
- void *external_spt = get_external_spt(gfn, new_spte, level);
+ if (!was_present) {
+ /*
+ * Use different call to either set up middle level
+ * external page table, or leaf.
+ */
+ if (is_leaf) {
+ ret = kvm_x86_call(set_external_spte)(kvm, gfn, level, new_pfn);
+ } else {
+ void *external_spt = get_external_spt(gfn, new_spte, level);
- KVM_BUG_ON(!external_spt, kvm);
- ret = kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt);
+ KVM_BUG_ON(!external_spt, kvm);
+ ret = kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt);
+ }
+ } else if (was_leaf && !is_leaf) {
+ /* demote */
+ ret = split_external_spt(kvm, gfn, old_spte, new_spte, level, true);
}
+
if (ret)
__kvm_tdp_mmu_write_spte(sptep, old_spte);
else
@@ -789,7 +802,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
if (!is_shadow_present_pte(new_spte))
remove_external_spte(kvm, gfn, old_spte, level);
else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
- split_external_spt(kvm, gfn, old_spte, new_spte, level);
+ split_external_spt(kvm, gfn, old_spte, new_spte, level, false);
else
KVM_BUG_ON(1, kvm);
}
@@ -1308,8 +1321,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
if (is_shadow_present_pte(iter.old_spte)) {
- /* Don't support large page for mirrored roots (TDX) */
- KVM_BUG_ON(is_mirror_sptep(iter.sptep), vcpu->kvm);
r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
} else {
r = tdp_mmu_link_sp(kvm, &iter, sp, true);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8a60ba5b6595..035d81275be4 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1941,7 +1941,7 @@ static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
}
static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
- void *private_spt)
+ void *private_spt, bool mmu_lock_shared)
{
struct page *page = virt_to_page(private_spt);
int ret;
@@ -1950,6 +1950,12 @@ static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level
level != PG_LEVEL_2M, kvm))
return -EINVAL;
+ if (WARN_ON_ONCE(mmu_lock_shared)) {
+ pr_warn_once("Splitting of GFN %llx level %d under shared lock occurs when KVM does not support it yet\n",
+ gfn, level);
+ return -EOPNOTSUPP;
+ }
+
ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
if (ret <= 0)
return ret;
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 11/23] KVM: x86: Reject splitting huge pages under shared mmu_lock for mirror root
2025-08-07 9:43 ` [RFC PATCH v2 11/23] KVM: x86: Reject splitting huge pages under shared mmu_lock for mirror root Yan Zhao
@ 2025-09-03 3:30 ` Binbin Wu
0 siblings, 0 replies; 129+ messages in thread
From: Binbin Wu @ 2025-09-03 3:30 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On 8/7/2025 5:43 PM, Yan Zhao wrote:
> While removing the KVM_BUG_ON() for the mirror root before invoking
> tdp_mmu_split_huge_page() in the fault path, update the hook
> split_external_spt to pass in shared mmu_lock info and invoke the hook in
> set_external_spte_present() on splitting is detected. Reject the splitting
> in TDX if the splitting is under shared mmu_lock.
>
> TDX requires different handling for splitting under shared or exclusive
> mmu_lock.
>
> Under a shared mmu_lock, TDX cannot kick off all vCPUs to avoid BUSY error
> from tdh_mem_page_demote(). As the current TDX module requires
> tdh_mem_range_block() to be invoked before each tdh_mem_page_demote(), if a
> BUSY error occurs, TDX must call tdh_mem_range_unblock() before returning
> the error to the KVM MMU core to roll back the old SPTE and retry. However,
> tdh_mem_range_unblock() may also fail due to contention.
>
> Reject splitting huge pages under shared mmu_lock for mirror root in TDX
> rather than KVM_BUG_ON() in KVM MMU core to allow for future real
> implementation of demote under shared mmu_lock once non-blocking demote is
> available.
Prefer "blockless" used in the cover letter to non-blocking.
[...]
^ permalink raw reply [flat|nested] 129+ messages in thread
* [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (10 preceding siblings ...)
2025-08-07 9:43 ` [RFC PATCH v2 11/23] KVM: x86: Reject splitting huge pages under shared mmu_lock for mirror root Yan Zhao
@ 2025-08-07 9:43 ` Yan Zhao
2025-09-03 6:57 ` Binbin Wu
` (2 more replies)
2025-08-07 9:44 ` [RFC PATCH v2 13/23] KVM: x86: Introduce hugepage_set_guest_inhibit() Yan Zhao
` (10 subsequent siblings)
22 siblings, 3 replies; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:43 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
Introduce kvm_split_cross_boundary_leafs() to split huge leaf entries that
cross the boundary of a specified range.
Splitting huge leaf entries that cross the boundary is essential before
zapping the range in the mirror root. This ensures that the subsequent zap
operation does not affect any GFNs outside the specified range. This is
crucial for the mirror root, as the private page table requires the guest's
ACCEPT operation after a GFN faults back.
The core of kvm_split_cross_boundary_leafs() leverages the main logic from
tdp_mmu_split_huge_pages_root(). It traverses the specified root and splits
huge leaf entries if they cross the range boundary. When splitting is
necessary, kvm->mmu_lock is temporarily released for memory allocation,
which means returning -ENOMEM is possible.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Rename the API to kvm_split_cross_boundary_leafs().
- Make the API to be usable for direct roots or under shared mmu_lock.
- Leverage the main logic from tdp_mmu_split_huge_pages_root(). (Rick)
RFC v1:
- Split patch.
- introduced API kvm_split_boundary_leafs(), refined the logic and
simplified the code.
---
arch/x86/kvm/mmu/mmu.c | 27 +++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 68 ++++++++++++++++++++++++++++++++++++--
arch/x86/kvm/mmu/tdp_mmu.h | 3 ++
include/linux/kvm_host.h | 2 ++
4 files changed, 97 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9182192daa3a..13910ae05f76 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1647,6 +1647,33 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
start, end - 1, can_yield, true, flush);
}
+/*
+ * Split large leafs crossing the boundary of the specified range
+ *
+ * Return value:
+ * 0 : success, no flush is required;
+ * 1 : success, flush is required;
+ * <0: failure.
+ */
+int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
+ bool shared)
+{
+ bool ret = 0;
+
+ lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
+ lockdep_is_held(&kvm->slots_lock) ||
+ srcu_read_lock_held(&kvm->srcu));
+
+ if (!range->may_block)
+ return -EOPNOTSUPP;
+
+ if (tdp_mmu_enabled)
+ ret = kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(kvm, range, shared);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(kvm_split_cross_boundary_leafs);
+
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool flush = false;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ce49cc850ed5..62a09a9655c3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1574,10 +1574,17 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
return ret;
}
+static bool iter_cross_boundary(struct tdp_iter *iter, gfn_t start, gfn_t end)
+{
+ return !(iter->gfn >= start &&
+ (iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <= end);
+}
+
static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
struct kvm_mmu_page *root,
gfn_t start, gfn_t end,
- int target_level, bool shared)
+ int target_level, bool shared,
+ bool only_cross_bounday, bool *flush)
{
struct kvm_mmu_page *sp = NULL;
struct tdp_iter iter;
@@ -1589,6 +1596,13 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
* level into one lower level. For example, if we encounter a 1GB page
* we split it into 512 2MB pages.
*
+ * When only_cross_bounday is true, just split huge pages above the
+ * target level into one lower level if the huge pages cross the start
+ * or end boundary.
+ *
+ * No need to update @flush for !only_cross_bounday cases, which rely
+ * on the callers to do the TLB flush in the end.
+ *
* Since the TDP iterator uses a pre-order traversal, we are guaranteed
* to visit an SPTE before ever visiting its children, which means we
* will correctly recursively split huge pages that are more than one
@@ -1597,12 +1611,19 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
*/
for_each_tdp_pte_min_level(iter, kvm, root, target_level + 1, start, end) {
retry:
- if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
+ if (tdp_mmu_iter_cond_resched(kvm, &iter, *flush, shared)) {
+ if (only_cross_bounday)
+ *flush = false;
continue;
+ }
if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
continue;
+ if (only_cross_bounday &&
+ !iter_cross_boundary(&iter, start, end))
+ continue;
+
if (!sp) {
rcu_read_unlock();
@@ -1637,6 +1658,8 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
goto retry;
sp = NULL;
+ if (only_cross_bounday)
+ *flush = true;
}
rcu_read_unlock();
@@ -1663,10 +1686,12 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
{
struct kvm_mmu_page *root;
int r = 0;
+ bool flush = false;
kvm_lockdep_assert_mmu_lock_held(kvm, shared);
for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id) {
- r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level, shared);
+ r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level,
+ shared, false, &flush);
if (r) {
kvm_tdp_mmu_put_root(kvm, root);
break;
@@ -1674,6 +1699,43 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
}
}
+/*
+ * Split large leafs which cross the specified boundary
+ */
+static int tdp_mmu_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t start, gfn_t end, bool shared,
+ bool *flush)
+{
+ return tdp_mmu_split_huge_pages_root(kvm, root, start, end, PG_LEVEL_4K,
+ shared, true, flush);
+}
+
+int kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(struct kvm *kvm,
+ struct kvm_gfn_range *range,
+ bool shared)
+{
+ enum kvm_tdp_mmu_root_types types;
+ struct kvm_mmu_page *root;
+ bool flush = false;
+ int ret;
+
+ kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+ types = kvm_gfn_range_filter_to_root_types(kvm, range->attr_filter);
+
+ __for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, types) {
+ ret = tdp_mmu_split_cross_boundary_leafs(kvm, root, range->start,
+ range->end, shared, &flush);
+ if (ret < 0) {
+ if (flush)
+ kvm_flush_remote_tlbs(kvm);
+
+ kvm_tdp_mmu_put_root(kvm, root);
+ return ret;
+ }
+ }
+ return flush;
+}
+
static bool tdp_mmu_need_write_protect(struct kvm *kvm, struct kvm_mmu_page *sp)
{
/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 52acf99d40a0..332d47cce714 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -69,6 +69,9 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm);
void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
enum kvm_tdp_mmu_root_types root_types);
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared);
+int kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(struct kvm *kvm,
+ struct kvm_gfn_range *range,
+ bool shared);
int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fb79d2b7decd..6137b76341e1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -273,6 +273,8 @@ struct kvm_gfn_range {
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
+ bool shared);
#endif
enum {
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
2025-08-07 9:43 ` [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
@ 2025-09-03 6:57 ` Binbin Wu
2025-09-03 9:44 ` Yan Zhao
2025-11-11 10:42 ` Huang, Kai
2025-11-19 6:31 ` Yan Zhao
2 siblings, 1 reply; 129+ messages in thread
From: Binbin Wu @ 2025-09-03 6:57 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On 8/7/2025 5:43 PM, Yan Zhao wrote:
> Introduce kvm_split_cross_boundary_leafs() to split huge leaf entries that
> cross the boundary of a specified range.
>
> Splitting huge leaf entries that cross the boundary is essential before
> zapping the range in the mirror root. This ensures that the subsequent zap
> operation does not affect any GFNs outside the specified range. This is
> crucial for the mirror root, as the private page table requires the guest's
> ACCEPT operation after a GFN faults back.
>
> The core of kvm_split_cross_boundary_leafs() leverages the main logic from
> tdp_mmu_split_huge_pages_root(). It traverses the specified root and splits
> huge leaf entries if they cross the range boundary. When splitting is
> necessary, kvm->mmu_lock is temporarily released for memory allocation,
> which means returning -ENOMEM is possible.
>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2:
> - Rename the API to kvm_split_cross_boundary_leafs().
> - Make the API to be usable for direct roots or under shared mmu_lock.
> - Leverage the main logic from tdp_mmu_split_huge_pages_root(). (Rick)
>
> RFC v1:
> - Split patch.
> - introduced API kvm_split_boundary_leafs(), refined the logic and
> simplified the code.
> ---
> arch/x86/kvm/mmu/mmu.c | 27 +++++++++++++++
> arch/x86/kvm/mmu/tdp_mmu.c | 68 ++++++++++++++++++++++++++++++++++++--
> arch/x86/kvm/mmu/tdp_mmu.h | 3 ++
> include/linux/kvm_host.h | 2 ++
> 4 files changed, 97 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9182192daa3a..13910ae05f76 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1647,6 +1647,33 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
> start, end - 1, can_yield, true, flush);
> }
>
> +/*
> + * Split large leafs crossing the boundary of the specified range
> + *
> + * Return value:
> + * 0 : success, no flush is required;
> + * 1 : success, flush is required;
> + * <0: failure.
> + */
> +int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
> + bool shared)
> +{
> + bool ret = 0;
> +
> + lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> + lockdep_is_held(&kvm->slots_lock) ||
> + srcu_read_lock_held(&kvm->srcu));
> +
> + if (!range->may_block)
> + return -EOPNOTSUPP;
> +
> + if (tdp_mmu_enabled)
> + ret = kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(kvm, range, shared);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(kvm_split_cross_boundary_leafs);
> +
> bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> bool flush = false;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index ce49cc850ed5..62a09a9655c3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1574,10 +1574,17 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> return ret;
> }
>
> +static bool iter_cross_boundary(struct tdp_iter *iter, gfn_t start, gfn_t end)
> +{
> + return !(iter->gfn >= start &&
> + (iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <= end);
> +}
> +
> static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> struct kvm_mmu_page *root,
> gfn_t start, gfn_t end,
> - int target_level, bool shared)
> + int target_level, bool shared,
> + bool only_cross_bounday, bool *flush)
s/only_cross_bounday/only_cross_boundary
> {
> struct kvm_mmu_page *sp = NULL;
> struct tdp_iter iter;
> @@ -1589,6 +1596,13 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> * level into one lower level. For example, if we encounter a 1GB page
> * we split it into 512 2MB pages.
> *
> + * When only_cross_bounday is true, just split huge pages above the
> + * target level into one lower level if the huge pages cross the start
> + * or end boundary.
> + *
> + * No need to update @flush for !only_cross_bounday cases, which rely
> + * on the callers to do the TLB flush in the end.
I think API wise, it's a bit confusing, although it's a local API.
If just look at the API without digging into the function implementation, my
initial thought is *flush will tell whether TLB flush is needed or not.
Just update *flush unconditionally? Or move the comment as the description for
the function to call it out?
I have thought another option to combine the two inputs, i.e., if *flush is a
valid pointer, it means it's for only_cross_boundary. Otherwise, just passing
NULL. But then I felt it was a bit risky to reply on the pointer to indicate the
scenario.
> + *
> * Since the TDP iterator uses a pre-order traversal, we are guaranteed
> * to visit an SPTE before ever visiting its children, which means we
> * will correctly recursively split huge pages that are more than one
> @@ -1597,12 +1611,19 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> */
> for_each_tdp_pte_min_level(iter, kvm, root, target_level + 1, start, end) {
> retry:
> - if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
> + if (tdp_mmu_iter_cond_resched(kvm, &iter, *flush, shared)) {
> + if (only_cross_bounday)
> + *flush = false;
> continue;
> + }
>
> if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
> continue;
>
> + if (only_cross_bounday &&
> + !iter_cross_boundary(&iter, start, end))
> + continue;
> +
> if (!sp) {
> rcu_read_unlock();
>
> @@ -1637,6 +1658,8 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> goto retry;
>
> sp = NULL;
> + if (only_cross_bounday)
> + *flush = true;
> }
>
> rcu_read_unlock();
[...]
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
2025-09-03 6:57 ` Binbin Wu
@ 2025-09-03 9:44 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-09-03 9:44 UTC (permalink / raw)
To: Binbin Wu
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On Wed, Sep 03, 2025 at 02:57:07PM +0800, Binbin Wu wrote:
>
>
> On 8/7/2025 5:43 PM, Yan Zhao wrote:
> > Introduce kvm_split_cross_boundary_leafs() to split huge leaf entries that
> > cross the boundary of a specified range.
> >
> > Splitting huge leaf entries that cross the boundary is essential before
> > zapping the range in the mirror root. This ensures that the subsequent zap
> > operation does not affect any GFNs outside the specified range. This is
> > crucial for the mirror root, as the private page table requires the guest's
> > ACCEPT operation after a GFN faults back.
> >
> > The core of kvm_split_cross_boundary_leafs() leverages the main logic from
> > tdp_mmu_split_huge_pages_root(). It traverses the specified root and splits
> > huge leaf entries if they cross the range boundary. When splitting is
> > necessary, kvm->mmu_lock is temporarily released for memory allocation,
> > which means returning -ENOMEM is possible.
> >
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> > RFC v2:
> > - Rename the API to kvm_split_cross_boundary_leafs().
> > - Make the API to be usable for direct roots or under shared mmu_lock.
> > - Leverage the main logic from tdp_mmu_split_huge_pages_root(). (Rick)
> >
> > RFC v1:
> > - Split patch.
> > - introduced API kvm_split_boundary_leafs(), refined the logic and
> > simplified the code.
> > ---
> > arch/x86/kvm/mmu/mmu.c | 27 +++++++++++++++
> > arch/x86/kvm/mmu/tdp_mmu.c | 68 ++++++++++++++++++++++++++++++++++++--
> > arch/x86/kvm/mmu/tdp_mmu.h | 3 ++
> > include/linux/kvm_host.h | 2 ++
> > 4 files changed, 97 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 9182192daa3a..13910ae05f76 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1647,6 +1647,33 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
> > start, end - 1, can_yield, true, flush);
> > }
> > +/*
> > + * Split large leafs crossing the boundary of the specified range
> > + *
> > + * Return value:
> > + * 0 : success, no flush is required;
> > + * 1 : success, flush is required;
> > + * <0: failure.
> > + */
> > +int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
> > + bool shared)
> > +{
> > + bool ret = 0;
> > +
> > + lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> > + lockdep_is_held(&kvm->slots_lock) ||
> > + srcu_read_lock_held(&kvm->srcu));
> > +
> > + if (!range->may_block)
> > + return -EOPNOTSUPP;
> > +
> > + if (tdp_mmu_enabled)
> > + ret = kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(kvm, range, shared);
> > +
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_split_cross_boundary_leafs);
> > +
> > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > bool flush = false;
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index ce49cc850ed5..62a09a9655c3 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1574,10 +1574,17 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> > return ret;
> > }
> > +static bool iter_cross_boundary(struct tdp_iter *iter, gfn_t start, gfn_t end)
> > +{
> > + return !(iter->gfn >= start &&
> > + (iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <= end);
> > +}
> > +
> > static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > struct kvm_mmu_page *root,
> > gfn_t start, gfn_t end,
> > - int target_level, bool shared)
> > + int target_level, bool shared,
> > + bool only_cross_bounday, bool *flush)
> s/only_cross_bounday/only_cross_boundary
Will fix.
> > {
> > struct kvm_mmu_page *sp = NULL;
> > struct tdp_iter iter;
> > @@ -1589,6 +1596,13 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > * level into one lower level. For example, if we encounter a 1GB page
> > * we split it into 512 2MB pages.
> > *
> > + * When only_cross_bounday is true, just split huge pages above the
> > + * target level into one lower level if the huge pages cross the start
> > + * or end boundary.
> > + *
> > + * No need to update @flush for !only_cross_bounday cases, which rely
> > + * on the callers to do the TLB flush in the end.
>
> I think API wise, it's a bit confusing, although it's a local API.
> If just look at the API without digging into the function implementation, my
> initial thought is *flush will tell whether TLB flush is needed or not.
>
> Just update *flush unconditionally? Or move the comment as the description for
> the function to call it out?
>
> I have thought another option to combine the two inputs, i.e., if *flush is a
> valid pointer, it means it's for only_cross_boundary. Otherwise, just passing
> NULL. But then I felt it was a bit risky to reply on the pointer to indicate the
> scenario.
I feel it's better not to combine flush and only_cross_boundary.
Will add a function description to tdp_mmu_split_huge_pages_root().
> > + *
> > * Since the TDP iterator uses a pre-order traversal, we are guaranteed
> > * to visit an SPTE before ever visiting its children, which means we
> > * will correctly recursively split huge pages that are more than one
> > @@ -1597,12 +1611,19 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > */
> > for_each_tdp_pte_min_level(iter, kvm, root, target_level + 1, start, end) {
> > retry:
> > - if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
> > + if (tdp_mmu_iter_cond_resched(kvm, &iter, *flush, shared)) {
> > + if (only_cross_bounday)
> > + *flush = false;
> > continue;
> > + }
> > if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
> > continue;
> > + if (only_cross_bounday &&
> > + !iter_cross_boundary(&iter, start, end))
> > + continue;
> > +
> > if (!sp) {
> > rcu_read_unlock();
> > @@ -1637,6 +1658,8 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > goto retry;
> > sp = NULL;
> > + if (only_cross_bounday)
> > + *flush = true;
> > }
> > rcu_read_unlock();
> [...]
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
2025-08-07 9:43 ` [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
2025-09-03 6:57 ` Binbin Wu
@ 2025-11-11 10:42 ` Huang, Kai
2025-11-13 8:54 ` Yan Zhao
2025-11-19 6:31 ` Yan Zhao
2 siblings, 1 reply; 129+ messages in thread
From: Huang, Kai @ 2025-11-11 10:42 UTC (permalink / raw)
To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
Cc: quic_eberman@quicinc.com, kvm@vger.kernel.org, Li, Xiaoyao,
Du, Fan, Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
vbabka@suse.cz, tabba@google.com, kas@kernel.org,
michael.roth@amd.com, Weiny, Ira, linux-kernel@vger.kernel.org,
binbin.wu@linux.intel.com, ackerleytng@google.com,
Yamahata, Isaku, Peng, Chao P, zhiquan1.li@intel.com,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
> static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> struct kvm_mmu_page *root,
> gfn_t start, gfn_t end,
> - int target_level, bool shared)
> + int target_level, bool shared,
> + bool only_cross_bounday, bool *flush)
> {
> struct kvm_mmu_page *sp = NULL;
> struct tdp_iter iter;
> @@ -1589,6 +1596,13 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> * level into one lower level. For example, if we encounter a 1GB page
> * we split it into 512 2MB pages.
> *
> + * When only_cross_bounday is true, just split huge pages above the
> + * target level into one lower level if the huge pages cross the start
> + * or end boundary.
> + *
> + * No need to update @flush for !only_cross_bounday cases, which rely
> + * on the callers to do the TLB flush in the end.
> + *
s/only_cross_bounday/only_cross_boundary
From tdp_mmu_split_huge_pages_root()'s perspective, it's quite odd to only
update 'flush' when 'only_cross_bounday' is true, because
'only_cross_bounday' can only results in less splitting.
I understand this is because splitting S-EPT mapping needs flush (at least
before non-block DEMOTE is implemented?). Would it better to also let the
caller to decide whether TLB flush is needed? E.g., we can make
tdp_mmu_split_huge_pages_root() return whether any split has been done or
not. I think this should also work?
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
2025-11-11 10:42 ` Huang, Kai
@ 2025-11-13 8:54 ` Yan Zhao
2025-11-13 11:02 ` Huang, Kai
0 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-11-13 8:54 UTC (permalink / raw)
To: Huang, Kai
Cc: pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org,
Li, Xiaoyao, Du, Fan, Hansen, Dave, david@redhat.com,
thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
kas@kernel.org, michael.roth@amd.com, Weiny, Ira,
linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
ackerleytng@google.com, Yamahata, Isaku, Peng, Chao P,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On Tue, Nov 11, 2025 at 06:42:55PM +0800, Huang, Kai wrote:
> On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
> > static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > struct kvm_mmu_page *root,
> > gfn_t start, gfn_t end,
> > - int target_level, bool shared)
> > + int target_level, bool shared,
> > + bool only_cross_bounday, bool *flush)
> > {
> > struct kvm_mmu_page *sp = NULL;
> > struct tdp_iter iter;
> > @@ -1589,6 +1596,13 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > * level into one lower level. For example, if we encounter a 1GB page
> > * we split it into 512 2MB pages.
> > *
> > + * When only_cross_bounday is true, just split huge pages above the
> > + * target level into one lower level if the huge pages cross the start
> > + * or end boundary.
> > + *
> > + * No need to update @flush for !only_cross_bounday cases, which rely
> > + * on the callers to do the TLB flush in the end.
> > + *
>
> s/only_cross_bounday/only_cross_boundary
>
> From tdp_mmu_split_huge_pages_root()'s perspective, it's quite odd to only
> update 'flush' when 'only_cross_bounday' is true, because
> 'only_cross_bounday' can only results in less splitting.
I have to say it's a reasonable point.
> I understand this is because splitting S-EPT mapping needs flush (at least
> before non-block DEMOTE is implemented?). Would it better to also let the
Actually the flush is only required for !TDX cases.
For TDX, either the flush has been performed internally within
tdx_sept_split_private_spt() or the flush is not required for future non-block
DEMOTE. So, the flush in KVM core on the mirror root may be skipped as a future
optimization for TDX if necessary.
> caller to decide whether TLB flush is needed? E.g., we can make
> tdp_mmu_split_huge_pages_root() return whether any split has been done or
> not. I think this should also work?
Do you mean just skipping the changes in the below "Hunk 1"?
Since tdp_mmu_split_huge_pages_root() originally did not do flush by itself,
which relied on the end callers (i.e.,kvm_mmu_slot_apply_flags(),
kvm_clear_dirty_log_protect(), and kvm_get_dirty_log_protect()) to do the flush
unconditionally, tdp_mmu_split_huge_pages_root() previously did not return
whether any split has been done or not.
So, if we want callers of kvm_split_cross_boundary_leafs() to do flush only
after splitting occurs, we have to return whether flush is required.
Then, in this patch, seems only the changes in "Hunk 1" can be dropped.
Hunk 1
-------------------------------
for_each_tdp_pte_min_level(iter, kvm, root, target_level + 1, start, end) {
retry:
- if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
+ if (tdp_mmu_iter_cond_resched(kvm, &iter, *flush, shared)) {
+ if (only_cross_bounday)
+ *flush = false;
continue;
+ }
if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
continue;
Hunk 2
-------------------------------
+ if (only_cross_bounday &&
+ !iter_cross_boundary(&iter, start, end))
+ continue;
+
if (!sp) {
rcu_read_unlock();
Hunk 3
-------------------------------
@@ -1637,6 +1658,8 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
goto retry;
sp = NULL;
+ if (only_cross_bounday)
+ *flush = true;
}
Do you think dropping of "Hunk 1" is worthwhile?
Would it be less odd if I make the following change?
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9f479832a981..7bc1d1a5f3ce 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1589,6 +1589,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
{
struct kvm_mmu_page *sp = NULL;
struct tdp_iter iter;
+ bool caller_unconditional_flush = !only_cross_bounday;
rcu_read_lock();
@@ -1613,7 +1614,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
for_each_tdp_pte_min_level(iter, kvm, root, target_level + 1, start, end) {
retry:
if (tdp_mmu_iter_cond_resched(kvm, &iter, *flush, shared)) {
- if (only_cross_bounday)
+ if (!caller_unconditional_flush)
*flush = false;
continue;
}
@@ -1659,7 +1660,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
goto retry;
sp = NULL;
- if (only_cross_bounday)
+ if (!caller_unconditional_flush)
*flush = true;
}
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
2025-11-13 8:54 ` Yan Zhao
@ 2025-11-13 11:02 ` Huang, Kai
2025-11-13 11:40 ` Huang, Kai
2025-11-14 6:09 ` Yan Zhao
0 siblings, 2 replies; 129+ messages in thread
From: Huang, Kai @ 2025-11-13 11:02 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
vbabka@suse.cz, michael.roth@amd.com,
linux-kernel@vger.kernel.org, seanjc@google.com,
pbonzini@redhat.com, binbin.wu@linux.intel.com,
ackerleytng@google.com, kas@kernel.org, Weiny, Ira, Peng, Chao P,
Yamahata, Isaku, Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun,
x86@kernel.org, pgonda@google.com
On Thu, 2025-11-13 at 16:54 +0800, Yan Zhao wrote:
> On Tue, Nov 11, 2025 at 06:42:55PM +0800, Huang, Kai wrote:
> > On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
> > > static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > struct kvm_mmu_page *root,
> > > gfn_t start, gfn_t end,
> > > - int target_level, bool shared)
> > > + int target_level, bool shared,
> > > + bool only_cross_bounday, bool *flush)
> > > {
> > > struct kvm_mmu_page *sp = NULL;
> > > struct tdp_iter iter;
> > > @@ -1589,6 +1596,13 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > * level into one lower level. For example, if we encounter a 1GB page
> > > * we split it into 512 2MB pages.
> > > *
> > > + * When only_cross_bounday is true, just split huge pages above the
> > > + * target level into one lower level if the huge pages cross the start
> > > + * or end boundary.
> > > + *
> > > + * No need to update @flush for !only_cross_bounday cases, which rely
> > > + * on the callers to do the TLB flush in the end.
> > > + *
> >
> > s/only_cross_bounday/only_cross_boundary
> >
> > From tdp_mmu_split_huge_pages_root()'s perspective, it's quite odd to only
> > update 'flush' when 'only_cross_bounday' is true, because
> > 'only_cross_bounday' can only results in less splitting.
> I have to say it's a reasonable point.
>
> > I understand this is because splitting S-EPT mapping needs flush (at least
> > before non-block DEMOTE is implemented?). Would it better to also let the
> Actually the flush is only required for !TDX cases.
>
> For TDX, either the flush has been performed internally within
> tdx_sept_split_private_spt()
>
AFAICT tdx_sept_split_private_spt() only does tdh_mem_track(), so KVM should
still kick all vCPUs out of guest mode so other vCPUs can actually flush the
TLB?
> or the flush is not required for future non-block
> DEMOTE. So, the flush in KVM core on the mirror root may be skipped as a future
> optimization for TDX if necessary.
>
> > caller to decide whether TLB flush is needed? E.g., we can make
> > tdp_mmu_split_huge_pages_root() return whether any split has been done or
> > not. I think this should also work?
> Do you mean just skipping the changes in the below "Hunk 1"?
>
> Since tdp_mmu_split_huge_pages_root() originally did not do flush by itself,
> which relied on the end callers (i.e.,kvm_mmu_slot_apply_flags(),
> kvm_clear_dirty_log_protect(), and kvm_get_dirty_log_protect()) to do the flush
> unconditionally, tdp_mmu_split_huge_pages_root() previously did not return
> whether any split has been done or not.
Right. But making it return any split has been done doesn't harm.
>
> So, if we want callers of kvm_split_cross_boundary_leafs() to do flush only
> after splitting occurs, we have to return whether flush is required.
But assuming we always return whether "split has been done", the caller can also
effectively know whether the flush is needed.
>
> Then, in this patch, seems only the changes in "Hunk 1" can be dropped.
I am thinking dropping both "Hunk 1" and "Hunk 3". This at least makes
kvm_split_cross_boundary_leafs() more reasonable, IMHO.
Something like below:
@@ -1558,7 +1558,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct
tdp_iter *iter,
static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
struct kvm_mmu_page *root,
gfn_t start, gfn_t end,
- int target_level, bool shared)
+ int target_level, bool shared,
+ bool only_cross_boundary,
+ bool *split)
{
struct kvm_mmu_page *sp = NULL;
struct tdp_iter iter;
@@ -1584,6 +1586,9 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
if (!is_shadow_present_pte(iter.old_spte) ||
!is_large_pte(iter.old_spte))
continue;
+ if (only_cross_boundary && !iter_cross_boundary(&iter, start,
end))
+ continue;
+
if (!sp) {
rcu_read_unlock();
@@ -1618,6 +1623,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
goto retry;
sp = NULL;
+ *split = true;
}
rcu_read_unlock();
Btw, I have to follow up this next week, since tomorrow is public holiday.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
2025-11-13 11:02 ` Huang, Kai
@ 2025-11-13 11:40 ` Huang, Kai
2025-11-14 6:09 ` Yan Zhao
1 sibling, 0 replies; 129+ messages in thread
From: Huang, Kai @ 2025-11-13 11:40 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
tabba@google.com, linux-kernel@vger.kernel.org, seanjc@google.com,
binbin.wu@linux.intel.com, kas@kernel.org, pbonzini@redhat.com,
ackerleytng@google.com, michael.roth@amd.com, Weiny, Ira,
Peng, Chao P, Yamahata, Isaku, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
On Thu, 2025-11-13 at 11:02 +0000, Huang, Kai wrote:
> I am thinking dropping both "Hunk 1" and "Hunk 3". This at least makes
> kvm_split_cross_boundary_leafs() more reasonable, IMHO.
>
> Something like below:
>
> @@ -1558,7 +1558,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct
> tdp_iter *iter,
> static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> struct kvm_mmu_page *root,
> gfn_t start, gfn_t end,
> - int target_level, bool shared)
> + int target_level, bool shared,
> + bool only_cross_boundary,
> + bool *split)
> {
> struct kvm_mmu_page *sp = NULL;
> struct tdp_iter iter;
> @@ -1584,6 +1586,9 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> if (!is_shadow_present_pte(iter.old_spte) ||
> !is_large_pte(iter.old_spte))
> continue;
>
> + if (only_cross_boundary && !iter_cross_boundary(&iter, start,
> end))
> + continue;
> +
> if (!sp) {
> rcu_read_unlock();
>
> @@ -1618,6 +1623,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> goto retry;
>
> sp = NULL;
> + *split = true;
> }
Forgot to say, if needed, we can update @split only when it is a valid pointer:
if (split)
*split = true;
This allows the caller to be able to just pass NULL when it doesn't care about
whether split has been done.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
2025-11-13 11:02 ` Huang, Kai
2025-11-13 11:40 ` Huang, Kai
@ 2025-11-14 6:09 ` Yan Zhao
2025-11-18 0:14 ` Huang, Kai
1 sibling, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-11-14 6:09 UTC (permalink / raw)
To: Huang, Kai
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
vbabka@suse.cz, michael.roth@amd.com,
linux-kernel@vger.kernel.org, seanjc@google.com,
pbonzini@redhat.com, binbin.wu@linux.intel.com,
ackerleytng@google.com, kas@kernel.org, Weiny, Ira, Peng, Chao P,
Yamahata, Isaku, Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun,
x86@kernel.org, pgonda@google.com
On Thu, Nov 13, 2025 at 07:02:59PM +0800, Huang, Kai wrote:
> On Thu, 2025-11-13 at 16:54 +0800, Yan Zhao wrote:
> > On Tue, Nov 11, 2025 at 06:42:55PM +0800, Huang, Kai wrote:
> > > On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
> > > > static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > struct kvm_mmu_page *root,
> > > > gfn_t start, gfn_t end,
> > > > - int target_level, bool shared)
> > > > + int target_level, bool shared,
> > > > + bool only_cross_bounday, bool *flush)
> > > > {
> > > > struct kvm_mmu_page *sp = NULL;
> > > > struct tdp_iter iter;
> > > > @@ -1589,6 +1596,13 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > * level into one lower level. For example, if we encounter a 1GB page
> > > > * we split it into 512 2MB pages.
> > > > *
> > > > + * When only_cross_bounday is true, just split huge pages above the
> > > > + * target level into one lower level if the huge pages cross the start
> > > > + * or end boundary.
> > > > + *
> > > > + * No need to update @flush for !only_cross_bounday cases, which rely
> > > > + * on the callers to do the TLB flush in the end.
> > > > + *
> > >
> > > s/only_cross_bounday/only_cross_boundary
> > >
> > > From tdp_mmu_split_huge_pages_root()'s perspective, it's quite odd to only
> > > update 'flush' when 'only_cross_bounday' is true, because
> > > 'only_cross_bounday' can only results in less splitting.
> > I have to say it's a reasonable point.
> >
> > > I understand this is because splitting S-EPT mapping needs flush (at least
> > > before non-block DEMOTE is implemented?). Would it better to also let the
> > Actually the flush is only required for !TDX cases.
> >
> > For TDX, either the flush has been performed internally within
> > tdx_sept_split_private_spt()
> >
>
> AFAICT tdx_sept_split_private_spt() only does tdh_mem_track(), so KVM should
> still kick all vCPUs out of guest mode so other vCPUs can actually flush the
> TLB?
tdx_sept_split_private_spt() actually invokes tdx_track(), which performs the
kicking off all vCPUs by invoking
"kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE)".
> > or the flush is not required for future non-block
> > DEMOTE. So, the flush in KVM core on the mirror root may be skipped as a future
> > optimization for TDX if necessary.
> >
> > > caller to decide whether TLB flush is needed? E.g., we can make
> > > tdp_mmu_split_huge_pages_root() return whether any split has been done or
> > > not. I think this should also work?
> > Do you mean just skipping the changes in the below "Hunk 1"?
> >
> > Since tdp_mmu_split_huge_pages_root() originally did not do flush by itself,
> > which relied on the end callers (i.e.,kvm_mmu_slot_apply_flags(),
> > kvm_clear_dirty_log_protect(), and kvm_get_dirty_log_protect()) to do the flush
> > unconditionally, tdp_mmu_split_huge_pages_root() previously did not return
> > whether any split has been done or not.
>
> Right. But making it return any split has been done doesn't harm.
>
> >
> > So, if we want callers of kvm_split_cross_boundary_leafs() to do flush only
> > after splitting occurs, we have to return whether flush is required.
>
> But assuming we always return whether "split has been done", the caller can also
> effectively know whether the flush is needed.
>
> >
> > Then, in this patch, seems only the changes in "Hunk 1" can be dropped.
>
> I am thinking dropping both "Hunk 1" and "Hunk 3". This at least makes
> kvm_split_cross_boundary_leafs() more reasonable, IMHO.
>
> Something like below:
>
> @@ -1558,7 +1558,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct
> tdp_iter *iter,
> static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> struct kvm_mmu_page *root,
> gfn_t start, gfn_t end,
> - int target_level, bool shared)
> + int target_level, bool shared,
> + bool only_cross_boundary,
> + bool *split)
> {
> struct kvm_mmu_page *sp = NULL;
> struct tdp_iter iter;
> @@ -1584,6 +1586,9 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> if (!is_shadow_present_pte(iter.old_spte) ||
> !is_large_pte(iter.old_spte))
> continue;
>
> + if (only_cross_boundary && !iter_cross_boundary(&iter, start,
> end))
> + continue;
> +
> if (!sp) {
> rcu_read_unlock();
>
> @@ -1618,6 +1623,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> goto retry;
>
> sp = NULL;
> + *split = true;
> }
>
> rcu_read_unlock();
This looks more reasonable for tdp_mmu_split_huge_pages_root();
Given that splitting only adds a new page to the paging structure (unlike page
merging), I currently can't think of any current use cases that would be broken
by the lack of TLB flush before tdp_mmu_iter_cond_resched() releases the
mmu_lock.
This is because:
1) if the split is triggered in a fault path, the hardware shouldn't have cached
the old huge translation.
2) if the split is triggered in a zap or convert path,
- there shouldn't be concurrent faults on the range due to the protection of
mmu_invalidate_range*.
- for concurrent splits on the same range, though the other vCPUs may
temporally see stale huge TLB entries after they believe they have
performed a split, they will be kicked off to flush the cache soon after
tdp_mmu_split_huge_pages_root() returns in the first vCPU/host thread.
This should be acceptable since I don't see any special guest needs that
rely on pure splits.
So I tend to agree with your suggestion though the implementation in this patch
is safer.
> Btw, I have to follow up this next week, since tomorrow is public holiday.
NP.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
2025-11-14 6:09 ` Yan Zhao
@ 2025-11-18 0:14 ` Huang, Kai
2025-11-18 6:30 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Huang, Kai @ 2025-11-18 0:14 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
tabba@google.com, linux-kernel@vger.kernel.org, seanjc@google.com,
binbin.wu@linux.intel.com, kas@kernel.org, pbonzini@redhat.com,
ackerleytng@google.com, michael.roth@amd.com, Weiny, Ira,
Peng, Chao P, Yamahata, Isaku, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
On Fri, 2025-11-14 at 14:09 +0800, Yan Zhao wrote:
> On Thu, Nov 13, 2025 at 07:02:59PM +0800, Huang, Kai wrote:
> > On Thu, 2025-11-13 at 16:54 +0800, Yan Zhao wrote:
> > > On Tue, Nov 11, 2025 at 06:42:55PM +0800, Huang, Kai wrote:
> > > > On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
> > > > > static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > > struct kvm_mmu_page *root,
> > > > > gfn_t start, gfn_t end,
> > > > > - int target_level, bool shared)
> > > > > + int target_level, bool shared,
> > > > > + bool only_cross_bounday, bool *flush)
> > > > > {
> > > > > struct kvm_mmu_page *sp = NULL;
> > > > > struct tdp_iter iter;
> > > > > @@ -1589,6 +1596,13 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > > * level into one lower level. For example, if we encounter a 1GB page
> > > > > * we split it into 512 2MB pages.
> > > > > *
> > > > > + * When only_cross_bounday is true, just split huge pages above the
> > > > > + * target level into one lower level if the huge pages cross the start
> > > > > + * or end boundary.
> > > > > + *
> > > > > + * No need to update @flush for !only_cross_bounday cases, which rely
> > > > > + * on the callers to do the TLB flush in the end.
> > > > > + *
> > > >
> > > > s/only_cross_bounday/only_cross_boundary
> > > >
> > > > From tdp_mmu_split_huge_pages_root()'s perspective, it's quite odd to only
> > > > update 'flush' when 'only_cross_bounday' is true, because
> > > > 'only_cross_bounday' can only results in less splitting.
> > > I have to say it's a reasonable point.
> > >
> > > > I understand this is because splitting S-EPT mapping needs flush (at least
> > > > before non-block DEMOTE is implemented?). Would it better to also let the
> > > Actually the flush is only required for !TDX cases.
> > >
> > > For TDX, either the flush has been performed internally within
> > > tdx_sept_split_private_spt()
> > >
> >
> > AFAICT tdx_sept_split_private_spt() only does tdh_mem_track(), so KVM should
> > still kick all vCPUs out of guest mode so other vCPUs can actually flush the
> > TLB?
> tdx_sept_split_private_spt() actually invokes tdx_track(), which performs the
> kicking off all vCPUs by invoking
> "kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE)".
Oh thanks for the reminder.
Then I am kinda confused why do you need to return @flush, especially when
'only_cross_boundary' is true which is for TDX case?
So step back to where why this 'flush' is needed to be returned:
- For TDX ('only_cross_boundary == true'):
The caller doesn't need to flush TLB because it has already been done when huge
page is actually split.
- For non-TDX case ('only_cross_boundary == false'):
AFAICT the only user of tdp_mmu_split_huge_pages_root() is "eager hugepage
splitting" during log-dirty. And per per the current implementation there are
two callers of tdp_mmu_split_huge_pages_root():
kvm_mmu_try_split_huge_pages()
kvm_mmu_slot_try_split_huge_pages()
But they are both void functions which neither return whether flush TLB is
needed, nor do TLB flush internally.
So I am kinda confused.
Perhaps you mean for "shared memory of TDX guest", the caller will also pass
'only_cross_boundary == true' and the caller needs to perform TLB flush?
[...]
> >
> > Something like below:
> >
> > @@ -1558,7 +1558,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct
> > tdp_iter *iter,
> > static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > struct kvm_mmu_page *root,
> > gfn_t start, gfn_t end,
> > - int target_level, bool shared)
> > + int target_level, bool shared,
> > + bool only_cross_boundary,
> > + bool *split)
> > {
> > struct kvm_mmu_page *sp = NULL;
> > struct tdp_iter iter;
> > @@ -1584,6 +1586,9 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > if (!is_shadow_present_pte(iter.old_spte) ||
> > !is_large_pte(iter.old_spte))
> > continue;
> >
> > + if (only_cross_boundary && !iter_cross_boundary(&iter, start,
> > end))
> > + continue;
> > +
> > if (!sp) {
> > rcu_read_unlock();
> >
> > @@ -1618,6 +1623,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > goto retry;
> >
> > sp = NULL;
> > + *split = true;
> > }
> >
> > rcu_read_unlock();
> This looks more reasonable for tdp_mmu_split_huge_pages_root();
>
> Given that splitting only adds a new page to the paging structure (unlike page
> merging), I currently can't think of any current use cases that would be broken
> by the lack of TLB flush before tdp_mmu_iter_cond_resched() releases the
> mmu_lock.
>
> This is because:
> 1) if the split is triggered in a fault path, the hardware shouldn't have cached
> the old huge translation.
> 2) if the split is triggered in a zap or convert path,
> - there shouldn't be concurrent faults on the range due to the protection of
> mmu_invalidate_range*.
> - for concurrent splits on the same range, though the other vCPUs may
> temporally see stale huge TLB entries after they believe they have
> performed a split, they will be kicked off to flush the cache soon after
> tdp_mmu_split_huge_pages_root() returns in the first vCPU/host thread.
> This should be acceptable since I don't see any special guest needs that
> rely on pure splits.
Perhaps we should just go straight to the point:
What does "hugepage split" do, and what's the consequence of not flushing TLB.
Per make_small_spte(), the new child PTEs will carry all bits of hugepage PTE
except they clear the 'hugepage bit (obviously)', and set the 'X' bit for NX
hugepage thing.
That means if we leave the stale hugepage TLB, the CPU is still able to find the
correct PFN and AFAICT there shouldn't be any other problem here. For any fault
due to the stale hugepage TLB missing the 'X' permission, AFAICT KVM will just
treat this as a spurious fault, which isn't nice but should have no harm.
>
> So I tend to agree with your suggestion though the implementation in this patch
> is safer.
I am perhaps still missing something, as I am still trying to precisely
understand in what cases you want to flush TLB when splitting hugepage.
I kinda tend to think you eventually want to flush TLB because eventually you
want to _ZAP_. But needing to flush due to zap and needing to flush due to
split is kinda different I think.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
2025-11-18 0:14 ` Huang, Kai
@ 2025-11-18 6:30 ` Yan Zhao
2025-11-18 8:59 ` Binbin Wu
2025-11-18 10:49 ` Huang, Kai
0 siblings, 2 replies; 129+ messages in thread
From: Yan Zhao @ 2025-11-18 6:30 UTC (permalink / raw)
To: Huang, Kai
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
tabba@google.com, linux-kernel@vger.kernel.org, seanjc@google.com,
binbin.wu@linux.intel.com, kas@kernel.org, pbonzini@redhat.com,
ackerleytng@google.com, michael.roth@amd.com, Weiny, Ira,
Peng, Chao P, Yamahata, Isaku, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
On Tue, Nov 18, 2025 at 08:14:17AM +0800, Huang, Kai wrote:
> On Fri, 2025-11-14 at 14:09 +0800, Yan Zhao wrote:
> > On Thu, Nov 13, 2025 at 07:02:59PM +0800, Huang, Kai wrote:
> > > On Thu, 2025-11-13 at 16:54 +0800, Yan Zhao wrote:
> > > > On Tue, Nov 11, 2025 at 06:42:55PM +0800, Huang, Kai wrote:
> > > > > On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
> > > > > > static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > > > struct kvm_mmu_page *root,
> > > > > > gfn_t start, gfn_t end,
> > > > > > - int target_level, bool shared)
> > > > > > + int target_level, bool shared,
> > > > > > + bool only_cross_bounday, bool *flush)
> > > > > > {
> > > > > > struct kvm_mmu_page *sp = NULL;
> > > > > > struct tdp_iter iter;
> > > > > > @@ -1589,6 +1596,13 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > > > * level into one lower level. For example, if we encounter a 1GB page
> > > > > > * we split it into 512 2MB pages.
> > > > > > *
> > > > > > + * When only_cross_bounday is true, just split huge pages above the
> > > > > > + * target level into one lower level if the huge pages cross the start
> > > > > > + * or end boundary.
> > > > > > + *
> > > > > > + * No need to update @flush for !only_cross_bounday cases, which rely
> > > > > > + * on the callers to do the TLB flush in the end.
> > > > > > + *
> > > > >
> > > > > s/only_cross_bounday/only_cross_boundary
> > > > >
> > > > > From tdp_mmu_split_huge_pages_root()'s perspective, it's quite odd to only
> > > > > update 'flush' when 'only_cross_bounday' is true, because
> > > > > 'only_cross_bounday' can only results in less splitting.
> > > > I have to say it's a reasonable point.
> > > >
> > > > > I understand this is because splitting S-EPT mapping needs flush (at least
> > > > > before non-block DEMOTE is implemented?). Would it better to also let the
> > > > Actually the flush is only required for !TDX cases.
> > > >
> > > > For TDX, either the flush has been performed internally within
> > > > tdx_sept_split_private_spt()
> > > >
> > >
> > > AFAICT tdx_sept_split_private_spt() only does tdh_mem_track(), so KVM should
> > > still kick all vCPUs out of guest mode so other vCPUs can actually flush the
> > > TLB?
> > tdx_sept_split_private_spt() actually invokes tdx_track(), which performs the
> > kicking off all vCPUs by invoking
> > "kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE)".
>
> Oh thanks for the reminder.
>
> Then I am kinda confused why do you need to return @flush, especially when
> 'only_cross_boundary' is true which is for TDX case?
> So step back to where why this 'flush' is needed to be returned:
>
> - For TDX ('only_cross_boundary == true'):
>
> The caller doesn't need to flush TLB because it has already been done when huge
> page is actually split.
>
> - For non-TDX case ('only_cross_boundary == false'):
>
> AFAICT the only user of tdp_mmu_split_huge_pages_root() is "eager hugepage
> splitting" during log-dirty. And per per the current implementation there are
> two callers of tdp_mmu_split_huge_pages_root():
>
> kvm_mmu_try_split_huge_pages()
> kvm_mmu_slot_try_split_huge_pages()
>
> But they are both void functions which neither return whether flush TLB is
> needed, nor do TLB flush internally.
Actually callers of the two void functions do the TLB flush unconditionally
in the end, i.e, in
kvm_mmu_slot_apply_flags(),
kvm_clear_dirty_log_protect(), and
kvm_get_dirty_log_protect()).
> So I am kinda confused.
>
> Perhaps you mean for "shared memory of TDX guest", the caller will also pass
> 'only_cross_boundary == true' and the caller needs to perform TLB flush?
Sorry for the confusion.
Currently 'only_cross_boundary == true' is only for TDX private memory.
Returning flush is because kvm_split_cross_boundary_leafs() is potentially
possible to be invoked for non-TDX cases as well in future (though currently
it's only invoked for TDX alone). When that occurs, it's better to return flush
to avoid the caller having to do flush unconditionally.
Another reason is to keep consistency with tdp_mmu_zap_leafs(), which returns
flush without differentiate whether the zap is for a mirror root not not. So,
though kvm_mmu_remote_flush() on mirror root is not necessary, it's
intentionally left for future optimization.
> [...]
>
> > >
> > > Something like below:
> > >
> > > @@ -1558,7 +1558,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct
> > > tdp_iter *iter,
> > > static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > struct kvm_mmu_page *root,
> > > gfn_t start, gfn_t end,
> > > - int target_level, bool shared)
> > > + int target_level, bool shared,
> > > + bool only_cross_boundary,
> > > + bool *split)
> > > {
> > > struct kvm_mmu_page *sp = NULL;
> > > struct tdp_iter iter;
> > > @@ -1584,6 +1586,9 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > if (!is_shadow_present_pte(iter.old_spte) ||
> > > !is_large_pte(iter.old_spte))
> > > continue;
> > >
> > > + if (only_cross_boundary && !iter_cross_boundary(&iter, start,
> > > end))
> > > + continue;
> > > +
> > > if (!sp) {
> > > rcu_read_unlock();
> > >
> > > @@ -1618,6 +1623,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > goto retry;
> > >
> > > sp = NULL;
> > > + *split = true;
> > > }
> > >
> > > rcu_read_unlock();
> > This looks more reasonable for tdp_mmu_split_huge_pages_root();
> >
> > Given that splitting only adds a new page to the paging structure (unlike page
> > merging), I currently can't think of any current use cases that would be broken
> > by the lack of TLB flush before tdp_mmu_iter_cond_resched() releases the
> > mmu_lock.
> >
> > This is because:
> > 1) if the split is triggered in a fault path, the hardware shouldn't have cached
> > the old huge translation.
> > 2) if the split is triggered in a zap or convert path,
> > - there shouldn't be concurrent faults on the range due to the protection of
> > mmu_invalidate_range*.
> > - for concurrent splits on the same range, though the other vCPUs may
> > temporally see stale huge TLB entries after they believe they have
> > performed a split, they will be kicked off to flush the cache soon after
> > tdp_mmu_split_huge_pages_root() returns in the first vCPU/host thread.
> > This should be acceptable since I don't see any special guest needs that
> > rely on pure splits.
>
> Perhaps we should just go straight to the point:
>
> What does "hugepage split" do, and what's the consequence of not flushing TLB.
>
> Per make_small_spte(), the new child PTEs will carry all bits of hugepage PTE
> except they clear the 'hugepage bit (obviously)', and set the 'X' bit for NX
> hugepage thing.
>
> That means if we leave the stale hugepage TLB, the CPU is still able to find the
> correct PFN and AFAICT there shouldn't be any other problem here. For any fault
> due to the stale hugepage TLB missing the 'X' permission, AFAICT KVM will just
> treat this as a spurious fault, which isn't nice but should have no harm.
Right, that isn't nice, though no harm.
Besides, I'm thinking of a scenario which is not currently existing though.
CPU 0 CPU 1
a1. split pages
a2. write protect pages
b1. split pages
b2. write protect pages
b3. start dirty page tracking
a3. flush TLB
a4. start dirty page tracking
If CPU 1 does not flush TLB after b2 (e.g., due to it finds the pages have been
split and write protected by a1&a2), it will miss some dirty pages.
Currently CPU 1 always flush TLB before b3 unconditionally, so there's no
problem.
> > So I tend to agree with your suggestion though the implementation in this patch
> > is safer.
>
> I am perhaps still missing something, as I am still trying to precisely
> understand in what cases you want to flush TLB when splitting hugepage.
>
> I kinda tend to think you eventually want to flush TLB because eventually you
> want to _ZAP_. But needing to flush due to zap and needing to flush due to
> split is kinda different I think.
Though I currently couldn't find any use cases that depend on split alone, e.g.
if there's any feature requiring the pages must be 4KB without any additional
permission changes, I just wanted to make the code safer in case I missed any
edge cases.
We surely don't want the window for CPUs to see huge pages and small pages lasts
long.
Flushing TLB before releasing the mmu_lock allows other threads operating on the
same range to see updated translations timely.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
2025-11-18 6:30 ` Yan Zhao
@ 2025-11-18 8:59 ` Binbin Wu
2025-11-18 10:49 ` Huang, Kai
1 sibling, 0 replies; 129+ messages in thread
From: Binbin Wu @ 2025-11-18 8:59 UTC (permalink / raw)
To: Yan Zhao, Huang, Kai
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
tabba@google.com, linux-kernel@vger.kernel.org, seanjc@google.com,
kas@kernel.org, pbonzini@redhat.com, ackerleytng@google.com,
michael.roth@amd.com, Weiny, Ira, Peng, Chao P, Yamahata, Isaku,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On 11/18/2025 2:30 PM, Yan Zhao wrote:
[...]
>>>> Something like below:
>>>>
>>>> @@ -1558,7 +1558,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct
>>>> tdp_iter *iter,
>>>> static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>>>> struct kvm_mmu_page *root,
>>>> gfn_t start, gfn_t end,
>>>> - int target_level, bool shared)
>>>> + int target_level, bool shared,
>>>> + bool only_cross_boundary,
>>>> + bool *split)
>>>> {
>>>> struct kvm_mmu_page *sp = NULL;
>>>> struct tdp_iter iter;
>>>> @@ -1584,6 +1586,9 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>>>> if (!is_shadow_present_pte(iter.old_spte) ||
>>>> !is_large_pte(iter.old_spte))
>>>> continue;
>>>>
>>>> + if (only_cross_boundary && !iter_cross_boundary(&iter, start,
>>>> end))
>>>> + continue;
>>>> +
>>>> if (!sp) {
>>>> rcu_read_unlock();
>>>>
>>>> @@ -1618,6 +1623,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>>>> goto retry;
>>>>
>>>> sp = NULL;
>>>> + *split = true;
>>>> }
>>>>
>>>> rcu_read_unlock();
>>> This looks more reasonable for tdp_mmu_split_huge_pages_root();
>>>
>>> Given that splitting only adds a new page to the paging structure (unlike page
>>> merging), I currently can't think of any current use cases that would be broken
>>> by the lack of TLB flush before tdp_mmu_iter_cond_resched() releases the
>>> mmu_lock.
>>>
>>> This is because:
>>> 1) if the split is triggered in a fault path, the hardware shouldn't have cached
>>> the old huge translation.
>>> 2) if the split is triggered in a zap or convert path,
>>> - there shouldn't be concurrent faults on the range due to the protection of
>>> mmu_invalidate_range*.
>>> - for concurrent splits on the same range, though the other vCPUs may
>>> temporally see stale huge TLB entries after they believe they have
>>> performed a split, they will be kicked off to flush the cache soon after
>>> tdp_mmu_split_huge_pages_root() returns in the first vCPU/host thread.
>>> This should be acceptable since I don't see any special guest needs that
>>> rely on pure splits.
>> Perhaps we should just go straight to the point:
>>
>> What does "hugepage split" do, and what's the consequence of not flushing TLB.
>>
>> Per make_small_spte(), the new child PTEs will carry all bits of hugepage PTE
>> except they clear the 'hugepage bit (obviously)', and set the 'X' bit for NX
>> hugepage thing.
>>
>> That means if we leave the stale hugepage TLB, the CPU is still able to find the
>> correct PFN and AFAICT there shouldn't be any other problem here.
The comments in tdp_mmu_split_huge_page() echo this.
/*
* Replace the huge spte with a pointer to the populated lower level
* page table. Since we are making this change without a TLB flush vCPUs
* will see a mix of the split mappings and the original huge mapping,
* depending on what's currently in their TLB. This is fine from a
* correctness standpoint since the translation will be the same either
* way.
*/
>> For any fault
>> due to the stale hugepage TLB missing the 'X' permission, AFAICT KVM will just
>> treat this as a spurious fault, which isn't nice but should have no harm.
> Right, that isn't nice, though no harm.
According to SDM "Operations that Invalidate Cached Mappings":
The following operations invalidate cached mappings as indicated:
- ...
- An EPT violation invalidates any guest-physical mappings (associated with
the current EPTRTA) that would be used to translate the guest-physical
address that caused the EPT violation. If that guest-physical address was
the translation of a linear address, the EPT violation also invalidates any
combined mappings for that linear address associated with the current PCID,
the current VPID and the current EPTRTA.
- ...
If other CPUs have the stale hugepage TLB entry, there may be one spurious
fault each.
Agree that it's not nice, but no harm.
>
> Besides, I'm thinking of a scenario which is not currently existing though.
>
> CPU 0 CPU 1
> a1. split pages
> a2. write protect pages
> b1. split pages
> b2. write protect pages
> b3. start dirty page tracking
> a3. flush TLB
> a4. start dirty page tracking
>
>
> If CPU 1 does not flush TLB after b2 (e.g., due to it finds the pages have been
> split and write protected by a1&a2), it will miss some dirty pages.
>
> Currently CPU 1 always flush TLB before b3 unconditionally, so there's no
> problem.
Yes, for this write protection case, the TLB should be flushed.
And currently, all (indirect) callers of tdp_mmu_split_huge_pages_root() do the
TLB flush unconditionally.
For the TDX case, the TLB flush is done via the hook of secure EPT hugepage
split.
It seems that the callers can decide whether the TLB flush is needed or not
based on the following actions after hugepage split, may be with the info of
whether split actually occurs or not.
>
>>> So I tend to agree with your suggestion though the implementation in this patch
>>> is safer.
>> I am perhaps still missing something, as I am still trying to precisely
>> understand in what cases you want to flush TLB when splitting hugepage.
>>
>> I kinda tend to think you eventually want to flush TLB because eventually you
>> want to _ZAP_. But needing to flush due to zap and needing to flush due to
>> split is kinda different I think.
> Though I currently couldn't find any use cases that depend on split alone, e.g.
> if there's any feature requiring the pages must be 4KB without any additional
> permission changes, I just wanted to make the code safer in case I missed any
> edge cases.
>
> We surely don't want the window for CPUs to see huge pages and small pages lasts
> long.
>
> Flushing TLB before releasing the mmu_lock allows other threads operating on the
> same range to see updated translations timely.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
2025-11-18 6:30 ` Yan Zhao
2025-11-18 8:59 ` Binbin Wu
@ 2025-11-18 10:49 ` Huang, Kai
2025-11-19 3:41 ` Yan Zhao
2025-11-19 6:23 ` Yan Zhao
1 sibling, 2 replies; 129+ messages in thread
From: Huang, Kai @ 2025-11-18 10:49 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
vbabka@suse.cz, kas@kernel.org, linux-kernel@vger.kernel.org,
seanjc@google.com, pbonzini@redhat.com, binbin.wu@linux.intel.com,
ackerleytng@google.com, michael.roth@amd.com, Weiny, Ira,
Peng, Chao P, Yamahata, Isaku, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
> >
> > - For non-TDX case ('only_cross_boundary == false'):
> >
> > AFAICT the only user of tdp_mmu_split_huge_pages_root() is "eager hugepage
> > splitting" during log-dirty. And per per the current implementation there are
> > two callers of tdp_mmu_split_huge_pages_root():
> >
> > kvm_mmu_try_split_huge_pages()
> > kvm_mmu_slot_try_split_huge_pages()
> >
> > But they are both void functions which neither return whether flush TLB is
> > needed, nor do TLB flush internally.
> Actually callers of the two void functions do the TLB flush unconditionally
> in the end, i.e, in
> kvm_mmu_slot_apply_flags(),
> kvm_clear_dirty_log_protect(), and
> kvm_get_dirty_log_protect()).
Yeah, I didn't call this out.
>
> > So I am kinda confused.
> >
> > Perhaps you mean for "shared memory of TDX guest", the caller will also pass
> > 'only_cross_boundary == true' and the caller needs to perform TLB flush?
> Sorry for the confusion.
>
> Currently 'only_cross_boundary == true' is only for TDX private memory.
>
> Returning flush is because kvm_split_cross_boundary_leafs() is potentially
> possible to be invoked for non-TDX cases as well in future (though currently
> it's only invoked for TDX alone). When that occurs, it's better to return flush
> to avoid the caller having to do flush unconditionally.
Exactly what "future" cases are you referring to?
Why do we need to consider it *NOW*?
>
> Another reason is to keep consistency with tdp_mmu_zap_leafs(), which returns
> flush without differentiate whether the zap is for a mirror root not not. So,
> though kvm_mmu_remote_flush() on mirror root is not necessary, it's
> intentionally left for future optimization.
You mean non-blocking DEMOTE won't need to flush TLB internally when splitting
but the caller needs to do the flush?
Anyway, all of above are not mentioned in the changelog. I think we need a
clear explanation in the changelog to justify the change.
>
> > [...]
> >
> > > >
> > > > Something like below:
> > > >
> > > > @@ -1558,7 +1558,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct
> > > > tdp_iter *iter,
> > > > static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > struct kvm_mmu_page *root,
> > > > gfn_t start, gfn_t end,
> > > > - int target_level, bool shared)
> > > > + int target_level, bool shared,
> > > > + bool only_cross_boundary,
> > > > + bool *split)
> > > > {
> > > > struct kvm_mmu_page *sp = NULL;
> > > > struct tdp_iter iter;
> > > > @@ -1584,6 +1586,9 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > if (!is_shadow_present_pte(iter.old_spte) ||
> > > > !is_large_pte(iter.old_spte))
> > > > continue;
> > > >
> > > > + if (only_cross_boundary && !iter_cross_boundary(&iter, start,
> > > > end))
> > > > + continue;
> > > > +
> > > > if (!sp) {
> > > > rcu_read_unlock();
> > > >
> > > > @@ -1618,6 +1623,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > goto retry;
> > > >
> > > > sp = NULL;
> > > > + *split = true;
> > > > }
> > > >
> > > > rcu_read_unlock();
> > > This looks more reasonable for tdp_mmu_split_huge_pages_root();
> > >
> > > Given that splitting only adds a new page to the paging structure (unlike page
> > > merging), I currently can't think of any current use cases that would be broken
> > > by the lack of TLB flush before tdp_mmu_iter_cond_resched() releases the
> > > mmu_lock.
> > >
> > > This is because:
> > > 1) if the split is triggered in a fault path, the hardware shouldn't have cached
> > > the old huge translation.
> > > 2) if the split is triggered in a zap or convert path,
> > > - there shouldn't be concurrent faults on the range due to the protection of
> > > mmu_invalidate_range*.
> > > - for concurrent splits on the same range, though the other vCPUs may
> > > temporally see stale huge TLB entries after they believe they have
> > > performed a split, they will be kicked off to flush the cache soon after
> > > tdp_mmu_split_huge_pages_root() returns in the first vCPU/host thread.
> > > This should be acceptable since I don't see any special guest needs that
> > > rely on pure splits.
> >
> > Perhaps we should just go straight to the point:
> >
> > What does "hugepage split" do, and what's the consequence of not flushing TLB.
> >
> > Per make_small_spte(), the new child PTEs will carry all bits of hugepage PTE
> > except they clear the 'hugepage bit (obviously)', and set the 'X' bit for NX
> > hugepage thing.
> >
> > That means if we leave the stale hugepage TLB, the CPU is still able to find the
> > correct PFN and AFAICT there shouldn't be any other problem here. For any fault
> > due to the stale hugepage TLB missing the 'X' permission, AFAICT KVM will just
> > treat this as a spurious fault, which isn't nice but should have no harm.
> Right, that isn't nice, though no harm.
>
> Besides, I'm thinking of a scenario which is not currently existing though.
>
> CPU 0 CPU 1
> a1. split pages
> a2. write protect pages
> b1. split pages
> b2. write protect pages
> b3. start dirty page tracking
> a3. flush TLB
> a4. start dirty page tracking
>
>
> If CPU 1 does not flush TLB after b2 (e.g., due to it finds the pages have been
> split and write protected by a1&a2), it will miss some dirty pages.
Do you have any actual concrete plan to foresee this is likely to happen in the
future? E.g., why CPU1 wants to skip TLB flush after b2 due to a1&a2 etc?
To be honest I don't think we should discuss those hypothetical problems.
>
> Currently CPU 1 always flush TLB before b3 unconditionally, so there's no
> problem.
>
> > > So I tend to agree with your suggestion though the implementation in this patch
> > > is safer.
> >
> > I am perhaps still missing something, as I am still trying to precisely
> > understand in what cases you want to flush TLB when splitting hugepage.
> >
> > I kinda tend to think you eventually want to flush TLB because eventually you
> > want to _ZAP_. But needing to flush due to zap and needing to flush due to
> > split is kinda different I think.
>
> Though I currently couldn't find any use cases that depend on split alone, e.g.
> if there's any feature requiring the pages must be 4KB without any additional
> permission changes, I just wanted to make the code safer in case I missed any
> edge cases.
>
> We surely don't want the window for CPUs to see huge pages and small pages lasts
> long.
>
> Flushing TLB before releasing the mmu_lock allows other threads operating on the
> same range to see updated translations timely.
In the upstream code most callers of tdp_mmu_iter_cond_resched() call it w/o
flushing TLB when yield happens, so the "window of stale TLB" already exists --
it's just not stale hugepage TLBs, but other stale TLBs.
But I agree it's not good to have stale TLBs, and looking at
recover_huge_pages_range(), it also does TLB flush when yielding if there's
already hugepage merge happened.
So if you want to make tdp_mmu_split_huge_pages_root() handle TLB flush, perhaps
we can make it like recover_huge_pages_range(). But AFAICT we also want to make
tdp_mmu_split_huge_pages_root() return whether flush is needed, but not actually
perform TLB flush for non-yielding case, because otherwise we need to revisit
the log-dirty code to avoid duplicated TLB flush.
And then the 'only_cross_boundary' can be added to it.
Btw, a second thought on the 'only_cross_boundary':
My first glance of 'only_cross_boundary' was it's a little bit odd, because you
actually only need to split the hugepage where 'start' and 'end' is in middle of
a hugepage.
So alternatively, instead of yet adding another 'only_cross_boundary' to
tdp_mmu_split_huge_pages_root(), I think we can also make the caller check the
range and only call tdp_mmu_split_huge_pages_root() when the range crosses the
hugepage boundary?
E.g., for a range [1G, 2G), it's doesn't cross any 2M boundary, thus the caller
can skip calling tdp_mmu_split_huge_pages_root(). If the range is [1G + 1M,
2G), then the caller can know only the first [1G, 1G + 2M) needs splitting.
This also saves unnecessary iter walk for the rest range [1G + 2M, 2G).
I think if we only consider 2M hugepage but not 1G page, then it should not be
that complicated to check the range and only call
tdp_mmu_split_huge_pages_root() for the range that is truly needs splitting?
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
2025-11-18 10:49 ` Huang, Kai
@ 2025-11-19 3:41 ` Yan Zhao
2026-01-06 10:35 ` Yan Zhao
2025-11-19 6:23 ` Yan Zhao
1 sibling, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-11-19 3:41 UTC (permalink / raw)
To: Huang, Kai
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
vbabka@suse.cz, kas@kernel.org, linux-kernel@vger.kernel.org,
seanjc@google.com, pbonzini@redhat.com, binbin.wu@linux.intel.com,
ackerleytng@google.com, michael.roth@amd.com, Weiny, Ira,
Peng, Chao P, Yamahata, Isaku, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
Hi Kai and all,
Let me summarize my points clearly in advance:
(I guess I failed to do it explicitly in my previous mails [1][2]).
- I agree with Kai's suggestion to return a "bool *split" to callers of
kvm_split_cross_boundary_leafs(). The callers can choose to do TLB flush or
not, since we don't want them to do TLB flush unconditionally. (see the "Note"
below).
- I think it's OK to skip TLB flush before tdp_mmu_iter_cond_resched() releases
the mmu_lock in tdp_mmu_split_huge_pages_root(), as there's no known use case
impacted up to now, according to the analysis in [1].
- Invoke kvm_flush_remote_tlbs() for tdp_mmu_split_huge_pages_root() in this
series is for
a) code completeness.
kvm_split_cross_boundary_leafs() does not force that the root must be a
mirror root.
TDX alone doesn't require invoking kvm_flush_remote_tlbs() as it's done
implicitly in tdx_sept_split_private_spt(). TDX share memory also does not
invoke kvm_split_cross_boundary_leafs().
b) code consistency.
kvm_unmap_gfn_range() also returns flush for callers to invoke
kvm_flush_remote_tlbs(), even when the range is of KVM_FILTER_PRIVATE
alone.
I'll update the patch with proper comments to explain the above points if you
are agreed.
Thanks
Yan
Note:
Currently there are 3 callers of kvm_split_cross_boundary_leafs():
1) tdx_check_accept_level(), which actually has no need to invoke
kvm_flush_remote_tlbs() since it splits mirror root only.
2) kvm_arch_pre_set_memory_attributes(), which can combine the flush together
with the TLB flush due to kvm_unmap_gfn_range().
3) kvm_gmem_split_private(), which is invoked by gmem punch_hole and gmem
conversion from private to shared. The caller can choose to do TLB flush
separately or together with kvm_gmem_zap() later.
[1] https://lore.kernel.org/all/aRbHtnMcoqM1gmL9@yzhao56-desk.sh.intel.com
[2] https://lore.kernel.org/all/aRwSkc10XQqY8RfE@yzhao56-desk.sh.intel.com
On Tue, Nov 18, 2025 at 06:49:31PM +0800, Huang, Kai wrote:
> > >
Will reply the rest of your mail seperately later.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
2025-11-19 3:41 ` Yan Zhao
@ 2026-01-06 10:35 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2026-01-06 10:35 UTC (permalink / raw)
To: Huang, Kai
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
vbabka@suse.cz, kas@kernel.org, linux-kernel@vger.kernel.org,
seanjc@google.com, pbonzini@redhat.com, binbin.wu@linux.intel.com,
ackerleytng@google.com, michael.roth@amd.com, Weiny, Ira,
Peng, Chao P, Yamahata, Isaku, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
On Wed, Nov 19, 2025 at 11:41:51AM +0800, Yan Zhao wrote:
> Hi Kai and all,
>
> Let me summarize my points clearly in advance:
> (I guess I failed to do it explicitly in my previous mails [1][2]).
>
> - I agree with Kai's suggestion to return a "bool *split" to callers of
> kvm_split_cross_boundary_leafs(). The callers can choose to do TLB flush or
> not, since we don't want them to do TLB flush unconditionally. (see the "Note"
> below).
Hi Kai,
Thanks for your review and bringing up the TLB flush issue!
After further thought, I finally chose not to return the split status in
kvm_split_cross_boundary_leafs(), because the split status is not accurate given
that we don't flush TLB before releasing mmu_lock in
tdp_mmu_split_huge_pages_root(). i.e., when the function returns split as false,
splits could still have occurred during the temporary release of mmu_lock.
So, I implemented the API like this:
(1) Do not return split status in kvm_split_cross_boundary_leafs().
(2) Let the caller decide whether and how to flush TLB according to the use
cases. e.g.,
- if it's for dirty tracking (e.g., splits before turning on PML),
unconditionally flush TLB.
- if it's in the fault path, e.g., tdx_check_accept_level(). No TLB flush is
required (current TDX's tdx_track() also ensures no need for a separate
flush).
- if it's for gmem punch hole or page conversions, the callers can delay the
TLB flush for splits and combine it with the flush for zaps.
I've posted this implementation in v3
https://lore.kernel.org/all/20260106101646.24809-1-yan.y.zhao@intel.com.
Please let me know if it doesn't look good.
Thanks
Yan
> - I think it's OK to skip TLB flush before tdp_mmu_iter_cond_resched() releases
> the mmu_lock in tdp_mmu_split_huge_pages_root(), as there's no known use case
> impacted up to now, according to the analysis in [1].
>
> - Invoke kvm_flush_remote_tlbs() for tdp_mmu_split_huge_pages_root() in this
> series is for
> a) code completeness.
> kvm_split_cross_boundary_leafs() does not force that the root must be a
> mirror root.
>
> TDX alone doesn't require invoking kvm_flush_remote_tlbs() as it's done
> implicitly in tdx_sept_split_private_spt(). TDX share memory also does not
> invoke kvm_split_cross_boundary_leafs().
>
> b) code consistency.
> kvm_unmap_gfn_range() also returns flush for callers to invoke
> kvm_flush_remote_tlbs(), even when the range is of KVM_FILTER_PRIVATE
> alone.
>
> I'll update the patch with proper comments to explain the above points if you
> are agreed.
>
> Thanks
> Yan
>
> Note:
> Currently there are 3 callers of kvm_split_cross_boundary_leafs():
> 1) tdx_check_accept_level(), which actually has no need to invoke
> kvm_flush_remote_tlbs() since it splits mirror root only.
>
> 2) kvm_arch_pre_set_memory_attributes(), which can combine the flush together
> with the TLB flush due to kvm_unmap_gfn_range().
>
> 3) kvm_gmem_split_private(), which is invoked by gmem punch_hole and gmem
> conversion from private to shared. The caller can choose to do TLB flush
> separately or together with kvm_gmem_zap() later.
>
>
> [1] https://lore.kernel.org/all/aRbHtnMcoqM1gmL9@yzhao56-desk.sh.intel.com
> [2] https://lore.kernel.org/all/aRwSkc10XQqY8RfE@yzhao56-desk.sh.intel.com
>
> On Tue, Nov 18, 2025 at 06:49:31PM +0800, Huang, Kai wrote:
> > > >
> Will reply the rest of your mail seperately later.
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
2025-11-18 10:49 ` Huang, Kai
2025-11-19 3:41 ` Yan Zhao
@ 2025-11-19 6:23 ` Yan Zhao
1 sibling, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-11-19 6:23 UTC (permalink / raw)
To: Huang, Kai
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
vbabka@suse.cz, kas@kernel.org, linux-kernel@vger.kernel.org,
seanjc@google.com, pbonzini@redhat.com, binbin.wu@linux.intel.com,
ackerleytng@google.com, michael.roth@amd.com, Weiny, Ira,
Peng, Chao P, Yamahata, Isaku, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
On Tue, Nov 18, 2025 at 06:49:31PM +0800, Huang, Kai wrote:
> > > So I am kinda confused.
> > >
> > > Perhaps you mean for "shared memory of TDX guest", the caller will also pass
> > > 'only_cross_boundary == true' and the caller needs to perform TLB flush?
> > Sorry for the confusion.
> >
> > Currently 'only_cross_boundary == true' is only for TDX private memory.
> >
> > Returning flush is because kvm_split_cross_boundary_leafs() is potentially
> > possible to be invoked for non-TDX cases as well in future (though currently
> > it's only invoked for TDX alone). When that occurs, it's better to return flush
> > to avoid the caller having to do flush unconditionally.
>
> Exactly what "future" cases are you referring to?
>
> Why do we need to consider it *NOW*?
The API kvm_split_cross_boundary_leafs() does not force that the root must be
a mirror root. So, it's better to return "split" status and let the caller
decide whether to invoke kvm_flush_remote_tlbs(). This is for code completeness
and consistency as explained in [1].
[1] https://lore.kernel.org/all/aR08f%2Fn7j0RyGlUn@yzhao56-desk.sh.intel.com/
> >
> > Another reason is to keep consistency with tdp_mmu_zap_leafs(), which returns
> > flush without differentiate whether the zap is for a mirror root not not. So,
> > though kvm_mmu_remote_flush() on mirror root is not necessary, it's
> > intentionally left for future optimization.
>
> You mean non-blocking DEMOTE won't need to flush TLB internally when splitting
> but the caller needs to do the flush?
I mean as an API implemented in KVM core, it's better to have
kvm_split_cross_boundary_leafs() to make any assumption on whether TDX
internally have performed the TLB flush or whether non-blocking DEMOTE needs
flush.
We can return "split" and let callers decide flush or not.
The optimization to avoid invoking kvm_flush_remote_tlbs() for zaps/splits on a
mirror root alone can be implemented in the future when necessary.
> Anyway, all of above are not mentioned in the changelog. I think we need a
> clear explanation in the changelog to justify the change.
Will do.
> > > [...]
> > >
> > > > >
> > > > > Something like below:
> > > > >
> > > > > @@ -1558,7 +1558,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct
> > > > > tdp_iter *iter,
> > > > > static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > > struct kvm_mmu_page *root,
> > > > > gfn_t start, gfn_t end,
> > > > > - int target_level, bool shared)
> > > > > + int target_level, bool shared,
> > > > > + bool only_cross_boundary,
> > > > > + bool *split)
> > > > > {
> > > > > struct kvm_mmu_page *sp = NULL;
> > > > > struct tdp_iter iter;
> > > > > @@ -1584,6 +1586,9 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > > if (!is_shadow_present_pte(iter.old_spte) ||
> > > > > !is_large_pte(iter.old_spte))
> > > > > continue;
> > > > >
> > > > > + if (only_cross_boundary && !iter_cross_boundary(&iter, start,
> > > > > end))
> > > > > + continue;
> > > > > +
> > > > > if (!sp) {
> > > > > rcu_read_unlock();
> > > > >
> > > > > @@ -1618,6 +1623,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > > goto retry;
> > > > >
> > > > > sp = NULL;
> > > > > + *split = true;
> > > > > }
> > > > >
> > > > > rcu_read_unlock();
> > > > This looks more reasonable for tdp_mmu_split_huge_pages_root();
> > > >
> > > > Given that splitting only adds a new page to the paging structure (unlike page
> > > > merging), I currently can't think of any current use cases that would be broken
> > > > by the lack of TLB flush before tdp_mmu_iter_cond_resched() releases the
> > > > mmu_lock.
> > > >
> > > > This is because:
> > > > 1) if the split is triggered in a fault path, the hardware shouldn't have cached
> > > > the old huge translation.
> > > > 2) if the split is triggered in a zap or convert path,
> > > > - there shouldn't be concurrent faults on the range due to the protection of
> > > > mmu_invalidate_range*.
> > > > - for concurrent splits on the same range, though the other vCPUs may
> > > > temporally see stale huge TLB entries after they believe they have
> > > > performed a split, they will be kicked off to flush the cache soon after
> > > > tdp_mmu_split_huge_pages_root() returns in the first vCPU/host thread.
> > > > This should be acceptable since I don't see any special guest needs that
> > > > rely on pure splits.
> > >
> > > Perhaps we should just go straight to the point:
> > >
> > > What does "hugepage split" do, and what's the consequence of not flushing TLB.
> > >
> > > Per make_small_spte(), the new child PTEs will carry all bits of hugepage PTE
> > > except they clear the 'hugepage bit (obviously)', and set the 'X' bit for NX
> > > hugepage thing.
> > >
> > > That means if we leave the stale hugepage TLB, the CPU is still able to find the
> > > correct PFN and AFAICT there shouldn't be any other problem here. For any fault
> > > due to the stale hugepage TLB missing the 'X' permission, AFAICT KVM will just
> > > treat this as a spurious fault, which isn't nice but should have no harm.
> > Right, that isn't nice, though no harm.
> >
> > Besides, I'm thinking of a scenario which is not currently existing though.
> >
> > CPU 0 CPU 1
> > a1. split pages
> > a2. write protect pages
> > b1. split pages
> > b2. write protect pages
> > b3. start dirty page tracking
> > a3. flush TLB
> > a4. start dirty page tracking
> >
> >
> > If CPU 1 does not flush TLB after b2 (e.g., due to it finds the pages have been
> > split and write protected by a1&a2), it will miss some dirty pages.
>
> Do you have any actual concrete plan to foresee this is likely to happen in the
> future? E.g., why CPU1 wants to skip TLB flush after b2 due to a1&a2 etc?
>
> To be honest I don't think we should discuss those hypothetical problems.
Sorry about the confusion.
I just wanted to express why I thought it's safer to do flush before releasing
mmu_lock. However, as I mentioned in the previous replies, I can't find any
current use cases impacted by skipping this flush. So I think it's ok not to
flush before releasing mmu_lock.
Will update the patch comment to explain why the skipping of flush is ok.
(I think current upstream tdp_mmu_split_huge_pages_root() lacks a comment of why
it's safe not to do the flush before releasing mmu_lock).
> > Currently CPU 1 always flush TLB before b3 unconditionally, so there's no
> > problem.
> >
> > > > So I tend to agree with your suggestion though the implementation in this patch
> > > > is safer.
> > >
> > > I am perhaps still missing something, as I am still trying to precisely
> > > understand in what cases you want to flush TLB when splitting hugepage.
> > >
> > > I kinda tend to think you eventually want to flush TLB because eventually you
> > > want to _ZAP_. But needing to flush due to zap and needing to flush due to
> > > split is kinda different I think.
> >
> > Though I currently couldn't find any use cases that depend on split alone, e.g.
> > if there's any feature requiring the pages must be 4KB without any additional
> > permission changes, I just wanted to make the code safer in case I missed any
> > edge cases.
> >
> > We surely don't want the window for CPUs to see huge pages and small pages lasts
> > long.
> >
> > Flushing TLB before releasing the mmu_lock allows other threads operating on the
> > same range to see updated translations timely.
>
> In the upstream code most callers of tdp_mmu_iter_cond_resched() call it w/o
> flushing TLB when yield happens, so the "window of stale TLB" already exists --
> it's just not stale hugepage TLBs, but other stale TLBs.
>
> But I agree it's not good to have stale TLBs, and looking at
> recover_huge_pages_range(), it also does TLB flush when yielding if there's
> already hugepage merge happened.
>
> So if you want to make tdp_mmu_split_huge_pages_root() handle TLB flush, perhaps
> we can make it like recover_huge_pages_range(). But AFAICT we also want to make
> tdp_mmu_split_huge_pages_root() return whether flush is needed, but not actually
> perform TLB flush for non-yielding case, because otherwise we need to revisit
> the log-dirty code to avoid duplicated TLB flush.
>
> And then the 'only_cross_boundary' can be added to it.
>
> Btw, a second thought on the 'only_cross_boundary':
>
> My first glance of 'only_cross_boundary' was it's a little bit odd, because you
> actually only need to split the hugepage where 'start' and 'end' is in middle of
> a hugepage.
>
> So alternatively, instead of yet adding another 'only_cross_boundary' to
> tdp_mmu_split_huge_pages_root(), I think we can also make the caller check the
> range and only call tdp_mmu_split_huge_pages_root() when the range crosses the
> hugepage boundary?
I don't think it's good.
- The code to check if a range crosses hugepage boundary is cumbersome and level
dependent. As you point out below, [1G, 1G + 2M) does not need splitting if
there's only 2M mappings, but it needs splitting when there're 1G mappings.
For an API implemented in KVM core, it's better not to assume there're no 1G
mappings. Not to mention TDX itself will support 1G in future.
- When a range is determined as "truly needs splitting", e.g. [1G-8K, 1G+8K),
tdp_mmu_split_huge_pages_root() still needs to return 'split' status since
splitting may not occur due to no present mappings or no 2M mappings.
> E.g., for a range [1G, 2G), it's doesn't cross any 2M boundary, thus the caller
> can skip calling tdp_mmu_split_huge_pages_root(). If the range is [1G + 1M,
> 2G), then the caller can know only the first [1G, 1G + 2M) needs splitting.
> This also saves unnecessary iter walk for the rest range [1G + 2M, 2G).
>
> I think if we only consider 2M hugepage but not 1G page, then it should not be
> that complicated to check the range and only call
> tdp_mmu_split_huge_pages_root() for the range that is truly needs splitting?
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()
2025-08-07 9:43 ` [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
2025-09-03 6:57 ` Binbin Wu
2025-11-11 10:42 ` Huang, Kai
@ 2025-11-19 6:31 ` Yan Zhao
2 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-11-19 6:31 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, michael.roth, david, vannapurve, vbabka,
thomas.lendacky, pgonda, fan.du, jun.miao, ira.weiny,
isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng
On Thu, Aug 07, 2025 at 05:43:58PM +0800, Yan Zhao wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9182192daa3a..13910ae05f76 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1647,6 +1647,33 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
> start, end - 1, can_yield, true, flush);
> }
>
> +/*
> + * Split large leafs crossing the boundary of the specified range
> + *
> + * Return value:
> + * 0 : success, no flush is required;
> + * 1 : success, flush is required;
> + * <0: failure.
> + */
> +int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
> + bool shared)
> +{
> + bool ret = 0;
> +
> + lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> + lockdep_is_held(&kvm->slots_lock) ||
> + srcu_read_lock_held(&kvm->srcu));
> +
> + if (!range->may_block)
> + return -EOPNOTSUPP;
> +
> + if (tdp_mmu_enabled)
> + ret = kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(kvm, range, shared);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(kvm_split_cross_boundary_leafs);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index fb79d2b7decd..6137b76341e1 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -273,6 +273,8 @@ struct kvm_gfn_range {
> bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> +int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
> + bool shared);
Note to myself: Provide a default implementation for non-x86 platforms.
^ permalink raw reply [flat|nested] 129+ messages in thread
* [RFC PATCH v2 13/23] KVM: x86: Introduce hugepage_set_guest_inhibit()
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (11 preceding siblings ...)
2025-08-07 9:43 ` [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs() Yan Zhao
@ 2025-08-07 9:44 ` Yan Zhao
2025-08-07 9:44 ` [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info Yan Zhao
` (9 subsequent siblings)
22 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:44 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
TDX requires guests to accept S-EPT mappings created by the host KVM. Due
to the current implementation of the TDX module, if a guest accepts a GFN
at a lower level after KVM maps it at a higher level, the TDX module will
emulate an EPT violation VMExit to KVM instead of returning a size mismatch
error to the guest. If KVM fails to perform page splitting in the VMExit
handler, the guest's accept operation will be triggered again upon
re-entering the guest, causing a repeated EPT violation VMExit.
To facilitate passing the guest's accept level information to the KVM MMU
core and to prevent the repeated mapping of a GFN at different levels due
to different accept levels specified by different vCPUs, introduce the
interface hugepage_set_guest_inhibit(). This interface specifies across
vCPUs that mapping at a certain level is inhibited from the guest.
The KVM_LPAGE_GUEST_INHIBIT_FLAG bit is currently modified in one
direction (set), so no clear interface is provided.
Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com/ [1]
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- new in RFC v2
---
arch/x86/kvm/mmu.h | 3 +++
arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++++++---
2 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b122255c7d4e..c2d8819f3438 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -326,4 +326,7 @@ static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
{
return gfn & kvm_gfn_direct_bits(kvm);
}
+
+void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
+bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
#endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 13910ae05f76..1c639286aac2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -721,12 +721,14 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
}
/*
- * The most significant bit in disallow_lpage tracks whether or not memory
- * attributes are mixed, i.e. not identical for all gfns at the current level.
+ * The most 2 significant bits in disallow_lpage tracks whether or not memory
+ * attributes are mixed, i.e. not identical for all gfns at the current level,
+ * or whether or not guest inhibits the current level of hugepage at the gfn.
* The lower order bits are used to refcount other cases where a hugepage is
* disallowed, e.g. if KVM has shadow a page table at the gfn.
*/
#define KVM_LPAGE_MIXED_FLAG BIT(31)
+#define KVM_LPAGE_GUEST_INHIBIT_FLAG BIT(30)
static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
gfn_t gfn, int count)
@@ -739,7 +741,8 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
old = linfo->disallow_lpage;
linfo->disallow_lpage += count;
- WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG);
+ WARN_ON_ONCE((old ^ linfo->disallow_lpage) &
+ (KVM_LPAGE_MIXED_FLAG | KVM_LPAGE_GUEST_INHIBIT_FLAG));
}
}
@@ -1647,6 +1650,18 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
start, end - 1, can_yield, true, flush);
}
+bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
+{
+ return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_GUEST_INHIBIT_FLAG;
+}
+EXPORT_SYMBOL_GPL(hugepage_test_guest_inhibit);
+
+void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
+{
+ lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_GUEST_INHIBIT_FLAG;
+}
+EXPORT_SYMBOL_GPL(hugepage_set_guest_inhibit);
+
/*
* Split large leafs crossing the boundary of the specified range
*
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (12 preceding siblings ...)
2025-08-07 9:44 ` [RFC PATCH v2 13/23] KVM: x86: Introduce hugepage_set_guest_inhibit() Yan Zhao
@ 2025-08-07 9:44 ` Yan Zhao
2025-09-03 7:36 ` Binbin Wu
` (3 more replies)
2025-08-07 9:44 ` [RFC PATCH v2 15/23] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
` (8 subsequent siblings)
22 siblings, 4 replies; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:44 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
TDX requires guests to accept S-EPT mappings created by the host KVM. Due
to the current implementation of the TDX module, if a guest accepts a GFN
at a lower level after KVM maps it at a higher level, the TDX module will
emulate an EPT violation VMExit to KVM instead of returning a size mismatch
error to the guest. If KVM fails to perform page splitting in the VMExit
handler, the guest's accept operation will be triggered again upon
re-entering the guest, causing a repeated EPT violation VMExit.
The TDX module thus enables the EPT violation VMExit to carry the guest's
accept level when the VMExit is caused by the guest's accept operation.
Therefore, in TDX's EPT violation handler
(1) Set the guest inhibit bit in the lpage info to prevent KVM MMU core
from mapping at a higher a level than the guest's accept level.
(2) Split any existing huge mapping at the fault GFN to avoid unsupported
splitting under the shared mmu_lock by TDX.
Use write mmu_lock to pretect (1) and (2) for now. If future KVM TDX can
perform the actual splitting under shared mmu_lock with enhanced TDX
modules, (1) is possible to be called under shared mmu_lock, and (2) would
become unnecessary.
As an optimization, this patch calls hugepage_test_guest_inhibit() without
holding the mmu_lock to reduce the frequency of acquiring the write
mmu_lock. The write mmu_lock is thus only acquired if the guest inhibit bit
is not already set. This is safe because the guest inhibit bit is set in a
one-way manner while the splitting under the write mmu_lock is performed
before setting the guest inhibit bit.
Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2
- Change tdx_get_accept_level() to tdx_check_accept_level().
- Invoke kvm_split_cross_boundary_leafs() and hugepage_set_guest_inhibit()
to change KVM mapping level in a global way according to guest accept
level. (Rick, Sean).
RFC v1:
- Introduce tdx_get_accept_level() to get guest accept level.
- Use tdx->violation_request_level and tdx->violation_gfn* to pass guest
accept level to tdx_gmem_private_max_mapping_level() to detemine KVM
mapping level.
---
arch/x86/kvm/vmx/tdx.c | 50 +++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx_arch.h | 3 +++
2 files changed, 53 insertions(+)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 035d81275be4..71115058e5e6 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2019,6 +2019,53 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
}
+static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+ struct kvm_memory_slot *slot = gfn_to_memslot(vcpu->kvm, gfn);
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+ struct kvm *kvm = vcpu->kvm;
+ u64 eeq_type, eeq_info;
+ int level = -1;
+
+ if (!slot)
+ return 0;
+
+ eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
+ if (eeq_type != TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
+ return 0;
+
+ eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
+ TDX_EXT_EXIT_QUAL_INFO_SHIFT;
+
+ level = (eeq_info & GENMASK(2, 0)) + 1;
+
+ if (level == PG_LEVEL_4K || level == PG_LEVEL_2M) {
+ if (!hugepage_test_guest_inhibit(slot, gfn, level + 1)) {
+ gfn_t base_gfn = gfn_round_for_level(gfn, level);
+ struct kvm_gfn_range gfn_range = {
+ .start = base_gfn,
+ .end = base_gfn + KVM_PAGES_PER_HPAGE(level),
+ .slot = slot,
+ .may_block = true,
+ .attr_filter = KVM_FILTER_PRIVATE,
+ };
+
+ scoped_guard(write_lock, &kvm->mmu_lock) {
+ int ret;
+
+ ret = kvm_split_cross_boundary_leafs(kvm, &gfn_range, false);
+ if (ret)
+ return ret;
+
+ hugepage_set_guest_inhibit(slot, gfn, level + 1);
+ if (level == PG_LEVEL_4K)
+ hugepage_set_guest_inhibit(slot, gfn, level + 2);
+ }
+ }
+ }
+ return 0;
+}
+
static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
{
unsigned long exit_qual;
@@ -2044,6 +2091,9 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
*/
exit_qual = EPT_VIOLATION_ACC_WRITE;
+ if (tdx_check_accept_level(vcpu, gpa_to_gfn(gpa)))
+ return RET_PF_RETRY;
+
/* Only private GPA triggers zero-step mitigation */
local_retry = true;
} else {
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index a30e880849e3..af006a73ee05 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -82,7 +82,10 @@ struct tdx_cpuid_value {
#define TDX_TD_ATTR_PERFMON BIT_ULL(63)
#define TDX_EXT_EXIT_QUAL_TYPE_MASK GENMASK(3, 0)
+#define TDX_EXT_EXIT_QUAL_TYPE_ACCEPT 1
#define TDX_EXT_EXIT_QUAL_TYPE_PENDING_EPT_VIOLATION 6
+#define TDX_EXT_EXIT_QUAL_INFO_MASK GENMASK(63, 32)
+#define TDX_EXT_EXIT_QUAL_INFO_SHIFT 32
/*
* TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
*/
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
2025-08-07 9:44 ` [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info Yan Zhao
@ 2025-09-03 7:36 ` Binbin Wu
2025-09-03 9:37 ` Yan Zhao
2025-11-11 10:55 ` Huang, Kai
` (2 subsequent siblings)
3 siblings, 1 reply; 129+ messages in thread
From: Binbin Wu @ 2025-09-03 7:36 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On 8/7/2025 5:44 PM, Yan Zhao wrote:
> TDX requires guests to accept S-EPT mappings created by the host KVM. Due
> to the current implementation of the TDX module, if a guest accepts a GFN
> at a lower level after KVM maps it at a higher level, the TDX module will
> emulate an EPT violation VMExit to KVM instead of returning a size mismatch
> error to the guest. If KVM fails to perform page splitting in the VMExit
> handler, the guest's accept operation will be triggered again upon
> re-entering the guest, causing a repeated EPT violation VMExit.
>
> The TDX module thus enables the EPT violation VMExit to carry the guest's
> accept level when the VMExit is caused by the guest's accept operation.
>
> Therefore, in TDX's EPT violation handler
> (1) Set the guest inhibit bit in the lpage info to prevent KVM MMU core
> from mapping at a higher a level than the guest's accept level.
>
> (2) Split any existing huge mapping at the fault GFN to avoid unsupported
> splitting under the shared mmu_lock by TDX.
>
> Use write mmu_lock to pretect (1) and (2) for now. If future KVM TDX can
> perform the actual splitting under shared mmu_lock with enhanced TDX
> modules, (1) is possible to be called under shared mmu_lock, and (2) would
> become unnecessary.
The description for (1) and (2) reversed?
>
> As an optimization, this patch calls hugepage_test_guest_inhibit() without
> holding the mmu_lock to reduce the frequency of acquiring the write
> mmu_lock. The write mmu_lock is thus only acquired if the guest inhibit bit
> is not already set. This is safe because the guest inhibit bit is set in a
> one-way manner while the splitting under the write mmu_lock is performed
> before setting the guest inhibit bit.
>
> Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com
> Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2
> - Change tdx_get_accept_level() to tdx_check_accept_level().
> - Invoke kvm_split_cross_boundary_leafs() and hugepage_set_guest_inhibit()
> to change KVM mapping level in a global way according to guest accept
> level. (Rick, Sean).
>
> RFC v1:
> - Introduce tdx_get_accept_level() to get guest accept level.
> - Use tdx->violation_request_level and tdx->violation_gfn* to pass guest
> accept level to tdx_gmem_private_max_mapping_level() to detemine KVM
> mapping level.
> ---
> arch/x86/kvm/vmx/tdx.c | 50 +++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/tdx_arch.h | 3 +++
> 2 files changed, 53 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 035d81275be4..71115058e5e6 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2019,6 +2019,53 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
> return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
> }
>
> +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> + struct kvm_memory_slot *slot = gfn_to_memslot(vcpu->kvm, gfn);
> + struct vcpu_tdx *tdx = to_tdx(vcpu);
> + struct kvm *kvm = vcpu->kvm;
> + u64 eeq_type, eeq_info;
> + int level = -1;
> +
> + if (!slot)
> + return 0;
> +
> + eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> + if (eeq_type != TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
> + return 0;
> +
> + eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> + TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> +
> + level = (eeq_info & GENMASK(2, 0)) + 1;
> +
> + if (level == PG_LEVEL_4K || level == PG_LEVEL_2M) {
> + if (!hugepage_test_guest_inhibit(slot, gfn, level + 1)) {
> + gfn_t base_gfn = gfn_round_for_level(gfn, level);
> + struct kvm_gfn_range gfn_range = {
> + .start = base_gfn,
> + .end = base_gfn + KVM_PAGES_PER_HPAGE(level),
> + .slot = slot,
> + .may_block = true,
> + .attr_filter = KVM_FILTER_PRIVATE,
> + };
> +
> + scoped_guard(write_lock, &kvm->mmu_lock) {
> + int ret;
> +
> + ret = kvm_split_cross_boundary_leafs(kvm, &gfn_range, false);
> + if (ret)
> + return ret;
kvm_split_cross_boundary_leafs() calls kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(), which could return flush as 1 if any of the huge page crossing boundary is split, return directly when ret is non-zero seems not right. Also, the TLB flush should also be taken care because in kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(), TLB flush is only done for negative return value.
> +
> + hugepage_set_guest_inhibit(slot, gfn, level + 1);
> + if (level == PG_LEVEL_4K)
> + hugepage_set_guest_inhibit(slot, gfn, level + 2);
> + }
> + }
> + }
> + return 0;
> +}
> +
> static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> {
> unsigned long exit_qual;
> @@ -2044,6 +2091,9 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> */
> exit_qual = EPT_VIOLATION_ACC_WRITE;
>
> + if (tdx_check_accept_level(vcpu, gpa_to_gfn(gpa)))
> + return RET_PF_RETRY;
> +
> /* Only private GPA triggers zero-step mitigation */
> local_retry = true;
> } else {
> diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
> index a30e880849e3..af006a73ee05 100644
> --- a/arch/x86/kvm/vmx/tdx_arch.h
> +++ b/arch/x86/kvm/vmx/tdx_arch.h
> @@ -82,7 +82,10 @@ struct tdx_cpuid_value {
> #define TDX_TD_ATTR_PERFMON BIT_ULL(63)
>
> #define TDX_EXT_EXIT_QUAL_TYPE_MASK GENMASK(3, 0)
> +#define TDX_EXT_EXIT_QUAL_TYPE_ACCEPT 1
> #define TDX_EXT_EXIT_QUAL_TYPE_PENDING_EPT_VIOLATION 6
> +#define TDX_EXT_EXIT_QUAL_INFO_MASK GENMASK(63, 32)
> +#define TDX_EXT_EXIT_QUAL_INFO_SHIFT 32
> /*
> * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
> */
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
2025-09-03 7:36 ` Binbin Wu
@ 2025-09-03 9:37 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-09-03 9:37 UTC (permalink / raw)
To: Binbin Wu
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On Wed, Sep 03, 2025 at 03:36:49PM +0800, Binbin Wu wrote:
>
>
> On 8/7/2025 5:44 PM, Yan Zhao wrote:
> > TDX requires guests to accept S-EPT mappings created by the host KVM. Due
> > to the current implementation of the TDX module, if a guest accepts a GFN
> > at a lower level after KVM maps it at a higher level, the TDX module will
> > emulate an EPT violation VMExit to KVM instead of returning a size mismatch
> > error to the guest. If KVM fails to perform page splitting in the VMExit
> > handler, the guest's accept operation will be triggered again upon
> > re-entering the guest, causing a repeated EPT violation VMExit.
> >
> > The TDX module thus enables the EPT violation VMExit to carry the guest's
> > accept level when the VMExit is caused by the guest's accept operation.
> >
> > Therefore, in TDX's EPT violation handler
> > (1) Set the guest inhibit bit in the lpage info to prevent KVM MMU core
> > from mapping at a higher a level than the guest's accept level.
> >
> > (2) Split any existing huge mapping at the fault GFN to avoid unsupported
> > splitting under the shared mmu_lock by TDX.
> >
> > Use write mmu_lock to pretect (1) and (2) for now. If future KVM TDX can
> > perform the actual splitting under shared mmu_lock with enhanced TDX
> > modules, (1) is possible to be called under shared mmu_lock, and (2) would
> > become unnecessary.
>
> The description for (1) and (2) reversed?
No.
After supporting splitting under shared mmu_lock,
- setting guest inhibit bit can be performed under shared mmu_lock. (*)
- splitting existing huge mapping under write mmu_lock here would be unnecessary.
(*) is still required to convey the info of which max level the guest requires.
(as explained in "Open 1: How to pass guest's ACCEPT level info" in the
cover letter).
> > As an optimization, this patch calls hugepage_test_guest_inhibit() without
> > holding the mmu_lock to reduce the frequency of acquiring the write
> > mmu_lock. The write mmu_lock is thus only acquired if the guest inhibit bit
> > is not already set. This is safe because the guest inhibit bit is set in a
> > one-way manner while the splitting under the write mmu_lock is performed
> > before setting the guest inhibit bit.
> >
> > Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com
> > Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> > RFC v2
> > - Change tdx_get_accept_level() to tdx_check_accept_level().
> > - Invoke kvm_split_cross_boundary_leafs() and hugepage_set_guest_inhibit()
> > to change KVM mapping level in a global way according to guest accept
> > level. (Rick, Sean).
> >
> > RFC v1:
> > - Introduce tdx_get_accept_level() to get guest accept level.
> > - Use tdx->violation_request_level and tdx->violation_gfn* to pass guest
> > accept level to tdx_gmem_private_max_mapping_level() to detemine KVM
> > mapping level.
> > ---
> > arch/x86/kvm/vmx/tdx.c | 50 +++++++++++++++++++++++++++++++++++++
> > arch/x86/kvm/vmx/tdx_arch.h | 3 +++
> > 2 files changed, 53 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 035d81275be4..71115058e5e6 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -2019,6 +2019,53 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
> > return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
> > }
> > +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gfn_t gfn)
> > +{
> > + struct kvm_memory_slot *slot = gfn_to_memslot(vcpu->kvm, gfn);
> > + struct vcpu_tdx *tdx = to_tdx(vcpu);
> > + struct kvm *kvm = vcpu->kvm;
> > + u64 eeq_type, eeq_info;
> > + int level = -1;
> > +
> > + if (!slot)
> > + return 0;
> > +
> > + eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> > + if (eeq_type != TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
> > + return 0;
> > +
> > + eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> > + TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> > +
> > + level = (eeq_info & GENMASK(2, 0)) + 1;
> > +
> > + if (level == PG_LEVEL_4K || level == PG_LEVEL_2M) {
> > + if (!hugepage_test_guest_inhibit(slot, gfn, level + 1)) {
> > + gfn_t base_gfn = gfn_round_for_level(gfn, level);
> > + struct kvm_gfn_range gfn_range = {
> > + .start = base_gfn,
> > + .end = base_gfn + KVM_PAGES_PER_HPAGE(level),
> > + .slot = slot,
> > + .may_block = true,
> > + .attr_filter = KVM_FILTER_PRIVATE,
> > + };
> > +
> > + scoped_guard(write_lock, &kvm->mmu_lock) {
> > + int ret;
> > +
> > + ret = kvm_split_cross_boundary_leafs(kvm, &gfn_range, false);
> > + if (ret)
> > + return ret;
>
> kvm_split_cross_boundary_leafs() calls kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(), which could return flush as 1 if any of the huge page crossing boundary is split, return directly when ret is non-zero seems not right. Also, the TLB flush should also be taken care because in kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(), TLB flush is only done for negative return value.
Oh, good catch!
I forgot about the 2 facts. Will fix them.
> > +
> > + hugepage_set_guest_inhibit(slot, gfn, level + 1);
> > + if (level == PG_LEVEL_4K)
> > + hugepage_set_guest_inhibit(slot, gfn, level + 2);
> > + }
> > + }
> > + }
> > + return 0;
> > +}
> > +
> > static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> > {
> > unsigned long exit_qual;
> > @@ -2044,6 +2091,9 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> > */
> > exit_qual = EPT_VIOLATION_ACC_WRITE;
> > + if (tdx_check_accept_level(vcpu, gpa_to_gfn(gpa)))
> > + return RET_PF_RETRY;
> > +
> > /* Only private GPA triggers zero-step mitigation */
> > local_retry = true;
> > } else {
> > diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
> > index a30e880849e3..af006a73ee05 100644
> > --- a/arch/x86/kvm/vmx/tdx_arch.h
> > +++ b/arch/x86/kvm/vmx/tdx_arch.h
> > @@ -82,7 +82,10 @@ struct tdx_cpuid_value {
> > #define TDX_TD_ATTR_PERFMON BIT_ULL(63)
> > #define TDX_EXT_EXIT_QUAL_TYPE_MASK GENMASK(3, 0)
> > +#define TDX_EXT_EXIT_QUAL_TYPE_ACCEPT 1
> > #define TDX_EXT_EXIT_QUAL_TYPE_PENDING_EPT_VIOLATION 6
> > +#define TDX_EXT_EXIT_QUAL_INFO_MASK GENMASK(63, 32)
> > +#define TDX_EXT_EXIT_QUAL_INFO_SHIFT 32
> > /*
> > * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
> > */
>
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
2025-08-07 9:44 ` [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info Yan Zhao
2025-09-03 7:36 ` Binbin Wu
@ 2025-11-11 10:55 ` Huang, Kai
2025-11-14 1:42 ` Yan Zhao
2025-11-11 11:05 ` Huang, Kai
2025-11-19 5:51 ` Binbin Wu
3 siblings, 1 reply; 129+ messages in thread
From: Huang, Kai @ 2025-11-11 10:55 UTC (permalink / raw)
To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
Cc: quic_eberman@quicinc.com, kvm@vger.kernel.org, Li, Xiaoyao,
Du, Fan, Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
vbabka@suse.cz, tabba@google.com, kas@kernel.org,
michael.roth@amd.com, Weiny, Ira, linux-kernel@vger.kernel.org,
binbin.wu@linux.intel.com, ackerleytng@google.com,
Yamahata, Isaku, Peng, Chao P, zhiquan1.li@intel.com,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On Thu, 2025-08-07 at 17:44 +0800, Yan Zhao wrote:
> @@ -2044,6 +2091,9 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> */
> exit_qual = EPT_VIOLATION_ACC_WRITE;
>
> + if (tdx_check_accept_level(vcpu, gpa_to_gfn(gpa)))
> + return RET_PF_RETRY;
> +
I don't think you should return RET_PF_RETRY here.
This is still at very early stage of EPT violation. The caller of
tdx_handle_ept_violation() is expecting either 0, 1, or negative error code.
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
2025-11-11 10:55 ` Huang, Kai
@ 2025-11-14 1:42 ` Yan Zhao
2025-11-18 0:26 ` Huang, Kai
0 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-11-14 1:42 UTC (permalink / raw)
To: Huang, Kai
Cc: pbonzini@redhat.com, seanjc@google.com, quic_eberman@quicinc.com,
kvm@vger.kernel.org, Li, Xiaoyao, Du, Fan, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
tabba@google.com, kas@kernel.org, michael.roth@amd.com,
Weiny, Ira, linux-kernel@vger.kernel.org,
binbin.wu@linux.intel.com, ackerleytng@google.com,
Yamahata, Isaku, Peng, Chao P, zhiquan1.li@intel.com,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On Tue, Nov 11, 2025 at 06:55:45PM +0800, Huang, Kai wrote:
> On Thu, 2025-08-07 at 17:44 +0800, Yan Zhao wrote:
> > @@ -2044,6 +2091,9 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> > */
> > exit_qual = EPT_VIOLATION_ACC_WRITE;
> >
> > + if (tdx_check_accept_level(vcpu, gpa_to_gfn(gpa)))
> > + return RET_PF_RETRY;
> > +
>
> I don't think you should return RET_PF_RETRY here.
>
> This is still at very early stage of EPT violation. The caller of
> tdx_handle_ept_violation() is expecting either 0, 1, or negative error code.
Hmm, strictly speaking, the caller of the EPT violation handler is expecting
0, >0, or negative error code.
vcpu_run
|->r = vcpu_enter_guest(vcpu);
| |->r = kvm_x86_call(handle_exit)(vcpu, exit_fastpath);
| | return r;
| if (r <= 0)
| break;
handle_ept_violation
|->return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
tdx_handle_ept_violation
|->ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
| return ret;
The current VMX/TDX's EPT violation handlers returns RET_PF_* to the caller
since commit 7c5480386300 ("KVM: x86/mmu: Return RET_PF* instead of 1 in
kvm_mmu_page_fault") for the sake of zero-step mitigation.
This is no problem, because
enum {
RET_PF_CONTINUE = 0,
RET_PF_RETRY,
RET_PF_EMULATE,
RET_PF_WRITE_PROTECTED,
RET_PF_INVALID,
RET_PF_FIXED,
RET_PF_SPURIOUS,
};
/*
* Define RET_PF_CONTINUE as 0 to allow for
* - efficient machine code when checking for CONTINUE, e.g.
* "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero,
* - kvm_mmu_do_page_fault() to return other RET_PF_* as a positive value.
*/
static_assert(RET_PF_CONTINUE == 0);
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
2025-11-14 1:42 ` Yan Zhao
@ 2025-11-18 0:26 ` Huang, Kai
2025-11-18 2:44 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Huang, Kai @ 2025-11-18 0:26 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
vbabka@suse.cz, tabba@google.com, Du, Fan, michael.roth@amd.com,
seanjc@google.com, binbin.wu@linux.intel.com,
linux-kernel@vger.kernel.org, pbonzini@redhat.com, Weiny, Ira,
kas@kernel.org, ackerleytng@google.com, Peng, Chao P,
zhiquan1.li@intel.com, Yamahata, Isaku, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
On Fri, 2025-11-14 at 09:42 +0800, Yan Zhao wrote:
> On Tue, Nov 11, 2025 at 06:55:45PM +0800, Huang, Kai wrote:
> > On Thu, 2025-08-07 at 17:44 +0800, Yan Zhao wrote:
> > > @@ -2044,6 +2091,9 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> > > */
> > > exit_qual = EPT_VIOLATION_ACC_WRITE;
> > >
> > > + if (tdx_check_accept_level(vcpu, gpa_to_gfn(gpa)))
> > > + return RET_PF_RETRY;
> > > +
> >
> > I don't think you should return RET_PF_RETRY here.
> >
> > This is still at very early stage of EPT violation. The caller of
> > tdx_handle_ept_violation() is expecting either 0, 1, or negative error code.
> Hmm, strictly speaking, the caller of the EPT violation handler is expecting
> 0, >0, or negative error code.
>
> vcpu_run
> |->r = vcpu_enter_guest(vcpu);
> | |->r = kvm_x86_call(handle_exit)(vcpu, exit_fastpath);
> | | return r;
> | if (r <= 0)
> | break;
>
> handle_ept_violation
> |->return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
>
> tdx_handle_ept_violation
> |->ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> | return ret;
>
> The current VMX/TDX's EPT violation handlers returns RET_PF_* to the caller
> since commit 7c5480386300 ("KVM: x86/mmu: Return RET_PF* instead of 1 in
> kvm_mmu_page_fault") for the sake of zero-step mitigation.
>
> This is no problem, because
>
> enum {
> RET_PF_CONTINUE = 0,
> RET_PF_RETRY,
> RET_PF_EMULATE,
> RET_PF_WRITE_PROTECTED,
> RET_PF_INVALID,
> RET_PF_FIXED,
> RET_PF_SPURIOUS,
> };
>
> /*
> * Define RET_PF_CONTINUE as 0 to allow for
> * - efficient machine code when checking for CONTINUE, e.g.
> * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero,
> * - kvm_mmu_do_page_fault() to return other RET_PF_* as a positive value.
> */
> static_assert(RET_PF_CONTINUE == 0);
Ah, OK.
But this makes KVM retry fault, when kvm_split_cross_boundary_leafs() fails, due
to -ENOMEM, presumably. While in the normal page fault handler path, -ENOMEM
will just return to userspace AFAICT.
This is not consistent, but I guess nobody cares, or noticed.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
2025-11-18 0:26 ` Huang, Kai
@ 2025-11-18 2:44 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-11-18 2:44 UTC (permalink / raw)
To: Huang, Kai
Cc: kvm@vger.kernel.org, Li, Xiaoyao, quic_eberman@quicinc.com,
Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
vbabka@suse.cz, tabba@google.com, Du, Fan, michael.roth@amd.com,
seanjc@google.com, binbin.wu@linux.intel.com,
linux-kernel@vger.kernel.org, pbonzini@redhat.com, Weiny, Ira,
kas@kernel.org, ackerleytng@google.com, Peng, Chao P,
zhiquan1.li@intel.com, Yamahata, Isaku, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
On Tue, Nov 18, 2025 at 08:26:42AM +0800, Huang, Kai wrote:
> On Fri, 2025-11-14 at 09:42 +0800, Yan Zhao wrote:
> > On Tue, Nov 11, 2025 at 06:55:45PM +0800, Huang, Kai wrote:
> > > On Thu, 2025-08-07 at 17:44 +0800, Yan Zhao wrote:
> > > > @@ -2044,6 +2091,9 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> > > > */
> > > > exit_qual = EPT_VIOLATION_ACC_WRITE;
> > > >
> > > > + if (tdx_check_accept_level(vcpu, gpa_to_gfn(gpa)))
> > > > + return RET_PF_RETRY;
> > > > +
> > >
> > > I don't think you should return RET_PF_RETRY here.
> > >
> > > This is still at very early stage of EPT violation. The caller of
> > > tdx_handle_ept_violation() is expecting either 0, 1, or negative error code.
> > Hmm, strictly speaking, the caller of the EPT violation handler is expecting
> > 0, >0, or negative error code.
> >
> > vcpu_run
> > |->r = vcpu_enter_guest(vcpu);
> > | |->r = kvm_x86_call(handle_exit)(vcpu, exit_fastpath);
> > | | return r;
> > | if (r <= 0)
> > | break;
> >
> > handle_ept_violation
> > |->return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
> >
> > tdx_handle_ept_violation
> > |->ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> > | return ret;
> >
> > The current VMX/TDX's EPT violation handlers returns RET_PF_* to the caller
> > since commit 7c5480386300 ("KVM: x86/mmu: Return RET_PF* instead of 1 in
> > kvm_mmu_page_fault") for the sake of zero-step mitigation.
> >
> > This is no problem, because
> >
> > enum {
> > RET_PF_CONTINUE = 0,
> > RET_PF_RETRY,
> > RET_PF_EMULATE,
> > RET_PF_WRITE_PROTECTED,
> > RET_PF_INVALID,
> > RET_PF_FIXED,
> > RET_PF_SPURIOUS,
> > };
> >
> > /*
> > * Define RET_PF_CONTINUE as 0 to allow for
> > * - efficient machine code when checking for CONTINUE, e.g.
> > * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero,
> > * - kvm_mmu_do_page_fault() to return other RET_PF_* as a positive value.
> > */
> > static_assert(RET_PF_CONTINUE == 0);
>
> Ah, OK.
>
> But this makes KVM retry fault, when kvm_split_cross_boundary_leafs() fails, due
> to -ENOMEM, presumably. While in the normal page fault handler path, -ENOMEM
> will just return to userspace AFAICT.
>
> This is not consistent, but I guess nobody cares, or noticed.
Oh, I got your point now.
Though retrying on -ENOMEM is also OK, returning ret to userspace for
consistency is a good point, given mmu_topup_memory_caches() returns -ENOMEM to
userspace.
I'll update it accordingly. Thanks!
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
2025-08-07 9:44 ` [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info Yan Zhao
2025-09-03 7:36 ` Binbin Wu
2025-11-11 10:55 ` Huang, Kai
@ 2025-11-11 11:05 ` Huang, Kai
2025-11-14 7:22 ` Yan Zhao
2025-11-19 5:51 ` Binbin Wu
3 siblings, 1 reply; 129+ messages in thread
From: Huang, Kai @ 2025-11-11 11:05 UTC (permalink / raw)
To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
Cc: quic_eberman@quicinc.com, kvm@vger.kernel.org, Li, Xiaoyao,
Du, Fan, Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
vbabka@suse.cz, tabba@google.com, kas@kernel.org,
michael.roth@amd.com, Weiny, Ira, linux-kernel@vger.kernel.org,
binbin.wu@linux.intel.com, ackerleytng@google.com,
Yamahata, Isaku, Peng, Chao P, zhiquan1.li@intel.com,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On Thu, 2025-08-07 at 17:44 +0800, Yan Zhao wrote:
> TDX requires guests to accept S-EPT mappings created by the host KVM. Due
> to the current implementation of the TDX module, if a guest accepts a GFN
> at a lower level after KVM maps it at a higher level, the TDX module will
> emulate an EPT violation VMExit to KVM instead of returning a size mismatch
> error to the guest. If KVM fails to perform page splitting in the VMExit
> handler, the guest's accept operation will be triggered again upon
> re-entering the guest, causing a repeated EPT violation VMExit.
>
> The TDX module thus enables the EPT violation VMExit to carry the guest's
> accept level when the VMExit is caused by the guest's accept operation.
>
> Therefore, in TDX's EPT violation handler
> (1) Set the guest inhibit bit in the lpage info to prevent KVM MMU core
> from mapping at a higher a level than the guest's accept level.
>
> (2) Split any existing huge mapping at the fault GFN to avoid unsupported
> splitting under the shared mmu_lock by TDX.
>
> Use write mmu_lock to pretect (1) and (2) for now. If future KVM TDX can
> perform the actual splitting under shared mmu_lock with enhanced TDX
> modules, (1) is possible to be called under shared mmu_lock, and (2) would
> become unnecessary.
>
> As an optimization, this patch calls hugepage_test_guest_inhibit() without
> holding the mmu_lock to reduce the frequency of acquiring the write
> mmu_lock. The write mmu_lock is thus only acquired if the guest inhibit bit
> is not already set. This is safe because the guest inhibit bit is set in a
> one-way manner while the splitting under the write mmu_lock is performed
> before setting the guest inhibit bit.
>
> Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com
> Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2
> - Change tdx_get_accept_level() to tdx_check_accept_level().
> - Invoke kvm_split_cross_boundary_leafs() and hugepage_set_guest_inhibit()
> to change KVM mapping level in a global way according to guest accept
> level. (Rick, Sean).
>
> RFC v1:
> - Introduce tdx_get_accept_level() to get guest accept level.
> - Use tdx->violation_request_level and tdx->violation_gfn* to pass guest
> accept level to tdx_gmem_private_max_mapping_level() to detemine KVM
> mapping level.
> ---
> arch/x86/kvm/vmx/tdx.c | 50 +++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/tdx_arch.h | 3 +++
> 2 files changed, 53 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 035d81275be4..71115058e5e6 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2019,6 +2019,53 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
> return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
> }
>
> +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> + struct kvm_memory_slot *slot = gfn_to_memslot(vcpu->kvm, gfn);
> + struct vcpu_tdx *tdx = to_tdx(vcpu);
> + struct kvm *kvm = vcpu->kvm;
> + u64 eeq_type, eeq_info;
> + int level = -1;
> +
> + if (!slot)
> + return 0;
> +
> + eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> + if (eeq_type != TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
> + return 0;
> +
> + eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> + TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> +
> + level = (eeq_info & GENMASK(2, 0)) + 1;
> +
> + if (level == PG_LEVEL_4K || level == PG_LEVEL_2M) {
> + if (!hugepage_test_guest_inhibit(slot, gfn, level + 1)) {
> + gfn_t base_gfn = gfn_round_for_level(gfn, level);
> + struct kvm_gfn_range gfn_range = {
> + .start = base_gfn,
> + .end = base_gfn + KVM_PAGES_PER_HPAGE(level),
> + .slot = slot,
> + .may_block = true,
> + .attr_filter = KVM_FILTER_PRIVATE,
> + };
> +
> + scoped_guard(write_lock, &kvm->mmu_lock) {
> + int ret;
> +
> + ret = kvm_split_cross_boundary_leafs(kvm, &gfn_range, false);
> + if (ret)
> + return ret;
> +
> + hugepage_set_guest_inhibit(slot, gfn, level + 1);
> + if (level == PG_LEVEL_4K)
> + hugepage_set_guest_inhibit(slot, gfn, level + 2);
> + }
> + }
> + }
Also, could you also clarify what's the current behaviour when the exit
doesn't have any level information?
Will 'level == PG_LEVEL_4K' in this case? Or will this function return
early right after check the eeq_type?
It's not mentioned anywhere in the changelog. The cover letter vaguely
says:
This mechanism allows support of huge pages for non-Linux TDs and
also removes the 4KB restriction on pre-fault mappings for Linux
TDs in RFC v1.
But it's not clear to me how this is solved.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
2025-11-11 11:05 ` Huang, Kai
@ 2025-11-14 7:22 ` Yan Zhao
2025-11-18 1:04 ` Huang, Kai
0 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-11-14 7:22 UTC (permalink / raw)
To: Huang, Kai
Cc: pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org,
Li, Xiaoyao, Du, Fan, Hansen, Dave, david@redhat.com,
thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
kas@kernel.org, michael.roth@amd.com, Weiny, Ira,
linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
ackerleytng@google.com, Yamahata, Isaku, Peng, Chao P,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On Tue, Nov 11, 2025 at 07:05:28PM +0800, Huang, Kai wrote:
> On Thu, 2025-08-07 at 17:44 +0800, Yan Zhao wrote:
> > TDX requires guests to accept S-EPT mappings created by the host KVM. Due
> > to the current implementation of the TDX module, if a guest accepts a GFN
> > at a lower level after KVM maps it at a higher level, the TDX module will
> > emulate an EPT violation VMExit to KVM instead of returning a size mismatch
> > error to the guest. If KVM fails to perform page splitting in the VMExit
> > handler, the guest's accept operation will be triggered again upon
> > re-entering the guest, causing a repeated EPT violation VMExit.
> >
> > The TDX module thus enables the EPT violation VMExit to carry the guest's
> > accept level when the VMExit is caused by the guest's accept operation.
> >
> > Therefore, in TDX's EPT violation handler
> > (1) Set the guest inhibit bit in the lpage info to prevent KVM MMU core
> > from mapping at a higher a level than the guest's accept level.
> >
> > (2) Split any existing huge mapping at the fault GFN to avoid unsupported
> > splitting under the shared mmu_lock by TDX.
> >
> > Use write mmu_lock to pretect (1) and (2) for now. If future KVM TDX can
> > perform the actual splitting under shared mmu_lock with enhanced TDX
> > modules, (1) is possible to be called under shared mmu_lock, and (2) would
> > become unnecessary.
> >
> > As an optimization, this patch calls hugepage_test_guest_inhibit() without
> > holding the mmu_lock to reduce the frequency of acquiring the write
> > mmu_lock. The write mmu_lock is thus only acquired if the guest inhibit bit
> > is not already set. This is safe because the guest inhibit bit is set in a
> > one-way manner while the splitting under the write mmu_lock is performed
> > before setting the guest inhibit bit.
> >
> > Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com
> > Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> > RFC v2
> > - Change tdx_get_accept_level() to tdx_check_accept_level().
> > - Invoke kvm_split_cross_boundary_leafs() and hugepage_set_guest_inhibit()
> > to change KVM mapping level in a global way according to guest accept
> > level. (Rick, Sean).
> >
> > RFC v1:
> > - Introduce tdx_get_accept_level() to get guest accept level.
> > - Use tdx->violation_request_level and tdx->violation_gfn* to pass guest
> > accept level to tdx_gmem_private_max_mapping_level() to detemine KVM
> > mapping level.
> > ---
> > arch/x86/kvm/vmx/tdx.c | 50 +++++++++++++++++++++++++++++++++++++
> > arch/x86/kvm/vmx/tdx_arch.h | 3 +++
> > 2 files changed, 53 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 035d81275be4..71115058e5e6 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -2019,6 +2019,53 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
> > return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
> > }
> >
> > +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gfn_t gfn)
> > +{
> > + struct kvm_memory_slot *slot = gfn_to_memslot(vcpu->kvm, gfn);
> > + struct vcpu_tdx *tdx = to_tdx(vcpu);
> > + struct kvm *kvm = vcpu->kvm;
> > + u64 eeq_type, eeq_info;
> > + int level = -1;
> > +
> > + if (!slot)
> > + return 0;
> > +
> > + eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> > + if (eeq_type != TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
> > + return 0;
> > +
> > + eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> > + TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> > +
> > + level = (eeq_info & GENMASK(2, 0)) + 1;
> > +
> > + if (level == PG_LEVEL_4K || level == PG_LEVEL_2M) {
> > + if (!hugepage_test_guest_inhibit(slot, gfn, level + 1)) {
> > + gfn_t base_gfn = gfn_round_for_level(gfn, level);
> > + struct kvm_gfn_range gfn_range = {
> > + .start = base_gfn,
> > + .end = base_gfn + KVM_PAGES_PER_HPAGE(level),
> > + .slot = slot,
> > + .may_block = true,
> > + .attr_filter = KVM_FILTER_PRIVATE,
> > + };
> > +
> > + scoped_guard(write_lock, &kvm->mmu_lock) {
> > + int ret;
> > +
> > + ret = kvm_split_cross_boundary_leafs(kvm, &gfn_range, false);
> > + if (ret)
> > + return ret;
> > +
> > + hugepage_set_guest_inhibit(slot, gfn, level + 1);
> > + if (level == PG_LEVEL_4K)
> > + hugepage_set_guest_inhibit(slot, gfn, level + 2);
> > + }
> > + }
> > + }
>
> Also, could you also clarify what's the current behaviour when the exit
> doesn't have any level information?
An EPT violation exit seen by KVM for TDs is emulated by the TDX module. The TDX
module provides VMM with more detailed info through the exit's extended exit
qualification.
If an EPT violation exit is emulated due to the guest's ACCEPT operation, the
extended exit qualification is of type TDX_EXT_EXIT_QUAL_TYPE_ACCEPT. Since an
ACCEPT operation must provide a valid level (otherwise, the TDX module just
fails guest ACCEPT without exit to VMM), the extended exit qualification info
must carry a valid level too: either PG_LEVEL_4K or PG_LEVEL_2M.
So, if KVM sees an exit with no level info, the extended exit qualification is
not of type TDX_EXT_EXIT_QUAL_TYPE_ACCEPT in the first place. It could be of
type NONE or type PENDING_EPT_VIOLATION depending on whether the guest is
configured with pending_ve_disable or if the gpa is private. This kind of exit
is caused by guest accessing a memory without first accepting it.
> Will 'level == PG_LEVEL_4K' in this case? Or will this function return
> early right after check the eeq_type?
The function will return early right after check the eeq_type.
> It's not mentioned anywhere in the changelog. The cover letter vaguely
> says:
>
> This mechanism allows support of huge pages for non-Linux TDs and
> also removes the 4KB restriction on pre-fault mappings for Linux
> TDs in RFC v1.
>
> But it's not clear to me how this is solved.
I'll add a comment to tdx_check_accept_level() and update the patch log to make
the picture clearer.
Thanks for pointing it out.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
2025-11-14 7:22 ` Yan Zhao
@ 2025-11-18 1:04 ` Huang, Kai
2025-11-18 2:20 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Huang, Kai @ 2025-11-18 1:04 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
vbabka@suse.cz, michael.roth@amd.com,
linux-kernel@vger.kernel.org, seanjc@google.com,
pbonzini@redhat.com, binbin.wu@linux.intel.com,
ackerleytng@google.com, kas@kernel.org, Weiny, Ira, Peng, Chao P,
Yamahata, Isaku, Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun,
x86@kernel.org, pgonda@google.com
On Fri, 2025-11-14 at 15:22 +0800, Yan Zhao wrote:
> > Will 'level == PG_LEVEL_4K' in this case? Or will this function return
> > early right after check the eeq_type?
> The function will return early right after check the eeq_type.
But for such case the fault handler will still return 2M and KVM will AUG 2M
page? Then if guest accepts 4K page, a new exit to KVM would happen?
But this time KVM is able to find the info that guest is accepting 4K and KVM
will split the 2M to 4K pages so we are good to go?
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
2025-11-18 1:04 ` Huang, Kai
@ 2025-11-18 2:20 ` Yan Zhao
2025-11-18 9:44 ` Huang, Kai
0 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-11-18 2:20 UTC (permalink / raw)
To: Huang, Kai
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
vbabka@suse.cz, michael.roth@amd.com,
linux-kernel@vger.kernel.org, seanjc@google.com,
pbonzini@redhat.com, binbin.wu@linux.intel.com,
ackerleytng@google.com, kas@kernel.org, Weiny, Ira, Peng, Chao P,
Yamahata, Isaku, Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun,
x86@kernel.org, pgonda@google.com
On Tue, Nov 18, 2025 at 09:04:20AM +0800, Huang, Kai wrote:
> On Fri, 2025-11-14 at 15:22 +0800, Yan Zhao wrote:
> > > Will 'level == PG_LEVEL_4K' in this case? Or will this function return
> > > early right after check the eeq_type?
> > The function will return early right after check the eeq_type.
>
> But for such case the fault handler will still return 2M and KVM will AUG 2M
> page? Then if guest accepts 4K page, a new exit to KVM would happen?
>
> But this time KVM is able to find the info that guest is accepting 4K and KVM
> will split the 2M to 4K pages so we are good to go?
If guest accesses a private memory without first accepting it (like non-Linux
guests), the sequence is:
1. Guest accesses a private memory.
2. KVM finds it can map the GFN at 2MB. So, AUG 2MB pages.
3. Guest accepts the GFN at 4KB.
4. KVM receives a EPT violation with eeq_type of ACCEPT and level 4KB
5. KVM splits the 2MB mapping.
6. Guest accepts successfully and accesses the page.
If guest first accepts a private memory before accessing (like Linux guests),
the sequence is:
1. Guest accepts a private memory at 4KB.
2. KVM receives a EPT violation with eeq_type of ACCEPT and level 4KB.
3. KVM AUG 4KB.
4. Guest accepts successfully and accesses the page.
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
2025-11-18 2:20 ` Yan Zhao
@ 2025-11-18 9:44 ` Huang, Kai
2025-11-19 2:58 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Huang, Kai @ 2025-11-18 9:44 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
tabba@google.com, linux-kernel@vger.kernel.org, seanjc@google.com,
binbin.wu@linux.intel.com, kas@kernel.org, pbonzini@redhat.com,
ackerleytng@google.com, michael.roth@amd.com, Weiny, Ira,
Peng, Chao P, Yamahata, Isaku, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
On Tue, 2025-11-18 at 10:20 +0800, Yan Zhao wrote:
> On Tue, Nov 18, 2025 at 09:04:20AM +0800, Huang, Kai wrote:
> > On Fri, 2025-11-14 at 15:22 +0800, Yan Zhao wrote:
> > > > Will 'level == PG_LEVEL_4K' in this case? Or will this function return
> > > > early right after check the eeq_type?
> > > The function will return early right after check the eeq_type.
> >
> > But for such case the fault handler will still return 2M and KVM will AUG 2M
> > page? Then if guest accepts 4K page, a new exit to KVM would happen?
> >
> > But this time KVM is able to find the info that guest is accepting 4K and KVM
> > will split the 2M to 4K pages so we are good to go?
>
> If guest accesses a private memory without first accepting it (like non-Linux
> guests), the sequence is:
> 1. Guest accesses a private memory.
> 2. KVM finds it can map the GFN at 2MB. So, AUG 2MB pages.
> 3. Guest accepts the GFN at 4KB.
> 4. KVM receives a EPT violation with eeq_type of ACCEPT and level 4KB
> 5. KVM splits the 2MB mapping.
> 6. Guest accepts successfully and accesses the page.
Yeah looks good.
Btw, the change to make KVM AUG 2M when no accept level is specified is done in
patch 23. I think you can add some text to explain in that patch?
E.g., something like:
Always try to AUG 2M hugepage, even there's no accept level from the guest.
If the guest later accepts at 4K page, the TDX module will exit to KVM with
the actual accept level info and KVM will split to 4K pages. The guest then
will be able to accept the 4K pages successfully.
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
2025-11-18 9:44 ` Huang, Kai
@ 2025-11-19 2:58 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-11-19 2:58 UTC (permalink / raw)
To: Huang, Kai
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, vbabka@suse.cz,
tabba@google.com, linux-kernel@vger.kernel.org, seanjc@google.com,
binbin.wu@linux.intel.com, kas@kernel.org, pbonzini@redhat.com,
ackerleytng@google.com, michael.roth@amd.com, Weiny, Ira,
Peng, Chao P, Yamahata, Isaku, Annapurve, Vishal,
Edgecombe, Rick P, Miao, Jun, x86@kernel.org, pgonda@google.com
On Tue, Nov 18, 2025 at 05:44:25PM +0800, Huang, Kai wrote:
> On Tue, 2025-11-18 at 10:20 +0800, Yan Zhao wrote:
> > On Tue, Nov 18, 2025 at 09:04:20AM +0800, Huang, Kai wrote:
> > > On Fri, 2025-11-14 at 15:22 +0800, Yan Zhao wrote:
> > > > > Will 'level == PG_LEVEL_4K' in this case? Or will this function return
> > > > > early right after check the eeq_type?
> > > > The function will return early right after check the eeq_type.
> > >
> > > But for such case the fault handler will still return 2M and KVM will AUG 2M
> > > page? Then if guest accepts 4K page, a new exit to KVM would happen?
> > >
> > > But this time KVM is able to find the info that guest is accepting 4K and KVM
> > > will split the 2M to 4K pages so we are good to go?
> >
> > If guest accesses a private memory without first accepting it (like non-Linux
> > guests), the sequence is:
> > 1. Guest accesses a private memory.
> > 2. KVM finds it can map the GFN at 2MB. So, AUG 2MB pages.
> > 3. Guest accepts the GFN at 4KB.
> > 4. KVM receives a EPT violation with eeq_type of ACCEPT and level 4KB
> > 5. KVM splits the 2MB mapping.
> > 6. Guest accepts successfully and accesses the page.
>
> Yeah looks good.
>
> Btw, the change to make KVM AUG 2M when no accept level is specified is done in
> patch 23. I think you can add some text to explain in that patch?
>
> E.g., something like:
>
> Always try to AUG 2M hugepage, even there's no accept level from the guest.
> If the guest later accepts at 4K page, the TDX module will exit to KVM with
> the actual accept level info and KVM will split to 4K pages. The guest then
> will be able to accept the 4K pages successfully.
It's a good idea.
I think it's also better to mention in patch 23 that returning 2M in
tdx_gmem_private_max_mapping_level() doesn't mean TDX will AUG 2M.
So, maybe
Always try to let KVM map at 2MB level, though KVM may still map the page at
4KB (i.e., passing in PG_LEVEL_4K to AUG) due to
- the backend folio is 4KB,
- disallow_lpage restrictions:
a) mixed private/shared pages in the 2MB range
b) level alignment due to slot base_gfn, slot size, and ugfn
c) guest_inhibit bit set due to guest accept level
When there's accept level of 4KB from the guest, KVM will AUG the page at 4KB
directly due to the guest_inhibit bit set. So guest is able to accept at 4KB
successfully.
When there's no accept level from the guest, and there're no other
restrictions on the GFN range of a huge folio, KVM will AUG the page at 2MB
first.
If the guest later accepts at 4K page, the TDX module will exit to KVM with
the actual accept level info and KVM will split to 4K pages and set the
guest_inhibit bit. The guest then will be able to accept the 4K pages
successfully.
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
2025-08-07 9:44 ` [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info Yan Zhao
` (2 preceding siblings ...)
2025-11-11 11:05 ` Huang, Kai
@ 2025-11-19 5:51 ` Binbin Wu
2025-11-19 6:29 ` Yan Zhao
3 siblings, 1 reply; 129+ messages in thread
From: Binbin Wu @ 2025-11-19 5:51 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On 8/7/2025 5:44 PM, Yan Zhao wrote:
> TDX requires guests to accept S-EPT mappings created by the host KVM. Due
> to the current implementation of the TDX module, if a guest accepts a GFN
> at a lower level after KVM maps it at a higher level, the TDX module will
> emulate an EPT violation VMExit to KVM instead of returning a size mismatch
> error to the guest. If KVM fails to perform page splitting in the VMExit
> handler, the guest's accept operation will be triggered again upon
> re-entering the guest, causing a repeated EPT violation VMExit.
>
> The TDX module thus enables the EPT violation VMExit to carry the guest's
> accept level when the VMExit is caused by the guest's accept operation.
>
> Therefore, in TDX's EPT violation handler
> (1) Set the guest inhibit bit in the lpage info to prevent KVM MMU core
> from mapping at a higher a level than the guest's accept level.
^
an extra 'a'
>
> (2) Split any existing huge mapping at the fault GFN to avoid unsupported
> splitting under the shared mmu_lock by TDX.
>
> Use write mmu_lock to pretect (1) and (2) for now. If future KVM TDX can
pretect -> protect
> perform the actual splitting under shared mmu_lock with enhanced TDX
> modules, (1) is possible to be called under shared mmu_lock, and (2) would
> become unnecessary.
>
> As an optimization, this patch calls hugepage_test_guest_inhibit() without
> holding the mmu_lock to reduce the frequency of acquiring the write
> mmu_lock. The write mmu_lock is thus only acquired if the guest inhibit bit
> is not already set. This is safe because the guest inhibit bit is set in a
> one-way manner while the splitting under the write mmu_lock is performed
> before setting the guest inhibit bit.
>
> Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com
> Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2
> - Change tdx_get_accept_level() to tdx_check_accept_level().
> - Invoke kvm_split_cross_boundary_leafs() and hugepage_set_guest_inhibit()
> to change KVM mapping level in a global way according to guest accept
> level. (Rick, Sean).
>
> RFC v1:
> - Introduce tdx_get_accept_level() to get guest accept level.
> - Use tdx->violation_request_level and tdx->violation_gfn* to pass guest
> accept level to tdx_gmem_private_max_mapping_level() to detemine KVM
> mapping level.
> ---
> arch/x86/kvm/vmx/tdx.c | 50 +++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/tdx_arch.h | 3 +++
> 2 files changed, 53 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 035d81275be4..71115058e5e6 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2019,6 +2019,53 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
> return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
> }
>
> +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gfn_t gfn)
The function name sounds like it is just doing check, but it may split a
hugepage on mismatch.
How about tdx_enforce_accept_level_mapping() or something else to reflect
the change could be make?
> +{
> + struct kvm_memory_slot *slot = gfn_to_memslot(vcpu->kvm, gfn);
> + struct vcpu_tdx *tdx = to_tdx(vcpu);
> + struct kvm *kvm = vcpu->kvm;
> + u64 eeq_type, eeq_info;
> + int level = -1;
> +
> + if (!slot)
> + return 0;
> +
> + eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> + if (eeq_type != TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
> + return 0;
> +
> + eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> + TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> +
> + level = (eeq_info & GENMASK(2, 0)) + 1;
> +
> + if (level == PG_LEVEL_4K || level == PG_LEVEL_2M) {
> + if (!hugepage_test_guest_inhibit(slot, gfn, level + 1)) {
> + gfn_t base_gfn = gfn_round_for_level(gfn, level);
> + struct kvm_gfn_range gfn_range = {
> + .start = base_gfn,
> + .end = base_gfn + KVM_PAGES_PER_HPAGE(level),
> + .slot = slot,
> + .may_block = true,
> + .attr_filter = KVM_FILTER_PRIVATE,
> + };
> +
> + scoped_guard(write_lock, &kvm->mmu_lock) {
> + int ret;
> +
> + ret = kvm_split_cross_boundary_leafs(kvm, &gfn_range, false);
> + if (ret)
> + return ret;
> +
> + hugepage_set_guest_inhibit(slot, gfn, level + 1);
> + if (level == PG_LEVEL_4K)
> + hugepage_set_guest_inhibit(slot, gfn, level + 2);
> + }
> + }
> + }
> + return 0;
> +}
> +
> static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> {
> unsigned long exit_qual;
> @@ -2044,6 +2091,9 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> */
> exit_qual = EPT_VIOLATION_ACC_WRITE;
>
> + if (tdx_check_accept_level(vcpu, gpa_to_gfn(gpa)))
> + return RET_PF_RETRY;
> +
> /* Only private GPA triggers zero-step mitigation */
> local_retry = true;
> } else {
> diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
> index a30e880849e3..af006a73ee05 100644
> --- a/arch/x86/kvm/vmx/tdx_arch.h
> +++ b/arch/x86/kvm/vmx/tdx_arch.h
> @@ -82,7 +82,10 @@ struct tdx_cpuid_value {
> #define TDX_TD_ATTR_PERFMON BIT_ULL(63)
>
> #define TDX_EXT_EXIT_QUAL_TYPE_MASK GENMASK(3, 0)
> +#define TDX_EXT_EXIT_QUAL_TYPE_ACCEPT 1
> #define TDX_EXT_EXIT_QUAL_TYPE_PENDING_EPT_VIOLATION 6
> +#define TDX_EXT_EXIT_QUAL_INFO_MASK GENMASK(63, 32)
> +#define TDX_EXT_EXIT_QUAL_INFO_SHIFT 32
> /*
> * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
> */
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
2025-11-19 5:51 ` Binbin Wu
@ 2025-11-19 6:29 ` Yan Zhao
2025-11-19 6:39 ` Binbin Wu
0 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-11-19 6:29 UTC (permalink / raw)
To: Binbin Wu
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On Wed, Nov 19, 2025 at 01:51:26PM +0800, Binbin Wu wrote:
>
>
> On 8/7/2025 5:44 PM, Yan Zhao wrote:
> > TDX requires guests to accept S-EPT mappings created by the host KVM. Due
> > to the current implementation of the TDX module, if a guest accepts a GFN
> > at a lower level after KVM maps it at a higher level, the TDX module will
> > emulate an EPT violation VMExit to KVM instead of returning a size mismatch
> > error to the guest. If KVM fails to perform page splitting in the VMExit
> > handler, the guest's accept operation will be triggered again upon
> > re-entering the guest, causing a repeated EPT violation VMExit.
> >
> > The TDX module thus enables the EPT violation VMExit to carry the guest's
> > accept level when the VMExit is caused by the guest's accept operation.
> >
> > Therefore, in TDX's EPT violation handler
> > (1) Set the guest inhibit bit in the lpage info to prevent KVM MMU core
> > from mapping at a higher a level than the guest's accept level.
> ^
> an extra 'a'
Thanks.
> >
> > (2) Split any existing huge mapping at the fault GFN to avoid unsupported
> > splitting under the shared mmu_lock by TDX.
> >
> > Use write mmu_lock to pretect (1) and (2) for now. If future KVM TDX can
>
> pretect -> protect
Thanks.
> > perform the actual splitting under shared mmu_lock with enhanced TDX
> > modules, (1) is possible to be called under shared mmu_lock, and (2) would
> > become unnecessary.
> >
> > As an optimization, this patch calls hugepage_test_guest_inhibit() without
> > holding the mmu_lock to reduce the frequency of acquiring the write
> > mmu_lock. The write mmu_lock is thus only acquired if the guest inhibit bit
> > is not already set. This is safe because the guest inhibit bit is set in a
> > one-way manner while the splitting under the write mmu_lock is performed
> > before setting the guest inhibit bit.
> >
> > Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@intel.com
> > Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> > RFC v2
> > - Change tdx_get_accept_level() to tdx_check_accept_level().
> > - Invoke kvm_split_cross_boundary_leafs() and hugepage_set_guest_inhibit()
> > to change KVM mapping level in a global way according to guest accept
> > level. (Rick, Sean).
> >
> > RFC v1:
> > - Introduce tdx_get_accept_level() to get guest accept level.
> > - Use tdx->violation_request_level and tdx->violation_gfn* to pass guest
> > accept level to tdx_gmem_private_max_mapping_level() to detemine KVM
> > mapping level.
> > ---
> > arch/x86/kvm/vmx/tdx.c | 50 +++++++++++++++++++++++++++++++++++++
> > arch/x86/kvm/vmx/tdx_arch.h | 3 +++
> > 2 files changed, 53 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 035d81275be4..71115058e5e6 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -2019,6 +2019,53 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
> > return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
> > }
> > +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gfn_t gfn)
>
> The function name sounds like it is just doing check, but it may split a
> hugepage on mismatch.
>
> How about tdx_enforce_accept_level_mapping() or something else to reflect
> the change could be make?
What about tdx_honor_guest_accept_level()?
> > +{
> > + struct kvm_memory_slot *slot = gfn_to_memslot(vcpu->kvm, gfn);
> > + struct vcpu_tdx *tdx = to_tdx(vcpu);
> > + struct kvm *kvm = vcpu->kvm;
> > + u64 eeq_type, eeq_info;
> > + int level = -1;
> > +
> > + if (!slot)
> > + return 0;
> > +
> > + eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> > + if (eeq_type != TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
> > + return 0;
> > +
> > + eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> > + TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> > +
> > + level = (eeq_info & GENMASK(2, 0)) + 1;
> > +
> > + if (level == PG_LEVEL_4K || level == PG_LEVEL_2M) {
> > + if (!hugepage_test_guest_inhibit(slot, gfn, level + 1)) {
> > + gfn_t base_gfn = gfn_round_for_level(gfn, level);
> > + struct kvm_gfn_range gfn_range = {
> > + .start = base_gfn,
> > + .end = base_gfn + KVM_PAGES_PER_HPAGE(level),
> > + .slot = slot,
> > + .may_block = true,
> > + .attr_filter = KVM_FILTER_PRIVATE,
> > + };
> > +
> > + scoped_guard(write_lock, &kvm->mmu_lock) {
> > + int ret;
> > +
> > + ret = kvm_split_cross_boundary_leafs(kvm, &gfn_range, false);
> > + if (ret)
> > + return ret;
> > +
> > + hugepage_set_guest_inhibit(slot, gfn, level + 1);
> > + if (level == PG_LEVEL_4K)
> > + hugepage_set_guest_inhibit(slot, gfn, level + 2);
> > + }
> > + }
> > + }
> > + return 0;
> > +}
> > +
> > static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> > {
> > unsigned long exit_qual;
> > @@ -2044,6 +2091,9 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> > */
> > exit_qual = EPT_VIOLATION_ACC_WRITE;
> > + if (tdx_check_accept_level(vcpu, gpa_to_gfn(gpa)))
> > + return RET_PF_RETRY;
> > +
> > /* Only private GPA triggers zero-step mitigation */
> > local_retry = true;
> > } else {
> > diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
> > index a30e880849e3..af006a73ee05 100644
> > --- a/arch/x86/kvm/vmx/tdx_arch.h
> > +++ b/arch/x86/kvm/vmx/tdx_arch.h
> > @@ -82,7 +82,10 @@ struct tdx_cpuid_value {
> > #define TDX_TD_ATTR_PERFMON BIT_ULL(63)
> > #define TDX_EXT_EXIT_QUAL_TYPE_MASK GENMASK(3, 0)
> > +#define TDX_EXT_EXIT_QUAL_TYPE_ACCEPT 1
> > #define TDX_EXT_EXIT_QUAL_TYPE_PENDING_EPT_VIOLATION 6
> > +#define TDX_EXT_EXIT_QUAL_INFO_MASK GENMASK(63, 32)
> > +#define TDX_EXT_EXIT_QUAL_INFO_SHIFT 32
> > /*
> > * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
> > */
>
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info
2025-11-19 6:29 ` Yan Zhao
@ 2025-11-19 6:39 ` Binbin Wu
0 siblings, 0 replies; 129+ messages in thread
From: Binbin Wu @ 2025-11-19 6:39 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, michael.roth, david,
vannapurve, vbabka, thomas.lendacky, pgonda, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, chao.p.peng
On 11/19/2025 2:29 PM, Yan Zhao wrote:
[...]
>>> +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gfn_t gfn)
>> The function name sounds like it is just doing check, but it may split a
>> hugepage on mismatch.
>>
>> How about tdx_enforce_accept_level_mapping() or something else to reflect
>> the change could be make?
> What about tdx_honor_guest_accept_level()?
It looks good to me.
^ permalink raw reply [flat|nested] 129+ messages in thread
* [RFC PATCH v2 15/23] KVM: Change the return type of gfn_handler_t() from bool to int
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (13 preceding siblings ...)
2025-08-07 9:44 ` [RFC PATCH v2 14/23] KVM: TDX: Split and inhibit huge mappings if a VMExit carries level info Yan Zhao
@ 2025-08-07 9:44 ` Yan Zhao
2025-08-07 9:44 ` [RFC PATCH v2 16/23] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Yan Zhao
` (7 subsequent siblings)
22 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:44 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
Modify the return type of gfn_handler_t() from bool to int. A negative
return value indicates failure, while a return value of 1 signifies success
with a flush required, and 0 denotes success without a flush required.
This adjustment prepares for a later change that will enable
kvm_pre_set_memory_attributes() to fail.
No functional changes expected.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- No change
RFC v1:
- New patch.
---
arch/arm64/kvm/mmu.c | 8 ++++----
arch/loongarch/kvm/mmu.c | 8 ++++----
arch/mips/kvm/mmu.c | 6 +++---
arch/powerpc/kvm/book3s.c | 4 ++--
arch/powerpc/kvm/e500_mmu_host.c | 8 ++++----
arch/riscv/kvm/mmu.c | 12 ++++++------
arch/x86/kvm/mmu/mmu.c | 20 ++++++++++----------
include/linux/kvm_host.h | 12 ++++++------
virt/kvm/kvm_main.c | 24 ++++++++++++++++--------
9 files changed, 55 insertions(+), 47 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8b225450a4eb..991a6df0ca21 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1999,12 +1999,12 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
return false;
}
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
u64 size = (range->end - range->start) << PAGE_SHIFT;
if (!kvm->arch.mmu.pgt)
- return false;
+ return 0;
return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
range->start << PAGE_SHIFT,
@@ -2015,12 +2015,12 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
*/
}
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
u64 size = (range->end - range->start) << PAGE_SHIFT;
if (!kvm->arch.mmu.pgt)
- return false;
+ return 0;
return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
range->start << PAGE_SHIFT,
diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c
index ed956c5cf2cc..0542516c98eb 100644
--- a/arch/loongarch/kvm/mmu.c
+++ b/arch/loongarch/kvm/mmu.c
@@ -511,7 +511,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
range->end << PAGE_SHIFT, &ctx);
}
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
kvm_ptw_ctx ctx;
@@ -523,15 +523,15 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
range->end << PAGE_SHIFT, &ctx);
}
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
gpa_t gpa = range->start << PAGE_SHIFT;
kvm_pte_t *ptep = kvm_populate_gpa(kvm, NULL, gpa, 0);
if (ptep && kvm_pte_present(NULL, ptep) && kvm_pte_young(*ptep))
- return true;
+ return 1;
- return false;
+ return 0;
}
/*
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index d2c3b6b41f18..c26cc89c8e98 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -444,18 +444,18 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
return true;
}
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
return kvm_mips_mkold_gpa_pt(kvm, range->start, range->end);
}
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
gpa_t gpa = range->start << PAGE_SHIFT;
pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
if (!gpa_pte)
- return false;
+ return 0;
return pte_young(*gpa_pte);
}
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index d79c5d1098c0..9bf6e1cf64f1 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -886,12 +886,12 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
return kvm->arch.kvm_ops->unmap_gfn_range(kvm, range);
}
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
return kvm->arch.kvm_ops->age_gfn(kvm, range);
}
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
}
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 06caf8bbbe2b..dd5411ee242e 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -697,16 +697,16 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
return kvm_e500_mmu_unmap_gfn(kvm, range);
}
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
/* XXX could be more clever ;) */
- return false;
+ return 0;
}
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
/* XXX could be more clever ;) */
- return false;
+ return 0;
}
/*****************************************/
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 1087ea74567b..98c2fcd9229f 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -550,38 +550,38 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
return false;
}
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
pte_t *ptep;
u32 ptep_level = 0;
u64 size = (range->end - range->start) << PAGE_SHIFT;
if (!kvm->arch.pgd)
- return false;
+ return 0;
WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
if (!gstage_get_leaf_entry(kvm, range->start << PAGE_SHIFT,
&ptep, &ptep_level))
- return false;
+ return 0;
return ptep_test_and_clear_young(NULL, 0, ptep);
}
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
pte_t *ptep;
u32 ptep_level = 0;
u64 size = (range->end - range->start) << PAGE_SHIFT;
if (!kvm->arch.pgd)
- return false;
+ return 0;
WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
if (!gstage_get_leaf_entry(kvm, range->start << PAGE_SHIFT,
&ptep, &ptep_level))
- return false;
+ return 0;
return pte_young(ptep_get(ptep));
}
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1c639286aac2..c71f8bb0b903 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1806,7 +1806,7 @@ static bool kvm_may_have_shadow_mmu_sptes(struct kvm *kvm)
return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
}
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;
@@ -1819,7 +1819,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
return young;
}
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;
@@ -7841,8 +7841,8 @@ static void hugepage_set_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_MIXED_FLAG;
}
-bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
- struct kvm_gfn_range *range)
+int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+ struct kvm_gfn_range *range)
{
struct kvm_memory_slot *slot = range->slot;
int level;
@@ -7859,10 +7859,10 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
* a hugepage can be used for affected ranges.
*/
if (WARN_ON_ONCE(!kvm_arch_supports_gmem(kvm)))
- return false;
+ return 0;
if (WARN_ON_ONCE(range->end <= range->start))
- return false;
+ return 0;
/*
* If the head and tail pages of the range currently allow a hugepage,
@@ -7921,8 +7921,8 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
return true;
}
-bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
- struct kvm_gfn_range *range)
+int kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+ struct kvm_gfn_range *range)
{
unsigned long attrs = range->arg.attributes;
struct kvm_memory_slot *slot = range->slot;
@@ -7938,7 +7938,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
* SHARED may now allow hugepages.
*/
if (WARN_ON_ONCE(!kvm_arch_supports_gmem(kvm)))
- return false;
+ return 0;
/*
* The sequence matters here: upper levels consume the result of lower
@@ -7985,7 +7985,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
hugepage_set_mixed(slot, gfn, level);
}
}
- return false;
+ return 0;
}
void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6137b76341e1..d03e4a70a6db 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -271,8 +271,8 @@ struct kvm_gfn_range {
bool lockless;
};
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
bool shared);
#endif
@@ -1537,7 +1537,7 @@ void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
void kvm_mmu_invalidate_begin(struct kvm *kvm);
void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
void kvm_mmu_invalidate_end(struct kvm *kvm);
-bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
long kvm_arch_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg);
@@ -2524,10 +2524,10 @@ static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn
bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
unsigned long mask, unsigned long attrs);
-bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+ struct kvm_gfn_range *range);
+int kvm_arch_post_set_memory_attributes(struct kvm *kvm,
struct kvm_gfn_range *range);
-bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
- struct kvm_gfn_range *range);
/*
* Returns true if the given gfn's private/shared status (in the CoCo sense) is
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fe86f3f627ba..8f87d6c6be3f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -508,7 +508,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
return container_of(mn, struct kvm, mmu_notifier);
}
-typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+typedef int (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
typedef void (*on_lock_fn_t)(struct kvm *kvm);
@@ -592,6 +592,7 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
kvm_for_each_memslot_in_hva_range(node, slots,
range->start, range->end - 1) {
unsigned long hva_start, hva_end;
+ int ret;
slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
@@ -632,7 +633,9 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
goto mmu_unlock;
}
}
- r.ret |= range->handler(kvm, &gfn_range);
+ ret = range->handler(kvm, &gfn_range);
+ WARN_ON_ONCE(ret < 0);
+ r.ret |= ret;
}
}
@@ -718,7 +721,7 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
}
}
-bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+int kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
return kvm_unmap_gfn_range(kvm, range);
@@ -2469,7 +2472,8 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
struct kvm_memslots *slots;
struct kvm_memslot_iter iter;
bool found_memslot = false;
- bool ret = false;
+ bool flush = false;
+ int ret = 0;
int i;
gfn_range.arg = range->arg;
@@ -2502,19 +2506,23 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
range->on_lock(kvm);
}
- ret |= range->handler(kvm, &gfn_range);
+ ret = range->handler(kvm, &gfn_range);
+ if (ret < 0)
+ goto err;
+ flush |= ret;
}
}
- if (range->flush_on_ret && ret)
+err:
+ if (range->flush_on_ret && flush)
kvm_flush_remote_tlbs(kvm);
if (found_memslot)
KVM_MMU_UNLOCK(kvm);
}
-static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
- struct kvm_gfn_range *range)
+static int kvm_pre_set_memory_attributes(struct kvm *kvm,
+ struct kvm_gfn_range *range)
{
/*
* Unconditionally add the range to the invalidation set, regardless of
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* [RFC PATCH v2 16/23] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (14 preceding siblings ...)
2025-08-07 9:44 ` [RFC PATCH v2 15/23] KVM: Change the return type of gfn_handler_t() from bool to int Yan Zhao
@ 2025-08-07 9:44 ` Yan Zhao
2025-08-07 9:45 ` [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
` (6 subsequent siblings)
22 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:44 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
In TDX, private page tables require precise zapping because faulting back
the zapped mappings necessitates the guest's re-acceptance. Therefore,
before performing a zap for the private-to-shared conversion, rather than
zapping a huge leaf entry that crosses the boundary of the GFN range to be
zapped, split the leaf entry to ensure GFNs outside the conversion range
are not affected.
Invoke kvm_split_cross_boundary_leafs() in
kvm_arch_pre_set_memory_attributes() to split the huge leafs that cross
GFN range boundary before calling kvm_unmap_gfn_range() to zap the GFN
range that will be converted to shared.
When kvm_split_cross_boundary_leafs() fails, it is expected to internally
invoke kvm_flush_remote_tlbs() to flush any changes that have been
successfully completed.
Unlike kvm_unmap_gfn_range(), which cannot fail,
kvm_split_cross_boundary_leafs() may fail due to memory allocation for
splitting. Update kvm_handle_gfn_range() to propagate the error back to
kvm_vm_set_mem_attributes(), which can then fail the ioctl
KVM_SET_MEMORY_ATTRIBUTES.
The downside of current implementation is that though
kvm_split_cross_boundary_leafs() is invoked before kvm_unmap_gfn_range()
for each GFN range, the entire conversion range may consist of several GFN
ranges. If an out-of-memory error occurs during the splitting of a GFN
range, some previous GFN ranges may have been successfully split and
zapped, even though their page attributes remain unchanged due to the
splitting failure.
If it's necessary, a patch can be arranged to divide a single invocation of
"kvm_handle_gfn_range(kvm, &pre_set_range)" into two, e.g.,
kvm_handle_gfn_range(kvm, &pre_set_range_prepare_and_split)
kvm_handle_gfn_range(kvm, &pre_set_range_unmap),
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- update kvm_split_boundary_leafs() to kvm_split_cross_boundary_leafs() and
invoke it only for priate-to-shared conversion.
RFC v1:
- new patch.
---
arch/x86/kvm/mmu/mmu.c | 13 ++++++++++---
virt/kvm/kvm_main.c | 13 +++++++++----
2 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c71f8bb0b903..f23d8fc59323 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7845,7 +7845,9 @@ int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
struct kvm_gfn_range *range)
{
struct kvm_memory_slot *slot = range->slot;
+ bool flush = false;
int level;
+ int ret;
/*
* Zap SPTEs even if the slot can't be mapped PRIVATE. KVM x86 only
@@ -7894,12 +7896,17 @@ int kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
}
/* Unmap the old attribute page. */
- if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)
+ if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
range->attr_filter = KVM_FILTER_SHARED;
- else
+ } else {
range->attr_filter = KVM_FILTER_PRIVATE;
+ ret = kvm_split_cross_boundary_leafs(kvm, range, false);
+ if (ret < 0)
+ return ret;
+ flush |= ret;
+ }
- return kvm_unmap_gfn_range(kvm, range);
+ return kvm_unmap_gfn_range(kvm, range) | flush;
}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8f87d6c6be3f..9dceecf34822 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2464,8 +2464,8 @@ bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
return true;
}
-static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
- struct kvm_mmu_notifier_range *range)
+static __always_inline int kvm_handle_gfn_range(struct kvm *kvm,
+ struct kvm_mmu_notifier_range *range)
{
struct kvm_gfn_range gfn_range;
struct kvm_memory_slot *slot;
@@ -2519,6 +2519,8 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
if (found_memslot)
KVM_MMU_UNLOCK(kvm);
+
+ return ret < 0 ? ret : 0;
}
static int kvm_pre_set_memory_attributes(struct kvm *kvm,
@@ -2587,7 +2589,9 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
cond_resched();
}
- kvm_handle_gfn_range(kvm, &pre_set_range);
+ r = kvm_handle_gfn_range(kvm, &pre_set_range);
+ if (r)
+ goto out_unlock;
for (i = start; i < end; i++) {
r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
@@ -2596,7 +2600,8 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
cond_resched();
}
- kvm_handle_gfn_range(kvm, &post_set_range);
+ r = kvm_handle_gfn_range(kvm, &post_set_range);
+ KVM_BUG_ON(r, kvm);
out_unlock:
mutex_unlock(&kvm->slots_lock);
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (15 preceding siblings ...)
2025-08-07 9:44 ` [RFC PATCH v2 16/23] KVM: x86: Split cross-boundary mirror leafs for KVM_SET_MEMORY_ATTRIBUTES Yan Zhao
@ 2025-08-07 9:45 ` Yan Zhao
2025-09-04 7:58 ` Binbin Wu
` (2 more replies)
2025-08-07 9:45 ` [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set Yan Zhao
` (5 subsequent siblings)
22 siblings, 3 replies; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:45 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
In TDX, private page tables require precise zapping because faulting back
the zapped mappings necessitates the guest's re-acceptance. Therefore,
before performing a zap for hole punching and private-to-shared
conversions, huge leafs that cross the boundary of the zapping GFN range in
the mirror page table must be split.
Splitting may result in an error. If this happens, hole punching and
private-to-shared conversion should bail out early and return an error to
userspace.
Splitting is not necessary for kvm_gmem_release() since the entire page
table is being zapped, nor for kvm_gmem_error_folio() as an SPTE must not
map more than one physical folio.
Therefore, in this patch,
- break kvm_gmem_invalidate_begin_and_zap() into
kvm_gmem_invalidate_begin() and kvm_gmem_zap() and have
kvm_gmem_release() and kvm_gmem_error_folio() to invoke them.
- have kvm_gmem_punch_hole() to invoke kvm_gmem_invalidate_begin(),
kvm_gmem_split_private(), and kvm_gmem_zap().
Bail out if kvm_gmem_split_private() returns error.
- drop the old kvm_gmem_unmap_private() and have private-to-shared
conversion to invoke kvm_gmem_split_private() and kvm_gmem_zap() instead.
Bail out if kvm_gmem_split_private() returns error.
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Rebased to [1]. As changes in this patch are gmem specific, they may need
to be updated if the implementation in [1] changes.
- Update kvm_split_boundary_leafs() to kvm_split_cross_boundary_leafs() and
invoke it before kvm_gmem_punch_hole() and private-to-shared conversion.
[1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/
RFC v1:
- new patch.
---
virt/kvm/guest_memfd.c | 142 ++++++++++++++++++++++++-----------------
1 file changed, 84 insertions(+), 58 deletions(-)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 67aa2285aa49..9edf33c482d7 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -318,14 +318,14 @@ static bool kvm_gmem_has_safe_refcount(struct address_space *mapping, pgoff_t st
return refcount_safe;
}
-static void kvm_gmem_unmap_private(struct kvm_gmem *gmem, pgoff_t start,
- pgoff_t end)
+static int kvm_gmem_split_private(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end)
{
struct kvm_memory_slot *slot;
struct kvm *kvm = gmem->kvm;
unsigned long index;
bool locked = false;
bool flush = false;
+ int ret = 0;
xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
pgoff_t pgoff = slot->gmem.pgoff;
@@ -335,7 +335,6 @@ static void kvm_gmem_unmap_private(struct kvm_gmem *gmem, pgoff_t start,
.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
.slot = slot,
.may_block = true,
- /* This function is only concerned with private mappings. */
.attr_filter = KVM_FILTER_PRIVATE,
};
@@ -344,6 +343,47 @@ static void kvm_gmem_unmap_private(struct kvm_gmem *gmem, pgoff_t start,
locked = true;
}
+ ret = kvm_split_cross_boundary_leafs(kvm, &gfn_range, false);
+ if (ret < 0)
+ goto out;
+
+ flush |= ret;
+ ret = 0;
+ }
+out:
+ if (flush)
+ kvm_flush_remote_tlbs(kvm);
+
+ if (locked)
+ KVM_MMU_UNLOCK(kvm);
+
+ return ret;
+}
+
+static void kvm_gmem_zap(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end,
+ enum kvm_gfn_range_filter filter)
+{
+ struct kvm_memory_slot *slot;
+ struct kvm *kvm = gmem->kvm;
+ unsigned long index;
+ bool locked = false;
+ bool flush = false;
+
+ xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
+ pgoff_t pgoff = slot->gmem.pgoff;
+ struct kvm_gfn_range gfn_range = {
+ .start = slot->base_gfn + max(pgoff, start) - pgoff,
+ .end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
+ .slot = slot,
+ .may_block = true,
+ .attr_filter = filter,
+ };
+
+ if (!locked) {
+ KVM_MMU_LOCK(kvm);
+ locked = true;
+ }
+
flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
}
@@ -514,6 +554,8 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
struct conversion_work *work,
bool to_shared, pgoff_t *error_index)
{
+ int ret = 0;
+
if (to_shared) {
struct list_head *gmem_list;
struct kvm_gmem *gmem;
@@ -522,19 +564,24 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
work_end = work->start + work->nr_pages;
gmem_list = &inode->i_mapping->i_private_list;
+ list_for_each_entry(gmem, gmem_list, entry) {
+ ret = kvm_gmem_split_private(gmem, work->start, work_end);
+ if (ret)
+ return ret;
+ }
list_for_each_entry(gmem, gmem_list, entry)
- kvm_gmem_unmap_private(gmem, work->start, work_end);
+ kvm_gmem_zap(gmem, work->start, work_end, KVM_FILTER_PRIVATE);
} else {
unmap_mapping_pages(inode->i_mapping, work->start,
work->nr_pages, false);
if (!kvm_gmem_has_safe_refcount(inode->i_mapping, work->start,
work->nr_pages, error_index)) {
- return -EAGAIN;
+ ret = -EAGAIN;
}
}
- return 0;
+ return ret;
}
static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
@@ -1187,54 +1234,6 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
return ERR_PTR(ret);
}
-static void kvm_gmem_invalidate_begin_and_zap(struct kvm_gmem *gmem,
- pgoff_t start, pgoff_t end)
-{
- bool flush = false, found_memslot = false;
- struct kvm_memory_slot *slot;
- struct kvm *kvm = gmem->kvm;
- unsigned long index;
-
- xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
- enum kvm_gfn_range_filter filter;
- pgoff_t pgoff = slot->gmem.pgoff;
-
- filter = KVM_FILTER_PRIVATE;
- if (kvm_gmem_memslot_supports_shared(slot)) {
- /*
- * Unmapping would also cause invalidation, but cannot
- * rely on mmu_notifiers to do invalidation via
- * unmapping, since memory may not be mapped to
- * userspace.
- */
- filter |= KVM_FILTER_SHARED;
- }
-
- struct kvm_gfn_range gfn_range = {
- .start = slot->base_gfn + max(pgoff, start) - pgoff,
- .end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
- .slot = slot,
- .may_block = true,
- .attr_filter = filter,
- };
-
- if (!found_memslot) {
- found_memslot = true;
-
- KVM_MMU_LOCK(kvm);
- kvm_mmu_invalidate_begin(kvm);
- }
-
- flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
- }
-
- if (flush)
- kvm_flush_remote_tlbs(kvm);
-
- if (found_memslot)
- KVM_MMU_UNLOCK(kvm);
-}
-
static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
pgoff_t end)
{
@@ -1445,9 +1444,28 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
filemap_invalidate_lock(inode->i_mapping);
list_for_each_entry(gmem, gmem_list, entry)
- kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
+ kvm_gmem_invalidate_begin(gmem, start, end);
ret = 0;
+ list_for_each_entry(gmem, gmem_list, entry) {
+ ret = kvm_gmem_split_private(gmem, start, end);
+ if (ret)
+ goto out;
+ }
+ list_for_each_entry(gmem, gmem_list, entry) {
+ enum kvm_gfn_range_filter filter;
+
+ /*
+ * kvm_gmem_invalidate_begin() would have unmapped shared
+ * mappings via mmu notifiers, but only if those mappings were
+ * actually set up. Since guest_memfd cannot assume that shared
+ * mappings were set up, zap both private and shared mappings
+ * here. If shared mappings were zapped, this should not be
+ * expensive.
+ */
+ filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED;
+ kvm_gmem_zap(gmem, start, end, filter);
+ }
if (kvm_gmem_has_custom_allocator(inode)) {
ret = kvm_gmem_truncate_inode_range(inode, offset, offset + len);
} else {
@@ -1455,6 +1473,7 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
}
+out:
list_for_each_entry(gmem, gmem_list, entry)
kvm_gmem_invalidate_end(gmem, start, end);
@@ -1576,7 +1595,8 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
* Zap all SPTEs pointed at by this file. Do not free the backing
* memory, as its lifetime is associated with the inode, not the file.
*/
- kvm_gmem_invalidate_begin_and_zap(gmem, 0, -1ul);
+ kvm_gmem_invalidate_begin(gmem, 0, -1ul);
+ kvm_gmem_zap(gmem, 0, -1ul, KVM_FILTER_PRIVATE | KVM_FILTER_SHARED);
kvm_gmem_invalidate_end(gmem, 0, -1ul);
list_del(&gmem->entry);
@@ -1906,8 +1926,14 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
start = folio->index;
end = start + folio_nr_pages(folio);
- list_for_each_entry(gmem, gmem_list, entry)
- kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
+ /* The size of the SEPT will not exceed the size of the folio */
+ list_for_each_entry(gmem, gmem_list, entry) {
+ enum kvm_gfn_range_filter filter;
+
+ kvm_gmem_invalidate_begin(gmem, start, end);
+ filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED;
+ kvm_gmem_zap(gmem, start, end, filter);
+ }
/*
* Do not truncate the range, what action is taken in response to the
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion
2025-08-07 9:45 ` [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
@ 2025-09-04 7:58 ` Binbin Wu
2025-09-04 9:48 ` Yan Zhao
2025-10-01 6:21 ` Ackerley Tng
2025-10-01 8:00 ` Ackerley Tng
2 siblings, 1 reply; 129+ messages in thread
From: Binbin Wu @ 2025-09-04 7:58 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On 8/7/2025 5:45 PM, Yan Zhao wrote:
[...]
>
> @@ -514,6 +554,8 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
> struct conversion_work *work,
> bool to_shared, pgoff_t *error_index)
> {
> + int ret = 0;
> +
> if (to_shared) {
> struct list_head *gmem_list;
> struct kvm_gmem *gmem;
> @@ -522,19 +564,24 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
> work_end = work->start + work->nr_pages;
>
> gmem_list = &inode->i_mapping->i_private_list;
> + list_for_each_entry(gmem, gmem_list, entry) {
> + ret = kvm_gmem_split_private(gmem, work->start, work_end);
> + if (ret)
> + return ret;
> + }
> list_for_each_entry(gmem, gmem_list, entry)
> - kvm_gmem_unmap_private(gmem, work->start, work_end);
> + kvm_gmem_zap(gmem, work->start, work_end, KVM_FILTER_PRIVATE);
> } else {
> unmap_mapping_pages(inode->i_mapping, work->start,
> work->nr_pages, false);
>
> if (!kvm_gmem_has_safe_refcount(inode->i_mapping, work->start,
> work->nr_pages, error_index)) {
> - return -EAGAIN;
> + ret = -EAGAIN;
> }
Not from this patch.
When if statement breaks into two lines, are curly braces needed?
> }
>
> - return 0;
> + return ret;
> }
>
[...]
> @@ -1906,8 +1926,14 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
> start = folio->index;
> end = start + folio_nr_pages(folio);
>
> - list_for_each_entry(gmem, gmem_list, entry)
> - kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
> + /* The size of the SEPT will not exceed the size of the folio */
To me, the comment alone without the context doesn't give a direct expression that
split is not needed. If it's not too wordy, could you make it more informative?
> + list_for_each_entry(gmem, gmem_list, entry) {
> + enum kvm_gfn_range_filter filter;
> +
> + kvm_gmem_invalidate_begin(gmem, start, end);
> + filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED;
> + kvm_gmem_zap(gmem, start, end, filter);
> + }
>
> /*
> * Do not truncate the range, what action is taken in response to the
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion
2025-09-04 7:58 ` Binbin Wu
@ 2025-09-04 9:48 ` Yan Zhao
2025-09-04 11:07 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-09-04 9:48 UTC (permalink / raw)
To: Binbin Wu
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, fan.du,
jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li, chao.p.peng
On Thu, Sep 04, 2025 at 03:58:54PM +0800, Binbin Wu wrote:
>
>
> On 8/7/2025 5:45 PM, Yan Zhao wrote:
> [...]
> > @@ -514,6 +554,8 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
> > struct conversion_work *work,
> > bool to_shared, pgoff_t *error_index)
> > {
> > + int ret = 0;
> > +
> > if (to_shared) {
> > struct list_head *gmem_list;
> > struct kvm_gmem *gmem;
> > @@ -522,19 +564,24 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
> > work_end = work->start + work->nr_pages;
> > gmem_list = &inode->i_mapping->i_private_list;
> > + list_for_each_entry(gmem, gmem_list, entry) {
> > + ret = kvm_gmem_split_private(gmem, work->start, work_end);
> > + if (ret)
> > + return ret;
> > + }
> > list_for_each_entry(gmem, gmem_list, entry)
> > - kvm_gmem_unmap_private(gmem, work->start, work_end);
> > + kvm_gmem_zap(gmem, work->start, work_end, KVM_FILTER_PRIVATE);
> > } else {
> > unmap_mapping_pages(inode->i_mapping, work->start,
> > work->nr_pages, false);
> > if (!kvm_gmem_has_safe_refcount(inode->i_mapping, work->start,
> > work->nr_pages, error_index)) {
> > - return -EAGAIN;
> > + ret = -EAGAIN;
> > }
>
> Not from this patch.
> When if statement breaks into two lines, are curly braces needed?
Hmm, either one (with or without curly braces) can pass the check of
"scripts/checkpatch.pl --strict".
>
> > }
> > - return 0;
> > + return ret;
> > }
> [...]
> > @@ -1906,8 +1926,14 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
> > start = folio->index;
> > end = start + folio_nr_pages(folio);
> > - list_for_each_entry(gmem, gmem_list, entry)
> > - kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
> > + /* The size of the SEPT will not exceed the size of the folio */
> To me, the comment alone without the context doesn't give a direct expression that
> split is not needed. If it's not too wordy, could you make it more informative?
What about:
The zap is limited to the range covered by a single folio.
As a S-EPT leaf entry can't cover a range larger than its backend folio size,
the zap can't cross two S-EPT leaf entries. So, no split is required.
>
> > + list_for_each_entry(gmem, gmem_list, entry) {
> > + enum kvm_gfn_range_filter filter;
> > +
> > + kvm_gmem_invalidate_begin(gmem, start, end);
> > + filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED;
> > + kvm_gmem_zap(gmem, start, end, filter);
> > + }
> > /*
> > * Do not truncate the range, what action is taken in response to the
>
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion
2025-09-04 9:48 ` Yan Zhao
@ 2025-09-04 11:07 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-09-04 11:07 UTC (permalink / raw)
To: Binbin Wu, pbonzini, seanjc, linux-kernel, kvm, x86,
rick.p.edgecombe, dave.hansen, kas, tabba, ackerleytng,
michael.roth, david, vannapurve, vbabka, thomas.lendacky, pgonda,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On Thu, Sep 04, 2025 at 05:48:42PM +0800, Yan Zhao wrote:
> On Thu, Sep 04, 2025 at 03:58:54PM +0800, Binbin Wu wrote:
> > > @@ -1906,8 +1926,14 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
> > > start = folio->index;
> > > end = start + folio_nr_pages(folio);
> > > - list_for_each_entry(gmem, gmem_list, entry)
> > > - kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
> > > + /* The size of the SEPT will not exceed the size of the folio */
> > To me, the comment alone without the context doesn't give a direct expression that
> > split is not needed. If it's not too wordy, could you make it more informative?
> What about:
> The zap is limited to the range covered by a single folio.
> As a S-EPT leaf entry can't cover a range larger than its backend folio size,
> the zap can't cross two S-EPT leaf entries. So, no split is required.
Sorry, my brain just froze.
Should just be:
As a leaf SPTE can't cover a range larger than its backend folio size,
no splitting is required before the zap.
> > > + list_for_each_entry(gmem, gmem_list, entry) {
> > > + enum kvm_gfn_range_filter filter;
> > > +
> > > + kvm_gmem_invalidate_begin(gmem, start, end);
> > > + filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED;
> > > + kvm_gmem_zap(gmem, start, end, filter);
> > > + }
> > > /*
> > > * Do not truncate the range, what action is taken in response to the
> >
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion
2025-08-07 9:45 ` [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
2025-09-04 7:58 ` Binbin Wu
@ 2025-10-01 6:21 ` Ackerley Tng
2025-10-13 0:18 ` Yan Zhao
2025-10-01 8:00 ` Ackerley Tng
2 siblings, 1 reply; 129+ messages in thread
From: Ackerley Tng @ 2025-10-01 6:21 UTC (permalink / raw)
To: Yan Zhao, pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
quic_eberman, michael.roth, david, vannapurve, vbabka,
thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng, yan.y.zhao
Yan Zhao <yan.y.zhao@intel.com> writes:
Thanks Yan! Just got around to looking at this, sorry about the delay!
> In TDX, private page tables require precise zapping because faulting back
> the zapped mappings necessitates the guest's re-acceptance. Therefore,
> before performing a zap for hole punching and private-to-shared
> conversions, huge leafs that cross the boundary of the zapping GFN range in
> the mirror page table must be split.
>
> Splitting may result in an error. If this happens, hole punching and
> private-to-shared conversion should bail out early and return an error to
> userspace.
>
> Splitting is not necessary for kvm_gmem_release() since the entire page
> table is being zapped, nor for kvm_gmem_error_folio() as an SPTE must not
> map more than one physical folio.
>
> Therefore, in this patch,
> - break kvm_gmem_invalidate_begin_and_zap() into
> kvm_gmem_invalidate_begin() and kvm_gmem_zap() and have
> kvm_gmem_release() and kvm_gmem_error_folio() to invoke them.
>
I think perhaps separating invalidate and zip could be a separate patch
from adding the split step into the flow, that would make this patch
smaller and easier to review.
No action required from you for now, I have the the above part in a
separate patch already (not yet posted).
> - have kvm_gmem_punch_hole() to invoke kvm_gmem_invalidate_begin(),
> kvm_gmem_split_private(), and kvm_gmem_zap().
> Bail out if kvm_gmem_split_private() returns error.
>
IIUC the current upstream position is that hole punching will not
be permitted for ranges smaller than the page size for the entire
guest_memfd.
Hence no splitting required during hole punch?
+ 4K guest_memfd: no splitting required since the EPT entries will not
be larger than 4K anyway
+ 2M and 1G (x86) guest_memfd: no splitting required since the entire
EPT entry will have to go away for valid ranges (valid ranges are
either 2M or 1G anyway)
Does that sound right?
> - drop the old kvm_gmem_unmap_private() and have private-to-shared
> conversion to invoke kvm_gmem_split_private() and kvm_gmem_zap() instead.
> Bail out if kvm_gmem_split_private() returns error.
>
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>
> [...snip...]
>
> @@ -514,6 +554,8 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
> struct conversion_work *work,
> bool to_shared, pgoff_t *error_index)
> {
> + int ret = 0;
> +
> if (to_shared) {
> struct list_head *gmem_list;
> struct kvm_gmem *gmem;
> @@ -522,19 +564,24 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
> work_end = work->start + work->nr_pages;
>
> gmem_list = &inode->i_mapping->i_private_list;
> + list_for_each_entry(gmem, gmem_list, entry) {
> + ret = kvm_gmem_split_private(gmem, work->start, work_end);
> + if (ret)
> + return ret;
> + }
Will be refactoring the conversion steps a little for the next version
of this series, hence I'd like to ask about the requirements before
doing splitting.
The requirement is to split before zapping, right? Other than that
we technically don't need to split before checking for a safe refcount, right?
> list_for_each_entry(gmem, gmem_list, entry)
> - kvm_gmem_unmap_private(gmem, work->start, work_end);
> + kvm_gmem_zap(gmem, work->start, work_end, KVM_FILTER_PRIVATE);
> } else {
> unmap_mapping_pages(inode->i_mapping, work->start,
> work->nr_pages, false);
>
> if (!kvm_gmem_has_safe_refcount(inode->i_mapping, work->start,
> work->nr_pages, error_index)) {
> - return -EAGAIN;
> + ret = -EAGAIN;
> }
> }
>
> - return 0;
> + return ret;
> }
>
>
> [...snip...]
>
> @@ -1906,8 +1926,14 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
> start = folio->index;
> end = start + folio_nr_pages(folio);
>
> - list_for_each_entry(gmem, gmem_list, entry)
> - kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
> + /* The size of the SEPT will not exceed the size of the folio */
I think splitting might be required here, but that depends on whether we
want to unmap just a part of the huge folio or whether we want to unmap
the entire folio.
Lots of open questions on memory failure handling, but for now I think
this makes sense.
> + list_for_each_entry(gmem, gmem_list, entry) {
> + enum kvm_gfn_range_filter filter;
> +
> + kvm_gmem_invalidate_begin(gmem, start, end);
> + filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED;
> + kvm_gmem_zap(gmem, start, end, filter);
> + }
>
> /*
> * Do not truncate the range, what action is taken in response to the
> --
> 2.43.2
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion
2025-10-01 6:21 ` Ackerley Tng
@ 2025-10-13 0:18 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-10-13 0:18 UTC (permalink / raw)
To: Ackerley Tng
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, quic_eberman, michael.roth, david,
vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du,
jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu,
chao.p.peng
Sorry for the delay. Just back from the vacation.
On Wed, Oct 01, 2025 at 06:21:47AM +0000, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
>
> Thanks Yan! Just got around to looking at this, sorry about the delay!
>
> > In TDX, private page tables require precise zapping because faulting back
> > the zapped mappings necessitates the guest's re-acceptance. Therefore,
> > before performing a zap for hole punching and private-to-shared
> > conversions, huge leafs that cross the boundary of the zapping GFN range in
> > the mirror page table must be split.
> >
> > Splitting may result in an error. If this happens, hole punching and
> > private-to-shared conversion should bail out early and return an error to
> > userspace.
> >
> > Splitting is not necessary for kvm_gmem_release() since the entire page
> > table is being zapped, nor for kvm_gmem_error_folio() as an SPTE must not
> > map more than one physical folio.
> >
> > Therefore, in this patch,
> > - break kvm_gmem_invalidate_begin_and_zap() into
> > kvm_gmem_invalidate_begin() and kvm_gmem_zap() and have
> > kvm_gmem_release() and kvm_gmem_error_folio() to invoke them.
> >
>
> I think perhaps separating invalidate and zip could be a separate patch
> from adding the split step into the flow, that would make this patch
> smaller and easier to review.
>
> No action required from you for now, I have the the above part in a
> separate patch already (not yet posted).
>
> > - have kvm_gmem_punch_hole() to invoke kvm_gmem_invalidate_begin(),
> > kvm_gmem_split_private(), and kvm_gmem_zap().
> > Bail out if kvm_gmem_split_private() returns error.
> >
>
> IIUC the current upstream position is that hole punching will not
> be permitted for ranges smaller than the page size for the entire
> guest_memfd.
In hugetlbfs_fallocate(), the punch hole ranges are hpage size aligned.
start = offset >> hpage_shift;
end = (offset + len + hpage_size - 1) >> hpage_shift;
However, in the guest_memfd (at least in v2), the punch hole ranges are
just page aligned.
pgoff_t start = offset >> PAGE_SHIFT;
pgoff_t end = (offset + len) >> PAGE_SHIFT;
(Note, I noticed that the range calculation for invalidation is not the same as
that in kvm_gmem_truncate_inode_range(), where:
full_hpage_start = round_up(start, nr_per_huge_page);
full_hpage_end = round_down(end, nr_per_huge_page);
We should probably align these two implementations for consistency).
> Hence no splitting required during hole punch?
>
> + 4K guest_memfd: no splitting required since the EPT entries will not
> be larger than 4K anyway
> + 2M and 1G (x86) guest_memfd: no splitting required since the entire
> EPT entry will have to go away for valid ranges (valid ranges are
> either 2M or 1G anyway)
>
> Does that sound right?
If future guest_memfd code could align the punch hole ranges to page size for
the entire guest_memfd, I think it's ok.
> > - drop the old kvm_gmem_unmap_private() and have private-to-shared
> > conversion to invoke kvm_gmem_split_private() and kvm_gmem_zap() instead.
> > Bail out if kvm_gmem_split_private() returns error.
> >
> > Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> >
> > [...snip...]
> >
> > @@ -514,6 +554,8 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
> > struct conversion_work *work,
> > bool to_shared, pgoff_t *error_index)
> > {
> > + int ret = 0;
> > +
> > if (to_shared) {
> > struct list_head *gmem_list;
> > struct kvm_gmem *gmem;
> > @@ -522,19 +564,24 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
> > work_end = work->start + work->nr_pages;
> >
> > gmem_list = &inode->i_mapping->i_private_list;
> > + list_for_each_entry(gmem, gmem_list, entry) {
> > + ret = kvm_gmem_split_private(gmem, work->start, work_end);
> > + if (ret)
> > + return ret;
> > + }
>
> Will be refactoring the conversion steps a little for the next version
> of this series, hence I'd like to ask about the requirements before
> doing splitting.
>
> The requirement is to split before zapping, right? Other than that
> we technically don't need to split before checking for a safe refcount, right?
Yes, the split is for private-to-shared conversion.
TDX will not hold page refcount for private pages any more.
> > list_for_each_entry(gmem, gmem_list, entry)
> > - kvm_gmem_unmap_private(gmem, work->start, work_end);
> > + kvm_gmem_zap(gmem, work->start, work_end, KVM_FILTER_PRIVATE);
> > } else {
> > unmap_mapping_pages(inode->i_mapping, work->start,
> > work->nr_pages, false);
> >
> > if (!kvm_gmem_has_safe_refcount(inode->i_mapping, work->start,
> > work->nr_pages, error_index)) {
> > - return -EAGAIN;
> > + ret = -EAGAIN;
> > }
> > }
> >
> > - return 0;
> > + return ret;
> > }
> >
> >
> > [...snip...]
> >
> > @@ -1906,8 +1926,14 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
> > start = folio->index;
> > end = start + folio_nr_pages(folio);
> >
> > - list_for_each_entry(gmem, gmem_list, entry)
> > - kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
> > + /* The size of the SEPT will not exceed the size of the folio */
>
> I think splitting might be required here, but that depends on whether we
> want to unmap just a part of the huge folio or whether we want to unmap
> the entire folio.
Ok. When that occurs, we can do the split according to the partial unmap range
info.
> Lots of open questions on memory failure handling, but for now I think
> this makes sense.
>
> > + list_for_each_entry(gmem, gmem_list, entry) {
> > + enum kvm_gfn_range_filter filter;
> > +
> > + kvm_gmem_invalidate_begin(gmem, start, end);
> > + filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED;
> > + kvm_gmem_zap(gmem, start, end, filter);
> > + }
> >
> > /*
> > * Do not truncate the range, what action is taken in response to the
> > --
> > 2.43.2
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion
2025-08-07 9:45 ` [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
2025-09-04 7:58 ` Binbin Wu
2025-10-01 6:21 ` Ackerley Tng
@ 2025-10-01 8:00 ` Ackerley Tng
2025-10-13 0:45 ` Yan Zhao
2 siblings, 1 reply; 129+ messages in thread
From: Ackerley Tng @ 2025-10-01 8:00 UTC (permalink / raw)
To: Yan Zhao, pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
quic_eberman, michael.roth, david, vannapurve, vbabka,
thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny,
isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng, yan.y.zhao
Yan Zhao <yan.y.zhao@intel.com> writes:
I was looking deeper into this patch since on my WIP tree I already had
the invalidate and zap steps separated out and had to do more to rebase
this patch :)
> In TDX, private page tables require precise zapping because faulting back
> the zapped mappings necessitates the guest's re-acceptance.
I feel that this statement could be better phrased because all zapped
mappings require re-acceptance, not just anything related to precise
zapping. Would this be better:
On private-to-shared conversions, page table entries must be zapped
from the Secure EPTs. Any pages mapped into Secure EPTs must be
accepted by the guest before they are used.
Hence, care must be taken to only precisely zap ranges requested for
private-to-shared conversion, since the guest is only prepared to
re-accept precisely the ranges it requested for conversion.
The guest may request to convert ranges not aligned with private
page table entry boundaries. To precisely zap these ranges, huge
leaves that span the boundaries of the requested ranges must be
split into smaller leaves, so that the split, smaller leaves now
align with the requested range for zapping.
> Therefore,
> before performing a zap for hole punching and private-to-shared
> conversions, huge leafs that cross the boundary of the zapping GFN range in
> the mirror page table must be split.
>
> Splitting may result in an error. If this happens, hole punching and
> private-to-shared conversion should bail out early and return an error to
> userspace.
>
> Splitting is not necessary for kvm_gmem_release() since the entire page
> table is being zapped, nor for kvm_gmem_error_folio() as an SPTE must not
> map more than one physical folio.
>
I think splitting is not necessary as long as aligned page table entries
are zapped. Splitting is also not necessary if the entire page table is
zapped but that's a superset of zapping aligned page table
entries. (Probably just a typo on your side.) Here's my attempt at
rephrasing this:
Splitting is not necessary for the cases where only aligned page
table entries are zapped, such as during kvm_gmem_release() where
the entire guest_memfd worth of memory is zapped, nor for
truncation, where truncation of pages within a huge folio is not
allowed.
> Therefore, in this patch,
> - break kvm_gmem_invalidate_begin_and_zap() into
> kvm_gmem_invalidate_begin() and kvm_gmem_zap() and have
> kvm_gmem_release() and kvm_gmem_error_folio() to invoke them.
>
> - have kvm_gmem_punch_hole() to invoke kvm_gmem_invalidate_begin(),
> kvm_gmem_split_private(), and kvm_gmem_zap().
> Bail out if kvm_gmem_split_private() returns error.
>
> - drop the old kvm_gmem_unmap_private() and have private-to-shared
> conversion to invoke kvm_gmem_split_private() and kvm_gmem_zap() instead.
> Bail out if kvm_gmem_split_private() returns error.
>
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2:
> - Rebased to [1]. As changes in this patch are gmem specific, they may need
> to be updated if the implementation in [1] changes.
> - Update kvm_split_boundary_leafs() to kvm_split_cross_boundary_leafs() and
> invoke it before kvm_gmem_punch_hole() and private-to-shared conversion.
>
> [1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/
>
> RFC v1:
> - new patch.
>
> [...snip...]
>
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion
2025-10-01 8:00 ` Ackerley Tng
@ 2025-10-13 0:45 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-10-13 0:45 UTC (permalink / raw)
To: Ackerley Tng
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, quic_eberman, michael.roth, david,
vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du,
jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu,
chao.p.peng
On Wed, Oct 01, 2025 at 08:00:21AM +0000, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
>
> I was looking deeper into this patch since on my WIP tree I already had
> the invalidate and zap steps separated out and had to do more to rebase
> this patch :)
>
> > In TDX, private page tables require precise zapping because faulting back
> > the zapped mappings necessitates the guest's re-acceptance.
>
> I feel that this statement could be better phrased because all zapped
> mappings require re-acceptance, not just anything related to precise
> zapping. Would this be better:
>
> On private-to-shared conversions, page table entries must be zapped
> from the Secure EPTs. Any pages mapped into Secure EPTs must be
> accepted by the guest before they are used.
>
> Hence, care must be taken to only precisely zap ranges requested for
> private-to-shared conversion, since the guest is only prepared to
> re-accept precisely the ranges it requested for conversion.
>
> The guest may request to convert ranges not aligned with private
> page table entry boundaries. To precisely zap these ranges, huge
> leaves that span the boundaries of the requested ranges must be
> split into smaller leaves, so that the split, smaller leaves now
> align with the requested range for zapping.
LGTM. Thanks!
> > Therefore,
> > before performing a zap for hole punching and private-to-shared
> > conversions, huge leafs that cross the boundary of the zapping GFN range in
> > the mirror page table must be split.
> >
> > Splitting may result in an error. If this happens, hole punching and
> > private-to-shared conversion should bail out early and return an error to
> > userspace.
> >
> > Splitting is not necessary for kvm_gmem_release() since the entire page
> > table is being zapped, nor for kvm_gmem_error_folio() as an SPTE must not
> > map more than one physical folio.
> >
>
> I think splitting is not necessary as long as aligned page table entries
> are zapped. Splitting is also not necessary if the entire page table is
> zapped but that's a superset of zapping aligned page table
> entries. (Probably just a typo on your side.) Here's my attempt at
what is the typo you are referring to?
> rephrasing this:
>
> Splitting is not necessary for the cases where only aligned page
> table entries are zapped, such as during kvm_gmem_release() where
By "page table entries", you mean SPTEs, i.e., entries in the secondary MMU,
right?
> the entire guest_memfd worth of memory is zapped, nor for
> truncation, where truncation of pages within a huge folio is not
> allowed.
I think that "splitting is not required for truncation" is valid only based on
KVM's implementation where "an SPTE must not map more than one physical folio".
i.e., the SPTE entry size is <= folio size.
If KVM were implemented differently where one SPTE could cover multiple folios
(similar to IOMMU SLTP entries for shared memory, though this is unlikely to
happen), splitting would still be required before truncation.
^ permalink raw reply [flat|nested] 129+ messages in thread
* [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (16 preceding siblings ...)
2025-08-07 9:45 ` [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion Yan Zhao
@ 2025-08-07 9:45 ` Yan Zhao
2025-08-11 21:10 ` Sagi Shahar
` (2 more replies)
2025-08-07 9:45 ` [RFC PATCH v2 19/23] KVM: TDX: Pass down pfn to split_external_spt() Yan Zhao
` (4 subsequent siblings)
22 siblings, 3 replies; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:45 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
The TDX module enumerates with a TDX_FEATURES0 bit if an explicit cache
flush is necessary when switching KeyID for a page, like before
handing the page over to a TD.
Currently, none of the TDX-capable platforms have this bit enabled.
Moreover, cache flushing with TDH.PHYMEM.PAGE.WBINVD fails if
Dynamic PAMT is active and the target page is not 4k. The SEAMCALL only
supports 4k pages and will fail if there is no PAMT_4K for the HPA.
Avoid performing these cache flushes unless the CLFLUSH_BEFORE_ALLOC bit
of TDX_FEATURES0 is set.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Pulled from
git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
- Rebased on top of TDX huge page RFC v2 (Yan)
---
arch/x86/include/asm/tdx.h | 1 +
arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++------
2 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index f1bd74348b34..c058a82d4a97 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -15,6 +15,7 @@
/* Bit definitions of TDX_FEATURES0 metadata field */
#define TDX_FEATURES0_NO_RBP_MOD BIT_ULL(18)
+#define TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC BIT_ULL(23)
#define TDX_FEATURES0_DYNAMIC_PAMT BIT_ULL(36)
#ifndef __ASSEMBLER__
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 9ed585bde062..b7a0ee0f4a50 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1648,14 +1648,13 @@ static inline u64 tdx_tdvpr_pa(struct tdx_vp *td)
return page_to_phys(td->tdvpr_page);
}
-/*
- * The TDX module exposes a CLFLUSH_BEFORE_ALLOC bit to specify whether
- * a CLFLUSH of pages is required before handing them to the TDX module.
- * Be conservative and make the code simpler by doing the CLFLUSH
- * unconditionally.
- */
static void tdx_clflush_page(struct page *page)
{
+ u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
+
+ if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
+ return;
+
clflush_cache_range(page_to_virt(page), PAGE_SIZE);
}
@@ -2030,8 +2029,12 @@ EXPORT_SYMBOL_GPL(tdh_phymem_cache_wb);
u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
{
+ u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
struct tdx_module_args args = {};
+ if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
+ return 0;
+
args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
@@ -2041,10 +2044,14 @@ EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_tdr);
u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
unsigned long start_idx, unsigned long npages)
{
+ u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
struct page *start = folio_page(folio, start_idx);
struct tdx_module_args args = {};
u64 err;
+ if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
+ return 0;
+
if (start_idx + npages > folio_nr_pages(folio))
return TDX_OPERAND_INVALID;
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set
2025-08-07 9:45 ` [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set Yan Zhao
@ 2025-08-11 21:10 ` Sagi Shahar
2025-08-12 6:37 ` Yan Zhao
2025-09-04 8:16 ` Binbin Wu
2025-09-05 15:41 ` Edgecombe, Rick P
2 siblings, 1 reply; 129+ messages in thread
From: Sagi Shahar @ 2025-08-11 21:10 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
binbin.wu, chao.p.peng
On Thu, Aug 7, 2025 at 4:47 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> The TDX module enumerates with a TDX_FEATURES0 bit if an explicit cache
> flush is necessary when switching KeyID for a page, like before
> handing the page over to a TD.
>
> Currently, none of the TDX-capable platforms have this bit enabled.
>
> Moreover, cache flushing with TDH.PHYMEM.PAGE.WBINVD fails if
> Dynamic PAMT is active and the target page is not 4k. The SEAMCALL only
> supports 4k pages and will fail if there is no PAMT_4K for the HPA.
>
> Avoid performing these cache flushes unless the CLFLUSH_BEFORE_ALLOC bit
> of TDX_FEATURES0 is set.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2:
> - Pulled from
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
> - Rebased on top of TDX huge page RFC v2 (Yan)
> ---
> arch/x86/include/asm/tdx.h | 1 +
> arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++------
> 2 files changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index f1bd74348b34..c058a82d4a97 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -15,6 +15,7 @@
>
> /* Bit definitions of TDX_FEATURES0 metadata field */
> #define TDX_FEATURES0_NO_RBP_MOD BIT_ULL(18)
> +#define TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC BIT_ULL(23)
> #define TDX_FEATURES0_DYNAMIC_PAMT BIT_ULL(36)
>
> #ifndef __ASSEMBLER__
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 9ed585bde062..b7a0ee0f4a50 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1648,14 +1648,13 @@ static inline u64 tdx_tdvpr_pa(struct tdx_vp *td)
> return page_to_phys(td->tdvpr_page);
> }
>
> -/*
> - * The TDX module exposes a CLFLUSH_BEFORE_ALLOC bit to specify whether
> - * a CLFLUSH of pages is required before handing them to the TDX module.
> - * Be conservative and make the code simpler by doing the CLFLUSH
> - * unconditionally.
> - */
> static void tdx_clflush_page(struct page *page)
> {
> + u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
> +
> + if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
> + return;
Isn't the logic here and below reversed? If
TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC bit is set, we want to perform the
clflush()
> +
> clflush_cache_range(page_to_virt(page), PAGE_SIZE);
> }
>
> @@ -2030,8 +2029,12 @@ EXPORT_SYMBOL_GPL(tdh_phymem_cache_wb);
>
> u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
> {
> + u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
> struct tdx_module_args args = {};
>
> + if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
> + return 0;
> +
> args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
>
> return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
> @@ -2041,10 +2044,14 @@ EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_tdr);
> u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
> unsigned long start_idx, unsigned long npages)
> {
> + u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
> struct page *start = folio_page(folio, start_idx);
> struct tdx_module_args args = {};
> u64 err;
>
> + if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
> + return 0;
> +
> if (start_idx + npages > folio_nr_pages(folio))
> return TDX_OPERAND_INVALID;
>
> --
> 2.43.2
>
>
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set
2025-08-11 21:10 ` Sagi Shahar
@ 2025-08-12 6:37 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-08-12 6:37 UTC (permalink / raw)
To: Sagi Shahar
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
binbin.wu, chao.p.peng
On Mon, Aug 11, 2025 at 04:10:41PM -0500, Sagi Shahar wrote:
> On Thu, Aug 7, 2025 at 4:47 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > The TDX module enumerates with a TDX_FEATURES0 bit if an explicit cache
> > flush is necessary when switching KeyID for a page, like before
> > handing the page over to a TD.
> >
> > Currently, none of the TDX-capable platforms have this bit enabled.
> >
> > Moreover, cache flushing with TDH.PHYMEM.PAGE.WBINVD fails if
> > Dynamic PAMT is active and the target page is not 4k. The SEAMCALL only
> > supports 4k pages and will fail if there is no PAMT_4K for the HPA.
I actually couldn't observe this failure in my side with DPAMT + hugepage
(without shutdown optimization).
> > Avoid performing these cache flushes unless the CLFLUSH_BEFORE_ALLOC bit
> > of TDX_FEATURES0 is set.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> > RFC v2:
> > - Pulled from
> > git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
> > - Rebased on top of TDX huge page RFC v2 (Yan)
> > ---
> > arch/x86/include/asm/tdx.h | 1 +
> > arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++------
> > 2 files changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index f1bd74348b34..c058a82d4a97 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -15,6 +15,7 @@
> >
> > /* Bit definitions of TDX_FEATURES0 metadata field */
> > #define TDX_FEATURES0_NO_RBP_MOD BIT_ULL(18)
> > +#define TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC BIT_ULL(23)
> > #define TDX_FEATURES0_DYNAMIC_PAMT BIT_ULL(36)
> >
> > #ifndef __ASSEMBLER__
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 9ed585bde062..b7a0ee0f4a50 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1648,14 +1648,13 @@ static inline u64 tdx_tdvpr_pa(struct tdx_vp *td)
> > return page_to_phys(td->tdvpr_page);
> > }
> >
> > -/*
> > - * The TDX module exposes a CLFLUSH_BEFORE_ALLOC bit to specify whether
> > - * a CLFLUSH of pages is required before handing them to the TDX module.
> > - * Be conservative and make the code simpler by doing the CLFLUSH
> > - * unconditionally.
> > - */
> > static void tdx_clflush_page(struct page *page)
> > {
> > + u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
> > +
> > + if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
> > + return;
>
> Isn't the logic here and below reversed? If
> TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC bit is set, we want to perform the
> clflush()
Yes, I think so.
As my test machine has boot_cpu_has_bug(X86_BUG_TDX_PW_MCE) returning true, I
thought it was right to perform clflush() and overlooked this logical error.
> > clflush_cache_range(page_to_virt(page), PAGE_SIZE);
> > }
> >
> > @@ -2030,8 +2029,12 @@ EXPORT_SYMBOL_GPL(tdh_phymem_cache_wb);
> >
> > u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
> > {
> > + u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
> > struct tdx_module_args args = {};
> >
> > + if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
> > + return 0;
> > +
> > args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
> >
> > return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
> > @@ -2041,10 +2044,14 @@ EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_tdr);
> > u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
> > unsigned long start_idx, unsigned long npages)
> > {
> > + u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
> > struct page *start = folio_page(folio, start_idx);
> > struct tdx_module_args args = {};
> > u64 err;
> >
> > + if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
> > + return 0;
> > +
> > if (start_idx + npages > folio_nr_pages(folio))
> > return TDX_OPERAND_INVALID;
> >
> > --
> > 2.43.2
> >
> >
>
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set
2025-08-07 9:45 ` [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set Yan Zhao
2025-08-11 21:10 ` Sagi Shahar
@ 2025-09-04 8:16 ` Binbin Wu
2025-09-04 9:50 ` Yan Zhao
2025-09-05 15:41 ` Edgecombe, Rick P
2 siblings, 1 reply; 129+ messages in thread
From: Binbin Wu @ 2025-09-04 8:16 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On 8/7/2025 5:45 PM, Yan Zhao wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> The TDX module enumerates with a TDX_FEATURES0 bit if an explicit cache
> flush is necessary when switching KeyID for a page, like before
> handing the page over to a TD.
>
> Currently, none of the TDX-capable platforms have this bit enabled.
>
> Moreover, cache flushing with TDH.PHYMEM.PAGE.WBINVD fails if
> Dynamic PAMT is active and the target page is not 4k. The SEAMCALL only
> supports 4k pages and will fail if there is no PAMT_4K for the HPA.
>
> Avoid performing these cache flushes unless the CLFLUSH_BEFORE_ALLOC bit
> of TDX_FEATURES0 is set.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2:
> - Pulled from
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
> - Rebased on top of TDX huge page RFC v2 (Yan)
> ---
> arch/x86/include/asm/tdx.h | 1 +
> arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++------
> 2 files changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index f1bd74348b34..c058a82d4a97 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -15,6 +15,7 @@
>
> /* Bit definitions of TDX_FEATURES0 metadata field */
> #define TDX_FEATURES0_NO_RBP_MOD BIT_ULL(18)
> +#define TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC BIT_ULL(23)
> #define TDX_FEATURES0_DYNAMIC_PAMT BIT_ULL(36)
>
> #ifndef __ASSEMBLER__
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 9ed585bde062..b7a0ee0f4a50 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1648,14 +1648,13 @@ static inline u64 tdx_tdvpr_pa(struct tdx_vp *td)
> return page_to_phys(td->tdvpr_page);
> }
>
> -/*
> - * The TDX module exposes a CLFLUSH_BEFORE_ALLOC bit to specify whether
> - * a CLFLUSH of pages is required before handing them to the TDX module.
> - * Be conservative and make the code simpler by doing the CLFLUSH
> - * unconditionally.
> - */
> static void tdx_clflush_page(struct page *page)
> {
> + u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
> +
> + if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
According to the cover letter, if TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC is enabled,
an explicit cache flush is necessary.
Shouldn't this and below be:
if (!(tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC))
> + return;
> +
> clflush_cache_range(page_to_virt(page), PAGE_SIZE);
> }
>
> @@ -2030,8 +2029,12 @@ EXPORT_SYMBOL_GPL(tdh_phymem_cache_wb);
>
> u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
> {
> + u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
> struct tdx_module_args args = {};
>
> + if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
> + return 0;
> +
> args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
>
> return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
> @@ -2041,10 +2044,14 @@ EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_tdr);
> u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
> unsigned long start_idx, unsigned long npages)
> {
> + u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
> struct page *start = folio_page(folio, start_idx);
> struct tdx_module_args args = {};
> u64 err;
>
> + if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
> + return 0;
> +
> if (start_idx + npages > folio_nr_pages(folio))
> return TDX_OPERAND_INVALID;
>
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set
2025-09-04 8:16 ` Binbin Wu
@ 2025-09-04 9:50 ` Yan Zhao
2025-09-05 9:05 ` Binbin Wu
0 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-09-04 9:50 UTC (permalink / raw)
To: Binbin Wu
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On Thu, Sep 04, 2025 at 04:16:27PM +0800, Binbin Wu wrote:
>
>
> On 8/7/2025 5:45 PM, Yan Zhao wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > The TDX module enumerates with a TDX_FEATURES0 bit if an explicit cache
> > flush is necessary when switching KeyID for a page, like before
> > handing the page over to a TD.
> >
> > Currently, none of the TDX-capable platforms have this bit enabled.
> >
> > Moreover, cache flushing with TDH.PHYMEM.PAGE.WBINVD fails if
> > Dynamic PAMT is active and the target page is not 4k. The SEAMCALL only
> > supports 4k pages and will fail if there is no PAMT_4K for the HPA.
> >
> > Avoid performing these cache flushes unless the CLFLUSH_BEFORE_ALLOC bit
> > of TDX_FEATURES0 is set.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> > RFC v2:
> > - Pulled from
> > git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
> > - Rebased on top of TDX huge page RFC v2 (Yan)
> > ---
> > arch/x86/include/asm/tdx.h | 1 +
> > arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++------
> > 2 files changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index f1bd74348b34..c058a82d4a97 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -15,6 +15,7 @@
> > /* Bit definitions of TDX_FEATURES0 metadata field */
> > #define TDX_FEATURES0_NO_RBP_MOD BIT_ULL(18)
> > +#define TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC BIT_ULL(23)
> > #define TDX_FEATURES0_DYNAMIC_PAMT BIT_ULL(36)
> > #ifndef __ASSEMBLER__
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 9ed585bde062..b7a0ee0f4a50 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1648,14 +1648,13 @@ static inline u64 tdx_tdvpr_pa(struct tdx_vp *td)
> > return page_to_phys(td->tdvpr_page);
> > }
> > -/*
> > - * The TDX module exposes a CLFLUSH_BEFORE_ALLOC bit to specify whether
> > - * a CLFLUSH of pages is required before handing them to the TDX module.
> > - * Be conservative and make the code simpler by doing the CLFLUSH
> > - * unconditionally.
> > - */
> > static void tdx_clflush_page(struct page *page)
> > {
> > + u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
> > +
> > + if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
>
> According to the cover letter, if TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC is enabled,
> an explicit cache flush is necessary.
> Shouldn't this and below be:
> if (!(tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC))
Right, Sagi also reported it.
https://lore.kernel.org/kvm/CAAhR5DEZZfX0=9QwBrXhC+1fp1Z0w4Xbb3mXcn0OuW+45tsLwA@mail.gmail.com/
> > + return;
> > +
> > clflush_cache_range(page_to_virt(page), PAGE_SIZE);
> > }
> > @@ -2030,8 +2029,12 @@ EXPORT_SYMBOL_GPL(tdh_phymem_cache_wb);
> > u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
> > {
> > + u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
> > struct tdx_module_args args = {};
> > + if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
> > + return 0;
> > +
> > args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
> > return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
> > @@ -2041,10 +2044,14 @@ EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_tdr);
> > u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
> > unsigned long start_idx, unsigned long npages)
> > {
> > + u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
> > struct page *start = folio_page(folio, start_idx);
> > struct tdx_module_args args = {};
> > u64 err;
> > + if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
> > + return 0;
> > +
> > if (start_idx + npages > folio_nr_pages(folio))
> > return TDX_OPERAND_INVALID;
>
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set
2025-09-04 9:50 ` Yan Zhao
@ 2025-09-05 9:05 ` Binbin Wu
0 siblings, 0 replies; 129+ messages in thread
From: Binbin Wu @ 2025-09-05 9:05 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On 9/4/2025 5:50 PM, Yan Zhao wrote:
> On Thu, Sep 04, 2025 at 04:16:27PM +0800, Binbin Wu wrote:
>>
>> On 8/7/2025 5:45 PM, Yan Zhao wrote:
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>
>>> The TDX module enumerates with a TDX_FEATURES0 bit if an explicit cache
>>> flush is necessary when switching KeyID for a page, like before
>>> handing the page over to a TD.
>>>
>>> Currently, none of the TDX-capable platforms have this bit enabled.
>>>
>>> Moreover, cache flushing with TDH.PHYMEM.PAGE.WBINVD fails if
>>> Dynamic PAMT is active and the target page is not 4k. The SEAMCALL only
>>> supports 4k pages and will fail if there is no PAMT_4K for the HPA.
>>>
>>> Avoid performing these cache flushes unless the CLFLUSH_BEFORE_ALLOC bit
>>> of TDX_FEATURES0 is set.
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>>> ---
>>> RFC v2:
>>> - Pulled from
>>> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
>>> - Rebased on top of TDX huge page RFC v2 (Yan)
>>> ---
>>> arch/x86/include/asm/tdx.h | 1 +
>>> arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++------
>>> 2 files changed, 14 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
>>> index f1bd74348b34..c058a82d4a97 100644
>>> --- a/arch/x86/include/asm/tdx.h
>>> +++ b/arch/x86/include/asm/tdx.h
>>> @@ -15,6 +15,7 @@
>>> /* Bit definitions of TDX_FEATURES0 metadata field */
>>> #define TDX_FEATURES0_NO_RBP_MOD BIT_ULL(18)
>>> +#define TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC BIT_ULL(23)
>>> #define TDX_FEATURES0_DYNAMIC_PAMT BIT_ULL(36)
>>> #ifndef __ASSEMBLER__
>>> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>>> index 9ed585bde062..b7a0ee0f4a50 100644
>>> --- a/arch/x86/virt/vmx/tdx/tdx.c
>>> +++ b/arch/x86/virt/vmx/tdx/tdx.c
>>> @@ -1648,14 +1648,13 @@ static inline u64 tdx_tdvpr_pa(struct tdx_vp *td)
>>> return page_to_phys(td->tdvpr_page);
>>> }
>>> -/*
>>> - * The TDX module exposes a CLFLUSH_BEFORE_ALLOC bit to specify whether
>>> - * a CLFLUSH of pages is required before handing them to the TDX module.
>>> - * Be conservative and make the code simpler by doing the CLFLUSH
>>> - * unconditionally.
>>> - */
>>> static void tdx_clflush_page(struct page *page)
>>> {
>>> + u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
>>> +
>>> + if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
>> According to the cover letter, if TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC is enabled,
>> an explicit cache flush is necessary.
>> Shouldn't this and below be:
>> if (!(tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC))
> Right, Sagi also reported it.
> https://lore.kernel.org/kvm/CAAhR5DEZZfX0=9QwBrXhC+1fp1Z0w4Xbb3mXcn0OuW+45tsLwA@mail.gmail.com/
>
>>> + return;
>>> +
>>> clflush_cache_range(page_to_virt(page), PAGE_SIZE);
>>> }
>>> @@ -2030,8 +2029,12 @@ EXPORT_SYMBOL_GPL(tdh_phymem_cache_wb);
>>> u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
>>> {
>>> + u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
>>> struct tdx_module_args args = {};
>>> + if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
>>> + return 0;
>>> +
According to the description of TDX module base spec (348549006),
CLFLUSH_BEFORE_ALLOC is related to clfush requirement before adding a page to
TDX module.
If it also applies to the pages returned back from TDX module, I think it needs
to be called out somewhere.
>>> args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
>>> return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
>>> @@ -2041,10 +2044,14 @@ EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_tdr);
>>> u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
>>> unsigned long start_idx, unsigned long npages)
>>> {
>>> + u64 tdx_features0 = tdx_sysinfo.features.tdx_features0;
>>> struct page *start = folio_page(folio, start_idx);
>>> struct tdx_module_args args = {};
>>> u64 err;
>>> + if (tdx_features0 & TDX_FEATURES0_CLFLUSH_BEFORE_ALLOC)
>>> + return 0;
>>> +
>>> if (start_idx + npages > folio_nr_pages(folio))
>>> return TDX_OPERAND_INVALID;
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set
2025-08-07 9:45 ` [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set Yan Zhao
2025-08-11 21:10 ` Sagi Shahar
2025-09-04 8:16 ` Binbin Wu
@ 2025-09-05 15:41 ` Edgecombe, Rick P
2025-09-15 6:05 ` Yan Zhao
2 siblings, 1 reply; 129+ messages in thread
From: Edgecombe, Rick P @ 2025-09-05 15:41 UTC (permalink / raw)
To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
Cc: quic_eberman@quicinc.com, kvm@vger.kernel.org, Li, Xiaoyao,
Du, Fan, Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
vbabka@suse.cz, tabba@google.com, kas@kernel.org,
michael.roth@amd.com, Weiny, Ira, linux-kernel@vger.kernel.org,
binbin.wu@linux.intel.com, ackerleytng@google.com,
Yamahata, Isaku, Peng, Chao P, zhiquan1.li@intel.com,
Annapurve, Vishal, Miao, Jun, pgonda@google.com, x86@kernel.org
On Thu, 2025-08-07 at 17:45 +0800, Yan Zhao wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> The TDX module enumerates with a TDX_FEATURES0 bit if an explicit cache
> flush is necessary when switching KeyID for a page, like before
> handing the page over to a TD.
>
> Currently, none of the TDX-capable platforms have this bit enabled.
>
> Moreover, cache flushing with TDH.PHYMEM.PAGE.WBINVD fails if
> Dynamic PAMT is active and the target page is not 4k. The SEAMCALL only
> supports 4k pages and will fail if there is no PAMT_4K for the HPA.
>
> Avoid performing these cache flushes unless the CLFLUSH_BEFORE_ALLOC bit
> of TDX_FEATURES0 is set.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
I think I mentioned this on some version of this patch already, but during the
base series we decided to assume CLFLUSH_BEFORE_ALLOC was always set for
simplicity. Let's try to be consistent.
Why prepare for some future TDX module that sets CLFLUSH_BEFORE_ALLOC *and* adds
new support for at larger page sizes TDH.PHYMEM.PAGE.WBINVD? It almost seems
like this is working around a bug.
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set
2025-09-05 15:41 ` Edgecombe, Rick P
@ 2025-09-15 6:05 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-09-15 6:05 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org,
Li, Xiaoyao, Du, Fan, Hansen, Dave, david@redhat.com,
thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
kas@kernel.org, michael.roth@amd.com, Weiny, Ira,
linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
ackerleytng@google.com, Yamahata, Isaku, Peng, Chao P,
Annapurve, Vishal, Miao, Jun, pgonda@google.com, x86@kernel.org
On Fri, Sep 05, 2025 at 11:41:41PM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-08-07 at 17:45 +0800, Yan Zhao wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > The TDX module enumerates with a TDX_FEATURES0 bit if an explicit cache
> > flush is necessary when switching KeyID for a page, like before
> > handing the page over to a TD.
> >
> > Currently, none of the TDX-capable platforms have this bit enabled.
> >
> > Moreover, cache flushing with TDH.PHYMEM.PAGE.WBINVD fails if
> > Dynamic PAMT is active and the target page is not 4k. The SEAMCALL only
> > supports 4k pages and will fail if there is no PAMT_4K for the HPA.
Back to when Kirill was enabling DPAMT for huge pages, TDH.PHYMEM.PAGE.WBINVD
did fail in this scenario.
However, this has been fixed in TDX_1.5.20, which is why I didn't encounter
this issue when posting this series.
> > Avoid performing these cache flushes unless the CLFLUSH_BEFORE_ALLOC bit
> > of TDX_FEATURES0 is set.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>
> I think I mentioned this on some version of this patch already, but during the
> base series we decided to assume CLFLUSH_BEFORE_ALLOC was always set for
> simplicity. Let's try to be consistent.
Right, though CLFLUSH_BEFORE_ALLOC is always false in all current TDX modules,
linux conservatively assumes it's always true.
> Why prepare for some future TDX module that sets CLFLUSH_BEFORE_ALLOC *and* adds
> new support for at larger page sizes TDH.PHYMEM.PAGE.WBINVD? It almost seems
> like this is working around a bug.
As the TDX module bug is gone, let's drop this patch to be consistent with the
policy of assuming CLFLUSH_BEFORE_ALLOC is always true.
^ permalink raw reply [flat|nested] 129+ messages in thread
* [RFC PATCH v2 19/23] KVM: TDX: Pass down pfn to split_external_spt()
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (17 preceding siblings ...)
2025-08-07 9:45 ` [RFC PATCH v2 18/23] x86/virt/tdx: Do not perform cache flushes unless CLFLUSH_BEFORE_ALLOC is set Yan Zhao
@ 2025-08-07 9:45 ` Yan Zhao
2025-09-04 8:30 ` Binbin Wu
2025-08-07 9:45 ` [RFC PATCH v2 20/23] KVM: TDX: Handle Dynamic PAMT in tdh_mem_page_demote() Yan Zhao
` (3 subsequent siblings)
22 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:45 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Pass down pfn to kvm_x86_ops::split_external_spt(). It is required for
handling Dynamic PAMT in tdx_sept_split_private_spt().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Pulled from
git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
- Rebased on top of TDX huge page RFC v2 (Yan)
---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kvm/mmu/tdp_mmu.c | 6 +++++-
arch/x86/kvm/vmx/tdx.c | 3 ++-
3 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6cb5b422dd1d..6b6c46c27390 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1841,7 +1841,8 @@ struct kvm_x86_ops {
/* Split the external page table into smaller page tables */
int (*split_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
- void *external_spt, bool mmu_lock_shared);
+ kvm_pfn_t pfn_for_gfn, void *external_spt,
+ bool mmu_lock_shared);
bool (*has_wbinvd_exit)(void);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 62a09a9655c3..eb758aaa4374 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -389,11 +389,15 @@ static int split_external_spt(struct kvm *kvm, gfn_t gfn, u64 old_spte,
u64 new_spte, int level, bool shared)
{
void *external_spt = get_external_spt(gfn, new_spte, level);
+ kvm_pfn_t pfn_for_gfn = spte_to_pfn(old_spte);
int ret;
KVM_BUG_ON(!external_spt, kvm);
- ret = kvm_x86_call(split_external_spt)(kvm, gfn, level, external_spt, shared);
+ ret = kvm_x86_call(split_external_spt)(kvm, gfn, level,
+ pfn_for_gfn, external_spt,
+ shared);
+
return ret;
}
/**
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 71115058e5e6..24aa9aaad6d8 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1941,7 +1941,8 @@ static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
}
static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
- void *private_spt, bool mmu_lock_shared)
+ kvm_pfn_t pfn_for_gfn, void *private_spt,
+ bool mmu_lock_shared)
{
struct page *page = virt_to_page(private_spt);
int ret;
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 19/23] KVM: TDX: Pass down pfn to split_external_spt()
2025-08-07 9:45 ` [RFC PATCH v2 19/23] KVM: TDX: Pass down pfn to split_external_spt() Yan Zhao
@ 2025-09-04 8:30 ` Binbin Wu
0 siblings, 0 replies; 129+ messages in thread
From: Binbin Wu @ 2025-09-04 8:30 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On 8/7/2025 5:45 PM, Yan Zhao wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Pass down pfn to kvm_x86_ops::split_external_spt(). It is required for
> handling Dynamic PAMT in tdx_sept_split_private_spt().
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2:
> - Pulled from
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
> - Rebased on top of TDX huge page RFC v2 (Yan)
> ---
> arch/x86/include/asm/kvm_host.h | 3 ++-
> arch/x86/kvm/mmu/tdp_mmu.c | 6 +++++-
> arch/x86/kvm/vmx/tdx.c | 3 ++-
> 3 files changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6cb5b422dd1d..6b6c46c27390 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1841,7 +1841,8 @@ struct kvm_x86_ops {
>
> /* Split the external page table into smaller page tables */
> int (*split_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> - void *external_spt, bool mmu_lock_shared);
> + kvm_pfn_t pfn_for_gfn, void *external_spt,
> + bool mmu_lock_shared);
>
> bool (*has_wbinvd_exit)(void);
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 62a09a9655c3..eb758aaa4374 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -389,11 +389,15 @@ static int split_external_spt(struct kvm *kvm, gfn_t gfn, u64 old_spte,
> u64 new_spte, int level, bool shared)
> {
> void *external_spt = get_external_spt(gfn, new_spte, level);
> + kvm_pfn_t pfn_for_gfn = spte_to_pfn(old_spte);
> int ret;
>
> KVM_BUG_ON(!external_spt, kvm);
>
> - ret = kvm_x86_call(split_external_spt)(kvm, gfn, level, external_spt, shared);
> + ret = kvm_x86_call(split_external_spt)(kvm, gfn, level,
> + pfn_for_gfn, external_spt,
> + shared);
It can save one line by moving "pfn_for_gfn" up.
> +
> return ret;
> }
> /**
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 71115058e5e6..24aa9aaad6d8 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1941,7 +1941,8 @@ static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
> }
>
> static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> - void *private_spt, bool mmu_lock_shared)
> + kvm_pfn_t pfn_for_gfn, void *private_spt,
> + bool mmu_lock_shared)
> {
> struct page *page = virt_to_page(private_spt);
> int ret;
^ permalink raw reply [flat|nested] 129+ messages in thread
* [RFC PATCH v2 20/23] KVM: TDX: Handle Dynamic PAMT in tdh_mem_page_demote()
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (18 preceding siblings ...)
2025-08-07 9:45 ` [RFC PATCH v2 19/23] KVM: TDX: Pass down pfn to split_external_spt() Yan Zhao
@ 2025-08-07 9:45 ` Yan Zhao
2025-08-07 9:46 ` [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path Yan Zhao
` (2 subsequent siblings)
22 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:45 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
If Dynamic PAMT is enabled, TDH.MEM.PAGE.DEMOTE will take the PAMT page
pair in registers R12 and R13.
Pass the pamt_pages list down to tdh_mem_page_demote() and populate
registers R12 and R13 from it.
Instead of using seamcall_ret(), use seamcall_saved_ret() as it can
handle registers above R11.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Pulled from
git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
- Rebased on top of TDX huge page RFC v2 (Yan).
---
arch/x86/include/asm/tdx.h | 1 +
arch/x86/kvm/vmx/tdx.c | 4 ++--
arch/x86/virt/vmx/tdx/tdx.c | 13 +++++++++++--
3 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index c058a82d4a97..2e529f0c578a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -180,6 +180,7 @@ u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data);
u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
+ struct list_head *pamt_pages,
u64 *ext_err1, u64 *ext_err2);
u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2);
u64 tdh_mr_finalize(struct tdx_td *td);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 24aa9aaad6d8..9d24a1a86a23 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1924,12 +1924,12 @@ static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
u64 err, entry, level_state;
err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
- &entry, &level_state);
+ NULL, &entry, &level_state);
if (unlikely(tdx_operand_busy(err))) {
tdx_no_vcpus_enter_start(kvm);
err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
- &entry, &level_state);
+ NULL, &entry, &level_state);
tdx_no_vcpus_enter_stop(kvm);
}
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index b7a0ee0f4a50..50f9d49f1c91 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1825,6 +1825,7 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data)
EXPORT_SYMBOL_GPL(tdh_mng_rd);
u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page,
+ struct list_head *pamt_pages,
u64 *ext_err1, u64 *ext_err2)
{
struct tdx_module_args args = {
@@ -1832,10 +1833,18 @@ u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page
.rdx = tdx_tdr_pa(td),
.r8 = page_to_phys(page),
};
- u64 ret;
+ struct page *pamt_page;
+ u64 *p, ret;
+ if (level == TDX_PS_2M) {
+ p = &args.r12;
+ list_for_each_entry(pamt_page, pamt_pages, lru) {
+ *p = page_to_phys(pamt_page);
+ p++;
+ }
+ }
tdx_clflush_page(page);
- ret = seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args);
+ ret = seamcall_saved_ret(TDH_MEM_PAGE_DEMOTE, &args);
*ext_err1 = args.rcx;
*ext_err2 = args.rdx;
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (19 preceding siblings ...)
2025-08-07 9:45 ` [RFC PATCH v2 20/23] KVM: TDX: Handle Dynamic PAMT in tdh_mem_page_demote() Yan Zhao
@ 2025-08-07 9:46 ` Yan Zhao
2025-09-04 9:17 ` Binbin Wu
2025-12-05 6:14 ` Sagi Shahar
2025-08-07 9:46 ` [RFC PATCH v2 22/23] KVM: TDX: Handle Dynamic PAMT on page split Yan Zhao
2025-08-07 9:46 ` [RFC PATCH v2 23/23] KVM: TDX: Turn on PG_LEVEL_2M after TD is RUNNABLE Yan Zhao
22 siblings, 2 replies; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:46 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Preallocate a page to be used in the split_external_spt() path.
Kernel needs one PAMT page pair for external_spt and one that provided
directly to the TDH.MEM.PAGE.DEMOTE SEAMCALL.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Pulled from
git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
- Implemented the flow of topup pamt_page_cache in
tdp_mmu_split_huge_pages_root() (Yan)
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu/mmu.c | 1 +
arch/x86/kvm/mmu/tdp_mmu.c | 51 +++++++++++++++++++++++++++++++++
3 files changed, 54 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6b6c46c27390..508b133df903 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1591,6 +1591,8 @@ struct kvm_arch {
#define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
struct kvm_mmu_memory_cache split_desc_cache;
+ struct kvm_mmu_memory_cache pamt_page_cache;
+
gfn_t gfn_direct_bits;
/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f23d8fc59323..e581cee37f64 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6848,6 +6848,7 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
+ kvm_mmu_free_memory_cache(&kvm->arch.pamt_page_cache);
}
void kvm_mmu_uninit_vm(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index eb758aaa4374..064c4e823658 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1584,6 +1584,27 @@ static bool iter_cross_boundary(struct tdp_iter *iter, gfn_t start, gfn_t end)
(iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <= end);
}
+static bool need_topup_mirror_caches(struct kvm *kvm)
+{
+ int nr = tdx_nr_pamt_pages() * 2;
+
+ return kvm_mmu_memory_cache_nr_free_objects(&kvm->arch.pamt_page_cache) < nr;
+}
+
+static int topup_mirror_caches(struct kvm *kvm)
+{
+ int r, nr;
+
+ /* One for external_spt, one for TDH.MEM.PAGE.DEMOTE */
+ nr = tdx_nr_pamt_pages() * 2;
+
+ r = kvm_mmu_topup_memory_cache(&kvm->arch.pamt_page_cache, nr);
+ if (r)
+ return r;
+
+ return 0;
+}
+
static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
struct kvm_mmu_page *root,
gfn_t start, gfn_t end,
@@ -1656,6 +1677,36 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
continue;
}
+ if (is_mirror_sp(root) && need_topup_mirror_caches(kvm)) {
+ int r;
+
+ rcu_read_unlock();
+
+ if (shared)
+ read_unlock(&kvm->mmu_lock);
+ else
+ write_unlock(&kvm->mmu_lock);
+
+ r = topup_mirror_caches(kvm);
+
+ if (shared)
+ read_lock(&kvm->mmu_lock);
+ else
+ write_lock(&kvm->mmu_lock);
+
+ if (r) {
+ trace_kvm_mmu_split_huge_page(iter.gfn,
+ iter.old_spte,
+ iter.level, r);
+ return r;
+ }
+
+ rcu_read_lock();
+
+ iter.yielded = true;
+ continue;
+ }
+
tdp_mmu_init_child_sp(sp, &iter);
if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path
2025-08-07 9:46 ` [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path Yan Zhao
@ 2025-09-04 9:17 ` Binbin Wu
2025-09-04 9:58 ` Yan Zhao
2025-12-05 6:14 ` Sagi Shahar
1 sibling, 1 reply; 129+ messages in thread
From: Binbin Wu @ 2025-09-04 9:17 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
chao.p.peng
On 8/7/2025 5:46 PM, Yan Zhao wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Preallocate a page to be used in the split_external_spt() path.
Not just "a" page.
>
> Kernel needs one PAMT page pair for external_spt and one that provided
> directly to the TDH.MEM.PAGE.DEMOTE SEAMCALL.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2:
> - Pulled from
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
> - Implemented the flow of topup pamt_page_cache in
> tdp_mmu_split_huge_pages_root() (Yan)
> ---
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/mmu/mmu.c | 1 +
> arch/x86/kvm/mmu/tdp_mmu.c | 51 +++++++++++++++++++++++++++++++++
> 3 files changed, 54 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6b6c46c27390..508b133df903 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1591,6 +1591,8 @@ struct kvm_arch {
> #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
> struct kvm_mmu_memory_cache split_desc_cache;
>
> + struct kvm_mmu_memory_cache pamt_page_cache;
> +
> gfn_t gfn_direct_bits;
>
> /*
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f23d8fc59323..e581cee37f64 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6848,6 +6848,7 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
> kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
> kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
> kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
> + kvm_mmu_free_memory_cache(&kvm->arch.pamt_page_cache);
> }
>
> void kvm_mmu_uninit_vm(struct kvm *kvm)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index eb758aaa4374..064c4e823658 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1584,6 +1584,27 @@ static bool iter_cross_boundary(struct tdp_iter *iter, gfn_t start, gfn_t end)
> (iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <= end);
> }
>
> +static bool need_topup_mirror_caches(struct kvm *kvm)
> +{
> + int nr = tdx_nr_pamt_pages() * 2;
> +
> + return kvm_mmu_memory_cache_nr_free_objects(&kvm->arch.pamt_page_cache) < nr;
> +}
> +
> +static int topup_mirror_caches(struct kvm *kvm)
> +{
> + int r, nr;
> +
> + /* One for external_spt, one for TDH.MEM.PAGE.DEMOTE */
The comment is a bit confusing.
IIUC, external_spt is also for TDH.MEM.PAGE.DEMOTE.
and it's "one pair" for PAMT pages.
> + nr = tdx_nr_pamt_pages() * 2;
> +
> + r = kvm_mmu_topup_memory_cache(&kvm->arch.pamt_page_cache, nr);
> + if (r)
> + return r;
> +
> + return 0;
This could be simplified:
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 064c4e823658..35d052aa408c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1593,16 +1593,12 @@ static bool need_topup_mirror_caches(struct kvm *kvm)
static int topup_mirror_caches(struct kvm *kvm)
{
- int r, nr;
+ int nr;
/* One for external_spt, one for TDH.MEM.PAGE.DEMOTE */
nr = tdx_nr_pamt_pages() * 2;
- r = kvm_mmu_topup_memory_cache(&kvm->arch.pamt_page_cache, nr);
- if (r)
- return r;
-
- return 0;
+ return kvm_mmu_topup_memory_cache(&kvm->arch.pamt_page_cache, nr);
}
> +}
> +
> static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> struct kvm_mmu_page *root,
> gfn_t start, gfn_t end,
> @@ -1656,6 +1677,36 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> continue;
> }
>
> + if (is_mirror_sp(root) && need_topup_mirror_caches(kvm)) {
> + int r;
> +
> + rcu_read_unlock();
> +
> + if (shared)
> + read_unlock(&kvm->mmu_lock);
> + else
> + write_unlock(&kvm->mmu_lock);
> +
> + r = topup_mirror_caches(kvm);
> +
> + if (shared)
> + read_lock(&kvm->mmu_lock);
> + else
> + write_lock(&kvm->mmu_lock);
> +
> + if (r) {
> + trace_kvm_mmu_split_huge_page(iter.gfn,
> + iter.old_spte,
> + iter.level, r);
> + return r;
> + }
> +
> + rcu_read_lock();
> +
> + iter.yielded = true;
> + continue;
> + }
> +
> tdp_mmu_init_child_sp(sp, &iter);
>
> if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path
2025-09-04 9:17 ` Binbin Wu
@ 2025-09-04 9:58 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-09-04 9:58 UTC (permalink / raw)
To: Binbin Wu
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, fan.du,
jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li, chao.p.peng
On Thu, Sep 04, 2025 at 05:17:40PM +0800, Binbin Wu wrote:
>
>
> On 8/7/2025 5:46 PM, Yan Zhao wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > Preallocate a page to be used in the split_external_spt() path.
>
> Not just "a" page.
>
> >
> > Kernel needs one PAMT page pair for external_spt and one that provided
> > directly to the TDH.MEM.PAGE.DEMOTE SEAMCALL.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> > RFC v2:
> > - Pulled from
> > git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
> > - Implemented the flow of topup pamt_page_cache in
> > tdp_mmu_split_huge_pages_root() (Yan)
> > ---
> > arch/x86/include/asm/kvm_host.h | 2 ++
> > arch/x86/kvm/mmu/mmu.c | 1 +
> > arch/x86/kvm/mmu/tdp_mmu.c | 51 +++++++++++++++++++++++++++++++++
> > 3 files changed, 54 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 6b6c46c27390..508b133df903 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1591,6 +1591,8 @@ struct kvm_arch {
> > #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
> > struct kvm_mmu_memory_cache split_desc_cache;
> > + struct kvm_mmu_memory_cache pamt_page_cache;
> > +
> > gfn_t gfn_direct_bits;
> > /*
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index f23d8fc59323..e581cee37f64 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6848,6 +6848,7 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
> > kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
> > kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
> > kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
> > + kvm_mmu_free_memory_cache(&kvm->arch.pamt_page_cache);
> > }
> > void kvm_mmu_uninit_vm(struct kvm *kvm)
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index eb758aaa4374..064c4e823658 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1584,6 +1584,27 @@ static bool iter_cross_boundary(struct tdp_iter *iter, gfn_t start, gfn_t end)
> > (iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <= end);
> > }
> > +static bool need_topup_mirror_caches(struct kvm *kvm)
> > +{
> > + int nr = tdx_nr_pamt_pages() * 2;
> > +
> > + return kvm_mmu_memory_cache_nr_free_objects(&kvm->arch.pamt_page_cache) < nr;
> > +}
> > +
> > +static int topup_mirror_caches(struct kvm *kvm)
> > +{
> > + int r, nr;
> > +
> > + /* One for external_spt, one for TDH.MEM.PAGE.DEMOTE */
>
> The comment is a bit confusing.
> IIUC, external_spt is also for TDH.MEM.PAGE.DEMOTE.
> and it's "one pair" for PAMT pages.
Sould be
one pair of PAMT pages for the adding page table page used by splitting, and
another pair for guest private page to be demoted.
> > + nr = tdx_nr_pamt_pages() * 2;
> > +
> > + r = kvm_mmu_topup_memory_cache(&kvm->arch.pamt_page_cache, nr);
> > + if (r)
> > + return r;
> > +
> > + return 0;
>
> This could be simplified:
Indeed.
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 064c4e823658..35d052aa408c 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1593,16 +1593,12 @@ static bool need_topup_mirror_caches(struct kvm *kvm)
>
> static int topup_mirror_caches(struct kvm *kvm)
> {
> - int r, nr;
> + int nr;
>
> /* One for external_spt, one for TDH.MEM.PAGE.DEMOTE */
> nr = tdx_nr_pamt_pages() * 2;
>
> - r = kvm_mmu_topup_memory_cache(&kvm->arch.pamt_page_cache, nr);
> - if (r)
> - return r;
> -
> - return 0;
> + return kvm_mmu_topup_memory_cache(&kvm->arch.pamt_page_cache, nr);
> }
>
> > +}
> > +
> > static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > struct kvm_mmu_page *root,
> > gfn_t start, gfn_t end,
> > @@ -1656,6 +1677,36 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > continue;
> > }
> > + if (is_mirror_sp(root) && need_topup_mirror_caches(kvm)) {
> > + int r;
> > +
> > + rcu_read_unlock();
> > +
> > + if (shared)
> > + read_unlock(&kvm->mmu_lock);
> > + else
> > + write_unlock(&kvm->mmu_lock);
> > +
> > + r = topup_mirror_caches(kvm);
> > +
> > + if (shared)
> > + read_lock(&kvm->mmu_lock);
> > + else
> > + write_lock(&kvm->mmu_lock);
> > +
> > + if (r) {
> > + trace_kvm_mmu_split_huge_page(iter.gfn,
> > + iter.old_spte,
> > + iter.level, r);
> > + return r;
> > + }
> > +
> > + rcu_read_lock();
> > +
> > + iter.yielded = true;
> > + continue;
> > + }
> > +
> > tdp_mmu_init_child_sp(sp, &iter);
> > if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
>
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path
2025-08-07 9:46 ` [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path Yan Zhao
2025-09-04 9:17 ` Binbin Wu
@ 2025-12-05 6:14 ` Sagi Shahar
2025-12-08 5:49 ` Yan Zhao
1 sibling, 1 reply; 129+ messages in thread
From: Sagi Shahar @ 2025-12-05 6:14 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
binbin.wu, chao.p.peng
On Thu, Aug 7, 2025 at 4:48 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Preallocate a page to be used in the split_external_spt() path.
>
> Kernel needs one PAMT page pair for external_spt and one that provided
> directly to the TDH.MEM.PAGE.DEMOTE SEAMCALL.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> RFC v2:
> - Pulled from
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
> - Implemented the flow of topup pamt_page_cache in
> tdp_mmu_split_huge_pages_root() (Yan)
> ---
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/mmu/mmu.c | 1 +
> arch/x86/kvm/mmu/tdp_mmu.c | 51 +++++++++++++++++++++++++++++++++
> 3 files changed, 54 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6b6c46c27390..508b133df903 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1591,6 +1591,8 @@ struct kvm_arch {
> #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
> struct kvm_mmu_memory_cache split_desc_cache;
>
> + struct kvm_mmu_memory_cache pamt_page_cache;
> +
The latest DPAMT patches use a per-vcpu tdx_prealloc struct to handle
preallocating pages for pamt. I'm wondering if you've considered how
this would work here since some of the calls requiring pamt originate
from user space ioctls and therefore are not associated with a vcpu.
Since the tdx_prealloc is a per vcpu struct there are no race issues
when multiple vcpus need to add pamt pages but here it would be
trickier here because theoretically, multiple threads could split
different pages simultaneously.
> gfn_t gfn_direct_bits;
>
> /*
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f23d8fc59323..e581cee37f64 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6848,6 +6848,7 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
> kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
> kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
> kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
> + kvm_mmu_free_memory_cache(&kvm->arch.pamt_page_cache);
> }
>
> void kvm_mmu_uninit_vm(struct kvm *kvm)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index eb758aaa4374..064c4e823658 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1584,6 +1584,27 @@ static bool iter_cross_boundary(struct tdp_iter *iter, gfn_t start, gfn_t end)
> (iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <= end);
> }
>
> +static bool need_topup_mirror_caches(struct kvm *kvm)
> +{
> + int nr = tdx_nr_pamt_pages() * 2;
> +
> + return kvm_mmu_memory_cache_nr_free_objects(&kvm->arch.pamt_page_cache) < nr;
> +}
> +
> +static int topup_mirror_caches(struct kvm *kvm)
> +{
> + int r, nr;
> +
> + /* One for external_spt, one for TDH.MEM.PAGE.DEMOTE */
> + nr = tdx_nr_pamt_pages() * 2;
> +
> + r = kvm_mmu_topup_memory_cache(&kvm->arch.pamt_page_cache, nr);
> + if (r)
> + return r;
> +
> + return 0;
> +}
> +
> static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> struct kvm_mmu_page *root,
> gfn_t start, gfn_t end,
> @@ -1656,6 +1677,36 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> continue;
> }
>
> + if (is_mirror_sp(root) && need_topup_mirror_caches(kvm)) {
> + int r;
> +
> + rcu_read_unlock();
> +
> + if (shared)
> + read_unlock(&kvm->mmu_lock);
> + else
> + write_unlock(&kvm->mmu_lock);
> +
> + r = topup_mirror_caches(kvm);
> +
> + if (shared)
> + read_lock(&kvm->mmu_lock);
> + else
> + write_lock(&kvm->mmu_lock);
> +
> + if (r) {
> + trace_kvm_mmu_split_huge_page(iter.gfn,
> + iter.old_spte,
> + iter.level, r);
> + return r;
> + }
> +
> + rcu_read_lock();
> +
> + iter.yielded = true;
> + continue;
> + }
> +
> tdp_mmu_init_child_sp(sp, &iter);
>
> if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
> --
> 2.43.2
>
>
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path
2025-12-05 6:14 ` Sagi Shahar
@ 2025-12-08 5:49 ` Yan Zhao
2025-12-11 1:42 ` Vishal Annapurve
0 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-12-08 5:49 UTC (permalink / raw)
To: Sagi Shahar
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vannapurve, vbabka, thomas.lendacky, pgonda, zhiquan1.li,
fan.du, jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li,
binbin.wu, chao.p.peng
On Fri, Dec 05, 2025 at 12:14:46AM -0600, Sagi Shahar wrote:
> On Thu, Aug 7, 2025 at 4:48 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > Preallocate a page to be used in the split_external_spt() path.
> >
> > Kernel needs one PAMT page pair for external_spt and one that provided
> > directly to the TDH.MEM.PAGE.DEMOTE SEAMCALL.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> > RFC v2:
> > - Pulled from
> > git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
> > - Implemented the flow of topup pamt_page_cache in
> > tdp_mmu_split_huge_pages_root() (Yan)
> > ---
> > arch/x86/include/asm/kvm_host.h | 2 ++
> > arch/x86/kvm/mmu/mmu.c | 1 +
> > arch/x86/kvm/mmu/tdp_mmu.c | 51 +++++++++++++++++++++++++++++++++
> > 3 files changed, 54 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 6b6c46c27390..508b133df903 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1591,6 +1591,8 @@ struct kvm_arch {
> > #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
> > struct kvm_mmu_memory_cache split_desc_cache;
> >
> > + struct kvm_mmu_memory_cache pamt_page_cache;
> > +
>
> The latest DPAMT patches use a per-vcpu tdx_prealloc struct to handle
> preallocating pages for pamt. I'm wondering if you've considered how
> this would work here since some of the calls requiring pamt originate
> from user space ioctls and therefore are not associated with a vcpu.
I'll use a per-VM tdx_prealloc struct for splitting here, similar to the
per-VM pamt_page_cache.
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 43dd295b7fd6..91bea25da528 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -48,6 +48,9 @@ struct kvm_tdx {
* Set/unset is protected with kvm->mmu_lock.
*/
bool wait_for_sept_zap;
+
+ spinlock_t external_kvm_split_cache_lock;
+ struct tdx_prealloc prealloc_split_cache;
};
> Since the tdx_prealloc is a per vcpu struct there are no race issues
> when multiple vcpus need to add pamt pages but here it would be
> trickier here because theoretically, multiple threads could split
> different pages simultaneously.
A spin lock external_kvm_split_cache_lock is introduced to protect the cache
enqueue and dequeue.
(a) When tdp_mmu_split_huge_pages_root() is invoked under write mmu_lock:
- Since cache dequeue is already under write mmu_lock in
tdp_mmu_split_huge_page()-->tdx_sept_split_private_spte(), acquiring/
releasing another spin lock doesn't matter.
- Though the cache enqueue in topup_external_split_cache() is not protected
by mmu_lock, protecting enqueue with a spinlock should not reduce
concurrency.
(b) When tdp_mmu_split_huge_pages_root() is invoked under read mmu_lock:
Introducing a new spinlock may hurt concurrency for a brief duration (which
is necessary).
However, there's no known (future) use case for multiple threads invoking
tdp_mmu_split_huge_pages_root() on mirror root under shared mmu_lock.
For future splitting under shared mmu_lock in the fault path, we'll use
the per-vCPU tdx_prealloc instead of the per-VM cache. TDX can leverage
kvm_get_running_vcpu() to differentiate between the two caches.
Here's the new diff in TDP MMU rebased to Dynamic PAMT v4.
+/*
+ * Check the per-VM external split cache under write mmu_lock or read mmu_lock
+ * in tdp_mmu_split_huge_pages_root().
+ *
+ * When need_topup_external_split_cache() returns true, the mmu_lock is held
+ * throughout
+ * (a) need_topup_external_split_cache(), and
+ * (b) the cache consumption (in tdx_sept_split_private_spte() called by
+ * tdp_mmu_split_huge_page()).
+ *
+ * Throughout the execution from (a) to (b):
+ * - With write mmu_lock, the per-VM external split cache is exclusively
+ * accessed by a single user. Therefore, the result returned from
+ * need_topup_external_split_cache() is accurate.
+ *
+ * - With read mmu_lock, the per-VM external split cache can be shared among
+ * multiple users. Cache consumption in tdx_sept_split_private_spte() thus
+ * needs to check again of the cache page count after acquiring its internal
+ * split cache lock and return an error if the cache page count is not
+ * sufficient.
+ */
+static bool need_topup_external_split_cache(struct kvm *kvm, int level)
+{
+ return kvm_x86_call(need_topup_external_per_vm_split_cache)(kvm, level);
+}
+static int topup_external_split_cache(struct kvm *kvm, int level, bool shared)
+{
+ int r;
+
+ rcu_read_unlock();
+
+ if (shared)
+ read_unlock(&kvm->mmu_lock);
+ else
+ write_unlock(&kvm->mmu_lock);
+
+ r = kvm_x86_call(topup_external_per_vm_split_cache)(kvm, level);
+
+ if (shared)
+ read_lock(&kvm->mmu_lock);
+ else
+ write_lock(&kvm->mmu_lock);
+
+ if (!r)
+ rcu_read_lock();
+
+ return r;
+}
static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
struct kvm_mmu_page *root,
gfn_t start, gfn_t end,
@@ -1673,6 +1723,23 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
continue;
}
+ if (is_mirror_sp(root) &&
+ need_topup_external_split_cache(kvm, iter.level)) {
+ int r;
+
+ r = topup_external_split_cache(kvm, iter.level, shared);
+
+ if (r) {
+ trace_kvm_mmu_split_huge_page(iter.gfn,
+ iter.old_spte,
+ iter.level, r);
+ return r;
+ }
+
+ iter.yielded = true;
+ continue;
+ }
+
tdp_mmu_init_child_sp(sp, &iter);
if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path
2025-12-08 5:49 ` Yan Zhao
@ 2025-12-11 1:42 ` Vishal Annapurve
2025-12-11 2:36 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Vishal Annapurve @ 2025-12-11 1:42 UTC (permalink / raw)
To: Yan Zhao
Cc: Sagi Shahar, pbonzini, seanjc, linux-kernel, kvm, x86,
rick.p.edgecombe, dave.hansen, kas, tabba, ackerleytng,
quic_eberman, michael.roth, david, vbabka, thomas.lendacky,
pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
xiaoyao.li, binbin.wu, chao.p.peng
On Sun, Dec 7, 2025 at 9:51 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 6b6c46c27390..508b133df903 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -1591,6 +1591,8 @@ struct kvm_arch {
> > > #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
> > > struct kvm_mmu_memory_cache split_desc_cache;
> > >
> > > + struct kvm_mmu_memory_cache pamt_page_cache;
> > > +
> >
> > The latest DPAMT patches use a per-vcpu tdx_prealloc struct to handle
> > preallocating pages for pamt. I'm wondering if you've considered how
> > this would work here since some of the calls requiring pamt originate
> > from user space ioctls and therefore are not associated with a vcpu.
> I'll use a per-VM tdx_prealloc struct for splitting here, similar to the
> per-VM pamt_page_cache.
>
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 43dd295b7fd6..91bea25da528 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -48,6 +48,9 @@ struct kvm_tdx {
> * Set/unset is protected with kvm->mmu_lock.
> */
> bool wait_for_sept_zap;
> +
> + spinlock_t external_kvm_split_cache_lock;
> + struct tdx_prealloc prealloc_split_cache;
> };
>
> > Since the tdx_prealloc is a per vcpu struct there are no race issues
> > when multiple vcpus need to add pamt pages but here it would be
> > trickier here because theoretically, multiple threads could split
> > different pages simultaneously.
> A spin lock external_kvm_split_cache_lock is introduced to protect the cache
> enqueue and dequeue.
> (a) When tdp_mmu_split_huge_pages_root() is invoked under write mmu_lock:
> - Since cache dequeue is already under write mmu_lock in
> tdp_mmu_split_huge_page()-->tdx_sept_split_private_spte(), acquiring/
> releasing another spin lock doesn't matter.
> - Though the cache enqueue in topup_external_split_cache() is not protected
> by mmu_lock, protecting enqueue with a spinlock should not reduce
> concurrency.
Even with the spin lock protecting the cache topup/consumption
operation, is it possible that one split operation context consumes
the top-up performed by the other split operation causing failure with
the subsequent consumptions?
>
> (b) When tdp_mmu_split_huge_pages_root() is invoked under read mmu_lock:
> Introducing a new spinlock may hurt concurrency for a brief duration (which
> is necessary).
> However, there's no known (future) use case for multiple threads invoking
> tdp_mmu_split_huge_pages_root() on mirror root under shared mmu_lock.
>
> For future splitting under shared mmu_lock in the fault path, we'll use
> the per-vCPU tdx_prealloc instead of the per-VM cache. TDX can leverage
> kvm_get_running_vcpu() to differentiate between the two caches.
>
>
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path
2025-12-11 1:42 ` Vishal Annapurve
@ 2025-12-11 2:36 ` Yan Zhao
0 siblings, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-12-11 2:36 UTC (permalink / raw)
To: Vishal Annapurve
Cc: Sagi Shahar, pbonzini, seanjc, linux-kernel, kvm, x86,
rick.p.edgecombe, dave.hansen, kas, tabba, ackerleytng,
quic_eberman, michael.roth, david, vbabka, thomas.lendacky,
pgonda, zhiquan1.li, fan.du, jun.miao, ira.weiny, isaku.yamahata,
xiaoyao.li, binbin.wu, chao.p.peng
On Wed, Dec 10, 2025 at 05:42:44PM -0800, Vishal Annapurve wrote:
> On Sun, Dec 7, 2025 at 9:51 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 6b6c46c27390..508b133df903 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -1591,6 +1591,8 @@ struct kvm_arch {
> > > > #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
> > > > struct kvm_mmu_memory_cache split_desc_cache;
> > > >
> > > > + struct kvm_mmu_memory_cache pamt_page_cache;
> > > > +
> > >
> > > The latest DPAMT patches use a per-vcpu tdx_prealloc struct to handle
> > > preallocating pages for pamt. I'm wondering if you've considered how
> > > this would work here since some of the calls requiring pamt originate
> > > from user space ioctls and therefore are not associated with a vcpu.
> > I'll use a per-VM tdx_prealloc struct for splitting here, similar to the
> > per-VM pamt_page_cache.
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> > index 43dd295b7fd6..91bea25da528 100644
> > --- a/arch/x86/kvm/vmx/tdx.h
> > +++ b/arch/x86/kvm/vmx/tdx.h
> > @@ -48,6 +48,9 @@ struct kvm_tdx {
> > * Set/unset is protected with kvm->mmu_lock.
> > */
> > bool wait_for_sept_zap;
> > +
> > + spinlock_t external_kvm_split_cache_lock;
> > + struct tdx_prealloc prealloc_split_cache;
> > };
> >
> > > Since the tdx_prealloc is a per vcpu struct there are no race issues
> > > when multiple vcpus need to add pamt pages but here it would be
> > > trickier here because theoretically, multiple threads could split
> > > different pages simultaneously.
> > A spin lock external_kvm_split_cache_lock is introduced to protect the cache
> > enqueue and dequeue.
> > (a) When tdp_mmu_split_huge_pages_root() is invoked under write mmu_lock:
> > - Since cache dequeue is already under write mmu_lock in
> > tdp_mmu_split_huge_page()-->tdx_sept_split_private_spte(), acquiring/
> > releasing another spin lock doesn't matter.
> > - Though the cache enqueue in topup_external_split_cache() is not protected
> > by mmu_lock, protecting enqueue with a spinlock should not reduce
> > concurrency.
>
> Even with the spin lock protecting the cache topup/consumption
> operation, is it possible that one split operation context consumes
> the top-up performed by the other split operation causing failure with
> the subsequent consumptions?
The sequence of check topup, topup, and consume is like this
1. write_lock(&kvm->mmu_lock)
check topup
2. write_unlock(&kvm->mmu_lock)
topup (get/put split lock to enqueue)
3. write_lock(&kvm->mmu_lock)
check topup (goto 2 if topup is necessary) (*)
get split lock
consume
put split lock
write_unlock(&kvm->mmu_lock)
Note: due to the "iter.yielded = true" and "continue" after the topup, (see my
posted diff in last reply), consuming does not directly follow the topup. i.e.,
there is step 3.
Due to (*) in step 3, and the consuming is under write mmu_lock, it's impossible
for splits in other threads to consume pages allocated for this split.
> > (b) When tdp_mmu_split_huge_pages_root() is invoked under read mmu_lock:
> > Introducing a new spinlock may hurt concurrency for a brief duration (which
> > is necessary).
Let's talk more about this future potential use cases (i.e., if there're
multiple callers of tdp_mmu_split_huge_pages_root() under shared mmu_lock).
The sequence would be
1. read_lock(&kvm->mmu_lock)
check topup
2. read_unlock(&kvm->mmu_lock)
topup (get/put split lock to enqueue)
3. read_lock(&kvm->mmu_lock)
check topup (goto 2 if topup is necessary)
get split lock
check topup (return retry if topup is necessary) (**)
consume
put split lock
read_unlock(&kvm->mmu_lock)
Due to (**) in step 3, and the consuming is under split lock and read mmu_lock,
it's also impossible for splits in other threads to consume pages allocated for
this split.
> > However, there's no known (future) use case for multiple threads invoking
> > tdp_mmu_split_huge_pages_root() on mirror root under shared mmu_lock.
> >
> > For future splitting under shared mmu_lock in the fault path, we'll use
> > the per-vCPU tdx_prealloc instead of the per-VM cache. TDX can leverage
> > kvm_get_running_vcpu() to differentiate between the two caches.
> >
> >
^ permalink raw reply [flat|nested] 129+ messages in thread
* [RFC PATCH v2 22/23] KVM: TDX: Handle Dynamic PAMT on page split
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (20 preceding siblings ...)
2025-08-07 9:46 ` [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path Yan Zhao
@ 2025-08-07 9:46 ` Yan Zhao
2025-08-14 5:31 ` Vishal Annapurve
2025-08-07 9:46 ` [RFC PATCH v2 23/23] KVM: TDX: Turn on PG_LEVEL_2M after TD is RUNNABLE Yan Zhao
22 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:46 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Page demote from 2M to 4k requires an additional PAMT page pair to cover
the 2M range that now mapped with 4k.
EPT page also has to be covered in PAMT_4K.
Allocate both from pre-allocated split PAMT pool.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Pulled from
git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
- Rebased on top of TDX huge page RFC v2 (Yan).
---
arch/x86/include/asm/tdx.h | 4 ++++
arch/x86/kvm/vmx/tdx.c | 28 ++++++++++++++++++++++++----
arch/x86/virt/vmx/tdx/tdx.c | 11 +++++++----
3 files changed, 35 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 2e529f0c578a..da317981e95a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -123,6 +123,10 @@ u32 tdx_get_nr_guest_keyids(void);
void tdx_guest_keyid_free(unsigned int keyid);
int tdx_nr_pamt_pages(void);
+atomic_t *tdx_get_pamt_refcount(unsigned long hpa);
+int tdx_alloc_pamt_pages(struct list_head *pamt_pages,
+ struct page *(alloc)(void *data), void *data);
+void tdx_free_pamt_pages(struct list_head *pamt_pages);
int tdx_pamt_get(struct page *page, enum pg_level level,
struct page *(alloc)(void *data), void *data);
void tdx_pamt_put(struct page *page, enum pg_level level);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 9d24a1a86a23..6e061d659639 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1915,28 +1915,48 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
return 0;
}
+static struct page *tdx_alloc_pamt_page_split(void *data)
+{
+ struct kvm *kvm = data;
+ void *p;
+
+ p = kvm_mmu_memory_cache_alloc(&kvm->arch.pamt_page_cache);
+ return virt_to_page(p);
+}
+
static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
- enum pg_level level, struct page *page)
+ enum pg_level level, struct page *page,
+ kvm_pfn_t pfn_for_gfn)
{
int tdx_level = pg_level_to_tdx_sept_level(level);
+ hpa_t hpa = pfn_to_hpa(pfn_for_gfn);
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
gpa_t gpa = gfn_to_gpa(gfn);
u64 err, entry, level_state;
+ LIST_HEAD(pamt_pages);
+
+ tdx_pamt_get(page, PG_LEVEL_4K, tdx_alloc_pamt_page_split, kvm);
+ tdx_alloc_pamt_pages(&pamt_pages, tdx_alloc_pamt_page_split, kvm);
err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
- NULL, &entry, &level_state);
+ &pamt_pages, &entry, &level_state);
if (unlikely(tdx_operand_busy(err))) {
tdx_no_vcpus_enter_start(kvm);
err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
- NULL, &entry, &level_state);
+ &pamt_pages, &entry, &level_state);
tdx_no_vcpus_enter_stop(kvm);
}
if (KVM_BUG_ON(err, kvm)) {
+ tdx_free_pamt_pages(&pamt_pages);
+ tdx_pamt_put(page, PG_LEVEL_4K);
pr_tdx_error_2(TDH_MEM_PAGE_DEMOTE, err, entry, level_state);
return -EIO;
}
+
+ if (tdx_supports_dynamic_pamt(tdx_sysinfo))
+ atomic_set(tdx_get_pamt_refcount(hpa), PTRS_PER_PMD);
return 0;
}
@@ -1963,7 +1983,7 @@ static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level
tdx_track(kvm);
- return tdx_spte_demote_private_spte(kvm, gfn, level, page);
+ return tdx_spte_demote_private_spte(kvm, gfn, level, page, pfn_for_gfn);
}
static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 50f9d49f1c91..dbbddd00ec60 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -188,10 +188,11 @@ int tdx_cpu_enable(void)
}
EXPORT_SYMBOL_GPL(tdx_cpu_enable);
-static atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
+atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
{
return &pamt_refcounts[hpa / PMD_SIZE];
}
+EXPORT_SYMBOL_GPL(tdx_get_pamt_refcount);
static int pamt_refcount_populate(pte_t *pte, unsigned long addr, void *data)
{
@@ -2151,7 +2152,7 @@ static u64 tdh_phymem_pamt_remove(unsigned long hpa,
static DEFINE_SPINLOCK(pamt_lock);
-static void tdx_free_pamt_pages(struct list_head *pamt_pages)
+void tdx_free_pamt_pages(struct list_head *pamt_pages)
{
struct page *page;
@@ -2160,9 +2161,10 @@ static void tdx_free_pamt_pages(struct list_head *pamt_pages)
__free_page(page);
}
}
+EXPORT_SYMBOL_GPL(tdx_free_pamt_pages);
-static int tdx_alloc_pamt_pages(struct list_head *pamt_pages,
- struct page *(alloc)(void *data), void *data)
+int tdx_alloc_pamt_pages(struct list_head *pamt_pages,
+ struct page *(alloc)(void *data), void *data)
{
for (int i = 0; i < tdx_nr_pamt_pages(); i++) {
struct page *page;
@@ -2180,6 +2182,7 @@ static int tdx_alloc_pamt_pages(struct list_head *pamt_pages,
tdx_free_pamt_pages(pamt_pages);
return -ENOMEM;
}
+EXPORT_SYMBOL_GPL(tdx_alloc_pamt_pages);
static int tdx_pamt_add(atomic_t *pamt_refcount, unsigned long hpa,
struct list_head *pamt_pages)
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 22/23] KVM: TDX: Handle Dynamic PAMT on page split
2025-08-07 9:46 ` [RFC PATCH v2 22/23] KVM: TDX: Handle Dynamic PAMT on page split Yan Zhao
@ 2025-08-14 5:31 ` Vishal Annapurve
2025-08-14 18:29 ` Vishal Annapurve
2025-08-18 4:19 ` Yan Zhao
0 siblings, 2 replies; 129+ messages in thread
From: Vishal Annapurve @ 2025-08-14 5:31 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du,
jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu,
chao.p.peng
On Thu, Aug 7, 2025 at 2:46 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> +static struct page *tdx_alloc_pamt_page_split(void *data)
> +{
> + struct kvm *kvm = data;
> + void *p;
> +
> + p = kvm_mmu_memory_cache_alloc(&kvm->arch.pamt_page_cache);
> + return virt_to_page(p);
> +}
> +
> static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
> - enum pg_level level, struct page *page)
> + enum pg_level level, struct page *page,
> + kvm_pfn_t pfn_for_gfn)
> {
> int tdx_level = pg_level_to_tdx_sept_level(level);
> + hpa_t hpa = pfn_to_hpa(pfn_for_gfn);
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> gpa_t gpa = gfn_to_gpa(gfn);
> u64 err, entry, level_state;
> + LIST_HEAD(pamt_pages);
> +
> + tdx_pamt_get(page, PG_LEVEL_4K, tdx_alloc_pamt_page_split, kvm);
This invocation needs a return value check.
> + tdx_alloc_pamt_pages(&pamt_pages, tdx_alloc_pamt_page_split, kvm);
IIUC tdx_pamt_get() will result in pamt_pages allocation above, so
this step is not needed.
>
> err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> - NULL, &entry, &level_state);
> + &pamt_pages, &entry, &level_state);
>
> if (unlikely(tdx_operand_busy(err))) {
> tdx_no_vcpus_enter_start(kvm);
> err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> - NULL, &entry, &level_state);
> + &pamt_pages, &entry, &level_state);
> tdx_no_vcpus_enter_stop(kvm);
> }
>
> if (KVM_BUG_ON(err, kvm)) {
> + tdx_free_pamt_pages(&pamt_pages);
If tdx_alloc_pamt_pages() is not needed then this can be dropped as well.
> + tdx_pamt_put(page, PG_LEVEL_4K);
> pr_tdx_error_2(TDH_MEM_PAGE_DEMOTE, err, entry, level_state);
> return -EIO;
> }
> +
> + if (tdx_supports_dynamic_pamt(tdx_sysinfo))
> + atomic_set(tdx_get_pamt_refcount(hpa), PTRS_PER_PMD);
Should this be
atomic_set(tdx_get_pamt_refcount(hpa), PTRS_PER_PMD -1 );
as tdx_pamt_get would have increased the refcount by 1 already above?
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 22/23] KVM: TDX: Handle Dynamic PAMT on page split
2025-08-14 5:31 ` Vishal Annapurve
@ 2025-08-14 18:29 ` Vishal Annapurve
2025-08-18 4:19 ` Yan Zhao
1 sibling, 0 replies; 129+ messages in thread
From: Vishal Annapurve @ 2025-08-14 18:29 UTC (permalink / raw)
To: Yan Zhao
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du,
jun.miao, ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu,
chao.p.peng
On Wed, Aug 13, 2025 at 10:31 PM Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Thu, Aug 7, 2025 at 2:46 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > +static struct page *tdx_alloc_pamt_page_split(void *data)
> > +{
> > + struct kvm *kvm = data;
> > + void *p;
> > +
> > + p = kvm_mmu_memory_cache_alloc(&kvm->arch.pamt_page_cache);
> > + return virt_to_page(p);
> > +}
> > +
> > static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
> > - enum pg_level level, struct page *page)
> > + enum pg_level level, struct page *page,
> > + kvm_pfn_t pfn_for_gfn)
> > {
> > int tdx_level = pg_level_to_tdx_sept_level(level);
> > + hpa_t hpa = pfn_to_hpa(pfn_for_gfn);
> > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > gpa_t gpa = gfn_to_gpa(gfn);
> > u64 err, entry, level_state;
> > + LIST_HEAD(pamt_pages);
> > +
> > + tdx_pamt_get(page, PG_LEVEL_4K, tdx_alloc_pamt_page_split, kvm);
>
> This invocation needs a return value check.
>
> > + tdx_alloc_pamt_pages(&pamt_pages, tdx_alloc_pamt_page_split, kvm);
>
> IIUC tdx_pamt_get() will result in pamt_pages allocation above, so
> this step is not needed.
I missed that one allocation is to cover the EPT page and another is
for HPA ranges backing the GPA mappings. So ignore my rest of the
comments except about the error handling for tdx_pamt_get() and
tdx_alloc_pamt_pages() missing in this patch.
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 22/23] KVM: TDX: Handle Dynamic PAMT on page split
2025-08-14 5:31 ` Vishal Annapurve
2025-08-14 18:29 ` Vishal Annapurve
@ 2025-08-18 4:19 ` Yan Zhao
1 sibling, 0 replies; 129+ messages in thread
From: Yan Zhao @ 2025-08-18 4:19 UTC (permalink / raw)
To: Vishal Annapurve
Cc: pbonzini, seanjc, linux-kernel, kvm, x86, rick.p.edgecombe,
dave.hansen, kas, tabba, ackerleytng, quic_eberman, michael.roth,
david, vbabka, thomas.lendacky, pgonda, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng
On Wed, Aug 13, 2025 at 10:31:27PM -0700, Vishal Annapurve wrote:
> On Thu, Aug 7, 2025 at 2:46 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > +static struct page *tdx_alloc_pamt_page_split(void *data)
> > +{
> > + struct kvm *kvm = data;
> > + void *p;
> > +
> > + p = kvm_mmu_memory_cache_alloc(&kvm->arch.pamt_page_cache);
> > + return virt_to_page(p);
> > +}
> > +
> > static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
> > - enum pg_level level, struct page *page)
> > + enum pg_level level, struct page *page,
> > + kvm_pfn_t pfn_for_gfn)
> > {
> > int tdx_level = pg_level_to_tdx_sept_level(level);
> > + hpa_t hpa = pfn_to_hpa(pfn_for_gfn);
> > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > gpa_t gpa = gfn_to_gpa(gfn);
> > u64 err, entry, level_state;
> > + LIST_HEAD(pamt_pages);
> > +
> > + tdx_pamt_get(page, PG_LEVEL_4K, tdx_alloc_pamt_page_split, kvm);
>
> This invocation needs a return value check.
Ack.
> > + tdx_alloc_pamt_pages(&pamt_pages, tdx_alloc_pamt_page_split, kvm);
>
> IIUC tdx_pamt_get() will result in pamt_pages allocation above, so
> this step is not needed.
This step is to allocate pamt_pages for the guest 2MB page that needs splitting.
The above tdx_pamt_get() is for the EPT page to be added.
I'll add comments or update the param names for better clarity.
Regarding the absence of return value check for the tdx_alloc_pamt_pages(), I
think it's because the tdx_alloc_pamt_page_split() retrieves pages from the
pamt_page_cache via kvm_mmu_memory_cache_alloc(), which is guaranteed to succeed
(otherwise, there's a BUG_ON() in kvm_mmu_memory_cache_alloc()).
> >
> > err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> > - NULL, &entry, &level_state);
> > + &pamt_pages, &entry, &level_state);
> >
> > if (unlikely(tdx_operand_busy(err))) {
> > tdx_no_vcpus_enter_start(kvm);
> > err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
> > - NULL, &entry, &level_state);
> > + &pamt_pages, &entry, &level_state);
> > tdx_no_vcpus_enter_stop(kvm);
> > }
> >
> > if (KVM_BUG_ON(err, kvm)) {
> > + tdx_free_pamt_pages(&pamt_pages);
>
> If tdx_alloc_pamt_pages() is not needed then this can be dropped as well.
>
> > + tdx_pamt_put(page, PG_LEVEL_4K);
> > pr_tdx_error_2(TDH_MEM_PAGE_DEMOTE, err, entry, level_state);
> > return -EIO;
> > }
> > +
> > + if (tdx_supports_dynamic_pamt(tdx_sysinfo))
> > + atomic_set(tdx_get_pamt_refcount(hpa), PTRS_PER_PMD);
>
> Should this be
> atomic_set(tdx_get_pamt_refcount(hpa), PTRS_PER_PMD -1 );
>
> as tdx_pamt_get would have increased the refcount by 1 already above?
This hpa is for guest 2MB memory range. There shouldn't have any increased
pamt_refcount for this range before a successful demote.
So, atomic_set() to PTRS_PER_PMD looks correct, though atomic_add() seems even
safer.
^ permalink raw reply [flat|nested] 129+ messages in thread
* [RFC PATCH v2 23/23] KVM: TDX: Turn on PG_LEVEL_2M after TD is RUNNABLE
2025-08-07 9:39 [RFC PATCH v2 00/23] KVM: TDX huge page support for private memory Yan Zhao
` (21 preceding siblings ...)
2025-08-07 9:46 ` [RFC PATCH v2 22/23] KVM: TDX: Handle Dynamic PAMT on page split Yan Zhao
@ 2025-08-07 9:46 ` Yan Zhao
2025-11-11 11:25 ` Huang, Kai
22 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-08-07 9:46 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: linux-kernel, kvm, x86, rick.p.edgecombe, dave.hansen, kas, tabba,
ackerleytng, quic_eberman, michael.roth, david, vannapurve,
vbabka, thomas.lendacky, pgonda, zhiquan1.li, fan.du, jun.miao,
ira.weiny, isaku.yamahata, xiaoyao.li, binbin.wu, chao.p.peng,
yan.y.zhao
Turn on PG_LEVEL_2M in tdx_gmem_private_max_mapping_level() when TD is
RUNNABLE.
Update the warnings and KVM_BUG_ON() info elsewhere to match that 2MB
mappings are permitted after TD is RUNNABLE.
Opportunistically, remove the unused params "gfn" and "pfn" in
tdx_mem_page_record_premap_cnt().
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
RFC v2:
- Merged RFC v1's patch 4 (forcing PG_LEVEL_4K before TD runnable) with
patch 9 (allowing PG_LEVEL_2M after TD runnable).
---
arch/x86/kvm/vmx/tdx.c | 29 +++++++++++++++--------------
1 file changed, 15 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 6e061d659639..a3e1ac044ee9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1633,12 +1633,11 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
* The counter has to be zero on KVM_TDX_FINALIZE_VM, to ensure that there
* are no half-initialized shared EPT pages.
*/
-static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
- enum pg_level level, kvm_pfn_t pfn)
+static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, enum pg_level level)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
- if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
+ if (KVM_BUG_ON(kvm->arch.pre_fault_allowed || level != PG_LEVEL_4K, kvm))
return -EINVAL;
/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
@@ -1667,10 +1666,6 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
if (ret)
return ret;
- /* TODO: handle large pages. */
- if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
- return -EINVAL;
-
/*
* Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
* barrier in tdx_td_finalize().
@@ -1680,7 +1675,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
ret = tdx_mem_page_aug(kvm, gfn, level, page);
else
- ret = tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
+ ret = tdx_mem_page_record_premap_cnt(kvm, level);
if (ret)
tdx_pamt_put(page, level);
@@ -1697,8 +1692,8 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
gpa_t gpa = gfn_to_gpa(gfn);
u64 err, entry, level_state;
- /* TODO: handle large pages. */
- if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+ /* Large page is not supported before TD runnable,*/
+ if (KVM_BUG_ON(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K, kvm))
return -EINVAL;
if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
@@ -1791,7 +1786,7 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
u64 entry, int level)
{
- if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
+ if (!err || kvm_tdx->state == TD_STATE_RUNNABLE || level > PG_LEVEL_4K)
return false;
if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
@@ -1811,8 +1806,8 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
u64 err, entry, level_state;
- /* For now large page isn't supported yet. */
- WARN_ON_ONCE(level != PG_LEVEL_4K);
+ /* Large page is not supported before TD runnable,*/
+ WARN_ON_ONCE(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K);
err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
@@ -1993,6 +1988,9 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
struct folio *folio = page_folio(page);
int ret;
+ WARN_ON_ONCE(folio_page_idx(folio, page) + KVM_PAGES_PER_HPAGE(level) >
+ folio_nr_pages(folio));
+
if (!is_hkid_assigned(to_kvm_tdx(kvm))) {
KVM_BUG_ON(!kvm->vm_dead, kvm);
@@ -3470,7 +3468,10 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
{
- return PG_LEVEL_4K;
+ if (unlikely(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE))
+ return PG_LEVEL_4K;
+
+ return PG_LEVEL_2M;
}
static int tdx_online_cpu(unsigned int cpu)
--
2.43.2
^ permalink raw reply related [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 23/23] KVM: TDX: Turn on PG_LEVEL_2M after TD is RUNNABLE
2025-08-07 9:46 ` [RFC PATCH v2 23/23] KVM: TDX: Turn on PG_LEVEL_2M after TD is RUNNABLE Yan Zhao
@ 2025-11-11 11:25 ` Huang, Kai
2025-11-14 8:34 ` Yan Zhao
0 siblings, 1 reply; 129+ messages in thread
From: Huang, Kai @ 2025-11-11 11:25 UTC (permalink / raw)
To: pbonzini@redhat.com, seanjc@google.com, Zhao, Yan Y
Cc: quic_eberman@quicinc.com, kvm@vger.kernel.org, Li, Xiaoyao,
Du, Fan, Hansen, Dave, david@redhat.com, thomas.lendacky@amd.com,
vbabka@suse.cz, tabba@google.com, kas@kernel.org,
michael.roth@amd.com, Weiny, Ira, linux-kernel@vger.kernel.org,
binbin.wu@linux.intel.com, ackerleytng@google.com,
Yamahata, Isaku, Peng, Chao P, zhiquan1.li@intel.com,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On Thu, 2025-08-07 at 17:46 +0800, Yan Zhao wrote:
> + /* Large page is not supported before TD runnable,*/
> + if (KVM_BUG_ON(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K, kvm))
> return -EINVAL;
Not a particular comment to this patch, but could you elaborate a little bit
why PROMOTE isn't supported in this series? This doesn't seem to be
mentioned anywhere in this series (not in the coverletter either).
E.g., theoretically, I think we can have a way to PROMOTE mappings for
initial memory pages (via TDH.MEM.PAGE.ADD), e.g., right before the TD is
becoming runnable?
^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: [RFC PATCH v2 23/23] KVM: TDX: Turn on PG_LEVEL_2M after TD is RUNNABLE
2025-11-11 11:25 ` Huang, Kai
@ 2025-11-14 8:34 ` Yan Zhao
2025-11-18 0:56 ` Huang, Kai
0 siblings, 1 reply; 129+ messages in thread
From: Yan Zhao @ 2025-11-14 8:34 UTC (permalink / raw)
To: Huang, Kai
Cc: pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org,
Li, Xiaoyao, Du, Fan, Hansen, Dave, david@redhat.com,
thomas.lendacky@amd.com, vbabka@suse.cz, tabba@google.com,
kas@kernel.org, michael.roth@amd.com, Weiny, Ira,
linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com,
ackerleytng@google.com, Yamahata, Isaku, Peng, Chao P,
Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun, x86@kernel.org,
pgonda@google.com
On Tue, Nov 11, 2025 at 07:25:30PM +0800, Huang, Kai wrote:
> On Thu, 2025-08-07 at 17:46 +0800, Yan Zhao wrote:
> > + /* Large page is not supported before TD runnable,*/
> > + if (KVM_BUG_ON(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K, kvm))
> > return -EINVAL;
>
> Not a particular comment to this patch, but could you elaborate a little bit
> why PROMOTE isn't supported in this series? This doesn't seem to be
> mentioned anywhere in this series (not in the coverletter either).
I mentioned it briefly in the coverletter:
6. Page merging (page promotion)
Promotion is disallowed (in patch 7), because
- The current TDX module requires all 4KB leafs to be either all PENDING
or all ACCEPTED before a successful promotion to 2MB. This requirement
prevents successful page merging after partially converting a 2MB
range from private to shared and then back to private, which is the
primary scenario necessitating page promotion.
- tdh_mem_page_promote() depends on tdh_mem_range_block() in the current
TDX module. Consequently, handling BUSY errors is complex, as page
merging typically occurs in the fault path under a shared mmu_lock.
v1 explains it in more details (See section "Page merging (page promotion)" in
[*]).
[*] https://lore.kernel.org/all/20250424030033.32635-1-yan.y.zhao@intel.com/
> E.g., theoretically, I think we can have a way to PROMOTE mappings for
> initial memory pages (via TDH.MEM.PAGE.ADD), e.g., right before the TD is
> becoming runnable?
Right. Kirill also asked it in in v1 [1].
Though we have no need to worry about the nr_premapped calculation after Sean's
cleanup series, I think there's no need to complicate the design for the initial
support, due to the limited the amount of initial memory pages.
In my environment, for a TD with 8GB memory, there are 1086 count of 2MB mapping
at runtime, but the initial memory is merely 1049 4KB pages in total.
So, the gain is less than 2/1000.
Will call it out in the next version.
[1] https://lore.kernel.org/all/aAn3SSocw0XvaRye@yzhao56-desk.sh.intel.com/
^ permalink raw reply [flat|nested] 129+ messages in thread* Re: [RFC PATCH v2 23/23] KVM: TDX: Turn on PG_LEVEL_2M after TD is RUNNABLE
2025-11-14 8:34 ` Yan Zhao
@ 2025-11-18 0:56 ` Huang, Kai
0 siblings, 0 replies; 129+ messages in thread
From: Huang, Kai @ 2025-11-18 0:56 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: Du, Fan, Li, Xiaoyao, kvm@vger.kernel.org, Hansen, Dave,
david@redhat.com, thomas.lendacky@amd.com, tabba@google.com,
vbabka@suse.cz, michael.roth@amd.com,
linux-kernel@vger.kernel.org, seanjc@google.com,
pbonzini@redhat.com, binbin.wu@linux.intel.com,
ackerleytng@google.com, kas@kernel.org, Weiny, Ira, Peng, Chao P,
Yamahata, Isaku, Annapurve, Vishal, Edgecombe, Rick P, Miao, Jun,
x86@kernel.org, pgonda@google.com
On Fri, 2025-11-14 at 16:34 +0800, Yan Zhao wrote:
> On Tue, Nov 11, 2025 at 07:25:30PM +0800, Huang, Kai wrote:
> > On Thu, 2025-08-07 at 17:46 +0800, Yan Zhao wrote:
> > > + /* Large page is not supported before TD runnable,*/
> > > + if (KVM_BUG_ON(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K, kvm))
> > > return -EINVAL;
> >
> > Not a particular comment to this patch, but could you elaborate a little bit
> > why PROMOTE isn't supported in this series? This doesn't seem to be
> > mentioned anywhere in this series (not in the coverletter either).
> I mentioned it briefly in the coverletter:
>
> 6. Page merging (page promotion)
> Promotion is disallowed (in patch 7), because
> - The current TDX module requires all 4KB leafs to be either all PENDING
> or all ACCEPTED before a successful promotion to 2MB. This requirement
> prevents successful page merging after partially converting a 2MB
> range from private to shared and then back to private, which is the
> primary scenario necessitating page promotion.
> - tdh_mem_page_promote() depends on tdh_mem_range_block() in the current
> TDX module. Consequently, handling BUSY errors is complex, as page
> merging typically occurs in the fault path under a shared mmu_lock.
>
>
> v1 explains it in more details (See section "Page merging (page promotion)" in
> [*]).
>
> [*] https://lore.kernel.org/all/20250424030033.32635-1-yan.y.zhao@intel.com/
>
> > E.g., theoretically, I think we can have a way to PROMOTE mappings for
> > initial memory pages (via TDH.MEM.PAGE.ADD), e.g., right before the TD is
> > becoming runnable?
> Right. Kirill also asked it in in v1 [1].
>
> Though we have no need to worry about the nr_premapped calculation after Sean's
> cleanup series, I think there's no need to complicate the design for the initial
> support, due to the limited the amount of initial memory pages.
>
> In my environment, for a TD with 8GB memory, there are 1086 count of 2MB mapping
> at runtime, but the initial memory is merely 1049 4KB pages in total.
> So, the gain is less than 2/1000.
>
> Will call it out in the next version.
>
> [1] https://lore.kernel.org/all/aAn3SSocw0XvaRye@yzhao56-desk.sh.intel.com/
Agreed. Thanks.
^ permalink raw reply [flat|nested] 129+ messages in thread