kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2 0/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
@ 2025-07-03 15:37 Adrian Hunter
  2025-07-03 15:37 ` [PATCH V2 1/2] x86/tdx: Eliminate duplicate code in tdx_clear_page() Adrian Hunter
  2025-07-03 15:37 ` [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present Adrian Hunter
  0 siblings, 2 replies; 21+ messages in thread
From: Adrian Hunter @ 2025-07-03 15:37 UTC (permalink / raw)
  To: Dave Hansen, pbonzini, seanjc, vannapurve
  Cc: Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar, x86,
	H Peter Anvin, linux-kernel, kvm, rick.p.edgecombe,
	kirill.shutemov, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao, chao.gao

Hi

Here are 2 small self-explanatory patches related to clearing TDX private
pages.

Patch 1 is a minor tidy-up.

In patch 2, by skipping the clearing step, shutdown time can improve by
up to 40%.


Changes in V2 (as requested by Dave):

      x86/tdx: Eliminate duplicate code in tdx_clear_page()
	Rename reset_tdx_pages() to tdx_quirk_reset_paddr()
	Call tdx_quirk_reset_paddr() directly

      x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
	Improve the comment


Adrian Hunter (2):
      x86/tdx: Eliminate duplicate code in tdx_clear_page()
      x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present

 arch/x86/include/asm/tdx.h  |  2 ++
 arch/x86/kvm/vmx/tdx.c      | 25 +++----------------------
 arch/x86/virt/vmx/tdx/tdx.c | 13 +++++++++++--
 3 files changed, 16 insertions(+), 24 deletions(-)


Regards
Adrian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH V2 1/2] x86/tdx: Eliminate duplicate code in tdx_clear_page()
  2025-07-03 15:37 [PATCH V2 0/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present Adrian Hunter
@ 2025-07-03 15:37 ` Adrian Hunter
  2025-07-03 16:34   ` Kirill A. Shutemov
                     ` (4 more replies)
  2025-07-03 15:37 ` [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present Adrian Hunter
  1 sibling, 5 replies; 21+ messages in thread
From: Adrian Hunter @ 2025-07-03 15:37 UTC (permalink / raw)
  To: Dave Hansen, pbonzini, seanjc, vannapurve
  Cc: Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar, x86,
	H Peter Anvin, linux-kernel, kvm, rick.p.edgecombe,
	kirill.shutemov, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao, chao.gao

tdx_clear_page() and reset_tdx_pages() duplicate the TDX page clearing
logic.  Rename reset_tdx_pages() to tdx_quirk_reset_paddr() and use it
in place of tdx_clear_page().

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---


Changes in V2:

	Rename reset_tdx_pages() to tdx_quirk_reset_paddr()
	Call tdx_quirk_reset_paddr() directly


 arch/x86/include/asm/tdx.h  |  2 ++
 arch/x86/kvm/vmx/tdx.c      | 25 +++----------------------
 arch/x86/virt/vmx/tdx/tdx.c |  5 +++--
 3 files changed, 8 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 7ddef3a69866..f66328404724 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -131,6 +131,8 @@ int tdx_guest_keyid_alloc(void);
 u32 tdx_get_nr_guest_keyids(void);
 void tdx_guest_keyid_free(unsigned int keyid);
 
+void tdx_quirk_reset_paddr(unsigned long base, unsigned long size);
+
 struct tdx_td {
 	/* TD root structure: */
 	struct page *tdr_page;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index a08e7055d1db..031e36665757 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -276,25 +276,6 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
 	vcpu->cpu = -1;
 }
 
-static void tdx_clear_page(struct page *page)
-{
-	const void *zero_page = (const void *) page_to_virt(ZERO_PAGE(0));
-	void *dest = page_to_virt(page);
-	unsigned long i;
-
-	/*
-	 * The page could have been poisoned.  MOVDIR64B also clears
-	 * the poison bit so the kernel can safely use the page again.
-	 */
-	for (i = 0; i < PAGE_SIZE; i += 64)
-		movdir64b(dest + i, zero_page);
-	/*
-	 * MOVDIR64B store uses WC buffer.  Prevent following memory reads
-	 * from seeing potentially poisoned cache.
-	 */
-	__mb();
-}
-
 static void tdx_no_vcpus_enter_start(struct kvm *kvm)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -340,7 +321,7 @@ static int tdx_reclaim_page(struct page *page)
 
 	r = __tdx_reclaim_page(page);
 	if (!r)
-		tdx_clear_page(page);
+		tdx_quirk_reset_paddr(page_to_phys(page), PAGE_SIZE);
 	return r;
 }
 
@@ -589,7 +570,7 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
 		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
 		return;
 	}
-	tdx_clear_page(kvm_tdx->td.tdr_page);
+	tdx_quirk_reset_paddr(page_to_phys(kvm_tdx->td.tdr_page), PAGE_SIZE);
 
 	__free_page(kvm_tdx->td.tdr_page);
 	kvm_tdx->td.tdr_page = NULL;
@@ -1689,7 +1670,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
 		return -EIO;
 	}
-	tdx_clear_page(page);
+	tdx_quirk_reset_paddr(page_to_phys(page), PAGE_SIZE);
 	tdx_unpin(kvm, page);
 	return 0;
 }
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index c7a9a087ccaf..14d93ed05bd2 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -637,7 +637,7 @@ static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
  * clear these pages.  Note this function doesn't flush cache of
  * these TDX private pages.  The caller should make sure of that.
  */
-static void reset_tdx_pages(unsigned long base, unsigned long size)
+void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
 {
 	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
 	unsigned long phys, end;
@@ -653,10 +653,11 @@ static void reset_tdx_pages(unsigned long base, unsigned long size)
 	 */
 	mb();
 }
+EXPORT_SYMBOL_GPL(tdx_quirk_reset_paddr);
 
 static void tdmr_reset_pamt(struct tdmr_info *tdmr)
 {
-	tdmr_do_pamt_func(tdmr, reset_tdx_pages);
+	tdmr_do_pamt_func(tdmr, tdx_quirk_reset_paddr);
 }
 
 static void tdmrs_reset_pamt_all(struct tdmr_info_list *tdmr_list)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
  2025-07-03 15:37 [PATCH V2 0/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present Adrian Hunter
  2025-07-03 15:37 ` [PATCH V2 1/2] x86/tdx: Eliminate duplicate code in tdx_clear_page() Adrian Hunter
@ 2025-07-03 15:37 ` Adrian Hunter
  2025-07-03 16:44   ` Kirill A. Shutemov
                     ` (4 more replies)
  1 sibling, 5 replies; 21+ messages in thread
From: Adrian Hunter @ 2025-07-03 15:37 UTC (permalink / raw)
  To: Dave Hansen, pbonzini, seanjc, vannapurve
  Cc: Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar, x86,
	H Peter Anvin, linux-kernel, kvm, rick.p.edgecombe,
	kirill.shutemov, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao, chao.gao

Avoid clearing reclaimed TDX private pages unless the platform is affected
by the X86_BUG_TDX_PW_MCE erratum. This significantly reduces VM shutdown
time on unaffected systems.

Background

KVM currently clears reclaimed TDX private pages using MOVDIR64B, which:

   - Clears the TD Owner bit (which identifies TDX private memory) and
     integrity metadata without triggering integrity violations.
   - Clears poison from cache lines without consuming it, avoiding MCEs on
     access (refer TDX Module Base spec. 16.5. Handling Machine Check
     Events during Guest TD Operation).

The TDX module also uses MOVDIR64B to initialize private pages before use.
If cache flushing is needed, it sets TDX_FEATURES.CLFLUSH_BEFORE_ALLOC.
However, KVM currently flushes unconditionally, refer commit 94c477a751c7b
("x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages")

In contrast, when private pages are reclaimed, the TDX Module handles
flushing via the TDH.PHYMEM.CACHE.WB SEAMCALL.

Problem

Clearing all private pages during VM shutdown is costly. For guests
with a large amount of memory it can take minutes.

Solution

TDX Module Base Architecture spec. documents that private pages reclaimed
from a TD should be initialized using MOVDIR64B, in order to avoid
integrity violation or TD bit mismatch detection when later being read
using a shared HKID, refer April 2025 spec. "Page Initialization" in
section "8.6.2. Platforms not Using ACT: Required Cache Flush and
Initialization by the Host VMM"

That is an overstatement and will be clarified in coming versions of the
spec. In fact, as outlined in "Table 16.2: Non-ACT Platforms Checks on
Memory" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
Mode" in the same spec, there is no issue accessing such reclaimed pages
using a shared key that does not have integrity enabled. Linux always uses
KeyID 0 which never has integrity enabled. KeyID 0 is also the TME KeyID
which disallows integrity, refer "TME Policy/Encryption Algorithm" bit
description in "Intel Architecture Memory Encryption Technologies" spec
version 1.6 April 2025. So there is no need to clear pages to avoid
integrity violations.

There remains a risk of poison consumption. However, in the context of
TDX, it is expected that there would be a machine check associated with the
original poisoning. On some platforms that results in a panic. However
platforms may support "SEAM_NR" Machine Check capability, in which case
Linux machine check handler marks the page as poisoned, which prevents it
from being allocated anymore, refer commit 7911f145de5fe ("x86/mce:
Implement recovery for errors in TDX/SEAM non-root mode")

Improvement

By skipping the clearing step on unaffected platforms, shutdown time
can improve by up to 40%.

On platforms with the X86_BUG_TDX_PW_MCE erratum (SPR and EMR), continue
clearing because these platforms may trigger poison on partial writes to
previously-private pages, even with KeyID 0, refer commit 1e536e1068970
("x86/cpu: Detect TDX partial write machine check erratum")

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---


Changes in V2:

	Improve the comment


 arch/x86/virt/vmx/tdx/tdx.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 14d93ed05bd2..4fa86188aa40 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -642,6 +642,14 @@ void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
 	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
 	unsigned long phys, end;
 
+	/*
+	 * Typically, any write to the page will convert it from TDX
+	 * private back to normal kernel memory. Systems with the
+	 * erratum need to do the conversion explicitly.
+	 */
+	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
+		return;
+
 	end = base + size;
 	for (phys = base; phys < end; phys += 64)
 		movdir64b(__va(phys), zero_page);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 1/2] x86/tdx: Eliminate duplicate code in tdx_clear_page()
  2025-07-03 15:37 ` [PATCH V2 1/2] x86/tdx: Eliminate duplicate code in tdx_clear_page() Adrian Hunter
@ 2025-07-03 16:34   ` Kirill A. Shutemov
  2025-07-04  6:44   ` Binbin Wu
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2025-07-03 16:34 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Dave Hansen, pbonzini, seanjc, vannapurve, Tony Luck,
	Borislav Petkov, Thomas Gleixner, Ingo Molnar, x86, H Peter Anvin,
	linux-kernel, kvm, rick.p.edgecombe, kai.huang, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao,
	chao.gao

On Thu, Jul 03, 2025 at 06:37:11PM +0300, Adrian Hunter wrote:
> tdx_clear_page() and reset_tdx_pages() duplicate the TDX page clearing
> logic.  Rename reset_tdx_pages() to tdx_quirk_reset_paddr() and use it
> in place of tdx_clear_page().
> 
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>

Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
  2025-07-03 15:37 ` [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present Adrian Hunter
@ 2025-07-03 16:44   ` Kirill A. Shutemov
  2025-07-03 17:06   ` Vishal Annapurve
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2025-07-03 16:44 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Dave Hansen, pbonzini, seanjc, vannapurve, Tony Luck,
	Borislav Petkov, Thomas Gleixner, Ingo Molnar, x86, H Peter Anvin,
	linux-kernel, kvm, rick.p.edgecombe, kai.huang, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao,
	chao.gao

On Thu, Jul 03, 2025 at 06:37:12PM +0300, Adrian Hunter wrote:
> Avoid clearing reclaimed TDX private pages unless the platform is affected
> by the X86_BUG_TDX_PW_MCE erratum. This significantly reduces VM shutdown
> time on unaffected systems.
> 
> Background
> 
> KVM currently clears reclaimed TDX private pages using MOVDIR64B, which:
> 
>    - Clears the TD Owner bit (which identifies TDX private memory) and
>      integrity metadata without triggering integrity violations.
>    - Clears poison from cache lines without consuming it, avoiding MCEs on
>      access (refer TDX Module Base spec. 16.5. Handling Machine Check
>      Events during Guest TD Operation).
> 
> The TDX module also uses MOVDIR64B to initialize private pages before use.
> If cache flushing is needed, it sets TDX_FEATURES.CLFLUSH_BEFORE_ALLOC.
> However, KVM currently flushes unconditionally, refer commit 94c477a751c7b
> ("x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages")
> 
> In contrast, when private pages are reclaimed, the TDX Module handles
> flushing via the TDH.PHYMEM.CACHE.WB SEAMCALL.
> 
> Problem
> 
> Clearing all private pages during VM shutdown is costly. For guests
> with a large amount of memory it can take minutes.
> 
> Solution
> 
> TDX Module Base Architecture spec. documents that private pages reclaimed
> from a TD should be initialized using MOVDIR64B, in order to avoid
> integrity violation or TD bit mismatch detection when later being read
> using a shared HKID, refer April 2025 spec. "Page Initialization" in
> section "8.6.2. Platforms not Using ACT: Required Cache Flush and
> Initialization by the Host VMM"
> 
> That is an overstatement and will be clarified in coming versions of the
> spec. In fact, as outlined in "Table 16.2: Non-ACT Platforms Checks on
> Memory" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
> Mode" in the same spec, there is no issue accessing such reclaimed pages
> using a shared key that does not have integrity enabled. Linux always uses
> KeyID 0 which never has integrity enabled. KeyID 0 is also the TME KeyID
> which disallows integrity, refer "TME Policy/Encryption Algorithm" bit
> description in "Intel Architecture Memory Encryption Technologies" spec
> version 1.6 April 2025. So there is no need to clear pages to avoid
> integrity violations.
> 
> There remains a risk of poison consumption. However, in the context of
> TDX, it is expected that there would be a machine check associated with the
> original poisoning. On some platforms that results in a panic. However
> platforms may support "SEAM_NR" Machine Check capability, in which case
> Linux machine check handler marks the page as poisoned, which prevents it
> from being allocated anymore, refer commit 7911f145de5fe ("x86/mce:
> Implement recovery for errors in TDX/SEAM non-root mode")
> 
> Improvement
> 
> By skipping the clearing step on unaffected platforms, shutdown time
> can improve by up to 40%.
> 
> On platforms with the X86_BUG_TDX_PW_MCE erratum (SPR and EMR), continue
> clearing because these platforms may trigger poison on partial writes to
> previously-private pages, even with KeyID 0, refer commit 1e536e1068970
> ("x86/cpu: Detect TDX partial write machine check erratum")
> 
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>

Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
  2025-07-03 15:37 ` [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present Adrian Hunter
  2025-07-03 16:44   ` Kirill A. Shutemov
@ 2025-07-03 17:06   ` Vishal Annapurve
  2025-07-04  5:37     ` Adrian Hunter
  2025-07-04  1:32   ` Xiaoyao Li
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Vishal Annapurve @ 2025-07-03 17:06 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Dave Hansen, pbonzini, seanjc, Tony Luck, Borislav Petkov,
	Thomas Gleixner, Ingo Molnar, x86, H Peter Anvin, linux-kernel,
	kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, yan.y.zhao, chao.gao

On Thu, Jul 3, 2025 at 8:37 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> Avoid clearing reclaimed TDX private pages unless the platform is affected
> by the X86_BUG_TDX_PW_MCE erratum. This significantly reduces VM shutdown
> time on unaffected systems.
>
> Background
>
> KVM currently clears reclaimed TDX private pages using MOVDIR64B, which:
>
>    - Clears the TD Owner bit (which identifies TDX private memory) and
>      integrity metadata without triggering integrity violations.
>    - Clears poison from cache lines without consuming it, avoiding MCEs on
>      access (refer TDX Module Base spec. 16.5. Handling Machine Check
>      Events during Guest TD Operation).
>
> The TDX module also uses MOVDIR64B to initialize private pages before use.
> If cache flushing is needed, it sets TDX_FEATURES.CLFLUSH_BEFORE_ALLOC.
> However, KVM currently flushes unconditionally, refer commit 94c477a751c7b
> ("x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages")
>
> In contrast, when private pages are reclaimed, the TDX Module handles
> flushing via the TDH.PHYMEM.CACHE.WB SEAMCALL.
>
> Problem
>
> Clearing all private pages during VM shutdown is costly. For guests
> with a large amount of memory it can take minutes.
>
> Solution
>
> TDX Module Base Architecture spec. documents that private pages reclaimed
> from a TD should be initialized using MOVDIR64B, in order to avoid
> integrity violation or TD bit mismatch detection when later being read
> using a shared HKID, refer April 2025 spec. "Page Initialization" in
> section "8.6.2. Platforms not Using ACT: Required Cache Flush and
> Initialization by the Host VMM"
>
> That is an overstatement and will be clarified in coming versions of the
> spec. In fact, as outlined in "Table 16.2: Non-ACT Platforms Checks on
> Memory" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
> Mode" in the same spec, there is no issue accessing such reclaimed pages
> using a shared key that does not have integrity enabled. Linux always uses
> KeyID 0 which never has integrity enabled. KeyID 0 is also the TME KeyID
> which disallows integrity, refer "TME Policy/Encryption Algorithm" bit
> description in "Intel Architecture Memory Encryption Technologies" spec
> version 1.6 April 2025. So there is no need to clear pages to avoid
> integrity violations.
>
> There remains a risk of poison consumption. However, in the context of
> TDX, it is expected that there would be a machine check associated with the
> original poisoning. On some platforms that results in a panic. However
> platforms may support "SEAM_NR" Machine Check capability, in which case
> Linux machine check handler marks the page as poisoned, which prevents it
> from being allocated anymore, refer commit 7911f145de5fe ("x86/mce:
> Implement recovery for errors in TDX/SEAM non-root mode")
>
> Improvement
>
> By skipping the clearing step on unaffected platforms, shutdown time
> can improve by up to 40%.

This patch looks good to me.

I would like to raise a related topic, is there any requirement for
zeroing pages on conversion from private to shared before
userspace/guest faults in the gpa ranges as shared?

If the answer is no for all CoCo architectures then guest_memfd can
simply just zero pages on allocation for all it's users and not worry
about zeroing later.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
  2025-07-03 15:37 ` [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present Adrian Hunter
  2025-07-03 16:44   ` Kirill A. Shutemov
  2025-07-03 17:06   ` Vishal Annapurve
@ 2025-07-04  1:32   ` Xiaoyao Li
  2025-07-04  6:44     ` Binbin Wu
  2025-07-07  2:09   ` Huang, Kai
  2025-07-07  3:16   ` Chao Gao
  4 siblings, 1 reply; 21+ messages in thread
From: Xiaoyao Li @ 2025-07-04  1:32 UTC (permalink / raw)
  To: Adrian Hunter, Dave Hansen, pbonzini, seanjc, vannapurve
  Cc: Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar, x86,
	H Peter Anvin, linux-kernel, kvm, rick.p.edgecombe,
	kirill.shutemov, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, isaku.yamahata, yan.y.zhao, chao.gao

On 7/3/2025 11:37 PM, Adrian Hunter wrote:
> Avoid clearing reclaimed TDX private pages unless the platform is affected
> by the X86_BUG_TDX_PW_MCE erratum. This significantly reduces VM shutdown
> time on unaffected systems.
> 
> Background
> 
> KVM currently clears reclaimed TDX private pages using MOVDIR64B, which:
> 
>     - Clears the TD Owner bit (which identifies TDX private memory) and
>       integrity metadata without triggering integrity violations.
>     - Clears poison from cache lines without consuming it, avoiding MCEs on
>       access (refer TDX Module Base spec. 16.5. Handling Machine Check
>       Events during Guest TD Operation).
> 
> The TDX module also uses MOVDIR64B to initialize private pages before use.
> If cache flushing is needed, it sets TDX_FEATURES.CLFLUSH_BEFORE_ALLOC.
> However, KVM currently flushes unconditionally, refer commit 94c477a751c7b
> ("x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages")
> 
> In contrast, when private pages are reclaimed, the TDX Module handles
> flushing via the TDH.PHYMEM.CACHE.WB SEAMCALL.
> 
> Problem
> 
> Clearing all private pages during VM shutdown is costly. For guests
> with a large amount of memory it can take minutes.
> 
> Solution
> 
> TDX Module Base Architecture spec. documents that private pages reclaimed
> from a TD should be initialized using MOVDIR64B, in order to avoid
> integrity violation or TD bit mismatch detection when later being read
> using a shared HKID, refer April 2025 spec. "Page Initialization" in
> section "8.6.2. Platforms not Using ACT: Required Cache Flush and
> Initialization by the Host VMM"
> 
> That is an overstatement and will be clarified in coming versions of the
> spec. In fact, as outlined in "Table 16.2: Non-ACT Platforms Checks on
> Memory" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
> Mode" in the same spec, there is no issue accessing such reclaimed pages
> using a shared key that does not have integrity enabled. Linux always uses
> KeyID 0 which never has integrity enabled. KeyID 0 is also the TME KeyID
> which disallows integrity, refer "TME Policy/Encryption Algorithm" bit
> description in "Intel Architecture Memory Encryption Technologies" spec
> version 1.6 April 2025. So there is no need to clear pages to avoid
> integrity violations.
> 
> There remains a risk of poison consumption. However, in the context of
> TDX, it is expected that there would be a machine check associated with the
> original poisoning. On some platforms that results in a panic. However
> platforms may support "SEAM_NR" Machine Check capability, in which case
> Linux machine check handler marks the page as poisoned, which prevents it
> from being allocated anymore, refer commit 7911f145de5fe ("x86/mce:
> Implement recovery for errors in TDX/SEAM non-root mode")
> 
> Improvement
> 
> By skipping the clearing step on unaffected platforms, shutdown time
> can improve by up to 40%.
> 
> On platforms with the X86_BUG_TDX_PW_MCE erratum (SPR and EMR), continue
> clearing because these platforms may trigger poison on partial writes to
> previously-private pages, even with KeyID 0, refer commit 1e536e1068970
> ("x86/cpu: Detect TDX partial write machine check erratum")
> 
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
> 
> 
> Changes in V2:
> 
> 	Improve the comment
> 
> 
>   arch/x86/virt/vmx/tdx/tdx.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 14d93ed05bd2..4fa86188aa40 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -642,6 +642,14 @@ void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
>   	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
>   	unsigned long phys, end;
>   
> +	/*
> +	 * Typically, any write to the page will convert it from TDX
> +	 * private back to normal kernel memory. Systems with the
> +	 * erratum need to do the conversion explicitly.

Can we call out that "system with erratum need to do the conversion 
explicitly via MOVDIR64B" ?

Without "via MOVDIR64B", it leads to the impression that explicit 
conversion with any write is OK for system with the erratum, and maybe 
the following code just happened to use movdir64b().

> +	 */
> +	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> +		return;
> +
>   	end = base + size;
>   	for (phys = base; phys < end; phys += 64)
>   		movdir64b(__va(phys), zero_page);


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
  2025-07-03 17:06   ` Vishal Annapurve
@ 2025-07-04  5:37     ` Adrian Hunter
  2025-07-07 23:31       ` Vishal Annapurve
  0 siblings, 1 reply; 21+ messages in thread
From: Adrian Hunter @ 2025-07-04  5:37 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Dave Hansen, pbonzini, seanjc, Tony Luck, Borislav Petkov,
	Thomas Gleixner, Ingo Molnar, x86, H Peter Anvin, linux-kernel,
	kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, yan.y.zhao, chao.gao

On 03/07/2025 20:06, Vishal Annapurve wrote:
> On Thu, Jul 3, 2025 at 8:37 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>
>> Avoid clearing reclaimed TDX private pages unless the platform is affected
>> by the X86_BUG_TDX_PW_MCE erratum. This significantly reduces VM shutdown
>> time on unaffected systems.
>>
>> Background
>>
>> KVM currently clears reclaimed TDX private pages using MOVDIR64B, which:
>>
>>    - Clears the TD Owner bit (which identifies TDX private memory) and
>>      integrity metadata without triggering integrity violations.
>>    - Clears poison from cache lines without consuming it, avoiding MCEs on
>>      access (refer TDX Module Base spec. 16.5. Handling Machine Check
>>      Events during Guest TD Operation).
>>
>> The TDX module also uses MOVDIR64B to initialize private pages before use.
>> If cache flushing is needed, it sets TDX_FEATURES.CLFLUSH_BEFORE_ALLOC.
>> However, KVM currently flushes unconditionally, refer commit 94c477a751c7b
>> ("x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages")
>>
>> In contrast, when private pages are reclaimed, the TDX Module handles
>> flushing via the TDH.PHYMEM.CACHE.WB SEAMCALL.
>>
>> Problem
>>
>> Clearing all private pages during VM shutdown is costly. For guests
>> with a large amount of memory it can take minutes.
>>
>> Solution
>>
>> TDX Module Base Architecture spec. documents that private pages reclaimed
>> from a TD should be initialized using MOVDIR64B, in order to avoid
>> integrity violation or TD bit mismatch detection when later being read
>> using a shared HKID, refer April 2025 spec. "Page Initialization" in
>> section "8.6.2. Platforms not Using ACT: Required Cache Flush and
>> Initialization by the Host VMM"
>>
>> That is an overstatement and will be clarified in coming versions of the
>> spec. In fact, as outlined in "Table 16.2: Non-ACT Platforms Checks on
>> Memory" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
>> Mode" in the same spec, there is no issue accessing such reclaimed pages
>> using a shared key that does not have integrity enabled. Linux always uses
>> KeyID 0 which never has integrity enabled. KeyID 0 is also the TME KeyID
>> which disallows integrity, refer "TME Policy/Encryption Algorithm" bit
>> description in "Intel Architecture Memory Encryption Technologies" spec
>> version 1.6 April 2025. So there is no need to clear pages to avoid
>> integrity violations.
>>
>> There remains a risk of poison consumption. However, in the context of
>> TDX, it is expected that there would be a machine check associated with the
>> original poisoning. On some platforms that results in a panic. However
>> platforms may support "SEAM_NR" Machine Check capability, in which case
>> Linux machine check handler marks the page as poisoned, which prevents it
>> from being allocated anymore, refer commit 7911f145de5fe ("x86/mce:
>> Implement recovery for errors in TDX/SEAM non-root mode")
>>
>> Improvement
>>
>> By skipping the clearing step on unaffected platforms, shutdown time
>> can improve by up to 40%.
> 
> This patch looks good to me.
> 
> I would like to raise a related topic, is there any requirement for
> zeroing pages on conversion from private to shared before
> userspace/guest faults in the gpa ranges as shared?

For TDX, clearing must still be done for platforms with the
partial-write errata (SPR and EMR).

> 
> If the answer is no for all CoCo architectures then guest_memfd can
> simply just zero pages on allocation for all it's users and not worry
> about zeroing later.

In fact TDX does not need private pages to be zeroed on allocation
because the TDX Module always does that.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
  2025-07-04  1:32   ` Xiaoyao Li
@ 2025-07-04  6:44     ` Binbin Wu
  0 siblings, 0 replies; 21+ messages in thread
From: Binbin Wu @ 2025-07-04  6:44 UTC (permalink / raw)
  To: Xiaoyao Li, Adrian Hunter
  Cc: Dave Hansen, pbonzini, seanjc, vannapurve, Tony Luck,
	Borislav Petkov, Thomas Gleixner, Ingo Molnar, x86, H Peter Anvin,
	linux-kernel, kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
	reinette.chatre, tony.lindgren, isaku.yamahata, yan.y.zhao,
	chao.gao



On 7/4/2025 9:32 AM, Xiaoyao Li wrote:
> On 7/3/2025 11:37 PM, Adrian Hunter wrote:
>> Avoid clearing reclaimed TDX private pages unless the platform is affected
>> by the X86_BUG_TDX_PW_MCE erratum. This significantly reduces VM shutdown
>> time on unaffected systems.
>>
>> Background
>>
>> KVM currently clears reclaimed TDX private pages using MOVDIR64B, which:
>>
>>     - Clears the TD Owner bit (which identifies TDX private memory) and
>>       integrity metadata without triggering integrity violations.
>>     - Clears poison from cache lines without consuming it, avoiding MCEs on
>>       access (refer TDX Module Base spec. 16.5. Handling Machine Check
>>       Events during Guest TD Operation).
>>
>> The TDX module also uses MOVDIR64B to initialize private pages before use.
>> If cache flushing is needed, it sets TDX_FEATURES.CLFLUSH_BEFORE_ALLOC.
>> However, KVM currently flushes unconditionally, refer commit 94c477a751c7b
>> ("x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages")
>>
>> In contrast, when private pages are reclaimed, the TDX Module handles
>> flushing via the TDH.PHYMEM.CACHE.WB SEAMCALL.
>>
>> Problem
>>
>> Clearing all private pages during VM shutdown is costly. For guests
>> with a large amount of memory it can take minutes.
>>
>> Solution
>>
>> TDX Module Base Architecture spec. documents that private pages reclaimed
>> from a TD should be initialized using MOVDIR64B, in order to avoid
>> integrity violation or TD bit mismatch detection when later being read
>> using a shared HKID, refer April 2025 spec. "Page Initialization" in
>> section "8.6.2. Platforms not Using ACT: Required Cache Flush and
>> Initialization by the Host VMM"
>>
>> That is an overstatement and will be clarified in coming versions of the
>> spec. In fact, as outlined in "Table 16.2: Non-ACT Platforms Checks on
>> Memory" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
>> Mode" in the same spec, there is no issue accessing such reclaimed pages
>> using a shared key that does not have integrity enabled. Linux always uses
>> KeyID 0 which never has integrity enabled. KeyID 0 is also the TME KeyID
>> which disallows integrity, refer "TME Policy/Encryption Algorithm" bit
>> description in "Intel Architecture Memory Encryption Technologies" spec
>> version 1.6 April 2025. So there is no need to clear pages to avoid
>> integrity violations.
>>
>> There remains a risk of poison consumption. However, in the context of
>> TDX, it is expected that there would be a machine check associated with the
>> original poisoning. On some platforms that results in a panic. However
>> platforms may support "SEAM_NR" Machine Check capability, in which case
>> Linux machine check handler marks the page as poisoned, which prevents it
>> from being allocated anymore, refer commit 7911f145de5fe ("x86/mce:
>> Implement recovery for errors in TDX/SEAM non-root mode")
>>
>> Improvement
>>
>> By skipping the clearing step on unaffected platforms, shutdown time
>> can improve by up to 40%.
>>
>> On platforms with the X86_BUG_TDX_PW_MCE erratum (SPR and EMR), continue
>> clearing because these platforms may trigger poison on partial writes to
>> previously-private pages, even with KeyID 0, refer commit 1e536e1068970
>> ("x86/cpu: Detect TDX partial write machine check erratum")
>>
>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
>> ---
>>
>>
>> Changes in V2:
>>
>>     Improve the comment
>>
>>
>>   arch/x86/virt/vmx/tdx/tdx.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>> index 14d93ed05bd2..4fa86188aa40 100644
>> --- a/arch/x86/virt/vmx/tdx/tdx.c
>> +++ b/arch/x86/virt/vmx/tdx/tdx.c
>> @@ -642,6 +642,14 @@ void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
>>       const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
>>       unsigned long phys, end;
>>   +    /*
>> +     * Typically, any write to the page will convert it from TDX
>> +     * private back to normal kernel memory. Systems with the
>> +     * erratum need to do the conversion explicitly.
>
> Can we call out that "system with erratum need to do the conversion explicitly via MOVDIR64B" ?

The original description of the function has mentioned it.
I am wondering if it's better to merge the comments to the function description?


>
> Without "via MOVDIR64B", it leads to the impression that explicit conversion with any write is OK for system with the erratum, and maybe the following code just happened to use movdir64b().
>
>> +     */
>> +    if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
>> +        return;
>> +
>>       end = base + size;
>>       for (phys = base; phys < end; phys += 64)
>>           movdir64b(__va(phys), zero_page);
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 1/2] x86/tdx: Eliminate duplicate code in tdx_clear_page()
  2025-07-03 15:37 ` [PATCH V2 1/2] x86/tdx: Eliminate duplicate code in tdx_clear_page() Adrian Hunter
  2025-07-03 16:34   ` Kirill A. Shutemov
@ 2025-07-04  6:44   ` Binbin Wu
  2025-07-04 15:33   ` Xiaoyao Li
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Binbin Wu @ 2025-07-04  6:44 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Dave Hansen, pbonzini, seanjc, vannapurve, Tony Luck,
	Borislav Petkov, Thomas Gleixner, Ingo Molnar, x86, H Peter Anvin,
	linux-kernel, kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
	yan.y.zhao, chao.gao



On 7/3/2025 11:37 PM, Adrian Hunter wrote:
> tdx_clear_page() and reset_tdx_pages() duplicate the TDX page clearing
> logic.  Rename reset_tdx_pages() to tdx_quirk_reset_paddr() and use it
> in place of tdx_clear_page().
>
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

> ---
>
>
> Changes in V2:
>
> 	Rename reset_tdx_pages() to tdx_quirk_reset_paddr()
> 	Call tdx_quirk_reset_paddr() directly
>
>
>   arch/x86/include/asm/tdx.h  |  2 ++
>   arch/x86/kvm/vmx/tdx.c      | 25 +++----------------------
>   arch/x86/virt/vmx/tdx/tdx.c |  5 +++--
>   3 files changed, 8 insertions(+), 24 deletions(-)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 7ddef3a69866..f66328404724 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -131,6 +131,8 @@ int tdx_guest_keyid_alloc(void);
>   u32 tdx_get_nr_guest_keyids(void);
>   void tdx_guest_keyid_free(unsigned int keyid);
>   
> +void tdx_quirk_reset_paddr(unsigned long base, unsigned long size);
> +
>   struct tdx_td {
>   	/* TD root structure: */
>   	struct page *tdr_page;
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index a08e7055d1db..031e36665757 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -276,25 +276,6 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
>   	vcpu->cpu = -1;
>   }
>   
> -static void tdx_clear_page(struct page *page)
> -{
> -	const void *zero_page = (const void *) page_to_virt(ZERO_PAGE(0));
> -	void *dest = page_to_virt(page);
> -	unsigned long i;
> -
> -	/*
> -	 * The page could have been poisoned.  MOVDIR64B also clears
> -	 * the poison bit so the kernel can safely use the page again.
> -	 */
> -	for (i = 0; i < PAGE_SIZE; i += 64)
> -		movdir64b(dest + i, zero_page);
> -	/*
> -	 * MOVDIR64B store uses WC buffer.  Prevent following memory reads
> -	 * from seeing potentially poisoned cache.
> -	 */
> -	__mb();
> -}
> -
>   static void tdx_no_vcpus_enter_start(struct kvm *kvm)
>   {
>   	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> @@ -340,7 +321,7 @@ static int tdx_reclaim_page(struct page *page)
>   
>   	r = __tdx_reclaim_page(page);
>   	if (!r)
> -		tdx_clear_page(page);
> +		tdx_quirk_reset_paddr(page_to_phys(page), PAGE_SIZE);
>   	return r;
>   }
>   
> @@ -589,7 +570,7 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
>   		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
>   		return;
>   	}
> -	tdx_clear_page(kvm_tdx->td.tdr_page);
> +	tdx_quirk_reset_paddr(page_to_phys(kvm_tdx->td.tdr_page), PAGE_SIZE);
>   
>   	__free_page(kvm_tdx->td.tdr_page);
>   	kvm_tdx->td.tdr_page = NULL;
> @@ -1689,7 +1670,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>   		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
>   		return -EIO;
>   	}
> -	tdx_clear_page(page);
> +	tdx_quirk_reset_paddr(page_to_phys(page), PAGE_SIZE);
>   	tdx_unpin(kvm, page);
>   	return 0;
>   }
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index c7a9a087ccaf..14d93ed05bd2 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -637,7 +637,7 @@ static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
>    * clear these pages.  Note this function doesn't flush cache of
>    * these TDX private pages.  The caller should make sure of that.
>    */
> -static void reset_tdx_pages(unsigned long base, unsigned long size)
> +void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
>   {
>   	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
>   	unsigned long phys, end;
> @@ -653,10 +653,11 @@ static void reset_tdx_pages(unsigned long base, unsigned long size)
>   	 */
>   	mb();
>   }
> +EXPORT_SYMBOL_GPL(tdx_quirk_reset_paddr);
>   
>   static void tdmr_reset_pamt(struct tdmr_info *tdmr)
>   {
> -	tdmr_do_pamt_func(tdmr, reset_tdx_pages);
> +	tdmr_do_pamt_func(tdmr, tdx_quirk_reset_paddr);
>   }
>   
>   static void tdmrs_reset_pamt_all(struct tdmr_info_list *tdmr_list)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 1/2] x86/tdx: Eliminate duplicate code in tdx_clear_page()
  2025-07-03 15:37 ` [PATCH V2 1/2] x86/tdx: Eliminate duplicate code in tdx_clear_page() Adrian Hunter
  2025-07-03 16:34   ` Kirill A. Shutemov
  2025-07-04  6:44   ` Binbin Wu
@ 2025-07-04 15:33   ` Xiaoyao Li
  2025-07-07  2:08   ` Huang, Kai
  2025-07-07 17:31   ` Edgecombe, Rick P
  4 siblings, 0 replies; 21+ messages in thread
From: Xiaoyao Li @ 2025-07-04 15:33 UTC (permalink / raw)
  To: Adrian Hunter, Dave Hansen, pbonzini, seanjc, vannapurve
  Cc: Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar, x86,
	H Peter Anvin, linux-kernel, kvm, rick.p.edgecombe,
	kirill.shutemov, kai.huang, reinette.chatre, tony.lindgren,
	binbin.wu, isaku.yamahata, yan.y.zhao, chao.gao

On 7/3/2025 11:37 PM, Adrian Hunter wrote:
> tdx_clear_page() and reset_tdx_pages() duplicate the TDX page clearing
> logic.  Rename reset_tdx_pages() to tdx_quirk_reset_paddr() and use it
> in place of tdx_clear_page().
> 
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 1/2] x86/tdx: Eliminate duplicate code in tdx_clear_page()
  2025-07-03 15:37 ` [PATCH V2 1/2] x86/tdx: Eliminate duplicate code in tdx_clear_page() Adrian Hunter
                     ` (2 preceding siblings ...)
  2025-07-04 15:33   ` Xiaoyao Li
@ 2025-07-07  2:08   ` Huang, Kai
  2025-07-07 17:31   ` Edgecombe, Rick P
  4 siblings, 0 replies; 21+ messages in thread
From: Huang, Kai @ 2025-07-07  2:08 UTC (permalink / raw)
  To: Annapurve, Vishal, pbonzini@redhat.com, Hunter, Adrian,
	seanjc@google.com, dave.hansen@linux.intel.com
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y, Luck, Tony,
	tony.lindgren@linux.intel.com, binbin.wu@linux.intel.com,
	Chatre, Reinette, linux-kernel@vger.kernel.org, mingo@redhat.com,
	Yamahata, Isaku, kirill.shutemov@linux.intel.com,
	tglx@linutronix.de, hpa@zytor.com, Edgecombe, Rick P,
	bp@alien8.de, Gao, Chao, x86@kernel.org

On Thu, 2025-07-03 at 18:37 +0300, Hunter, Adrian wrote:
> tdx_clear_page() and reset_tdx_pages() duplicate the TDX page clearing
> logic.  Rename reset_tdx_pages() to tdx_quirk_reset_paddr() and use it
> in place of tdx_clear_page().
> 
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> 

Acked-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
  2025-07-03 15:37 ` [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present Adrian Hunter
                     ` (2 preceding siblings ...)
  2025-07-04  1:32   ` Xiaoyao Li
@ 2025-07-07  2:09   ` Huang, Kai
  2025-07-07  3:16   ` Chao Gao
  4 siblings, 0 replies; 21+ messages in thread
From: Huang, Kai @ 2025-07-07  2:09 UTC (permalink / raw)
  To: Annapurve, Vishal, pbonzini@redhat.com, Hunter, Adrian,
	seanjc@google.com, dave.hansen@linux.intel.com
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y, Luck, Tony,
	tony.lindgren@linux.intel.com, binbin.wu@linux.intel.com,
	Chatre, Reinette, linux-kernel@vger.kernel.org, mingo@redhat.com,
	Yamahata, Isaku, kirill.shutemov@linux.intel.com,
	tglx@linutronix.de, hpa@zytor.com, Edgecombe, Rick P,
	bp@alien8.de, Gao, Chao, x86@kernel.org

On Thu, 2025-07-03 at 18:37 +0300, Hunter, Adrian wrote:
> Avoid clearing reclaimed TDX private pages unless the platform is affected
> by the X86_BUG_TDX_PW_MCE erratum. This significantly reduces VM shutdown
> time on unaffected systems.
> 
> Background
> 
> KVM currently clears reclaimed TDX private pages using MOVDIR64B, which:
> 
>    - Clears the TD Owner bit (which identifies TDX private memory) and
>      integrity metadata without triggering integrity violations.
>    - Clears poison from cache lines without consuming it, avoiding MCEs on
>      access (refer TDX Module Base spec. 16.5. Handling Machine Check
>      Events during Guest TD Operation).
> 
> The TDX module also uses MOVDIR64B to initialize private pages before use.
> If cache flushing is needed, it sets TDX_FEATURES.CLFLUSH_BEFORE_ALLOC.
> However, KVM currently flushes unconditionally, refer commit 94c477a751c7b
> ("x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages")
> 
> In contrast, when private pages are reclaimed, the TDX Module handles
> flushing via the TDH.PHYMEM.CACHE.WB SEAMCALL.
> 
> Problem
> 
> Clearing all private pages during VM shutdown is costly. For guests
> with a large amount of memory it can take minutes.
> 
> Solution
> 
> TDX Module Base Architecture spec. documents that private pages reclaimed
> from a TD should be initialized using MOVDIR64B, in order to avoid
> integrity violation or TD bit mismatch detection when later being read
> using a shared HKID, refer April 2025 spec. "Page Initialization" in
> section "8.6.2. Platforms not Using ACT: Required Cache Flush and
> Initialization by the Host VMM"
> 
> That is an overstatement and will be clarified in coming versions of the
> spec. In fact, as outlined in "Table 16.2: Non-ACT Platforms Checks on
> Memory" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
> Mode" in the same spec, there is no issue accessing such reclaimed pages
> using a shared key that does not have integrity enabled. Linux always uses
> KeyID 0 which never has integrity enabled. KeyID 0 is also the TME KeyID
> which disallows integrity, refer "TME Policy/Encryption Algorithm" bit
> description in "Intel Architecture Memory Encryption Technologies" spec
> version 1.6 April 2025. So there is no need to clear pages to avoid
> integrity violations.
> 
> There remains a risk of poison consumption. However, in the context of
> TDX, it is expected that there would be a machine check associated with the
> original poisoning. On some platforms that results in a panic. However
> platforms may support "SEAM_NR" Machine Check capability, in which case
> Linux machine check handler marks the page as poisoned, which prevents it
> from being allocated anymore, refer commit 7911f145de5fe ("x86/mce:
> Implement recovery for errors in TDX/SEAM non-root mode")
> 
> Improvement
> 
> By skipping the clearing step on unaffected platforms, shutdown time
> can improve by up to 40%.
> 
> On platforms with the X86_BUG_TDX_PW_MCE erratum (SPR and EMR), continue
> clearing because these platforms may trigger poison on partial writes to
> previously-private pages, even with KeyID 0, refer commit 1e536e1068970
> ("x86/cpu: Detect TDX partial write machine check erratum")
> 
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> 

Acked-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
  2025-07-03 15:37 ` [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present Adrian Hunter
                     ` (3 preceding siblings ...)
  2025-07-07  2:09   ` Huang, Kai
@ 2025-07-07  3:16   ` Chao Gao
  2025-07-07  4:23     ` Dave Hansen
  4 siblings, 1 reply; 21+ messages in thread
From: Chao Gao @ 2025-07-07  3:16 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Dave Hansen, pbonzini, seanjc, vannapurve, Tony Luck,
	Borislav Petkov, Thomas Gleixner, Ingo Molnar, x86, H Peter Anvin,
	linux-kernel, kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, yan.y.zhao

On Thu, Jul 03, 2025 at 06:37:12PM +0300, Adrian Hunter wrote:
>Avoid clearing reclaimed TDX private pages unless the platform is affected
>by the X86_BUG_TDX_PW_MCE erratum. This significantly reduces VM shutdown
>time on unaffected systems.
>
>Background
>
>KVM currently clears reclaimed TDX private pages using MOVDIR64B, which:
>
>   - Clears the TD Owner bit (which identifies TDX private memory) and
>     integrity metadata without triggering integrity violations.
>   - Clears poison from cache lines without consuming it, avoiding MCEs on
>     access (refer TDX Module Base spec. 16.5. Handling Machine Check
>     Events during Guest TD Operation).
>
>The TDX module also uses MOVDIR64B to initialize private pages before use.
>If cache flushing is needed, it sets TDX_FEATURES.CLFLUSH_BEFORE_ALLOC.
>However, KVM currently flushes unconditionally, refer commit 94c477a751c7b
>("x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages")
>
>In contrast, when private pages are reclaimed, the TDX Module handles
>flushing via the TDH.PHYMEM.CACHE.WB SEAMCALL.
>
>Problem
>
>Clearing all private pages during VM shutdown is costly. For guests
>with a large amount of memory it can take minutes.
>
>Solution
>
>TDX Module Base Architecture spec. documents that private pages reclaimed
>from a TD should be initialized using MOVDIR64B, in order to avoid
>integrity violation or TD bit mismatch detection when later being read
>using a shared HKID, refer April 2025 spec. "Page Initialization" in
>section "8.6.2. Platforms not Using ACT: Required Cache Flush and
>Initialization by the Host VMM"
>
>That is an overstatement and will be clarified in coming versions of the
>spec. In fact, as outlined in "Table 16.2: Non-ACT Platforms Checks on
>Memory" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
>Mode" in the same spec, there is no issue accessing such reclaimed pages
>using a shared key that does not have integrity enabled. Linux always uses
>KeyID 0 which never has integrity enabled. KeyID 0 is also the TME KeyID
>which disallows integrity, refer "TME Policy/Encryption Algorithm" bit
>description in "Intel Architecture Memory Encryption Technologies" spec
>version 1.6 April 2025. So there is no need to clear pages to avoid
>integrity violations.
>
>There remains a risk of poison consumption. However, in the context of
>TDX, it is expected that there would be a machine check associated with the
>original poisoning. On some platforms that results in a panic. However
>platforms may support "SEAM_NR" Machine Check capability, in which case
>Linux machine check handler marks the page as poisoned, which prevents it
>from being allocated anymore, refer commit 7911f145de5fe ("x86/mce:
>Implement recovery for errors in TDX/SEAM non-root mode")

Even on a CPU w/ SEAM_NR and w/o X86_BUG_TDX_PW_MCE, is there still a risk of
poisoned memory being returned to the host kernel? Since only poison
consumption causes #MCE, if a poisoned page is never consumed in SEAM non-root
mode, there will be no #MCE, and the mentioned commit won't mark the page as
poisoned.

A reclaimed poisoned page could be reused and potentially cause a kernel panic.
While WBINVD could help, we believe it's not worth it as it will slow down the
vast majority of cases. Is my understanding correct?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
  2025-07-07  3:16   ` Chao Gao
@ 2025-07-07  4:23     ` Dave Hansen
  2025-07-07  7:15       ` Chao Gao
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Hansen @ 2025-07-07  4:23 UTC (permalink / raw)
  To: Chao Gao, Adrian Hunter
  Cc: Dave Hansen, pbonzini, seanjc, vannapurve, Tony Luck,
	Borislav Petkov, Thomas Gleixner, Ingo Molnar, x86, H Peter Anvin,
	linux-kernel, kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, yan.y.zhao

On 7/6/25 20:16, Chao Gao wrote:
> Even on a CPU w/ SEAM_NR and w/o X86_BUG_TDX_PW_MCE, is there still a risk of
> poisoned memory being returned to the host kernel? Since only poison
> consumption causes #MCE, if a poisoned page is never consumed in SEAM non-root
> mode, there will be no #MCE, and the mentioned commit won't mark the page as
> poisoned.
> 
> A reclaimed poisoned page could be reused and potentially cause a kernel panic.
> While WBINVD could help, we believe it's not worth it as it will slow down the
> vast majority of cases. Is my understanding correct?

How is this any different from any other kind of hardware poison?

Why should this specific kind of freeing (TDX private memory being freed
back to the host) operation be different from any other kind of free?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
  2025-07-07  4:23     ` Dave Hansen
@ 2025-07-07  7:15       ` Chao Gao
  2025-07-07 11:39         ` Huang, Kai
  2025-07-07 14:32         ` Dave Hansen
  0 siblings, 2 replies; 21+ messages in thread
From: Chao Gao @ 2025-07-07  7:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Adrian Hunter, Dave Hansen, pbonzini, seanjc, vannapurve,
	Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar, x86,
	H Peter Anvin, linux-kernel, kvm, rick.p.edgecombe,
	kirill.shutemov, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao

On Sun, Jul 06, 2025 at 09:23:05PM -0700, Dave Hansen wrote:
>On 7/6/25 20:16, Chao Gao wrote:
>> Even on a CPU w/ SEAM_NR and w/o X86_BUG_TDX_PW_MCE, is there still a risk of
>> poisoned memory being returned to the host kernel? Since only poison
>> consumption causes #MCE, if a poisoned page is never consumed in SEAM non-root
>> mode, there will be no #MCE, and the mentioned commit won't mark the page as
>> poisoned.
>> 
>> A reclaimed poisoned page could be reused and potentially cause a kernel panic.
>> While WBINVD could help, we believe it's not worth it as it will slow down the
>> vast majority of cases. Is my understanding correct?
>
>How is this any different from any other kind of hardware poison?

I wasn't arguing that MOVDIR64B should be kept. I was highlighting the risk of
kernel panic on CPUs even without the partial write bug and guessing why it was
not worth fixing.

Regarding your question, the poison likely occurs due to software bugs rather
than hardware issues. And, as stated in the comment removed in patch 1, unlike
other hardware poison, this poison can be cleared using MOVDIR64B.

>
>Why should this specific kind of freeing (TDX private memory being freed
>back to the host) operation be different from any other kind of free?

To limit the impact of software bugs (e.g., TDX module bugs) to TDX guests
rather than affecting the entire kernel. Debugging a TDX module bug that
results in a #MCE in a random host context can be quite frustrating, right?
But, on the other hand, MOVDIR64B incurs a 40% slowdown when shutting down a
TD. So, It's a tradeoff between containing theoretical software bugs and
experiencing a 40% slowdown.

Personally, I also prefer to remove MOVDIR64B, but I also want to point out the
bug triage issue and the risk of kernel panic after removing MOVDIR64B.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
  2025-07-07  7:15       ` Chao Gao
@ 2025-07-07 11:39         ` Huang, Kai
  2025-07-07 14:32         ` Dave Hansen
  1 sibling, 0 replies; 21+ messages in thread
From: Huang, Kai @ 2025-07-07 11:39 UTC (permalink / raw)
  To: Hansen, Dave, Gao, Chao
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Luck, Tony,
	dave.hansen@linux.intel.com, tony.lindgren@linux.intel.com,
	Zhao, Yan Y, Hunter, Adrian, linux-kernel@vger.kernel.org,
	seanjc@google.com, mingo@redhat.com, pbonzini@redhat.com,
	tglx@linutronix.de, Yamahata, Isaku,
	kirill.shutemov@linux.intel.com, Chatre, Reinette, hpa@zytor.com,
	Annapurve, Vishal, Edgecombe, Rick P, bp@alien8.de,
	binbin.wu@linux.intel.com, x86@kernel.org

On Mon, 2025-07-07 at 15:15 +0800, Chao Gao wrote:
> On Sun, Jul 06, 2025 at 09:23:05PM -0700, Dave Hansen wrote:
> > On 7/6/25 20:16, Chao Gao wrote:
> > > Even on a CPU w/ SEAM_NR and w/o X86_BUG_TDX_PW_MCE, is there still a risk of
> > > poisoned memory being returned to the host kernel? Since only poison
> > > consumption causes #MCE, if a poisoned page is never consumed in SEAM non-root
> > > mode, there will be no #MCE, and the mentioned commit won't mark the page as
> > > poisoned.
> > > 
> > > A reclaimed poisoned page could be reused and potentially cause a kernel panic.
> > > While WBINVD could help, we believe it's not worth it as it will slow down the
> > > vast majority of cases. Is my understanding correct?
> > 
> > How is this any different from any other kind of hardware poison?
> 
> I wasn't arguing that MOVDIR64B should be kept. I was highlighting the risk of
> kernel panic on CPUs even without the partial write bug and guessing why it was
> not worth fixing.
> 
> Regarding your question, the poison likely occurs due to software bugs rather
> than hardware issues. And, as stated in the comment removed in patch 1, unlike
> other hardware poison, this poison can be cleared using MOVDIR64B.
> 
> > 
> > Why should this specific kind of freeing (TDX private memory being freed
> > back to the host) operation be different from any other kind of free?
> 
> To limit the impact of software bugs (e.g., TDX module bugs) to TDX guests
> rather than affecting the entire kernel. Debugging a TDX module bug that
> results in a #MCE in a random host context can be quite frustrating, right?
> But, on the other hand, MOVDIR64B incurs a 40% slowdown when shutting down a
> TD. So, It's a tradeoff between containing theoretical software bugs and
> experiencing a 40% slowdown.
> 
> Personally, I also prefer to remove MOVDIR64B, but I also want to point out the
> bug triage issue and the risk of kernel panic after removing MOVDIR64B.

If we are only talking about the poison due to TD-mismatch or integrity
failure, per TDX spec the CPU only marks the memory as poisoned when the CPU
actually (performs read and) consumes the bad data in SEAM non-root mode, in
which case there will be a subsequent #MCE from SEAM non-root mode.

A TDX module bug which causes the module itself accidentally writes TDX
private memory using different KeyID won't mark the memory as poisoned.  A
further read (due to bug) from host kernel using KeyID 0 won't poison the
memory either.

A TDX module bug which causes the module itself accidentally reads TDX
private memory using different KeyID poisons that memory and causes #MCE
immediately in SEAM, but this is fatal to the system, so no poisoned memory
will be returned to the kernel.

In other words, I think it shouldn't be possible that a poisoned page is
never consumed in SEAM non-root but later returned to the host kernel.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
  2025-07-07  7:15       ` Chao Gao
  2025-07-07 11:39         ` Huang, Kai
@ 2025-07-07 14:32         ` Dave Hansen
  2025-07-07 18:15           ` Edgecombe, Rick P
  1 sibling, 1 reply; 21+ messages in thread
From: Dave Hansen @ 2025-07-07 14:32 UTC (permalink / raw)
  To: Chao Gao
  Cc: Adrian Hunter, Dave Hansen, pbonzini, seanjc, vannapurve,
	Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar, x86,
	H Peter Anvin, linux-kernel, kvm, rick.p.edgecombe,
	kirill.shutemov, kai.huang, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, isaku.yamahata, yan.y.zhao

On 7/7/25 00:15, Chao Gao wrote:
>> Why should this specific kind of freeing (TDX private memory being freed
>> back to the host) operation be different from any other kind of free?
> To limit the impact of software bugs (e.g., TDX module bugs) to TDX guests
> rather than affecting the entire kernel.

It's one thing if the TDX module is so constantly buggy that we're
getting tons of kernel crash reports that we track back to the TDX module.

It's quite another thing to add kernel complexity to preemptively lessen
the chance of a theoretical TDX bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 1/2] x86/tdx: Eliminate duplicate code in tdx_clear_page()
  2025-07-03 15:37 ` [PATCH V2 1/2] x86/tdx: Eliminate duplicate code in tdx_clear_page() Adrian Hunter
                     ` (3 preceding siblings ...)
  2025-07-07  2:08   ` Huang, Kai
@ 2025-07-07 17:31   ` Edgecombe, Rick P
  4 siblings, 0 replies; 21+ messages in thread
From: Edgecombe, Rick P @ 2025-07-07 17:31 UTC (permalink / raw)
  To: Annapurve, Vishal, pbonzini@redhat.com, Hunter, Adrian,
	seanjc@google.com, dave.hansen@linux.intel.com
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Huang, Kai, Zhao, Yan Y,
	Luck, Tony, tony.lindgren@linux.intel.com,
	binbin.wu@linux.intel.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, mingo@redhat.com, Yamahata, Isaku,
	kirill.shutemov@linux.intel.com, tglx@linutronix.de,
	hpa@zytor.com, bp@alien8.de, Gao, Chao, x86@kernel.org

On Thu, 2025-07-03 at 18:37 +0300, Adrian Hunter wrote:
> tdx_clear_page() and reset_tdx_pages() duplicate the TDX page clearing
> logic.  Rename reset_tdx_pages() to tdx_quirk_reset_paddr() and use it
> in place of tdx_clear_page().
> 

There is a tiny functional change. reset_tdx_pages() uses mb(), where
tdx_clear_page() uses __mb(). The former has some kcsan stuff.

The log should call this out as an opportunistic change.

> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
> 
> 
> Changes in V2:
> 
> 	Rename reset_tdx_pages() to tdx_quirk_reset_paddr()
> 	Call tdx_quirk_reset_paddr() directly
> 
> 
>  arch/x86/include/asm/tdx.h  |  2 ++
>  arch/x86/kvm/vmx/tdx.c      | 25 +++----------------------
>  arch/x86/virt/vmx/tdx/tdx.c |  5 +++--
>  3 files changed, 8 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 7ddef3a69866..f66328404724 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -131,6 +131,8 @@ int tdx_guest_keyid_alloc(void);
>  u32 tdx_get_nr_guest_keyids(void);
>  void tdx_guest_keyid_free(unsigned int keyid);
>  
> +void tdx_quirk_reset_paddr(unsigned long base, unsigned long size);
> +
>  struct tdx_td {
>  	/* TD root structure: */
>  	struct page *tdr_page;
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index a08e7055d1db..031e36665757 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -276,25 +276,6 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
>  	vcpu->cpu = -1;
>  }
>  
> -static void tdx_clear_page(struct page *page)
> -{
> -	const void *zero_page = (const void *) page_to_virt(ZERO_PAGE(0));
> -	void *dest = page_to_virt(page);
> -	unsigned long i;
> -
> -	/*
> -	 * The page could have been poisoned.  MOVDIR64B also clears
> -	 * the poison bit so the kernel can safely use the page again.
> -	 */
> -	for (i = 0; i < PAGE_SIZE; i += 64)
> -		movdir64b(dest + i, zero_page);
> -	/*
> -	 * MOVDIR64B store uses WC buffer.  Prevent following memory reads
> -	 * from seeing potentially poisoned cache.
> -	 */
> -	__mb();
> -}
> -
>  static void tdx_no_vcpus_enter_start(struct kvm *kvm)
>  {
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> @@ -340,7 +321,7 @@ static int tdx_reclaim_page(struct page *page)
>  
>  	r = __tdx_reclaim_page(page);
>  	if (!r)
> -		tdx_clear_page(page);
> +		tdx_quirk_reset_paddr(page_to_phys(page), PAGE_SIZE);
>  	return r;
>  }
>  
> @@ -589,7 +570,7 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
>  		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
>  		return;
>  	}
> -	tdx_clear_page(kvm_tdx->td.tdr_page);
> +	tdx_quirk_reset_paddr(page_to_phys(kvm_tdx->td.tdr_page), PAGE_SIZE);
>  
>  	__free_page(kvm_tdx->td.tdr_page);
>  	kvm_tdx->td.tdr_page = NULL;
> @@ -1689,7 +1670,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>  		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
>  		return -EIO;
>  	}
> -	tdx_clear_page(page);
> +	tdx_quirk_reset_paddr(page_to_phys(page), PAGE_SIZE);
>  	tdx_unpin(kvm, page);
>  	return 0;
>  }
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index c7a9a087ccaf..14d93ed05bd2 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -637,7 +637,7 @@ static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
>   * clear these pages.  Note this function doesn't flush cache of
>   * these TDX private pages.  The caller should make sure of that.
>   */


Full comment in the base code:

/*
 * Convert TDX private pages back to normal by using MOVDIR64B to
 * clear these pages.  Note this function doesn't flush cache of
 * these TDX private pages.  The caller should make sure of that.
 */

I'm not sure why it is suggesting the caller should make sure to flush the cache
in the original case, but the "tdx quirk" framing makes even less sense. Can we
update the comment too? Maybe something about what quirk is being referred to?

Probably the log needs to explain why this function is called "quirk" too.

> -static void reset_tdx_pages(unsigned long base, unsigned long size)
> +void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
>  {
>  	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
>  	unsigned long phys, end;
> @@ -653,10 +653,11 @@ static void reset_tdx_pages(unsigned long base, unsigned long size)
>  	 */
>  	mb();
>  }
> +EXPORT_SYMBOL_GPL(tdx_quirk_reset_paddr);
>  
>  static void tdmr_reset_pamt(struct tdmr_info *tdmr)
>  {
> -	tdmr_do_pamt_func(tdmr, reset_tdx_pages);
> +	tdmr_do_pamt_func(tdmr, tdx_quirk_reset_paddr);
>  }
>  
>  static void tdmrs_reset_pamt_all(struct tdmr_info_list *tdmr_list)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
  2025-07-07 14:32         ` Dave Hansen
@ 2025-07-07 18:15           ` Edgecombe, Rick P
  0 siblings, 0 replies; 21+ messages in thread
From: Edgecombe, Rick P @ 2025-07-07 18:15 UTC (permalink / raw)
  To: Hansen, Dave, Gao, Chao
  Cc: kvm@vger.kernel.org, Li, Xiaoyao, Huang, Kai, Luck, Tony,
	dave.hansen@linux.intel.com, Zhao, Yan Y, Hunter, Adrian,
	linux-kernel@vger.kernel.org, Chatre, Reinette, seanjc@google.com,
	pbonzini@redhat.com, mingo@redhat.com, tglx@linutronix.de,
	kirill.shutemov@linux.intel.com, binbin.wu@linux.intel.com,
	Yamahata, Isaku, hpa@zytor.com, Annapurve, Vishal, bp@alien8.de,
	tony.lindgren@linux.intel.com, x86@kernel.org

On Mon, 2025-07-07 at 07:32 -0700, Dave Hansen wrote:
> On 7/7/25 00:15, Chao Gao wrote:
> > > Why should this specific kind of freeing (TDX private memory being freed
> > > back to the host) operation be different from any other kind of free?
> > To limit the impact of software bugs (e.g., TDX module bugs) to TDX guests
> > rather than affecting the entire kernel.
> 
> It's one thing if the TDX module is so constantly buggy that we're
> getting tons of kernel crash reports that we track back to the TDX module.

Even if this happens, I think it would be good to limit kernel-side safety code
to finding TDX module bugs. Not working around them.

> 
> It's quite another thing to add kernel complexity to preemptively lessen
> the chance of a theoretical TDX bug.

And lessen the chance of catching the bug and fixing it in the TDX module.
Otherwise we develop a "works by accident" solution that causes crashes for
unknown reasons if anyone removes code.

This pattern of adding defensive protections against TDX module bugs came up in
the TDX huge pages patches as well. Let's make the type of reasoning here the
norm.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
  2025-07-04  5:37     ` Adrian Hunter
@ 2025-07-07 23:31       ` Vishal Annapurve
  0 siblings, 0 replies; 21+ messages in thread
From: Vishal Annapurve @ 2025-07-07 23:31 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Dave Hansen, pbonzini, seanjc, Tony Luck, Borislav Petkov,
	Thomas Gleixner, Ingo Molnar, x86, H Peter Anvin, linux-kernel,
	kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
	isaku.yamahata, yan.y.zhao, chao.gao

On Thu, Jul 3, 2025 at 10:38 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> On 03/07/2025 20:06, Vishal Annapurve wrote:
> > On Thu, Jul 3, 2025 at 8:37 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>
> >> Avoid clearing reclaimed TDX private pages unless the platform is affected
> >> by the X86_BUG_TDX_PW_MCE erratum. This significantly reduces VM shutdown
> >> time on unaffected systems.
> >>
> >> Background
> >>
> >> KVM currently clears reclaimed TDX private pages using MOVDIR64B, which:
> >>
> >>    - Clears the TD Owner bit (which identifies TDX private memory) and
> >>      integrity metadata without triggering integrity violations.
> >>    - Clears poison from cache lines without consuming it, avoiding MCEs on
> >>      access (refer TDX Module Base spec. 16.5. Handling Machine Check
> >>      Events during Guest TD Operation).
> >>
> >> The TDX module also uses MOVDIR64B to initialize private pages before use.
> >> If cache flushing is needed, it sets TDX_FEATURES.CLFLUSH_BEFORE_ALLOC.
> >> However, KVM currently flushes unconditionally, refer commit 94c477a751c7b
> >> ("x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages")
> >>
> >> In contrast, when private pages are reclaimed, the TDX Module handles
> >> flushing via the TDH.PHYMEM.CACHE.WB SEAMCALL.
> >>
> >> Problem
> >>
> >> Clearing all private pages during VM shutdown is costly. For guests
> >> with a large amount of memory it can take minutes.
> >>
> >> Solution
> >>
> >> TDX Module Base Architecture spec. documents that private pages reclaimed
> >> from a TD should be initialized using MOVDIR64B, in order to avoid
> >> integrity violation or TD bit mismatch detection when later being read
> >> using a shared HKID, refer April 2025 spec. "Page Initialization" in
> >> section "8.6.2. Platforms not Using ACT: Required Cache Flush and
> >> Initialization by the Host VMM"
> >>
> >> That is an overstatement and will be clarified in coming versions of the
> >> spec. In fact, as outlined in "Table 16.2: Non-ACT Platforms Checks on
> >> Memory" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
> >> Mode" in the same spec, there is no issue accessing such reclaimed pages
> >> using a shared key that does not have integrity enabled. Linux always uses
> >> KeyID 0 which never has integrity enabled. KeyID 0 is also the TME KeyID
> >> which disallows integrity, refer "TME Policy/Encryption Algorithm" bit
> >> description in "Intel Architecture Memory Encryption Technologies" spec
> >> version 1.6 April 2025. So there is no need to clear pages to avoid
> >> integrity violations.
> >>
> >> There remains a risk of poison consumption. However, in the context of
> >> TDX, it is expected that there would be a machine check associated with the
> >> original poisoning. On some platforms that results in a panic. However
> >> platforms may support "SEAM_NR" Machine Check capability, in which case
> >> Linux machine check handler marks the page as poisoned, which prevents it
> >> from being allocated anymore, refer commit 7911f145de5fe ("x86/mce:
> >> Implement recovery for errors in TDX/SEAM non-root mode")
> >>
> >> Improvement
> >>
> >> By skipping the clearing step on unaffected platforms, shutdown time
> >> can improve by up to 40%.
> >
> > This patch looks good to me.
> >
> > I would like to raise a related topic, is there any requirement for
> > zeroing pages on conversion from private to shared before
> > userspace/guest faults in the gpa ranges as shared?
>
> For TDX, clearing must still be done for platforms with the
> partial-write errata (SPR and EMR).
>

So I take it that vmm/guest_memfd can safely assume no responsibility
of clearing contents on conversion outside of the X86_BUG_TDX_PW_MCE
scenario, given that the spec doesn't dictate initial contents of
converted memory and no guest/host software should depend on the
initial values after conversion.

> >
> > If the answer is no for all CoCo architectures then guest_memfd can
> > simply just zero pages on allocation for all it's users and not worry
> > about zeroing later.
>
> In fact TDX does not need private pages to be zeroed on allocation
> because the TDX Module always does that.
>

guest_memfd allocated pages may get faulted in as shared first. To
keep things simple, guest_memfd can start with the "just zero on
allocation" policy which works for all current/future CoCo/non-CoCo
users of guest_memfd and we can later iterate with any arch-specific
optimizations as needed.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-07-07 23:32 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-03 15:37 [PATCH V2 0/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present Adrian Hunter
2025-07-03 15:37 ` [PATCH V2 1/2] x86/tdx: Eliminate duplicate code in tdx_clear_page() Adrian Hunter
2025-07-03 16:34   ` Kirill A. Shutemov
2025-07-04  6:44   ` Binbin Wu
2025-07-04 15:33   ` Xiaoyao Li
2025-07-07  2:08   ` Huang, Kai
2025-07-07 17:31   ` Edgecombe, Rick P
2025-07-03 15:37 ` [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present Adrian Hunter
2025-07-03 16:44   ` Kirill A. Shutemov
2025-07-03 17:06   ` Vishal Annapurve
2025-07-04  5:37     ` Adrian Hunter
2025-07-07 23:31       ` Vishal Annapurve
2025-07-04  1:32   ` Xiaoyao Li
2025-07-04  6:44     ` Binbin Wu
2025-07-07  2:09   ` Huang, Kai
2025-07-07  3:16   ` Chao Gao
2025-07-07  4:23     ` Dave Hansen
2025-07-07  7:15       ` Chao Gao
2025-07-07 11:39         ` Huang, Kai
2025-07-07 14:32         ` Dave Hansen
2025-07-07 18:15           ` Edgecombe, Rick P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).