[PATCH v8 0/7] TDX host: kexec/kdump support

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v8 0/7] TDX host: kexec/kdump support
@ 2025-09-01 16:09 Paolo Bonzini
  2025-09-01 16:09 ` [PATCH 1/7] x86/kexec: Consolidate relocate_kernel() function parameters Paolo Bonzini
                   ` (7 more replies)
  0 siblings, 8 replies; 38+ messages in thread
From: Paolo Bonzini @ 2025-09-01 16:09 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky, x86,
	kas, rick.p.edgecombe, dwmw, kai.huang, seanjc, reinette.chatre,
	isaku.yamahata, dan.j.williams, ashish.kalra, nik.borisov,
	chao.gao, sagis, farrah.chen

Currently kexec() support and TDX host are muturally exclusive in the
Kconfig.  This series adds the TDX host kexec support so that they can
be both enabled in Kconfig.

With this series, the user can kexec (including crash kdump) to the new
kernel at any time regardless of whether TDX has been enabled in the
first kernel.  One limitation is if the first kernel has ever enabled
TDX, for now the second kernel cannot use TDX.  This is the future work
in my TODO list.

This series should go in through the tip tree.

Thanks,

Paolo

v7->v8: stub out the new code when kexec is not enabled in the kernel.
	Of course even the smallest code change is subject to bikeshedding,
	and I chose my preferred color for the bikeshed.  But it's pastel
	green and I'm sure you'll agree that it's beautiful.


Kai Huang (7):
  x86/kexec: Consolidate relocate_kernel() function parameters
  x86/sme: Use percpu boolean to control WBINVD during kexec
  x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  x86/kexec: Disable kexec/kdump on platforms with TDX partial write
    erratum
  x86/virt/tdx: Remove the !KEXEC_CORE dependency
  x86/virt/tdx: Update the kexec section in the TDX documentation
  KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs

 Documentation/arch/x86/tdx.rst       | 14 ++++-----
 arch/x86/Kconfig                     |  1 -
 arch/x86/include/asm/kexec.h         | 12 ++++++--
 arch/x86/include/asm/processor.h     |  2 ++
 arch/x86/include/asm/tdx.h           | 31 +++++++++++++++++++-
 arch/x86/kernel/cpu/amd.c            | 17 +++++++++++
 arch/x86/kernel/machine_kexec_64.c   | 44 ++++++++++++++++++++++------
 arch/x86/kernel/process.c            | 24 +++++++--------
 arch/x86/kernel/relocate_kernel_64.S | 36 +++++++++++++++--------
 arch/x86/kvm/vmx/tdx.c               | 10 +++++++
 arch/x86/virt/vmx/tdx/tdx.c          | 23 +++++++++++++--
 11 files changed, 167 insertions(+), 47 deletions(-)

-- 
2.51.0


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/7] x86/kexec: Consolidate relocate_kernel() function parameters
  2025-09-01 16:09 [PATCH v8 0/7] TDX host: kexec/kdump support Paolo Bonzini
@ 2025-09-01 16:09 ` Paolo Bonzini
  2025-09-01 16:09 ` [PATCH 2/7] x86/sme: Use percpu boolean to control WBINVD during kexec Paolo Bonzini
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2025-09-01 16:09 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky, x86,
	kas, rick.p.edgecombe, dwmw, kai.huang, seanjc, reinette.chatre,
	isaku.yamahata, dan.j.williams, ashish.kalra, nik.borisov,
	chao.gao, sagis, farrah.chen

From: Kai Huang <kai.huang@intel.com>

During kexec, the kernel jumps to the new kernel in relocate_kernel(),
which is implemented in assembly and both 32-bit and 64-bit have their
own version.

Currently, for both 32-bit and 64-bit, the last two parameters of the
relocate_kernel() are both 'unsigned int' but actually they only convey
a boolean, i.e., one bit information.  The 'unsigned int' has enough
space to carry two bits information therefore there's no need to pass
the two booleans in two separate 'unsigned int'.

Consolidate the last two function parameters of relocate_kernel() into a
single 'unsigned int' and pass flags instead.

Only consolidate the 64-bit version albeit the similar optimization can
be done for the 32-bit version too.  Don't bother changing the 32-bit
version while it is working (since assembly code change is required).

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/kexec.h         | 12 ++++++++++--
 arch/x86/kernel/machine_kexec_64.c   | 22 +++++++++++++---------
 arch/x86/kernel/relocate_kernel_64.S | 25 +++++++++++++++----------
 3 files changed, 38 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index f2ad77929d6e..12cebbcdb6c8 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -13,6 +13,15 @@
 # define KEXEC_DEBUG_EXC_HANDLER_SIZE	6 /* PUSHI, PUSHI, 2-byte JMP */
 #endif
 
+#ifdef CONFIG_X86_64
+
+#include <linux/bits.h>
+
+#define RELOC_KERNEL_PRESERVE_CONTEXT		BIT(0)
+#define RELOC_KERNEL_HOST_MEM_ENC_ACTIVE	BIT(1)
+
+#endif
+
 # define KEXEC_CONTROL_PAGE_SIZE	4096
 # define KEXEC_CONTROL_CODE_MAX_SIZE	2048
 
@@ -121,8 +130,7 @@ typedef unsigned long
 relocate_kernel_fn(unsigned long indirection_page,
 		   unsigned long pa_control_page,
 		   unsigned long start_address,
-		   unsigned int preserve_context,
-		   unsigned int host_mem_enc_active);
+		   unsigned int flags);
 #endif
 extern relocate_kernel_fn relocate_kernel;
 #define ARCH_HAS_KIMAGE_ARCH
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 8593760c255a..fdd04b5bb70e 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -384,16 +384,10 @@ void __nocfi machine_kexec(struct kimage *image)
 {
 	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
 	relocate_kernel_fn *relocate_kernel_ptr;
-	unsigned int host_mem_enc_active;
+	unsigned int relocate_kernel_flags;
 	int save_ftrace_enabled;
 	void *control_page;
 
-	/*
-	 * This must be done before load_segments() since if call depth tracking
-	 * is used then GS must be valid to make any function calls.
-	 */
-	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
-
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
 		save_processor_state();
@@ -427,6 +421,17 @@ void __nocfi machine_kexec(struct kimage *image)
 	 */
 	relocate_kernel_ptr = control_page + (unsigned long)relocate_kernel - reloc_start;
 
+	relocate_kernel_flags = 0;
+	if (image->preserve_context)
+		relocate_kernel_flags |= RELOC_KERNEL_PRESERVE_CONTEXT;
+
+	/*
+	 * This must be done before load_segments() since if call depth tracking
+	 * is used then GS must be valid to make any function calls.
+	 */
+	if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
+		relocate_kernel_flags |= RELOC_KERNEL_HOST_MEM_ENC_ACTIVE;
+
 	/*
 	 * The segment registers are funny things, they have both a
 	 * visible and an invisible part.  Whenever the visible part is
@@ -443,8 +448,7 @@ void __nocfi machine_kexec(struct kimage *image)
 	image->start = relocate_kernel_ptr((unsigned long)image->head,
 					   virt_to_phys(control_page),
 					   image->start,
-					   image->preserve_context,
-					   host_mem_enc_active);
+					   relocate_kernel_flags);
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index ea604f4d0b52..26e945f85d19 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -66,8 +66,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
 	 * %rdi indirection_page
 	 * %rsi pa_control_page
 	 * %rdx start address
-	 * %rcx preserve_context
-	 * %r8  host_mem_enc_active
+	 * %rcx flags: RELOC_KERNEL_*
 	 */
 
 	/* Save the CPU context, used for jumping back */
@@ -111,7 +110,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
 	/* save indirection list for jumping back */
 	movq	%rdi, pa_backup_pages_map(%rip)
 
-	/* Save the preserve_context to %r11 as swap_pages clobbers %rcx. */
+	/* Save the flags to %r11 as swap_pages clobbers %rcx. */
 	movq	%rcx, %r11
 
 	/* setup a new stack at the end of the physical control page */
@@ -129,9 +128,8 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	/*
 	 * %rdi	indirection page
 	 * %rdx start address
-	 * %r8 host_mem_enc_active
 	 * %r9 page table page
-	 * %r11 preserve_context
+	 * %r11 flags: RELOC_KERNEL_*
 	 * %r13 original CR4 when relocate_kernel() was invoked
 	 */
 
@@ -204,7 +202,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	 * entries that will conflict with the now unencrypted memory
 	 * used by kexec. Flush the caches before copying the kernel.
 	 */
-	testq	%r8, %r8
+	testb	$RELOC_KERNEL_HOST_MEM_ENC_ACTIVE, %r11b
 	jz .Lsme_off
 	wbinvd
 .Lsme_off:
@@ -220,7 +218,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	movq	%cr3, %rax
 	movq	%rax, %cr3
 
-	testq	%r11, %r11	/* preserve_context */
+	testb	$RELOC_KERNEL_PRESERVE_CONTEXT, %r11b
 	jnz .Lrelocate
 
 	/*
@@ -273,7 +271,13 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	ANNOTATE_NOENDBR
 	andq	$PAGE_MASK, %r8
 	lea	PAGE_SIZE(%r8), %rsp
-	movl	$1, %r11d	/* Ensure preserve_context flag is set */
+	/*
+	 * Ensure RELOC_KERNEL_PRESERVE_CONTEXT flag is set so that
+	 * swap_pages() can swap pages correctly.  Note all other
+	 * RELOC_KERNEL_* flags passed to relocate_kernel() are not
+	 * restored.
+	 */
+	movl	$RELOC_KERNEL_PRESERVE_CONTEXT, %r11d
 	call	swap_pages
 	movq	kexec_va_control_page(%rip), %rax
 0:	addq	$virtual_mapped - 0b, %rax
@@ -321,7 +325,7 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
 	UNWIND_HINT_END_OF_STACK
 	/*
 	 * %rdi indirection page
-	 * %r11 preserve_context
+	 * %r11 flags: RELOC_KERNEL_*
 	 */
 	movq	%rdi, %rcx	/* Put the indirection_page in %rcx */
 	xorl	%edi, %edi
@@ -357,7 +361,8 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
 	movq	%rdi, %rdx    /* Save destination page to %rdx */
 	movq	%rsi, %rax    /* Save source page to %rax */
 
-	testq	%r11, %r11    /* Only actually swap for ::preserve_context */
+	/* Only actually swap for ::preserve_context */
+	testb	$RELOC_KERNEL_PRESERVE_CONTEXT, %r11b
 	jz	.Lnoswap
 
 	/* copy source page to swap page */
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/7] x86/sme: Use percpu boolean to control WBINVD during kexec
  2025-09-01 16:09 [PATCH v8 0/7] TDX host: kexec/kdump support Paolo Bonzini
  2025-09-01 16:09 ` [PATCH 1/7] x86/kexec: Consolidate relocate_kernel() function parameters Paolo Bonzini
@ 2025-09-01 16:09 ` Paolo Bonzini
  2025-09-01 16:09 ` [PATCH 3/7] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Paolo Bonzini
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2025-09-01 16:09 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky, x86,
	kas, rick.p.edgecombe, dwmw, kai.huang, seanjc, reinette.chatre,
	isaku.yamahata, dan.j.williams, ashish.kalra, nik.borisov,
	chao.gao, sagis, farrah.chen

From: Kai Huang <kai.huang@intel.com>

TL;DR:

Prepare to unify how TDX and SME do cache flushing during kexec by
making a percpu boolean control whether to do the WBINVD.

-- Background --

On SME platforms, dirty cacheline aliases with and without encryption
bit can coexist, and the CPU can flush them back to memory in random
order.  During kexec, the caches must be flushed before jumping to the
new kernel otherwise the dirty cachelines could silently corrupt the
memory used by the new kernel due to different encryption property.

TDX also needs a cache flush during kexec for the same reason.  It would
be good to have a generic way to flush the cache instead of scattering
checks for each feature all around.

When SME is enabled, the kernel basically encrypts all memory including
the kernel itself and a simple memory write from the kernel could dirty
cachelines.  Currently, the kernel uses WBINVD to flush the cache for
SME during kexec in two places:

1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU
   stops them;
2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the
   new kernel.

-- Solution --

Unlike SME, TDX can only dirty cachelines when it is used (i.e., when
SEAMCALLs are performed).  Since there are no more SEAMCALLs after the
aforementioned WBINVDs, leverage this for TDX.

To unify the approach for SME and TDX, use a percpu boolean to indicate
the cache may be in an incoherent state and needs flushing during kexec,
and set the boolean for SME.  TDX can then leverage it.

While SME could use a global flag (since it's enabled at early boot and
enabled on all CPUs), the percpu flag fits TDX better:

The percpu flag can be set when a CPU makes a SEAMCALL, and cleared when
another WBINVD on the CPU obviates the need for a kexec-time WBINVD.
Saving kexec-time WBINVD is valuable, because there is an existing
race[*] where kexec could proceed while another CPU is active.  WBINVD
could make this race worse, so it's worth skipping it when possible.

-- Side effect to SME --

Today the first WBINVD in the stop_this_cpu() is performed when SME is
*supported* by the platform, and the second WBINVD is done in
relocate_kernel() when SME is *activated* by the kernel.  Make things
simple by changing to do the second WBINVD when the platform supports
SME.  This allows the kernel to simply turn on this percpu boolean when
bringing up a CPU by checking whether the platform supports SME.

No other functional change intended.

[*] The aforementioned race:

During kexec native_stop_other_cpus() is called to stop all remote CPUs
before jumping to the new kernel.  native_stop_other_cpus() firstly
sends normal REBOOT vector IPIs to stop remote CPUs and waits them to
stop.  If that times out, it sends NMI to stop the CPUs that are still
alive.  The race happens when native_stop_other_cpus() has to send NMIs
and could potentially result in the system hang (for more information
please see [1]).

Link: https://lore.kernel.org/kvm/b963fcd60abe26c7ec5dc20b42f1a2ebbcc72397.1750934177.git.kai.huang@intel.com/ [1]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/kexec.h         |  4 ++--
 arch/x86/include/asm/processor.h     |  2 ++
 arch/x86/kernel/cpu/amd.c            | 17 +++++++++++++++++
 arch/x86/kernel/machine_kexec_64.c   | 14 ++++++++++----
 arch/x86/kernel/process.c            | 24 +++++++++++-------------
 arch/x86/kernel/relocate_kernel_64.S | 13 ++++++++++---
 6 files changed, 52 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 12cebbcdb6c8..5cfb27f26583 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -17,8 +17,8 @@
 
 #include <linux/bits.h>
 
-#define RELOC_KERNEL_PRESERVE_CONTEXT		BIT(0)
-#define RELOC_KERNEL_HOST_MEM_ENC_ACTIVE	BIT(1)
+#define RELOC_KERNEL_PRESERVE_CONTEXT	BIT(0)
+#define RELOC_KERNEL_CACHE_INCOHERENT	BIT(1)
 
 #endif
 
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index bde58f6510ac..a24c7805acdb 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -731,6 +731,8 @@ void __noreturn stop_this_cpu(void *dummy);
 void microcode_check(struct cpuinfo_x86 *prev_info);
 void store_cpu_caps(struct cpuinfo_x86 *info);
 
+DECLARE_PER_CPU(bool, cache_state_incoherent);
+
 enum l1tf_mitigations {
 	L1TF_MITIGATION_OFF,
 	L1TF_MITIGATION_AUTO,
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index a6f88ca1a6b4..5398db4dedb4 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -545,6 +545,23 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
 {
 	u64 msr;
 
+	/*
+	 * Mark using WBINVD is needed during kexec on processors that
+	 * support SME. This provides support for performing a successful
+	 * kexec when going from SME inactive to SME active (or vice-versa).
+	 *
+	 * The cache must be cleared so that if there are entries with the
+	 * same physical address, both with and without the encryption bit,
+	 * they don't race each other when flushed and potentially end up
+	 * with the wrong entry being committed to memory.
+	 *
+	 * Test the CPUID bit directly because with mem_encrypt=off the
+	 * BSP will clear the X86_FEATURE_SME bit and the APs will not
+	 * see it set after that.
+	 */
+	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+		__this_cpu_write(cache_state_incoherent, true);
+
 	/*
 	 * BIOS support is required for SME and SEV.
 	 *   For SME: If BIOS has enabled SME then adjust x86_phys_bits by
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index fdd04b5bb70e..34c303a92eaf 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -29,6 +29,7 @@
 #include <asm/set_memory.h>
 #include <asm/cpu.h>
 #include <asm/efi.h>
+#include <asm/processor.h>
 
 #ifdef CONFIG_ACPI
 /*
@@ -426,11 +427,11 @@ void __nocfi machine_kexec(struct kimage *image)
 		relocate_kernel_flags |= RELOC_KERNEL_PRESERVE_CONTEXT;
 
 	/*
-	 * This must be done before load_segments() since if call depth tracking
-	 * is used then GS must be valid to make any function calls.
+	 * This must be done before load_segments() since it resets
+	 * GS to 0 and percpu data needs the correct GS to work.
 	 */
-	if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
-		relocate_kernel_flags |= RELOC_KERNEL_HOST_MEM_ENC_ACTIVE;
+	if (this_cpu_read(cache_state_incoherent))
+		relocate_kernel_flags |= RELOC_KERNEL_CACHE_INCOHERENT;
 
 	/*
 	 * The segment registers are funny things, they have both a
@@ -441,6 +442,11 @@ void __nocfi machine_kexec(struct kimage *image)
 	 *
 	 * Take advantage of this here by force loading the segments,
 	 * before the GDT is zapped with an invalid value.
+	 *
+	 * load_segments() resets GS to 0.  Don't make any function call
+	 * after here since call depth tracking uses percpu variables to
+	 * operate (relocate_kernel() is explicitly ignored by call depth
+	 * tracking).
 	 */
 	load_segments();
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 1b7960cf6eb0..f2bbbeef5477 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -88,6 +88,16 @@ EXPORT_PER_CPU_SYMBOL(cpu_tss_rw);
 DEFINE_PER_CPU(bool, __tss_limit_invalid);
 EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid);
 
+/*
+ * The cache may be in an incoherent state and needs flushing during kexec.
+ * E.g., on SME/TDX platforms, dirty cacheline aliases with and without
+ * encryption bit(s) can coexist and the cache needs to be flushed before
+ * booting to the new kernel to avoid the silent memory corruption due to
+ * dirty cachelines with different encryption property being written back
+ * to the memory.
+ */
+DEFINE_PER_CPU(bool, cache_state_incoherent);
+
 /*
  * this gets called so that we can store lazy state into memory and copy the
  * current task into the new thread.
@@ -827,19 +837,7 @@ void __noreturn stop_this_cpu(void *dummy)
 	disable_local_APIC();
 	mcheck_cpu_clear(c);
 
-	/*
-	 * Use wbinvd on processors that support SME. This provides support
-	 * for performing a successful kexec when going from SME inactive
-	 * to SME active (or vice-versa). The cache must be cleared so that
-	 * if there are entries with the same physical address, both with and
-	 * without the encryption bit, they don't race each other when flushed
-	 * and potentially end up with the wrong entry being committed to
-	 * memory.
-	 *
-	 * Test the CPUID bit directly because the machine might've cleared
-	 * X86_FEATURE_SME due to cmdline options.
-	 */
-	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+	if (this_cpu_read(cache_state_incoherent))
 		wbinvd();
 
 	/*
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 26e945f85d19..11e20bb13aca 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -198,14 +198,21 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	movq	%r9, %cr3
 
 	/*
+	 * If the memory cache is in incoherent state, e.g., due to
+	 * memory encryption, do WBINVD to flush cache.
+	 *
 	 * If SME is active, there could be old encrypted cache line
 	 * entries that will conflict with the now unencrypted memory
 	 * used by kexec. Flush the caches before copying the kernel.
+	 *
+	 * Note SME sets this flag to true when the platform supports
+	 * SME, so the WBINVD is performed even SME is not activated
+	 * by the kernel.  But this has no harm.
 	 */
-	testb	$RELOC_KERNEL_HOST_MEM_ENC_ACTIVE, %r11b
-	jz .Lsme_off
+	testb	$RELOC_KERNEL_CACHE_INCOHERENT, %r11b
+	jz .Lnowbinvd
 	wbinvd
-.Lsme_off:
+.Lnowbinvd:
 
 	call	swap_pages
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/7] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  2025-09-01 16:09 [PATCH v8 0/7] TDX host: kexec/kdump support Paolo Bonzini
  2025-09-01 16:09 ` [PATCH 1/7] x86/kexec: Consolidate relocate_kernel() function parameters Paolo Bonzini
  2025-09-01 16:09 ` [PATCH 2/7] x86/sme: Use percpu boolean to control WBINVD during kexec Paolo Bonzini
@ 2025-09-01 16:09 ` Paolo Bonzini
  2025-09-01 16:09 ` [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Paolo Bonzini
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2025-09-01 16:09 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky, x86,
	kas, rick.p.edgecombe, dwmw, kai.huang, seanjc, reinette.chatre,
	isaku.yamahata, dan.j.williams, ashish.kalra, nik.borisov,
	chao.gao, sagis, farrah.chen

From: Kai Huang <kai.huang@intel.com>

On TDX platforms, dirty cacheline aliases with and without encryption
bits can coexist, and the cpu can flush them back to memory in random
order.  During kexec, the caches must be flushed before jumping to the
new kernel otherwise the dirty cachelines could silently corrupt the
memory used by the new kernel due to different encryption property.

A percpu boolean is used to mark whether the cache of a given CPU may be
in an incoherent state, and the kexec performs WBINVD on the CPUs with
that boolean turned on.

For TDX, only the TDX module or the TDX guests can generate dirty
cachelines of TDX private memory, i.e., they are only generated when the
kernel does a SEAMCALL.

Set that boolean when the kernel does SEAMCALL so that kexec can flush
the cache correctly.

The kernel provides both the __seamcall*() assembly functions and the
seamcall*() wrapper ones which additionally handle running out of
entropy error in a loop.  Most of the SEAMCALLs are called using the
seamcall*(), except TDH.VP.ENTER and TDH.PHYMEM.PAGE.RDMD which are
called using __seamcall*() variant directly.

To cover the two special cases, add a new __seamcall_dirty_cache()
helper which only sets the percpu boolean and calls the __seamcall*(),
and change the special cases to use the new helper.  To cover all other
SEAMCALLs, change seamcall*() to call the new helper.

For the SEAMCALLs invoked via seamcall*(), they can be made from both
task context and IRQ disabled context.  Given SEAMCALL is just a lengthy
instruction (e.g., thousands of cycles) from kernel's point of view and
preempt_{disable|enable}() is cheap compared to it, just unconditionally
disable preemption during setting the boolean and making SEAMCALL.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/tdx.h  | 25 ++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.c |  4 ++--
 2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 57b46f05ff97..c178360c1fb1 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -102,10 +102,31 @@ u64 __seamcall_ret(u64 fn, struct tdx_module_args *args);
 u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
 void tdx_init(void);

+#include <linux/preempt.h>
 #include <asm/archrandom.h>
+#include <asm/processor.h>

 typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args);

+static __always_inline u64 __seamcall_dirty_cache(sc_func_t func, u64 fn,
+						  struct tdx_module_args *args)
+{
+	lockdep_assert_preemption_disabled();
+
+	/*
+	 * SEAMCALLs are made to the TDX module and can generate dirty
+	 * cachelines of TDX private memory.  Mark cache state incoherent
+	 * so that the cache can be flushed during kexec.
+	 *
+	 * This needs to be done before actually making the SEAMCALL,
+	 * because kexec-ing CPU could send NMI to stop remote CPUs,
+	 * in which case even disabling IRQ won't help here.
+	 */
+	this_cpu_write(cache_state_incoherent, true);
+
+	return func(fn, args);
+}
+
 static __always_inline u64 sc_retry(sc_func_t func, u64 fn,
 			   struct tdx_module_args *args)
 {
@@ -113,7 +134,9 @@ static __always_inline u64 sc_retry(sc_func_t func, u64 fn,
 	u64 ret;

 	do {
-		ret = func(fn, args);
+		preempt_disable();
+		ret = __seamcall_dirty_cache(func, fn, args);
+		preempt_enable();
 	} while (ret == TDX_RND_NO_ENTROPY && --retry);

 	return ret;
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 823850399bb7..2abf53ed59c8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1268,7 +1268,7 @@ static bool paddr_is_tdx_private(unsigned long phys)
 		return false;

 	/* Get page type from the TDX module */
-	sret = __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args);
+	sret = __seamcall_dirty_cache(__seamcall_ret, TDH_PHYMEM_PAGE_RDMD, &args);

 	/*
 	 * The SEAMCALL will not return success unless there is a
@@ -1524,7 +1524,7 @@ noinstr __flatten u64 tdh_vp_enter(struct tdx_vp *td, struct tdx_module_args *ar
 {
 	args->rcx = tdx_tdvpr_pa(td);

-	return __seamcall_saved_ret(TDH_VP_ENTER, args);
+	return __seamcall_dirty_cache(__seamcall_saved_ret, TDH_VP_ENTER, args);
 }
 EXPORT_SYMBOL_GPL(tdh_vp_enter);

-- 
2.51.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-09-01 16:09 [PATCH v8 0/7] TDX host: kexec/kdump support Paolo Bonzini
                   ` (2 preceding siblings ...)
  2025-09-01 16:09 ` [PATCH 3/7] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Paolo Bonzini
@ 2025-09-01 16:09 ` Paolo Bonzini
  2025-09-30  1:38   ` Vishal Annapurve
  2025-10-26 23:33   ` Vishal Annapurve
  2025-09-01 16:09 ` [PATCH 5/7] x86/virt/tdx: Remove the !KEXEC_CORE dependency Paolo Bonzini
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 38+ messages in thread
From: Paolo Bonzini @ 2025-09-01 16:09 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky, x86,
	kas, rick.p.edgecombe, dwmw, kai.huang, seanjc, reinette.chatre,
	isaku.yamahata, dan.j.williams, ashish.kalra, nik.borisov,
	chao.gao, sagis, farrah.chen, Binbin Wu

From: Kai Huang <kai.huang@intel.com>

Some early TDX-capable platforms have an erratum: A kernel partial
write (a write transaction of less than cacheline lands at memory
controller) to TDX private memory poisons that memory, and a subsequent
read triggers a machine check.

On those platforms, the old kernel must reset TDX private memory before
jumping to the new kernel, otherwise the new kernel may see unexpected
machine check.  Currently the kernel doesn't track which page is a TDX
private page.  For simplicity just fail kexec/kdump for those platforms.

Leverage the existing machine_kexec_prepare() to fail kexec/kdump by
adding the check of the presence of the TDX erratum (which is only
checked for if the kernel is built with TDX host support).  This rejects
kexec/kdump when the kernel is loading the kexec/kdump kernel image.

The alternative is to reject kexec/kdump when the kernel is jumping to
the new kernel.  But for kexec this requires adding a new check (e.g.,
arch_kexec_allowed()) in the common code to fail kernel_kexec() at early
stage.  Kdump (crash_kexec()) needs similar check, but it's hard to
justify because crash_kexec() is not supposed to abort.

It's feasible to further relax this limitation, i.e., only fail kexec
when TDX is actually enabled by the kernel.  But this is still a half
measure compared to resetting TDX private memory so just do the simplest
thing for now.

The impact to userspace is the users will get an error when loading the
kexec/kdump kernel image:

  kexec_load failed: Operation not supported

This might be confusing to the users, thus also print the reason in the
dmesg:

  [..] kexec: Not allowed on platform with tdx_pw_mce bug.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kernel/machine_kexec_64.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 34c303a92eaf..201137b98fb8 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -347,6 +347,22 @@ int machine_kexec_prepare(struct kimage *image)
 	unsigned long reloc_end = (unsigned long)__relocate_kernel_end;
 	int result;
 
+	/*
+	 * Some early TDX-capable platforms have an erratum.  A kernel
+	 * partial write (a write transaction of less than cacheline
+	 * lands at memory controller) to TDX private memory poisons that
+	 * memory, and a subsequent read triggers a machine check.
+	 *
+	 * On those platforms the old kernel must reset TDX private
+	 * memory before jumping to the new kernel otherwise the new
+	 * kernel may see unexpected machine check.  For simplicity
+	 * just fail kexec/kdump on those platforms.
+	 */
+	if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) {
+		pr_info_once("Not allowed on platform with tdx_pw_mce bug\n");
+		return -EOPNOTSUPP;
+	}
+
 	/* Setup the identity mapped 64bit page table */
 	result = init_pgtable(image, __pa(control_page));
 	if (result)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 5/7] x86/virt/tdx: Remove the !KEXEC_CORE dependency
  2025-09-01 16:09 [PATCH v8 0/7] TDX host: kexec/kdump support Paolo Bonzini
                   ` (3 preceding siblings ...)
  2025-09-01 16:09 ` [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Paolo Bonzini
@ 2025-09-01 16:09 ` Paolo Bonzini
  2025-09-01 16:09 ` [PATCH 6/7] x86/virt/tdx: Update the kexec section in the TDX documentation Paolo Bonzini
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2025-09-01 16:09 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky, x86,
	kas, rick.p.edgecombe, dwmw, kai.huang, seanjc, reinette.chatre,
	isaku.yamahata, dan.j.williams, ashish.kalra, nik.borisov,
	chao.gao, sagis, farrah.chen

From: Kai Huang <kai.huang@intel.com>

During kexec it is now guaranteed that all dirty cachelines of TDX
private memory are flushed before jumping to the new kernel.  The TDX
private memory from the old kernel will remain as TDX private memory in
the new kernel, but it is OK because kernel read/write to TDX private
memory will never cause machine check, except on the platforms with the
TDX partial write erratum, which has already been handled.

It is safe to allow kexec to work together with TDX now.  Remove the
!KEXEC_CORE dependency.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 348a193a3ede..217982814bd7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1895,7 +1895,6 @@ config INTEL_TDX_HOST
 	depends on X86_X2APIC
 	select ARCH_KEEP_MEMBLOCK
 	depends on CONTIG_ALLOC
-	depends on !KEXEC_CORE
 	depends on X86_MCE
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 6/7] x86/virt/tdx: Update the kexec section in the TDX documentation
  2025-09-01 16:09 [PATCH v8 0/7] TDX host: kexec/kdump support Paolo Bonzini
                   ` (4 preceding siblings ...)
  2025-09-01 16:09 ` [PATCH 5/7] x86/virt/tdx: Remove the !KEXEC_CORE dependency Paolo Bonzini
@ 2025-09-01 16:09 ` Paolo Bonzini
  2025-09-01 16:09 ` [PATCH 7/7] KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs Paolo Bonzini
  2025-10-03 13:09 ` [PATCH v8 0/7] TDX host: kexec/kdump support Paolo Bonzini
  7 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2025-09-01 16:09 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky, x86,
	kas, rick.p.edgecombe, dwmw, kai.huang, seanjc, reinette.chatre,
	isaku.yamahata, dan.j.williams, ashish.kalra, nik.borisov,
	chao.gao, sagis, farrah.chen

From: Kai Huang <kai.huang@intel.com>

TDX host kernel now supports kexec/kdump.  Update the documentation to
reflect that.

Opportunistically, remove the parentheses in "Kexec()" and move this
section under the "Erratum" section because the updated "Kexec" section
now refers to that erratum.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 Documentation/arch/x86/tdx.rst | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst
index 719043cd8b46..61670e7df2f7 100644
--- a/Documentation/arch/x86/tdx.rst
+++ b/Documentation/arch/x86/tdx.rst
@@ -142,13 +142,6 @@ but depends on the BIOS to behave correctly.
 Note TDX works with CPU logical online/offline, thus the kernel still
 allows to offline logical CPU and online it again.
 
-Kexec()
-~~~~~~~
-
-TDX host support currently lacks the ability to handle kexec.  For
-simplicity only one of them can be enabled in the Kconfig.  This will be
-fixed in the future.
-
 Erratum
 ~~~~~~~
 
@@ -171,6 +164,13 @@ If the platform has such erratum, the kernel prints additional message in
 machine check handler to tell user the machine check may be caused by
 kernel bug on TDX private memory.
 
+Kexec
+~~~~~~~
+
+Currently kexec doesn't work on the TDX platforms with the aforementioned
+erratum.  It fails when loading the kexec kernel image.  Otherwise it
+works normally.
+
 Interaction vs S3 and deeper states
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 7/7] KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs
  2025-09-01 16:09 [PATCH v8 0/7] TDX host: kexec/kdump support Paolo Bonzini
                   ` (5 preceding siblings ...)
  2025-09-01 16:09 ` [PATCH 6/7] x86/virt/tdx: Update the kexec section in the TDX documentation Paolo Bonzini
@ 2025-09-01 16:09 ` Paolo Bonzini
  2025-10-03 13:09 ` [PATCH v8 0/7] TDX host: kexec/kdump support Paolo Bonzini
  7 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2025-09-01 16:09 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky, x86,
	kas, rick.p.edgecombe, dwmw, kai.huang, seanjc, reinette.chatre,
	isaku.yamahata, dan.j.williams, ashish.kalra, nik.borisov,
	chao.gao, sagis, farrah.chen

From: Kai Huang <kai.huang@intel.com>

On TDX platforms, during kexec, the kernel needs to make sure there
are no dirty cachelines of TDX private memory before booting to the new
kernel to avoid silent memory corruption to the new kernel.

To do this, the kernel has a percpu boolean to indicate whether the
cache of a CPU may be in incoherent state.  During kexec, namely in
stop_this_cpu(), the kernel does WBINVD if that percpu boolean is true.
TDX turns on that percpu boolean on a CPU when the kernel does SEAMCALL,
Thus making sure the cache will be flushed during kexec.

However, kexec has a race condition that, while remaining extremely rare,
would be more likely in the presence of a relatively long operation such
as WBINVD.

In particular, the kexec-ing CPU invokes native_stop_other_cpus()
to stop all remote CPUs before booting to the new kernel.
native_stop_other_cpus() then sends a REBOOT vector IPI to remote CPUs
and waits for them to stop; if that times out, it also sends NMIs to the
still-alive CPUs and waits again for them to stop.  If the race happens,
kexec proceeds before all CPUs have processed the NMI and stopped[1],
and the system hangs.

But after tdx_disable_virtualization_cpu(), no more TDX activity
can happen on this cpu.  When kexec is enabled, flush the cache
explicitly at that point; this moves the WBINVD to an earlier stage than
stop_this_cpus(), avoiding a possibly lengthy operation at a time where
it could cause this race.

[1] https://lore.kernel.org/kvm/b963fcd60abe26c7ec5dc20b42f1a2ebbcc72397.1750934177.git.kai.huang@intel.com/

Signed-off-by: Kai Huang <kai.huang@intel.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
[Make the new function a stub for !CONFIG_KEXEC_CORE. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/tdx.h  |  6 ++++++
 arch/x86/kvm/vmx/tdx.c      | 10 ++++++++++
 arch/x86/virt/vmx/tdx/tdx.c | 19 +++++++++++++++++++
 3 files changed, 35 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index c178360c1fb1..6120461bd5ff 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -228,5 +228,11 @@ static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
 static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NULL; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
+#ifdef CONFIG_KEXEC_CORE
+void tdx_cpu_flush_cache_for_kexec(void);
+#else
+static inline void tdx_cpu_flush_cache_for_kexec(void) { }
+#endif
+
 #endif /* !__ASSEMBLER__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f457b2e578b2..04b6d332c1af 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -423,6 +423,16 @@ void tdx_disable_virtualization_cpu(void)
 		tdx_flush_vp(&arg);
 	}
 	local_irq_restore(flags);
+
+	/*
+	 * Flush cache now if kexec is possible: this is necessary to avoid
+	 * having dirty private memory cachelines when the new kernel boots,
+	 * but WBINVD is a relatively expensive operation and doing it during
+	 * kexec can exacerbate races in native_stop_other_cpus().  Do it
+	 * now, since this is a safe moment and there is going to be no more
+	 * TDX activity on this CPU from this point on.
+	 */
+	tdx_cpu_flush_cache_for_kexec();
 }
 
 #define TDX_SEAMCALL_RETRIES 10000
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 2abf53ed59c8..330b560313af 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1872,3 +1872,22 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
 	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
 }
 EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
+
+#ifdef CONFIG_KEXEC_CORE
+void tdx_cpu_flush_cache_for_kexec(void)
+{
+	lockdep_assert_preemption_disabled();
+
+	if (!this_cpu_read(cache_state_incoherent))
+		return;
+
+	/*
+	 * Private memory cachelines need to be clean at the time of
+	 * kexec.  Write them back now, as the caller promises that
+	 * there should be no more SEAMCALLs on this CPU.
+	 */
+	wbinvd();
+	this_cpu_write(cache_state_incoherent, false);
+}
+EXPORT_SYMBOL_GPL(tdx_cpu_flush_cache_for_kexec);
+#endif
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-09-01 16:09 ` [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Paolo Bonzini
@ 2025-09-30  1:38   ` Vishal Annapurve
  2025-09-30 21:32     ` Dave Hansen
  2025-10-26 23:33   ` Vishal Annapurve
  1 sibling, 1 reply; 38+ messages in thread
From: Vishal Annapurve @ 2025-09-30  1:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, dave.hansen, bp, tglx, peterz, mingo, hpa,
	thomas.lendacky, x86, kas, rick.p.edgecombe, dwmw, kai.huang,
	seanjc, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis, farrah.chen,
	Binbin Wu

On Mon, Sep 1, 2025 at 9:11 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> From: Kai Huang <kai.huang@intel.com>
>
> Some early TDX-capable platforms have an erratum: A kernel partial
> write (a write transaction of less than cacheline lands at memory
> controller) to TDX private memory poisons that memory, and a subsequent
> read triggers a machine check.
>
> On those platforms, the old kernel must reset TDX private memory before
> jumping to the new kernel, otherwise the new kernel may see unexpected
> machine check.  Currently the kernel doesn't track which page is a TDX
> private page.  For simplicity just fail kexec/kdump for those platforms.

Google has a usecase that needs host kdump support on SPR/EMR
platforms. Disabling kdump disables the host's ability to dump very
critical information on host crashes altogether. Is Intel working on
enabling kdump support for platforms with the erratum?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-09-30  1:38   ` Vishal Annapurve
@ 2025-09-30 21:32     ` Dave Hansen
  2025-10-01  2:05       ` Vishal Annapurve
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2025-09-30 21:32 UTC (permalink / raw)
  To: Vishal Annapurve, Paolo Bonzini
  Cc: linux-kernel, kvm, bp, tglx, peterz, mingo, hpa, thomas.lendacky,
	x86, kas, rick.p.edgecombe, dwmw, kai.huang, seanjc,
	reinette.chatre, isaku.yamahata, dan.j.williams, ashish.kalra,
	nik.borisov, chao.gao, sagis, farrah.chen, Binbin Wu

On 9/29/25 18:38, Vishal Annapurve wrote:
>> On those platforms, the old kernel must reset TDX private memory before
>> jumping to the new kernel, otherwise the new kernel may see unexpected
>> machine check.  Currently the kernel doesn't track which page is a TDX
>> private page.  For simplicity just fail kexec/kdump for those platforms.
> Google has a usecase that needs host kdump support on SPR/EMR
> platforms. Disabling kdump disables the host's ability to dump very
> critical information on host crashes altogether. Is Intel working on
> enabling kdump support for platforms with the erratum?

Nope.

Any workarounds are going to be slow and probably imperfect. That's not
a great match for kdump. I'm perfectly happy waiting for fixed hardware
from what I've seen.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-09-30 21:32     ` Dave Hansen
@ 2025-10-01  2:05       ` Vishal Annapurve
  2025-10-01 14:32         ` Dave Hansen
  0 siblings, 1 reply; 38+ messages in thread
From: Vishal Annapurve @ 2025-10-01  2:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Paolo Bonzini, linux-kernel, kvm, bp, tglx, peterz, mingo, hpa,
	thomas.lendacky, x86, kas, rick.p.edgecombe, dwmw, kai.huang,
	seanjc, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis, farrah.chen,
	Binbin Wu

On Tue, Sep 30, 2025 at 2:32 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 9/29/25 18:38, Vishal Annapurve wrote:
> >> On those platforms, the old kernel must reset TDX private memory before
> >> jumping to the new kernel, otherwise the new kernel may see unexpected
> >> machine check.  Currently the kernel doesn't track which page is a TDX
> >> private page.  For simplicity just fail kexec/kdump for those platforms.
> > Google has a usecase that needs host kdump support on SPR/EMR
> > platforms. Disabling kdump disables the host's ability to dump very
> > critical information on host crashes altogether. Is Intel working on
> > enabling kdump support for platforms with the erratum?
>
> Nope.
>
> Any workarounds are going to be slow and probably imperfect. That's not

Do we really need to deploy workarounds that are complex and slow to
get kdump working for the majority of the scenarios? Is there any
analysis done for the risk with imperfect and simpler workarounds vs
benefits of kdump functionality?

> a great match for kdump. I'm perfectly happy waiting for fixed hardware
> from what I've seen.

IIUC SPR/EMR - two CPU generations out there are impacted by this
erratum and just disabling kdump functionality IMO is not the best
solution here.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-01  2:05       ` Vishal Annapurve
@ 2025-10-01 14:32         ` Dave Hansen
  2025-10-01 17:17           ` Vishal Annapurve
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2025-10-01 14:32 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Paolo Bonzini, linux-kernel, kvm, bp, tglx, peterz, mingo, hpa,
	thomas.lendacky, x86, kas, rick.p.edgecombe, dwmw, kai.huang,
	seanjc, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis, farrah.chen,
	Binbin Wu

On 9/30/25 19:05, Vishal Annapurve wrote:
...
>> Any workarounds are going to be slow and probably imperfect. That's not
> 
> Do we really need to deploy workarounds that are complex and slow to
> get kdump working for the majority of the scenarios? Is there any
> analysis done for the risk with imperfect and simpler workarounds vs
> benefits of kdump functionality?
> 
>> a great match for kdump. I'm perfectly happy waiting for fixed hardware
>> from what I've seen.
> 
> IIUC SPR/EMR - two CPU generations out there are impacted by this
> erratum and just disabling kdump functionality IMO is not the best
> solution here.

That's an eminently reasonable position. But we're speaking in broad
generalities and I'm unsure what you don't like about the status quo or
how you'd like to see things change.

Care to send along a patch representing the "best solution"? That should
clear things up.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-01 14:32         ` Dave Hansen
@ 2025-10-01 17:17           ` Vishal Annapurve
  2025-10-01 18:00             ` Dave Hansen
  2025-10-02  6:59             ` Reshetova, Elena
  0 siblings, 2 replies; 38+ messages in thread
From: Vishal Annapurve @ 2025-10-01 17:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Paolo Bonzini, linux-kernel, kvm, bp, tglx, peterz, mingo, hpa,
	thomas.lendacky, x86, kas, rick.p.edgecombe, dwmw, kai.huang,
	seanjc, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis, farrah.chen,
	Binbin Wu

On Wed, Oct 1, 2025 at 7:32 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 9/30/25 19:05, Vishal Annapurve wrote:
> ...
> >> Any workarounds are going to be slow and probably imperfect. That's not
> >
> > Do we really need to deploy workarounds that are complex and slow to
> > get kdump working for the majority of the scenarios? Is there any
> > analysis done for the risk with imperfect and simpler workarounds vs
> > benefits of kdump functionality?
> >
> >> a great match for kdump. I'm perfectly happy waiting for fixed hardware
> >> from what I've seen.
> >
> > IIUC SPR/EMR - two CPU generations out there are impacted by this
> > erratum and just disabling kdump functionality IMO is not the best
> > solution here.
>
> That's an eminently reasonable position. But we're speaking in broad
> generalities and I'm unsure what you don't like about the status quo or
> how you'd like to see things change.

Looks like the decision to disable kdump was taken between [1] -> [2].
"The kernel currently doesn't track which page is TDX private memory.
It's not trivial to reset TDX private memory.  For simplicity, this
series simply disables kexec/kdump for such platforms.  This will be
enhanced in the future."

A patch [3] from the series[1], describes the issue as:
"This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller.  The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings.  The issue can also be triggered away from the
CPU by devices doing partial writes via DMA."

And also mentions:
"Also note only the normal kexec needs to worry about this problem, but
not the crash kexec: 1) The kdump kernel only uses the special memory
reserved by the first kernel, and the reserved memory can never be used
by TDX in the first kernel; 2) The /proc/vmcore, which reflects the
first (crashed) kernel's memory, is only for read.  The read will never
"poison" TDX memory thus cause unexpected machine check (only partial
write does)."

What was the scenario that led to disabling kdump support altogether
given the above description?

[1] https://lore.kernel.org/lkml/cover.1727179214.git.kai.huang@intel.com/
[2] https://lore.kernel.org/all/cover.1741778537.git.kai.huang@intel.com/
[3] https://lore.kernel.org/lkml/6960ef6d7ee9398d164bf3997e6009df3e88cb67.1727179214.git.kai.huang@intel.com/

>
> Care to send along a patch representing the "best solution"? That should
> clear things up.
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-01 17:17           ` Vishal Annapurve
@ 2025-10-01 18:00             ` Dave Hansen
  2025-10-01 21:19               ` Huang, Kai
  2025-10-02  6:59             ` Reshetova, Elena
  1 sibling, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2025-10-01 18:00 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Paolo Bonzini, linux-kernel, kvm, bp, tglx, peterz, mingo, hpa,
	thomas.lendacky, x86, kas, rick.p.edgecombe, dwmw, kai.huang,
	seanjc, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis, farrah.chen,
	Binbin Wu

On 10/1/25 10:17, Vishal Annapurve wrote:
> And also mentions:
> "Also note only the normal kexec needs to worry about this problem, but
> not the crash kexec: 1) The kdump kernel only uses the special memory
> reserved by the first kernel, and the reserved memory can never be used
> by TDX in the first kernel; 2) The /proc/vmcore, which reflects the
> first (crashed) kernel's memory, is only for read.  The read will never
> "poison" TDX memory thus cause unexpected machine check (only partial
> write does)."
> 
> What was the scenario that led to disabling kdump support altogether
> given the above description?

I think it was purely out of convenience so that the disabling could be
three lines of code.

I don't know off the top of my head if there's a simple enough way to
disable kexec but not kdump. When I applied the thing, I was probably
just considering kexec/kdump a monolithic thing and not thinking that
folks would want one but not the other.

Kai, did you have any other motivations?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-01 18:00             ` Dave Hansen
@ 2025-10-01 21:19               ` Huang, Kai
  0 siblings, 0 replies; 38+ messages in thread
From: Huang, Kai @ 2025-10-01 21:19 UTC (permalink / raw)
  To: Hansen, Dave, Annapurve, Vishal
  Cc: kvm@vger.kernel.org, tglx@linutronix.de, ashish.kalra@amd.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	Chatre, Reinette, dwmw@amazon.co.uk, pbonzini@redhat.com,
	seanjc@google.com, Yamahata, Isaku, kas@kernel.org,
	mingo@redhat.com, nik.borisov@suse.com, hpa@zytor.com,
	peterz@infradead.org, sagis@google.com, Chen, Farrah,
	Edgecombe, Rick P, bp@alien8.de, binbin.wu@linux.intel.com,
	Gao, Chao, x86@kernel.org, Williams, Dan J

On Wed, 2025-10-01 at 11:00 -0700, Hansen, Dave wrote:
> On 10/1/25 10:17, Vishal Annapurve wrote:
> > And also mentions:
> > "Also note only the normal kexec needs to worry about this problem, but
> > not the crash kexec: 1) The kdump kernel only uses the special memory
> > reserved by the first kernel, and the reserved memory can never be used
> > by TDX in the first kernel; 2) The /proc/vmcore, which reflects the
> > first (crashed) kernel's memory, is only for read.  The read will never
> > "poison" TDX memory thus cause unexpected machine check (only partial
> > write does)."
> > 
> > What was the scenario that led to disabling kdump support altogether
> > given the above description?
> 
> I think it was purely out of convenience so that the disabling could be
> three lines of code.
> 
> I don't know off the top of my head if there's a simple enough way to
> disable kexec but not kdump. When I applied the thing, I was probably
> just considering kexec/kdump a monolithic thing and not thinking that
> folks would want one but not the other.
> 
> Kai, did you have any other motivations?

The "/proc/vmcore is only for read" is my understanding of how the kdump
kernel uses the /proc/vmcore.  I used to only disable kexec but allow
kdump to work (something like the diff below [*]), but during the internal
review we decided to just disable all since we cannot be sure whether it
is 100% true for all the kdump users.

This was raised by Vishal publicly before and was discussed here (in v3):

https://lore.kernel.org/kvm/f8dcbe257b3931aec9e199132b678bd7681b7efa.camel@intel.com/

[*]:

diff --git a/arch/x86/kernel/machine_kexec_64.c
b/arch/x86/kernel/machine_kexec_64.c
index 15088d14904f..c7af4aa7dd6b 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -356,10 +356,11 @@ int machine_kexec_prepare(struct kimage *image)
         * On those platforms the old kernel must reset TDX private
         * memory before jumping to the new kernel otherwise the new
         * kernel may see unexpected machine check.  For simplicity
-        * just fail kexec/kdump on those platforms.
+        * just fail kexec on those platforms.  Still allow kdump since
+        * the kdump kernel will only reads TDX memory but not write.
         */
-       if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) {
-               pr_info_once("Not allowed on platform with tdx_pw_mce
bug\n");
+       if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE) && image->type !=
KEXEC_TYPE_CRASH) {
+               pr_info_once("Kexec not allowed on platform with
tdx_pw_mce bug\n");
                return -EOPNOTSUPP;
        }

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* RE: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-01 17:17           ` Vishal Annapurve
  2025-10-01 18:00             ` Dave Hansen
@ 2025-10-02  6:59             ` Reshetova, Elena
  2025-10-02  7:46               ` Juergen Gross
  1 sibling, 1 reply; 38+ messages in thread
From: Reshetova, Elena @ 2025-10-02  6:59 UTC (permalink / raw)
  To: Annapurve, Vishal, Hansen, Dave
  Cc: Paolo Bonzini, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	bp@alien8.de, tglx@linutronix.de, peterz@infradead.org,
	mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com,
	x86@kernel.org, kas@kernel.org, Edgecombe, Rick P,
	dwmw@amazon.co.uk, Huang, Kai, seanjc@google.com,
	Chatre, Reinette, Yamahata, Isaku, Williams, Dan J,
	ashish.kalra@amd.com, nik.borisov@suse.com, Gao, Chao,
	sagis@google.com, Chen, Farrah, Binbin Wu

> On Wed, Oct 1, 2025 at 7:32 AM Dave Hansen <dave.hansen@intel.com>
> wrote:
> >
> > On 9/30/25 19:05, Vishal Annapurve wrote:
> > ...
> > >> Any workarounds are going to be slow and probably imperfect. That's not
> > >
> > > Do we really need to deploy workarounds that are complex and slow to
> > > get kdump working for the majority of the scenarios? Is there any
> > > analysis done for the risk with imperfect and simpler workarounds vs
> > > benefits of kdump functionality?
> > >
> > >> a great match for kdump. I'm perfectly happy waiting for fixed hardware
> > >> from what I've seen.
> > >
> > > IIUC SPR/EMR - two CPU generations out there are impacted by this
> > > erratum and just disabling kdump functionality IMO is not the best
> > > solution here.
> >
> > That's an eminently reasonable position. But we're speaking in broad
> > generalities and I'm unsure what you don't like about the status quo or
> > how you'd like to see things change.
> 
> Looks like the decision to disable kdump was taken between [1] -> [2].
> "The kernel currently doesn't track which page is TDX private memory.
> It's not trivial to reset TDX private memory.  For simplicity, this
> series simply disables kexec/kdump for such platforms.  This will be
> enhanced in the future."
> 
> A patch [3] from the series[1], describes the issue as:
> "This problem is triggered by "partial" writes where a write transaction
> of less than cacheline lands at the memory controller.  The CPU does
> these via non-temporal write instructions (like MOVNTI), or through
> UC/WC memory mappings.  The issue can also be triggered away from the
> CPU by devices doing partial writes via DMA."
> 
> And also mentions:
> "Also note only the normal kexec needs to worry about this problem, but
> not the crash kexec: 1) The kdump kernel only uses the special memory
> reserved by the first kernel, and the reserved memory can never be used
> by TDX in the first kernel; 2) The /proc/vmcore, which reflects the
> first (crashed) kernel's memory, is only for read.  The read will never
> "poison" TDX memory thus cause unexpected machine check (only partial
> write does)."

While the statement that the read will never poison the memory is correct,
the situation we can theoretically worry about is the following in my understanding:

1. During its execution on platform with partial write problem, host OS or other
actor executing outside of SEAM mode triggers partial write into a cache line that
originally belonged to TDX private memory. 
This is smth that host OS or other entities should not do, but it could happen due
to host OS bugs, etc. 
2. The above causes the specified cache line to be poisoned by mem controller. 
However, here we assume that no one accesses this cache line from TDX module,
TD guests or Host OS for the time being and the problem remains hidden.
3. Host OS crashes due to some other issue, kdump crash kernel is triggered,
and kdump starts to read all the memory from the previous host kernel to dump
the diagnostics info.
4. At some point of time, kdump crash kernel reaches the memory with the poisoned
cache line, consumes poison, and the #MC is issued for the kernel space. 

Isn't this the reason for also disabling kdump? Or do I miss smth?

Best Regards,
Elena.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-02  6:59             ` Reshetova, Elena
@ 2025-10-02  7:46               ` Juergen Gross
  2025-10-02  8:10                 ` Reshetova, Elena
  2025-10-02 15:06                 ` Dave Hansen
  0 siblings, 2 replies; 38+ messages in thread
From: Juergen Gross @ 2025-10-02  7:46 UTC (permalink / raw)
  To: Reshetova, Elena, Annapurve, Vishal, Hansen, Dave
  Cc: Paolo Bonzini, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	bp@alien8.de, tglx@linutronix.de, peterz@infradead.org,
	mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com,
	x86@kernel.org, kas@kernel.org, Edgecombe, Rick P,
	dwmw@amazon.co.uk, Huang, Kai, seanjc@google.com,
	Chatre, Reinette, Yamahata, Isaku, Williams, Dan J,
	ashish.kalra@amd.com, nik.borisov@suse.com, Gao, Chao,
	sagis@google.com, Chen, Farrah, Binbin Wu


[-- Attachment #1.1.1: Type: text/plain, Size: 3840 bytes --]

On 02.10.25 08:59, Reshetova, Elena wrote:
>> On Wed, Oct 1, 2025 at 7:32 AM Dave Hansen <dave.hansen@intel.com>
>> wrote:
>>>
>>> On 9/30/25 19:05, Vishal Annapurve wrote:
>>> ...
>>>>> Any workarounds are going to be slow and probably imperfect. That's not
>>>>
>>>> Do we really need to deploy workarounds that are complex and slow to
>>>> get kdump working for the majority of the scenarios? Is there any
>>>> analysis done for the risk with imperfect and simpler workarounds vs
>>>> benefits of kdump functionality?
>>>>
>>>>> a great match for kdump. I'm perfectly happy waiting for fixed hardware
>>>>> from what I've seen.
>>>>
>>>> IIUC SPR/EMR - two CPU generations out there are impacted by this
>>>> erratum and just disabling kdump functionality IMO is not the best
>>>> solution here.
>>>
>>> That's an eminently reasonable position. But we're speaking in broad
>>> generalities and I'm unsure what you don't like about the status quo or
>>> how you'd like to see things change.
>>
>> Looks like the decision to disable kdump was taken between [1] -> [2].
>> "The kernel currently doesn't track which page is TDX private memory.
>> It's not trivial to reset TDX private memory.  For simplicity, this
>> series simply disables kexec/kdump for such platforms.  This will be
>> enhanced in the future."
>>
>> A patch [3] from the series[1], describes the issue as:
>> "This problem is triggered by "partial" writes where a write transaction
>> of less than cacheline lands at the memory controller.  The CPU does
>> these via non-temporal write instructions (like MOVNTI), or through
>> UC/WC memory mappings.  The issue can also be triggered away from the
>> CPU by devices doing partial writes via DMA."
>>
>> And also mentions:
>> "Also note only the normal kexec needs to worry about this problem, but
>> not the crash kexec: 1) The kdump kernel only uses the special memory
>> reserved by the first kernel, and the reserved memory can never be used
>> by TDX in the first kernel; 2) The /proc/vmcore, which reflects the
>> first (crashed) kernel's memory, is only for read.  The read will never
>> "poison" TDX memory thus cause unexpected machine check (only partial
>> write does)."
> 
> While the statement that the read will never poison the memory is correct,
> the situation we can theoretically worry about is the following in my understanding:
> 
> 1. During its execution on platform with partial write problem, host OS or other
> actor executing outside of SEAM mode triggers partial write into a cache line that
> originally belonged to TDX private memory.
> This is smth that host OS or other entities should not do, but it could happen due
> to host OS bugs, etc.
> 2. The above causes the specified cache line to be poisoned by mem controller.
> However, here we assume that no one accesses this cache line from TDX module,
> TD guests or Host OS for the time being and the problem remains hidden.
> 3. Host OS crashes due to some other issue, kdump crash kernel is triggered,
> and kdump starts to read all the memory from the previous host kernel to dump
> the diagnostics info.
> 4. At some point of time, kdump crash kernel reaches the memory with the poisoned
> cache line, consumes poison, and the #MC is issued for the kernel space.
> 
> Isn't this the reason for also disabling kdump? Or do I miss smth?

So lets compare the 2 cases with kdump enabled and disabled in your scenario
(crash of the host OS):

kdump enabled: No dump can be produced due to the #MC and system is rebooted.

kdump disabled: No dump is produced and system is rebooted after crash.

What is the main concern with kdump enabled? I don't see any disadvantage with
enabling it, just the advantage that in many cases a dump will be written.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-02  7:46               ` Juergen Gross
@ 2025-10-02  8:10                 ` Reshetova, Elena
  2025-10-02 15:06                 ` Dave Hansen
  1 sibling, 0 replies; 38+ messages in thread
From: Reshetova, Elena @ 2025-10-02  8:10 UTC (permalink / raw)
  To: Juergen Gross, Annapurve, Vishal, Hansen, Dave
  Cc: Paolo Bonzini, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	bp@alien8.de, tglx@linutronix.de, peterz@infradead.org,
	mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com,
	x86@kernel.org, kas@kernel.org, Edgecombe, Rick P,
	dwmw@amazon.co.uk, Huang, Kai, seanjc@google.com,
	Chatre, Reinette, Yamahata, Isaku, Williams, Dan J,
	ashish.kalra@amd.com, nik.borisov@suse.com, Gao, Chao,
	sagis@google.com, Chen, Farrah, Binbin Wu

> On 02.10.25 08:59, Reshetova, Elena wrote:
> >> On Wed, Oct 1, 2025 at 7:32 AM Dave Hansen <dave.hansen@intel.com>
> >> wrote:
> >>>
> >>> On 9/30/25 19:05, Vishal Annapurve wrote:
> >>> ...
> >>>>> Any workarounds are going to be slow and probably imperfect. That's
> not
> >>>>
> >>>> Do we really need to deploy workarounds that are complex and slow to
> >>>> get kdump working for the majority of the scenarios? Is there any
> >>>> analysis done for the risk with imperfect and simpler workarounds vs
> >>>> benefits of kdump functionality?
> >>>>
> >>>>> a great match for kdump. I'm perfectly happy waiting for fixed hardware
> >>>>> from what I've seen.
> >>>>
> >>>> IIUC SPR/EMR - two CPU generations out there are impacted by this
> >>>> erratum and just disabling kdump functionality IMO is not the best
> >>>> solution here.
> >>>
> >>> That's an eminently reasonable position. But we're speaking in broad
> >>> generalities and I'm unsure what you don't like about the status quo or
> >>> how you'd like to see things change.
> >>
> >> Looks like the decision to disable kdump was taken between [1] -> [2].
> >> "The kernel currently doesn't track which page is TDX private memory.
> >> It's not trivial to reset TDX private memory.  For simplicity, this
> >> series simply disables kexec/kdump for such platforms.  This will be
> >> enhanced in the future."
> >>
> >> A patch [3] from the series[1], describes the issue as:
> >> "This problem is triggered by "partial" writes where a write transaction
> >> of less than cacheline lands at the memory controller.  The CPU does
> >> these via non-temporal write instructions (like MOVNTI), or through
> >> UC/WC memory mappings.  The issue can also be triggered away from the
> >> CPU by devices doing partial writes via DMA."
> >>
> >> And also mentions:
> >> "Also note only the normal kexec needs to worry about this problem, but
> >> not the crash kexec: 1) The kdump kernel only uses the special memory
> >> reserved by the first kernel, and the reserved memory can never be used
> >> by TDX in the first kernel; 2) The /proc/vmcore, which reflects the
> >> first (crashed) kernel's memory, is only for read.  The read will never
> >> "poison" TDX memory thus cause unexpected machine check (only partial
> >> write does)."
> >
> > While the statement that the read will never poison the memory is correct,
> > the situation we can theoretically worry about is the following in my
> understanding:
> >
> > 1. During its execution on platform with partial write problem, host OS or
> other
> > actor executing outside of SEAM mode triggers partial write into a cache line
> that
> > originally belonged to TDX private memory.
> > This is smth that host OS or other entities should not do, but it could happen
> due
> > to host OS bugs, etc.
> > 2. The above causes the specified cache line to be poisoned by mem
> controller.
> > However, here we assume that no one accesses this cache line from TDX
> module,
> > TD guests or Host OS for the time being and the problem remains hidden.
> > 3. Host OS crashes due to some other issue, kdump crash kernel is triggered,
> > and kdump starts to read all the memory from the previous host kernel to
> dump
> > the diagnostics info.
> > 4. At some point of time, kdump crash kernel reaches the memory with the
> poisoned
> > cache line, consumes poison, and the #MC is issued for the kernel space.
> >
> > Isn't this the reason for also disabling kdump? Or do I miss smth?
> 
> So lets compare the 2 cases with kdump enabled and disabled in your scenario
> (crash of the host OS):
> 
> kdump enabled: No dump can be produced due to the #MC and system is
> rebooted.
> 
> kdump disabled: No dump is produced and system is rebooted after crash.
> 
> What is the main concern with kdump enabled? I don't see any disadvantage
> with
> enabling it, just the advantage that in many cases a dump will be written.

I am not in the position to judge about what should be done about kdump in Linux,
neither I am arguing one way or another.
I just wanted to fill the gap and explain the technical scenario above
which I think was missing from this thread. Whatever decision is taken by 
community should rely on understanding the HW behaviour, so this is what
I tried to explain above.

Best Regards,
Elena.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-02  7:46               ` Juergen Gross
  2025-10-02  8:10                 ` Reshetova, Elena
@ 2025-10-02 15:06                 ` Dave Hansen
  2025-10-02 16:09                   ` Vishal Annapurve
  2025-10-07 13:31                   ` Jürgen Groß
  1 sibling, 2 replies; 38+ messages in thread
From: Dave Hansen @ 2025-10-02 15:06 UTC (permalink / raw)
  To: Juergen Gross, Reshetova, Elena, Annapurve, Vishal
  Cc: Paolo Bonzini, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	bp@alien8.de, tglx@linutronix.de, peterz@infradead.org,
	mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com,
	x86@kernel.org, kas@kernel.org, Edgecombe, Rick P,
	dwmw@amazon.co.uk, Huang, Kai, seanjc@google.com,
	Chatre, Reinette, Yamahata, Isaku, Williams, Dan J,
	ashish.kalra@amd.com, nik.borisov@suse.com, Gao, Chao,
	sagis@google.com, Chen, Farrah, Binbin Wu

On 10/2/25 00:46, Juergen Gross wrote:
> So lets compare the 2 cases with kdump enabled and disabled in your 
> scenario (crash of the host OS):
> 
> kdump enabled: No dump can be produced due to the #MC and system is
> rebooted.
> 
> kdump disabled: No dump is produced and system is rebooted after crash.
> > What is the main concern with kdump enabled? I don't see any
> disadvantage with enabling it, just the advantage that in many cases
> a dump will be written.
The disadvantage is that a kernel bug from long ago results in a machine
check. Machine checks are generally indicative of bad hardware. So the
disadvantage is that someone mistakes the long ago kernel bug for bad
hardware.

There are two ways of looking at this:

1. A theoretically fragile kdump is better than no kdump at all. All of
   the stars would have to align for kdump to _fail_ and we don't think
   that's going to happen often enough to matter.
2. kdump happens after kernel bugs. The machine checks happen because of
   kernel bugs. It's not a big stretch to think that, at scale, kdump is
   going to run in to these #MCs on a regular basis.

Does that capture the two perspectives fairly?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-02 15:06                 ` Dave Hansen
@ 2025-10-02 16:09                   ` Vishal Annapurve
  2025-10-18 15:54                     ` Vishal Annapurve
  2025-10-07 13:31                   ` Jürgen Groß
  1 sibling, 1 reply; 38+ messages in thread
From: Vishal Annapurve @ 2025-10-02 16:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Juergen Gross, Reshetova, Elena, Paolo Bonzini,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org, bp@alien8.de,
	tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	hpa@zytor.com, thomas.lendacky@amd.com, x86@kernel.org,
	kas@kernel.org, Edgecombe, Rick P, dwmw@amazon.co.uk, Huang, Kai,
	seanjc@google.com, Chatre, Reinette, Yamahata, Isaku,
	Williams, Dan J, ashish.kalra@amd.com, nik.borisov@suse.com,
	Gao, Chao, sagis@google.com, Chen, Farrah, Binbin Wu

On Thu, Oct 2, 2025 at 8:06 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/2/25 00:46, Juergen Gross wrote:
> > So lets compare the 2 cases with kdump enabled and disabled in your
> > scenario (crash of the host OS):
> >
> > kdump enabled: No dump can be produced due to the #MC and system is
> > rebooted.
> >
> > kdump disabled: No dump is produced and system is rebooted after crash.
> > > What is the main concern with kdump enabled? I don't see any
> > disadvantage with enabling it, just the advantage that in many cases
> > a dump will be written.
> The disadvantage is that a kernel bug from long ago results in a machine
> check. Machine checks are generally indicative of bad hardware. So the
> disadvantage is that someone mistakes the long ago kernel bug for bad
> hardware.
>
> There are two ways of looking at this:
>
> 1. A theoretically fragile kdump is better than no kdump at all. All of
>    the stars would have to align for kdump to _fail_ and we don't think
>    that's going to happen often enough to matter.
> 2. kdump happens after kernel bugs. The machine checks happen because of
>    kernel bugs. It's not a big stretch to think that, at scale, kdump is
>    going to run in to these #MCs on a regular basis.

Looking at Elena's response, I would say it's still *a* big stretch
for kdump to run into these #MCs on a regular basis as following
sequence is needed for problematic scenario:
1) Host OS bug should corrupt TDX private memory with a *partial
write*, that is part of kernel memory.
    -> i.e. PAMT tables, SEPT tables, TD VCPU/VM metadata etc.
    -> IIUC corruption of guest memory is not a concern as that
belongs to userspace.
2) TDX Module/TD shouldn't consume that poisoned memory.
    -> i.e. no walk of the metadata memory.
3) Host kernel needs to generate a bug that causes an orthogonal panic.

*partial writes* IIUC need special instructions.

>
> Does that capture the two perspectives fairly?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v8 0/7] TDX host: kexec/kdump support
  2025-09-01 16:09 [PATCH v8 0/7] TDX host: kexec/kdump support Paolo Bonzini
                   ` (6 preceding siblings ...)
  2025-09-01 16:09 ` [PATCH 7/7] KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs Paolo Bonzini
@ 2025-10-03 13:09 ` Paolo Bonzini
  2025-10-03 13:54   ` David Woodhouse
  2025-10-03 14:05   ` Dave Hansen
  7 siblings, 2 replies; 38+ messages in thread
From: Paolo Bonzini @ 2025-10-03 13:09 UTC (permalink / raw)
  To: linux-kernel, kvm, dave.hansen
  Cc: bp, tglx, peterz, mingo, hpa, thomas.lendacky, x86, kas,
	rick.p.edgecombe, dwmw, kai.huang, seanjc, reinette.chatre,
	isaku.yamahata, dan.j.williams, ashish.kalra, nik.borisov,
	chao.gao, sagis, farrah.chen

Hi Dave and others,

any reason why this series was not pulled into 6.18? I was a bit
surprised not to see it...

Thanks,

Paolo

On Mon, Sep 1, 2025 at 6:09 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> Currently kexec() support and TDX host are muturally exclusive in the
> Kconfig.  This series adds the TDX host kexec support so that they can
> be both enabled in Kconfig.
>
> With this series, the user can kexec (including crash kdump) to the new
> kernel at any time regardless of whether TDX has been enabled in the
> first kernel.  One limitation is if the first kernel has ever enabled
> TDX, for now the second kernel cannot use TDX.  This is the future work
> in my TODO list.
>
> This series should go in through the tip tree.
>
> Thanks,
>
> Paolo
>
> v7->v8: stub out the new code when kexec is not enabled in the kernel.
>         Of course even the smallest code change is subject to bikeshedding,
>         and I chose my preferred color for the bikeshed.  But it's pastel
>         green and I'm sure you'll agree that it's beautiful.
>
>
> Kai Huang (7):
>   x86/kexec: Consolidate relocate_kernel() function parameters
>   x86/sme: Use percpu boolean to control WBINVD during kexec
>   x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
>   x86/kexec: Disable kexec/kdump on platforms with TDX partial write
>     erratum
>   x86/virt/tdx: Remove the !KEXEC_CORE dependency
>   x86/virt/tdx: Update the kexec section in the TDX documentation
>   KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs
>
>  Documentation/arch/x86/tdx.rst       | 14 ++++-----
>  arch/x86/Kconfig                     |  1 -
>  arch/x86/include/asm/kexec.h         | 12 ++++++--
>  arch/x86/include/asm/processor.h     |  2 ++
>  arch/x86/include/asm/tdx.h           | 31 +++++++++++++++++++-
>  arch/x86/kernel/cpu/amd.c            | 17 +++++++++++
>  arch/x86/kernel/machine_kexec_64.c   | 44 ++++++++++++++++++++++------
>  arch/x86/kernel/process.c            | 24 +++++++--------
>  arch/x86/kernel/relocate_kernel_64.S | 36 +++++++++++++++--------
>  arch/x86/kvm/vmx/tdx.c               | 10 +++++++
>  arch/x86/virt/vmx/tdx/tdx.c          | 23 +++++++++++++--
>  11 files changed, 167 insertions(+), 47 deletions(-)
>
> --
> 2.51.0


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re:  [PATCH v8 0/7] TDX host: kexec/kdump support
  2025-10-03 13:09 ` [PATCH v8 0/7] TDX host: kexec/kdump support Paolo Bonzini
@ 2025-10-03 13:54   ` David Woodhouse
  2025-10-03 14:05   ` Dave Hansen
  1 sibling, 0 replies; 38+ messages in thread
From: David Woodhouse @ 2025-10-03 13:54 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel, kvm, dave.hansen
  Cc: bp, tglx, peterz, mingo, hpa, thomas.lendacky, x86, kas,
	rick.p.edgecombe, kai.huang, seanjc, reinette.chatre,
	isaku.yamahata, dan.j.williams, ashish.kalra, nik.borisov,
	chao.gao, sagis, farrah.chen

[-- Attachment #1: Type: text/plain, Size: 414 bytes --]

On Fri, 2025-10-03 at 15:09 +0200, Paolo Bonzini wrote:
> Hi Dave and others,
> 
> any reason why this series was not pulled into 6.18? I was a bit
> surprised not to see it...

At the very least I'd like to see the first patch with the
relocate_kernel() rework go in. I have some stuff lined up which would
conflict with that and needs to go after it, to speed up kexec by
keeping CPUs online, etc.



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v8 0/7] TDX host: kexec/kdump support
  2025-10-03 13:09 ` [PATCH v8 0/7] TDX host: kexec/kdump support Paolo Bonzini
  2025-10-03 13:54   ` David Woodhouse
@ 2025-10-03 14:05   ` Dave Hansen
  1 sibling, 0 replies; 38+ messages in thread
From: Dave Hansen @ 2025-10-03 14:05 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel, kvm
  Cc: bp, tglx, peterz, mingo, hpa, thomas.lendacky, x86, kas,
	rick.p.edgecombe, dwmw, kai.huang, seanjc, reinette.chatre,
	isaku.yamahata, dan.j.williams, ashish.kalra, nik.borisov,
	chao.gao, sagis, farrah.chen

On 10/3/25 06:09, Paolo Bonzini wrote:
> any reason why this series was not pulled into 6.18? I was a bit
> surprised not to see it...

The usual reasons. It fell through the cracks when it got posted and
nobody mentioned it until now.

I'll stick it on my list for after the merge window closes.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-02 15:06                 ` Dave Hansen
  2025-10-02 16:09                   ` Vishal Annapurve
@ 2025-10-07 13:31                   ` Jürgen Groß
  2025-10-08 15:40                     ` Dave Hansen
  1 sibling, 1 reply; 38+ messages in thread
From: Jürgen Groß @ 2025-10-07 13:31 UTC (permalink / raw)
  To: Dave Hansen, Reshetova, Elena, Annapurve, Vishal
  Cc: Paolo Bonzini, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	bp@alien8.de, tglx@linutronix.de, peterz@infradead.org,
	mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com,
	x86@kernel.org, kas@kernel.org, Edgecombe, Rick P,
	dwmw@amazon.co.uk, Huang, Kai, seanjc@google.com,
	Chatre, Reinette, Yamahata, Isaku, Williams, Dan J,
	ashish.kalra@amd.com, nik.borisov@suse.com, Gao, Chao,
	sagis@google.com, Chen, Farrah, Binbin Wu


[-- Attachment #1.1.1: Type: text/plain, Size: 1618 bytes --]

On 02.10.25 17:06, Dave Hansen wrote:
> On 10/2/25 00:46, Juergen Gross wrote:
>> So lets compare the 2 cases with kdump enabled and disabled in your
>> scenario (crash of the host OS):
>>
>> kdump enabled: No dump can be produced due to the #MC and system is
>> rebooted.
>>
>> kdump disabled: No dump is produced and system is rebooted after crash.
>>> What is the main concern with kdump enabled? I don't see any
>> disadvantage with enabling it, just the advantage that in many cases
>> a dump will be written.
> The disadvantage is that a kernel bug from long ago results in a machine
> check. Machine checks are generally indicative of bad hardware. So the
> disadvantage is that someone mistakes the long ago kernel bug for bad
> hardware.
> 
> There are two ways of looking at this:
> 
> 1. A theoretically fragile kdump is better than no kdump at all. All of
>     the stars would have to align for kdump to _fail_ and we don't think
>     that's going to happen often enough to matter.
> 2. kdump happens after kernel bugs. The machine checks happen because of
>     kernel bugs. It's not a big stretch to think that, at scale, kdump is
>     going to run in to these #MCs on a regular basis.
> 
> Does that capture the two perspectives fairly?

Basically yes.

If we can't come to an agreement that kdump should be allowed in spite of
a potential #MC, maybe we could disable kdump only if TDX guests have been
active on the machine before? Disabling kdump on a distro kernel just because
TDX was enabled but without anyone having used TDX would be quite hard.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-07 13:31                   ` Jürgen Groß
@ 2025-10-08 15:40                     ` Dave Hansen
  2025-10-08 18:13                       ` Jürgen Groß
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2025-10-08 15:40 UTC (permalink / raw)
  To: Jürgen Groß, Reshetova, Elena, Annapurve, Vishal
  Cc: Paolo Bonzini, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	bp@alien8.de, tglx@linutronix.de, peterz@infradead.org,
	mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com,
	x86@kernel.org, kas@kernel.org, Edgecombe, Rick P,
	dwmw@amazon.co.uk, Huang, Kai, seanjc@google.com,
	Chatre, Reinette, Yamahata, Isaku, Williams, Dan J,
	ashish.kalra@amd.com, nik.borisov@suse.com, Gao, Chao,
	sagis@google.com, Chen, Farrah, Binbin Wu

On 10/7/25 06:31, Jürgen Groß wrote:>
> If we can't come to an agreement that kdump should be allowed in
> spite of a potential #MC, maybe we could disable kdump only if TDX
> guests have been active on the machine before?

How would we determine that?

We can't just call the TDX module to see because it might have been
running before but got shut down.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-08 15:40                     ` Dave Hansen
@ 2025-10-08 18:13                       ` Jürgen Groß
  0 siblings, 0 replies; 38+ messages in thread
From: Jürgen Groß @ 2025-10-08 18:13 UTC (permalink / raw)
  To: Dave Hansen, Reshetova, Elena, Annapurve, Vishal
  Cc: Paolo Bonzini, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	bp@alien8.de, tglx@linutronix.de, peterz@infradead.org,
	mingo@redhat.com, hpa@zytor.com, thomas.lendacky@amd.com,
	x86@kernel.org, kas@kernel.org, Edgecombe, Rick P,
	dwmw@amazon.co.uk, Huang, Kai, seanjc@google.com,
	Chatre, Reinette, Yamahata, Isaku, Williams, Dan J,
	ashish.kalra@amd.com, nik.borisov@suse.com, Gao, Chao,
	sagis@google.com, Chen, Farrah, Binbin Wu


[-- Attachment #1.1.1: Type: text/plain, Size: 626 bytes --]

On 08.10.25 17:40, Dave Hansen wrote:
> On 10/7/25 06:31, Jürgen Groß wrote:>
>> If we can't come to an agreement that kdump should be allowed in
>> spite of a potential #MC, maybe we could disable kdump only if TDX
>> guests have been active on the machine before?
> 
> How would we determine that?
> 
> We can't just call the TDX module to see because it might have been
> running before but got shut down.

Ah, okay, I didn't think of that.

Then we could add a kernel boot parameter to let the user opt-in for kexec
being possible in spite of the potential #MC. I think this should cover it.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-02 16:09                   ` Vishal Annapurve
@ 2025-10-18 15:54                     ` Vishal Annapurve
  2025-10-21 17:08                       ` Dave Hansen
  0 siblings, 1 reply; 38+ messages in thread
From: Vishal Annapurve @ 2025-10-18 15:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Juergen Gross, Reshetova, Elena, Paolo Bonzini,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org, bp@alien8.de,
	tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	hpa@zytor.com, thomas.lendacky@amd.com, x86@kernel.org,
	kas@kernel.org, Edgecombe, Rick P, dwmw@amazon.co.uk, Huang, Kai,
	seanjc@google.com, Chatre, Reinette, Yamahata, Isaku,
	Williams, Dan J, ashish.kalra@amd.com, nik.borisov@suse.com,
	Gao, Chao, sagis@google.com, Chen, Farrah, Binbin Wu

On Thu, Oct 2, 2025 at 9:09 AM Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Thu, Oct 2, 2025 at 8:06 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 10/2/25 00:46, Juergen Gross wrote:
> > > So lets compare the 2 cases with kdump enabled and disabled in your
> > > scenario (crash of the host OS):
> > >
> > > kdump enabled: No dump can be produced due to the #MC and system is
> > > rebooted.
> > >
> > > kdump disabled: No dump is produced and system is rebooted after crash.
> > > > What is the main concern with kdump enabled? I don't see any
> > > disadvantage with enabling it, just the advantage that in many cases
> > > a dump will be written.
> > The disadvantage is that a kernel bug from long ago results in a machine
> > check. Machine checks are generally indicative of bad hardware. So the
> > disadvantage is that someone mistakes the long ago kernel bug for bad
> > hardware.
> >
> > There are two ways of looking at this:
> >
> > 1. A theoretically fragile kdump is better than no kdump at all. All of
> >    the stars would have to align for kdump to _fail_ and we don't think
> >    that's going to happen often enough to matter.
> > 2. kdump happens after kernel bugs. The machine checks happen because of
> >    kernel bugs. It's not a big stretch to think that, at scale, kdump is
> >    going to run in to these #MCs on a regular basis.
>
> Looking at Elena's response, I would say it's still *a* big stretch
> for kdump to run into these #MCs on a regular basis as following
> sequence is needed for problematic scenario:
> 1) Host OS bug should corrupt TDX private memory with a *partial
> write*, that is part of kernel memory.
>     -> i.e. PAMT tables, SEPT tables, TD VCPU/VM metadata etc.
>     -> IIUC corruption of guest memory is not a concern as that
> belongs to userspace.
> 2) TDX Module/TD shouldn't consume that poisoned memory.
>     -> i.e. no walk of the metadata memory.
> 3) Host kernel needs to generate a bug that causes an orthogonal panic.
>
> *partial writes* IIUC need special instructions.

Circling bank on this topic, I would like to iterate a few points:
1) Google has been running workloads with the series [1] for ~2 years
now, we haven't seen any issues with kdump functionality across kernel
bugs, real hardware issues, private memory corruption etc.
2) IMO rather than disabling kdump because of host kernel bugs
potentially corrupting private memory, it would be much more useful to
employ mechanisms like direct map removal to ensure host bugs leading
to private memory corruption are caught much early on. Disabling kdump
doesn't help the problem here and just makes it worse for a vast
majority of other scenarios. On the other hand, enabling kdump doesn't
make the problem worse than it is.
   - Host IOMMU mappings should also be ideally restricted to the
regions that don't overlap with private memory regions.
3) With DPAMT support [2], the possibility of  the host corrupting
private memory will reduce for the hosts not running confidential VMs
at all.

[1] https://lore.kernel.org/lkml/cover.1727179214.git.kai.huang@intel.com/
[2] https://lore.kernel.org/kvm/20250918232224.2202592-1-rick.p.edgecombe@intel.com/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-18 15:54                     ` Vishal Annapurve
@ 2025-10-21 17:08                       ` Dave Hansen
  2025-10-22  2:50                         ` Vishal Annapurve
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2025-10-21 17:08 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Juergen Gross, Reshetova, Elena, Paolo Bonzini,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org, bp@alien8.de,
	tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	hpa@zytor.com, thomas.lendacky@amd.com, x86@kernel.org,
	kas@kernel.org, Edgecombe, Rick P, dwmw@amazon.co.uk, Huang, Kai,
	seanjc@google.com, Chatre, Reinette, Yamahata, Isaku,
	Williams, Dan J, ashish.kalra@amd.com, nik.borisov@suse.com,
	Gao, Chao, sagis@google.com, Chen, Farrah, Binbin Wu

On 10/18/25 08:54, Vishal Annapurve wrote:
> Circling bank on this topic, I would like to iterate a few points:
> 1) Google has been running workloads with the series [1] for ~2 years
> now, we haven't seen any issues with kdump functionality across kernel
> bugs, real hardware issues, private memory corruption etc.

Great points and great info!

As a next step, I'd expect someone (at Google) to take this into
consideration and put together a series to have the kernel comprehend
those points.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-21 17:08                       ` Dave Hansen
@ 2025-10-22  2:50                         ` Vishal Annapurve
  2025-10-22 21:05                           ` Huang, Kai
  0 siblings, 1 reply; 38+ messages in thread
From: Vishal Annapurve @ 2025-10-22  2:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Juergen Gross, Reshetova, Elena, Paolo Bonzini,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org, bp@alien8.de,
	tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	hpa@zytor.com, thomas.lendacky@amd.com, x86@kernel.org,
	kas@kernel.org, Edgecombe, Rick P, dwmw@amazon.co.uk, Huang, Kai,
	seanjc@google.com, Chatre, Reinette, Yamahata, Isaku,
	Williams, Dan J, ashish.kalra@amd.com, nik.borisov@suse.com,
	Gao, Chao, sagis@google.com, Chen, Farrah, Binbin Wu

On Tue, Oct 21, 2025 at 10:08 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/18/25 08:54, Vishal Annapurve wrote:
> > Circling bank on this topic, I would like to iterate a few points:
> > 1) Google has been running workloads with the series [1] for ~2 years
> > now, we haven't seen any issues with kdump functionality across kernel
> > bugs, real hardware issues, private memory corruption etc.
>
> Great points and great info!
>
> As a next step, I'd expect someone (at Google) to take this into
> consideration and put together a series to have the kernel comprehend
> those points.

Then is it safe to say that Intel doesn't consider:
* Adding the support to just reset PAMT memory [1] to this series and
* Modifying the logic in this patch [2] to enable kdump and keep kexec
support disabled in this series

as a viable direction upstream for now until a better solution comes along?

If not, can kdump be made optional as Juergen suggested?

[1] https://lore.kernel.org/lkml/6960ef6d7ee9398d164bf3997e6009df3e88cb67.1727179214.git.kai.huang@intel.com/
[2] https://lore.kernel.org/all/20250901160930.1785244-5-pbonzini@redhat.com/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-22  2:50                         ` Vishal Annapurve
@ 2025-10-22 21:05                           ` Huang, Kai
  2025-10-23 16:54                             ` Vishal Annapurve
  0 siblings, 1 reply; 38+ messages in thread
From: Huang, Kai @ 2025-10-22 21:05 UTC (permalink / raw)
  To: Annapurve, Vishal, Hansen, Dave
  Cc: kvm@vger.kernel.org, binbin.wu@linux.intel.com, Gao, Chao,
	ashish.kalra@amd.com, thomas.lendacky@amd.com, Reshetova, Elena,
	linux-kernel@vger.kernel.org, kas@kernel.org, dwmw@amazon.co.uk,
	pbonzini@redhat.com, mingo@redhat.com, tglx@linutronix.de,
	Yamahata, Isaku, seanjc@google.com, nik.borisov@suse.com,
	hpa@zytor.com, peterz@infradead.org, sagis@google.com,
	Chen, Farrah, Edgecombe, Rick P, bp@alien8.de, Chatre, Reinette,
	jgross@suse.com, x86@kernel.org, Williams, Dan J

On Tue, 2025-10-21 at 19:50 -0700, Vishal Annapurve wrote:
> On Tue, Oct 21, 2025 at 10:08 AM Dave Hansen <dave.hansen@intel.com> wrote:
> > 
> > On 10/18/25 08:54, Vishal Annapurve wrote:
> > > Circling bank on this topic, I would like to iterate a few points:
> > > 1) Google has been running workloads with the series [1] for ~2 years
> > > now, we haven't seen any issues with kdump functionality across kernel
> > > bugs, real hardware issues, private memory corruption etc.
> > 
> > Great points and great info!
> > 
> > As a next step, I'd expect someone (at Google) to take this into
> > consideration and put together a series to have the kernel comprehend
> > those points.
> 
> Then is it safe to say that Intel doesn't consider:
> * Adding the support to just reset PAMT memory [1] to this series and

You need to reset all TDX private memory including TDX guest private
memory, S-EPT pages etc, and PAMT.  Resetting PAMT alone won't be enough,
and is pointless.

When [1] was posted, KVM TDX hadn't landed yet, so the only type of TDX
private memory was PAMT, but there's also a big comment there to point out
the in-kernel users should be responsible for resetting any TDX private
memory that they manage:

+	/*
+	 * It's ideal to cover all types of TDX private pages here, but
+	 * currently there's no unified way to tell whether a given page
+	 * is TDX private page or not.
+	 *
+	 * Only convert PAMT here.  All in-kernel TDX users (e.g., KVM)
+	 * are responsible for converting TDX private pages that are
+	 * managed by them by either registering reboot notifier or
+	 * shutdown syscore ops.
+	 */
+	tdmrs_reset_pamt_all(&tdx_tdmr_list);

> * Modifying the logic in this patch [2] to enable kdump and keep kexec
> support disabled in this series

Resetting TDX private is a complete solution which allows to enable both
kdump and kexec.  If we choose to reset TDX private memory, then we can
just revert [2].

> 
> as a viable direction upstream for now until a better solution comes along?

The alternative could be to simply modify [2] to allow kdump (but leave
TDX private memory untouched to the new kernel) but not normal kexec.  The
risk of doing so has already been covered in this thread AFAICT:

 1) If the kdump kernel does partial write to vmcore, the kdump kernel may
    see unexpected #MCE.
 2) As Elena pointed out, if the old kernel has bug and somehow already
    does partial write to TDX private memory (which leads to poison), the
    consumption of such poison may be deferred to the kdump kernel.

> 
> If not, can kdump be made optional as Juergen suggested?

IIUC Juergen suggested:

  Then we could add a kernel boot parameter to let the user opt-in
  for kexec being possible in spite of the potential #MC.

I don't have opinion on this, other than that I think the boot parameter
only makes sense if we do the "alternative" mentioned above, i.e., not
resetting TDX private memory.

> 
> [1] https://lore.kernel.org/lkml/6960ef6d7ee9398d164bf3997e6009df3e88cb67.1727179214.git.kai.huang@intel.com/
> [2] https://lore.kernel.org/all/20250901160930.1785244-5-pbonzini@redhat.com/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-22 21:05                           ` Huang, Kai
@ 2025-10-23 16:54                             ` Vishal Annapurve
  0 siblings, 0 replies; 38+ messages in thread
From: Vishal Annapurve @ 2025-10-23 16:54 UTC (permalink / raw)
  To: Huang, Kai
  Cc: Hansen, Dave, kvm@vger.kernel.org, binbin.wu@linux.intel.com,
	Gao, Chao, ashish.kalra@amd.com, thomas.lendacky@amd.com,
	Reshetova, Elena, linux-kernel@vger.kernel.org, kas@kernel.org,
	dwmw@amazon.co.uk, pbonzini@redhat.com, mingo@redhat.com,
	tglx@linutronix.de, Yamahata, Isaku, seanjc@google.com,
	nik.borisov@suse.com, hpa@zytor.com, peterz@infradead.org,
	sagis@google.com, Chen, Farrah, Edgecombe, Rick P, bp@alien8.de,
	Chatre, Reinette, jgross@suse.com, x86@kernel.org,
	Williams, Dan J

On Wed, Oct 22, 2025 at 2:05 PM Huang, Kai <kai.huang@intel.com> wrote:
>
>
> > * Modifying the logic in this patch [2] to enable kdump and keep kexec
> > support disabled in this series
>
> Resetting TDX private is a complete solution which allows to enable both
> kdump and kexec.  If we choose to reset TDX private memory, then we can
> just revert [2].
>
> >
> > as a viable direction upstream for now until a better solution comes along?
>
> The alternative could be to simply modify [2] to allow kdump (but leave
> TDX private memory untouched to the new kernel) but not normal kexec.  The
> risk of doing so has already been covered in this thread AFAICT:
>
>  1) If the kdump kernel does partial write to vmcore, the kdump kernel may
>     see unexpected #MCE.

Ideally a kdump kernel should not write to vmcore.

>  2) As Elena pointed out, if the old kernel has bug and somehow already
>     does partial write to TDX private memory (which leads to poison), the
>     consumption of such poison may be deferred to the kdump kernel.

Is this case very different from hardware memory failures leading to
poisoned memory ranges? i.e. kdump solution has an existing scenario
of possible poison consumption during generation of kdump.

Is it okay to advertise kdump functionality to be the best effort and
live with this caveat until a cleaner solution comes along?

>
> >
> > If not, can kdump be made optional as Juergen suggested?
>
> IIUC Juergen suggested:
>
>   Then we could add a kernel boot parameter to let the user opt-in
>   for kexec being possible in spite of the potential #MC.
>
> I don't have opinion on this, other than that I think the boot parameter
> only makes sense if we do the "alternative" mentioned above, i.e., not
> resetting TDX private memory.
>
> >
> > [1] https://lore.kernel.org/lkml/6960ef6d7ee9398d164bf3997e6009df3e88cb67.1727179214.git.kai.huang@intel.com/
> > [2] https://lore.kernel.org/all/20250901160930.1785244-5-pbonzini@redhat.com/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-09-01 16:09 ` [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Paolo Bonzini
  2025-09-30  1:38   ` Vishal Annapurve
@ 2025-10-26 23:33   ` Vishal Annapurve
  2025-10-27  0:50     ` Huang, Kai
  1 sibling, 1 reply; 38+ messages in thread
From: Vishal Annapurve @ 2025-10-26 23:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, dave.hansen, bp, tglx, peterz, mingo, hpa,
	thomas.lendacky, x86, kas, rick.p.edgecombe, dwmw, kai.huang,
	seanjc, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis, farrah.chen,
	Binbin Wu

On Mon, Sep 1, 2025 at 9:11 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> From: Kai Huang <kai.huang@intel.com>
>
> Some early TDX-capable platforms have an erratum: A kernel partial
> write (a write transaction of less than cacheline lands at memory
> controller) to TDX private memory poisons that memory, and a subsequent
> read triggers a machine check.
>
> On those platforms, the old kernel must reset TDX private memory before
> jumping to the new kernel, otherwise the new kernel may see unexpected
> machine check.  Currently the kernel doesn't track which page is a TDX
> private page.  For simplicity just fail kexec/kdump for those platforms.
>
> Leverage the existing machine_kexec_prepare() to fail kexec/kdump by
> adding the check of the presence of the TDX erratum (which is only
> checked for if the kernel is built with TDX host support).  This rejects
> kexec/kdump when the kernel is loading the kexec/kdump kernel image.
>
> The alternative is to reject kexec/kdump when the kernel is jumping to
> the new kernel.  But for kexec this requires adding a new check (e.g.,
> arch_kexec_allowed()) in the common code to fail kernel_kexec() at early
> stage.  Kdump (crash_kexec()) needs similar check, but it's hard to
> justify because crash_kexec() is not supposed to abort.
>
> It's feasible to further relax this limitation, i.e., only fail kexec
> when TDX is actually enabled by the kernel.  But this is still a half
> measure compared to resetting TDX private memory so just do the simplest
> thing for now.

Hi Kai,

IIUC, kernel doesn't donate any of it's available memory to TDX module
if TDX is not actually enabled (i.e. if "kvm.intel.tdx=y" kernel
command line parameter is missing).

Why is it unsafe to allow kexec/kdump if "kvm.intel.tdx=y" is not
supplied to the kernel?

>
> The impact to userspace is the users will get an error when loading the
> kexec/kdump kernel image:
>
>   kexec_load failed: Operation not supported
>
> This might be confusing to the users, thus also print the reason in the
> dmesg:
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-26 23:33   ` Vishal Annapurve
@ 2025-10-27  0:50     ` Huang, Kai
  2025-10-27 16:23       ` Edgecombe, Rick P
  0 siblings, 1 reply; 38+ messages in thread
From: Huang, Kai @ 2025-10-27  0:50 UTC (permalink / raw)
  To: pbonzini@redhat.com, Annapurve, Vishal
  Cc: kvm@vger.kernel.org, tglx@linutronix.de, Hansen, Dave,
	thomas.lendacky@amd.com, ashish.kalra@amd.com,
	linux-kernel@vger.kernel.org, Chatre, Reinette, dwmw@amazon.co.uk,
	kas@kernel.org, mingo@redhat.com, Yamahata, Isaku,
	nik.borisov@suse.com, seanjc@google.com, hpa@zytor.com,
	peterz@infradead.org, sagis@google.com, Chen, Farrah,
	Edgecombe, Rick P, bp@alien8.de, binbin.wu@linux.intel.com,
	Gao, Chao, x86@kernel.org, Williams, Dan J

On Sun, 2025-10-26 at 16:33 -0700, Vishal Annapurve wrote:
> > It's feasible to further relax this limitation, i.e., only fail kexec
> > when TDX is actually enabled by the kernel.  But this is still a half
> > measure compared to resetting TDX private memory so just do the simplest
> > thing for now.
> 
> Hi Kai,

Hi Vishal,

> 
> IIUC, kernel doesn't donate any of it's available memory to TDX module
> if TDX is not actually enabled (i.e. if "kvm.intel.tdx=y" kernel
> command line parameter is missing).

Right (for now KVM is the only in-kernel TDX user).

> 
> Why is it unsafe to allow kexec/kdump if "kvm.intel.tdx=y" is not
> supplied to the kernel?

It can be relaxed.  Please see the above quoted text from the changelog:

 > It's feasible to further relax this limitation, i.e., only fail kexec
 > when TDX is actually enabled by the kernel.  But this is still a half
 > measure compared to resetting TDX private memory so just do the simplest
 > thing for now.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-27  0:50     ` Huang, Kai
@ 2025-10-27 16:23       ` Edgecombe, Rick P
  2025-10-27 21:28         ` Huang, Kai
  0 siblings, 1 reply; 38+ messages in thread
From: Edgecombe, Rick P @ 2025-10-27 16:23 UTC (permalink / raw)
  To: pbonzini@redhat.com, Annapurve, Vishal, Huang, Kai
  Cc: kvm@vger.kernel.org, ashish.kalra@amd.com, Hansen, Dave,
	thomas.lendacky@amd.com, kas@kernel.org, seanjc@google.com,
	dwmw@amazon.co.uk, linux-kernel@vger.kernel.org, mingo@redhat.com,
	Yamahata, Isaku, nik.borisov@suse.com, Chatre, Reinette,
	hpa@zytor.com, peterz@infradead.org, sagis@google.com,
	Chen, Farrah, tglx@linutronix.de, bp@alien8.de,
	binbin.wu@linux.intel.com, Gao, Chao, x86@kernel.org,
	Williams, Dan J

On Mon, 2025-10-27 at 00:50 +0000, Huang, Kai wrote:
> > 
> > IIUC, kernel doesn't donate any of it's available memory to TDX module
> > if TDX is not actually enabled (i.e. if "kvm.intel.tdx=y" kernel
> > command line parameter is missing).
> 
> Right (for now KVM is the only in-kernel TDX user).
> 
> > 
> > Why is it unsafe to allow kexec/kdump if "kvm.intel.tdx=y" is not
> > supplied to the kernel?
> 
> It can be relaxed.  Please see the above quoted text from the changelog:
> 
>  > It's feasible to further relax this limitation, i.e., only fail kexec
>  > when TDX is actually enabled by the kernel.  But this is still a half
>  > measure compared to resetting TDX private memory so just do the simplest
>  > thing for now.

I think KVM could be re-inserted with different module params? As in, the two
in-tree users could be two separate insertions of the KVM module. That seems
like something that could easily come up in the real world, if a user re-inserts
for the purpose of enabling TDX. I think the above quote was talking about
another way of checking if it's enabled.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-27 16:23       ` Edgecombe, Rick P
@ 2025-10-27 21:28         ` Huang, Kai
  2025-10-28  0:07           ` Vishal Annapurve
  0 siblings, 1 reply; 38+ messages in thread
From: Huang, Kai @ 2025-10-27 21:28 UTC (permalink / raw)
  To: pbonzini@redhat.com, Annapurve, Vishal, Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, ashish.kalra@amd.com, Hansen, Dave,
	thomas.lendacky@amd.com, kas@kernel.org, seanjc@google.com,
	dwmw@amazon.co.uk, linux-kernel@vger.kernel.org, mingo@redhat.com,
	Yamahata, Isaku, nik.borisov@suse.com, Chatre, Reinette,
	hpa@zytor.com, peterz@infradead.org, sagis@google.com,
	Chen, Farrah, tglx@linutronix.de, bp@alien8.de,
	binbin.wu@linux.intel.com, Gao, Chao, x86@kernel.org,
	Williams, Dan J

On Mon, 2025-10-27 at 16:23 +0000, Edgecombe, Rick P wrote:
> On Mon, 2025-10-27 at 00:50 +0000, Huang, Kai wrote:
> > > 
> > > IIUC, kernel doesn't donate any of it's available memory to TDX module
> > > if TDX is not actually enabled (i.e. if "kvm.intel.tdx=y" kernel
> > > command line parameter is missing).
> > 
> > Right (for now KVM is the only in-kernel TDX user).
> > 
> > > 
> > > Why is it unsafe to allow kexec/kdump if "kvm.intel.tdx=y" is not
> > > supplied to the kernel?
> > 
> > It can be relaxed.  Please see the above quoted text from the changelog:
> > 
> >  > It's feasible to further relax this limitation, i.e., only fail kexec
> >  > when TDX is actually enabled by the kernel.  But this is still a half
> >  > measure compared to resetting TDX private memory so just do the simplest
> >  > thing for now.
> 
> I think KVM could be re-inserted with different module params? As in, the two
> in-tree users could be two separate insertions of the KVM module. That seems
> like something that could easily come up in the real world, if a user re-inserts
> for the purpose of enabling TDX. I think the above quote was talking about
> another way of checking if it's enabled.

Yes exactly.  We need to look at module status for that.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-27 21:28         ` Huang, Kai
@ 2025-10-28  0:07           ` Vishal Annapurve
  2025-10-28  9:31             ` Huang, Kai
  0 siblings, 1 reply; 38+ messages in thread
From: Vishal Annapurve @ 2025-10-28  0:07 UTC (permalink / raw)
  To: Huang, Kai
  Cc: pbonzini@redhat.com, Edgecombe, Rick P, kvm@vger.kernel.org,
	ashish.kalra@amd.com, Hansen, Dave, thomas.lendacky@amd.com,
	kas@kernel.org, seanjc@google.com, dwmw@amazon.co.uk,
	linux-kernel@vger.kernel.org, mingo@redhat.com, Yamahata, Isaku,
	nik.borisov@suse.com, Chatre, Reinette, hpa@zytor.com,
	peterz@infradead.org, sagis@google.com, Chen, Farrah,
	tglx@linutronix.de, bp@alien8.de, binbin.wu@linux.intel.com,
	Gao, Chao, x86@kernel.org, Williams, Dan J

On Mon, Oct 27, 2025 at 2:28 PM Huang, Kai <kai.huang@intel.com> wrote:
>
> On Mon, 2025-10-27 at 16:23 +0000, Edgecombe, Rick P wrote:
> > On Mon, 2025-10-27 at 00:50 +0000, Huang, Kai wrote:
> > > >
> > > > IIUC, kernel doesn't donate any of it's available memory to TDX module
> > > > if TDX is not actually enabled (i.e. if "kvm.intel.tdx=y" kernel
> > > > command line parameter is missing).
> > >
> > > Right (for now KVM is the only in-kernel TDX user).
> > >
> > > >
> > > > Why is it unsafe to allow kexec/kdump if "kvm.intel.tdx=y" is not
> > > > supplied to the kernel?
> > >
> > > It can be relaxed.  Please see the above quoted text from the changelog:
> > >
> > >  > It's feasible to further relax this limitation, i.e., only fail kexec
> > >  > when TDX is actually enabled by the kernel.  But this is still a half
> > >  > measure compared to resetting TDX private memory so just do the simplest
> > >  > thing for now.
> >
> > I think KVM could be re-inserted with different module params? As in, the two
> > in-tree users could be two separate insertions of the KVM module. That seems
> > like something that could easily come up in the real world, if a user re-inserts
> > for the purpose of enabling TDX. I think the above quote was talking about
> > another way of checking if it's enabled.
>
> Yes exactly.  We need to look at module status for that.

So, the right thing to do is to declare the host platform as affected
by PW_MCE_BUG only if TDX module is initialized, does that sound
correct?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-28  0:07           ` Vishal Annapurve
@ 2025-10-28  9:31             ` Huang, Kai
  2025-11-03 16:44               ` Vishal Annapurve
  0 siblings, 1 reply; 38+ messages in thread
From: Huang, Kai @ 2025-10-28  9:31 UTC (permalink / raw)
  To: Annapurve, Vishal
  Cc: kvm@vger.kernel.org, x86@kernel.org, ashish.kalra@amd.com,
	Hansen, Dave, thomas.lendacky@amd.com, Williams, Dan J,
	kas@kernel.org, seanjc@google.com, dwmw@amazon.co.uk,
	pbonzini@redhat.com, mingo@redhat.com, Yamahata, Isaku,
	nik.borisov@suse.com, Chatre, Reinette, hpa@zytor.com,
	peterz@infradead.org, sagis@google.com, Chen, Farrah,
	Edgecombe, Rick P, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, bp@alien8.de, tglx@linutronix.de,
	Gao, Chao

On Mon, 2025-10-27 at 17:07 -0700, Vishal Annapurve wrote:
> On Mon, Oct 27, 2025 at 2:28 PM Huang, Kai <kai.huang@intel.com> wrote:
> > 
> > On Mon, 2025-10-27 at 16:23 +0000, Edgecombe, Rick P wrote:
> > > On Mon, 2025-10-27 at 00:50 +0000, Huang, Kai wrote:
> > > > > 
> > > > > IIUC, kernel doesn't donate any of it's available memory to TDX module
> > > > > if TDX is not actually enabled (i.e. if "kvm.intel.tdx=y" kernel
> > > > > command line parameter is missing).
> > > > 
> > > > Right (for now KVM is the only in-kernel TDX user).
> > > > 
> > > > > 
> > > > > Why is it unsafe to allow kexec/kdump if "kvm.intel.tdx=y" is not
> > > > > supplied to the kernel?
> > > > 
> > > > It can be relaxed.  Please see the above quoted text from the changelog:
> > > > 
> > > >  > It's feasible to further relax this limitation, i.e., only fail kexec
> > > >  > when TDX is actually enabled by the kernel.  But this is still a half
> > > >  > measure compared to resetting TDX private memory so just do the simplest
> > > >  > thing for now.
> > > 
> > > I think KVM could be re-inserted with different module params? As in, the two
> > > in-tree users could be two separate insertions of the KVM module. That seems
> > > like something that could easily come up in the real world, if a user re-inserts
> > > for the purpose of enabling TDX. I think the above quote was talking about
> > > another way of checking if it's enabled.
> > 
> > Yes exactly.  We need to look at module status for that.
> 
> So, the right thing to do is to declare the host platform as affected
> by PW_MCE_BUG only if TDX module is initialized, does that sound
> correct?

I was thinking something like this:

https://lore.kernel.org/lkml/20250416230259.97989-1-kai.huang@intel.com/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-10-28  9:31             ` Huang, Kai
@ 2025-11-03 16:44               ` Vishal Annapurve
  0 siblings, 0 replies; 38+ messages in thread
From: Vishal Annapurve @ 2025-11-03 16:44 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm@vger.kernel.org, x86@kernel.org, ashish.kalra@amd.com,
	Hansen, Dave, thomas.lendacky@amd.com, Williams, Dan J,
	kas@kernel.org, seanjc@google.com, dwmw@amazon.co.uk,
	pbonzini@redhat.com, mingo@redhat.com, Yamahata, Isaku,
	nik.borisov@suse.com, Chatre, Reinette, hpa@zytor.com,
	peterz@infradead.org, sagis@google.com, Chen, Farrah,
	Edgecombe, Rick P, linux-kernel@vger.kernel.org,
	binbin.wu@linux.intel.com, bp@alien8.de, tglx@linutronix.de,
	Gao, Chao

On Tue, Oct 28, 2025 at 2:31 AM Huang, Kai <kai.huang@intel.com> wrote:
>
> On Mon, 2025-10-27 at 17:07 -0700, Vishal Annapurve wrote:
> > On Mon, Oct 27, 2025 at 2:28 PM Huang, Kai <kai.huang@intel.com> wrote:
> > >
> > > On Mon, 2025-10-27 at 16:23 +0000, Edgecombe, Rick P wrote:
> > > > On Mon, 2025-10-27 at 00:50 +0000, Huang, Kai wrote:
> > > > > >
> > > > > > IIUC, kernel doesn't donate any of it's available memory to TDX module
> > > > > > if TDX is not actually enabled (i.e. if "kvm.intel.tdx=y" kernel
> > > > > > command line parameter is missing).
> > > > >
> > > > > Right (for now KVM is the only in-kernel TDX user).
> > > > >
> > > > > >
> > > > > > Why is it unsafe to allow kexec/kdump if "kvm.intel.tdx=y" is not
> > > > > > supplied to the kernel?
> > > > >
> > > > > It can be relaxed.  Please see the above quoted text from the changelog:
> > > > >
> > > > >  > It's feasible to further relax this limitation, i.e., only fail kexec
> > > > >  > when TDX is actually enabled by the kernel.  But this is still a half
> > > > >  > measure compared to resetting TDX private memory so just do the simplest
> > > > >  > thing for now.
> > > >
> > > > I think KVM could be re-inserted with different module params? As in, the two
> > > > in-tree users could be two separate insertions of the KVM module. That seems
> > > > like something that could easily come up in the real world, if a user re-inserts
> > > > for the purpose of enabling TDX. I think the above quote was talking about
> > > > another way of checking if it's enabled.
> > >
> > > Yes exactly.  We need to look at module status for that.
> >
> > So, the right thing to do is to declare the host platform as affected
> > by PW_MCE_BUG only if TDX module is initialized, does that sound
> > correct?
>
> I was thinking something like this:
>
> https://lore.kernel.org/lkml/20250416230259.97989-1-kai.huang@intel.com/

This seems to be an important thing to make progress on. IMO,
disabling kexec/kdump even if the host doesn't plan to use TDX
functionality but wants to keep the build config enabled is a
regression.

I think explicitly doing TDX module initialization[1] ideally needs
something like the above series from Kai and possibly resetting the
PAMT memory during kexec/kdump at least on SPR/EMR CPUs. Otherwise
it's effectively impossible to enable CONFIG_INTEL_TDX_HOST and have
kexec/kdump working on the host even if no confidential workloads are
scheduled on such SPR/EMR hosts.

[1] https://lore.kernel.org/kvm/20251010220403.987927-4-seanjc@google.com/

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2025-11-03 16:44 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-01 16:09 [PATCH v8 0/7] TDX host: kexec/kdump support Paolo Bonzini
2025-09-01 16:09 ` [PATCH 1/7] x86/kexec: Consolidate relocate_kernel() function parameters Paolo Bonzini
2025-09-01 16:09 ` [PATCH 2/7] x86/sme: Use percpu boolean to control WBINVD during kexec Paolo Bonzini
2025-09-01 16:09 ` [PATCH 3/7] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Paolo Bonzini
2025-09-01 16:09 ` [PATCH 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Paolo Bonzini
2025-09-30  1:38   ` Vishal Annapurve
2025-09-30 21:32     ` Dave Hansen
2025-10-01  2:05       ` Vishal Annapurve
2025-10-01 14:32         ` Dave Hansen
2025-10-01 17:17           ` Vishal Annapurve
2025-10-01 18:00             ` Dave Hansen
2025-10-01 21:19               ` Huang, Kai
2025-10-02  6:59             ` Reshetova, Elena
2025-10-02  7:46               ` Juergen Gross
2025-10-02  8:10                 ` Reshetova, Elena
2025-10-02 15:06                 ` Dave Hansen
2025-10-02 16:09                   ` Vishal Annapurve
2025-10-18 15:54                     ` Vishal Annapurve
2025-10-21 17:08                       ` Dave Hansen
2025-10-22  2:50                         ` Vishal Annapurve
2025-10-22 21:05                           ` Huang, Kai
2025-10-23 16:54                             ` Vishal Annapurve
2025-10-07 13:31                   ` Jürgen Groß
2025-10-08 15:40                     ` Dave Hansen
2025-10-08 18:13                       ` Jürgen Groß
2025-10-26 23:33   ` Vishal Annapurve
2025-10-27  0:50     ` Huang, Kai
2025-10-27 16:23       ` Edgecombe, Rick P
2025-10-27 21:28         ` Huang, Kai
2025-10-28  0:07           ` Vishal Annapurve
2025-10-28  9:31             ` Huang, Kai
2025-11-03 16:44               ` Vishal Annapurve
2025-09-01 16:09 ` [PATCH 5/7] x86/virt/tdx: Remove the !KEXEC_CORE dependency Paolo Bonzini
2025-09-01 16:09 ` [PATCH 6/7] x86/virt/tdx: Update the kexec section in the TDX documentation Paolo Bonzini
2025-09-01 16:09 ` [PATCH 7/7] KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs Paolo Bonzini
2025-10-03 13:09 ` [PATCH v8 0/7] TDX host: kexec/kdump support Paolo Bonzini
2025-10-03 13:54   ` David Woodhouse
2025-10-03 14:05   ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox