[PATCH v2 0/5] TDX host: kexec/kdump support

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/5] TDX host: kexec/kdump support
@ 2025-05-10 11:20 Kai Huang
  2025-05-10 11:20 ` [PATCH v2 1/5] x86/sme: Use percpu boolean to control wbinvd during kexec Kai Huang
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Kai Huang @ 2025-05-10 11:20 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo
  Cc: kirill.shutemov, hpa, x86, linux-kernel, pbonzini, seanjc,
	rick.p.edgecombe, isaku.yamahata, reinette.chatre, dan.j.williams,
	thomas.lendacky, ashish.kalra, nik.borisov, sagis

Hi Dave,

This series is the latest attempt to support kexec on TDX host following
your suggestion to use a percpu boolean to control WBINVD during kexec.
I appreciate if you can help to review.

The last attempt was to have one patch to make TDX and kexec mutually
exclusive at runtime while allowing them to be both enabled in Kconfig,
but it turns out to be overkill.  Dave proposed another option of using
a percpu boolean to track the need for flushing:

https://lore.kernel.org/lkml/20250416230259.97989-1-kai.huang@intel.com/

One advantage of the percpu boolean is for TDX we could do optimization
to do:

	wbinvd();
	percpu(boolean) = 0;

for all CPUs at early stage to avoid having to do WBINVD in
stop_this_cpu() at kexec time.  I made a RFC patch to show the idea:

https://github.com/intel/tdx/commit/d9f0123b1d63ba24f472da6971181939ce53d2e3

I'll also reply this RFC patch to this coverletter in case anyone wants
to have a discussion.  Nevertheless, my exception is this series can be
merged first w/o the RFC patch.  We can follow up the optimizations
later.

This series is tagged v2, since it's a closer follow on to the RFC
patchset, which was posted before the single patch:

https://lore.kernel.org/all/cover.1741778537.git.kai.huang@intel.com/

This series is based on today's tip/master.

=== More information ===

TDX private memory is memory that is encrypted with private Host Key IDs
(HKID).  If the kernel has ever enabled TDX, part of system memory
remains TDX private memory when kexec happens.  E.g., the PAMT (Physical
Address Metadata Table) pages used by the TDX module to track each TDX
memory page's state are never freed once the TDX module is initialized.
TDX guests also have guest private memory and secure-EPT pages.

After kexec, the new kernel will have no knowledge of which memory page
was used as TDX private page and can use all memory as regular memory.

1) Cache flush

Per TDX 1.5 base spec "8.6.1.Platforms not Using ACT: Required Cache
Flush and Initialization by the Host VMM", to support kexec for TDX, the
kernel needs to flush cache to make sure there's no dirty cachelines of
TDX private memory left over to the new kernel (when the TDX module
reports TDX_FEATURES.CLFLUSH_BEFORE_ALLOC as 1 in the global metadata for
the platform).  The kernel also needs to make sure there's no more TDX
activity (no SEAMCALL) after cache flush so that no new dirty cachelines
of TDX private memory are generated.

SME has similar requirement.  SME kexec support uses WBINVD to do the
cache flush.  WBINVD is able to flush cachelines associated with any
HKID.  Reuse the WBINVD introduced by SME to flush cache for TDX.

Currently the kernel explicitly checks whether the hardware supports SME
and only does WBINVD if true.  Instead of adding yet another TDX
specific check, this series uses a percpu boolean to indicate whether
WBINVD is needed on that CPU during kexec.

2) Reset TDX private memory using MOVDIR64B

The TDX spec (the aforementioned section) also suggests the kernel
*should* use MOVDIR64B to clear TDX private page before the kernel
reuses it as regular one.

However, in reality the situation can be more flexible.  Per TDX 1.5
base spec ("Table 16.2: Non-ACT Platforms Checks on Memory Reads in Ci
Mode" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
Mode"), the read/write to TDX private memory using shared KeyID without
integrity check enabled will not poison the memory and cause machine
check.

Note on the platforms with ACT (Access Control Table), there's no
integrity check involved thus no machine check is possible to happen due
to memory read/write using different KeyIDs.

KeyID 0 (TME key) doesn't support integrity check.  This series chooses
to NOT reset TDX private memory but leave TDX private memory as-is to the
new kernel.  As mentioned above, in practice it is safe to do so.

3) One limitation

If the kernel has ever enabled TDX, after kexec the new kernel won't be
able to use TDX anymore.  This is because when the new kernel tries to
initialize TDX module it will fail on the first SEAMCALL due to the
module has already been initialized by the old kernel.

More (non-trivial) work will be needed for the new kernel to use TDX,
e.g., one solution is to just reload the TDX module from the location
where BIOS loads the TDX module (/boot/efi/EFI/TDX/).  This series
doesn't cover this, but leave this as future work.

4) Kdump support

This series also enables kdump with TDX, but no special handling is
needed for crash kexec (except turning on the Kconfig option):

 - kdump kernel uses reserved memory from the old kernel as system ram,
   and the old kernel will never use the reserved memory as TDX memory.
 - /proc/vmcore contains TDX private memory pages.  It's meaningless to
   read them, but it doesn't do any harm either.

5) TDX "partial write machine check" erratum

On the platform with TDX erratum, a partial write (a write transaction
of less than a cacheline lands at memory controller) to TDX private
memory poisons that memory, and a subsequent read triggers machine
check.  On those platforms, the kernel needs to reset TDX private memory
before jumping to the new kernel otherwise the new kernel may see
unexpected machine check.

The kernel currently doesn't track which page is TDX private memory.
It's not trivial to reset TDX private memory.  For simplicity, this
series simply disables kexec/kdump for such platforms.  This can be
enhanced in the future.

Kai Huang (5):
  x86/sme: Use percpu boolean to control wbinvd during kexec
  x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  x86/kexec: Disable kexec/kdump on platforms with TDX partial write
    erratum
  x86/virt/tdx: Remove the !KEXEC_CORE dependency
  x86/virt/tdx: Update the kexec section in the TDX documentation

 Documentation/arch/x86/tdx.rst       | 14 ++++++-------
 arch/x86/Kconfig                     |  1 -
 arch/x86/include/asm/kexec.h         |  2 +-
 arch/x86/include/asm/processor.h     |  2 ++
 arch/x86/include/asm/tdx.h           | 31 +++++++++++++++++++++++++++-
 arch/x86/kernel/cpu/amd.c            | 16 ++++++++++++++
 arch/x86/kernel/machine_kexec_64.c   | 31 +++++++++++++++++++++++-----
 arch/x86/kernel/process.c            | 16 +++-----------
 arch/x86/kernel/relocate_kernel_64.S | 15 ++++++++++----
 9 files changed, 96 insertions(+), 32 deletions(-)

base-commit: 5b036d3516a8be54c24c05d8d1dc86f9815ba53e
-- 
2.43.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/5] x86/sme: Use percpu boolean to control wbinvd during kexec
  2025-05-10 11:20 [PATCH v2 0/5] TDX host: kexec/kdump support Kai Huang
@ 2025-05-10 11:20 ` Kai Huang
  2025-05-10 11:20 ` [PATCH v2 2/5] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Kai Huang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Kai Huang @ 2025-05-10 11:20 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo
  Cc: kirill.shutemov, hpa, x86, linux-kernel, pbonzini, seanjc,
	rick.p.edgecombe, isaku.yamahata, reinette.chatre, dan.j.williams,
	thomas.lendacky, ashish.kalra, nik.borisov, sagis

On SME platforms, at hardware level dirty cachelines with and without
encryption bit of the same memory can coexist, and the CPU can flush
them back to memory in random order.  During kexec, the caches must be
flushed before jumping to the new kernel to avoid silent memory
corruption when a cacheline with a different encryption property is
written back over whatever encryption properties the new kernel is
using.

TDX also needs cache flush during kexec for the same reason.  It would
be good to implement a generic way to flush cache instead of scattering
checks for each feature all around.

During kexec, the kexec-ing CPU sends IPIs to all remote CPUs to stop
them before it boots to the new kernel.  For SME the kernel basically
encrypts all memory including the kernel itself by manipulating page
tables.  A simple memory write from the kernel could dirty cachelines.

Currently, the kernel uses WBINVD to flush cache for SME during kexec in
two places:

1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU
   stops them;
2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the
   new kernel.

Unlike SME, TDX can only dirty cachelines when it is used (i.e., when
SEAMCALLs are performed).  Since there are no more SEAMCALLs after the
aforementioned WBINVDs, leverage this for TDX.

In order to have a generic way to cover both SME and TDX, use a percpu
boolean to indicate the cache may be in an incoherent state thus cache
flush is needed during kexec, and turn on the boolean for SME.  TDX can
then leverage it by also turning the boolean on.

A percpu boolean isn't strictly needed for SME since it is activated at
very early boot time and on all CPUs.  A global flag would be sufficient.

A percpu boolean isn't strictly needed for SME since it is activated at
very early boot time and on all CPUs.  A global flag would be
sufficient.  But using a percpu variable has two benefits.  Foremost,
the state that is being tracked here (percpu cache coherency situation
requiring a flush) is percpu, so a percpu state is a more direct and
natural fit.

Secondly, it will fit TDX's usage better since the percpu var can be set
when a CPU makes a SEAMCALL, and cleared when another WBINVD on the CPU
obviates the need for a kexec-time WBINVD.  Saving kexec-time WBINVD is
valuable, because there is an existing race[*] where kexec could proceed
while another CPU is active.  WBINVD could make this race worse, so it's
worth skipping it when possible.

Today the first WBINVD in the stop_this_cpu() is performed when SME is
*supported* by the platform, and the second WBINVD is done in
relocate_kernel() when SME is *activated* by the kernel.  Make things
simple by changing to do the second WBINVD when the platform supports
SME.  This allows the kernel to simply turn on this percpu boolean when
bringing up a CPU by checking whether the platform supports SME.

No other functional change intended.

Also, currently machine_check() has a comment to explain why no function
call is allowed after load_segments().  After changing to use the new
percpu boolean to control whether to perform WBINVD when calling the
relocate_kernel(), that comment isn't needed anymore.  But it is still a
useful comment, so expand the comment around load_segments() to mention
that due to depth tracking no function call can be made after
load_segments().

[*] The "race" in native_stop_other_cpus()

Commit

  1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")

introduced a new 'cpus_stop_mask' to resolve an "intermittent lockups on
poweroff" issue which was caused by the WBINVD in stop_this_cpu().
Specifically, the new cpumask resolved the below problem mentioned in
that commit:

    CPU0                                    CPU1

     stop_other_cpus()
       send_IPIs(REBOOT);                   stop_this_cpu()
       while (num_online_cpus() > 1);         set_online(false);
       proceed... -> hang
                                              wbinvd()

While it fixed the reported issue, that commit explained the new cpumask
"cannot plug all holes either".

This is because it doesn't address the "race" mentioned in the #3 in the
comment in native_stop_other_cpus():

        /*
         * 1) Send an IPI on the reboot vector to all other CPUs.
         *
         *    The other CPUs should react on it after leaving critical
         *    sections and re-enabling interrupts. They might still hold
         *    locks, but there is nothing which can be done about that.
         *
         * 2) Wait for all other CPUs to report that they reached the
         *    HLT loop in stop_this_cpu()
         *
         * 3) If #2 timed out send an NMI to the CPUs which did not
         *    yet report
         *
         * 4) Wait for all other CPUs to report that they reached the
         *    HLT loop in stop_this_cpu()
         *
         * #3 can obviously race against a CPU reaching the HLT loop late.
         * That CPU will have reported already and the "have all CPUs
         * reached HLT" condition will be true despite the fact that the
         * other CPU is still handling the NMI. Again, there is no
         * protection against that as "disabled" APICs still respond to
         * NMIs.
         */

Consider below case:

    CPU 0					CPU 1

    native_stop_other_cpus()			stop_this_cpu()

	// sends REBOOT IPI to stop remote CPUs
						...
						wbinvd();

	// wait timesout, try NMI
	if (!cpumask_empty(&cpus_stop_mask)) {
		for_each_cpu(cpu, &cpus_stop_mask) {

						...
						cpumask_clear_cpu(cpu,
							&cpus_stop_mask);
						hlt;

			// send NMI 	--->
						wakeup from hlt and run
						stop_this_cpu():

			// WAIT CPUs TO STOP
			while (!cpumask_empty(
			    &cpus_stop_mask) &&
				...)
		}
						...
		proceed ...			wbinvd();
						...
						hlt;

The "WAIT CPUs TO STOP" is supposed to wait until all remote CPUs are
stopped, but actually it quits immediately because the remote CPUs have
been cleared in cpus_stop_mask when stop_this_cpu() is called from the
REBOOT IPI.

Doing WBINVD in stop_this_cpu() could potentially increase the chance to
trigger the above "race" despite it's still rare to happen.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/include/asm/kexec.h         |  2 +-
 arch/x86/include/asm/processor.h     |  2 ++
 arch/x86/kernel/cpu/amd.c            | 16 ++++++++++++++++
 arch/x86/kernel/machine_kexec_64.c   | 15 ++++++++++-----
 arch/x86/kernel/process.c            | 16 +++-------------
 arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++----
 6 files changed, 43 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index f2ad77929d6e..d7e93522b93d 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -122,7 +122,7 @@ relocate_kernel_fn(unsigned long indirection_page,
 		   unsigned long pa_control_page,
 		   unsigned long start_address,
 		   unsigned int preserve_context,
-		   unsigned int host_mem_enc_active);
+		   unsigned int cache_incoherent);
 #endif
 extern relocate_kernel_fn relocate_kernel;
 #define ARCH_HAS_KIMAGE_ARCH
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 50d34698036d..da442ca6a2e9 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -731,6 +731,8 @@ void __noreturn stop_this_cpu(void *dummy);
 void microcode_check(struct cpuinfo_x86 *prev_info);
 void store_cpu_caps(struct cpuinfo_x86 *info);
 
+DECLARE_PER_CPU(bool, cache_state_incoherent);
+
 enum l1tf_mitigations {
 	L1TF_MITIGATION_OFF,
 	L1TF_MITIGATION_AUTO,
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 13a48ec28f32..7918561f9382 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -494,6 +494,22 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
 {
 	u64 msr;
 
+	/*
+	 * Mark using wbinvd is needed during kexec on processors that
+	 * support SME. This provides support for performing a successful
+	 * kexec when going from SME inactive to SME active (or vice-versa).
+	 *
+	 * The cache must be cleared so that if there are entries with the
+	 * same physical address, both with and without the encryption bit,
+	 * they don't race each other when flushed and potentially end up
+	 * with the wrong entry being committed to memory.
+	 *
+	 * Test the CPUID bit directly because the machine might've cleared
+	 * X86_FEATURE_SME due to cmdline options.
+	 */
+	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+		__this_cpu_write(cache_state_incoherent, true);
+
 	/*
 	 * BIOS support is required for SME and SEV.
 	 *   For SME: If BIOS has enabled SME then adjust x86_phys_bits by
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 949c9e4bfad2..f5a7b1894fcf 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -29,6 +29,7 @@
 #include <asm/set_memory.h>
 #include <asm/cpu.h>
 #include <asm/efi.h>
+#include <asm/processor.h>
 
 #ifdef CONFIG_ACPI
 /*
@@ -384,15 +385,15 @@ void __nocfi machine_kexec(struct kimage *image)
 {
 	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
 	relocate_kernel_fn *relocate_kernel_ptr;
-	unsigned int host_mem_enc_active;
+	unsigned int cache_incoherent;
 	int save_ftrace_enabled;
 	void *control_page;
 
 	/*
-	 * This must be done before load_segments() since if call depth tracking
-	 * is used then GS must be valid to make any function calls.
+	 * This must be done before load_segments(), since it resets
+	 * GS to 0 and percpu data needs the correct GS to work.
 	 */
-	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
+	cache_incoherent = this_cpu_read(cache_state_incoherent);
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
@@ -436,6 +437,10 @@ void __nocfi machine_kexec(struct kimage *image)
 	 *
 	 * Take advantage of this here by force loading the segments,
 	 * before the GDT is zapped with an invalid value.
+	 *
+	 * load_segments() resets GS to 0.  Don't make any function call
+	 * after here since call depth tracking uses percpu variables to
+	 * operate (relocate_kernel() is explicitly ignored by call depth
 	 */
 	load_segments();
 
@@ -444,7 +449,7 @@ void __nocfi machine_kexec(struct kimage *image)
 					   virt_to_phys(control_page),
 					   image->start,
 					   image->preserve_context,
-					   host_mem_enc_active);
+					   cache_incoherent);
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 4b668bc683c4..908e4c0a3453 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -88,6 +88,8 @@ EXPORT_PER_CPU_SYMBOL(cpu_tss_rw);
 DEFINE_PER_CPU(bool, __tss_limit_invalid);
 EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid);
 
+DEFINE_PER_CPU(bool, cache_state_incoherent);
+
 /*
  * this gets called so that we can store lazy state into memory and copy the
  * current task into the new thread.
@@ -813,19 +815,7 @@ void __noreturn stop_this_cpu(void *dummy)
 	disable_local_APIC();
 	mcheck_cpu_clear(c);
 
-	/*
-	 * Use wbinvd on processors that support SME. This provides support
-	 * for performing a successful kexec when going from SME inactive
-	 * to SME active (or vice-versa). The cache must be cleared so that
-	 * if there are entries with the same physical address, both with and
-	 * without the encryption bit, they don't race each other when flushed
-	 * and potentially end up with the wrong entry being committed to
-	 * memory.
-	 *
-	 * Test the CPUID bit directly because the machine might've cleared
-	 * X86_FEATURE_SME due to cmdline options.
-	 */
-	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+	if (this_cpu_read(cache_state_incoherent))
 		wbinvd();
 
 	/*
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index ea604f4d0b52..34b3a5e4fe49 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -67,7 +67,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
 	 * %rsi pa_control_page
 	 * %rdx start address
 	 * %rcx preserve_context
-	 * %r8  host_mem_enc_active
+	 * %r8  cache_incoherent
 	 */
 
 	/* Save the CPU context, used for jumping back */
@@ -129,7 +129,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	/*
 	 * %rdi	indirection page
 	 * %rdx start address
-	 * %r8 host_mem_enc_active
+	 * %r8 cache_incoherent
 	 * %r9 page table page
 	 * %r11 preserve_context
 	 * %r13 original CR4 when relocate_kernel() was invoked
@@ -200,14 +200,21 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	movq	%r9, %cr3
 
 	/*
+	 * If the memory cache is in incoherent state, e.g., due to
+	 * memory encryption, do wbinvd to flush cache.
+	 *
 	 * If SME is active, there could be old encrypted cache line
 	 * entries that will conflict with the now unencrypted memory
 	 * used by kexec. Flush the caches before copying the kernel.
+	 *
+	 * Note SME sets this flag to true when the platform supports
+	 * SME, so the wbinvd is performed even SME is not activated
+	 * by the kernel.  But this has no harm.
 	 */
 	testq	%r8, %r8
-	jz .Lsme_off
+	jz .Lnowbinvd
 	wbinvd
-.Lsme_off:
+.Lnowbinvd:
 
 	call	swap_pages
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/5] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  2025-05-10 11:20 [PATCH v2 0/5] TDX host: kexec/kdump support Kai Huang
  2025-05-10 11:20 ` [PATCH v2 1/5] x86/sme: Use percpu boolean to control wbinvd during kexec Kai Huang
@ 2025-05-10 11:20 ` Kai Huang
  2025-05-14 10:10   ` [PATCH v2.1 " Kai Huang
  2025-05-10 11:20 ` [PATCH v2 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Kai Huang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Kai Huang @ 2025-05-10 11:20 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo
  Cc: kirill.shutemov, hpa, x86, linux-kernel, pbonzini, seanjc,
	rick.p.edgecombe, isaku.yamahata, reinette.chatre, dan.j.williams,
	thomas.lendacky, ashish.kalra, nik.borisov, sagis

On TDX platforms, at hardware level dirty cachelines with and without
TDX keyID can coexist, and CPU can flush them back to memory in random
order.  During kexec, the caches must be flushed before jumping to the
new kernel to avoid silent memory corruption when a cacheline with a
different encryption property is written back over whatever encryption
properties the new kernel is using.

A percpu boolean is used to mark whether the cache of a given CPU may be
in an incoherent state, and the kexec performs WBINVD on the CPUs with
that boolean turned on.

For TDX, only the TDX module or the TDX guests can generate dirty
cachelines of TDX private memory, i.e., they are only generated when the
kernel does SEAMCALL.

Turn on that boolean when the kernel does SEAMCALL so that kexec can
correctly flush cache.  Note not all SEAMCALL leaf functions generate
dirty cachelines of TDX private memory, but for simplicity, just treat
all of them do.

SEAMCALL can be made from both task context and IRQ disabled context.
Given SEAMCALL is just a lengthy instruction (e.g., thousands of cycles)
from kernel's point of view and preempt_{disable|enable}() is cheap
compared to it, simply unconditionally disable preemption during setting
the percpu boolean and making SEAMCALL.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/include/asm/tdx.h | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 4a1922ec80cf..d017e48958cd 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -97,9 +97,38 @@ u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
 void tdx_init(void);

 #include <asm/archrandom.h>
+#include <asm/processor.h>

 typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args);

+static inline u64 do_seamcall(sc_func_t func, u64 fn,
+			      struct tdx_module_args *args)
+{
+	u64 ret;
+
+	preempt_disable();
+
+	/*
+	 * SEAMCALLs are made to the TDX module and can generate dirty
+	 * cachelines of TDX private memory.  Mark cache state incoherent
+	 * so that the cache can be flushed during kexec.
+	 *
+	 * Not all SEAMCALL leaf functions generate dirty cachelines
+	 * but for simplicity just treat all of them do.
+	 *
+	 * This needs to be done before actually making the SEAMCALL,
+	 * because kexec-ing CPU could send NMI to stop remote CPUs,
+	 * in which case even disabling IRQ won't help here.
+	 */
+	this_cpu_write(cache_state_incoherent, true);
+
+	ret = func(fn, args);
+
+	preempt_enable();
+
+	return ret;
+}
+
 static inline u64 sc_retry(sc_func_t func, u64 fn,
 			   struct tdx_module_args *args)
 {
@@ -107,7 +136,7 @@ static inline u64 sc_retry(sc_func_t func, u64 fn,
 	u64 ret;

 	do {
-		ret = func(fn, args);
+		ret = do_seamcall(func, fn, args);
 	} while (ret == TDX_RND_NO_ENTROPY && --retry);

 	return ret;
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-05-10 11:20 [PATCH v2 0/5] TDX host: kexec/kdump support Kai Huang
  2025-05-10 11:20 ` [PATCH v2 1/5] x86/sme: Use percpu boolean to control wbinvd during kexec Kai Huang
  2025-05-10 11:20 ` [PATCH v2 2/5] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Kai Huang
@ 2025-05-10 11:20 ` Kai Huang
  2025-05-10 11:20 ` [PATCH v2 4/5] x86/virt/tdx: Remove the !KEXEC_CORE dependency Kai Huang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Kai Huang @ 2025-05-10 11:20 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo
  Cc: kirill.shutemov, hpa, x86, linux-kernel, pbonzini, seanjc,
	rick.p.edgecombe, isaku.yamahata, reinette.chatre, dan.j.williams,
	thomas.lendacky, ashish.kalra, nik.borisov, sagis

Some early TDX-capable platforms have an erratum: A kernel partial
write (a write transaction of less than cacheline lands at memory
controller) to TDX private memory poisons that memory, and a subsequent
read triggers a machine check.

On those platforms, the old kernel must reset TDX private memory before
jumping to the new kernel, otherwise the new kernel may see unexpected
machine check.  Currently the kernel doesn't track which page is a TDX
private page.  For simplicity just fail kexec/kdump for those platforms.

Leverage the existing machine_kexec_prepare() to fail kexec/kdump by
adding the check of the presence of the TDX erratum (which is only
checked for if the kernel is built with TDX host support).  This rejects
kexec/kdump when the kernel is loading the kexec/kdump kernel image.

The alternative is to reject kexec/kdump when the kernel is jumping to
the new kernel.  But for kexec this requires adding a new check (e.g.,
arch_kexec_allowed()) in the common code to fail kernel_kexec() at early
stage.  Kdump (crash_kexec()) needs similar check, but it's hard to
justify because crash_kexec() is not supposed to abort.

It's feasible to further relax this limitation, i.e., only fail kexec
when TDX is actually enabled by the kernel.  But this is still a half
measure compared to resetting TDX private memory so just do the simplest
thing for now.

The impact to userspace is the users will get an error when loading the
kexec/kdump kernel image:

  kexec_load failed: Operation not supported

This might be confusing to the users, thus also print the reason in the
dmesg:

  [..] kexec: not allowed on platform with tdx_pw_mce bug.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/kernel/machine_kexec_64.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index f5a7b1894fcf..39e133dab7cb 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -347,6 +347,22 @@ int machine_kexec_prepare(struct kimage *image)
 	unsigned long reloc_end = (unsigned long)__relocate_kernel_end;
 	int result;

+	/*
+	 * Some early TDX-capable platforms have an erratum.  A kernel
+	 * partial write (a write transaction of less than cacheline
+	 * lands at memory controller) to TDX private memory poisons that
+	 * memory, and a subsequent read triggers a machine check.
+	 *
+	 * On those platforms the old kernel must reset TDX private
+	 * memory before jumping to the new kernel otherwise the new
+	 * kernel may see unexpected machine check.  For simplicity
+	 * just fail kexec/kdump on those platforms.
+	 */
+	if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) {
+		pr_info_once("Not allowed on platform with tdx_pw_mce bug\n");
+		return -EOPNOTSUPP;
+	}
+
 	/* Setup the identity mapped 64bit page table */
 	result = init_pgtable(image, __pa(control_page));
 	if (result)
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 4/5] x86/virt/tdx: Remove the !KEXEC_CORE dependency
  2025-05-10 11:20 [PATCH v2 0/5] TDX host: kexec/kdump support Kai Huang
                   ` (2 preceding siblings ...)
  2025-05-10 11:20 ` [PATCH v2 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Kai Huang
@ 2025-05-10 11:20 ` Kai Huang
  2025-05-10 11:20 ` [PATCH v2 5/5] x86/virt/tdx: Update the kexec section in the TDX documentation Kai Huang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Kai Huang @ 2025-05-10 11:20 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo
  Cc: kirill.shutemov, hpa, x86, linux-kernel, pbonzini, seanjc,
	rick.p.edgecombe, isaku.yamahata, reinette.chatre, dan.j.williams,
	thomas.lendacky, ashish.kalra, nik.borisov, sagis

During kexec it is now guaranteed that all dirty cachelines of TDX
private memory are flushed before jumping to the new kernel.  The TDX
private memory from the old kernel will remain as TDX private memory in
the new kernel, but it is OK because kernel read/write to TDX private
memory will never cause machine check, except on the platforms with the
TDX partial write erratum, which has already been handled.

It is safe to allow kexec to work together with TDX now.  Remove the
!KEXEC_CORE dependency.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0f63f550ab52..bd576bc8843a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1923,7 +1923,6 @@ config INTEL_TDX_HOST
 	depends on X86_X2APIC
 	select ARCH_KEEP_MEMBLOCK
 	depends on CONTIG_ALLOC
-	depends on !KEXEC_CORE
 	depends on X86_MCE
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 5/5] x86/virt/tdx: Update the kexec section in the TDX documentation
  2025-05-10 11:20 [PATCH v2 0/5] TDX host: kexec/kdump support Kai Huang
                   ` (3 preceding siblings ...)
  2025-05-10 11:20 ` [PATCH v2 4/5] x86/virt/tdx: Remove the !KEXEC_CORE dependency Kai Huang
@ 2025-05-10 11:20 ` Kai Huang
  2025-05-10 11:25 ` [RFC PATCH v2 6/5] KVM: TDX: Explicitly do WBINVD upon reboot notifier Kai Huang
  2025-05-14 12:37 ` [PATCH v2 0/5] TDX host: kexec/kdump support Tom Lendacky
  6 siblings, 0 replies; 15+ messages in thread
From: Kai Huang @ 2025-05-10 11:20 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo
  Cc: kirill.shutemov, hpa, x86, linux-kernel, pbonzini, seanjc,
	rick.p.edgecombe, isaku.yamahata, reinette.chatre, dan.j.williams,
	thomas.lendacky, ashish.kalra, nik.borisov, sagis

TDX host kernel now supports kexec/kdump.  Update the documentation to
reflect that.

Opportunsitically, remove the parentheses in "Kexec()" and move this
section under the "Erratum" section because the updated "Kexec" section
now refers to that erratum.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 Documentation/arch/x86/tdx.rst | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst
index 719043cd8b46..61670e7df2f7 100644
--- a/Documentation/arch/x86/tdx.rst
+++ b/Documentation/arch/x86/tdx.rst
@@ -142,13 +142,6 @@ but depends on the BIOS to behave correctly.
 Note TDX works with CPU logical online/offline, thus the kernel still
 allows to offline logical CPU and online it again.
 
-Kexec()
-~~~~~~~
-
-TDX host support currently lacks the ability to handle kexec.  For
-simplicity only one of them can be enabled in the Kconfig.  This will be
-fixed in the future.
-
 Erratum
 ~~~~~~~
 
@@ -171,6 +164,13 @@ If the platform has such erratum, the kernel prints additional message in
 machine check handler to tell user the machine check may be caused by
 kernel bug on TDX private memory.
 
+Kexec
+~~~~~~~
+
+Currently kexec doesn't work on the TDX platforms with the aforementioned
+erratum.  It fails when loading the kexec kernel image.  Otherwise it
+works normally.
+
 Interaction vs S3 and deeper states
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 6/5] KVM: TDX: Explicitly do WBINVD upon reboot notifier
  2025-05-10 11:20 [PATCH v2 0/5] TDX host: kexec/kdump support Kai Huang
                   ` (4 preceding siblings ...)
  2025-05-10 11:20 ` [PATCH v2 5/5] x86/virt/tdx: Update the kexec section in the TDX documentation Kai Huang
@ 2025-05-10 11:25 ` Kai Huang
  2025-06-13 11:36   ` Paolo Bonzini
  2025-05-14 12:37 ` [PATCH v2 0/5] TDX host: kexec/kdump support Tom Lendacky
  6 siblings, 1 reply; 15+ messages in thread
From: Kai Huang @ 2025-05-10 11:25 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo
  Cc: kirill.shutemov, hpa, x86, linux-kernel, pbonzini, seanjc,
	rick.p.edgecombe, isaku.yamahata, reinette.chatre, dan.j.williams,
	thomas.lendacky, ashish.kalra, nik.borisov, sagis

On TDX platforms, during kexec, the kernel needs to make sure there's no
dirty cachelines of TDX private memory before booting to the new kernel
to avoid silent memory corruption to the new kernel.

During kexec, the kexec-ing CPU firstly invokes native_stop_other_cpus()
to stop all remote CPUs before booting to the new kernel.  The remote
CPUs will then execute stop_this_cpu() to stop themselves.

The kernel has a percpu boolean to indicate whether the cache of a CPU
may be in incoherent state.  In stop_this_cpu(), the kernel does WBINVD
if that percpu boolean is true.

TDX turns on that percpu boolean on a CPU when the kernel does SEAMCALL.
This makes sure the cahces will be flushed during kexec.

However, the native_stop_other_cpus() and stop_this_cpu() have a "race"
which is extremely rare to happen but if did could cause system to hang.
Specifically, the native_stop_other_cpus() firstly sends normal reboot
IPI to remote CPUs and wait one second for them to stop.  If that times
out, native_stop_other_cpus() then sends NMIs to remote CPUs to stop
them.

The aforementioned race happens when NMIs are sent.  Doing WBINVD in
stop_this_cpu() makes each CPU take longer time to stop and increases
the chance of the race to happen.

Register reboot notifier in KVM to explcitly flush caches upon reboot
for TDX.  This brings doing WBINVD at earlier stage and aovids the
WBINVD in stop_this_cpu(), eliminating the possibility of increasing the
chance of the aforementioned race.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/include/asm/tdx.h  |  3 +++
 arch/x86/kvm/vmx/tdx.c      | 45 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.c |  9 ++++++++
 3 files changed, 57 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 91dc6e6bdd97..d0156bf0b966 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -221,6 +221,8 @@ u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u6
 u64 tdh_phymem_cache_wb(bool resume);
 u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td);
 u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
+
+void tdx_cpu_flush_cache(void);
 #else
 static inline void tdx_init(void) { }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
@@ -228,6 +230,7 @@ static inline int tdx_enable(void)  { return -ENODEV; }
 static inline u32 tdx_get_nr_guest_keyids(void) { return 0; }
 static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
 static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NULL; }
+static inline void tdx_cpu_flush_cache(void) { }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLER__ */
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..3b92b3999855 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -5,7 +5,9 @@
 #include <asm/fpu/xcr.h>
 #include <linux/misc_cgroup.h>
 #include <linux/mmu_context.h>
+#include <linux/reboot.h>
 #include <asm/tdx.h>
+#include <asm/processor.h>
 #include "capabilities.h"
 #include "mmu.h"
 #include "x86_ops.h"
@@ -3278,6 +3280,33 @@ static int tdx_offline_cpu(unsigned int cpu)
 	return -EBUSY;
 }
 
+static void smp_func_cpu_flush_cache(void *unused)
+{
+	tdx_cpu_flush_cache();
+}
+
+static int tdx_reboot_notify(struct notifier_block *nb, unsigned long code,
+			     void *unused)
+{
+	/*
+	 * Flush cache for all CPUs upon the reboot notifier.  This
+	 * avoids having to do WBINVD in stop_this_cpu() during kexec.
+	 *
+	 * Kexec calls native_stop_other_cpus() to stop remote CPUs
+	 * before booting to new kernel, but that code has a "race"
+	 * when the normal REBOOT IPI timesout and NMIs are sent to
+	 * remote CPUs to stop them.  Doing WBINVD in stop_this_cpu()
+	 * could potentially increase the posibility of the "race".
+	 */
+	if (code == SYS_RESTART)
+		on_each_cpu(smp_func_cpu_flush_cache, NULL, 1);
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block tdx_reboot_nb = {
+	.notifier_call = tdx_reboot_notify,
+};
+
 static void __do_tdx_cleanup(void)
 {
 	/*
@@ -3435,6 +3464,11 @@ void tdx_cleanup(void)
 {
 	if (enable_tdx) {
 		misc_cg_set_capacity(MISC_CG_RES_TDX, 0);
+		/*
+		 * Ignore the return value.  See the comment in
+		 * tdx_bringup().
+		 */
+		unregister_reboot_notifier(&tdx_reboot_nb);
 		__tdx_cleanup();
 		kvm_disable_virtualization();
 	}
@@ -3518,6 +3552,17 @@ int __init tdx_bringup(void)
 		enable_tdx = 0;
 	}
 
+	if (enable_tdx)
+		/*
+		 * Ignore the return value.  @tdx_reboot_nb is used to flush
+		 * cache for all CPUs upon rebooting to avoid having to do
+		 * WBINVD in kexec while the kexec-ing CPU stops all remote
+		 * CPUs.  Failure to register isn't fatal, because if KVM
+		 * doesn't flush cache explicitly upon rebooting the kexec
+		 * will do it anyway.
+		 */
+		register_reboot_notifier(&tdx_reboot_nb);
+
 	return r;
 
 success_disable_tdx:
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index af8798bc62ed..7478230cdc33 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1890,3 +1890,12 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
 	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
 }
 EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
+
+void tdx_cpu_flush_cache(void)
+{
+	lockdep_assert_preemption_disabled();
+
+	wbinvd();
+	this_cpu_write(cache_state_incoherent, false);
+}
+EXPORT_SYMBOL_GPL(tdx_cpu_flush_cache);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2.1 2/5] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  2025-05-10 11:20 ` [PATCH v2 2/5] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Kai Huang
@ 2025-05-14 10:10   ` Kai Huang
  2025-05-14 15:52     ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Kai Huang @ 2025-05-14 10:10 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo
  Cc: kirill.shutemov, hpa, x86, linux-kernel, pbonzini, seanjc,
	rick.p.edgecombe, reinette.chatre, isaku.yamahata, dan.j.williams,
	thomas.lendacky, ashish.kalra, nik.borisov, sagis

On TDX platforms, at hardware level dirty cachelines with and without
TDX keyID can coexist, and CPU can flush them back to memory in random
order.  During kexec, the caches must be flushed before jumping to the
new kernel to avoid silent memory corruption when a cacheline with a
different encryption property is written back over whatever encryption
properties the new kernel is using.

A percpu boolean is used to mark whether the cache of a given CPU may be
in an incoherent state, and the kexec performs WBINVD on the CPUs with
that boolean turned on.

For TDX, only the TDX module or the TDX guests can generate dirty
cachelines of TDX private memory, i.e., they are only generated when the
kernel does SEAMCALL.

Turn on that boolean when the kernel does SEAMCALL so that kexec can
correctly flush cache.  Note not all SEAMCALL leaf functions generate
dirty cachelines of TDX private memory, but for simplicity, just treat
all of them do.

SEAMCALL can be made from both task context and IRQ disabled context.
Given SEAMCALL is just a lengthy instruction (e.g., thousands of cycles)
from kernel's point of view and preempt_{disable|enable}() is cheap
compared to it, simply unconditionally disable preemption during setting
the percpu boolean and making SEAMCALL.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v2 -> v2.1:
 - Include <linux/preempt.h> to fix a build issue reported by LKP using
   'x86_64-allyesconfig' config.

---
 arch/x86/include/asm/tdx.h | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 4a1922ec80cf..e69021aee731 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -96,10 +96,40 @@ u64 __seamcall_ret(u64 fn, struct tdx_module_args *args);
 u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
 void tdx_init(void);

+#include <linux/preempt.h>
 #include <asm/archrandom.h>
+#include <asm/processor.h>

 typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args);

+static inline u64 do_seamcall(sc_func_t func, u64 fn,
+			      struct tdx_module_args *args)
+{
+	u64 ret;
+
+	preempt_disable();
+
+	/*
+	 * SEAMCALLs are made to the TDX module and can generate dirty
+	 * cachelines of TDX private memory.  Mark cache state incoherent
+	 * so that the cache can be flushed during kexec.
+	 *
+	 * Not all SEAMCALL leaf functions generate dirty cachelines
+	 * but for simplicity just treat all of them do.
+	 *
+	 * This needs to be done before actually making the SEAMCALL,
+	 * because kexec-ing CPU could send NMI to stop remote CPUs,
+	 * in which case even disabling IRQ won't help here.
+	 */
+	this_cpu_write(cache_state_incoherent, true);
+
+	ret = func(fn, args);
+
+	preempt_enable();
+
+	return ret;
+}
+
 static inline u64 sc_retry(sc_func_t func, u64 fn,
 			   struct tdx_module_args *args)
 {
@@ -107,7 +137,7 @@ static inline u64 sc_retry(sc_func_t func, u64 fn,
 	u64 ret;

 	do {
-		ret = func(fn, args);
+		ret = do_seamcall(func, fn, args);
 	} while (ret == TDX_RND_NO_ENTROPY && --retry);

 	return ret;
-- 
2.49.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/5] TDX host: kexec/kdump support
  2025-05-10 11:20 [PATCH v2 0/5] TDX host: kexec/kdump support Kai Huang
                   ` (5 preceding siblings ...)
  2025-05-10 11:25 ` [RFC PATCH v2 6/5] KVM: TDX: Explicitly do WBINVD upon reboot notifier Kai Huang
@ 2025-05-14 12:37 ` Tom Lendacky
  2025-05-14 22:09   ` Huang, Kai
  2025-05-29 11:06   ` Huang, Kai
  6 siblings, 2 replies; 15+ messages in thread
From: Tom Lendacky @ 2025-05-14 12:37 UTC (permalink / raw)
  To: Kai Huang, dave.hansen, bp, tglx, peterz, mingo
  Cc: kirill.shutemov, hpa, x86, linux-kernel, pbonzini, seanjc,
	rick.p.edgecombe, isaku.yamahata, reinette.chatre, dan.j.williams,
	ashish.kalra, nik.borisov, sagis

On 5/10/25 06:20, Kai Huang wrote:
> Hi Dave,
> 
> This series is the latest attempt to support kexec on TDX host following
> your suggestion to use a percpu boolean to control WBINVD during kexec.
> I appreciate if you can help to review.
> 
> The last attempt was to have one patch to make TDX and kexec mutually
> exclusive at runtime while allowing them to be both enabled in Kconfig,
> but it turns out to be overkill.  Dave proposed another option of using
> a percpu boolean to track the need for flushing:
> 
> https://lore.kernel.org/lkml/20250416230259.97989-1-kai.huang@intel.com/
> 
> One advantage of the percpu boolean is for TDX we could do optimization
> to do:
> 
> 	wbinvd();
> 	percpu(boolean) = 0;
> 
> for all CPUs at early stage to avoid having to do WBINVD in
> stop_this_cpu() at kexec time.  I made a RFC patch to show the idea:
> 
> https://github.com/intel/tdx/commit/d9f0123b1d63ba24f472da6971181939ce53d2e3
> 
> I'll also reply this RFC patch to this coverletter in case anyone wants
> to have a discussion.  Nevertheless, my exception is this series can be
> merged first w/o the RFC patch.  We can follow up the optimizations
> later.
> 
> This series is tagged v2, since it's a closer follow on to the RFC
> patchset, which was posted before the single patch:
> 
> https://lore.kernel.org/all/cover.1741778537.git.kai.huang@intel.com/
> 
> This series is based on today's tip/master.

I'm on PTO for the next few days, so I'll try to review/test the SME
related changes when I return.

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2.1 2/5] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  2025-05-14 10:10   ` [PATCH v2.1 " Kai Huang
@ 2025-05-14 15:52     ` Dave Hansen
  2025-05-14 22:13       ` Huang, Kai
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2025-05-14 15:52 UTC (permalink / raw)
  To: Kai Huang, bp, tglx, peterz, mingo
  Cc: kirill.shutemov, hpa, x86, linux-kernel, pbonzini, seanjc,
	rick.p.edgecombe, reinette.chatre, isaku.yamahata, dan.j.williams,
	thomas.lendacky, ashish.kalra, nik.borisov, sagis

On 5/14/25 03:10, Kai Huang wrote:
> Turn on that boolean when the kernel does SEAMCALL so that kexec can
> correctly flush cache.  Note not all SEAMCALL leaf functions generate
> dirty cachelines of TDX private memory, but for simplicity, just treat
> all of them do.

It's not just for simplicity.

There's no contract in place for when the TDX module will dirty memory
or not. A call that is "clean" today might dirty memory tomorrow.

The _only_ thing we know is that SEAMCALLs can dirty cachelines. We
don't know when or how they do it. This blurb makes it sound like it's
possible to optimize this. It's not.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/5] TDX host: kexec/kdump support
  2025-05-14 12:37 ` [PATCH v2 0/5] TDX host: kexec/kdump support Tom Lendacky
@ 2025-05-14 22:09   ` Huang, Kai
  2025-05-29 11:06   ` Huang, Kai
  1 sibling, 0 replies; 15+ messages in thread
From: Huang, Kai @ 2025-05-14 22:09 UTC (permalink / raw)
  To: thomas.lendacky@amd.com, tglx@linutronix.de, peterz@infradead.org,
	Hansen, Dave, mingo@redhat.com, bp@alien8.de
  Cc: Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
	sagis@google.com, hpa@zytor.com, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Chatre, Reinette,
	pbonzini@redhat.com, Williams, Dan J, Yamahata, Isaku,
	ashish.kalra@amd.com, nik.borisov@suse.com

On Wed, 2025-05-14 at 07:37 -0500, Tom Lendacky wrote:
> On 5/10/25 06:20, Kai Huang wrote:
> > Hi Dave,
> > 
> > This series is the latest attempt to support kexec on TDX host following
> > your suggestion to use a percpu boolean to control WBINVD during kexec.
> > I appreciate if you can help to review.
> > 
> > The last attempt was to have one patch to make TDX and kexec mutually
> > exclusive at runtime while allowing them to be both enabled in Kconfig,
> > but it turns out to be overkill.  Dave proposed another option of using
> > a percpu boolean to track the need for flushing:
> > 
> > https://lore.kernel.org/lkml/20250416230259.97989-1-kai.huang@intel.com/
> > 
> > One advantage of the percpu boolean is for TDX we could do optimization
> > to do:
> > 
> > 	wbinvd();
> > 	percpu(boolean) = 0;
> > 
> > for all CPUs at early stage to avoid having to do WBINVD in
> > stop_this_cpu() at kexec time.  I made a RFC patch to show the idea:
> > 
> > https://github.com/intel/tdx/commit/d9f0123b1d63ba24f472da6971181939ce53d2e3
> > 
> > I'll also reply this RFC patch to this coverletter in case anyone wants
> > to have a discussion.  Nevertheless, my exception is this series can be
> > merged first w/o the RFC patch.  We can follow up the optimizations
> > later.
> > 
> > This series is tagged v2, since it's a closer follow on to the RFC
> > patchset, which was posted before the single patch:
> > 
> > https://lore.kernel.org/all/cover.1741778537.git.kai.huang@intel.com/
> > 
> > This series is based on today's tip/master.
> 
> I'm on PTO for the next few days, so I'll try to review/test the SME
> related changes when I return.
> 

Thanks Tom, I appreciate your help!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2.1 2/5] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  2025-05-14 15:52     ` Dave Hansen
@ 2025-05-14 22:13       ` Huang, Kai
  0 siblings, 0 replies; 15+ messages in thread
From: Huang, Kai @ 2025-05-14 22:13 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, bp@alien8.de
  Cc: Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
	sagis@google.com, hpa@zytor.com, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Chatre, Reinette,
	pbonzini@redhat.com, Williams, Dan J, Yamahata, Isaku,
	ashish.kalra@amd.com, thomas.lendacky@amd.com,
	nik.borisov@suse.com

On Wed, 2025-05-14 at 08:52 -0700, Hansen, Dave wrote:
> On 5/14/25 03:10, Kai Huang wrote:
> > Turn on that boolean when the kernel does SEAMCALL so that kexec can
> > correctly flush cache.  Note not all SEAMCALL leaf functions generate
> > dirty cachelines of TDX private memory, but for simplicity, just treat
> > all of them do.
> 
> It's not just for simplicity.
> 
> There's no contract in place for when the TDX module will dirty memory
> or not. A call that is "clean" today might dirty memory tomorrow.
> 
> The _only_ thing we know is that SEAMCALLs can dirty cachelines. We
> don't know when or how they do it. This blurb makes it sound like it's
> possible to optimize this. It's not.

I thought it should not be possible to generate dirty cachelines with TDX
private KeyIDs before the very first TDX KeyID (which is global KeyID) is
configured, but right I guess we'd better not assume that.

I'll remote the "Note ..." part.  Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/5] TDX host: kexec/kdump support
  2025-05-14 12:37 ` [PATCH v2 0/5] TDX host: kexec/kdump support Tom Lendacky
  2025-05-14 22:09   ` Huang, Kai
@ 2025-05-29 11:06   ` Huang, Kai
  1 sibling, 0 replies; 15+ messages in thread
From: Huang, Kai @ 2025-05-29 11:06 UTC (permalink / raw)
  To: thomas.lendacky@amd.com, tglx@linutronix.de, peterz@infradead.org,
	Hansen, Dave, mingo@redhat.com, bp@alien8.de
  Cc: Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
	sagis@google.com, hpa@zytor.com, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Chatre, Reinette,
	pbonzini@redhat.com, Williams, Dan J, Yamahata, Isaku,
	ashish.kalra@amd.com, nik.borisov@suse.com

On Wed, 2025-05-14 at 07:37 -0500, Tom Lendacky wrote:
> On 5/10/25 06:20, Kai Huang wrote:
> > Hi Dave,
> > 
> > This series is the latest attempt to support kexec on TDX host following
> > your suggestion to use a percpu boolean to control WBINVD during kexec.
> > I appreciate if you can help to review.
> > 
> > The last attempt was to have one patch to make TDX and kexec mutually
> > exclusive at runtime while allowing them to be both enabled in Kconfig,
> > but it turns out to be overkill.  Dave proposed another option of using
> > a percpu boolean to track the need for flushing:
> > 
> > https://lore.kernel.org/lkml/20250416230259.97989-1-kai.huang@intel.com/
> > 
> > One advantage of the percpu boolean is for TDX we could do optimization
> > to do:
> > 
> > 	wbinvd();
> > 	percpu(boolean) = 0;
> > 
> > for all CPUs at early stage to avoid having to do WBINVD in
> > stop_this_cpu() at kexec time.  I made a RFC patch to show the idea:
> > 
> > https://github.com/intel/tdx/commit/d9f0123b1d63ba24f472da6971181939ce53d2e3
> > 
> > I'll also reply this RFC patch to this coverletter in case anyone wants
> > to have a discussion.  Nevertheless, my exception is this series can be
> > merged first w/o the RFC patch.  We can follow up the optimizations
> > later.
> > 
> > This series is tagged v2, since it's a closer follow on to the RFC
> > patchset, which was posted before the single patch:
> > 
> > https://lore.kernel.org/all/cover.1741778537.git.kai.huang@intel.com/
> > 
> > This series is based on today's tip/master.
> 
> I'm on PTO for the next few days, so I'll try to review/test the SME
> related changes when I return.
> 

Hi Tom,

I hope you had a good time off? :-)

I appreciate if you can help to review/test.  This series can still be applied
cleanly on today's tip/master. :-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 6/5] KVM: TDX: Explicitly do WBINVD upon reboot notifier
  2025-05-10 11:25 ` [RFC PATCH v2 6/5] KVM: TDX: Explicitly do WBINVD upon reboot notifier Kai Huang
@ 2025-06-13 11:36   ` Paolo Bonzini
  2025-06-16 10:39     ` Huang, Kai
  0 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2025-06-13 11:36 UTC (permalink / raw)
  To: Kai Huang, dave.hansen, bp, tglx, peterz, mingo
  Cc: kirill.shutemov, hpa, x86, linux-kernel, seanjc, rick.p.edgecombe,
	isaku.yamahata, reinette.chatre, dan.j.williams, thomas.lendacky,
	ashish.kalra, nik.borisov, sagis

On 5/10/25 13:25, Kai Huang wrote:
> On TDX platforms, during kexec, the kernel needs to make sure there's no
> dirty cachelines of TDX private memory before booting to the new kernel
> to avoid silent memory corruption to the new kernel.
> 
> During kexec, the kexec-ing CPU firstly invokes native_stop_other_cpus()
> to stop all remote CPUs before booting to the new kernel.  The remote
> CPUs will then execute stop_this_cpu() to stop themselves.
> 
> The kernel has a percpu boolean to indicate whether the cache of a CPU
> may be in incoherent state.  In stop_this_cpu(), the kernel does WBINVD
> if that percpu boolean is true.
> 
> TDX turns on that percpu boolean on a CPU when the kernel does SEAMCALL.
> This makes sure the cahces will be flushed during kexec.
> 
> However, the native_stop_other_cpus() and stop_this_cpu() have a "race"
> which is extremely rare to happen but if did could cause system to hang.

s/if did//

> Specifically, the native_stop_other_cpus() firstly sends normal reboot
> IPI to remote CPUs and wait one second for them to stop.  If that times
> out, native_stop_other_cpus() then sends NMIs to remote CPUs to stop
> them.
> 
> The aforementioned race happens when NMIs are sent.  Doing WBINVD in
> stop_this_cpu() makes each CPU take longer time to stop and increases
> the chance of the race to happen.
> 
> Register reboot notifier in KVM to explcitly flush caches upon reboot
> for TDX.  This brings doing WBINVD at earlier stage and aovids the
> WBINVD in stop_this_cpu(), eliminating the possibility of increasing the
> chance of the aforementioned race.

"This moves the WBINVD to an earlier stage than stop_this_cpus(), 
avoiding a possibly lengthy operation at a time where it could cause 
this race."

Acked-by: Paolo Bonzini <pbonzini@redhat.com>

Waiting for v3. :)

Paolo

> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>   arch/x86/include/asm/tdx.h  |  3 +++
>   arch/x86/kvm/vmx/tdx.c      | 45 +++++++++++++++++++++++++++++++++++++
>   arch/x86/virt/vmx/tdx/tdx.c |  9 ++++++++
>   3 files changed, 57 insertions(+)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 91dc6e6bdd97..d0156bf0b966 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -221,6 +221,8 @@ u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u6
>   u64 tdh_phymem_cache_wb(bool resume);
>   u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td);
>   u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
> +
> +void tdx_cpu_flush_cache(void);
>   #else
>   static inline void tdx_init(void) { }
>   static inline int tdx_cpu_enable(void) { return -ENODEV; }
> @@ -228,6 +230,7 @@ static inline int tdx_enable(void)  { return -ENODEV; }
>   static inline u32 tdx_get_nr_guest_keyids(void) { return 0; }
>   static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
>   static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NULL; }
> +static inline void tdx_cpu_flush_cache(void) { }
>   #endif	/* CONFIG_INTEL_TDX_HOST */
>   
>   #endif /* !__ASSEMBLER__ */
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b952bc673271..3b92b3999855 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -5,7 +5,9 @@
>   #include <asm/fpu/xcr.h>
>   #include <linux/misc_cgroup.h>
>   #include <linux/mmu_context.h>
> +#include <linux/reboot.h>
>   #include <asm/tdx.h>
> +#include <asm/processor.h>
>   #include "capabilities.h"
>   #include "mmu.h"
>   #include "x86_ops.h"
> @@ -3278,6 +3280,33 @@ static int tdx_offline_cpu(unsigned int cpu)
>   	return -EBUSY;
>   }
>   
> +static void smp_func_cpu_flush_cache(void *unused)
> +{
> +	tdx_cpu_flush_cache();
> +}
> +
> +static int tdx_reboot_notify(struct notifier_block *nb, unsigned long code,
> +			     void *unused)
> +{
> +	/*
> +	 * Flush cache for all CPUs upon the reboot notifier.  This
> +	 * avoids having to do WBINVD in stop_this_cpu() during kexec.
> +	 *
> +	 * Kexec calls native_stop_other_cpus() to stop remote CPUs
> +	 * before booting to new kernel, but that code has a "race"
> +	 * when the normal REBOOT IPI timesout and NMIs are sent to
> +	 * remote CPUs to stop them.  Doing WBINVD in stop_this_cpu()
> +	 * could potentially increase the posibility of the "race".
> +	 */
> +	if (code == SYS_RESTART)
> +		on_each_cpu(smp_func_cpu_flush_cache, NULL, 1);
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block tdx_reboot_nb = {
> +	.notifier_call = tdx_reboot_notify,
> +};
> +
>   static void __do_tdx_cleanup(void)
>   {
>   	/*
> @@ -3435,6 +3464,11 @@ void tdx_cleanup(void)
>   {
>   	if (enable_tdx) {
>   		misc_cg_set_capacity(MISC_CG_RES_TDX, 0);
> +		/*
> +		 * Ignore the return value.  See the comment in
> +		 * tdx_bringup().
> +		 */
> +		unregister_reboot_notifier(&tdx_reboot_nb);
>   		__tdx_cleanup();
>   		kvm_disable_virtualization();
>   	}
> @@ -3518,6 +3552,17 @@ int __init tdx_bringup(void)
>   		enable_tdx = 0;
>   	}
>   
> +	if (enable_tdx)
> +		/*
> +		 * Ignore the return value.  @tdx_reboot_nb is used to flush
> +		 * cache for all CPUs upon rebooting to avoid having to do
> +		 * WBINVD in kexec while the kexec-ing CPU stops all remote
> +		 * CPUs.  Failure to register isn't fatal, because if KVM
> +		 * doesn't flush cache explicitly upon rebooting the kexec
> +		 * will do it anyway.
> +		 */
> +		register_reboot_notifier(&tdx_reboot_nb);
> +
>   	return r;
>   
>   success_disable_tdx:
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index af8798bc62ed..7478230cdc33 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1890,3 +1890,12 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
>   	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
>   }
>   EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
> +
> +void tdx_cpu_flush_cache(void)
> +{
> +	lockdep_assert_preemption_disabled();
> +
> +	wbinvd();
> +	this_cpu_write(cache_state_incoherent, false);
> +}
> +EXPORT_SYMBOL_GPL(tdx_cpu_flush_cache);


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 6/5] KVM: TDX: Explicitly do WBINVD upon reboot notifier
  2025-06-13 11:36   ` Paolo Bonzini
@ 2025-06-16 10:39     ` Huang, Kai
  0 siblings, 0 replies; 15+ messages in thread
From: Huang, Kai @ 2025-06-16 10:39 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, pbonzini@redhat.com,
	Hansen, Dave, mingo@redhat.com, bp@alien8.de
  Cc: Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
	sagis@google.com, hpa@zytor.com, Chatre, Reinette,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	Yamahata, Isaku, ashish.kalra@amd.com, nik.borisov@suse.com

On Fri, 2025-06-13 at 13:36 +0200, Paolo Bonzini wrote:
> On 5/10/25 13:25, Kai Huang wrote:
> > On TDX platforms, during kexec, the kernel needs to make sure there's no
> > dirty cachelines of TDX private memory before booting to the new kernel
> > to avoid silent memory corruption to the new kernel.
> > 
> > During kexec, the kexec-ing CPU firstly invokes native_stop_other_cpus()
> > to stop all remote CPUs before booting to the new kernel.  The remote
> > CPUs will then execute stop_this_cpu() to stop themselves.
> > 
> > The kernel has a percpu boolean to indicate whether the cache of a CPU
> > may be in incoherent state.  In stop_this_cpu(), the kernel does WBINVD
> > if that percpu boolean is true.
> > 
> > TDX turns on that percpu boolean on a CPU when the kernel does SEAMCALL.
> > This makes sure the cahces will be flushed during kexec.
> > 
> > However, the native_stop_other_cpus() and stop_this_cpu() have a "race"
> > which is extremely rare to happen but if did could cause system to hang.
> 
> s/if did//
> 
> > Specifically, the native_stop_other_cpus() firstly sends normal reboot
> > IPI to remote CPUs and wait one second for them to stop.  If that times
> > out, native_stop_other_cpus() then sends NMIs to remote CPUs to stop
> > them.
> > 
> > The aforementioned race happens when NMIs are sent.  Doing WBINVD in
> > stop_this_cpu() makes each CPU take longer time to stop and increases
> > the chance of the race to happen.
> > 
> > Register reboot notifier in KVM to explcitly flush caches upon reboot
> > for TDX.  This brings doing WBINVD at earlier stage and aovids the
> > WBINVD in stop_this_cpu(), eliminating the possibility of increasing the
> > chance of the aforementioned race.
> 
> "This moves the WBINVD to an earlier stage than stop_this_cpus(), 
> avoiding a possibly lengthy operation at a time where it could cause 
> this race."
> 
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> Waiting for v3. :)

Thanks Paolo, and will do!

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-06-16 10:39 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-10 11:20 [PATCH v2 0/5] TDX host: kexec/kdump support Kai Huang
2025-05-10 11:20 ` [PATCH v2 1/5] x86/sme: Use percpu boolean to control wbinvd during kexec Kai Huang
2025-05-10 11:20 ` [PATCH v2 2/5] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Kai Huang
2025-05-14 10:10   ` [PATCH v2.1 " Kai Huang
2025-05-14 15:52     ` Dave Hansen
2025-05-14 22:13       ` Huang, Kai
2025-05-10 11:20 ` [PATCH v2 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Kai Huang
2025-05-10 11:20 ` [PATCH v2 4/5] x86/virt/tdx: Remove the !KEXEC_CORE dependency Kai Huang
2025-05-10 11:20 ` [PATCH v2 5/5] x86/virt/tdx: Update the kexec section in the TDX documentation Kai Huang
2025-05-10 11:25 ` [RFC PATCH v2 6/5] KVM: TDX: Explicitly do WBINVD upon reboot notifier Kai Huang
2025-06-13 11:36   ` Paolo Bonzini
2025-06-16 10:39     ` Huang, Kai
2025-05-14 12:37 ` [PATCH v2 0/5] TDX host: kexec/kdump support Tom Lendacky
2025-05-14 22:09   ` Huang, Kai
2025-05-29 11:06   ` Huang, Kai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).