[PATCH v3 0/6] TDX host: kexec/kdump support

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/6] TDX host: kexec/kdump support
@ 2025-06-26 10:48 Kai Huang
  2025-06-26 10:48 ` [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec Kai Huang
                   ` (5 more replies)
  0 siblings, 6 replies; 42+ messages in thread
From: Kai Huang @ 2025-06-26 10:48 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky
  Cc: x86, kirill.shutemov, rick.p.edgecombe, linux-kernel, pbonzini,
	seanjc, kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, sagis

This series is the latest attempt to support kexec on TDX host following
Dave's suggestion to use a percpu boolean to control WBINVD during
kexec.

Hi Tom,

The first patch touches AMD SME code.  I appreciate if you can help to
review and test.  Please let me know if there's anything I can help in
return. :-)

I've tested on TDX system and it worked as expected.

v2 -> v3 (all trivial changes):

 - Rebase on latest tip/master
   - change to use __always_inline for do_seamcall() in patch 2
 - Update patch 2 (changelog and code comment) to remove the sentence
   which says "not all SEAMCALLs generate dirty cachelines of TDX
   private memory but just treat all of them do."  -- Dave.
 - Add Farrah's Tested-by for all TDX patches.

The v2 had one informal RFC patch appended to show "some optimization"
which can move WBINVD from the kexec phase to an early stage in KVM.
Paolo commented and Acked that patch (thanks!), so this v3 made that
patch as a formal one (patch 6).  But technically it is not absolutely
needed in this series but can be done in the future.

More history info can be found in v2:

 https://lore.kernel.org/lkml/cover.1746874095.git.kai.huang@intel.com/

=== More information ===

TDX private memory is memory that is encrypted with private Host Key IDs
(HKID).  If the kernel has ever enabled TDX, part of system memory
remains TDX private memory when kexec happens.  E.g., the PAMT (Physical
Address Metadata Table) pages used by the TDX module to track each TDX
memory page's state are never freed once the TDX module is initialized.
TDX guests also have guest private memory and secure-EPT pages.

After kexec, the new kernel will have no knowledge of which memory page
was used as TDX private page and can use all memory as regular memory.

1) Cache flush

Per TDX 1.5 base spec "8.6.1.Platforms not Using ACT: Required Cache
Flush and Initialization by the Host VMM", to support kexec for TDX, the
kernel needs to flush cache to make sure there's no dirty cachelines of
TDX private memory left over to the new kernel (when the TDX module
reports TDX_FEATURES.CLFLUSH_BEFORE_ALLOC as 1 in the global metadata for
the platform).  The kernel also needs to make sure there's no more TDX
activity (no SEAMCALL) after cache flush so that no new dirty cachelines
of TDX private memory are generated.

SME has similar requirement.  SME kexec support uses WBINVD to do the
cache flush.  WBINVD is able to flush cachelines associated with any
HKID.  Reuse the WBINVD introduced by SME to flush cache for TDX.

Currently the kernel explicitly checks whether the hardware supports SME
and only does WBINVD if true.  Instead of adding yet another TDX
specific check, this series uses a percpu boolean to indicate whether
WBINVD is needed on that CPU during kexec.

2) Reset TDX private memory using MOVDIR64B

The TDX spec (the aforementioned section) also suggests the kernel
*should* use MOVDIR64B to clear TDX private page before the kernel
reuses it as regular one.

However, in reality the situation can be more flexible.  Per TDX 1.5
base spec ("Table 16.2: Non-ACT Platforms Checks on Memory Reads in Ci
Mode" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
Mode"), the read/write to TDX private memory using shared KeyID without
integrity check enabled will not poison the memory and cause machine
check.

Note on the platforms with ACT (Access Control Table), there's no
integrity check involved thus no machine check is possible to happen due
to memory read/write using different KeyIDs.

KeyID 0 (TME key) doesn't support integrity check.  This series chooses
to NOT reset TDX private memory but leave TDX private memory as-is to the
new kernel.  As mentioned above, in practice it is safe to do so.

3) One limitation

If the kernel has ever enabled TDX, after kexec the new kernel won't be
able to use TDX anymore.  This is because when the new kernel tries to
initialize TDX module it will fail on the first SEAMCALL due to the
module has already been initialized by the old kernel.

More (non-trivial) work will be needed for the new kernel to use TDX,
e.g., one solution is to just reload the TDX module from the location
where BIOS loads the TDX module (/boot/efi/EFI/TDX/).  This series
doesn't cover this, but leave this as future work.

4) Kdump support

This series also enables kdump with TDX, but no special handling is
needed for crash kexec (except turning on the Kconfig option):

 - kdump kernel uses reserved memory from the old kernel as system ram,
   and the old kernel will never use the reserved memory as TDX memory.
 - /proc/vmcore contains TDX private memory pages.  It's meaningless to
   read them, but it doesn't do any harm either.

5) TDX "partial write machine check" erratum

On the platform with TDX erratum, a partial write (a write transaction
of less than a cacheline lands at memory controller) to TDX private
memory poisons that memory, and a subsequent read triggers machine
check.  On those platforms, the kernel needs to reset TDX private memory
before jumping to the new kernel otherwise the new kernel may see
unexpected machine check.

The kernel currently doesn't track which page is TDX private memory.
It's not trivial to reset TDX private memory.  For simplicity, this
series simply disables kexec/kdump for such platforms.  This can be
enhanced in the future.

*** BLURB HERE ***

Kai Huang (6):
  x86/sme: Use percpu boolean to control wbinvd during kexec
  x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  x86/kexec: Disable kexec/kdump on platforms with TDX partial write
    erratum
  x86/virt/tdx: Remove the !KEXEC_CORE dependency
  x86/virt/tdx: Update the kexec section in the TDX documentation
  KVM: TDX: Explicitly do WBINVD upon reboot notifier

 Documentation/arch/x86/tdx.rst       | 14 ++++-----
 arch/x86/Kconfig                     |  1 -
 arch/x86/include/asm/kexec.h         |  2 +-
 arch/x86/include/asm/processor.h     |  2 ++
 arch/x86/include/asm/tdx.h           | 32 +++++++++++++++++++-
 arch/x86/kernel/cpu/amd.c            | 16 ++++++++++
 arch/x86/kernel/machine_kexec_64.c   | 31 +++++++++++++++----
 arch/x86/kernel/process.c            | 16 ++--------
 arch/x86/kernel/relocate_kernel_64.S | 15 +++++++---
 arch/x86/kvm/vmx/tdx.c               | 45 ++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.c          |  9 ++++++
 11 files changed, 151 insertions(+), 32 deletions(-)

base-commit: bda8bfc862a1bc1cb2de38145d99ae50ad90b667
-- 
2.49.0

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec
  2025-06-26 10:48 [PATCH v3 0/6] TDX host: kexec/kdump support Kai Huang
@ 2025-06-26 10:48 ` Kai Huang
  2025-06-26 17:59   ` Edgecombe, Rick P
                     ` (2 more replies)
  2025-06-26 10:48 ` [PATCH v3 2/6] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Kai Huang
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 42+ messages in thread
From: Kai Huang @ 2025-06-26 10:48 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky
  Cc: x86, kirill.shutemov, rick.p.edgecombe, linux-kernel, pbonzini,
	seanjc, kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, sagis

On SME platforms, at hardware level dirty cachelines with and without
encryption bit of the same memory can coexist, and the CPU can flush
them back to memory in random order.  During kexec, the caches must be
flushed before jumping to the new kernel to avoid silent memory
corruption when a cacheline with a different encryption property is
written back over whatever encryption properties the new kernel is
using.

TDX also needs cache flush during kexec for the same reason.  It would
be good to implement a generic way to flush cache instead of scattering
checks for each feature all around.

During kexec, the kexec-ing CPU sends IPIs to all remote CPUs to stop
them before it boots to the new kernel.  For SME the kernel basically
encrypts all memory including the kernel itself by manipulating page
tables.  A simple memory write from the kernel could dirty cachelines.

Currently, the kernel uses WBINVD to flush cache for SME during kexec in
two places:

1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU
   stops them;
2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the
   new kernel.

Unlike SME, TDX can only dirty cachelines when it is used (i.e., when
SEAMCALLs are performed).  Since there are no more SEAMCALLs after the
aforementioned WBINVDs, leverage this for TDX.

In order to have a generic way to cover both SME and TDX, use a percpu
boolean to indicate the cache may be in an incoherent state thus cache
flush is needed during kexec, and turn on the boolean for SME.  TDX can
then leverage it by also turning the boolean on.

A percpu boolean isn't strictly needed for SME since it is activated at
very early boot time and on all CPUs.  A global flag would be
sufficient.  But using a percpu variable has two benefits.  Foremost,
the state that is being tracked here (percpu cache coherency situation
requiring a flush) is percpu, so a percpu state is a more direct and
natural fit.

Secondly, it will fit TDX's usage better since the percpu var can be set
when a CPU makes a SEAMCALL, and cleared when another WBINVD on the CPU
obviates the need for a kexec-time WBINVD.  Saving kexec-time WBINVD is
valuable, because there is an existing race[*] where kexec could proceed
while another CPU is active.  WBINVD could make this race worse, so it's
worth skipping it when possible.

Today the first WBINVD in the stop_this_cpu() is performed when SME is
*supported* by the platform, and the second WBINVD is done in
relocate_kernel() when SME is *activated* by the kernel.  Make things
simple by changing to do the second WBINVD when the platform supports
SME.  This allows the kernel to simply turn on this percpu boolean when
bringing up a CPU by checking whether the platform supports SME.

No other functional change intended.

Also, currently machine_check() has a comment to explain why no function
call is allowed after load_segments().  After changing to use the new
percpu boolean to control whether to perform WBINVD when calling the
relocate_kernel(), that comment isn't needed anymore.  But it is still a
useful comment, so expand the comment around load_segments() to mention
that due to depth tracking no function call can be made after
load_segments().

[*] The "race" in native_stop_other_cpus()

Commit

  1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")

introduced a new 'cpus_stop_mask' to resolve an "intermittent lockups on
poweroff" issue which was caused by the WBINVD in stop_this_cpu().
Specifically, the new cpumask resolved the below problem mentioned in
that commit:

    CPU0                                    CPU1

     stop_other_cpus()
       send_IPIs(REBOOT);                   stop_this_cpu()
       while (num_online_cpus() > 1);         set_online(false);
       proceed... -> hang
                                              wbinvd()

While it fixed the reported issue, that commit explained the new cpumask
"cannot plug all holes either".

This is because it doesn't address the "race" mentioned in the #3 in the
comment in native_stop_other_cpus():

        /*
         * 1) Send an IPI on the reboot vector to all other CPUs.
         *
         *    The other CPUs should react on it after leaving critical
         *    sections and re-enabling interrupts. They might still hold
         *    locks, but there is nothing which can be done about that.
         *
         * 2) Wait for all other CPUs to report that they reached the
         *    HLT loop in stop_this_cpu()
         *
         * 3) If #2 timed out send an NMI to the CPUs which did not
         *    yet report
         *
         * 4) Wait for all other CPUs to report that they reached the
         *    HLT loop in stop_this_cpu()
         *
         * #3 can obviously race against a CPU reaching the HLT loop late.
         * That CPU will have reported already and the "have all CPUs
         * reached HLT" condition will be true despite the fact that the
         * other CPU is still handling the NMI. Again, there is no
         * protection against that as "disabled" APICs still respond to
         * NMIs.
         */

Consider below case:

    CPU 0					CPU 1

    native_stop_other_cpus()			stop_this_cpu()

	// sends REBOOT IPI to stop remote CPUs
						...
						wbinvd();

	// wait timesout, try NMI
	if (!cpumask_empty(&cpus_stop_mask)) {
		for_each_cpu(cpu, &cpus_stop_mask) {

						...
						cpumask_clear_cpu(cpu,
							&cpus_stop_mask);
						hlt;

			// send NMI 	--->
						wakeup from hlt and run
						stop_this_cpu():

			// WAIT CPUs TO STOP
			while (!cpumask_empty(
			    &cpus_stop_mask) &&
				...)
		}
						...
		proceed ...			wbinvd();
						...
						hlt;

The "WAIT CPUs TO STOP" is supposed to wait until all remote CPUs are
stopped, but actually it quits immediately because the remote CPUs have
been cleared in cpus_stop_mask when stop_this_cpu() is called from the
REBOOT IPI.

Doing WBINVD in stop_this_cpu() could potentially increase the chance to
trigger the above "race" despite it's still rare to happen.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/include/asm/kexec.h         |  2 +-
 arch/x86/include/asm/processor.h     |  2 ++
 arch/x86/kernel/cpu/amd.c            | 16 ++++++++++++++++
 arch/x86/kernel/machine_kexec_64.c   | 15 ++++++++++-----
 arch/x86/kernel/process.c            | 16 +++-------------
 arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++----
 6 files changed, 43 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index f2ad77929d6e..d7e93522b93d 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -122,7 +122,7 @@ relocate_kernel_fn(unsigned long indirection_page,
 		   unsigned long pa_control_page,
 		   unsigned long start_address,
 		   unsigned int preserve_context,
-		   unsigned int host_mem_enc_active);
+		   unsigned int cache_incoherent);
 #endif
 extern relocate_kernel_fn relocate_kernel;
 #define ARCH_HAS_KIMAGE_ARCH
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index bde58f6510ac..a24c7805acdb 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -731,6 +731,8 @@ void __noreturn stop_this_cpu(void *dummy);
 void microcode_check(struct cpuinfo_x86 *prev_info);
 void store_cpu_caps(struct cpuinfo_x86 *info);
 
+DECLARE_PER_CPU(bool, cache_state_incoherent);
+
 enum l1tf_mitigations {
 	L1TF_MITIGATION_OFF,
 	L1TF_MITIGATION_AUTO,
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index f18f540db58c..4c7fde344216 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -503,6 +503,22 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
 {
 	u64 msr;
 
+	/*
+	 * Mark using wbinvd is needed during kexec on processors that
+	 * support SME. This provides support for performing a successful
+	 * kexec when going from SME inactive to SME active (or vice-versa).
+	 *
+	 * The cache must be cleared so that if there are entries with the
+	 * same physical address, both with and without the encryption bit,
+	 * they don't race each other when flushed and potentially end up
+	 * with the wrong entry being committed to memory.
+	 *
+	 * Test the CPUID bit directly because the machine might've cleared
+	 * X86_FEATURE_SME due to cmdline options.
+	 */
+	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+		__this_cpu_write(cache_state_incoherent, true);
+
 	/*
 	 * BIOS support is required for SME and SEV.
 	 *   For SME: If BIOS has enabled SME then adjust x86_phys_bits by
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 697fb99406e6..4519c7b75c49 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -29,6 +29,7 @@
 #include <asm/set_memory.h>
 #include <asm/cpu.h>
 #include <asm/efi.h>
+#include <asm/processor.h>
 
 #ifdef CONFIG_ACPI
 /*
@@ -384,15 +385,15 @@ void __nocfi machine_kexec(struct kimage *image)
 {
 	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
 	relocate_kernel_fn *relocate_kernel_ptr;
-	unsigned int host_mem_enc_active;
+	unsigned int cache_incoherent;
 	int save_ftrace_enabled;
 	void *control_page;
 
 	/*
-	 * This must be done before load_segments() since if call depth tracking
-	 * is used then GS must be valid to make any function calls.
+	 * This must be done before load_segments(), since it resets
+	 * GS to 0 and percpu data needs the correct GS to work.
 	 */
-	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
+	cache_incoherent = this_cpu_read(cache_state_incoherent);
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
@@ -436,6 +437,10 @@ void __nocfi machine_kexec(struct kimage *image)
 	 *
 	 * Take advantage of this here by force loading the segments,
 	 * before the GDT is zapped with an invalid value.
+	 *
+	 * load_segments() resets GS to 0.  Don't make any function call
+	 * after here since call depth tracking uses percpu variables to
+	 * operate (relocate_kernel() is explicitly ignored by call depth
 	 */
 	load_segments();
 
@@ -444,7 +449,7 @@ void __nocfi machine_kexec(struct kimage *image)
 					   virt_to_phys(control_page),
 					   image->start,
 					   image->preserve_context,
-					   host_mem_enc_active);
+					   cache_incoherent);
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 7b94851bb37e..6b5edfbefa9a 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -88,6 +88,8 @@ EXPORT_PER_CPU_SYMBOL(cpu_tss_rw);
 DEFINE_PER_CPU(bool, __tss_limit_invalid);
 EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid);
 
+DEFINE_PER_CPU(bool, cache_state_incoherent);
+
 /*
  * this gets called so that we can store lazy state into memory and copy the
  * current task into the new thread.
@@ -827,19 +829,7 @@ void __noreturn stop_this_cpu(void *dummy)
 	disable_local_APIC();
 	mcheck_cpu_clear(c);
 
-	/*
-	 * Use wbinvd on processors that support SME. This provides support
-	 * for performing a successful kexec when going from SME inactive
-	 * to SME active (or vice-versa). The cache must be cleared so that
-	 * if there are entries with the same physical address, both with and
-	 * without the encryption bit, they don't race each other when flushed
-	 * and potentially end up with the wrong entry being committed to
-	 * memory.
-	 *
-	 * Test the CPUID bit directly because the machine might've cleared
-	 * X86_FEATURE_SME due to cmdline options.
-	 */
-	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+	if (this_cpu_read(cache_state_incoherent))
 		wbinvd();
 
 	/*
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index ea604f4d0b52..34b3a5e4fe49 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -67,7 +67,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
 	 * %rsi pa_control_page
 	 * %rdx start address
 	 * %rcx preserve_context
-	 * %r8  host_mem_enc_active
+	 * %r8  cache_incoherent
 	 */
 
 	/* Save the CPU context, used for jumping back */
@@ -129,7 +129,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	/*
 	 * %rdi	indirection page
 	 * %rdx start address
-	 * %r8 host_mem_enc_active
+	 * %r8 cache_incoherent
 	 * %r9 page table page
 	 * %r11 preserve_context
 	 * %r13 original CR4 when relocate_kernel() was invoked
@@ -200,14 +200,21 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	movq	%r9, %cr3
 
 	/*
+	 * If the memory cache is in incoherent state, e.g., due to
+	 * memory encryption, do wbinvd to flush cache.
+	 *
 	 * If SME is active, there could be old encrypted cache line
 	 * entries that will conflict with the now unencrypted memory
 	 * used by kexec. Flush the caches before copying the kernel.
+	 *
+	 * Note SME sets this flag to true when the platform supports
+	 * SME, so the wbinvd is performed even SME is not activated
+	 * by the kernel.  But this has no harm.
 	 */
 	testq	%r8, %r8
-	jz .Lsme_off
+	jz .Lnowbinvd
 	wbinvd
-.Lsme_off:
+.Lnowbinvd:
 
 	call	swap_pages
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3 2/6] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  2025-06-26 10:48 [PATCH v3 0/6] TDX host: kexec/kdump support Kai Huang
  2025-06-26 10:48 ` [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec Kai Huang
@ 2025-06-26 10:48 ` Kai Huang
  2025-06-26 18:37   ` Edgecombe, Rick P
  2025-06-26 10:48 ` [PATCH v3 3/6] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Kai Huang
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 42+ messages in thread
From: Kai Huang @ 2025-06-26 10:48 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky
  Cc: x86, kirill.shutemov, rick.p.edgecombe, linux-kernel, pbonzini,
	seanjc, kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, sagis, Farrah Chen

On TDX platforms, at hardware level dirty cachelines with and without
TDX keyID can coexist, and CPU can flush them back to memory in random
order.  During kexec, the caches must be flushed before jumping to the
new kernel to avoid silent memory corruption when a cacheline with a
different encryption property is written back over whatever encryption
properties the new kernel is using.

A percpu boolean is used to mark whether the cache of a given CPU may be
in an incoherent state, and the kexec performs WBINVD on the CPUs with
that boolean turned on.

For TDX, only the TDX module or the TDX guests can generate dirty
cachelines of TDX private memory, i.e., they are only generated when the
kernel does SEAMCALL.

Turn on that boolean when the kernel does SEAMCALL so that kexec can
correctly flush cache.

SEAMCALL can be made from both task context and IRQ disabled context.
Given SEAMCALL is just a lengthy instruction (e.g., thousands of cycles)
from kernel's point of view and preempt_{disable|enable}() is cheap
compared to it, simply unconditionally disable preemption during setting
the percpu boolean and making SEAMCALL.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
---

v2 -> v3:
 - Change to use __always_inline for do_seamcall() to avoid indirect
   call instructions of making SEAMCALL.
 - Remove the senstence "not all SEAMCALLs generate dirty cachelines of
   TDX private memory but just treat all of them do." in changelog and
   the code comment. -- Dave

---
 arch/x86/include/asm/tdx.h | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 7ddef3a69866..d4c624c69d7f 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -102,10 +102,37 @@ u64 __seamcall_ret(u64 fn, struct tdx_module_args *args);
 u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
 void tdx_init(void);
 
+#include <linux/preempt.h>
 #include <asm/archrandom.h>
+#include <asm/processor.h>
 
 typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args);
 
+static __always_inline u64 do_seamcall(sc_func_t func, u64 fn,
+				       struct tdx_module_args *args)
+{
+	u64 ret;
+
+	preempt_disable();
+
+	/*
+	 * SEAMCALLs are made to the TDX module and can generate dirty
+	 * cachelines of TDX private memory.  Mark cache state incoherent
+	 * so that the cache can be flushed during kexec.
+	 *
+	 * This needs to be done before actually making the SEAMCALL,
+	 * because kexec-ing CPU could send NMI to stop remote CPUs,
+	 * in which case even disabling IRQ won't help here.
+	 */
+	this_cpu_write(cache_state_incoherent, true);
+
+	ret = func(fn, args);
+
+	preempt_enable();
+
+	return ret;
+}
+
 static __always_inline u64 sc_retry(sc_func_t func, u64 fn,
 			   struct tdx_module_args *args)
 {
@@ -113,7 +140,7 @@ static __always_inline u64 sc_retry(sc_func_t func, u64 fn,
 	u64 ret;
 
 	do {
-		ret = func(fn, args);
+		ret = do_seamcall(func, fn, args);
 	} while (ret == TDX_RND_NO_ENTROPY && --retry);
 
 	return ret;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3 3/6] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-06-26 10:48 [PATCH v3 0/6] TDX host: kexec/kdump support Kai Huang
  2025-06-26 10:48 ` [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec Kai Huang
  2025-06-26 10:48 ` [PATCH v3 2/6] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Kai Huang
@ 2025-06-26 10:48 ` Kai Huang
  2025-06-26 18:49   ` Edgecombe, Rick P
                     ` (2 more replies)
  2025-06-26 10:48 ` [PATCH v3 4/6] x86/virt/tdx: Remove the !KEXEC_CORE dependency Kai Huang
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 42+ messages in thread
From: Kai Huang @ 2025-06-26 10:48 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky
  Cc: x86, kirill.shutemov, rick.p.edgecombe, linux-kernel, pbonzini,
	seanjc, kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, sagis, Farrah Chen

Some early TDX-capable platforms have an erratum: A kernel partial
write (a write transaction of less than cacheline lands at memory
controller) to TDX private memory poisons that memory, and a subsequent
read triggers a machine check.

On those platforms, the old kernel must reset TDX private memory before
jumping to the new kernel, otherwise the new kernel may see unexpected
machine check.  Currently the kernel doesn't track which page is a TDX
private page.  For simplicity just fail kexec/kdump for those platforms.

Leverage the existing machine_kexec_prepare() to fail kexec/kdump by
adding the check of the presence of the TDX erratum (which is only
checked for if the kernel is built with TDX host support).  This rejects
kexec/kdump when the kernel is loading the kexec/kdump kernel image.

The alternative is to reject kexec/kdump when the kernel is jumping to
the new kernel.  But for kexec this requires adding a new check (e.g.,
arch_kexec_allowed()) in the common code to fail kernel_kexec() at early
stage.  Kdump (crash_kexec()) needs similar check, but it's hard to
justify because crash_kexec() is not supposed to abort.

It's feasible to further relax this limitation, i.e., only fail kexec
when TDX is actually enabled by the kernel.  But this is still a half
measure compared to resetting TDX private memory so just do the simplest
thing for now.

The impact to userspace is the users will get an error when loading the
kexec/kdump kernel image:

  kexec_load failed: Operation not supported

This might be confusing to the users, thus also print the reason in the
dmesg:

  [..] kexec: not allowed on platform with tdx_pw_mce bug.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
---
 arch/x86/kernel/machine_kexec_64.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 4519c7b75c49..d5a85d786e61 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -347,6 +347,22 @@ int machine_kexec_prepare(struct kimage *image)
 	unsigned long reloc_end = (unsigned long)__relocate_kernel_end;
 	int result;
 
+	/*
+	 * Some early TDX-capable platforms have an erratum.  A kernel
+	 * partial write (a write transaction of less than cacheline
+	 * lands at memory controller) to TDX private memory poisons that
+	 * memory, and a subsequent read triggers a machine check.
+	 *
+	 * On those platforms the old kernel must reset TDX private
+	 * memory before jumping to the new kernel otherwise the new
+	 * kernel may see unexpected machine check.  For simplicity
+	 * just fail kexec/kdump on those platforms.
+	 */
+	if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) {
+		pr_info_once("Not allowed on platform with tdx_pw_mce bug\n");
+		return -EOPNOTSUPP;
+	}
+
 	/* Setup the identity mapped 64bit page table */
 	result = init_pgtable(image, __pa(control_page));
 	if (result)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3 4/6] x86/virt/tdx: Remove the !KEXEC_CORE dependency
  2025-06-26 10:48 [PATCH v3 0/6] TDX host: kexec/kdump support Kai Huang
                   ` (2 preceding siblings ...)
  2025-06-26 10:48 ` [PATCH v3 3/6] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Kai Huang
@ 2025-06-26 10:48 ` Kai Huang
  2025-06-26 18:49   ` Edgecombe, Rick P
  2025-06-26 10:48 ` [PATCH v3 5/6] x86/virt/tdx: Update the kexec section in the TDX documentation Kai Huang
  2025-06-26 10:48 ` [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier Kai Huang
  5 siblings, 1 reply; 42+ messages in thread
From: Kai Huang @ 2025-06-26 10:48 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky
  Cc: x86, kirill.shutemov, rick.p.edgecombe, linux-kernel, pbonzini,
	seanjc, kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, sagis, Farrah Chen

During kexec it is now guaranteed that all dirty cachelines of TDX
private memory are flushed before jumping to the new kernel.  The TDX
private memory from the old kernel will remain as TDX private memory in
the new kernel, but it is OK because kernel read/write to TDX private
memory will never cause machine check, except on the platforms with the
TDX partial write erratum, which has already been handled.

It is safe to allow kexec to work together with TDX now.  Remove the
!KEXEC_CORE dependency.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
---
 arch/x86/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 71019b3b54ea..ca1c9f9e59be 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1899,7 +1899,6 @@ config INTEL_TDX_HOST
 	depends on X86_X2APIC
 	select ARCH_KEEP_MEMBLOCK
 	depends on CONTIG_ALLOC
-	depends on !KEXEC_CORE
 	depends on X86_MCE
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3 5/6] x86/virt/tdx: Update the kexec section in the TDX documentation
  2025-06-26 10:48 [PATCH v3 0/6] TDX host: kexec/kdump support Kai Huang
                   ` (3 preceding siblings ...)
  2025-06-26 10:48 ` [PATCH v3 4/6] x86/virt/tdx: Remove the !KEXEC_CORE dependency Kai Huang
@ 2025-06-26 10:48 ` Kai Huang
  2025-06-26 18:51   ` Edgecombe, Rick P
  2025-06-26 10:48 ` [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier Kai Huang
  5 siblings, 1 reply; 42+ messages in thread
From: Kai Huang @ 2025-06-26 10:48 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky
  Cc: x86, kirill.shutemov, rick.p.edgecombe, linux-kernel, pbonzini,
	seanjc, kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, sagis, Farrah Chen

TDX host kernel now supports kexec/kdump.  Update the documentation to
reflect that.

Opportunistically, remove the parentheses in "Kexec()" and move this
section under the "Erratum" section because the updated "Kexec" section
now refers to that erratum.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
---
 Documentation/arch/x86/tdx.rst | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst
index 719043cd8b46..61670e7df2f7 100644
--- a/Documentation/arch/x86/tdx.rst
+++ b/Documentation/arch/x86/tdx.rst
@@ -142,13 +142,6 @@ but depends on the BIOS to behave correctly.
 Note TDX works with CPU logical online/offline, thus the kernel still
 allows to offline logical CPU and online it again.
 
-Kexec()
-~~~~~~~
-
-TDX host support currently lacks the ability to handle kexec.  For
-simplicity only one of them can be enabled in the Kconfig.  This will be
-fixed in the future.
-
 Erratum
 ~~~~~~~
 
@@ -171,6 +164,13 @@ If the platform has such erratum, the kernel prints additional message in
 machine check handler to tell user the machine check may be caused by
 kernel bug on TDX private memory.
 
+Kexec
+~~~~~~~
+
+Currently kexec doesn't work on the TDX platforms with the aforementioned
+erratum.  It fails when loading the kexec kernel image.  Otherwise it
+works normally.
+
 Interaction vs S3 and deeper states
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier
  2025-06-26 10:48 [PATCH v3 0/6] TDX host: kexec/kdump support Kai Huang
                   ` (4 preceding siblings ...)
  2025-06-26 10:48 ` [PATCH v3 5/6] x86/virt/tdx: Update the kexec section in the TDX documentation Kai Huang
@ 2025-06-26 10:48 ` Kai Huang
  2025-06-27  0:01   ` Edgecombe, Rick P
                     ` (2 more replies)
  5 siblings, 3 replies; 42+ messages in thread
From: Kai Huang @ 2025-06-26 10:48 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky
  Cc: x86, kirill.shutemov, rick.p.edgecombe, linux-kernel, pbonzini,
	seanjc, kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, sagis, Farrah Chen

On TDX platforms, during kexec, the kernel needs to make sure there's no
dirty cachelines of TDX private memory before booting to the new kernel
to avoid silent memory corruption to the new kernel.

During kexec, the kexec-ing CPU firstly invokes native_stop_other_cpus()
to stop all remote CPUs before booting to the new kernel.  The remote
CPUs will then execute stop_this_cpu() to stop themselves.

The kernel has a percpu boolean to indicate whether the cache of a CPU
may be in incoherent state.  In stop_this_cpu(), the kernel does WBINVD
if that percpu boolean is true.

TDX turns on that percpu boolean on a CPU when the kernel does SEAMCALL.
This makes sure the caches will be flushed during kexec.

However, the native_stop_other_cpus() and stop_this_cpu() have a "race"
which is extremely rare to happen but could cause system to hang.

Specifically, the native_stop_other_cpus() firstly sends normal reboot
IPI to remote CPUs and wait one second for them to stop.  If that times
out, native_stop_other_cpus() then sends NMIs to remote CPUs to stop
them.

The aforementioned race happens when NMIs are sent.  Doing WBINVD in
stop_this_cpu() makes each CPU take longer time to stop and increases
the chance of the race to happen.

Register reboot notifier in KVM to explicitly flush caches upon
receiving reboot notifier (e.g., during kexec) for TDX.  This moves the
WBINVD to an earlier stage than stop_this_cpus(), avoiding a possibly
lengthy operation at a time where it could cause this race.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
---

v2 -> v3:
 - Update changelog to address Paolo's comments and Add Paolo's Ack:
   https://lore.kernel.org/lkml/3a7c0856-6e7b-4d3d-b966-6f17f1aca42e@redhat.com/

---
 arch/x86/include/asm/tdx.h  |  3 +++
 arch/x86/kvm/vmx/tdx.c      | 45 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.c |  9 ++++++++
 3 files changed, 57 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d4c624c69d7f..e6b11982c6c6 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -221,6 +221,8 @@ u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u6
 u64 tdh_phymem_cache_wb(bool resume);
 u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td);
 u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
+
+void tdx_cpu_flush_cache(void);
 #else
 static inline void tdx_init(void) { }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
@@ -228,6 +230,7 @@ static inline int tdx_enable(void)  { return -ENODEV; }
 static inline u32 tdx_get_nr_guest_keyids(void) { return 0; }
 static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
 static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NULL; }
+static inline void tdx_cpu_flush_cache(void) { }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLER__ */
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1ad20c273f3b..c567a64a6cb0 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -5,7 +5,9 @@
 #include <asm/fpu/xcr.h>
 #include <linux/misc_cgroup.h>
 #include <linux/mmu_context.h>
+#include <linux/reboot.h>
 #include <asm/tdx.h>
+#include <asm/processor.h>
 #include "capabilities.h"
 #include "mmu.h"
 #include "x86_ops.h"
@@ -3347,6 +3349,33 @@ static int tdx_offline_cpu(unsigned int cpu)
 	return -EBUSY;
 }
 
+static void smp_func_cpu_flush_cache(void *unused)
+{
+	tdx_cpu_flush_cache();
+}
+
+static int tdx_reboot_notify(struct notifier_block *nb, unsigned long code,
+			     void *unused)
+{
+	/*
+	 * Flush cache for all CPUs upon the reboot notifier.  This
+	 * avoids having to do WBINVD in stop_this_cpu() during kexec.
+	 *
+	 * Kexec calls native_stop_other_cpus() to stop remote CPUs
+	 * before booting to new kernel, but that code has a "race"
+	 * when the normal REBOOT IPI timesout and NMIs are sent to
+	 * remote CPUs to stop them.  Doing WBINVD in stop_this_cpu()
+	 * could potentially increase the posibility of the "race".
+	 */
+	if (code == SYS_RESTART)
+		on_each_cpu(smp_func_cpu_flush_cache, NULL, 1);
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block tdx_reboot_nb = {
+	.notifier_call = tdx_reboot_notify,
+};
+
 static void __do_tdx_cleanup(void)
 {
 	/*
@@ -3504,6 +3533,11 @@ void tdx_cleanup(void)
 {
 	if (enable_tdx) {
 		misc_cg_set_capacity(MISC_CG_RES_TDX, 0);
+		/*
+		 * Ignore the return value.  See the comment in
+		 * tdx_bringup().
+		 */
+		unregister_reboot_notifier(&tdx_reboot_nb);
 		__tdx_cleanup();
 		kvm_disable_virtualization();
 	}
@@ -3587,6 +3621,17 @@ int __init tdx_bringup(void)
 		enable_tdx = 0;
 	}
 
+	if (enable_tdx)
+		/*
+		 * Ignore the return value.  @tdx_reboot_nb is used to flush
+		 * cache for all CPUs upon rebooting to avoid having to do
+		 * WBINVD in kexec while the kexec-ing CPU stops all remote
+		 * CPUs.  Failure to register isn't fatal, because if KVM
+		 * doesn't flush cache explicitly upon rebooting the kexec
+		 * will do it anyway.
+		 */
+		register_reboot_notifier(&tdx_reboot_nb);
+
 	return r;
 
 success_disable_tdx:
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index c7a9a087ccaf..73425e9bee39 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1870,3 +1870,12 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
 	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
 }
 EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
+
+void tdx_cpu_flush_cache(void)
+{
+	lockdep_assert_preemption_disabled();
+
+	wbinvd();
+	this_cpu_write(cache_state_incoherent, false);
+}
+EXPORT_SYMBOL_GPL(tdx_cpu_flush_cache);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec
  2025-06-26 10:48 ` [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec Kai Huang
@ 2025-06-26 17:59   ` Edgecombe, Rick P
  2025-06-26 18:42     ` Edgecombe, Rick P
  2025-06-27  0:37     ` Huang, Kai
  2025-06-27 15:08   ` Tom Lendacky
  2025-06-28 12:50   ` Borislav Petkov
  2 siblings, 2 replies; 42+ messages in thread
From: Edgecombe, Rick P @ 2025-06-26 17:59 UTC (permalink / raw)
  To: Hansen, Dave, Huang, Kai, bp@alien8.de, peterz@infradead.org,
	hpa@zytor.com, mingo@redhat.com, tglx@linutronix.de,
	thomas.lendacky@amd.com
  Cc: seanjc@google.com, x86@kernel.org, sagis@google.com,
	Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	ashish.kalra@amd.com, nik.borisov@suse.com

On Thu, 2025-06-26 at 22:48 +1200, Kai Huang wrote:
> On SME platforms, at hardware level dirty cachelines with and without
> encryption bit of the same memory can coexist, and the CPU can flush
> them back to memory in random order.
> 

Nit: It's a bit of a run-on sentence. I know we struggled with this one. But I
think this might be easier:

  On SME platforms, dirty cacheline aliases with and without encryption bits
  can coexist, and the CPU can flush them back to memory in random order.

>   During kexec, the caches must be
> flushed before jumping to the new kernel to avoid silent memory
> corruption when a cacheline with a different encryption property is
> written back over whatever encryption properties the new kernel is
> using.
> 
> TDX also needs cache flush during kexec for the same reason.  It would
                ^ a
> be good to implement a generic way to flush cache instead of scattering
                                             ^ the
> checks for each feature all around.
> 
> During kexec, the kexec-ing CPU sends IPIs to all remote CPUs to stop
> them before it boots to the new kernel.  For SME the kernel basically
> encrypts all memory including the kernel itself by manipulating page
> tables.  A simple memory write from the kernel could dirty cachelines.
> 
> Currently, the kernel uses WBINVD to flush cache for SME during kexec in
                                             ^the
> two places:
> 
> 1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU
>    stops them;
> 2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the
>    new kernel.
> 
> Unlike SME, TDX can only dirty cachelines when it is used (i.e., when
> SEAMCALLs are performed).  Since there are no more SEAMCALLs after the
> aforementioned WBINVDs, leverage this for TDX.
> 
> In order to have a generic way to cover both SME and TDX, use a percpu
> boolean to indicate the cache may be in an incoherent state thus cache
> flush is needed during kexec, and turn on the boolean for SME.  TDX can
> then leverage it by also turning the boolean on.
> 
> A percpu boolean isn't strictly needed for SME since it is activated at
> very early boot time and on all CPUs.  A global flag would be
> sufficient.  But using a percpu variable has two benefits.  Foremost,
> the state that is being tracked here (percpu cache coherency situation
> requiring a flush) is percpu, so a percpu state is a more direct and
> natural fit.
> 
> Secondly, it will fit TDX's usage better since the percpu var can be set
> when a CPU makes a SEAMCALL, and cleared when another WBINVD on the CPU
> obviates the need for a kexec-time WBINVD.  Saving kexec-time WBINVD is
> valuable, because there is an existing race[*] where kexec could proceed
> while another CPU is active.  WBINVD could make this race worse, so it's
> worth skipping it when possible.
> 
> Today the first WBINVD in the stop_this_cpu() is performed when SME is
> *supported* by the platform, and the second WBINVD is done in
> relocate_kernel() when SME is *activated* by the kernel.  Make things
> simple by changing to do the second WBINVD when the platform supports
> SME.  This allows the kernel to simply turn on this percpu boolean when
> bringing up a CPU by checking whether the platform supports SME.

Since the race is related to stop_this_cpu() it doesn't affect it. But it does
mean that the bring up CPU may not flush the cache if takes a kdump kexec before
the per-cpu var is set? Of course the existing logic doesn't trigger until
cpuinfo_x86 is populated. What is the consequence?

Arguably the supported/enabled part could be moved to a separate earlier patch.
The code change would just get immediately replaced, but the benefit would be
that a bisect would show which part of the change was responsible.

> 
> No other functional change intended.
> 
> Also, currently machine_check() has a comment to explain why no function

Nittiest of the nits: I would move this section above "No function..." and drop
the "also". This is the same old "notes" critique I always make. Just when you
think you have got all the details and start to process, it asks you to update
your model.

> call is allowed after load_segments().  After changing to use the new
> percpu boolean to control whether to perform WBINVD when calling the
> relocate_kernel(), that comment isn't needed anymore.  But it is still a
> useful comment, so expand the comment around load_segments() to mention
> that due to depth tracking no function call can be made after
> load_segments().
> 
> [*] The "race" in native_stop_other_cpus()
> 
> Commit
> 
>   1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
> 
> introduced a new 'cpus_stop_mask' to resolve an "intermittent lockups on
> poweroff" issue which was caused by the WBINVD in stop_this_cpu().
> Specifically, the new cpumask resolved the below problem mentioned in
> that commit:
> 
>     CPU0                                    CPU1
> 
>      stop_other_cpus()
>        send_IPIs(REBOOT);                   stop_this_cpu()
>        while (num_online_cpus() > 1);         set_online(false);
>        proceed... -> hang
>                                               wbinvd()
> 
> While it fixed the reported issue, that commit explained the new cpumask
> "cannot plug all holes either".
> 
> This is because it doesn't address the "race" mentioned in the #3 in the
> comment in native_stop_other_cpus():
> 
>         /*
>          * 1) Send an IPI on the reboot vector to all other CPUs.
>          *
>          *    The other CPUs should react on it after leaving critical
>          *    sections and re-enabling interrupts. They might still hold
>          *    locks, but there is nothing which can be done about that.
>          *
>          * 2) Wait for all other CPUs to report that they reached the
>          *    HLT loop in stop_this_cpu()
>          *
>          * 3) If #2 timed out send an NMI to the CPUs which did not
>          *    yet report
>          *
>          * 4) Wait for all other CPUs to report that they reached the
>          *    HLT loop in stop_this_cpu()
>          *
>          * #3 can obviously race against a CPU reaching the HLT loop late.
>          * That CPU will have reported already and the "have all CPUs
>          * reached HLT" condition will be true despite the fact that the
>          * other CPU is still handling the NMI. Again, there is no
>          * protection against that as "disabled" APICs still respond to
>          * NMIs.
>          */
> 
> Consider below case:
> 
>     CPU 0					CPU 1
> 
>     native_stop_other_cpus()			stop_this_cpu()
> 
> 	// sends REBOOT IPI to stop remote CPUs
> 						...
> 						wbinvd();
> 
> 	// wait timesout, try NMI
> 	if (!cpumask_empty(&cpus_stop_mask)) {
> 		for_each_cpu(cpu, &cpus_stop_mask) {
> 
> 						...
> 						cpumask_clear_cpu(cpu,
> 							&cpus_stop_mask);
> 						hlt;
> 
> 			// send NMI 	--->
> 						wakeup from hlt and run
> 						stop_this_cpu():
> 
> 			// WAIT CPUs TO STOP
> 			while (!cpumask_empty(
> 			    &cpus_stop_mask) &&
> 				...)
> 		}
> 						...
> 		proceed ...			wbinvd();
> 						...
> 						hlt;
> 
> The "WAIT CPUs TO STOP" is supposed to wait until all remote CPUs are
> stopped, but actually it quits immediately because the remote CPUs have
> been cleared in cpus_stop_mask when stop_this_cpu() is called from the
> REBOOT IPI.
> 
> Doing WBINVD in stop_this_cpu() could potentially increase the chance to
> trigger the above "race" despite it's still rare to happen.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>  arch/x86/include/asm/kexec.h         |  2 +-
>  arch/x86/include/asm/processor.h     |  2 ++
>  arch/x86/kernel/cpu/amd.c            | 16 ++++++++++++++++
>  arch/x86/kernel/machine_kexec_64.c   | 15 ++++++++++-----
>  arch/x86/kernel/process.c            | 16 +++-------------
>  arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++----
>  6 files changed, 43 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index f2ad77929d6e..d7e93522b93d 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -122,7 +122,7 @@ relocate_kernel_fn(unsigned long indirection_page,
>  		   unsigned long pa_control_page,
>  		   unsigned long start_address,
>  		   unsigned int preserve_context,
> -		   unsigned int host_mem_enc_active);
> +		   unsigned int cache_incoherent);
>  #endif
>  extern relocate_kernel_fn relocate_kernel;
>  #define ARCH_HAS_KIMAGE_ARCH
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index bde58f6510ac..a24c7805acdb 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -731,6 +731,8 @@ void __noreturn stop_this_cpu(void *dummy);
>  void microcode_check(struct cpuinfo_x86 *prev_info);
>  void store_cpu_caps(struct cpuinfo_x86 *info);
>  
> +DECLARE_PER_CPU(bool, cache_state_incoherent);
> +
>  enum l1tf_mitigations {
>  	L1TF_MITIGATION_OFF,
>  	L1TF_MITIGATION_AUTO,
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index f18f540db58c..4c7fde344216 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -503,6 +503,22 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>  {
>  	u64 msr;
>  
> +	/*
> +	 * Mark using wbinvd is needed during kexec on processors that
> +	 * support SME. This provides support for performing a successful
> +	 * kexec when going from SME inactive to SME active (or vice-versa).
> +	 *
> +	 * The cache must be cleared so that if there are entries with the
> +	 * same physical address, both with and without the encryption bit,
> +	 * they don't race each other when flushed and potentially end up
> +	 * with the wrong entry being committed to memory.
> +	 *
> +	 * Test the CPUID bit directly because the machine might've cleared
> +	 * X86_FEATURE_SME due to cmdline options.
> +	 */
> +	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> +		__this_cpu_write(cache_state_incoherent, true);
> +
>  	/*
>  	 * BIOS support is required for SME and SEV.
>  	 *   For SME: If BIOS has enabled SME then adjust x86_phys_bits by
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 697fb99406e6..4519c7b75c49 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -29,6 +29,7 @@
>  #include <asm/set_memory.h>
>  #include <asm/cpu.h>
>  #include <asm/efi.h>
> +#include <asm/processor.h>
>  
>  #ifdef CONFIG_ACPI
>  /*
> @@ -384,15 +385,15 @@ void __nocfi machine_kexec(struct kimage *image)
>  {
>  	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
>  	relocate_kernel_fn *relocate_kernel_ptr;
> -	unsigned int host_mem_enc_active;
> +	unsigned int cache_incoherent;
>  	int save_ftrace_enabled;
>  	void *control_page;
>  
>  	/*
> -	 * This must be done before load_segments() since if call depth tracking
> -	 * is used then GS must be valid to make any function calls.
> +	 * This must be done before load_segments(), since it resets
> +	 * GS to 0 and percpu data needs the correct GS to work.
>  	 */
> -	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
> +	cache_incoherent = this_cpu_read(cache_state_incoherent);
>  
>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)
> @@ -436,6 +437,10 @@ void __nocfi machine_kexec(struct kimage *image)
>  	 *
>  	 * Take advantage of this here by force loading the segments,
>  	 * before the GDT is zapped with an invalid value.
> +	 *
> +	 * load_segments() resets GS to 0.  Don't make any function call
> +	 * after here since call depth tracking uses percpu variables to
> +	 * operate (relocate_kernel() is explicitly ignored by call depth
>  	 */
>  	load_segments();
>  
> @@ -444,7 +449,7 @@ void __nocfi machine_kexec(struct kimage *image)
>  					   virt_to_phys(control_page),
>  					   image->start,
>  					   image->preserve_context,
> -					   host_mem_enc_active);
> +					   cache_incoherent);
>  
>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 7b94851bb37e..6b5edfbefa9a 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -88,6 +88,8 @@ EXPORT_PER_CPU_SYMBOL(cpu_tss_rw);
>  DEFINE_PER_CPU(bool, __tss_limit_invalid);
>  EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid);
>  
> +DEFINE_PER_CPU(bool, cache_state_incoherent);
> +
>  /*
>   * this gets called so that we can store lazy state into memory and copy the
>   * current task into the new thread.
> @@ -827,19 +829,7 @@ void __noreturn stop_this_cpu(void *dummy)
>  	disable_local_APIC();
>  	mcheck_cpu_clear(c);
>  
> -	/*
> -	 * Use wbinvd on processors that support SME. This provides support
> -	 * for performing a successful kexec when going from SME inactive
> -	 * to SME active (or vice-versa). The cache must be cleared so that
> -	 * if there are entries with the same physical address, both with and
> -	 * without the encryption bit, they don't race each other when flushed
> -	 * and potentially end up with the wrong entry being committed to
> -	 * memory.
> -	 *
> -	 * Test the CPUID bit directly because the machine might've cleared
> -	 * X86_FEATURE_SME due to cmdline options.
> -	 */
> -	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> +	if (this_cpu_read(cache_state_incoherent))
>  		wbinvd();
>  
>  	/*
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index ea604f4d0b52..34b3a5e4fe49 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -67,7 +67,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
>  	 * %rsi pa_control_page
>  	 * %rdx start address
>  	 * %rcx preserve_context
> -	 * %r8  host_mem_enc_active
> +	 * %r8  cache_incoherent
>  	 */
>  
>  	/* Save the CPU context, used for jumping back */
> @@ -129,7 +129,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	/*
>  	 * %rdi	indirection page
>  	 * %rdx start address
> -	 * %r8 host_mem_enc_active
> +	 * %r8 cache_incoherent
>  	 * %r9 page table page
>  	 * %r11 preserve_context
>  	 * %r13 original CR4 when relocate_kernel() was invoked
> @@ -200,14 +200,21 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	movq	%r9, %cr3
>  
>  	/*
> +	 * If the memory cache is in incoherent state, e.g., due to
> +	 * memory encryption, do wbinvd to flush cache.
> +	 *
>  	 * If SME is active, there could be old encrypted cache line
>  	 * entries that will conflict with the now unencrypted memory
>  	 * used by kexec. Flush the caches before copying the kernel.
> +	 *
> +	 * Note SME sets this flag to true when the platform supports
> +	 * SME, so the wbinvd is performed even SME is not activated
> +	 * by the kernel.  But this has no harm.
>  	 */
>  	testq	%r8, %r8
> -	jz .Lsme_off
> +	jz .Lnowbinvd
>  	wbinvd
> -.Lsme_off:
> +.Lnowbinvd:
>  
>  	call	swap_pages
>  


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 2/6] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  2025-06-26 10:48 ` [PATCH v3 2/6] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Kai Huang
@ 2025-06-26 18:37   ` Edgecombe, Rick P
  2025-06-26 23:36     ` Huang, Kai
  0 siblings, 1 reply; 42+ messages in thread
From: Edgecombe, Rick P @ 2025-06-26 18:37 UTC (permalink / raw)
  To: Hansen, Dave, Huang, Kai, bp@alien8.de, peterz@infradead.org,
	hpa@zytor.com, mingo@redhat.com, tglx@linutronix.de,
	thomas.lendacky@amd.com
  Cc: seanjc@google.com, x86@kernel.org, sagis@google.com,
	Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	ashish.kalra@amd.com, Chen, Farrah, nik.borisov@suse.com

On Thu, 2025-06-26 at 22:48 +1200, Kai Huang wrote:
> On TDX platforms, at hardware level dirty cachelines with and without
> TDX keyID can coexist, and CPU can flush them back to memory in random
> order.
> 

Same comment as the previous patch.

>   During kexec, the caches must be flushed before jumping to the
> new kernel to avoid silent memory corruption when a cacheline with a
> different encryption property is written back over whatever encryption
> properties the new kernel is using.
> 
> A percpu boolean is used to mark whether the cache of a given CPU may be
> in an incoherent state, and the kexec performs WBINVD on the CPUs with
> that boolean turned on.
> 
> For TDX, only the TDX module or the TDX guests can generate dirty
> cachelines of TDX private memory, i.e., they are only generated when the
> kernel does SEAMCALL.
             ^a
> 
> Turn on that boolean when the kernel does SEAMCALL so that kexec can
Nit: "Turn on" is a little ambiguous. "Set"?

> correctly flush cache.
> 
> SEAMCALL can be made from both task context and IRQ disabled context.

SEAMCALLs

> Given SEAMCALL is just a lengthy instruction (e.g., thousands of cycles)
> from kernel's point of view and preempt_{disable|enable}() is cheap
> compared to it, simply unconditionally disable preemption during setting
> the percpu boolean and making SEAMCALL.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Tested-by: Farrah Chen <farrah.chen@intel.com>
> ---
> 
> v2 -> v3:
>  - Change to use __always_inline for do_seamcall() to avoid indirect
>    call instructions of making SEAMCALL.

How did this come about?

>  - Remove the senstence "not all SEAMCALLs generate dirty cachelines of
>    TDX private memory but just treat all of them do." in changelog and
>    the code comment. -- Dave
> 
> ---
>  arch/x86/include/asm/tdx.h | 29 ++++++++++++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 7ddef3a69866..d4c624c69d7f 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -102,10 +102,37 @@ u64 __seamcall_ret(u64 fn, struct tdx_module_args *args);
>  u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
>  void tdx_init(void);
>  
> +#include <linux/preempt.h>
>  #include <asm/archrandom.h>
> +#include <asm/processor.h>
>  
>  typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args);
>  
> +static __always_inline u64 do_seamcall(sc_func_t func, u64 fn,
> +				       struct tdx_module_args *args)
> +{

So now we have:

seamcall()
	sc_retry()
		do_seamcall()
			__seamcall()


do_seamcall() is only called from sc_retry(). Why add yet another helper in the
stack? You could just build it into sc_retry().

Oh, and __seamcall_*() variety is called directly too, so skips the
do_seamcall() per-cpu var logic in those cases. So, maybe do_seamcall() is
needed, but it needs a better name considering where it would get called from.

These wrappers need an overhaul I think, but maybe for now just have
do_dirty_seamcall() which is called from tdh_vp_enter() and sc_retry().

Oh no, actually scratch that! The inline/flatten issue will happen again if we
add the per-cpu vars to tdh_vp_enter()... Which means we probably need to set
the per-cpu var in tdx_vcpu_enter_exit(). And the other __seamcall() caller is
the machine check handler...

Am I missing something? It seems this patch is incomplete. If some of these
missed SEAMCALLs don't dirty a cacheline, then the justification that it works
by just covering all seamcalls needs to be updated.


Side topic. Do all the SEAMCALL wrappers calling into the seamcall_*() variety
of wrappers need the entropy retry logic? I think no, and some callers actually
depend on it not happening.


> +	u64 ret;
> +
> +	preempt_disable();
> +
> +	/*
> +	 * SEAMCALLs are made to the TDX module and can generate dirty
> +	 * cachelines of TDX private memory.  Mark cache state incoherent
> +	 * so that the cache can be flushed during kexec.
> +	 *
> +	 * This needs to be done before actually making the SEAMCALL,
> +	 * because kexec-ing CPU could send NMI to stop remote CPUs,
> +	 * in which case even disabling IRQ won't help here.
> +	 */
> +	this_cpu_write(cache_state_incoherent, true);
> +
> +	ret = func(fn, args);
> +
> +	preempt_enable();
> +
> +	return ret;
> +}
> +
>  static __always_inline u64 sc_retry(sc_func_t func, u64 fn,
>  			   struct tdx_module_args *args)
>  {
> @@ -113,7 +140,7 @@ static __always_inline u64 sc_retry(sc_func_t func, u64 fn,
>  	u64 ret;
>  
>  	do {
> -		ret = func(fn, args);
> +		ret = do_seamcall(func, fn, args);
>  	} while (ret == TDX_RND_NO_ENTROPY && --retry);
>  
>  	return ret;


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec
  2025-06-26 17:59   ` Edgecombe, Rick P
@ 2025-06-26 18:42     ` Edgecombe, Rick P
  2025-06-27  0:30       ` Huang, Kai
  2025-06-27  0:37     ` Huang, Kai
  1 sibling, 1 reply; 42+ messages in thread
From: Edgecombe, Rick P @ 2025-06-26 18:42 UTC (permalink / raw)
  To: Hansen, Dave, Huang, Kai, bp@alien8.de, peterz@infradead.org,
	hpa@zytor.com, mingo@redhat.com, tglx@linutronix.de,
	thomas.lendacky@amd.com
  Cc: seanjc@google.com, x86@kernel.org, sagis@google.com,
	Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	ashish.kalra@amd.com, nik.borisov@suse.com

On Thu, 2025-06-26 at 10:59 -0700, Rick Edgecombe wrote:
> Since the race is related to stop_this_cpu() it doesn't affect it. But it does
> mean that the bring up CPU may not flush the cache if takes a kdump kexec before
> the per-cpu var is set? Of course the existing logic doesn't trigger until
> cpuinfo_x86 is populated. What is the consequence?

Oh, doh! This stuff all happens before kdump could load. Right?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 3/6] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-06-26 10:48 ` [PATCH v3 3/6] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Kai Huang
@ 2025-06-26 18:49   ` Edgecombe, Rick P
  2025-07-01  5:37   ` Binbin Wu
  2025-07-02  8:25   ` Chao Gao
  2 siblings, 0 replies; 42+ messages in thread
From: Edgecombe, Rick P @ 2025-06-26 18:49 UTC (permalink / raw)
  To: Hansen, Dave, Huang, Kai, bp@alien8.de, peterz@infradead.org,
	hpa@zytor.com, mingo@redhat.com, tglx@linutronix.de,
	thomas.lendacky@amd.com
  Cc: seanjc@google.com, x86@kernel.org, sagis@google.com,
	Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	ashish.kalra@amd.com, Chen, Farrah, nik.borisov@suse.com

On Thu, 2025-06-26 at 22:48 +1200, Kai Huang wrote:
> Some early TDX-capable platforms have an erratum: A kernel partial
> write (a write transaction of less than cacheline lands at memory
> controller) to TDX private memory poisons that memory, and a subsequent
> read triggers a machine check.
> 
> On those platforms, the old kernel must reset TDX private memory before
> jumping to the new kernel, otherwise the new kernel may see unexpected
> machine check.  Currently the kernel doesn't track which page is a TDX
> private page.  For simplicity just fail kexec/kdump for those platforms.
> 
> Leverage the existing machine_kexec_prepare() to fail kexec/kdump by
> adding the check of the presence of the TDX erratum (which is only
> checked for if the kernel is built with TDX host support).  This rejects
> kexec/kdump when the kernel is loading the kexec/kdump kernel image.
> 
> The alternative is to reject kexec/kdump when the kernel is jumping to
> the new kernel.  But for kexec this requires adding a new check (e.g.,
> arch_kexec_allowed()) in the common code to fail kernel_kexec() at early
> stage.  Kdump (crash_kexec()) needs similar check, but it's hard to
> justify because crash_kexec() is not supposed to abort.
> 
> It's feasible to further relax this limitation, i.e., only fail kexec
> when TDX is actually enabled by the kernel.  But this is still a half
> measure compared to resetting TDX private memory so just do the simplest
> thing for now.
> 
> The impact to userspace is the users will get an error when loading the
> kexec/kdump kernel image:
> 
>   kexec_load failed: Operation not supported
> 
> This might be confusing to the users, thus also print the reason in the
> dmesg:
> 
>   [..] kexec: not allowed on platform with tdx_pw_mce bug.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Tested-by: Farrah Chen <farrah.chen@intel.com>
> 

This does mean that kdump will not be allowed on these platforms if TDX is
configured in the BIOS, even if they don't set the kvm.tdx module param to
actually use it. Today it is not easy to accidentally turn on TDX in the BIOS,
so this would not usually happen by accident. Some future platforms might make
it easier, but today we don't support any kexec if TDX is configured today. So
this still opens up more capability than existed before. All considered, I think
it's a good direction.

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 4/6] x86/virt/tdx: Remove the !KEXEC_CORE dependency
  2025-06-26 10:48 ` [PATCH v3 4/6] x86/virt/tdx: Remove the !KEXEC_CORE dependency Kai Huang
@ 2025-06-26 18:49   ` Edgecombe, Rick P
  0 siblings, 0 replies; 42+ messages in thread
From: Edgecombe, Rick P @ 2025-06-26 18:49 UTC (permalink / raw)
  To: Hansen, Dave, Huang, Kai, bp@alien8.de, peterz@infradead.org,
	hpa@zytor.com, mingo@redhat.com, tglx@linutronix.de,
	thomas.lendacky@amd.com
  Cc: seanjc@google.com, x86@kernel.org, sagis@google.com,
	Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	ashish.kalra@amd.com, Chen, Farrah, nik.borisov@suse.com

On Thu, 2025-06-26 at 22:48 +1200, Kai Huang wrote:
> During kexec it is now guaranteed that all dirty cachelines of TDX
> private memory are flushed before jumping to the new kernel.  The TDX
> private memory from the old kernel will remain as TDX private memory in
> the new kernel, but it is OK because kernel read/write to TDX private
> memory will never cause machine check, except on the platforms with the
> TDX partial write erratum, which has already been handled.
> 
> It is safe to allow kexec to work together with TDX now.  Remove the
> !KEXEC_CORE dependency.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Tested-by: Farrah Chen <farrah.chen@intel.com>

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 5/6] x86/virt/tdx: Update the kexec section in the TDX documentation
  2025-06-26 10:48 ` [PATCH v3 5/6] x86/virt/tdx: Update the kexec section in the TDX documentation Kai Huang
@ 2025-06-26 18:51   ` Edgecombe, Rick P
  0 siblings, 0 replies; 42+ messages in thread
From: Edgecombe, Rick P @ 2025-06-26 18:51 UTC (permalink / raw)
  To: Hansen, Dave, Huang, Kai, bp@alien8.de, peterz@infradead.org,
	hpa@zytor.com, mingo@redhat.com, tglx@linutronix.de,
	thomas.lendacky@amd.com
  Cc: seanjc@google.com, x86@kernel.org, sagis@google.com,
	Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	ashish.kalra@amd.com, Chen, Farrah, nik.borisov@suse.com

On Thu, 2025-06-26 at 22:48 +1200, Kai Huang wrote:
> TDX host kernel now supports kexec/kdump.  Update the documentation to
> reflect that.
> 
> Opportunistically, remove the parentheses in "Kexec()" and move this
> section under the "Erratum" section because the updated "Kexec" section
> now refers to that erratum.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Tested-by: Farrah Chen <farrah.chen@intel.com>
> ---

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 2/6] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  2025-06-26 18:37   ` Edgecombe, Rick P
@ 2025-06-26 23:36     ` Huang, Kai
  2025-06-27  0:52       ` Edgecombe, Rick P
  0 siblings, 1 reply; 42+ messages in thread
From: Huang, Kai @ 2025-06-26 23:36 UTC (permalink / raw)
  To: Hansen, Dave, Edgecombe, Rick P, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	tglx@linutronix.de, thomas.lendacky@amd.com
  Cc: nik.borisov@suse.com, seanjc@google.com, x86@kernel.org,
	sagis@google.com, Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	ashish.kalra@amd.com, Chen, Farrah


(I'll fix all wording comments above)

> > 
> > v2 -> v3:
> >  - Change to use __always_inline for do_seamcall() to avoid indirect
> >    call instructions of making SEAMCALL.
> 
> How did this come about?

We had a "missing ENDBR" build warning recently got fixed, which was caused
by compiler fails to inline the 'static inline sc_retry()'.  It got fixed by
changing to __always_inline, so we need to use __always_inline here too
otherwise the compiler may still refuse to inline it.

See commit 0b3bc018e86a ("x86/virt/tdx: Avoid indirect calls to TDX assembly
functions")
 
> 
> >  - Remove the senstence "not all SEAMCALLs generate dirty cachelines of
> >    TDX private memory but just treat all of them do." in changelog and
> >    the code comment. -- Dave
> > 
> > ---
> >  arch/x86/include/asm/tdx.h | 29 ++++++++++++++++++++++++++++-
> >  1 file changed, 28 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index 7ddef3a69866..d4c624c69d7f 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -102,10 +102,37 @@ u64 __seamcall_ret(u64 fn, struct tdx_module_args *args);
> >  u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
> >  void tdx_init(void);
> >  
> > +#include <linux/preempt.h>
> >  #include <asm/archrandom.h>
> > +#include <asm/processor.h>
> >  
> >  typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args);
> >  
> > +static __always_inline u64 do_seamcall(sc_func_t func, u64 fn,
> > +				       struct tdx_module_args *args)
> > +{
> 
> So now we have:
> 
> seamcall()
> 	sc_retry()
> 		do_seamcall()
> 			__seamcall()
> 
> 
> do_seamcall() is only called from sc_retry(). Why add yet another helper in the
> stack? You could just build it into sc_retry().

It's just more readable if we have the do_seamcall().  It's always inlined
anyway.

> 
> Oh, and __seamcall_*() variety is called directly too, so skips the
> do_seamcall() per-cpu var logic in those cases. So, maybe do_seamcall() is
> needed, but it needs a better name considering where it would get called from.
> 
> These wrappers need an overhaul I think, but maybe for now just have
> do_dirty_seamcall() which is called from tdh_vp_enter() and sc_retry().

Right.  I forgot TDH.VP.ENTER and TDH_PHYMEM_PAGE_RDMD are called directly
using __seamcall*().

We can move preempt_disable()/enable() out of do_seamcall() to sc_retry()
and instead add a lockdep_assert_preemption_disabled() there, and then
change tdh_vp_enter() and paddr_is_tdx_private() to call do_seamcall()
instead.

> 
> Oh no, actually scratch that! The inline/flatten issue will happen again if we
> add the per-cpu vars to tdh_vp_enter()... Which means we probably need to set
> the per-cpu var in tdx_vcpu_enter_exit(). And the other __seamcall() caller is
> the machine check handler...

this_cpu_write() itself won't do any function call so it's fine.

Well, lockdep_assert_preemption_disabled() does have a WARN_ON_ONCE(), but
AFAICT using it in noinstr code is fine:

/*                                                                         
 * This instrumentation_begin() is strictly speaking incorrect; but it     
 * suppresses the complaints from WARN()s in noinstr code. If such a WARN()
 * were to trigger, we'd rather wreck the machine in an attempt to get the 
 * message out than not know about it.
 */                                                                        
#define __WARN_FLAGS(cond_str, flags)                           \          
do {                                                            \          
        __auto_type __flags = BUGFLAG_WARNING|(flags);          \          
        instrumentation_begin();                                \          
        _BUG_FLAGS(cond_str, ASM_UD2, __flags, ANNOTATE_REACHABLE(1b)); \  
        instrumentation_end();                                  \          
} while (0)  

We can also just remove the lockdep_assert_preemption_disabled() in
do_seamcall() if this is really a concern.

> 
> Am I missing something? It seems this patch is incomplete. If some of these
> missed SEAMCALLs don't dirty a cacheline, then the justification that it works
> by just covering all seamcalls needs to be updated.

I think we just want to treat all SEAMCALLs can dirty cachelines.

> 
> 
> Side topic. Do all the SEAMCALL wrappers calling into the seamcall_*() variety
> of wrappers need the entropy retry logic? 
> 

The purpose of doing it in common code is that we don't need to have
duplicated code to handle running out of entropy for different SEAMCALLs.

> I think no, and some callers actually
> depend on it not happening.

Besides TDH.VP.ENTER TDH.PHYMEM.PAGE.RDMD, which we know running out of
entropy cannot happen, I am not aware we have any SEAMCALL that "depends on"
it not happening.  Could you elaborate?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier
  2025-06-26 10:48 ` [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier Kai Huang
@ 2025-06-27  0:01   ` Edgecombe, Rick P
  2025-06-27  1:00     ` Huang, Kai
  2025-07-01  6:09   ` Binbin Wu
  2025-07-02  7:54   ` Chao Gao
  2 siblings, 1 reply; 42+ messages in thread
From: Edgecombe, Rick P @ 2025-06-27  0:01 UTC (permalink / raw)
  To: Hansen, Dave, Huang, Kai, bp@alien8.de, peterz@infradead.org,
	hpa@zytor.com, mingo@redhat.com, tglx@linutronix.de,
	thomas.lendacky@amd.com
  Cc: seanjc@google.com, x86@kernel.org, sagis@google.com,
	Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	ashish.kalra@amd.com, Chen, Farrah, nik.borisov@suse.com

On Thu, 2025-06-26 at 22:48 +1200, Kai Huang wrote:
> On TDX platforms, during kexec, the kernel needs to make sure there's no
> dirty cachelines of TDX private memory before booting to the new kernel
> to avoid silent memory corruption to the new kernel.
> 
> During kexec, the kexec-ing CPU firstly invokes native_stop_other_cpus()
> to stop all remote CPUs before booting to the new kernel.  The remote
> CPUs will then execute stop_this_cpu() to stop themselves.
> 
> The kernel has a percpu boolean to indicate whether the cache of a CPU
> may be in incoherent state.  In stop_this_cpu(), the kernel does WBINVD
> if that percpu boolean is true.
> 
> TDX turns on that percpu boolean on a CPU when the kernel does SEAMCALL.
> This makes sure the caches will be flushed during kexec.
> 
> However, the native_stop_other_cpus() and stop_this_cpu() have a "race"
> which is extremely rare to happen but could cause system to hang.
> 
> Specifically, the native_stop_other_cpus() firstly sends normal reboot
> IPI to remote CPUs and wait one second for them to stop.  If that times
> out, native_stop_other_cpus() then sends NMIs to remote CPUs to stop
> them.
> 
> The aforementioned race happens when NMIs are sent.  Doing WBINVD in
> stop_this_cpu() makes each CPU take longer time to stop and increases
> the chance of the race to happen.
> 
> Register reboot notifier in KVM to explicitly flush caches upon
> receiving reboot notifier (e.g., during kexec) for TDX.  This moves the
> WBINVD to an earlier stage than stop_this_cpus(), avoiding a possibly
> lengthy operation at a time where it could cause this race.
> 

I think this will reduce the chance of the race, but it feels like the wrong way
to address the race. But I don't know how to properly fix it either. Maybe this
is just the nature of x86 NMIs, to have code like this.

Functionally it looks good, but a few nits to take or leave below.

> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
> Tested-by: Farrah Chen <farrah.chen@intel.com>
> ---
> 
> v2 -> v3:
>  - Update changelog to address Paolo's comments and Add Paolo's Ack:
>    https://lore.kernel.org/lkml/3a7c0856-6e7b-4d3d-b966-6f17f1aca42e@redhat.com/
> 
> ---
>  arch/x86/include/asm/tdx.h  |  3 +++
>  arch/x86/kvm/vmx/tdx.c      | 45 +++++++++++++++++++++++++++++++++++++
>  arch/x86/virt/vmx/tdx/tdx.c |  9 ++++++++
>  3 files changed, 57 insertions(+)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index d4c624c69d7f..e6b11982c6c6 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -221,6 +221,8 @@ u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u6
>  u64 tdh_phymem_cache_wb(bool resume);
>  u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td);
>  u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
> +

Nit: There is a new line here, but not below. I guess it's ok.

> +void tdx_cpu_flush_cache(void);
>  #else
>  static inline void tdx_init(void) { }
>  static inline int tdx_cpu_enable(void) { return -ENODEV; }
> @@ -228,6 +230,7 @@ static inline int tdx_enable(void)  { return -ENODEV; }
>  static inline u32 tdx_get_nr_guest_keyids(void) { return 0; }
>  static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
>  static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NULL; }
> +static inline void tdx_cpu_flush_cache(void) { }
>  #endif	/* CONFIG_INTEL_TDX_HOST */
>  
>  #endif /* !__ASSEMBLER__ */
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 1ad20c273f3b..c567a64a6cb0 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -5,7 +5,9 @@
>  #include <asm/fpu/xcr.h>
>  #include <linux/misc_cgroup.h>
>  #include <linux/mmu_context.h>
> +#include <linux/reboot.h>
>  #include <asm/tdx.h>
> +#include <asm/processor.h>
>  #include "capabilities.h"
>  #include "mmu.h"
>  #include "x86_ops.h"
> @@ -3347,6 +3349,33 @@ static int tdx_offline_cpu(unsigned int cpu)
>  	return -EBUSY;
>  }
>  
> +static void smp_func_cpu_flush_cache(void *unused)
> +{
> +	tdx_cpu_flush_cache();
> +}
> +
> +static int tdx_reboot_notify(struct notifier_block *nb, unsigned long code,
> +			     void *unused)
> +{
> +	/*
> +	 * Flush cache for all CPUs upon the reboot notifier.  This
> +	 * avoids having to do WBINVD in stop_this_cpu() during kexec.
> +	 *
> +	 * Kexec calls native_stop_other_cpus() to stop remote CPUs
> +	 * before booting to new kernel, but that code has a "race"
> +	 * when the normal REBOOT IPI timesout and NMIs are sent to
> +	 * remote CPUs to stop them.  Doing WBINVD in stop_this_cpu()
> +	 * could potentially increase the posibility of the "race".
> +	 */
> +	if (code == SYS_RESTART)
> +		on_each_cpu(smp_func_cpu_flush_cache, NULL, 1);
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block tdx_reboot_nb = {
> +	.notifier_call = tdx_reboot_notify,
> +};
> +
>  static void __do_tdx_cleanup(void)
>  {
>  	/*
> @@ -3504,6 +3533,11 @@ void tdx_cleanup(void)
>  {
>  	if (enable_tdx) {
>  		misc_cg_set_capacity(MISC_CG_RES_TDX, 0);
> +		/*
> +		 * Ignore the return value.  See the comment in
> +		 * tdx_bringup().
> +		 */
> +		unregister_reboot_notifier(&tdx_reboot_nb);
>  		__tdx_cleanup();
>  		kvm_disable_virtualization();
>  	}
> @@ -3587,6 +3621,17 @@ int __init tdx_bringup(void)
>  		enable_tdx = 0;
>  	}
>  
> +	if (enable_tdx)
> +		/*
> +		 * Ignore the return value.  @tdx_reboot_nb is used to flush
> +		 * cache for all CPUs upon rebooting to avoid having to do
> +		 * WBINVD in kexec while the kexec-ing CPU stops all remote
> +		 * CPUs.  Failure to register isn't fatal, because if KVM
> +		 * doesn't flush cache explicitly upon rebooting the kexec
> +		 * will do it anyway.
> +		 */
> +		register_reboot_notifier(&tdx_reboot_nb);
> +

The comment should be inside a {}.

>  	return r;
>  
>  success_disable_tdx:
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index c7a9a087ccaf..73425e9bee39 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1870,3 +1870,12 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
>  	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
>  }
>  EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
> +
> +void tdx_cpu_flush_cache(void)
> +{
> +	lockdep_assert_preemption_disabled();
> +
> +	wbinvd();
> +	this_cpu_write(cache_state_incoherent, false);
> +}
> +EXPORT_SYMBOL_GPL(tdx_cpu_flush_cache);

Does this need to be here? Why not in KVM?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec
  2025-06-26 18:42     ` Edgecombe, Rick P
@ 2025-06-27  0:30       ` Huang, Kai
  2025-06-30  7:09         ` Binbin Wu
  0 siblings, 1 reply; 42+ messages in thread
From: Huang, Kai @ 2025-06-27  0:30 UTC (permalink / raw)
  To: Hansen, Dave, Edgecombe, Rick P, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	tglx@linutronix.de, thomas.lendacky@amd.com
  Cc: seanjc@google.com, x86@kernel.org, sagis@google.com,
	Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	ashish.kalra@amd.com, nik.borisov@suse.com

On Thu, 2025-06-26 at 18:42 +0000, Edgecombe, Rick P wrote:
> On Thu, 2025-06-26 at 10:59 -0700, Rick Edgecombe wrote:
> > Since the race is related to stop_this_cpu() it doesn't affect it. But it does
> > mean that the bring up CPU may not flush the cache if takes a kdump kexec before
> > the per-cpu var is set? Of course the existing logic doesn't trigger until
> > cpuinfo_x86 is populated. What is the consequence?
> 
> Oh, doh! This stuff all happens before kdump could load. Right?

I'll replace kdump with normal kexec in the above context, since kdump case
(crash_kexec()) uses a different path from normal kexec when stopping remote
CPUs.  And there's no WBINVD involved in the carsh dump code path for SME.

The reason I guess is reading encrypted memory from /proc/vmcore is
meaningless anyway therefore even memory corruption can happen it is fine.

Now for normal kexec:

For SME this boolean is never cleared once set when the CPU is brought up
for the first time.  So in practice yes when the kexec is performed the
boolean is already set for all CPUs.

But, you are right there's a case that 'maxcpus' kernel commandline is used
which results in only part of CPUs are brought up during boot, and the user
can manually online the rest CPUs at runtime.

(Side: this 'maxcpus' should not be used for modern machine, see the end).

In this case, IIUC, _theoretically_, the native_stop_other_cpus() could race
with CPU onlining, because I didn't see CPU hotplug is explicitly disabled
before that.

But, the native_stop_other_cpus() only sends IPIs to the CPUs which is
already set in cpu_online_mask, and when one CPU is already in
cpu_online_mask, the code to set the boolean has already been done, so the
WBINVD will be done for that.

The only possible issue that I could find is one CPU becomes online after
cpus_stop_mask is already populated:

	CPU 0						CPU 1

							start_secondary()

	cpumask_copy(&cpus_stop_mask, cpu_online_mask);
	cpumask_clear_cpu(this_cpu, &cpus_stop_mask);		
							...
							ap_starting();
							...
							set_cpu_online();
							...

	if (!cpumask_empty(&cpus_stop_mask)) {                                    
                apic_send_IPI_allbutself(REBOOT_VECTOR);
		...

But again this issue (assuming it exists) is an existing issue which is not
related to this patch.

And I am not 100% sure whether this issue exists, since allowing CPU hotplug
during kexec doesn't seem reasonable to me at least on x86, despite it is
indeed enabled kernel_kexec() common code:

        /*
         * migrate_to_reboot_cpu() disables CPU hotplug assuming that
         * no further code needs to use CPU hotplug (which is true in
         * the reboot case). However, the kexec path depends on using
         * CPU hotplug again; so re-enable it here.
         */
         cpu_hotplug_enable();
         pr_notice("Starting new kernel\n");
         machine_shutdown();

I tried to git blame to find clue but failed since the history is lost
during file move/renaming etc.  I suspect it is for other ARCHs.

At last, I believe 'maxcpus' should not be used for modern platforms anyway
due to below:

static inline bool cpu_bootable(unsigned int cpu)
{
	...

        /*
         * On x86 it's required to boot all logical CPUs at least once so
         * that the init code can get a chance to set CR4.MCE on each
         * CPU. Otherwise, a broadcasted MCE observing CR4.MCE=0b on any
         * core will shutdown the machine.
         */
        return !cpumask_test_cpu(cpu, &cpus_booted_once_mask);
}

So using boolean for SME really shouldn't have any issue.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec
  2025-06-26 17:59   ` Edgecombe, Rick P
  2025-06-26 18:42     ` Edgecombe, Rick P
@ 2025-06-27  0:37     ` Huang, Kai
  2025-06-27  0:39       ` Edgecombe, Rick P
  1 sibling, 1 reply; 42+ messages in thread
From: Huang, Kai @ 2025-06-27  0:37 UTC (permalink / raw)
  To: Hansen, Dave, Edgecombe, Rick P, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	tglx@linutronix.de, thomas.lendacky@amd.com
  Cc: seanjc@google.com, x86@kernel.org, sagis@google.com,
	Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	ashish.kalra@amd.com, nik.borisov@suse.com


(I'll address all wording comments, as always.)

> > Today the first WBINVD in the stop_this_cpu() is performed when SME is
> > *supported* by the platform, and the second WBINVD is done in
> > relocate_kernel() when SME is *activated* by the kernel.  Make things
> > simple by changing to do the second WBINVD when the platform supports
> > SME.  This allows the kernel to simply turn on this percpu boolean when
> > bringing up a CPU by checking whether the platform supports SME.
> 
> Since the race is related to stop_this_cpu() it doesn't affect it. But it does
> mean that the bring up CPU may not flush the cache if takes a kdump kexec before
> the per-cpu var is set? Of course the existing logic doesn't trigger until
> cpuinfo_x86 is populated. What is the consequence?

See another reply.

> 
> Arguably the supported/enabled part could be moved to a separate earlier patch.
> The code change would just get immediately replaced, but the benefit would be
> that a bisect would show which part of the change was responsible.

I am not a fan of splitting the new variable and the user into different
patches, as long as the patch isn't too big to review.  You need to review
them together anyway I think, so arguably putting them together is easier to
review.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec
  2025-06-27  0:37     ` Huang, Kai
@ 2025-06-27  0:39       ` Edgecombe, Rick P
  2025-06-27  1:06         ` Huang, Kai
  0 siblings, 1 reply; 42+ messages in thread
From: Edgecombe, Rick P @ 2025-06-27  0:39 UTC (permalink / raw)
  To: Hansen, Dave, bp@alien8.de, Huang, Kai, peterz@infradead.org,
	hpa@zytor.com, mingo@redhat.com, thomas.lendacky@amd.com,
	tglx@linutronix.de
  Cc: seanjc@google.com, x86@kernel.org, sagis@google.com,
	Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	ashish.kalra@amd.com, nik.borisov@suse.com

On Fri, 2025-06-27 at 00:37 +0000, Huang, Kai wrote:
> > Arguably the supported/enabled part could be moved to a separate earlier
> > patch.
> > The code change would just get immediately replaced, but the benefit would
> > be
> > that a bisect would show which part of the change was responsible.
> 
> I am not a fan of splitting the new variable and the user into different
> patches, as long as the patch isn't too big to review.  You need to review
> them together anyway I think, so arguably putting them together is easier to
> review.

How about if Tom thinks there is any risk, we can split them for bisectability
help. Otherwise reduce the churn.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 2/6] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  2025-06-26 23:36     ` Huang, Kai
@ 2025-06-27  0:52       ` Edgecombe, Rick P
  2025-06-27  1:47         ` Huang, Kai
  0 siblings, 1 reply; 42+ messages in thread
From: Edgecombe, Rick P @ 2025-06-27  0:52 UTC (permalink / raw)
  To: Hansen, Dave, bp@alien8.de, Huang, Kai, peterz@infradead.org,
	hpa@zytor.com, mingo@redhat.com, thomas.lendacky@amd.com,
	tglx@linutronix.de
  Cc: ashish.kalra@amd.com, seanjc@google.com, x86@kernel.org,
	sagis@google.com, Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	kvm@vger.kernel.org, pbonzini@redhat.com, nik.borisov@suse.com,
	Yamahata, Isaku, Chen, Farrah

On Thu, 2025-06-26 at 23:36 +0000, Huang, Kai wrote:
> (I'll fix all wording comments above)
> 
> > > 
> > > v2 -> v3:
> > >  - Change to use __always_inline for do_seamcall() to avoid indirect
> > >    call instructions of making SEAMCALL.
> > 
> > How did this come about?
> 
> We had a "missing ENDBR" build warning recently got fixed, which was caused
> by compiler fails to inline the 'static inline sc_retry()'.  It got fixed by
> changing to __always_inline, so we need to use __always_inline here too
> otherwise the compiler may still refuse to inline it.

Oh, right.
> 
> See commit 0b3bc018e86a ("x86/virt/tdx: Avoid indirect calls to TDX assembly
> functions")
>  
> > 
> > >  - Remove the senstence "not all SEAMCALLs generate dirty cachelines of
> > >    TDX private memory but just treat all of them do." in changelog and
> > >    the code comment. -- Dave
> > > 
> > > ---
> > >  arch/x86/include/asm/tdx.h | 29 ++++++++++++++++++++++++++++-
> > >  1 file changed, 28 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > > index 7ddef3a69866..d4c624c69d7f 100644
> > > --- a/arch/x86/include/asm/tdx.h
> > > +++ b/arch/x86/include/asm/tdx.h
> > > @@ -102,10 +102,37 @@ u64 __seamcall_ret(u64 fn, struct tdx_module_args *args);
> > >  u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
> > >  void tdx_init(void);
> > >  
> > > +#include <linux/preempt.h>
> > >  #include <asm/archrandom.h>
> > > +#include <asm/processor.h>
> > >  
> > >  typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args);
> > >  
> > > +static __always_inline u64 do_seamcall(sc_func_t func, u64 fn,
> > > +				       struct tdx_module_args *args)
> > > +{
> > 
> > So now we have:
> > 
> > seamcall()
> > 	sc_retry()
> > 		do_seamcall()
> > 			__seamcall()
> > 
> > 
> > do_seamcall() is only called from sc_retry(). Why add yet another helper in the
> > stack? You could just build it into sc_retry().
> 
> It's just more readable if we have the do_seamcall().  It's always inlined
> anyway.

Don't you think that is a questionable chain of names? I was thinking that we
might want to do a future cleanup of all these wrappers. But I wondered if it
was one of those "least worse" options kind of things, and you already tried
something and threw your hands up. I think the existing layers are already
questionable. Which we don't need to cleanup for this series.

> 
> > 
> > Oh, and __seamcall_*() variety is called directly too, so skips the
> > do_seamcall() per-cpu var logic in those cases. So, maybe do_seamcall() is
> > needed, but it needs a better name considering where it would get called from.
> > 
> > These wrappers need an overhaul I think, but maybe for now just have
> > do_dirty_seamcall() which is called from tdh_vp_enter() and sc_retry().
> 
> Right.  I forgot TDH.VP.ENTER and TDH_PHYMEM_PAGE_RDMD are called directly
> using __seamcall*().
> 
> We can move preempt_disable()/enable() out of do_seamcall() to sc_retry()
> and instead add a lockdep_assert_preemption_disabled() there, and then
> change tdh_vp_enter() and paddr_is_tdx_private() to call do_seamcall()
> instead.

Can you play around with it and find a good fix? It needs to mark the per-cpu
var and not cause the inline warnings for tdh_vp_enter().

> 
> > 
> > Oh no, actually scratch that! The inline/flatten issue will happen again if we
> > add the per-cpu vars to tdh_vp_enter()... Which means we probably need to set
> > the per-cpu var in tdx_vcpu_enter_exit(). And the other __seamcall() caller is
> > the machine check handler...
> 
> this_cpu_write() itself won't do any function call so it's fine.
> 
> Well, lockdep_assert_preemption_disabled() does have a WARN_ON_ONCE(), but
> AFAICT using it in noinstr code is fine:

I was looking at preempt_latency_start(). But yea, it looked like there were a
few that *shouldn't* be non-inlined, but as we saw recently...

> 
> /*                                                                         
>  * This instrumentation_begin() is strictly speaking incorrect; but it     
>  * suppresses the complaints from WARN()s in noinstr code. If such a WARN()
>  * were to trigger, we'd rather wreck the machine in an attempt to get the 
>  * message out than not know about it.
>  */                                                                        
> #define __WARN_FLAGS(cond_str, flags)                           \          
> do {                                                            \          
>         __auto_type __flags = BUGFLAG_WARNING|(flags);          \          
>         instrumentation_begin();                                \          
>         _BUG_FLAGS(cond_str, ASM_UD2, __flags, ANNOTATE_REACHABLE(1b)); \  
>         instrumentation_end();                                  \          
> } while (0)  
> 
> We can also just remove the lockdep_assert_preemption_disabled() in
> do_seamcall() if this is really a concern.

The concern is weird compiler/config generates a problem like this:
https://lore.kernel.org/lkml/20250624101351.8019-1-kai.huang@intel.com/

Do you think it's not valid?

> 
> > 
> > Am I missing something? It seems this patch is incomplete. If some of these
> > missed SEAMCALLs don't dirty a cacheline, then the justification that it works
> > by just covering all seamcalls needs to be updated.
> 
> I think we just want to treat all SEAMCALLs can dirty cachelines.

Right, that was the idea. I was leaving open the option that it was on purpose
to avoid these other problems. But, yes, let's stick with the (hopefully)
simpler system.

> 
> > 
> > 
> > Side topic. Do all the SEAMCALL wrappers calling into the seamcall_*() variety
> > of wrappers need the entropy retry logic? 
> > 
> 
> The purpose of doing it in common code is that we don't need to have
> duplicated code to handle running out of entropy for different SEAMCALLs.
> 
> > I think no, and some callers actually
> > depend on it not happening.
> 
> Besides TDH.VP.ENTER TDH.PHYMEM.PAGE.RDMD, which we know running out of
> entropy cannot happen, I am not aware we have any SEAMCALL that "depends on"
> it not happening.  Could you elaborate?

Some SEAMCALLs are expected to succeed, like in the BUSY error code breaking
schemes for the S-EPT ones.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier
  2025-06-27  0:01   ` Edgecombe, Rick P
@ 2025-06-27  1:00     ` Huang, Kai
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2025-06-27  1:00 UTC (permalink / raw)
  To: Hansen, Dave, Edgecombe, Rick P, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	tglx@linutronix.de, thomas.lendacky@amd.com
  Cc: nik.borisov@suse.com, seanjc@google.com, x86@kernel.org,
	sagis@google.com, Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	ashish.kalra@amd.com, Chen, Farrah


> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -221,6 +221,8 @@ u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u6
> >  u64 tdh_phymem_cache_wb(bool resume);
> >  u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td);
> >  u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
> > +
> 
> Nit: There is a new line here, but not below. I guess it's ok.

I will remove.


[...]

> > +	if (enable_tdx)
> > +		/*
> > +		 * Ignore the return value.  @tdx_reboot_nb is used to flush
> > +		 * cache for all CPUs upon rebooting to avoid having to do
> > +		 * WBINVD in kexec while the kexec-ing CPU stops all remote
> > +		 * CPUs.  Failure to register isn't fatal, because if KVM
> > +		 * doesn't flush cache explicitly upon rebooting the kexec
> > +		 * will do it anyway.
> > +		 */
> > +		register_reboot_notifier(&tdx_reboot_nb);
> > +
> 
> The comment should be inside a {}.

Will do.

> 
> >  	return r;
> >  
> >  success_disable_tdx:
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index c7a9a087ccaf..73425e9bee39 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1870,3 +1870,12 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
> >  	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
> >  }
> >  EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
> > +
> > +void tdx_cpu_flush_cache(void)
> > +{
> > +	lockdep_assert_preemption_disabled();
> > +
> > +	wbinvd();
> > +	this_cpu_write(cache_state_incoherent, false);
> > +}
> > +EXPORT_SYMBOL_GPL(tdx_cpu_flush_cache);
> 
> Does this need to be here? Why not in KVM?

Otherwise the 'cache_state_incoherent' variable needs to be exported.  It's
good to hide the details in TDX core code too.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec
  2025-06-27  0:39       ` Edgecombe, Rick P
@ 2025-06-27  1:06         ` Huang, Kai
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2025-06-27  1:06 UTC (permalink / raw)
  To: Hansen, Dave, Edgecombe, Rick P, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	thomas.lendacky@amd.com, tglx@linutronix.de
  Cc: seanjc@google.com, x86@kernel.org, sagis@google.com,
	Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	ashish.kalra@amd.com, nik.borisov@suse.com

On Fri, 2025-06-27 at 00:39 +0000, Edgecombe, Rick P wrote:
> On Fri, 2025-06-27 at 00:37 +0000, Huang, Kai wrote:
> > > Arguably the supported/enabled part could be moved to a separate earlier
> > > patch.
> > > The code change would just get immediately replaced, but the benefit would
> > > be
> > > that a bisect would show which part of the change was responsible.
> > 
> > I am not a fan of splitting the new variable and the user into different
> > patches, as long as the patch isn't too big to review.  You need to review
> > them together anyway I think, so arguably putting them together is easier to
> > review.
> 
> How about if Tom thinks there is any risk, we can split them for bisectability
> help. Otherwise reduce the churn.

Sure.  But I don't understand how this can impact bisect?

Let's assume we have patch 1 to introduce the boolean w/o user, and patch 2
to modify SME code.  If there's any issue, the bisect will point to the
patch 2.  If we have one patch then the bisect will point to this one patch
too.  Is there any difference?

The downside of splitting into two patches is we need to make sure there's
no build warning when introducing a variable w/o user, for which we might
need to do additional things (e.g., it happens when adding a static function
w/o use).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 2/6] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  2025-06-27  0:52       ` Edgecombe, Rick P
@ 2025-06-27  1:47         ` Huang, Kai
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2025-06-27  1:47 UTC (permalink / raw)
  To: Hansen, Dave, Edgecombe, Rick P, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	thomas.lendacky@amd.com, tglx@linutronix.de
  Cc: seanjc@google.com, x86@kernel.org, sagis@google.com,
	Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	ashish.kalra@amd.com, kvm@vger.kernel.org, nik.borisov@suse.com,
	pbonzini@redhat.com, Chen, Farrah, Yamahata, Isaku

On Fri, 2025-06-27 at 00:52 +0000, Edgecombe, Rick P wrote:
> On Thu, 2025-06-26 at 23:36 +0000, Huang, Kai wrote:
> > (I'll fix all wording comments above)
> > 
> > > > 
> > > > v2 -> v3:
> > > >  - Change to use __always_inline for do_seamcall() to avoid indirect
> > > >    call instructions of making SEAMCALL.
> > > 
> > > How did this come about?
> > 
> > We had a "missing ENDBR" build warning recently got fixed, which was caused
> > by compiler fails to inline the 'static inline sc_retry()'.  It got fixed by
> > changing to __always_inline, so we need to use __always_inline here too
> > otherwise the compiler may still refuse to inline it.
> 
> Oh, right.
> > 
> > See commit 0b3bc018e86a ("x86/virt/tdx: Avoid indirect calls to TDX assembly
> > functions")
> >  
> > > 
> > > >  - Remove the senstence "not all SEAMCALLs generate dirty cachelines of
> > > >    TDX private memory but just treat all of them do." in changelog and
> > > >    the code comment. -- Dave
> > > > 
> > > > ---
> > > >  arch/x86/include/asm/tdx.h | 29 ++++++++++++++++++++++++++++-
> > > >  1 file changed, 28 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > > > index 7ddef3a69866..d4c624c69d7f 100644
> > > > --- a/arch/x86/include/asm/tdx.h
> > > > +++ b/arch/x86/include/asm/tdx.h
> > > > @@ -102,10 +102,37 @@ u64 __seamcall_ret(u64 fn, struct tdx_module_args *args);
> > > >  u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
> > > >  void tdx_init(void);
> > > >  
> > > > +#include <linux/preempt.h>
> > > >  #include <asm/archrandom.h>
> > > > +#include <asm/processor.h>
> > > >  
> > > >  typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args);
> > > >  
> > > > +static __always_inline u64 do_seamcall(sc_func_t func, u64 fn,
> > > > +				       struct tdx_module_args *args)
> > > > +{
> > > 
> > > So now we have:
> > > 
> > > seamcall()
> > > 	sc_retry()
> > > 		do_seamcall()
> > > 			__seamcall()
> > > 
> > > 
> > > do_seamcall() is only called from sc_retry(). Why add yet another helper in the
> > > stack? You could just build it into sc_retry().
> > 
> > It's just more readable if we have the do_seamcall().  It's always inlined
> > anyway.
> 
> Don't you think that is a questionable chain of names? I was thinking that we
> might want to do a future cleanup of all these wrappers. But I wondered if it
> was one of those "least worse" options kind of things, and you already tried
> something and threw your hands up. I think the existing layers are already
> questionable. Which we don't need to cleanup for this series.

I agree we should revisit this in the future.

> 
> > 
> > > 
> > > Oh, and __seamcall_*() variety is called directly too, so skips the
> > > do_seamcall() per-cpu var logic in those cases. So, maybe do_seamcall() is
> > > needed, but it needs a better name considering where it would get called from.
> > > 
> > > These wrappers need an overhaul I think, but maybe for now just have
> > > do_dirty_seamcall() which is called from tdh_vp_enter() and sc_retry().
> > 
> > Right.  I forgot TDH.VP.ENTER and TDH_PHYMEM_PAGE_RDMD are called directly
> > using __seamcall*().
> > 
> > We can move preempt_disable()/enable() out of do_seamcall() to sc_retry()
> > and instead add a lockdep_assert_preemption_disabled() there, and then
> > change tdh_vp_enter() and paddr_is_tdx_private() to call do_seamcall()
> > instead.
> 
> Can you play around with it and find a good fix? It needs to mark the per-cpu
> var and not cause the inline warnings for tdh_vp_enter().

Yeah already did.  The below diff [*] doesn't trigger warning for
tdh_vp_enter()  for both gcc and clang with the Kconfig that can trigger the
warning fixed by the patch here:

https://lore.kernel.org/lkml/20250624101351.8019-1-kai.huang@intel.com/

I'll see whether I can do more.

> 
> > 
> > > 
> > > Oh no, actually scratch that! The inline/flatten issue will happen again if we
> > > add the per-cpu vars to tdh_vp_enter()... Which means we probably need to set
> > > the per-cpu var in tdx_vcpu_enter_exit(). And the other __seamcall() caller is
> > > the machine check handler...
> > 
> > this_cpu_write() itself won't do any function call so it's fine.
> > 
> > Well, lockdep_assert_preemption_disabled() does have a WARN_ON_ONCE(), but
> > AFAICT using it in noinstr code is fine:
> 
> I was looking at preempt_latency_start(). But yea, it looked like there were a
> few that *shouldn't* be non-inlined, but as we saw recently...

Note with the diff [*] tdh_vp_enter() will only call
lockdep_assert_preemption_disabled() which doesn't call
preempt_latency_start().

> 
> > 
> > /*                                                                         
> >  * This instrumentation_begin() is strictly speaking incorrect; but it     
> >  * suppresses the complaints from WARN()s in noinstr code. If such a WARN()
> >  * were to trigger, we'd rather wreck the machine in an attempt to get the 
> >  * message out than not know about it.
> >  */                                                                        
> > #define __WARN_FLAGS(cond_str, flags)                           \          
> > do {                                                            \          
> >         __auto_type __flags = BUGFLAG_WARNING|(flags);          \          
> >         instrumentation_begin();                                \          
> >         _BUG_FLAGS(cond_str, ASM_UD2, __flags, ANNOTATE_REACHABLE(1b)); \  
> >         instrumentation_end();                                  \          
> > } while (0)  
> > 
> > We can also just remove the lockdep_assert_preemption_disabled() in
> > do_seamcall() if this is really a concern.
> 
> The concern is weird compiler/config generates a problem like this:
> https://lore.kernel.org/lkml/20250624101351.8019-1-kai.huang@intel.com/
> 
> Do you think it's not valid?

I absolutely do.

However WARN_ON_ONCE() in noinstr is fine since there's
instrumentation_begin()/end() inside.

> 
> > 
> > > 
> > > Am I missing something? It seems this patch is incomplete. If some of these
> > > missed SEAMCALLs don't dirty a cacheline, then the justification that it works
> > > by just covering all seamcalls needs to be updated.
> > 
> > I think we just want to treat all SEAMCALLs can dirty cachelines.
> 
> Right, that was the idea. I was leaving open the option that it was on purpose
> to avoid these other problems. But, yes, let's stick with the (hopefully)
> simpler system.
> 
> > 
> > > 
> > > 
> > > Side topic. Do all the SEAMCALL wrappers calling into the seamcall_*() variety
> > > of wrappers need the entropy retry logic? 
> > > 
> > 
> > The purpose of doing it in common code is that we don't need to have
> > duplicated code to handle running out of entropy for different SEAMCALLs.
> > 
> > > I think no, and some callers actually
> > > depend on it not happening.
> > 
> > Besides TDH.VP.ENTER TDH.PHYMEM.PAGE.RDMD, which we know running out of
> > entropy cannot happen, I am not aware we have any SEAMCALL that "depends on"
> > it not happening.  Could you elaborate?
> 
> Some SEAMCALLs are expected to succeed, like in the BUSY error code breaking
> schemes for the S-EPT ones.

If they succeed then the sc_retry() will just call SEAMCALL once.  If we
really want some SEAMCALL not to handle running out of entropy at all, e.g.,
they can be called in critical path, then we can change that to
do_seamcall() instead.

[*] the incremental diff:

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d4c624c69d7f..6865f62436ad 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -113,7 +113,7 @@ static __always_inline u64 do_seamcall(sc_func_t func,
u64 fn,
 {
        u64 ret;
 
-       preempt_disable();
+       lockdep_assert_preemption_disabled();
 
        /*
         * SEAMCALLs are made to the TDX module and can generate dirty
@@ -128,8 +128,6 @@ static __always_inline u64 do_seamcall(sc_func_t func,
u64 fn,
 
        ret = func(fn, args);
 
-       preempt_enable();
-
        return ret;
 }
 
@@ -140,7 +138,9 @@ static __always_inline u64 sc_retry(sc_func_t func, u64
fn,
        u64 ret;
 
        do {
+               preempt_disable();
                ret = do_seamcall(func, fn, args);
+               preempt_enable();
        } while (ret == TDX_RND_NO_ENTROPY && --retry);
 
        return ret;
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index c7a9a087ccaf..d6ee4e5a75d2 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1266,7 +1266,7 @@ static bool paddr_is_tdx_private(unsigned long phys)
                return false;
 
        /* Get page type from the TDX module */
-       sret = __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args);
+       sret = do_seamcall(__seamcall_ret, TDH_PHYMEM_PAGE_RDMD, &args);
 
        /*
         * The SEAMCALL will not return success unless there is a
@@ -1522,7 +1522,7 @@ noinstr __flatten u64 tdh_vp_enter(struct tdx_vp *td,
struct tdx_module_args *ar
 {
        args->rcx = tdx_tdvpr_pa(td);
 
-       return __seamcall_saved_ret(TDH_VP_ENTER, args);
+       return do_seamcall(__seamcall_saved_ret, TDH_VP_ENTER, args);
 }
 EXPORT_SYMBOL_GPL(tdh_vp_enter);


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec
  2025-06-26 10:48 ` [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec Kai Huang
  2025-06-26 17:59   ` Edgecombe, Rick P
@ 2025-06-27 15:08   ` Tom Lendacky
  2025-06-30 11:35     ` Huang, Kai
  2025-06-28 12:50   ` Borislav Petkov
  2 siblings, 1 reply; 42+ messages in thread
From: Tom Lendacky @ 2025-06-27 15:08 UTC (permalink / raw)
  To: Kai Huang, dave.hansen, bp, tglx, peterz, mingo, hpa
  Cc: x86, kirill.shutemov, rick.p.edgecombe, linux-kernel, pbonzini,
	seanjc, kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, sagis

On 6/26/25 05:48, Kai Huang wrote:
> On SME platforms, at hardware level dirty cachelines with and without
> encryption bit of the same memory can coexist, and the CPU can flush
> them back to memory in random order.  During kexec, the caches must be
> flushed before jumping to the new kernel to avoid silent memory
> corruption when a cacheline with a different encryption property is
> written back over whatever encryption properties the new kernel is
> using.
> 
> TDX also needs cache flush during kexec for the same reason.  It would
> be good to implement a generic way to flush cache instead of scattering
> checks for each feature all around.
> 
> During kexec, the kexec-ing CPU sends IPIs to all remote CPUs to stop
> them before it boots to the new kernel.  For SME the kernel basically
> encrypts all memory including the kernel itself by manipulating page
> tables.  A simple memory write from the kernel could dirty cachelines.
> 
> Currently, the kernel uses WBINVD to flush cache for SME during kexec in
> two places:
> 
> 1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU
>    stops them;
> 2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the
>    new kernel.
> 
> Unlike SME, TDX can only dirty cachelines when it is used (i.e., when
> SEAMCALLs are performed).  Since there are no more SEAMCALLs after the
> aforementioned WBINVDs, leverage this for TDX.
> 
> In order to have a generic way to cover both SME and TDX, use a percpu
> boolean to indicate the cache may be in an incoherent state thus cache
> flush is needed during kexec, and turn on the boolean for SME.  TDX can
> then leverage it by also turning the boolean on.
> 
> A percpu boolean isn't strictly needed for SME since it is activated at
> very early boot time and on all CPUs.  A global flag would be
> sufficient.  But using a percpu variable has two benefits.  Foremost,
> the state that is being tracked here (percpu cache coherency situation
> requiring a flush) is percpu, so a percpu state is a more direct and
> natural fit.
> 
> Secondly, it will fit TDX's usage better since the percpu var can be set
> when a CPU makes a SEAMCALL, and cleared when another WBINVD on the CPU
> obviates the need for a kexec-time WBINVD.  Saving kexec-time WBINVD is
> valuable, because there is an existing race[*] where kexec could proceed
> while another CPU is active.  WBINVD could make this race worse, so it's
> worth skipping it when possible.
> 
> Today the first WBINVD in the stop_this_cpu() is performed when SME is
> *supported* by the platform, and the second WBINVD is done in
> relocate_kernel() when SME is *activated* by the kernel.  Make things
> simple by changing to do the second WBINVD when the platform supports
> SME.  This allows the kernel to simply turn on this percpu boolean when
> bringing up a CPU by checking whether the platform supports SME.
> 
> No other functional change intended.
> 
> Also, currently machine_check() has a comment to explain why no function
> call is allowed after load_segments().  After changing to use the new
> percpu boolean to control whether to perform WBINVD when calling the
> relocate_kernel(), that comment isn't needed anymore.  But it is still a
> useful comment, so expand the comment around load_segments() to mention
> that due to depth tracking no function call can be made after
> load_segments().
> 
> [*] The "race" in native_stop_other_cpus()
> 
> Commit
> 
>   1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
> 
> introduced a new 'cpus_stop_mask' to resolve an "intermittent lockups on
> poweroff" issue which was caused by the WBINVD in stop_this_cpu().
> Specifically, the new cpumask resolved the below problem mentioned in
> that commit:
> 
>     CPU0                                    CPU1
> 
>      stop_other_cpus()
>        send_IPIs(REBOOT);                   stop_this_cpu()
>        while (num_online_cpus() > 1);         set_online(false);
>        proceed... -> hang
>                                               wbinvd()
> 
> While it fixed the reported issue, that commit explained the new cpumask
> "cannot plug all holes either".
> 
> This is because it doesn't address the "race" mentioned in the #3 in the
> comment in native_stop_other_cpus():
> 
>         /*
>          * 1) Send an IPI on the reboot vector to all other CPUs.
>          *
>          *    The other CPUs should react on it after leaving critical
>          *    sections and re-enabling interrupts. They might still hold
>          *    locks, but there is nothing which can be done about that.
>          *
>          * 2) Wait for all other CPUs to report that they reached the
>          *    HLT loop in stop_this_cpu()
>          *
>          * 3) If #2 timed out send an NMI to the CPUs which did not
>          *    yet report
>          *
>          * 4) Wait for all other CPUs to report that they reached the
>          *    HLT loop in stop_this_cpu()
>          *
>          * #3 can obviously race against a CPU reaching the HLT loop late.
>          * That CPU will have reported already and the "have all CPUs
>          * reached HLT" condition will be true despite the fact that the
>          * other CPU is still handling the NMI. Again, there is no
>          * protection against that as "disabled" APICs still respond to
>          * NMIs.
>          */
> 
> Consider below case:
> 
>     CPU 0					CPU 1
> 
>     native_stop_other_cpus()			stop_this_cpu()
> 
> 	// sends REBOOT IPI to stop remote CPUs
> 						...
> 						wbinvd();
> 
> 	// wait timesout, try NMI
> 	if (!cpumask_empty(&cpus_stop_mask)) {
> 		for_each_cpu(cpu, &cpus_stop_mask) {
> 
> 						...
> 						cpumask_clear_cpu(cpu,
> 							&cpus_stop_mask);
> 						hlt;
> 
> 			// send NMI 	--->
> 						wakeup from hlt and run
> 						stop_this_cpu():
> 
> 			// WAIT CPUs TO STOP
> 			while (!cpumask_empty(
> 			    &cpus_stop_mask) &&
> 				...)
> 		}
> 						...
> 		proceed ...			wbinvd();
> 						...
> 						hlt;
> 
> The "WAIT CPUs TO STOP" is supposed to wait until all remote CPUs are
> stopped, but actually it quits immediately because the remote CPUs have
> been cleared in cpus_stop_mask when stop_this_cpu() is called from the
> REBOOT IPI.
> 
> Doing WBINVD in stop_this_cpu() could potentially increase the chance to
> trigger the above "race" despite it's still rare to happen.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>

> ---
>  arch/x86/include/asm/kexec.h         |  2 +-
>  arch/x86/include/asm/processor.h     |  2 ++
>  arch/x86/kernel/cpu/amd.c            | 16 ++++++++++++++++
>  arch/x86/kernel/machine_kexec_64.c   | 15 ++++++++++-----
>  arch/x86/kernel/process.c            | 16 +++-------------
>  arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++----
>  6 files changed, 43 insertions(+), 23 deletions(-)
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec
  2025-06-26 10:48 ` [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec Kai Huang
  2025-06-26 17:59   ` Edgecombe, Rick P
  2025-06-27 15:08   ` Tom Lendacky
@ 2025-06-28 12:50   ` Borislav Petkov
  2025-06-28 17:04     ` Tom Lendacky
  2025-06-30 11:34     ` Huang, Kai
  2 siblings, 2 replies; 42+ messages in thread
From: Borislav Petkov @ 2025-06-28 12:50 UTC (permalink / raw)
  To: Kai Huang
  Cc: dave.hansen, tglx, peterz, mingo, hpa, thomas.lendacky, x86,
	kirill.shutemov, rick.p.edgecombe, linux-kernel, pbonzini, seanjc,
	kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, sagis

On Thu, Jun 26, 2025 at 10:48:47PM +1200, Kai Huang wrote:

...

> Doing WBINVD in stop_this_cpu() could potentially increase the chance to
> trigger the above "race" despite it's still rare to happen.

Oh the amount of text... 

Please run it and all your comments through AI to simplify formulations etc.
It is a lot to read.

> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>  arch/x86/include/asm/kexec.h         |  2 +-
>  arch/x86/include/asm/processor.h     |  2 ++
>  arch/x86/kernel/cpu/amd.c            | 16 ++++++++++++++++
>  arch/x86/kernel/machine_kexec_64.c   | 15 ++++++++++-----
>  arch/x86/kernel/process.c            | 16 +++-------------
>  arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++----
>  6 files changed, 43 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index f2ad77929d6e..d7e93522b93d 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -122,7 +122,7 @@ relocate_kernel_fn(unsigned long indirection_page,
>  		   unsigned long pa_control_page,
>  		   unsigned long start_address,
>  		   unsigned int preserve_context,
> -		   unsigned int host_mem_enc_active);
> +		   unsigned int cache_incoherent);

So preserve_context and cache_incoherent are both a *single* bit of
information. And we use two u32s for that?!?!

How about flags please?

>  #endif
>  extern relocate_kernel_fn relocate_kernel;
>  #define ARCH_HAS_KIMAGE_ARCH
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index bde58f6510ac..a24c7805acdb 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -731,6 +731,8 @@ void __noreturn stop_this_cpu(void *dummy);
>  void microcode_check(struct cpuinfo_x86 *prev_info);
>  void store_cpu_caps(struct cpuinfo_x86 *info);
>  

So much text above - not a single comment here explaining what this var is
for.

> +DECLARE_PER_CPU(bool, cache_state_incoherent);
> +
>  enum l1tf_mitigations {
>  	L1TF_MITIGATION_OFF,
>  	L1TF_MITIGATION_AUTO,
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index f18f540db58c..4c7fde344216 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -503,6 +503,22 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>  {
>  	u64 msr;
>  
> +	/*
> +	 * Mark using wbinvd is needed during kexec on processors that

For all text: write insns in caps pls - WBINVD.

> +	 * support SME. This provides support for performing a successful
> +	 * kexec when going from SME inactive to SME active (or vice-versa).
> +	 *
> +	 * The cache must be cleared so that if there are entries with the
> +	 * same physical address, both with and without the encryption bit,
> +	 * they don't race each other when flushed and potentially end up
> +	 * with the wrong entry being committed to memory.
> +	 *
> +	 * Test the CPUID bit directly because the machine might've cleared
> +	 * X86_FEATURE_SME due to cmdline options.

Where?

That same function does the clearing later...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec
  2025-06-28 12:50   ` Borislav Petkov
@ 2025-06-28 17:04     ` Tom Lendacky
  2025-06-30 11:34       ` Huang, Kai
  2025-06-30 11:34     ` Huang, Kai
  1 sibling, 1 reply; 42+ messages in thread
From: Tom Lendacky @ 2025-06-28 17:04 UTC (permalink / raw)
  To: Borislav Petkov, Kai Huang
  Cc: dave.hansen, tglx, peterz, mingo, hpa, x86, kirill.shutemov,
	rick.p.edgecombe, linux-kernel, pbonzini, seanjc, kvm,
	reinette.chatre, isaku.yamahata, dan.j.williams, ashish.kalra,
	nik.borisov, sagis

On 6/28/25 07:50, Borislav Petkov wrote:
> On Thu, Jun 26, 2025 at 10:48:47PM +1200, Kai Huang wrote:
> 

>> +	 * support SME. This provides support for performing a successful
>> +	 * kexec when going from SME inactive to SME active (or vice-versa).
>> +	 *
>> +	 * The cache must be cleared so that if there are entries with the
>> +	 * same physical address, both with and without the encryption bit,
>> +	 * they don't race each other when flushed and potentially end up
>> +	 * with the wrong entry being committed to memory.
>> +	 *
>> +	 * Test the CPUID bit directly because the machine might've cleared
>> +	 * X86_FEATURE_SME due to cmdline options.
> 
> Where?
> 
> That same function does the clearing later...

I think he means that if this function does clear X86_FEATURE_SME during
the BSP boot, then when the APs boot they won't see the feature set, so
you have to check the CPUID information directly. So maybe that can better
worded.

I did verify that booting with mem_encrypt=off will start with
X86_FEATURE_SME set, the BSP will clear it and then all APs will not see
it set after that.

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec
  2025-06-27  0:30       ` Huang, Kai
@ 2025-06-30  7:09         ` Binbin Wu
  0 siblings, 0 replies; 42+ messages in thread
From: Binbin Wu @ 2025-06-30  7:09 UTC (permalink / raw)
  To: Huang, Kai
  Cc: Hansen, Dave, Edgecombe, Rick P, bp@alien8.de,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	tglx@linutronix.de, thomas.lendacky@amd.com, seanjc@google.com,
	x86@kernel.org, sagis@google.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, kirill.shutemov@linux.intel.com,
	Williams, Dan J, kvm@vger.kernel.org, pbonzini@redhat.com,
	Yamahata, Isaku, ashish.kalra@amd.com, nik.borisov@suse.com



On 6/27/2025 8:30 AM, Huang, Kai wrote:
[...]
>
> And I am not 100% sure whether this issue exists, since allowing CPU hotplug
> during kexec doesn't seem reasonable to me at least on x86, despite it is
> indeed enabled kernel_kexec() common code:
>
>          /*
>           * migrate_to_reboot_cpu() disables CPU hotplug assuming that
>           * no further code needs to use CPU hotplug (which is true in
>           * the reboot case). However, the kexec path depends on using
>           * CPU hotplug again; so re-enable it here.
>           */
>           cpu_hotplug_enable();
>           pr_notice("Starting new kernel\n");
>           machine_shutdown();
>
> I tried to git blame to find clue but failed since the history is lost
> during file move/renaming etc.  I suspect it is for other ARCHs.
>
Had a check in git history, it's related to powerpc kexec code.

First, commit e8e5c2155b00 ("powerpc/kexec: Fix orphaned offline CPUs across kexec")
add the code to online each present CPU.

Later, commit011e4b02f1da (" powerpc, kexec: Fix "Processor X is stuck" issue
during kexec from ST mode") add the code in the common code kernel_kexec() to
enable hotplug to fix the stuck issue.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec
  2025-06-28 17:04     ` Tom Lendacky
@ 2025-06-30 11:34       ` Huang, Kai
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2025-06-30 11:34 UTC (permalink / raw)
  To: thomas.lendacky@amd.com, bp@alien8.de
  Cc: kvm@vger.kernel.org, ashish.kalra@amd.com, Hansen, Dave,
	kirill.shutemov@linux.intel.com, Chatre, Reinette,
	seanjc@google.com, pbonzini@redhat.com, mingo@redhat.com,
	Yamahata, Isaku, linux-kernel@vger.kernel.org, tglx@linutronix.de,
	nik.borisov@suse.com, hpa@zytor.com, peterz@infradead.org,
	sagis@google.com, Edgecombe, Rick P, x86@kernel.org,
	Williams, Dan J

On Sat, 2025-06-28 at 12:04 -0500, Tom Lendacky wrote:
> On 6/28/25 07:50, Borislav Petkov wrote:
> > On Thu, Jun 26, 2025 at 10:48:47PM +1200, Kai Huang wrote:
> > 
> 
> > > +	 * support SME. This provides support for performing a successful
> > > +	 * kexec when going from SME inactive to SME active (or vice-versa).
> > > +	 *
> > > +	 * The cache must be cleared so that if there are entries with the
> > > +	 * same physical address, both with and without the encryption bit,
> > > +	 * they don't race each other when flushed and potentially end up
> > > +	 * with the wrong entry being committed to memory.
> > > +	 *
> > > +	 * Test the CPUID bit directly because the machine might've cleared
> > > +	 * X86_FEATURE_SME due to cmdline options.
> > 
> > Where?
> > 
> > That same function does the clearing later...
> 
> I think he means that if this function does clear X86_FEATURE_SME during
> the BSP boot, then when the APs boot they won't see the feature set, so
> you have to check the CPUID information directly. So maybe that can better
> worded.
> 
> I did verify that booting with mem_encrypt=off will start with
> X86_FEATURE_SME set, the BSP will clear it and then all APs will not see
> it set after that.

I think I actually mean it could be cleared by 'clearcpuid' commandline. :-)

IIUC the AP taking out feature bits that are not set in BSP happens at the
end of identify_cpu(), so it happens after early_detect_mem_encrypt(), which
is called in init_amd().

So the mem_encrypt=off will eventually clear the X86_FEATURE_SME, but when 
early_detect_mem_encrypt() for BSP and AP, IIRC it will still see the
X86_FEATURE_SME bit.

But IIUC the 'clearcpuid' commandline could still just make
early_detect_mem_encrypt() unable to see X86_FEATURE_SME bit on the SME
capable platform.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec
  2025-06-28 12:50   ` Borislav Petkov
  2025-06-28 17:04     ` Tom Lendacky
@ 2025-06-30 11:34     ` Huang, Kai
  2025-07-01 12:12       ` Borislav Petkov
  1 sibling, 1 reply; 42+ messages in thread
From: Huang, Kai @ 2025-06-30 11:34 UTC (permalink / raw)
  To: bp@alien8.de
  Cc: kvm@vger.kernel.org, ashish.kalra@amd.com, Hansen, Dave,
	thomas.lendacky@amd.com, kirill.shutemov@linux.intel.com,
	Chatre, Reinette, seanjc@google.com, pbonzini@redhat.com,
	mingo@redhat.com, Yamahata, Isaku, linux-kernel@vger.kernel.org,
	tglx@linutronix.de, nik.borisov@suse.com, hpa@zytor.com,
	peterz@infradead.org, sagis@google.com, Edgecombe, Rick P,
	x86@kernel.org, Williams, Dan J

On Sat, 2025-06-28 at 14:50 +0200, Borislav Petkov wrote:
> On Thu, Jun 26, 2025 at 10:48:47PM +1200, Kai Huang wrote:
> 
> ...
> 
> > Doing WBINVD in stop_this_cpu() could potentially increase the chance to
> > trigger the above "race" despite it's still rare to happen.
> 
> Oh the amount of text... 
> 
> Please run it and all your comments through AI to simplify formulations etc.
> It is a lot to read.

Hi Boris,

Thanks for the comments.

Yeah I agree the text can be improved.  I tried to run AI to simplify but so
far I am not quite happy about the result yet.  I'll try more.

> 
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
> >  arch/x86/include/asm/kexec.h         |  2 +-
> >  arch/x86/include/asm/processor.h     |  2 ++
> >  arch/x86/kernel/cpu/amd.c            | 16 ++++++++++++++++
> >  arch/x86/kernel/machine_kexec_64.c   | 15 ++++++++++-----
> >  arch/x86/kernel/process.c            | 16 +++-------------
> >  arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++----
> >  6 files changed, 43 insertions(+), 23 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> > index f2ad77929d6e..d7e93522b93d 100644
> > --- a/arch/x86/include/asm/kexec.h
> > +++ b/arch/x86/include/asm/kexec.h
> > @@ -122,7 +122,7 @@ relocate_kernel_fn(unsigned long indirection_page,
> >  		   unsigned long pa_control_page,
> >  		   unsigned long start_address,
> >  		   unsigned int preserve_context,
> > -		   unsigned int host_mem_enc_active);
> > +		   unsigned int cache_incoherent);
> 
> So preserve_context and cache_incoherent are both a *single* bit of
> information. And we use two u32s for that?!?!
> 
> How about flags please?

Yeah I agree a single u32 + flags is better.  However this is the problem in
the existing code (this patch only does renaming).

I think I can come up with a patch to clean this up and put it as the first
patch of this series, or I can do a patch to clean this up after this series
(either together with this series, or separately at a later time).  Which
way do you prefer?


> 
> >  #endif
> >  extern relocate_kernel_fn relocate_kernel;
> >  #define ARCH_HAS_KIMAGE_ARCH
> > diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> > index bde58f6510ac..a24c7805acdb 100644
> > --- a/arch/x86/include/asm/processor.h
> > +++ b/arch/x86/include/asm/processor.h
> > @@ -731,6 +731,8 @@ void __noreturn stop_this_cpu(void *dummy);
> >  void microcode_check(struct cpuinfo_x86 *prev_info);
> >  void store_cpu_caps(struct cpuinfo_x86 *info);
> >  
> 
> So much text above - not a single comment here explaining what this var is
> for.

Agreed a comment is needed.  How about below?

  /*
   * The cache may be in an incoherent state (e.g., due to memory 
   * encryption) and needs flushing during kexec.
   */

> 
> > +DECLARE_PER_CPU(bool, cache_state_incoherent);
> > +
> >  enum l1tf_mitigations {
> >  	L1TF_MITIGATION_OFF,
> >  	L1TF_MITIGATION_AUTO,
> > diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> > index f18f540db58c..4c7fde344216 100644
> > --- a/arch/x86/kernel/cpu/amd.c
> > +++ b/arch/x86/kernel/cpu/amd.c
> > @@ -503,6 +503,22 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
> >  {
> >  	u64 msr;
> >  
> > +	/*
> > +	 * Mark using wbinvd is needed during kexec on processors that
> 
> For all text: write insns in caps pls - WBINVD.

Will do.

> 
> > +	 * support SME. This provides support for performing a successful
> > +	 * kexec when going from SME inactive to SME active (or vice-versa).
> > +	 *
> > +	 * The cache must be cleared so that if there are entries with the
> > +	 * same physical address, both with and without the encryption bit,
> > +	 * they don't race each other when flushed and potentially end up
> > +	 * with the wrong entry being committed to memory.
> > +	 *
> > +	 * Test the CPUID bit directly because the machine might've cleared
> > +	 * X86_FEATURE_SME due to cmdline options.
> 
> Where?
> 
> That same function does the clearing later...

IIUC the X86_FEATURE_SME could be cleared via 'clearcpuid' kernel cmdline.

Please also see my reply to Tom.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec
  2025-06-27 15:08   ` Tom Lendacky
@ 2025-06-30 11:35     ` Huang, Kai
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2025-06-30 11:35 UTC (permalink / raw)
  To: thomas.lendacky@amd.com, tglx@linutronix.de, peterz@infradead.org,
	Hansen, Dave, mingo@redhat.com, bp@alien8.de, hpa@zytor.com
  Cc: Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
	sagis@google.com, Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	ashish.kalra@amd.com, nik.borisov@suse.com

On Fri, 2025-06-27 at 10:08 -0500, Tom Lendacky wrote:
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> 
> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
> Tested-by: Tom Lendacky <thomas.lendacky@amd.com>

Thanks Tom!!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 3/6] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-06-26 10:48 ` [PATCH v3 3/6] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Kai Huang
  2025-06-26 18:49   ` Edgecombe, Rick P
@ 2025-07-01  5:37   ` Binbin Wu
  2025-07-02  3:12     ` Huang, Kai
  2025-07-02  8:25   ` Chao Gao
  2 siblings, 1 reply; 42+ messages in thread
From: Binbin Wu @ 2025-07-01  5:37 UTC (permalink / raw)
  To: Kai Huang, dave.hansen, bp, tglx, peterz, mingo, hpa,
	thomas.lendacky
  Cc: x86, kirill.shutemov, rick.p.edgecombe, linux-kernel, pbonzini,
	seanjc, kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, sagis, Farrah Chen



On 6/26/2025 6:48 PM, Kai Huang wrote:
> Some early TDX-capable platforms have an erratum: A kernel partial
> write (a write transaction of less than cacheline lands at memory
> controller) to TDX private memory poisons that memory, and a subsequent
> read triggers a machine check.
>
> On those platforms, the old kernel must reset TDX private memory before
> jumping to the new kernel, otherwise the new kernel may see unexpected
> machine check.  Currently the kernel doesn't track which page is a TDX
> private page.  For simplicity just fail kexec/kdump for those platforms.
>
> Leverage the existing machine_kexec_prepare() to fail kexec/kdump by
> adding the check of the presence of the TDX erratum (which is only
> checked for if the kernel is built with TDX host support).  This rejects
> kexec/kdump when the kernel is loading the kexec/kdump kernel image.
>
> The alternative is to reject kexec/kdump when the kernel is jumping to
> the new kernel.  But for kexec this requires adding a new check (e.g.,
> arch_kexec_allowed()) in the common code to fail kernel_kexec() at early
> stage.  Kdump (crash_kexec()) needs similar check, but it's hard to
> justify because crash_kexec() is not supposed to abort.
>
> It's feasible to further relax this limitation, i.e., only fail kexec
> when TDX is actually enabled by the kernel.  But this is still a half
> measure compared to resetting TDX private memory so just do the simplest
> thing for now.
>
> The impact to userspace is the users will get an error when loading the
> kexec/kdump kernel image:
>
>    kexec_load failed: Operation not supported
>
> This might be confusing to the users, thus also print the reason in the
> dmesg:
>
>    [..] kexec: not allowed on platform with tdx_pw_mce bug.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Tested-by: Farrah Chen <farrah.chen@intel.com>
> ---
>   arch/x86/kernel/machine_kexec_64.c | 16 ++++++++++++++++
>   1 file changed, 16 insertions(+)
>
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 4519c7b75c49..d5a85d786e61 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -347,6 +347,22 @@ int machine_kexec_prepare(struct kimage *image)
>   	unsigned long reloc_end = (unsigned long)__relocate_kernel_end;
>   	int result;
>   
> +	/*
> +	 * Some early TDX-capable platforms have an erratum.  A kernel
> +	 * partial write (a write transaction of less than cacheline
> +	 * lands at memory controller) to TDX private memory poisons that
> +	 * memory, and a subsequent read triggers a machine check.
> +	 *
Nit: About the description of the erratum, maybe it's better to refer to the
comments of check_tdx_erratum() to avoid duplication. Also it gives a link to
how/when the bug is set.

Otherwise,
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>


> +	 * On those platforms the old kernel must reset TDX private
> +	 * memory before jumping to the new kernel otherwise the new
> +	 * kernel may see unexpected machine check.  For simplicity
> +	 * just fail kexec/kdump on those platforms.
> +	 */
> +	if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) {
> +		pr_info_once("Not allowed on platform with tdx_pw_mce bug\n");
> +		return -EOPNOTSUPP;
> +	}
> +
>   	/* Setup the identity mapped 64bit page table */
>   	result = init_pgtable(image, __pa(control_page));
>   	if (result)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier
  2025-06-26 10:48 ` [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier Kai Huang
  2025-06-27  0:01   ` Edgecombe, Rick P
@ 2025-07-01  6:09   ` Binbin Wu
  2025-07-02  3:14     ` Huang, Kai
  2025-07-02  7:54   ` Chao Gao
  2 siblings, 1 reply; 42+ messages in thread
From: Binbin Wu @ 2025-07-01  6:09 UTC (permalink / raw)
  To: Kai Huang, dave.hansen, bp, tglx, peterz, mingo, hpa,
	thomas.lendacky
  Cc: x86, kirill.shutemov, rick.p.edgecombe, linux-kernel, pbonzini,
	seanjc, kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, sagis, Farrah Chen



On 6/26/2025 6:48 PM, Kai Huang wrote:
> On TDX platforms, during kexec, the kernel needs to make sure there's no
> dirty cachelines of TDX private memory before booting to the new kernel
> to avoid silent memory corruption to the new kernel.
>
> During kexec, the kexec-ing CPU firstly invokes native_stop_other_cpus()
> to stop all remote CPUs before booting to the new kernel.  The remote
> CPUs will then execute stop_this_cpu() to stop themselves.
>
> The kernel has a percpu boolean to indicate whether the cache of a CPU
> may be in incoherent state.  In stop_this_cpu(), the kernel does WBINVD
> if that percpu boolean is true.
>
> TDX turns on that percpu boolean on a CPU when the kernel does SEAMCALL.
> This makes sure the caches will be flushed during kexec.
>
> However, the native_stop_other_cpus() and stop_this_cpu() have a "race"
> which is extremely rare to happen but could cause system to hang.
>
> Specifically, the native_stop_other_cpus() firstly sends normal reboot
> IPI to remote CPUs and wait one second for them to stop.  If that times
> out, native_stop_other_cpus() then sends NMIs to remote CPUs to stop
> them.
>
> The aforementioned race happens when NMIs are sent.  Doing WBINVD in
> stop_this_cpu() makes each CPU take longer time to stop and increases
> the chance of the race to happen.
>
> Register reboot notifier in KVM to explicitly flush caches upon
> receiving reboot notifier (e.g., during kexec) for TDX.  This moves the
> WBINVD to an earlier stage than stop_this_cpus(), avoiding a possibly
> lengthy operation at a time where it could cause this race.
Two nits below.

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
> Tested-by: Farrah Chen <farrah.chen@intel.com>
> ---
>
> v2 -> v3:
>   - Update changelog to address Paolo's comments and Add Paolo's Ack:
>     https://lore.kernel.org/lkml/3a7c0856-6e7b-4d3d-b966-6f17f1aca42e@redhat.com/
>
> ---
>   arch/x86/include/asm/tdx.h  |  3 +++
>   arch/x86/kvm/vmx/tdx.c      | 45 +++++++++++++++++++++++++++++++++++++
>   arch/x86/virt/vmx/tdx/tdx.c |  9 ++++++++
>   3 files changed, 57 insertions(+)
>
[...]
> +
> +static int tdx_reboot_notify(struct notifier_block *nb, unsigned long code,
> +			     void *unused)
> +{
> +	/*
> +	 * Flush cache for all CPUs upon the reboot notifier.  This
> +	 * avoids having to do WBINVD in stop_this_cpu() during kexec.
> +	 *
> +	 * Kexec calls native_stop_other_cpus() to stop remote CPUs
> +	 * before booting to new kernel, but that code has a "race"
> +	 * when the normal REBOOT IPI timesout and NMIs are sent to

timesout should be times out or timeouts?

> +	 * remote CPUs to stop them.  Doing WBINVD in stop_this_cpu()
> +	 * could potentially increase the posibility of the "race".
s/posibility/possibility

> +	 */
> +	if (code == SYS_RESTART)
> +		on_each_cpu(smp_func_cpu_flush_cache, NULL, 1);
> +	return NOTIFY_DONE;
> +}
> +
>
[...]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec
  2025-06-30 11:34     ` Huang, Kai
@ 2025-07-01 12:12       ` Borislav Petkov
  2025-07-02  3:06         ` Huang, Kai
  0 siblings, 1 reply; 42+ messages in thread
From: Borislav Petkov @ 2025-07-01 12:12 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm@vger.kernel.org, ashish.kalra@amd.com, Hansen, Dave,
	thomas.lendacky@amd.com, kirill.shutemov@linux.intel.com,
	Chatre, Reinette, seanjc@google.com, pbonzini@redhat.com,
	mingo@redhat.com, Yamahata, Isaku, linux-kernel@vger.kernel.org,
	tglx@linutronix.de, nik.borisov@suse.com, hpa@zytor.com,
	peterz@infradead.org, sagis@google.com, Edgecombe, Rick P,
	x86@kernel.org, Williams, Dan J

On Mon, Jun 30, 2025 at 11:34:34AM +0000, Huang, Kai wrote:
> Yeah I agree the text can be improved.  I tried to run AI to simplify but so
> far I am not quite happy about the result yet.  I'll try more.

Ask it to simplify it. I get it that you want to be exhaustive in your commit
message but there really is such thing as too much text.

Think of it this way: is the text I'm writing optimal when anyone needs to
read it in the future to know why a change has been done. If not, try to make
it so.

> Yeah I agree a single u32 + flags is better.  However this is the problem in
> the existing code (this patch only does renaming).
> 
> I think I can come up with a patch to clean this up and put it as the first
> patch of this series, or I can do a patch to clean this up after this series
> (either together with this series, or separately at a later time).  Which
> way do you prefer?

Clean ups go first, so yeah, please do a cleanup pre-patch.

>   /*
>    * The cache may be in an incoherent state (e.g., due to memory 
>    * encryption) and needs flushing during kexec.
>    */

Better than nothing. I'd try to explain with 1-2 sentences what can happen due
to memory encryption and why cache invalidation is required. So that the
comment is standalone and is not sending you on a wild goose chasing ride.

> IIUC the X86_FEATURE_SME could be cleared via 'clearcpuid' kernel cmdline.
> 
> Please also see my reply to Tom.

I know but we have never said that clearcpuid= should be used in production.
If you break the kernel using it, you get to keep the pieces. clearcpuid=
taints the kernel and screams bloody murder. So I'm not too worried about
that.

What is more relevant is this:

"I did verify that booting with mem_encrypt=off will start with
X86_FEATURE_SME set, the BSP will clear it and then all APs will not see
it set after that."

which should be put there in the comment.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec
  2025-07-01 12:12       ` Borislav Petkov
@ 2025-07-02  3:06         ` Huang, Kai
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2025-07-02  3:06 UTC (permalink / raw)
  To: bp@alien8.de
  Cc: kvm@vger.kernel.org, ashish.kalra@amd.com, Hansen, Dave,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, Chatre, Reinette, pbonzini@redhat.com,
	tglx@linutronix.de, Yamahata, Isaku, nik.borisov@suse.com,
	mingo@redhat.com, hpa@zytor.com, peterz@infradead.org,
	sagis@google.com, Edgecombe, Rick P,
	kirill.shutemov@linux.intel.com, x86@kernel.org, Williams, Dan J

On Tue, 2025-07-01 at 14:12 +0200, Borislav Petkov wrote:
> On Mon, Jun 30, 2025 at 11:34:34AM +0000, Huang, Kai wrote:
> > Yeah I agree the text can be improved.  I tried to run AI to simplify but so
> > far I am not quite happy about the result yet.  I'll try more.
> 
> Ask it to simplify it. I get it that you want to be exhaustive in your commit
> message but there really is such thing as too much text.
> 
> Think of it this way: is the text I'm writing optimal when anyone needs to
> read it in the future to know why a change has been done. If not, try to make
> it so.

Yeah this is a good point.  Thanks for that.

After working with AI I came up with below [*].  Basically I:

 - Added a "TL:DR" for quick read, and broke the text into different
sections (e.g., "Background" ..) so it can be read more easily;
 - Removed some trivial things (e.g., the paragraph to explain a comment
change around 'call depth tracking');
 - Simplified the "race" description;
 - Simplified some wording to make sentence shorter.

I appreciate if you can help to see whether it is OK.

Btw, if we remove the "race" description, it could be trimmed down to half,
but I thought it could be still valuable there just in case someone wants to
read it years later.

> 
> > Yeah I agree a single u32 + flags is better.  However this is the problem in
> > the existing code (this patch only does renaming).
> > 
> > I think I can come up with a patch to clean this up and put it as the first
> > patch of this series, or I can do a patch to clean this up after this series
> > (either together with this series, or separately at a later time).  Which
> > way do you prefer?
> 
> Clean ups go first, so yeah, please do a cleanup pre-patch.

Sure will do.  Thanks.

> 
> >   /*
> >    * The cache may be in an incoherent state (e.g., due to memory 
> >    * encryption) and needs flushing during kexec.
> >    */
> 
> Better than nothing. I'd try to explain with 1-2 sentences what can happen due
> to memory encryption and why cache invalidation is required. So that the
> comment is standalone and is not sending you on a wild goose chasing ride.

Yeah agree with the comment being standalone.  How about below?

/*
 * The cache may be in an incoherent state and needs flushing during kexec.
 * E.g., on SME/TDX platforms, dirty cacheline aliases with and without
 * encryption bit(s) can coexist and the cache needs to be flushed before
 * booting to the new kernel to avoid the silent memory corruption due to
 * dirty cachelines with different encryption property being written back
 * to the memory.
 */

> 
> > IIUC the X86_FEATURE_SME could be cleared via 'clearcpuid' kernel cmdline.
> > 
> > Please also see my reply to Tom.
> 
> I know but we have never said that clearcpuid= should be used in production.

Thanks for the info.

> If you break the kernel using it, you get to keep the pieces. clearcpuid=
> taints the kernel and screams bloody murder. So I'm not too worried about
> that.
> 
> What is more relevant is this:
> 
> "I did verify that booting with mem_encrypt=off will start with
> X86_FEATURE_SME set, the BSP will clear it and then all APs will not see
> it set after that."
> 
> which should be put there in the comment.

Yeah agreed.  I read the code again and yeah Tom is right :-)

I didn't want to end up with rewriting the original comment so I came up
with below:

        /*                                              
         * Mark using WBINVD is needed during kexec on processors that     
         * support SME. This provides support for performing a successful
         * kexec when going from SME inactive to SME active (or vice-versa).
         *                                                        
         * The cache must be cleared so that if there are entries with the 
         * same physical address, both with and without the encryption bit,
         * they don't race each other when flushed and potentially end up
         * with the wrong entry being committed to memory.                 
         *                                                              
         * Test the CPUID bit directly because with mem_encrypt=off the    
         * BSP will clear the X86_FEATURE_SME bit and the APs will not     
         * see it set after that.                                      
         */

Does this look good to you?

[*] The updated changelog:

TL;DR:

Use a percpu boolean to control whether to perform WBINVD to unify cache
flushing that is needed during kexec due to memory encryption for both
SME and TDX, and switch SME to use the boolean.

--- Background ---

On SME platforms, dirty cacheline aliases with and without encryption
bit can coexist, and the CPU can flush them back to memory in random
order.  During kexec, the caches must be flushed before jumping to the
new kernel otherwise the dirty cachelines could silently corrupt the
memory used by the new kernel due to different encryption property.

TDX also needs a cache flush during kexec for the same reason.  It would
be good to have a generic way to flush the cache instead of scattering
checks for each feature all around.

When SME is enabled, the kernel basically encrypts all memory including
the kernel itself and a simple memory write from the kernel could dirty
cachelines.  Currently, the kernel uses WBINVD to flush the cache for
SME during kexec in two places:

1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU
   stops them;
2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the
   new kernel.

--- Solution ---

Unlike SME, TDX can only dirty cachelines when it is used (i.e., when
SEAMCALLs are performed).  Since there are no more SEAMCALLs after the
aforementioned WBINVDs, leverage this for TDX.

To unify the approach for SME and TDX, use a percpu boolean to indicate
the cache may be in an incoherent state and needs flushing during kexec,
and set the boolean for SME.  TDX can then leverage it.

While SME could use a global flag (since it's enabled at early boot and
enabled on all CPUs), the percpu flag fits TDX better:

The percpu flag can be set when a CPU makes a SEAMCALL, and cleared when
another WBINVD on the CPU obviates the need for a kexec-time WBINVD.
Saving kexec-time WBINVD is valuable, because there is an existing
race[*] where kexec could proceed while another CPU is active.  WBINVD
could make this race worse, so it's worth skipping it when possible.

--- Side effect to SME ---

Today the first WBINVD in the stop_this_cpu() is performed when SME is
*supported* by the platform, and the second WBINVD is done in
relocate_kernel() when SME is *activated* by the kernel.  Make things
simple by changing to do the second WBINVD when the platform supports
SME.  This allows the kernel to simply turn on this percpu boolean when
bringing up a CPU by checking whether the platform supports SME.

No other functional change intended.

--- More Read ---

[*] The "race" in native_stop_other_cpus()

Commit

  1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")

introduced a new 'cpus_stop_mask' to resolve an "intermittent lockups on
poweroff" issue which was caused by the WBINVD in stop_this_cpu().
Specifically, the new cpumask resolved the below problem mentioned in
that commit:

    CPU0                                    CPU1

     stop_other_cpus()
       send_IPIs(REBOOT);                   stop_this_cpu()
       while (num_online_cpus() > 1);         set_online(false);
       proceed... -> hang
                                              wbinvd()

While it fixed the reported issue, that commit explained the new cpumask
"cannot plug all holes either", and there's a "race" could still happen:

    CPU 0                                       CPU 1

    native_stop_other_cpus()                    stop_this_cpu()

        // sends REBOOT IPI to stop remote CPUs
                                                ...
                                                wbinvd();

        // wait timesout, try NMI
        if (!cpumask_empty(&cpus_stop_mask)) {
                for_each_cpu(cpu, &cpus_stop_mask) {

                                                ...
                                                cpumask_clear_cpu(cpu,
                                                        &cpus_stop_mask);
                                                hlt;

                        // send NMI     --->
                                                wakeup from hlt and run
                                                stop_this_cpu():

                        // WAIT CPUs TO STOP
                        while (!cpumask_empty(
                            &cpus_stop_mask) &&
                                ...)
                }
                                                ...
                proceed ...                     wbinvd();
                                                ...
                                                hlt;

The "WAIT CPUs TO STOP" is supposed to wait until all remote CPUs are
stopped, but actually it quits immediately because the remote CPUs have
been cleared in cpus_stop_mask when stop_this_cpu() is called from the
REBOOT IPI.

Doing WBINVD in stop_this_cpu() could potentially increase the chance to
trigger the above "race" despite it's still rare to happen.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 3/6] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-07-01  5:37   ` Binbin Wu
@ 2025-07-02  3:12     ` Huang, Kai
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2025-07-02  3:12 UTC (permalink / raw)
  To: Hansen, Dave, bp@alien8.de, binbin.wu@linux.intel.com,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	tglx@linutronix.de, thomas.lendacky@amd.com
  Cc: Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
	sagis@google.com, Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	ashish.kalra@amd.com, Chen, Farrah, nik.borisov@suse.com


> > +	/*
> > +	 * Some early TDX-capable platforms have an erratum.  A kernel
> > +	 * partial write (a write transaction of less than cacheline
> > +	 * lands at memory controller) to TDX private memory poisons that
> > +	 * memory, and a subsequent read triggers a machine check.
> > +	 *
> Nit: About the description of the erratum, maybe it's better to refer to the
> comments of check_tdx_erratum() to avoid duplication. Also it gives a link to
> how/when the bug is set.

I am not sure pointing to the comment at another place is desired,
especially at another file.  It could be done in some cases but IMHO in
general it's better to have a "standalone" comment so we don't need to jump
back and forth, or need to care about the case that the other comment could
be updated etc.

> 
> Otherwise,
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier
  2025-07-01  6:09   ` Binbin Wu
@ 2025-07-02  3:14     ` Huang, Kai
  0 siblings, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2025-07-02  3:14 UTC (permalink / raw)
  To: Hansen, Dave, bp@alien8.de, binbin.wu@linux.intel.com,
	peterz@infradead.org, hpa@zytor.com, mingo@redhat.com,
	tglx@linutronix.de, thomas.lendacky@amd.com
  Cc: Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
	sagis@google.com, Chatre, Reinette, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, Williams, Dan J,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	ashish.kalra@amd.com, Chen, Farrah, nik.borisov@suse.com


> Two nits below.
> 
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> 
> > 
> > +	/*
> > +	 * Flush cache for all CPUs upon the reboot notifier.  This
> > +	 * avoids having to do WBINVD in stop_this_cpu() during kexec.
> > +	 *
> > +	 * Kexec calls native_stop_other_cpus() to stop remote CPUs
> > +	 * before booting to new kernel, but that code has a "race"
> > +	 * when the normal REBOOT IPI timesout and NMIs are sent to
> 
> timesout should be times out or timeouts?

I will use "times out".

> 
> > +	 * remote CPUs to stop them.  Doing WBINVD in stop_this_cpu()
> > +	 * could potentially increase the posibility of the "race".
> s/posibility/possibility
> 
> 

Oops I didn't check enough :-)

Will fix and thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier
  2025-06-26 10:48 ` [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier Kai Huang
  2025-06-27  0:01   ` Edgecombe, Rick P
  2025-07-01  6:09   ` Binbin Wu
@ 2025-07-02  7:54   ` Chao Gao
  2025-07-02  9:22     ` Huang, Kai
  2025-07-07 12:37     ` Huang, Kai
  2 siblings, 2 replies; 42+ messages in thread
From: Chao Gao @ 2025-07-02  7:54 UTC (permalink / raw)
  To: Kai Huang
  Cc: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky, x86,
	kirill.shutemov, rick.p.edgecombe, linux-kernel, pbonzini, seanjc,
	kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, sagis, Farrah Chen

>--- a/arch/x86/virt/vmx/tdx/tdx.c
>+++ b/arch/x86/virt/vmx/tdx/tdx.c
>@@ -1870,3 +1870,12 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
> 	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
> }
> EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
>+
>+void tdx_cpu_flush_cache(void)
>+{
>+	lockdep_assert_preemption_disabled();
>+
>+	wbinvd();

Shouldn't you check the per-CPU variable first? so that WBINVD can be
skipped if there is nothing incoherent.

And reboot notifier looks the wrong place for WBINVD. Because SEAMCALLs
(see below) called after the reboot notifier will set the per-CPU variable
again. So in some cases, this patch will result in an *extra* WBINVD
instead of moving WBINVD to an earlier stage.

kernel_kexec()
  ->kernel_restart_prepare()
    ->blocking_notifier_call_chain() // reboot notifier
  ->syscore_shutdown()
    -> ...
      ->tdx_disable_virtualization_cpu()
        ->tdx_flush_vp()

>+	this_cpu_write(cache_state_incoherent, false);
>+}
>+EXPORT_SYMBOL_GPL(tdx_cpu_flush_cache);

I wonder why we don't simply perform WBINVD in
vt_disable_virtualization_cpu() after VMXOFF, i.e.,

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d1e02e567b57..1ad3c28b8eff 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -19,6 +19,8 @@ static void vt_disable_virtualization_cpu(void)
	if (enable_tdx)
		tdx_disable_virtualization_cpu();
	vmx_disable_virtualization_cpu();
+	/* Explain why WBINVD is needed */
+	if (enable_tdx)
+		wbinvd();
 }
 
 static __init int vt_hardware_setup(void)

It can solve the cache line aliasing problem and is much simpler than
patches 1-2 and 6.

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 3/6] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-06-26 10:48 ` [PATCH v3 3/6] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Kai Huang
  2025-06-26 18:49   ` Edgecombe, Rick P
  2025-07-01  5:37   ` Binbin Wu
@ 2025-07-02  8:25   ` Chao Gao
  2025-07-02  8:43     ` Huang, Kai
  2 siblings, 1 reply; 42+ messages in thread
From: Chao Gao @ 2025-07-02  8:25 UTC (permalink / raw)
  To: Kai Huang
  Cc: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky, x86,
	kirill.shutemov, rick.p.edgecombe, linux-kernel, pbonzini, seanjc,
	kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, sagis, Farrah Chen

On Thu, Jun 26, 2025 at 10:48:49PM +1200, Kai Huang wrote:
>Some early TDX-capable platforms have an erratum: A kernel partial
>write (a write transaction of less than cacheline lands at memory
>controller) to TDX private memory poisons that memory, and a subsequent
>read triggers a machine check.
>
>On those platforms, the old kernel must reset TDX private memory before
>jumping to the new kernel, otherwise the new kernel may see unexpected
>machine check.  Currently the kernel doesn't track which page is a TDX
>private page.  For simplicity just fail kexec/kdump for those platforms.

My understanding is that the kdump kernel uses a small amount of memory
reserved at boot, which the crashed kernel never accesses. And the kdump
kernel reads the memory of the crashed kernel and doesn't overwrite it.
So it should be safe to allow kdump (i.e., no partial write to private
memory). Anything I missed?

(I am not asking to enable kdump in *this* series; I'm just trying to
understand the rationale behind disabling kdump)

>
>Leverage the existing machine_kexec_prepare() to fail kexec/kdump by
>adding the check of the presence of the TDX erratum (which is only
>checked for if the kernel is built with TDX host support).  This rejects
>kexec/kdump when the kernel is loading the kexec/kdump kernel image.
>
>The alternative is to reject kexec/kdump when the kernel is jumping to
>the new kernel.  But for kexec this requires adding a new check (e.g.,
>arch_kexec_allowed()) in the common code to fail kernel_kexec() at early
>stage.  Kdump (crash_kexec()) needs similar check, but it's hard to
>justify because crash_kexec() is not supposed to abort.
>
>It's feasible to further relax this limitation, i.e., only fail kexec
>when TDX is actually enabled by the kernel.  But this is still a half
>measure compared to resetting TDX private memory so just do the simplest
>thing for now.
>
>The impact to userspace is the users will get an error when loading the
>kexec/kdump kernel image:
>
>  kexec_load failed: Operation not supported
>
>This might be confusing to the users, thus also print the reason in the
>dmesg:
>
>  [..] kexec: not allowed on platform with tdx_pw_mce bug.
>
>Signed-off-by: Kai Huang <kai.huang@intel.com>
>Tested-by: Farrah Chen <farrah.chen@intel.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 3/6] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-07-02  8:25   ` Chao Gao
@ 2025-07-02  8:43     ` Huang, Kai
  2025-07-02 22:16       ` Vishal Annapurve
  0 siblings, 1 reply; 42+ messages in thread
From: Huang, Kai @ 2025-07-02  8:43 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm@vger.kernel.org, ashish.kalra@amd.com, Hansen, Dave,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, seanjc@google.com, pbonzini@redhat.com,
	tglx@linutronix.de, Yamahata, Isaku,
	kirill.shutemov@linux.intel.com, Chatre, Reinette,
	nik.borisov@suse.com, hpa@zytor.com, peterz@infradead.org,
	sagis@google.com, Chen, Farrah, Edgecombe, Rick P, bp@alien8.de,
	x86@kernel.org, Williams, Dan J

On Wed, 2025-07-02 at 16:25 +0800, Gao, Chao wrote:
> On Thu, Jun 26, 2025 at 10:48:49PM +1200, Kai Huang wrote:
> > Some early TDX-capable platforms have an erratum: A kernel partial
> > write (a write transaction of less than cacheline lands at memory
> > controller) to TDX private memory poisons that memory, and a subsequent
> > read triggers a machine check.
> > 
> > On those platforms, the old kernel must reset TDX private memory before
> > jumping to the new kernel, otherwise the new kernel may see unexpected
> > machine check.  Currently the kernel doesn't track which page is a TDX
> > private page.  For simplicity just fail kexec/kdump for those platforms.
> 
> My understanding is that the kdump kernel uses a small amount of memory
> reserved at boot, which the crashed kernel never accesses. And the kdump
> kernel reads the memory of the crashed kernel and doesn't overwrite it.
> So it should be safe to allow kdump (i.e., no partial write to private
> memory). Anything I missed?
> 
> (I am not asking to enable kdump in *this* series; I'm just trying to
> understand the rationale behind disabling kdump)

As you said it *should* be safe.  The kdump kernel should only read TDX
private memory but not write.  But I cannot say I am 100% sure (there are
many things involved when generating the kdump file such as memory
compression) so in internal discussion we thought we should just disable it.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier
  2025-07-02  7:54   ` Chao Gao
@ 2025-07-02  9:22     ` Huang, Kai
  2025-07-07 12:37     ` Huang, Kai
  1 sibling, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2025-07-02  9:22 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm@vger.kernel.org, ashish.kalra@amd.com, Hansen, Dave,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, seanjc@google.com, pbonzini@redhat.com,
	tglx@linutronix.de, Yamahata, Isaku,
	kirill.shutemov@linux.intel.com, Chatre, Reinette,
	nik.borisov@suse.com, hpa@zytor.com, peterz@infradead.org,
	sagis@google.com, Chen, Farrah, Edgecombe, Rick P, bp@alien8.de,
	x86@kernel.org, Williams, Dan J

On Wed, 2025-07-02 at 15:54 +0800, Chao Gao wrote:
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1870,3 +1870,12 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
> > 	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
> > }
> > EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
> > +
> > +void tdx_cpu_flush_cache(void)
> > +{
> > +	lockdep_assert_preemption_disabled();
> > +
> > +	wbinvd();
> 
> Shouldn't you check the per-CPU variable first? so that WBINVD can be
> skipped if there is nothing incoherent.

It is assumed the caller of this function knows that cache needs to be
flushed.  But I can do the additional check and skip the wbinvd().

> 
> And reboot notifier looks the wrong place for WBINVD. Because SEAMCALLs
> (see below) called after the reboot notifier will set the per-CPU variable
> again. So in some cases, this patch will result in an *extra* WBINVD
> instead of moving WBINVD to an earlier stage.

I agree.  Me and Rick had some discussion around here and this patch can
still bring optimization "in most cases", i.e., the real user of kexec
normally will just do the kexec when no TD is running.

To make it complete we should manually kill all TDs upon rebooting notifier.

I should call that out in the changelog though.

> 
> kernel_kexec()
>   ->kernel_restart_prepare()
>     ->blocking_notifier_call_chain() // reboot notifier
>   ->syscore_shutdown()
>     -> ...
>       ->tdx_disable_virtualization_cpu()
>         ->tdx_flush_vp()
> 
> > +	this_cpu_write(cache_state_incoherent, false);
> > +}
> > +EXPORT_SYMBOL_GPL(tdx_cpu_flush_cache);
> 
> I wonder why we don't simply perform WBINVD in
> vt_disable_virtualization_cpu() after VMXOFF, i.e.,
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index d1e02e567b57..1ad3c28b8eff 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -19,6 +19,8 @@ static void vt_disable_virtualization_cpu(void)
> 	if (enable_tdx)
> 		tdx_disable_virtualization_cpu();
> 	vmx_disable_virtualization_cpu();
> +	/* Explain why WBINVD is needed */
> +	if (enable_tdx)
> +		wbinvd();
>  }
> 
>  static __init int vt_hardware_setup(void)
> 
> It can solve the cache line aliasing problem and is much simpler than
> patches 1-2 and 6.

This sounds promising, but the emergency virtualization part needs similar
handling too.

Previously the VMXOFF in the emergency virtualization was explicitly done in
the core kernel, so for this approach we have to scatter checks for
different vendors at different places.  This wasn't nice.

Now the emergency virtualization disable itself is implemented in KVM too
(the core kernel only calls a function pointer), so I _think_ the situation
is better now, if we do wbinvd() in VMX disabling path.

It will get more complicated when other kernel components (VT-d) need to
invoke SEAMCALL, but at that time VMX code should have been moved to core
kernel, so doing WBINVD after VMXOFF sounds fine too.

One concern is this approach doesn't play quite nice with below pattern:

	cpu_enable_virtualization();
	SEAMCALL();
	cpu_disable_virtualization();

but hopefully VT-d code doesn't need to do like this.

The percpu boolean approach does have another advantage (although it's more
like theoretical issue), that it could be also used to cover other cases
that could also lead to cache being in incoherent:

https://lore.kernel.org/lkml/eb2e3b02-cf5e-4848-8f1d-9f3af8f9c96b@intel.com/

I'll think more on this.  I will be out for the rest of the week so I will
come back next week.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 3/6] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-07-02  8:43     ` Huang, Kai
@ 2025-07-02 22:16       ` Vishal Annapurve
  2025-07-02 23:57         ` Edgecombe, Rick P
  0 siblings, 1 reply; 42+ messages in thread
From: Vishal Annapurve @ 2025-07-02 22:16 UTC (permalink / raw)
  To: Huang, Kai
  Cc: Gao, Chao, kvm@vger.kernel.org, ashish.kalra@amd.com,
	Hansen, Dave, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org, mingo@redhat.com, seanjc@google.com,
	pbonzini@redhat.com, tglx@linutronix.de, Yamahata, Isaku,
	kirill.shutemov@linux.intel.com, Chatre, Reinette,
	nik.borisov@suse.com, hpa@zytor.com, peterz@infradead.org,
	sagis@google.com, Chen, Farrah, Edgecombe, Rick P, bp@alien8.de,
	x86@kernel.org, Williams, Dan J

On Wed, Jul 2, 2025 at 1:45 AM Huang, Kai <kai.huang@intel.com> wrote:
>
> On Wed, 2025-07-02 at 16:25 +0800, Gao, Chao wrote:
> > On Thu, Jun 26, 2025 at 10:48:49PM +1200, Kai Huang wrote:
> > > Some early TDX-capable platforms have an erratum: A kernel partial
> > > write (a write transaction of less than cacheline lands at memory
> > > controller) to TDX private memory poisons that memory, and a subsequent
> > > read triggers a machine check.
> > >
> > > On those platforms, the old kernel must reset TDX private memory before
> > > jumping to the new kernel, otherwise the new kernel may see unexpected
> > > machine check.  Currently the kernel doesn't track which page is a TDX
> > > private page.  For simplicity just fail kexec/kdump for those platforms.
> >
> > My understanding is that the kdump kernel uses a small amount of memory
> > reserved at boot, which the crashed kernel never accesses. And the kdump
> > kernel reads the memory of the crashed kernel and doesn't overwrite it.
> > So it should be safe to allow kdump (i.e., no partial write to private
> > memory). Anything I missed?
> >
> > (I am not asking to enable kdump in *this* series; I'm just trying to
> > understand the rationale behind disabling kdump)
>
> As you said it *should* be safe.  The kdump kernel should only read TDX
> private memory but not write.  But I cannot say I am 100% sure (there are
> many things involved when generating the kdump file such as memory
> compression) so in internal discussion we thought we should just disable it.

So what's the side-effect of enabling kdump, in the worst case kdump
kernel crashes and in the most likely scenario kdump will generate a
lot of important data to analyze from the host failure.

Allowing kdump seems to be a net positive outcome to me. Am I missing
something? If not, my vote would be to enable/allow kdump for such
platforms in this series itself.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 3/6] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-07-02 22:16       ` Vishal Annapurve
@ 2025-07-02 23:57         ` Edgecombe, Rick P
  0 siblings, 0 replies; 42+ messages in thread
From: Edgecombe, Rick P @ 2025-07-02 23:57 UTC (permalink / raw)
  To: Annapurve, Vishal, Huang, Kai
  Cc: kvm@vger.kernel.org, ashish.kalra@amd.com, Hansen, Dave,
	thomas.lendacky@amd.com, kirill.shutemov@linux.intel.com,
	seanjc@google.com, Chatre, Reinette, pbonzini@redhat.com,
	mingo@redhat.com, Yamahata, Isaku, nik.borisov@suse.com,
	tglx@linutronix.de, hpa@zytor.com, peterz@infradead.org,
	sagis@google.com, Chen, Farrah, Gao, Chao,
	linux-kernel@vger.kernel.org, bp@alien8.de, x86@kernel.org,
	Williams, Dan J

On Wed, 2025-07-02 at 15:16 -0700, Vishal Annapurve wrote:
> > As you said it *should* be safe.  The kdump kernel should only read TDX
> > private memory but not write.  But I cannot say I am 100% sure (there are
> > many things involved when generating the kdump file such as memory
> > compression) so in internal discussion we thought we should just disable it.
> 
> So what's the side-effect of enabling kdump, in the worst case kdump
> kernel crashes and in the most likely scenario kdump will generate a
> lot of important data to analyze from the host failure.
> 
> Allowing kdump seems to be a net positive outcome to me. Am I missing
> something? If not, my vote would be to enable/allow kdump for such
> platforms in this series itself.

This reasoning makes sense. But today there is no way to even configure kexec
when TDX is configured. It blocks TDX for distro based hosts. Kdump can always
be expanded in a follow up. The series has been tricky and so it's nice to not
have to tackle all the angles before getting at least some support back.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier
  2025-07-02  7:54   ` Chao Gao
  2025-07-02  9:22     ` Huang, Kai
@ 2025-07-07 12:37     ` Huang, Kai
  1 sibling, 0 replies; 42+ messages in thread
From: Huang, Kai @ 2025-07-07 12:37 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm@vger.kernel.org, ashish.kalra@amd.com, Hansen, Dave,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, seanjc@google.com, pbonzini@redhat.com,
	tglx@linutronix.de, Yamahata, Isaku,
	kirill.shutemov@linux.intel.com, Chatre, Reinette,
	nik.borisov@suse.com, hpa@zytor.com, peterz@infradead.org,
	sagis@google.com, Chen, Farrah, Edgecombe, Rick P, bp@alien8.de,
	x86@kernel.org, Williams, Dan J

On Wed, 2025-07-02 at 15:54 +0800, Chao Gao wrote:
> I wonder why we don't simply perform WBINVD in
> vt_disable_virtualization_cpu() after VMXOFF, i.e.,
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index d1e02e567b57..1ad3c28b8eff 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -19,6 +19,8 @@ static void vt_disable_virtualization_cpu(void)
> 	if (enable_tdx)
> 		tdx_disable_virtualization_cpu();
> 	vmx_disable_virtualization_cpu();
> +	/* Explain why WBINVD is needed */
> +	if (enable_tdx)
> +		wbinvd();
>  }
>  
>  static __init int vt_hardware_setup(void)
> 
> It can solve the cache line aliasing problem and is much simpler than
> patches 1-2 and 6.

After more thinking, I think the percpu boolean isn't conflicting with
what you suggested above.  As Dave mentioned here[*], it can help catch
wbinvd() at kexec-ing time if something screwed up at earlier time:

  ...
  hopefully at point after you're sure no more TDCALLs are being made. If
  you screw it up, no biggie: the kexec-time one will make up for it,
  exposing TDX systems to the kexec timing bugs. But if the on_each_cpu()
  thing works in the common case, you get no additional bug exposure.

Btw, the reason that I wanted to do wbinvd() in rebooting notifier is it
suits a good place to reset TDX private memory managed by KVM on "partial
write machine check" erratum platforms (which isn't included in this patch
though).  We need to flush cache before resetting TDX private memory.

So while doing wbinvd() after vmx_disable_virtualization_cpu() sounds
promising for cache flush, it is not ideal for handling erratum.  Using
rebooting notifier makes such code self-contained in tdx.c in KVM.

[*]:
https://lore.kernel.org/lkml/31e17bc8-2e9e-4e93-a912-3d54826e59d0@intel.com/

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2025-07-07 12:37 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-26 10:48 [PATCH v3 0/6] TDX host: kexec/kdump support Kai Huang
2025-06-26 10:48 ` [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec Kai Huang
2025-06-26 17:59   ` Edgecombe, Rick P
2025-06-26 18:42     ` Edgecombe, Rick P
2025-06-27  0:30       ` Huang, Kai
2025-06-30  7:09         ` Binbin Wu
2025-06-27  0:37     ` Huang, Kai
2025-06-27  0:39       ` Edgecombe, Rick P
2025-06-27  1:06         ` Huang, Kai
2025-06-27 15:08   ` Tom Lendacky
2025-06-30 11:35     ` Huang, Kai
2025-06-28 12:50   ` Borislav Petkov
2025-06-28 17:04     ` Tom Lendacky
2025-06-30 11:34       ` Huang, Kai
2025-06-30 11:34     ` Huang, Kai
2025-07-01 12:12       ` Borislav Petkov
2025-07-02  3:06         ` Huang, Kai
2025-06-26 10:48 ` [PATCH v3 2/6] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Kai Huang
2025-06-26 18:37   ` Edgecombe, Rick P
2025-06-26 23:36     ` Huang, Kai
2025-06-27  0:52       ` Edgecombe, Rick P
2025-06-27  1:47         ` Huang, Kai
2025-06-26 10:48 ` [PATCH v3 3/6] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Kai Huang
2025-06-26 18:49   ` Edgecombe, Rick P
2025-07-01  5:37   ` Binbin Wu
2025-07-02  3:12     ` Huang, Kai
2025-07-02  8:25   ` Chao Gao
2025-07-02  8:43     ` Huang, Kai
2025-07-02 22:16       ` Vishal Annapurve
2025-07-02 23:57         ` Edgecombe, Rick P
2025-06-26 10:48 ` [PATCH v3 4/6] x86/virt/tdx: Remove the !KEXEC_CORE dependency Kai Huang
2025-06-26 18:49   ` Edgecombe, Rick P
2025-06-26 10:48 ` [PATCH v3 5/6] x86/virt/tdx: Update the kexec section in the TDX documentation Kai Huang
2025-06-26 18:51   ` Edgecombe, Rick P
2025-06-26 10:48 ` [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier Kai Huang
2025-06-27  0:01   ` Edgecombe, Rick P
2025-06-27  1:00     ` Huang, Kai
2025-07-01  6:09   ` Binbin Wu
2025-07-02  3:14     ` Huang, Kai
2025-07-02  7:54   ` Chao Gao
2025-07-02  9:22     ` Huang, Kai
2025-07-07 12:37     ` Huang, Kai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).