kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/7] TDX host: kexec/kdump support
@ 2025-07-17 21:46 Kai Huang
  2025-07-17 21:46 ` [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters Kai Huang
                   ` (8 more replies)
  0 siblings, 9 replies; 28+ messages in thread
From: Kai Huang @ 2025-07-17 21:46 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky
  Cc: x86, kas, rick.p.edgecombe, dwmw, linux-kernel, pbonzini, seanjc,
	kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis

This series is the latest attempt to support kexec on TDX host following
Dave's suggestion to use a percpu boolean to control WBINVD during
kexec.

Hi Boris/Tom,

As requested, I added the first patch to cleanup the last two 'unsigned
int' parameters of the relocate_kernel() into one 'unsigned int' and pass
flags instead.  The patch 2 (patch 1 in v3) also gets updated based on
that.  Would you help to review?  Thanks.

I tested that both normal kexec and preserve_context kexec works (using
the tools/testing/selftests/kexec/test_kexec_jump.sh).  But I don't have
SME capable machine to test.

Hi Tom, I added your Reviewed-by and Tested-by in the patch 2 anyway
since I believe the change is trivial and straightforward).  But due to
the cleanup patch, I appreciate if you can help to test the first two
patches again.  Thanks a lot!

v3 -> v4:
 - Rebase to latest tip/master.
 - Add a cleanup patch to consolidate relocate_kernel()'s last two
   function parameters -- Boris.
 - Address comments received -- please see individual patches.
 - Collect tags (Tom, Rick, binbin).

 v3: https://lore.kernel.org/kvm/cover.1750934177.git.kai.huang@intel.com/

v2 -> v3 (all trivial changes):

 - Rebase on latest tip/master
   - change to use __always_inline for do_seamcall() in patch 2
 - Update patch 2 (changelog and code comment) to remove the sentence
   which says "not all SEAMCALLs generate dirty cachelines of TDX
   private memory but just treat all of them do."  -- Dave.
 - Add Farrah's Tested-by for all TDX patches.

The v2 had one informal RFC patch appended to show "some optimization"
which can move WBINVD from the kexec phase to an early stage in KVM.
Paolo commented and Acked that patch (thanks!), so this v3 made that
patch as a formal one (patch 6).  But technically it is not absolutely
needed in this series but can be done in the future.

More history info can be found in v2:

 https://lore.kernel.org/lkml/cover.1746874095.git.kai.huang@intel.com/

=== More information ===

TDX private memory is memory that is encrypted with private Host Key IDs
(HKID).  If the kernel has ever enabled TDX, part of system memory
remains TDX private memory when kexec happens.  E.g., the PAMT (Physical
Address Metadata Table) pages used by the TDX module to track each TDX
memory page's state are never freed once the TDX module is initialized.
TDX guests also have guest private memory and secure-EPT pages.

After kexec, the new kernel will have no knowledge of which memory page
was used as TDX private page and can use all memory as regular memory.

1) Cache flush

Per TDX 1.5 base spec "8.6.1.Platforms not Using ACT: Required Cache
Flush and Initialization by the Host VMM", to support kexec for TDX, the
kernel needs to flush cache to make sure there's no dirty cachelines of
TDX private memory left over to the new kernel (when the TDX module
reports TDX_FEATURES.CLFLUSH_BEFORE_ALLOC as 1 in the global metadata for
the platform).  The kernel also needs to make sure there's no more TDX
activity (no SEAMCALL) after cache flush so that no new dirty cachelines
of TDX private memory are generated.

SME has similar requirement.  SME kexec support uses WBINVD to do the
cache flush.  WBINVD is able to flush cachelines associated with any
HKID.  Reuse the WBINVD introduced by SME to flush cache for TDX.

Currently the kernel explicitly checks whether the hardware supports SME
and only does WBINVD if true.  Instead of adding yet another TDX
specific check, this series uses a percpu boolean to indicate whether
WBINVD is needed on that CPU during kexec.

2) Reset TDX private memory using MOVDIR64B

The TDX spec (the aforementioned section) also suggests the kernel
*should* use MOVDIR64B to clear TDX private page before the kernel
reuses it as regular one.

However, in reality the situation can be more flexible.  Per TDX 1.5
base spec ("Table 16.2: Non-ACT Platforms Checks on Memory Reads in Ci
Mode" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
Mode"), the read/write to TDX private memory using shared KeyID without
integrity check enabled will not poison the memory and cause machine
check.

Note on the platforms with ACT (Access Control Table), there's no
integrity check involved thus no machine check is possible to happen due
to memory read/write using different KeyIDs.

KeyID 0 (TME key) doesn't support integrity check.  This series chooses
to NOT reset TDX private memory but leave TDX private memory as-is to the
new kernel.  As mentioned above, in practice it is safe to do so.

3) One limitation

If the kernel has ever enabled TDX, after kexec the new kernel won't be
able to use TDX anymore.  This is because when the new kernel tries to
initialize TDX module it will fail on the first SEAMCALL due to the
module has already been initialized by the old kernel.

More (non-trivial) work will be needed for the new kernel to use TDX,
e.g., one solution is to just reload the TDX module from the location
where BIOS loads the TDX module (/boot/efi/EFI/TDX/).  This series
doesn't cover this, but leave this as future work.

4) Kdump support

This series also enables kdump with TDX, but no special handling is
needed for crash kexec (except turning on the Kconfig option):

 - kdump kernel uses reserved memory from the old kernel as system ram,
   and the old kernel will never use the reserved memory as TDX memory.
 - /proc/vmcore contains TDX private memory pages.  It's meaningless to
   read them, but it doesn't do any harm either.

5) TDX "partial write machine check" erratum

On the platform with TDX erratum, a partial write (a write transaction
of less than a cacheline lands at memory controller) to TDX private
memory poisons that memory, and a subsequent read triggers machine
check.  On those platforms, the kernel needs to reset TDX private memory
before jumping to the new kernel otherwise the new kernel may see
unexpected machine check.

The kernel currently doesn't track which page is TDX private memory.
It's not trivial to reset TDX private memory.  For simplicity, this
series simply disables kexec/kdump for such platforms.  This can be
enhanced in the future.



Kai Huang (7):
  x86/kexec: Consolidate relocate_kernel() function parameters
  x86/sme: Use percpu boolean to control WBINVD during kexec
  x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  x86/kexec: Disable kexec/kdump on platforms with TDX partial write
    erratum
  x86/virt/tdx: Remove the !KEXEC_CORE dependency
  x86/virt/tdx: Update the kexec section in the TDX documentation
  KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs

 Documentation/arch/x86/tdx.rst       | 14 ++++-----
 arch/x86/Kconfig                     |  1 -
 arch/x86/include/asm/kexec.h         | 12 ++++++--
 arch/x86/include/asm/processor.h     |  2 ++
 arch/x86/include/asm/tdx.h           | 31 +++++++++++++++++++-
 arch/x86/kernel/cpu/amd.c            | 17 +++++++++++
 arch/x86/kernel/machine_kexec_64.c   | 43 ++++++++++++++++++++++------
 arch/x86/kernel/process.c            | 24 +++++++---------
 arch/x86/kernel/relocate_kernel_64.S | 30 +++++++++++--------
 arch/x86/kvm/vmx/tdx.c               | 12 ++++++++
 arch/x86/virt/vmx/tdx/tdx.c          | 16 +++++++++--
 11 files changed, 155 insertions(+), 47 deletions(-)


base-commit: e180b3a224cb519388c2f61ca7bc1eaf94cec1fb
-- 
2.50.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters
  2025-07-17 21:46 [PATCH v4 0/7] TDX host: kexec/kdump support Kai Huang
@ 2025-07-17 21:46 ` Kai Huang
  2025-07-21 14:40   ` Tom Lendacky
  2025-07-17 21:46 ` [PATCH v4 2/7] x86/sme: Use percpu boolean to control WBINVD during kexec Kai Huang
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Kai Huang @ 2025-07-17 21:46 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky
  Cc: x86, kas, rick.p.edgecombe, dwmw, linux-kernel, pbonzini, seanjc,
	kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis

During kexec, the kernel jumps to the new kernel in relocate_kernel(),
which is implemented in assembly and both 32-bit and 64-bit have their
own version.

Currently, for both 32-bit and 64-bit, the last two parameters of the
relocate_kernel() are both 'unsigned int' but actually they only convey
a boolean, i.e., one bit information.  The 'unsigned int' has enough
space to carry two bits information therefore there's no need to pass
the two booleans in two separate 'unsigned int'.

Consolidate the last two function parameters of relocate_kernel() into a
single 'unsigned int' and pass flags instead.

Only consolidate the 64-bit version albeit the similar optimization can
be done for the 32-bit version too.  Don't bother changing the 32-bit
version while it is working (since assembly code change is required).

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/include/asm/kexec.h         | 12 ++++++++++--
 arch/x86/kernel/machine_kexec_64.c   | 22 +++++++++++++---------
 arch/x86/kernel/relocate_kernel_64.S | 19 +++++++++----------
 3 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index f2ad77929d6e..5f09791dc4e9 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -13,6 +13,15 @@
 # define KEXEC_DEBUG_EXC_HANDLER_SIZE	6 /* PUSHI, PUSHI, 2-byte JMP */
 #endif
 
+#ifdef CONFIG_X86_64
+
+#include <linux/bits.h>
+
+#define RELOC_KERNEL_PRESERVE_CONTEXT	BIT(0)
+#define RELOC_KERNEL_HOST_MEM_ACTIVE	BIT(1)
+
+#endif
+
 # define KEXEC_CONTROL_PAGE_SIZE	4096
 # define KEXEC_CONTROL_CODE_MAX_SIZE	2048
 
@@ -121,8 +130,7 @@ typedef unsigned long
 relocate_kernel_fn(unsigned long indirection_page,
 		   unsigned long pa_control_page,
 		   unsigned long start_address,
-		   unsigned int preserve_context,
-		   unsigned int host_mem_enc_active);
+		   unsigned int flags);
 #endif
 extern relocate_kernel_fn relocate_kernel;
 #define ARCH_HAS_KIMAGE_ARCH
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 697fb99406e6..25cff38f5e60 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -384,16 +384,10 @@ void __nocfi machine_kexec(struct kimage *image)
 {
 	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
 	relocate_kernel_fn *relocate_kernel_ptr;
-	unsigned int host_mem_enc_active;
+	unsigned int relocate_kernel_flags;
 	int save_ftrace_enabled;
 	void *control_page;
 
-	/*
-	 * This must be done before load_segments() since if call depth tracking
-	 * is used then GS must be valid to make any function calls.
-	 */
-	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
-
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
 		save_processor_state();
@@ -427,6 +421,17 @@ void __nocfi machine_kexec(struct kimage *image)
 	 */
 	relocate_kernel_ptr = control_page + (unsigned long)relocate_kernel - reloc_start;
 
+	relocate_kernel_flags = 0;
+	if (image->preserve_context)
+		relocate_kernel_flags |= RELOC_KERNEL_PRESERVE_CONTEXT;
+
+	/*
+	 * This must be done before load_segments() since if call depth tracking
+	 * is used then GS must be valid to make any function calls.
+	 */
+	if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
+		relocate_kernel_flags |= RELOC_KERNEL_HOST_MEM_ACTIVE;
+
 	/*
 	 * The segment registers are funny things, they have both a
 	 * visible and an invisible part.  Whenever the visible part is
@@ -443,8 +448,7 @@ void __nocfi machine_kexec(struct kimage *image)
 	image->start = relocate_kernel_ptr((unsigned long)image->head,
 					   virt_to_phys(control_page),
 					   image->start,
-					   image->preserve_context,
-					   host_mem_enc_active);
+					   relocate_kernel_flags);
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index ea604f4d0b52..1dfa323b33d5 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -66,8 +66,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
 	 * %rdi indirection_page
 	 * %rsi pa_control_page
 	 * %rdx start address
-	 * %rcx preserve_context
-	 * %r8  host_mem_enc_active
+	 * %rcx flags: RELOC_KERNEL_*
 	 */
 
 	/* Save the CPU context, used for jumping back */
@@ -111,7 +110,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
 	/* save indirection list for jumping back */
 	movq	%rdi, pa_backup_pages_map(%rip)
 
-	/* Save the preserve_context to %r11 as swap_pages clobbers %rcx. */
+	/* Save the flags to %r11 as swap_pages clobbers %rcx. */
 	movq	%rcx, %r11
 
 	/* setup a new stack at the end of the physical control page */
@@ -129,9 +128,8 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	/*
 	 * %rdi	indirection page
 	 * %rdx start address
-	 * %r8 host_mem_enc_active
 	 * %r9 page table page
-	 * %r11 preserve_context
+	 * %r11 flags: RELOC_KERNEL_*
 	 * %r13 original CR4 when relocate_kernel() was invoked
 	 */
 
@@ -204,7 +202,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	 * entries that will conflict with the now unencrypted memory
 	 * used by kexec. Flush the caches before copying the kernel.
 	 */
-	testq	%r8, %r8
+	testq	$RELOC_KERNEL_HOST_MEM_ACTIVE, %r11
 	jz .Lsme_off
 	wbinvd
 .Lsme_off:
@@ -220,7 +218,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	movq	%cr3, %rax
 	movq	%rax, %cr3
 
-	testq	%r11, %r11	/* preserve_context */
+	testq	$RELOC_KERNEL_PRESERVE_CONTEXT, %r11
 	jnz .Lrelocate
 
 	/*
@@ -273,7 +271,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	ANNOTATE_NOENDBR
 	andq	$PAGE_MASK, %r8
 	lea	PAGE_SIZE(%r8), %rsp
-	movl	$1, %r11d	/* Ensure preserve_context flag is set */
+	movl	$RELOC_KERNEL_PRESERVE_CONTEXT, %r11d	/* Ensure preserve_context flag is set */
 	call	swap_pages
 	movq	kexec_va_control_page(%rip), %rax
 0:	addq	$virtual_mapped - 0b, %rax
@@ -321,7 +319,7 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
 	UNWIND_HINT_END_OF_STACK
 	/*
 	 * %rdi indirection page
-	 * %r11 preserve_context
+	 * %r11 flags: RELOC_KERNEL_*
 	 */
 	movq	%rdi, %rcx	/* Put the indirection_page in %rcx */
 	xorl	%edi, %edi
@@ -357,7 +355,8 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
 	movq	%rdi, %rdx    /* Save destination page to %rdx */
 	movq	%rsi, %rax    /* Save source page to %rax */
 
-	testq	%r11, %r11    /* Only actually swap for ::preserve_context */
+	/* Only actually swap for ::preserve_context */
+	testq	$RELOC_KERNEL_PRESERVE_CONTEXT, %r11
 	jz	.Lnoswap
 
 	/* copy source page to swap page */
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 2/7] x86/sme: Use percpu boolean to control WBINVD during kexec
  2025-07-17 21:46 [PATCH v4 0/7] TDX host: kexec/kdump support Kai Huang
  2025-07-17 21:46 ` [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters Kai Huang
@ 2025-07-17 21:46 ` Kai Huang
  2025-07-17 21:46 ` [PATCH v4 3/7] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Kai Huang
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Kai Huang @ 2025-07-17 21:46 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky
  Cc: x86, kas, rick.p.edgecombe, dwmw, linux-kernel, pbonzini, seanjc,
	kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis

TL;DR:

Prepare to unify how TDX and SME do cache flushing during kexec by
making a percpu boolean control whether to do the WBINVD.

-- Background --

On SME platforms, dirty cacheline aliases with and without encryption
bit can coexist, and the CPU can flush them back to memory in random
order.  During kexec, the caches must be flushed before jumping to the
new kernel otherwise the dirty cachelines could silently corrupt the
memory used by the new kernel due to different encryption property.

TDX also needs a cache flush during kexec for the same reason.  It would
be good to have a generic way to flush the cache instead of scattering
checks for each feature all around.

When SME is enabled, the kernel basically encrypts all memory including
the kernel itself and a simple memory write from the kernel could dirty
cachelines.  Currently, the kernel uses WBINVD to flush the cache for
SME during kexec in two places:

1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU
   stops them;
2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the
   new kernel.

-- Solution --

Unlike SME, TDX can only dirty cachelines when it is used (i.e., when
SEAMCALLs are performed).  Since there are no more SEAMCALLs after the
aforementioned WBINVDs, leverage this for TDX.

To unify the approach for SME and TDX, use a percpu boolean to indicate
the cache may be in an incoherent state and needs flushing during kexec,
and set the boolean for SME.  TDX can then leverage it.

While SME could use a global flag (since it's enabled at early boot and
enabled on all CPUs), the percpu flag fits TDX better:

The percpu flag can be set when a CPU makes a SEAMCALL, and cleared when
another WBINVD on the CPU obviates the need for a kexec-time WBINVD.
Saving kexec-time WBINVD is valuable, because there is an existing
race[*] where kexec could proceed while another CPU is active.  WBINVD
could make this race worse, so it's worth skipping it when possible.

-- Side effect to SME --

Today the first WBINVD in the stop_this_cpu() is performed when SME is
*supported* by the platform, and the second WBINVD is done in
relocate_kernel() when SME is *activated* by the kernel.  Make things
simple by changing to do the second WBINVD when the platform supports
SME.  This allows the kernel to simply turn on this percpu boolean when
bringing up a CPU by checking whether the platform supports SME.

No other functional change intended.

[*] The aforementioned race:

During kexec native_stop_other_cpus() is called to stop all remote CPUs
before jumping to the new kernel.  native_stop_other_cpus() firstly
sends normal REBOOT vector IPIs to stop remote CPUs and waits them to
stop.  If that times out, it sends NMI to stop the CPUs that are still
alive.  The race happens when native_stop_other_cpus() has to send NMIs
and could potentially result in the system hang (for more information
please see [1]).

Link: https://lore.kernel.org/kvm/b963fcd60abe26c7ec5dc20b42f1a2ebbcc72397.1750934177.git.kai.huang@intel.com/ [1]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
---

v3 -> v4:
 - Simplify the changelog using AI -- Boris
 - Call out "Test CPUID bit directly due to mem_encrypt=off" in the
   comment  -- Boris
 - Add a comment to explain the percpu boolean -- Boris
 - s/wbinvd/WBINVD -- Boris
 - Code update due to patch 1 being added

---
 arch/x86/include/asm/kexec.h         |  2 +-
 arch/x86/include/asm/processor.h     |  2 ++
 arch/x86/kernel/cpu/amd.c            | 17 +++++++++++++++++
 arch/x86/kernel/machine_kexec_64.c   | 13 +++++++++----
 arch/x86/kernel/process.c            | 24 +++++++++++-------------
 arch/x86/kernel/relocate_kernel_64.S | 13 ++++++++++---
 6 files changed, 50 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 5f09791dc4e9..5cfb27f26583 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -18,7 +18,7 @@
 #include <linux/bits.h>
 
 #define RELOC_KERNEL_PRESERVE_CONTEXT	BIT(0)
-#define RELOC_KERNEL_HOST_MEM_ACTIVE	BIT(1)
+#define RELOC_KERNEL_CACHE_INCOHERENT	BIT(1)
 
 #endif
 
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index bde58f6510ac..a24c7805acdb 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -731,6 +731,8 @@ void __noreturn stop_this_cpu(void *dummy);
 void microcode_check(struct cpuinfo_x86 *prev_info);
 void store_cpu_caps(struct cpuinfo_x86 *info);
 
+DECLARE_PER_CPU(bool, cache_state_incoherent);
+
 enum l1tf_mitigations {
 	L1TF_MITIGATION_OFF,
 	L1TF_MITIGATION_AUTO,
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index a5ece6ebe8a7..66a682be4a1a 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -545,6 +545,23 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
 {
 	u64 msr;
 
+	/*
+	 * Mark using WBINVD is needed during kexec on processors that
+	 * support SME. This provides support for performing a successful
+	 * kexec when going from SME inactive to SME active (or vice-versa).
+	 *
+	 * The cache must be cleared so that if there are entries with the
+	 * same physical address, both with and without the encryption bit,
+	 * they don't race each other when flushed and potentially end up
+	 * with the wrong entry being committed to memory.
+	 *
+	 * Test the CPUID bit directly because with mem_encrypt=off the
+	 * BSP will clear the X86_FEATURE_SME bit and the APs will not
+	 * see it set after that.
+	 */
+	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+		__this_cpu_write(cache_state_incoherent, true);
+
 	/*
 	 * BIOS support is required for SME and SEV.
 	 *   For SME: If BIOS has enabled SME then adjust x86_phys_bits by
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 25cff38f5e60..8f80a2e8cbb5 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -29,6 +29,7 @@
 #include <asm/set_memory.h>
 #include <asm/cpu.h>
 #include <asm/efi.h>
+#include <asm/processor.h>
 
 #ifdef CONFIG_ACPI
 /*
@@ -426,11 +427,11 @@ void __nocfi machine_kexec(struct kimage *image)
 		relocate_kernel_flags |= RELOC_KERNEL_PRESERVE_CONTEXT;
 
 	/*
-	 * This must be done before load_segments() since if call depth tracking
-	 * is used then GS must be valid to make any function calls.
+	 * This must be done before load_segments() since it resets
+	 * GS to 0 and percpu data needs the correct GS to work.
 	 */
-	if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
-		relocate_kernel_flags |= RELOC_KERNEL_HOST_MEM_ACTIVE;
+	if (this_cpu_read(cache_state_incoherent))
+		relocate_kernel_flags |= RELOC_KERNEL_CACHE_INCOHERENT;
 
 	/*
 	 * The segment registers are funny things, they have both a
@@ -441,6 +442,10 @@ void __nocfi machine_kexec(struct kimage *image)
 	 *
 	 * Take advantage of this here by force loading the segments,
 	 * before the GDT is zapped with an invalid value.
+	 *
+	 * load_segments() resets GS to 0.  Don't make any function call
+	 * after here since call depth tracking uses percpu variables to
+	 * operate (relocate_kernel() is explicitly ignored by call depth
 	 */
 	load_segments();
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 1b7960cf6eb0..f2bbbeef5477 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -88,6 +88,16 @@ EXPORT_PER_CPU_SYMBOL(cpu_tss_rw);
 DEFINE_PER_CPU(bool, __tss_limit_invalid);
 EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid);
 
+/*
+ * The cache may be in an incoherent state and needs flushing during kexec.
+ * E.g., on SME/TDX platforms, dirty cacheline aliases with and without
+ * encryption bit(s) can coexist and the cache needs to be flushed before
+ * booting to the new kernel to avoid the silent memory corruption due to
+ * dirty cachelines with different encryption property being written back
+ * to the memory.
+ */
+DEFINE_PER_CPU(bool, cache_state_incoherent);
+
 /*
  * this gets called so that we can store lazy state into memory and copy the
  * current task into the new thread.
@@ -827,19 +837,7 @@ void __noreturn stop_this_cpu(void *dummy)
 	disable_local_APIC();
 	mcheck_cpu_clear(c);
 
-	/*
-	 * Use wbinvd on processors that support SME. This provides support
-	 * for performing a successful kexec when going from SME inactive
-	 * to SME active (or vice-versa). The cache must be cleared so that
-	 * if there are entries with the same physical address, both with and
-	 * without the encryption bit, they don't race each other when flushed
-	 * and potentially end up with the wrong entry being committed to
-	 * memory.
-	 *
-	 * Test the CPUID bit directly because the machine might've cleared
-	 * X86_FEATURE_SME due to cmdline options.
-	 */
-	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+	if (this_cpu_read(cache_state_incoherent))
 		wbinvd();
 
 	/*
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 1dfa323b33d5..2d02080546c8 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -198,14 +198,21 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	movq	%r9, %cr3
 
 	/*
+	 * If the memory cache is in incoherent state, e.g., due to
+	 * memory encryption, do WBINVD to flush cache.
+	 *
 	 * If SME is active, there could be old encrypted cache line
 	 * entries that will conflict with the now unencrypted memory
 	 * used by kexec. Flush the caches before copying the kernel.
+	 *
+	 * Note SME sets this flag to true when the platform supports
+	 * SME, so the WBINVD is performed even SME is not activated
+	 * by the kernel.  But this has no harm.
 	 */
-	testq	$RELOC_KERNEL_HOST_MEM_ACTIVE, %r11
-	jz .Lsme_off
+	testq	$RELOC_KERNEL_CACHE_INCOHERENT, %r11
+	jz .Lnowbinvd
 	wbinvd
-.Lsme_off:
+.Lnowbinvd:
 
 	call	swap_pages
 
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 3/7] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  2025-07-17 21:46 [PATCH v4 0/7] TDX host: kexec/kdump support Kai Huang
  2025-07-17 21:46 ` [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters Kai Huang
  2025-07-17 21:46 ` [PATCH v4 2/7] x86/sme: Use percpu boolean to control WBINVD during kexec Kai Huang
@ 2025-07-17 21:46 ` Kai Huang
  2025-07-22 14:52   ` Chao Gao
  2025-07-17 21:46 ` [PATCH v4 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Kai Huang
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Kai Huang @ 2025-07-17 21:46 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky
  Cc: x86, kas, rick.p.edgecombe, dwmw, linux-kernel, pbonzini, seanjc,
	kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis, Farrah Chen

On TDX platforms, dirty cacheline aliases with and without encryption
bits can coexist, and the cpu can flush them back to memory in random
order.  During kexec, the caches must be flushed before jumping to the
new kernel otherwise the dirty cachelines could silently corrupt the
memory used by the new kernel due to different encryption property.

A percpu boolean is used to mark whether the cache of a given CPU may be
in an incoherent state, and the kexec performs WBINVD on the CPUs with
that boolean turned on.

For TDX, only the TDX module or the TDX guests can generate dirty
cachelines of TDX private memory, i.e., they are only generated when the
kernel does a SEAMCALL.

Set that boolean when the kernel does SEAMCALL so that kexec can flush
the cache correctly.

The kernel provides both the __seamcall*() assembly functions and the
seamcall*() wrapper ones which additionally handle running out of
entropy error in a loop.  Most of the SEAMCALLs are called using the
seamcall*(), except TDH.VP.ENTER and TDH.PHYMEM.PAGE.RDMD which are
called using __seamcall*() variant directly.

To cover the two special cases, add a new helper do_seamcall() which
only sets the percpu boolean and then calls the __seamcall*(), and
change the special cases to use do_seamcall().  To cover all other
SEAMCALLs, change seamcall*() to call do_seamcall().

For the SEAMCALLs invoked via seamcall*(), they can be made from both
task context and IRQ disabled context.  Given SEAMCALL is just a lengthy
instruction (e.g., thousands of cycles) from kernel's point of view and
preempt_{disable|enable}() is cheap compared to it, just unconditionally
disable preemption during setting the boolean and making SEAMCALL.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
---

v3 -> v4:
 - Set the boolean for TDH.VP.ENTER and TDH.PHYMEM.PAGE.RDMD. -Rick
 - Update the first paragraph to make it shorter -- Rick
 - Update changelog to mention the two special cases.

---
 arch/x86/include/asm/tdx.h  | 29 ++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.c |  4 ++--
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 7ddef3a69866..6865f62436ad 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -102,10 +102,35 @@ u64 __seamcall_ret(u64 fn, struct tdx_module_args *args);
 u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
 void tdx_init(void);
 
+#include <linux/preempt.h>
 #include <asm/archrandom.h>
+#include <asm/processor.h>
 
 typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args);
 
+static __always_inline u64 do_seamcall(sc_func_t func, u64 fn,
+				       struct tdx_module_args *args)
+{
+	u64 ret;
+
+	lockdep_assert_preemption_disabled();
+
+	/*
+	 * SEAMCALLs are made to the TDX module and can generate dirty
+	 * cachelines of TDX private memory.  Mark cache state incoherent
+	 * so that the cache can be flushed during kexec.
+	 *
+	 * This needs to be done before actually making the SEAMCALL,
+	 * because kexec-ing CPU could send NMI to stop remote CPUs,
+	 * in which case even disabling IRQ won't help here.
+	 */
+	this_cpu_write(cache_state_incoherent, true);
+
+	ret = func(fn, args);
+
+	return ret;
+}
+
 static __always_inline u64 sc_retry(sc_func_t func, u64 fn,
 			   struct tdx_module_args *args)
 {
@@ -113,7 +138,9 @@ static __always_inline u64 sc_retry(sc_func_t func, u64 fn,
 	u64 ret;
 
 	do {
-		ret = func(fn, args);
+		preempt_disable();
+		ret = do_seamcall(func, fn, args);
+		preempt_enable();
 	} while (ret == TDX_RND_NO_ENTROPY && --retry);
 
 	return ret;
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index c7a9a087ccaf..d6ee4e5a75d2 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1266,7 +1266,7 @@ static bool paddr_is_tdx_private(unsigned long phys)
 		return false;
 
 	/* Get page type from the TDX module */
-	sret = __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args);
+	sret = do_seamcall(__seamcall_ret, TDH_PHYMEM_PAGE_RDMD, &args);
 
 	/*
 	 * The SEAMCALL will not return success unless there is a
@@ -1522,7 +1522,7 @@ noinstr __flatten u64 tdh_vp_enter(struct tdx_vp *td, struct tdx_module_args *ar
 {
 	args->rcx = tdx_tdvpr_pa(td);
 
-	return __seamcall_saved_ret(TDH_VP_ENTER, args);
+	return do_seamcall(__seamcall_saved_ret, TDH_VP_ENTER, args);
 }
 EXPORT_SYMBOL_GPL(tdh_vp_enter);
 
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-07-17 21:46 [PATCH v4 0/7] TDX host: kexec/kdump support Kai Huang
                   ` (2 preceding siblings ...)
  2025-07-17 21:46 ` [PATCH v4 3/7] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Kai Huang
@ 2025-07-17 21:46 ` Kai Huang
  2025-07-17 21:46 ` [PATCH v4 5/7] x86/virt/tdx: Remove the !KEXEC_CORE dependency Kai Huang
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Kai Huang @ 2025-07-17 21:46 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky
  Cc: x86, kas, rick.p.edgecombe, dwmw, linux-kernel, pbonzini, seanjc,
	kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis, Farrah Chen

Some early TDX-capable platforms have an erratum: A kernel partial
write (a write transaction of less than cacheline lands at memory
controller) to TDX private memory poisons that memory, and a subsequent
read triggers a machine check.

On those platforms, the old kernel must reset TDX private memory before
jumping to the new kernel, otherwise the new kernel may see unexpected
machine check.  Currently the kernel doesn't track which page is a TDX
private page.  For simplicity just fail kexec/kdump for those platforms.

Leverage the existing machine_kexec_prepare() to fail kexec/kdump by
adding the check of the presence of the TDX erratum (which is only
checked for if the kernel is built with TDX host support).  This rejects
kexec/kdump when the kernel is loading the kexec/kdump kernel image.

The alternative is to reject kexec/kdump when the kernel is jumping to
the new kernel.  But for kexec this requires adding a new check (e.g.,
arch_kexec_allowed()) in the common code to fail kernel_kexec() at early
stage.  Kdump (crash_kexec()) needs similar check, but it's hard to
justify because crash_kexec() is not supposed to abort.

It's feasible to further relax this limitation, i.e., only fail kexec
when TDX is actually enabled by the kernel.  But this is still a half
measure compared to resetting TDX private memory so just do the simplest
thing for now.

The impact to userspace is the users will get an error when loading the
kexec/kdump kernel image:

  kexec_load failed: Operation not supported

This might be confusing to the users, thus also print the reason in the
dmesg:

  [..] kexec: not allowed on platform with tdx_pw_mce bug.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/machine_kexec_64.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 8f80a2e8cbb5..411b80a7e49d 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -347,6 +347,22 @@ int machine_kexec_prepare(struct kimage *image)
 	unsigned long reloc_end = (unsigned long)__relocate_kernel_end;
 	int result;
 
+	/*
+	 * Some early TDX-capable platforms have an erratum.  A kernel
+	 * partial write (a write transaction of less than cacheline
+	 * lands at memory controller) to TDX private memory poisons that
+	 * memory, and a subsequent read triggers a machine check.
+	 *
+	 * On those platforms the old kernel must reset TDX private
+	 * memory before jumping to the new kernel otherwise the new
+	 * kernel may see unexpected machine check.  For simplicity
+	 * just fail kexec/kdump on those platforms.
+	 */
+	if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) {
+		pr_info_once("Not allowed on platform with tdx_pw_mce bug\n");
+		return -EOPNOTSUPP;
+	}
+
 	/* Setup the identity mapped 64bit page table */
 	result = init_pgtable(image, __pa(control_page));
 	if (result)
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 5/7] x86/virt/tdx: Remove the !KEXEC_CORE dependency
  2025-07-17 21:46 [PATCH v4 0/7] TDX host: kexec/kdump support Kai Huang
                   ` (3 preceding siblings ...)
  2025-07-17 21:46 ` [PATCH v4 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Kai Huang
@ 2025-07-17 21:46 ` Kai Huang
  2025-07-17 21:46 ` [PATCH v4 6/7] x86/virt/tdx: Update the kexec section in the TDX documentation Kai Huang
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Kai Huang @ 2025-07-17 21:46 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky
  Cc: x86, kas, rick.p.edgecombe, dwmw, linux-kernel, pbonzini, seanjc,
	kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis, Farrah Chen

During kexec it is now guaranteed that all dirty cachelines of TDX
private memory are flushed before jumping to the new kernel.  The TDX
private memory from the old kernel will remain as TDX private memory in
the new kernel, but it is OK because kernel read/write to TDX private
memory will never cause machine check, except on the platforms with the
TDX partial write erratum, which has already been handled.

It is safe to allow kexec to work together with TDX now.  Remove the
!KEXEC_CORE dependency.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e924819ae133..41dfe282cf7a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1900,7 +1900,6 @@ config INTEL_TDX_HOST
 	depends on X86_X2APIC
 	select ARCH_KEEP_MEMBLOCK
 	depends on CONTIG_ALLOC
-	depends on !KEXEC_CORE
 	depends on X86_MCE
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 6/7] x86/virt/tdx: Update the kexec section in the TDX documentation
  2025-07-17 21:46 [PATCH v4 0/7] TDX host: kexec/kdump support Kai Huang
                   ` (4 preceding siblings ...)
  2025-07-17 21:46 ` [PATCH v4 5/7] x86/virt/tdx: Remove the !KEXEC_CORE dependency Kai Huang
@ 2025-07-17 21:46 ` Kai Huang
  2025-07-17 21:46 ` [PATCH v4 7/7] KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs Kai Huang
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Kai Huang @ 2025-07-17 21:46 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky
  Cc: x86, kas, rick.p.edgecombe, dwmw, linux-kernel, pbonzini, seanjc,
	kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis, Farrah Chen

TDX host kernel now supports kexec/kdump.  Update the documentation to
reflect that.

Opportunistically, remove the parentheses in "Kexec()" and move this
section under the "Erratum" section because the updated "Kexec" section
now refers to that erratum.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 Documentation/arch/x86/tdx.rst | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst
index 719043cd8b46..61670e7df2f7 100644
--- a/Documentation/arch/x86/tdx.rst
+++ b/Documentation/arch/x86/tdx.rst
@@ -142,13 +142,6 @@ but depends on the BIOS to behave correctly.
 Note TDX works with CPU logical online/offline, thus the kernel still
 allows to offline logical CPU and online it again.
 
-Kexec()
-~~~~~~~
-
-TDX host support currently lacks the ability to handle kexec.  For
-simplicity only one of them can be enabled in the Kconfig.  This will be
-fixed in the future.
-
 Erratum
 ~~~~~~~
 
@@ -171,6 +164,13 @@ If the platform has such erratum, the kernel prints additional message in
 machine check handler to tell user the machine check may be caused by
 kernel bug on TDX private memory.
 
+Kexec
+~~~~~~~
+
+Currently kexec doesn't work on the TDX platforms with the aforementioned
+erratum.  It fails when loading the kexec kernel image.  Otherwise it
+works normally.
+
 Interaction vs S3 and deeper states
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 7/7] KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs
  2025-07-17 21:46 [PATCH v4 0/7] TDX host: kexec/kdump support Kai Huang
                   ` (5 preceding siblings ...)
  2025-07-17 21:46 ` [PATCH v4 6/7] x86/virt/tdx: Update the kexec section in the TDX documentation Kai Huang
@ 2025-07-17 21:46 ` Kai Huang
  2025-07-21 13:08 ` [PATCH v4 0/7] TDX host: kexec/kdump support Tom Lendacky
  2025-07-22 13:50 ` Paolo Bonzini
  8 siblings, 0 replies; 28+ messages in thread
From: Kai Huang @ 2025-07-17 21:46 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky
  Cc: x86, kas, rick.p.edgecombe, dwmw, linux-kernel, pbonzini, seanjc,
	kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis, Farrah Chen,
	Binbin Wu

On TDX platforms, during kexec, the kernel needs to make sure there are
no dirty cachelines of TDX private memory before booting to the new
kernel to avoid silent memory corruption to the new kernel.

During kexec, the kexec-ing CPU firstly invokes native_stop_other_cpus()
to stop all remote CPUs before booting to the new kernel.  The remote
CPUs will then execute stop_this_cpu() to stop themselves.

The kernel has a percpu boolean to indicate whether the cache of a CPU
may be in incoherent state.  In stop_this_cpu(), the kernel does WBINVD
if that percpu boolean is true.

TDX turns on that percpu boolean on a CPU when the kernel does SEAMCALL.
This makes sure the caches will be flushed during kexec.

However, the native_stop_other_cpus() and stop_this_cpu() have a "race"
which is extremely rare to happen but could cause the system to hang.

Specifically, the native_stop_other_cpus() firstly sends normal reboot
IPI to remote CPUs and waits one second for them to stop.  If that times
out, native_stop_other_cpus() then sends NMIs to remote CPUs to stop
them.

The aforementioned race happens when NMIs are sent.  Doing WBINVD in
stop_this_cpu() makes each CPU take longer time to stop and increases
the chance of the race happening.

Explicitly flush cache in tdx_disable_virtualization_cpu() after which
no more TDX activity can happen on this cpu.  This moves the WBINVD to
an earlier stage than stop_this_cpus(), avoiding a possibly lengthy
operation at a time where it could cause this race.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
---

v3 -> v4:
 - Change doing wbinvd() from rebooting notifier to
   tdx_disable_virtualization_cpu() to cover the case where more
   SEAMCALL can be made after cache flush, i.e., doing kexec when
   there's TD alive.  - Chao.
 - Add check to skip wbinvd if the boolean is false. -- Chao
 - Fix typo in the comment -- Binbin.

---
 arch/x86/include/asm/tdx.h  |  2 ++
 arch/x86/kvm/vmx/tdx.c      | 12 ++++++++++++
 arch/x86/virt/vmx/tdx/tdx.c | 12 ++++++++++++
 3 files changed, 26 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 6865f62436ad..5f1f4e8594c0 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -221,6 +221,7 @@ u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u6
 u64 tdh_phymem_cache_wb(bool resume);
 u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td);
 u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
+void tdx_cpu_flush_cache(void);
 #else
 static inline void tdx_init(void) { }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
@@ -228,6 +229,7 @@ static inline int tdx_enable(void)  { return -ENODEV; }
 static inline u32 tdx_get_nr_guest_keyids(void) { return 0; }
 static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
 static inline const struct tdx_sys_info *tdx_get_sysinfo(void) { return NULL; }
+static inline void tdx_cpu_flush_cache(void) { }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLER__ */
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f31ccdeb905b..478baaa1bfb5 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -444,6 +444,18 @@ void tdx_disable_virtualization_cpu(void)
 		tdx_flush_vp(&arg);
 	}
 	local_irq_restore(flags);
+
+	/*
+	 * No more TDX activity on this CPU from here.  Flush cache to
+	 * avoid having to do WBINVD in stop_this_cpu() during kexec.
+	 *
+	 * Kexec calls native_stop_other_cpus() to stop remote CPUs
+	 * before booting to new kernel, but that code has a "race"
+	 * when the normal REBOOT IPI times out and NMIs are sent to
+	 * remote CPUs to stop them.  Doing WBINVD in stop_this_cpu()
+	 * could potentially increase the possibility of the "race".
+	 */
+	tdx_cpu_flush_cache();
 }
 
 #define TDX_SEAMCALL_RETRIES 10000
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index d6ee4e5a75d2..c098a6e0382b 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1870,3 +1870,15 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
 	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
 }
 EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
+
+void tdx_cpu_flush_cache(void)
+{
+	lockdep_assert_preemption_disabled();
+
+	if (!this_cpu_read(cache_state_incoherent))
+		return;
+
+	wbinvd();
+	this_cpu_write(cache_state_incoherent, false);
+}
+EXPORT_SYMBOL_GPL(tdx_cpu_flush_cache);
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 0/7] TDX host: kexec/kdump support
  2025-07-17 21:46 [PATCH v4 0/7] TDX host: kexec/kdump support Kai Huang
                   ` (6 preceding siblings ...)
  2025-07-17 21:46 ` [PATCH v4 7/7] KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs Kai Huang
@ 2025-07-21 13:08 ` Tom Lendacky
  2025-07-21 14:50   ` Tom Lendacky
  2025-07-22 13:50 ` Paolo Bonzini
  8 siblings, 1 reply; 28+ messages in thread
From: Tom Lendacky @ 2025-07-21 13:08 UTC (permalink / raw)
  To: Kai Huang, dave.hansen, bp, tglx, peterz, mingo, hpa
  Cc: x86, kas, rick.p.edgecombe, dwmw, linux-kernel, pbonzini, seanjc,
	kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis

On 7/17/25 16:46, Kai Huang wrote:
> This series is the latest attempt to support kexec on TDX host following
> Dave's suggestion to use a percpu boolean to control WBINVD during
> kexec.
> 
> Hi Boris/Tom,
> 
> As requested, I added the first patch to cleanup the last two 'unsigned
> int' parameters of the relocate_kernel() into one 'unsigned int' and pass
> flags instead.  The patch 2 (patch 1 in v3) also gets updated based on
> that.  Would you help to review?  Thanks.
> 
> I tested that both normal kexec and preserve_context kexec works (using
> the tools/testing/selftests/kexec/test_kexec_jump.sh).  But I don't have
> SME capable machine to test.
> 
> Hi Tom, I added your Reviewed-by and Tested-by in the patch 2 anyway
> since I believe the change is trivial and straightforward).  But due to
> the cleanup patch, I appreciate if you can help to test the first two
> patches again.  Thanks a lot!

Everything is working, Thanks!

Tom

> 
> v3 -> v4:
>  - Rebase to latest tip/master.
>  - Add a cleanup patch to consolidate relocate_kernel()'s last two
>    function parameters -- Boris.
>  - Address comments received -- please see individual patches.
>  - Collect tags (Tom, Rick, binbin).
> 
>  v3: https://lore.kernel.org/kvm/cover.1750934177.git.kai.huang@intel.com/
> 
> v2 -> v3 (all trivial changes):
> 
>  - Rebase on latest tip/master
>    - change to use __always_inline for do_seamcall() in patch 2
>  - Update patch 2 (changelog and code comment) to remove the sentence
>    which says "not all SEAMCALLs generate dirty cachelines of TDX
>    private memory but just treat all of them do."  -- Dave.
>  - Add Farrah's Tested-by for all TDX patches.
> 
> The v2 had one informal RFC patch appended to show "some optimization"
> which can move WBINVD from the kexec phase to an early stage in KVM.
> Paolo commented and Acked that patch (thanks!), so this v3 made that
> patch as a formal one (patch 6).  But technically it is not absolutely
> needed in this series but can be done in the future.
> 
> More history info can be found in v2:
> 
>  https://lore.kernel.org/lkml/cover.1746874095.git.kai.huang@intel.com/
> 
> === More information ===
> 
> TDX private memory is memory that is encrypted with private Host Key IDs
> (HKID).  If the kernel has ever enabled TDX, part of system memory
> remains TDX private memory when kexec happens.  E.g., the PAMT (Physical
> Address Metadata Table) pages used by the TDX module to track each TDX
> memory page's state are never freed once the TDX module is initialized.
> TDX guests also have guest private memory and secure-EPT pages.
> 
> After kexec, the new kernel will have no knowledge of which memory page
> was used as TDX private page and can use all memory as regular memory.
> 
> 1) Cache flush
> 
> Per TDX 1.5 base spec "8.6.1.Platforms not Using ACT: Required Cache
> Flush and Initialization by the Host VMM", to support kexec for TDX, the
> kernel needs to flush cache to make sure there's no dirty cachelines of
> TDX private memory left over to the new kernel (when the TDX module
> reports TDX_FEATURES.CLFLUSH_BEFORE_ALLOC as 1 in the global metadata for
> the platform).  The kernel also needs to make sure there's no more TDX
> activity (no SEAMCALL) after cache flush so that no new dirty cachelines
> of TDX private memory are generated.
> 
> SME has similar requirement.  SME kexec support uses WBINVD to do the
> cache flush.  WBINVD is able to flush cachelines associated with any
> HKID.  Reuse the WBINVD introduced by SME to flush cache for TDX.
> 
> Currently the kernel explicitly checks whether the hardware supports SME
> and only does WBINVD if true.  Instead of adding yet another TDX
> specific check, this series uses a percpu boolean to indicate whether
> WBINVD is needed on that CPU during kexec.
> 
> 2) Reset TDX private memory using MOVDIR64B
> 
> The TDX spec (the aforementioned section) also suggests the kernel
> *should* use MOVDIR64B to clear TDX private page before the kernel
> reuses it as regular one.
> 
> However, in reality the situation can be more flexible.  Per TDX 1.5
> base spec ("Table 16.2: Non-ACT Platforms Checks on Memory Reads in Ci
> Mode" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
> Mode"), the read/write to TDX private memory using shared KeyID without
> integrity check enabled will not poison the memory and cause machine
> check.
> 
> Note on the platforms with ACT (Access Control Table), there's no
> integrity check involved thus no machine check is possible to happen due
> to memory read/write using different KeyIDs.
> 
> KeyID 0 (TME key) doesn't support integrity check.  This series chooses
> to NOT reset TDX private memory but leave TDX private memory as-is to the
> new kernel.  As mentioned above, in practice it is safe to do so.
> 
> 3) One limitation
> 
> If the kernel has ever enabled TDX, after kexec the new kernel won't be
> able to use TDX anymore.  This is because when the new kernel tries to
> initialize TDX module it will fail on the first SEAMCALL due to the
> module has already been initialized by the old kernel.
> 
> More (non-trivial) work will be needed for the new kernel to use TDX,
> e.g., one solution is to just reload the TDX module from the location
> where BIOS loads the TDX module (/boot/efi/EFI/TDX/).  This series
> doesn't cover this, but leave this as future work.
> 
> 4) Kdump support
> 
> This series also enables kdump with TDX, but no special handling is
> needed for crash kexec (except turning on the Kconfig option):
> 
>  - kdump kernel uses reserved memory from the old kernel as system ram,
>    and the old kernel will never use the reserved memory as TDX memory.
>  - /proc/vmcore contains TDX private memory pages.  It's meaningless to
>    read them, but it doesn't do any harm either.
> 
> 5) TDX "partial write machine check" erratum
> 
> On the platform with TDX erratum, a partial write (a write transaction
> of less than a cacheline lands at memory controller) to TDX private
> memory poisons that memory, and a subsequent read triggers machine
> check.  On those platforms, the kernel needs to reset TDX private memory
> before jumping to the new kernel otherwise the new kernel may see
> unexpected machine check.
> 
> The kernel currently doesn't track which page is TDX private memory.
> It's not trivial to reset TDX private memory.  For simplicity, this
> series simply disables kexec/kdump for such platforms.  This can be
> enhanced in the future.
> 
> 
> 
> Kai Huang (7):
>   x86/kexec: Consolidate relocate_kernel() function parameters
>   x86/sme: Use percpu boolean to control WBINVD during kexec
>   x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
>   x86/kexec: Disable kexec/kdump on platforms with TDX partial write
>     erratum
>   x86/virt/tdx: Remove the !KEXEC_CORE dependency
>   x86/virt/tdx: Update the kexec section in the TDX documentation
>   KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs
> 
>  Documentation/arch/x86/tdx.rst       | 14 ++++-----
>  arch/x86/Kconfig                     |  1 -
>  arch/x86/include/asm/kexec.h         | 12 ++++++--
>  arch/x86/include/asm/processor.h     |  2 ++
>  arch/x86/include/asm/tdx.h           | 31 +++++++++++++++++++-
>  arch/x86/kernel/cpu/amd.c            | 17 +++++++++++
>  arch/x86/kernel/machine_kexec_64.c   | 43 ++++++++++++++++++++++------
>  arch/x86/kernel/process.c            | 24 +++++++---------
>  arch/x86/kernel/relocate_kernel_64.S | 30 +++++++++++--------
>  arch/x86/kvm/vmx/tdx.c               | 12 ++++++++
>  arch/x86/virt/vmx/tdx/tdx.c          | 16 +++++++++--
>  11 files changed, 155 insertions(+), 47 deletions(-)
> 
> 
> base-commit: e180b3a224cb519388c2f61ca7bc1eaf94cec1fb

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters
  2025-07-17 21:46 ` [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters Kai Huang
@ 2025-07-21 14:40   ` Tom Lendacky
  2025-07-21 21:20     ` Huang, Kai
  0 siblings, 1 reply; 28+ messages in thread
From: Tom Lendacky @ 2025-07-21 14:40 UTC (permalink / raw)
  To: Kai Huang, dave.hansen, bp, tglx, peterz, mingo, hpa
  Cc: x86, kas, rick.p.edgecombe, dwmw, linux-kernel, pbonzini, seanjc,
	kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis

On 7/17/25 16:46, Kai Huang wrote:
> During kexec, the kernel jumps to the new kernel in relocate_kernel(),
> which is implemented in assembly and both 32-bit and 64-bit have their
> own version.
> 
> Currently, for both 32-bit and 64-bit, the last two parameters of the
> relocate_kernel() are both 'unsigned int' but actually they only convey
> a boolean, i.e., one bit information.  The 'unsigned int' has enough
> space to carry two bits information therefore there's no need to pass
> the two booleans in two separate 'unsigned int'.
> 
> Consolidate the last two function parameters of relocate_kernel() into a
> single 'unsigned int' and pass flags instead.
> 
> Only consolidate the 64-bit version albeit the similar optimization can
> be done for the 32-bit version too.  Don't bother changing the 32-bit
> version while it is working (since assembly code change is required).
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>  arch/x86/include/asm/kexec.h         | 12 ++++++++++--
>  arch/x86/kernel/machine_kexec_64.c   | 22 +++++++++++++---------
>  arch/x86/kernel/relocate_kernel_64.S | 19 +++++++++----------
>  3 files changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index f2ad77929d6e..5f09791dc4e9 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -13,6 +13,15 @@
>  # define KEXEC_DEBUG_EXC_HANDLER_SIZE	6 /* PUSHI, PUSHI, 2-byte JMP */
>  #endif
>  
> +#ifdef CONFIG_X86_64
> +
> +#include <linux/bits.h>
> +
> +#define RELOC_KERNEL_PRESERVE_CONTEXT	BIT(0)
> +#define RELOC_KERNEL_HOST_MEM_ACTIVE	BIT(1)

This isn't as descriptive with "ENC" removed from the name. It's almost
like you read this and think it should always be 1 because the kernel
always has host memory active.

> +
> +#endif
> +
>  # define KEXEC_CONTROL_PAGE_SIZE	4096
>  # define KEXEC_CONTROL_CODE_MAX_SIZE	2048
>  
> @@ -121,8 +130,7 @@ typedef unsigned long
>  relocate_kernel_fn(unsigned long indirection_page,
>  		   unsigned long pa_control_page,
>  		   unsigned long start_address,
> -		   unsigned int preserve_context,
> -		   unsigned int host_mem_enc_active);
> +		   unsigned int flags);
>  #endif
>  extern relocate_kernel_fn relocate_kernel;
>  #define ARCH_HAS_KIMAGE_ARCH
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 697fb99406e6..25cff38f5e60 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -384,16 +384,10 @@ void __nocfi machine_kexec(struct kimage *image)
>  {
>  	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
>  	relocate_kernel_fn *relocate_kernel_ptr;
> -	unsigned int host_mem_enc_active;
> +	unsigned int relocate_kernel_flags;
>  	int save_ftrace_enabled;
>  	void *control_page;
>  
> -	/*
> -	 * This must be done before load_segments() since if call depth tracking
> -	 * is used then GS must be valid to make any function calls.
> -	 */
> -	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
> -
>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)
>  		save_processor_state();
> @@ -427,6 +421,17 @@ void __nocfi machine_kexec(struct kimage *image)
>  	 */
>  	relocate_kernel_ptr = control_page + (unsigned long)relocate_kernel - reloc_start;
>  
> +	relocate_kernel_flags = 0;
> +	if (image->preserve_context)
> +		relocate_kernel_flags |= RELOC_KERNEL_PRESERVE_CONTEXT;
> +
> +	/*
> +	 * This must be done before load_segments() since if call depth tracking
> +	 * is used then GS must be valid to make any function calls.
> +	 */
> +	if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
> +		relocate_kernel_flags |= RELOC_KERNEL_HOST_MEM_ACTIVE;
> +
>  	/*
>  	 * The segment registers are funny things, they have both a
>  	 * visible and an invisible part.  Whenever the visible part is
> @@ -443,8 +448,7 @@ void __nocfi machine_kexec(struct kimage *image)
>  	image->start = relocate_kernel_ptr((unsigned long)image->head,
>  					   virt_to_phys(control_page),
>  					   image->start,
> -					   image->preserve_context,
> -					   host_mem_enc_active);
> +					   relocate_kernel_flags);
>  
>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index ea604f4d0b52..1dfa323b33d5 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -66,8 +66,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
>  	 * %rdi indirection_page
>  	 * %rsi pa_control_page
>  	 * %rdx start address
> -	 * %rcx preserve_context
> -	 * %r8  host_mem_enc_active
> +	 * %rcx flags: RELOC_KERNEL_*
>  	 */
>  
>  	/* Save the CPU context, used for jumping back */
> @@ -111,7 +110,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
>  	/* save indirection list for jumping back */
>  	movq	%rdi, pa_backup_pages_map(%rip)
>  
> -	/* Save the preserve_context to %r11 as swap_pages clobbers %rcx. */
> +	/* Save the flags to %r11 as swap_pages clobbers %rcx. */
>  	movq	%rcx, %r11
>  
>  	/* setup a new stack at the end of the physical control page */
> @@ -129,9 +128,8 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	/*
>  	 * %rdi	indirection page
>  	 * %rdx start address
> -	 * %r8 host_mem_enc_active
>  	 * %r9 page table page
> -	 * %r11 preserve_context
> +	 * %r11 flags: RELOC_KERNEL_*
>  	 * %r13 original CR4 when relocate_kernel() was invoked
>  	 */
>  
> @@ -204,7 +202,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	 * entries that will conflict with the now unencrypted memory
>  	 * used by kexec. Flush the caches before copying the kernel.
>  	 */
> -	testq	%r8, %r8
> +	testq	$RELOC_KERNEL_HOST_MEM_ACTIVE, %r11

Hmmm... can't both bits be set at the same time? If so, then this will
fail. This should be doing bit tests now.

>  	jz .Lsme_off
>  	wbinvd
>  .Lsme_off:
> @@ -220,7 +218,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	movq	%cr3, %rax
>  	movq	%rax, %cr3
>  
> -	testq	%r11, %r11	/* preserve_context */
> +	testq	$RELOC_KERNEL_PRESERVE_CONTEXT, %r11
>  	jnz .Lrelocate
>  
>  	/*
> @@ -273,7 +271,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	ANNOTATE_NOENDBR
>  	andq	$PAGE_MASK, %r8
>  	lea	PAGE_SIZE(%r8), %rsp
> -	movl	$1, %r11d	/* Ensure preserve_context flag is set */
> +	movl	$RELOC_KERNEL_PRESERVE_CONTEXT, %r11d	/* Ensure preserve_context flag is set */

And this will clear any value that was in r11 vs setting a single bit.
Not sure it currently has any effect because r8 (where the memory
encryption setting was held) is modified just before this. But if any
bits are added in the future that are needed past here, this will be a
problem.

>  	call	swap_pages
>  	movq	kexec_va_control_page(%rip), %rax
>  0:	addq	$virtual_mapped - 0b, %rax
> @@ -321,7 +319,7 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
>  	UNWIND_HINT_END_OF_STACK
>  	/*
>  	 * %rdi indirection page
> -	 * %r11 preserve_context
> +	 * %r11 flags: RELOC_KERNEL_*
>  	 */
>  	movq	%rdi, %rcx	/* Put the indirection_page in %rcx */
>  	xorl	%edi, %edi
> @@ -357,7 +355,8 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
>  	movq	%rdi, %rdx    /* Save destination page to %rdx */
>  	movq	%rsi, %rax    /* Save source page to %rax */
>  
> -	testq	%r11, %r11    /* Only actually swap for ::preserve_context */
> +	/* Only actually swap for ::preserve_context */
> +	testq	$RELOC_KERNEL_PRESERVE_CONTEXT, %r11

Ditto here on the bit testing.

Thanks,
Tom

>  	jz	.Lnoswap
>  
>  	/* copy source page to swap page */

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 0/7] TDX host: kexec/kdump support
  2025-07-21 13:08 ` [PATCH v4 0/7] TDX host: kexec/kdump support Tom Lendacky
@ 2025-07-21 14:50   ` Tom Lendacky
  2025-07-21 21:30     ` Huang, Kai
  2025-07-21 21:30     ` Huang, Kai
  0 siblings, 2 replies; 28+ messages in thread
From: Tom Lendacky @ 2025-07-21 14:50 UTC (permalink / raw)
  To: Kai Huang, dave.hansen, bp, tglx, peterz, mingo, hpa
  Cc: x86, kas, rick.p.edgecombe, dwmw, linux-kernel, pbonzini, seanjc,
	kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis

On 7/21/25 08:08, Tom Lendacky wrote:
> On 7/17/25 16:46, Kai Huang wrote:
>> This series is the latest attempt to support kexec on TDX host following
>> Dave's suggestion to use a percpu boolean to control WBINVD during
>> kexec.
>>
>> Hi Boris/Tom,
>>
>> As requested, I added the first patch to cleanup the last two 'unsigned
>> int' parameters of the relocate_kernel() into one 'unsigned int' and pass
>> flags instead.  The patch 2 (patch 1 in v3) also gets updated based on
>> that.  Would you help to review?  Thanks.
>>
>> I tested that both normal kexec and preserve_context kexec works (using
>> the tools/testing/selftests/kexec/test_kexec_jump.sh).  But I don't have
>> SME capable machine to test.
>>
>> Hi Tom, I added your Reviewed-by and Tested-by in the patch 2 anyway
>> since I believe the change is trivial and straightforward).  But due to
>> the cleanup patch, I appreciate if you can help to test the first two
>> patches again.  Thanks a lot!
> 
> Everything is working, Thanks!

See my comments in patch #1. I didn't test with context preservation, so
that bit was never set. If it was, I think things would have failed.

Thanks,
Tom

> 
> Tom
> 
>>
>> v3 -> v4:
>>  - Rebase to latest tip/master.
>>  - Add a cleanup patch to consolidate relocate_kernel()'s last two
>>    function parameters -- Boris.
>>  - Address comments received -- please see individual patches.
>>  - Collect tags (Tom, Rick, binbin).
>>
>>  v3: https://lore.kernel.org/kvm/cover.1750934177.git.kai.huang@intel.com/
>>
>> v2 -> v3 (all trivial changes):
>>
>>  - Rebase on latest tip/master
>>    - change to use __always_inline for do_seamcall() in patch 2
>>  - Update patch 2 (changelog and code comment) to remove the sentence
>>    which says "not all SEAMCALLs generate dirty cachelines of TDX
>>    private memory but just treat all of them do."  -- Dave.
>>  - Add Farrah's Tested-by for all TDX patches.
>>
>> The v2 had one informal RFC patch appended to show "some optimization"
>> which can move WBINVD from the kexec phase to an early stage in KVM.
>> Paolo commented and Acked that patch (thanks!), so this v3 made that
>> patch as a formal one (patch 6).  But technically it is not absolutely
>> needed in this series but can be done in the future.
>>
>> More history info can be found in v2:
>>
>>  https://lore.kernel.org/lkml/cover.1746874095.git.kai.huang@intel.com/
>>
>> === More information ===
>>
>> TDX private memory is memory that is encrypted with private Host Key IDs
>> (HKID).  If the kernel has ever enabled TDX, part of system memory
>> remains TDX private memory when kexec happens.  E.g., the PAMT (Physical
>> Address Metadata Table) pages used by the TDX module to track each TDX
>> memory page's state are never freed once the TDX module is initialized.
>> TDX guests also have guest private memory and secure-EPT pages.
>>
>> After kexec, the new kernel will have no knowledge of which memory page
>> was used as TDX private page and can use all memory as regular memory.
>>
>> 1) Cache flush
>>
>> Per TDX 1.5 base spec "8.6.1.Platforms not Using ACT: Required Cache
>> Flush and Initialization by the Host VMM", to support kexec for TDX, the
>> kernel needs to flush cache to make sure there's no dirty cachelines of
>> TDX private memory left over to the new kernel (when the TDX module
>> reports TDX_FEATURES.CLFLUSH_BEFORE_ALLOC as 1 in the global metadata for
>> the platform).  The kernel also needs to make sure there's no more TDX
>> activity (no SEAMCALL) after cache flush so that no new dirty cachelines
>> of TDX private memory are generated.
>>
>> SME has similar requirement.  SME kexec support uses WBINVD to do the
>> cache flush.  WBINVD is able to flush cachelines associated with any
>> HKID.  Reuse the WBINVD introduced by SME to flush cache for TDX.
>>
>> Currently the kernel explicitly checks whether the hardware supports SME
>> and only does WBINVD if true.  Instead of adding yet another TDX
>> specific check, this series uses a percpu boolean to indicate whether
>> WBINVD is needed on that CPU during kexec.
>>
>> 2) Reset TDX private memory using MOVDIR64B
>>
>> The TDX spec (the aforementioned section) also suggests the kernel
>> *should* use MOVDIR64B to clear TDX private page before the kernel
>> reuses it as regular one.
>>
>> However, in reality the situation can be more flexible.  Per TDX 1.5
>> base spec ("Table 16.2: Non-ACT Platforms Checks on Memory Reads in Ci
>> Mode" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
>> Mode"), the read/write to TDX private memory using shared KeyID without
>> integrity check enabled will not poison the memory and cause machine
>> check.
>>
>> Note on the platforms with ACT (Access Control Table), there's no
>> integrity check involved thus no machine check is possible to happen due
>> to memory read/write using different KeyIDs.
>>
>> KeyID 0 (TME key) doesn't support integrity check.  This series chooses
>> to NOT reset TDX private memory but leave TDX private memory as-is to the
>> new kernel.  As mentioned above, in practice it is safe to do so.
>>
>> 3) One limitation
>>
>> If the kernel has ever enabled TDX, after kexec the new kernel won't be
>> able to use TDX anymore.  This is because when the new kernel tries to
>> initialize TDX module it will fail on the first SEAMCALL due to the
>> module has already been initialized by the old kernel.
>>
>> More (non-trivial) work will be needed for the new kernel to use TDX,
>> e.g., one solution is to just reload the TDX module from the location
>> where BIOS loads the TDX module (/boot/efi/EFI/TDX/).  This series
>> doesn't cover this, but leave this as future work.
>>
>> 4) Kdump support
>>
>> This series also enables kdump with TDX, but no special handling is
>> needed for crash kexec (except turning on the Kconfig option):
>>
>>  - kdump kernel uses reserved memory from the old kernel as system ram,
>>    and the old kernel will never use the reserved memory as TDX memory.
>>  - /proc/vmcore contains TDX private memory pages.  It's meaningless to
>>    read them, but it doesn't do any harm either.
>>
>> 5) TDX "partial write machine check" erratum
>>
>> On the platform with TDX erratum, a partial write (a write transaction
>> of less than a cacheline lands at memory controller) to TDX private
>> memory poisons that memory, and a subsequent read triggers machine
>> check.  On those platforms, the kernel needs to reset TDX private memory
>> before jumping to the new kernel otherwise the new kernel may see
>> unexpected machine check.
>>
>> The kernel currently doesn't track which page is TDX private memory.
>> It's not trivial to reset TDX private memory.  For simplicity, this
>> series simply disables kexec/kdump for such platforms.  This can be
>> enhanced in the future.
>>
>>
>>
>> Kai Huang (7):
>>   x86/kexec: Consolidate relocate_kernel() function parameters
>>   x86/sme: Use percpu boolean to control WBINVD during kexec
>>   x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
>>   x86/kexec: Disable kexec/kdump on platforms with TDX partial write
>>     erratum
>>   x86/virt/tdx: Remove the !KEXEC_CORE dependency
>>   x86/virt/tdx: Update the kexec section in the TDX documentation
>>   KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs
>>
>>  Documentation/arch/x86/tdx.rst       | 14 ++++-----
>>  arch/x86/Kconfig                     |  1 -
>>  arch/x86/include/asm/kexec.h         | 12 ++++++--
>>  arch/x86/include/asm/processor.h     |  2 ++
>>  arch/x86/include/asm/tdx.h           | 31 +++++++++++++++++++-
>>  arch/x86/kernel/cpu/amd.c            | 17 +++++++++++
>>  arch/x86/kernel/machine_kexec_64.c   | 43 ++++++++++++++++++++++------
>>  arch/x86/kernel/process.c            | 24 +++++++---------
>>  arch/x86/kernel/relocate_kernel_64.S | 30 +++++++++++--------
>>  arch/x86/kvm/vmx/tdx.c               | 12 ++++++++
>>  arch/x86/virt/vmx/tdx/tdx.c          | 16 +++++++++--
>>  11 files changed, 155 insertions(+), 47 deletions(-)
>>
>>
>> base-commit: e180b3a224cb519388c2f61ca7bc1eaf94cec1fb

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters
  2025-07-21 14:40   ` Tom Lendacky
@ 2025-07-21 21:20     ` Huang, Kai
  2025-07-21 21:27       ` Tom Lendacky
  0 siblings, 1 reply; 28+ messages in thread
From: Huang, Kai @ 2025-07-21 21:20 UTC (permalink / raw)
  To: thomas.lendacky@amd.com, tglx@linutronix.de, peterz@infradead.org,
	Hansen, Dave, mingo@redhat.com, bp@alien8.de, hpa@zytor.com
  Cc: ashish.kalra@amd.com, Edgecombe, Rick P, seanjc@google.com,
	x86@kernel.org, kas@kernel.org, Gao, Chao, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Williams, Dan J, sagis@google.com,
	pbonzini@redhat.com, kvm@vger.kernel.org, dwmw@amazon.co.uk,
	Yamahata, Isaku, nik.borisov@suse.com

On Mon, 2025-07-21 at 09:40 -0500, Tom Lendacky wrote:
> On 7/17/25 16:46, Kai Huang wrote:
> > During kexec, the kernel jumps to the new kernel in relocate_kernel(),
> > which is implemented in assembly and both 32-bit and 64-bit have their
> > own version.
> > 
> > Currently, for both 32-bit and 64-bit, the last two parameters of the
> > relocate_kernel() are both 'unsigned int' but actually they only convey
> > a boolean, i.e., one bit information.  The 'unsigned int' has enough
> > space to carry two bits information therefore there's no need to pass
> > the two booleans in two separate 'unsigned int'.
> > 
> > Consolidate the last two function parameters of relocate_kernel() into a
> > single 'unsigned int' and pass flags instead.
> > 
> > Only consolidate the 64-bit version albeit the similar optimization can
> > be done for the 32-bit version too.  Don't bother changing the 32-bit
> > version while it is working (since assembly code change is required).
> > 
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
> >  arch/x86/include/asm/kexec.h         | 12 ++++++++++--
> >  arch/x86/kernel/machine_kexec_64.c   | 22 +++++++++++++---------
> >  arch/x86/kernel/relocate_kernel_64.S | 19 +++++++++----------
> >  3 files changed, 32 insertions(+), 21 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> > index f2ad77929d6e..5f09791dc4e9 100644
> > --- a/arch/x86/include/asm/kexec.h
> > +++ b/arch/x86/include/asm/kexec.h
> > @@ -13,6 +13,15 @@
> >  # define KEXEC_DEBUG_EXC_HANDLER_SIZE	6 /* PUSHI, PUSHI, 2-byte JMP */
> >  #endif
> >  
> > +#ifdef CONFIG_X86_64
> > +
> > +#include <linux/bits.h>
> > +
> > +#define RELOC_KERNEL_PRESERVE_CONTEXT	BIT(0)
> > +#define RELOC_KERNEL_HOST_MEM_ACTIVE	BIT(1)
> 
> This isn't as descriptive with "ENC" removed from the name. It's almost
> like you read this and think it should always be 1 because the kernel
> always has host memory active.

Hi Tom,

Thanks for review.

Right.  I'll add "ENC" to the name.

> 
> > +
> > +#endif
> > +
> >  # define KEXEC_CONTROL_PAGE_SIZE	4096
> >  # define KEXEC_CONTROL_CODE_MAX_SIZE	2048
> >  
> > @@ -121,8 +130,7 @@ typedef unsigned long
> >  relocate_kernel_fn(unsigned long indirection_page,
> >  		   unsigned long pa_control_page,
> >  		   unsigned long start_address,
> > -		   unsigned int preserve_context,
> > -		   unsigned int host_mem_enc_active);
> > +		   unsigned int flags);
> >  #endif
> >  extern relocate_kernel_fn relocate_kernel;
> >  #define ARCH_HAS_KIMAGE_ARCH
> > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > index 697fb99406e6..25cff38f5e60 100644
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -384,16 +384,10 @@ void __nocfi machine_kexec(struct kimage *image)
> >  {
> >  	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
> >  	relocate_kernel_fn *relocate_kernel_ptr;
> > -	unsigned int host_mem_enc_active;
> > +	unsigned int relocate_kernel_flags;
> >  	int save_ftrace_enabled;
> >  	void *control_page;
> >  
> > -	/*
> > -	 * This must be done before load_segments() since if call depth tracking
> > -	 * is used then GS must be valid to make any function calls.
> > -	 */
> > -	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
> > -
> >  #ifdef CONFIG_KEXEC_JUMP
> >  	if (image->preserve_context)
> >  		save_processor_state();
> > @@ -427,6 +421,17 @@ void __nocfi machine_kexec(struct kimage *image)
> >  	 */
> >  	relocate_kernel_ptr = control_page + (unsigned long)relocate_kernel - reloc_start;
> >  
> > +	relocate_kernel_flags = 0;
> > +	if (image->preserve_context)
> > +		relocate_kernel_flags |= RELOC_KERNEL_PRESERVE_CONTEXT;
> > +
> > +	/*
> > +	 * This must be done before load_segments() since if call depth tracking
> > +	 * is used then GS must be valid to make any function calls.
> > +	 */
> > +	if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
> > +		relocate_kernel_flags |= RELOC_KERNEL_HOST_MEM_ACTIVE;
> > +
> >  	/*
> >  	 * The segment registers are funny things, they have both a
> >  	 * visible and an invisible part.  Whenever the visible part is
> > @@ -443,8 +448,7 @@ void __nocfi machine_kexec(struct kimage *image)
> >  	image->start = relocate_kernel_ptr((unsigned long)image->head,
> >  					   virt_to_phys(control_page),
> >  					   image->start,
> > -					   image->preserve_context,
> > -					   host_mem_enc_active);
> > +					   relocate_kernel_flags);
> >  
> >  #ifdef CONFIG_KEXEC_JUMP
> >  	if (image->preserve_context)
> > diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> > index ea604f4d0b52..1dfa323b33d5 100644
> > --- a/arch/x86/kernel/relocate_kernel_64.S
> > +++ b/arch/x86/kernel/relocate_kernel_64.S
> > @@ -66,8 +66,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
> >  	 * %rdi indirection_page
> >  	 * %rsi pa_control_page
> >  	 * %rdx start address
> > -	 * %rcx preserve_context
> > -	 * %r8  host_mem_enc_active
> > +	 * %rcx flags: RELOC_KERNEL_*
> >  	 */
> >  
> >  	/* Save the CPU context, used for jumping back */
> > @@ -111,7 +110,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
> >  	/* save indirection list for jumping back */
> >  	movq	%rdi, pa_backup_pages_map(%rip)
> >  
> > -	/* Save the preserve_context to %r11 as swap_pages clobbers %rcx. */
> > +	/* Save the flags to %r11 as swap_pages clobbers %rcx. */
> >  	movq	%rcx, %r11
> >  
> >  	/* setup a new stack at the end of the physical control page */
> > @@ -129,9 +128,8 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> >  	/*
> >  	 * %rdi	indirection page
> >  	 * %rdx start address
> > -	 * %r8 host_mem_enc_active
> >  	 * %r9 page table page
> > -	 * %r11 preserve_context
> > +	 * %r11 flags: RELOC_KERNEL_*
> >  	 * %r13 original CR4 when relocate_kernel() was invoked
> >  	 */
> >  
> > @@ -204,7 +202,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> >  	 * entries that will conflict with the now unencrypted memory
> >  	 * used by kexec. Flush the caches before copying the kernel.
> >  	 */
> > -	testq	%r8, %r8
> > +	testq	$RELOC_KERNEL_HOST_MEM_ACTIVE, %r11
> 
> Hmmm... can't both bits be set at the same time? If so, then this will
> fail. This should be doing bit tests now.

TEST instruction performs logical AND of the two operands, therefore the
above equals to:

	set ZF if "R11 AND BIT(1) == 0".

Whether there's any other bits set in R11 doesn't impact the above, right?
 
> 
> >  	jz .Lsme_off
> >  	wbinvd
> >  .Lsme_off:
> > @@ -220,7 +218,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> >  	movq	%cr3, %rax
> >  	movq	%rax, %cr3
> >  
> > -	testq	%r11, %r11	/* preserve_context */
> > +	testq	$RELOC_KERNEL_PRESERVE_CONTEXT, %r11
> >  	jnz .Lrelocate
> >  
> >  	/*
> > @@ -273,7 +271,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> >  	ANNOTATE_NOENDBR
> >  	andq	$PAGE_MASK, %r8
> >  	lea	PAGE_SIZE(%r8), %rsp
> > -	movl	$1, %r11d	/* Ensure preserve_context flag is set */
> > +	movl	$RELOC_KERNEL_PRESERVE_CONTEXT, %r11d	/* Ensure preserve_context flag is set */
> 
> And this will clear any value that was in r11 vs setting a single bit.
> Not sure it currently has any effect because r8 (where the memory
> encryption setting was held) is modified just before this. But if any
> bits are added in the future that are needed past here, this will be a
> problem.

Right.  It's just for the

	call swap_pages

right after it.  Nothing else later uses RELOC_KERNEL_PRESERVE_CONTEXT or
RELOC_KERNEL_HOST_MEM_ACTIVE.

Maybe we can add a comment to remind that all other flags are not restored
so if someone wants to add a new bit and use it at a later he/she can see?

	/*
	 * Ensure RELOC_KERNEL_PRESERVE_CONTEXT flag is set so swap_pages
	 * can do things correctly.  Note this doesn't restore any other 
	 * RELOC_KERNEL_* flags that were passed to relocate_kernel().
	 */
> 
> >  	call	swap_pages
> >  	movq	kexec_va_control_page(%rip), %rax
> >  0:	addq	$virtual_mapped - 0b, %rax
> > @@ -321,7 +319,7 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
> >  	UNWIND_HINT_END_OF_STACK
> >  	/*
> >  	 * %rdi indirection page
> > -	 * %r11 preserve_context
> > +	 * %r11 flags: RELOC_KERNEL_*
> >  	 */
> >  	movq	%rdi, %rcx	/* Put the indirection_page in %rcx */
> >  	xorl	%edi, %edi
> > @@ -357,7 +355,8 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
> >  	movq	%rdi, %rdx    /* Save destination page to %rdx */
> >  	movq	%rsi, %rax    /* Save source page to %rax */
> >  
> > -	testq	%r11, %r11    /* Only actually swap for ::preserve_context */
> > +	/* Only actually swap for ::preserve_context */
> > +	testq	$RELOC_KERNEL_PRESERVE_CONTEXT, %r11
> 
> Ditto here on the bit testing.

I don't see any problem?  Please see above.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters
  2025-07-21 21:20     ` Huang, Kai
@ 2025-07-21 21:27       ` Tom Lendacky
  2025-07-21 21:36         ` Huang, Kai
  0 siblings, 1 reply; 28+ messages in thread
From: Tom Lendacky @ 2025-07-21 21:27 UTC (permalink / raw)
  To: Huang, Kai, tglx@linutronix.de, peterz@infradead.org,
	Hansen, Dave, mingo@redhat.com, bp@alien8.de, hpa@zytor.com
  Cc: ashish.kalra@amd.com, Edgecombe, Rick P, seanjc@google.com,
	x86@kernel.org, kas@kernel.org, Gao, Chao, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Williams, Dan J, sagis@google.com,
	pbonzini@redhat.com, kvm@vger.kernel.org, dwmw@amazon.co.uk,
	Yamahata, Isaku, nik.borisov@suse.com

On 7/21/25 16:20, Huang, Kai wrote:
> On Mon, 2025-07-21 at 09:40 -0500, Tom Lendacky wrote:
>> On 7/17/25 16:46, Kai Huang wrote:
>>> During kexec, the kernel jumps to the new kernel in relocate_kernel(),
>>> which is implemented in assembly and both 32-bit and 64-bit have their
>>> own version.
>>>
>>> Currently, for both 32-bit and 64-bit, the last two parameters of the
>>> relocate_kernel() are both 'unsigned int' but actually they only convey
>>> a boolean, i.e., one bit information.  The 'unsigned int' has enough
>>> space to carry two bits information therefore there's no need to pass
>>> the two booleans in two separate 'unsigned int'.
>>>
>>> Consolidate the last two function parameters of relocate_kernel() into a
>>> single 'unsigned int' and pass flags instead.
>>>
>>> Only consolidate the 64-bit version albeit the similar optimization can
>>> be done for the 32-bit version too.  Don't bother changing the 32-bit
>>> version while it is working (since assembly code change is required).
>>>
>>> Signed-off-by: Kai Huang <kai.huang@intel.com>
>>> ---
>>>  arch/x86/include/asm/kexec.h         | 12 ++++++++++--
>>>  arch/x86/kernel/machine_kexec_64.c   | 22 +++++++++++++---------
>>>  arch/x86/kernel/relocate_kernel_64.S | 19 +++++++++----------
>>>  3 files changed, 32 insertions(+), 21 deletions(-)
>>>

>>> @@ -204,7 +202,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>>>  	 * entries that will conflict with the now unencrypted memory
>>>  	 * used by kexec. Flush the caches before copying the kernel.
>>>  	 */
>>> -	testq	%r8, %r8
>>> +	testq	$RELOC_KERNEL_HOST_MEM_ACTIVE, %r11
>>
>> Hmmm... can't both bits be set at the same time? If so, then this will
>> fail. This should be doing bit tests now.
> 
> TEST instruction performs logical AND of the two operands, therefore the
> above equals to:
> 
> 	set ZF if "R11 AND BIT(1) == 0".
> 
> Whether there's any other bits set in R11 doesn't impact the above, right?
>  

Doh! My bad, yes, not sure what I was thinking there.

Thanks,
Tom

>>
>>>  	jz .Lsme_off
>>>  	wbinvd
>>>  .Lsme_off:
>>> @@ -220,7 +218,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>>>  	movq	%cr3, %rax
>>>  	movq	%rax, %cr3
>>>  
>>> -	testq	%r11, %r11	/* preserve_context */
>>> +	testq	$RELOC_KERNEL_PRESERVE_CONTEXT, %r11
>>>  	jnz .Lrelocate
>>>  
>>>  	/*
>>> @@ -273,7 +271,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>>>  	ANNOTATE_NOENDBR
>>>  	andq	$PAGE_MASK, %r8
>>>  	lea	PAGE_SIZE(%r8), %rsp
>>> -	movl	$1, %r11d	/* Ensure preserve_context flag is set */
>>> +	movl	$RELOC_KERNEL_PRESERVE_CONTEXT, %r11d	/* Ensure preserve_context flag is set */
>>
>> And this will clear any value that was in r11 vs setting a single bit.
>> Not sure it currently has any effect because r8 (where the memory
>> encryption setting was held) is modified just before this. But if any
>> bits are added in the future that are needed past here, this will be a
>> problem.
> 
> Right.  It's just for the
> 
> 	call swap_pages
> 
> right after it.  Nothing else later uses RELOC_KERNEL_PRESERVE_CONTEXT or
> RELOC_KERNEL_HOST_MEM_ACTIVE.
> 
> Maybe we can add a comment to remind that all other flags are not restored
> so if someone wants to add a new bit and use it at a later he/she can see?
> 
> 	/*
> 	 * Ensure RELOC_KERNEL_PRESERVE_CONTEXT flag is set so swap_pages
> 	 * can do things correctly.  Note this doesn't restore any other 
> 	 * RELOC_KERNEL_* flags that were passed to relocate_kernel().
> 	 */
>>
>>>  	call	swap_pages
>>>  	movq	kexec_va_control_page(%rip), %rax
>>>  0:	addq	$virtual_mapped - 0b, %rax
>>> @@ -321,7 +319,7 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
>>>  	UNWIND_HINT_END_OF_STACK
>>>  	/*
>>>  	 * %rdi indirection page
>>> -	 * %r11 preserve_context
>>> +	 * %r11 flags: RELOC_KERNEL_*
>>>  	 */
>>>  	movq	%rdi, %rcx	/* Put the indirection_page in %rcx */
>>>  	xorl	%edi, %edi
>>> @@ -357,7 +355,8 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
>>>  	movq	%rdi, %rdx    /* Save destination page to %rdx */
>>>  	movq	%rsi, %rax    /* Save source page to %rax */
>>>  
>>> -	testq	%r11, %r11    /* Only actually swap for ::preserve_context */
>>> +	/* Only actually swap for ::preserve_context */
>>> +	testq	$RELOC_KERNEL_PRESERVE_CONTEXT, %r11
>>
>> Ditto here on the bit testing.
> 
> I don't see any problem?  Please see above.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 0/7] TDX host: kexec/kdump support
  2025-07-21 14:50   ` Tom Lendacky
@ 2025-07-21 21:30     ` Huang, Kai
  2025-07-21 21:30     ` Huang, Kai
  1 sibling, 0 replies; 28+ messages in thread
From: Huang, Kai @ 2025-07-21 21:30 UTC (permalink / raw)
  To: thomas.lendacky@amd.com, tglx@linutronix.de, peterz@infradead.org,
	Hansen, Dave, mingo@redhat.com, bp@alien8.de, hpa@zytor.com
  Cc: ashish.kalra@amd.com, Edgecombe, Rick P, seanjc@google.com,
	x86@kernel.org, kas@kernel.org, Gao, Chao, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Williams, Dan J, sagis@google.com,
	pbonzini@redhat.com, kvm@vger.kernel.org, dwmw@amazon.co.uk,
	Yamahata, Isaku, nik.borisov@suse.com

On Mon, 2025-07-21 at 09:50 -0500, Tom Lendacky wrote:
> On 7/21/25 08:08, Tom Lendacky wrote:
> > On 7/17/25 16:46, Kai Huang wrote:
> > > This series is the latest attempt to support kexec on TDX host following
> > > Dave's suggestion to use a percpu boolean to control WBINVD during
> > > kexec.
> > > 
> > > Hi Boris/Tom,
> > > 
> > > As requested, I added the first patch to cleanup the last two 'unsigned
> > > int' parameters of the relocate_kernel() into one 'unsigned int' and pass
> > > flags instead.  The patch 2 (patch 1 in v3) also gets updated based on
> > > that.  Would you help to review?  Thanks.
> > > 
> > > I tested that both normal kexec and preserve_context kexec works (using
> > > the tools/testing/selftests/kexec/test_kexec_jump.sh).  But I don't have
> > > SME capable machine to test.
> > > 
> > > Hi Tom, I added your Reviewed-by and Tested-by in the patch 2 anyway
> > > since I believe the change is trivial and straightforward).  But due to
> > > the cleanup patch, I appreciate if you can help to test the first two
> > > patches again.  Thanks a lot!
> > 
> > Everything is working, Thanks!
> 
> See my comments in patch #1. I didn't test with context preservation, so
> that bit was never set. If it was, I think things would have failed.

I actually tested the test_kexec_jump.sh in kselftest as mentioned above
in a VM, as mentioned above.  I got "# kexec_jump succeeded [PASS]" so I
think it worked :-)  But unfortunately I don't know how to test
preserve_context kexec in any other way.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 0/7] TDX host: kexec/kdump support
  2025-07-21 14:50   ` Tom Lendacky
  2025-07-21 21:30     ` Huang, Kai
@ 2025-07-21 21:30     ` Huang, Kai
  1 sibling, 0 replies; 28+ messages in thread
From: Huang, Kai @ 2025-07-21 21:30 UTC (permalink / raw)
  To: thomas.lendacky@amd.com, tglx@linutronix.de, peterz@infradead.org,
	Hansen, Dave, mingo@redhat.com, bp@alien8.de, hpa@zytor.com
  Cc: ashish.kalra@amd.com, Edgecombe, Rick P, seanjc@google.com,
	x86@kernel.org, kas@kernel.org, Gao, Chao, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Williams, Dan J, sagis@google.com,
	pbonzini@redhat.com, kvm@vger.kernel.org, dwmw@amazon.co.uk,
	Yamahata, Isaku, nik.borisov@suse.com

On Mon, 2025-07-21 at 09:50 -0500, Tom Lendacky wrote:
> On 7/21/25 08:08, Tom Lendacky wrote:
> > On 7/17/25 16:46, Kai Huang wrote:
> > > This series is the latest attempt to support kexec on TDX host following
> > > Dave's suggestion to use a percpu boolean to control WBINVD during
> > > kexec.
> > > 
> > > Hi Boris/Tom,
> > > 
> > > As requested, I added the first patch to cleanup the last two 'unsigned
> > > int' parameters of the relocate_kernel() into one 'unsigned int' and pass
> > > flags instead.  The patch 2 (patch 1 in v3) also gets updated based on
> > > that.  Would you help to review?  Thanks.
> > > 
> > > I tested that both normal kexec and preserve_context kexec works (using
> > > the tools/testing/selftests/kexec/test_kexec_jump.sh).  But I don't have
> > > SME capable machine to test.
> > > 
> > > Hi Tom, I added your Reviewed-by and Tested-by in the patch 2 anyway
> > > since I believe the change is trivial and straightforward).  But due to
> > > the cleanup patch, I appreciate if you can help to test the first two
> > > patches again.  Thanks a lot!
> > 
> > Everything is working, Thanks!
> 
> See my comments in patch #1. I didn't test with context preservation, so
> that bit was never set. If it was, I think things would have failed.

I actually tested the test_kexec_jump.sh in kselftest as mentioned above
in a VM.  I got "# kexec_jump succeeded [PASS]" so I think it worked :-) 
But unfortunately I don't know how to test preserve_context kexec in any
other way.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters
  2025-07-21 21:27       ` Tom Lendacky
@ 2025-07-21 21:36         ` Huang, Kai
  2025-07-21 23:05           ` H. Peter Anvin
  2025-07-22 14:58           ` Borislav Petkov
  0 siblings, 2 replies; 28+ messages in thread
From: Huang, Kai @ 2025-07-21 21:36 UTC (permalink / raw)
  To: thomas.lendacky@amd.com, tglx@linutronix.de, peterz@infradead.org,
	Hansen, Dave, mingo@redhat.com, bp@alien8.de, hpa@zytor.com
  Cc: Gao, Chao, Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
	kas@kernel.org, sagis@google.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Williams, Dan J,
	nik.borisov@suse.com, ashish.kalra@amd.com, pbonzini@redhat.com,
	dwmw@amazon.co.uk, kvm@vger.kernel.org, Yamahata, Isaku

On Mon, 2025-07-21 at 16:27 -0500, Tom Lendacky wrote:
> > > > @@ -204,7 +202,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> > > >   	 * entries that will conflict with the now unencrypted memory
> > > >   	 * used by kexec. Flush the caches before copying the kernel.
> > > >   	 */
> > > > -	testq	%r8, %r8
> > > > +	testq	$RELOC_KERNEL_HOST_MEM_ACTIVE, %r11
> > > 
> > > Hmmm... can't both bits be set at the same time? If so, then this will
> > > fail. This should be doing bit tests now.
> > 
> > TEST instruction performs logical AND of the two operands, therefore the
> > above equals to:
> > 
> >  	set ZF if "R11 AND BIT(1) == 0".
> > 
> > Whether there's any other bits set in R11 doesn't impact the above, right?
> >   
> 
> Doh! My bad, yes, not sure what I was thinking there.
> 

Np and thanks! I'll address your other comments but I'll see whether Boris
has any other comments first.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters
  2025-07-21 21:36         ` Huang, Kai
@ 2025-07-21 23:05           ` H. Peter Anvin
  2025-07-21 23:29             ` Huang, Kai
  2025-07-22 14:58           ` Borislav Petkov
  1 sibling, 1 reply; 28+ messages in thread
From: H. Peter Anvin @ 2025-07-21 23:05 UTC (permalink / raw)
  To: Huang, Kai, thomas.lendacky@amd.com, tglx@linutronix.de,
	peterz@infradead.org, Hansen, Dave, mingo@redhat.com,
	bp@alien8.de
  Cc: Gao, Chao, Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
	kas@kernel.org, sagis@google.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Williams, Dan J,
	nik.borisov@suse.com, ashish.kalra@amd.com, pbonzini@redhat.com,
	dwmw@amazon.co.uk, kvm@vger.kernel.org, Yamahata, Isaku

On July 21, 2025 2:36:48 PM PDT, "Huang, Kai" <kai.huang@intel.com> wrote:
>On Mon, 2025-07-21 at 16:27 -0500, Tom Lendacky wrote:
>> > > > @@ -204,7 +202,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>> > > >   	 * entries that will conflict with the now unencrypted memory
>> > > >   	 * used by kexec. Flush the caches before copying the kernel.
>> > > >   	 */
>> > > > -	testq	%r8, %r8
>> > > > +	testq	$RELOC_KERNEL_HOST_MEM_ACTIVE, %r11
>> > > 
>> > > Hmmm... can't both bits be set at the same time? If so, then this will
>> > > fail. This should be doing bit tests now.
>> > 
>> > TEST instruction performs logical AND of the two operands, therefore the
>> > above equals to:
>> > 
>> >  	set ZF if "R11 AND BIT(1) == 0".
>> > 
>> > Whether there's any other bits set in R11 doesn't impact the above, right?
>> >   
>> 
>> Doh! My bad, yes, not sure what I was thinking there.
>> 
>
>Np and thanks! I'll address your other comments but I'll see whether Boris
>has any other comments first.
>

You can use testb in this case to save 3 bytes, too.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters
  2025-07-21 23:05           ` H. Peter Anvin
@ 2025-07-21 23:29             ` Huang, Kai
  2025-07-21 23:42               ` Huang, Kai
  0 siblings, 1 reply; 28+ messages in thread
From: Huang, Kai @ 2025-07-21 23:29 UTC (permalink / raw)
  To: H. Peter Anvin, thomas.lendacky@amd.com, tglx@linutronix.de,
	peterz@infradead.org, Hansen, Dave, mingo@redhat.com,
	bp@alien8.de
  Cc: Gao, Chao, Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
	kas@kernel.org, sagis@google.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Williams, Dan J,
	nik.borisov@suse.com, ashish.kalra@amd.com, pbonzini@redhat.com,
	dwmw@amazon.co.uk, kvm@vger.kernel.org, Yamahata, Isaku

> On July 21, 2025 2:36:48 PM PDT, "Huang, Kai" <kai.huang@intel.com> wrote:
> >On Mon, 2025-07-21 at 16:27 -0500, Tom Lendacky wrote:
> >> > > > @@ -204,7 +202,7 @@
> SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> >> > > >   	 * entries that will conflict with the now unencrypted memory
> >> > > >   	 * used by kexec. Flush the caches before copying the kernel.
> >> > > >   	 */
> >> > > > -	testq	%r8, %r8
> >> > > > +	testq	$RELOC_KERNEL_HOST_MEM_ACTIVE, %r11
> >> > >
> >> > > Hmmm... can't both bits be set at the same time? If so, then this
> >> > > will fail. This should be doing bit tests now.
> >> >
> >> > TEST instruction performs logical AND of the two operands,
> >> > therefore the above equals to:
> >> >
> >> >  	set ZF if "R11 AND BIT(1) == 0".
> >> >
> >> > Whether there's any other bits set in R11 doesn't impact the above, right?
> >> >
> >>
> >> Doh! My bad, yes, not sure what I was thinking there.
> >>
> >
> >Np and thanks! I'll address your other comments but I'll see whether
> >Boris has any other comments first.
> >
> 
> You can use testb in this case to save 3 bytes, too.

Yeah I can do that, thanks for the info!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters
  2025-07-21 23:29             ` Huang, Kai
@ 2025-07-21 23:42               ` Huang, Kai
  2025-07-22  0:44                 ` H. Peter Anvin
  0 siblings, 1 reply; 28+ messages in thread
From: Huang, Kai @ 2025-07-21 23:42 UTC (permalink / raw)
  To: hpa@zytor.com, thomas.lendacky@amd.com, peterz@infradead.org,
	tglx@linutronix.de, Hansen, Dave, mingo@redhat.com, bp@alien8.de
  Cc: Gao, Chao, Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
	kas@kernel.org, sagis@google.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Williams, Dan J,
	kvm@vger.kernel.org, ashish.kalra@amd.com, pbonzini@redhat.com,
	nik.borisov@suse.com, dwmw@amazon.co.uk, Yamahata, Isaku

On Mon, 2025-07-21 at 23:29 +0000, Huang, Kai wrote:
> > On July 21, 2025 2:36:48 PM PDT, "Huang, Kai" <kai.huang@intel.com> wrote:
> > > On Mon, 2025-07-21 at 16:27 -0500, Tom Lendacky wrote:
> > > > > > > @@ -204,7 +202,7 @@
> > SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> > > > > > >   	 * entries that will conflict with the now unencrypted memory
> > > > > > >   	 * used by kexec. Flush the caches before copying the kernel.
> > > > > > >   	 */
> > > > > > > -	testq	%r8, %r8
> > > > > > > +	testq	$RELOC_KERNEL_HOST_MEM_ACTIVE, %r11
> > > > > > 
> > > > > > Hmmm... can't both bits be set at the same time? If so, then this
> > > > > > will fail. This should be doing bit tests now.
> > > > > 
> > > > > TEST instruction performs logical AND of the two operands,
> > > > > therefore the above equals to:
> > > > > 
> > > > >  	set ZF if "R11 AND BIT(1) == 0".
> > > > > 
> > > > > Whether there's any other bits set in R11 doesn't impact the above, right?
> > > > > 
> > > > 
> > > > Doh! My bad, yes, not sure what I was thinking there.
> > > > 
> > > 
> > > Np and thanks! I'll address your other comments but I'll see whether
> > > Boris has any other comments first.
> > > 
> > 
> > You can use testb in this case to save 3 bytes, too.
> 
> Yeah I can do that, thanks for the info!

I just tried.  I need to do:

	testb	$RELOC_KERNEL_HOST_MEM_ACTIVE, %r11b

in order to compile, otherwise using plain %r11 generates:

arch/x86/kernel/relocate_kernel_64.S:212: Error: `%r11' not allowed with
`testb'

I'll do some test and if there's no problem I'll switch to use this way,
assuming it still saves us 3-bytes.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters
  2025-07-21 23:42               ` Huang, Kai
@ 2025-07-22  0:44                 ` H. Peter Anvin
  2025-07-22  0:53                   ` Huang, Kai
  0 siblings, 1 reply; 28+ messages in thread
From: H. Peter Anvin @ 2025-07-22  0:44 UTC (permalink / raw)
  To: Huang, Kai, thomas.lendacky@amd.com, peterz@infradead.org,
	tglx@linutronix.de, Hansen, Dave, mingo@redhat.com, bp@alien8.de
  Cc: Gao, Chao, Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
	kas@kernel.org, sagis@google.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Williams, Dan J,
	kvm@vger.kernel.org, ashish.kalra@amd.com, pbonzini@redhat.com,
	nik.borisov@suse.com, dwmw@amazon.co.uk, Yamahata, Isaku

On July 21, 2025 4:42:22 PM PDT, "Huang, Kai" <kai.huang@intel.com> wrote:
>On Mon, 2025-07-21 at 23:29 +0000, Huang, Kai wrote:
>> > On July 21, 2025 2:36:48 PM PDT, "Huang, Kai" <kai.huang@intel.com> wrote:
>> > > On Mon, 2025-07-21 at 16:27 -0500, Tom Lendacky wrote:
>> > > > > > > @@ -204,7 +202,7 @@
>> > SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>> > > > > > >   	 * entries that will conflict with the now unencrypted memory
>> > > > > > >   	 * used by kexec. Flush the caches before copying the kernel.
>> > > > > > >   	 */
>> > > > > > > -	testq	%r8, %r8
>> > > > > > > +	testq	$RELOC_KERNEL_HOST_MEM_ACTIVE, %r11
>> > > > > > 
>> > > > > > Hmmm... can't both bits be set at the same time? If so, then this
>> > > > > > will fail. This should be doing bit tests now.
>> > > > > 
>> > > > > TEST instruction performs logical AND of the two operands,
>> > > > > therefore the above equals to:
>> > > > > 
>> > > > >  	set ZF if "R11 AND BIT(1) == 0".
>> > > > > 
>> > > > > Whether there's any other bits set in R11 doesn't impact the above, right?
>> > > > > 
>> > > > 
>> > > > Doh! My bad, yes, not sure what I was thinking there.
>> > > > 
>> > > 
>> > > Np and thanks! I'll address your other comments but I'll see whether
>> > > Boris has any other comments first.
>> > > 
>> > 
>> > You can use testb in this case to save 3 bytes, too.
>> 
>> Yeah I can do that, thanks for the info!
>
>I just tried.  I need to do:
>
>	testb	$RELOC_KERNEL_HOST_MEM_ACTIVE, %r11b
>
>in order to compile, otherwise using plain %r11 generates:
>
>arch/x86/kernel/relocate_kernel_64.S:212: Error: `%r11' not allowed with
>`testb'
>
>I'll do some test and if there's no problem I'll switch to use this way,
>assuming it still saves us 3-bytes.
>

That works just fine.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters
  2025-07-22  0:44                 ` H. Peter Anvin
@ 2025-07-22  0:53                   ` Huang, Kai
  2025-07-23  8:13                     ` Huang, Kai
  0 siblings, 1 reply; 28+ messages in thread
From: Huang, Kai @ 2025-07-22  0:53 UTC (permalink / raw)
  To: hpa@zytor.com, thomas.lendacky@amd.com, peterz@infradead.org,
	tglx@linutronix.de, Hansen, Dave, mingo@redhat.com, bp@alien8.de
  Cc: Gao, Chao, Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
	kas@kernel.org, sagis@google.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Williams, Dan J, Yamahata, Isaku,
	kvm@vger.kernel.org, ashish.kalra@amd.com, nik.borisov@suse.com,
	pbonzini@redhat.com, dwmw@amazon.co.uk

On Mon, 2025-07-21 at 17:44 -0700, H. Peter Anvin wrote:
> On July 21, 2025 4:42:22 PM PDT, "Huang, Kai" <kai.huang@intel.com> wrote:
> > On Mon, 2025-07-21 at 23:29 +0000, Huang, Kai wrote:
> > > > On July 21, 2025 2:36:48 PM PDT, "Huang, Kai" <kai.huang@intel.com> wrote:
> > > > > On Mon, 2025-07-21 at 16:27 -0500, Tom Lendacky wrote:
> > > > > > > > > @@ -204,7 +202,7 @@
> > > > SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> > > > > > > > >   	 * entries that will conflict with the now unencrypted memory
> > > > > > > > >   	 * used by kexec. Flush the caches before copying the kernel.
> > > > > > > > >   	 */
> > > > > > > > > -	testq	%r8, %r8
> > > > > > > > > +	testq	$RELOC_KERNEL_HOST_MEM_ACTIVE, %r11
> > > > > > > > 
> > > > > > > > Hmmm... can't both bits be set at the same time? If so, then this
> > > > > > > > will fail. This should be doing bit tests now.
> > > > > > > 
> > > > > > > TEST instruction performs logical AND of the two operands,
> > > > > > > therefore the above equals to:
> > > > > > > 
> > > > > > >  	set ZF if "R11 AND BIT(1) == 0".
> > > > > > > 
> > > > > > > Whether there's any other bits set in R11 doesn't impact the above, right?
> > > > > > > 
> > > > > > 
> > > > > > Doh! My bad, yes, not sure what I was thinking there.
> > > > > > 
> > > > > 
> > > > > Np and thanks! I'll address your other comments but I'll see whether
> > > > > Boris has any other comments first.
> > > > > 
> > > > 
> > > > You can use testb in this case to save 3 bytes, too.
> > > 
> > > Yeah I can do that, thanks for the info!
> > 
> > I just tried.  I need to do:
> > 
> > 	testb	$RELOC_KERNEL_HOST_MEM_ACTIVE, %r11b
> > 
> > in order to compile, otherwise using plain %r11 generates:
> > 
> > arch/x86/kernel/relocate_kernel_64.S:212: Error: `%r11' not allowed with
> > `testb'
> > 
> > I'll do some test and if there's no problem I'll switch to use this way,
> > assuming it still saves us 3-bytes.
> > 
> 
> That works just fine.

Yeah I have now tested. Thanks :-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 0/7] TDX host: kexec/kdump support
  2025-07-17 21:46 [PATCH v4 0/7] TDX host: kexec/kdump support Kai Huang
                   ` (7 preceding siblings ...)
  2025-07-21 13:08 ` [PATCH v4 0/7] TDX host: kexec/kdump support Tom Lendacky
@ 2025-07-22 13:50 ` Paolo Bonzini
  8 siblings, 0 replies; 28+ messages in thread
From: Paolo Bonzini @ 2025-07-22 13:50 UTC (permalink / raw)
  To: Kai Huang, dave.hansen, bp, tglx, peterz, mingo, hpa,
	thomas.lendacky
  Cc: x86, kas, rick.p.edgecombe, dwmw, linux-kernel, seanjc, kvm,
	reinette.chatre, isaku.yamahata, dan.j.williams, ashish.kalra,
	nik.borisov, chao.gao, sagis

On 7/17/25 23:46, Kai Huang wrote:
> This series is the latest attempt to support kexec on TDX host following
> Dave's suggestion to use a percpu boolean to control WBINVD during
> kexec.
> 
> Hi Boris/Tom,
> 
> As requested, I added the first patch to cleanup the last two 'unsigned
> int' parameters of the relocate_kernel() into one 'unsigned int' and pass
> flags instead.  The patch 2 (patch 1 in v3) also gets updated based on
> that.  Would you help to review?  Thanks.
> 
> I tested that both normal kexec and preserve_context kexec works (using
> the tools/testing/selftests/kexec/test_kexec_jump.sh).  But I don't have
> SME capable machine to test.
> 
> Hi Tom, I added your Reviewed-by and Tested-by in the patch 2 anyway
> since I believe the change is trivial and straightforward).  But due to
> the cleanup patch, I appreciate if you can help to test the first two
> patches again.  Thanks a lot!

I guess we're just waiting for a v5 with testb and an augmented comment 
in patch 1?

(Frankly, I am not sure I see any improvement with respect to v3.  When 
assembly is involved, what looks like a cleanup can make register usage 
more messy.  But since the code is there, I'm not going to ask Kai or 
anyone else to do anything).

Paolo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 3/7] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  2025-07-17 21:46 ` [PATCH v4 3/7] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Kai Huang
@ 2025-07-22 14:52   ` Chao Gao
  2025-07-23  1:59     ` Huang, Kai
  0 siblings, 1 reply; 28+ messages in thread
From: Chao Gao @ 2025-07-22 14:52 UTC (permalink / raw)
  To: Kai Huang
  Cc: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky, x86,
	kas, rick.p.edgecombe, dwmw, linux-kernel, pbonzini, seanjc, kvm,
	reinette.chatre, isaku.yamahata, dan.j.williams, ashish.kalra,
	nik.borisov, sagis, Farrah Chen

>+static __always_inline u64 do_seamcall(sc_func_t func, u64 fn,
>+				       struct tdx_module_args *args)
>+{
>+	u64 ret;
>+
>+	lockdep_assert_preemption_disabled();
>+
>+	/*
>+	 * SEAMCALLs are made to the TDX module and can generate dirty
>+	 * cachelines of TDX private memory.  Mark cache state incoherent
>+	 * so that the cache can be flushed during kexec.
>+	 *
>+	 * This needs to be done before actually making the SEAMCALL,
>+	 * because kexec-ing CPU could send NMI to stop remote CPUs,
>+	 * in which case even disabling IRQ won't help here.
>+	 */
>+	this_cpu_write(cache_state_incoherent, true);
>+
>+	ret = func(fn, args);
>+
>+	return ret;

@ret can be dropped here. Just

	return func(fn, args);

should work.

And tracking cache incoherent state at the per-CPU level seems to add
unnecessary complexity. It requires a new do_seamcall() wrapper, setting the
flag on every seamcall rather than just the first one (I'm not concerned about
performance; it just feels silly), and using preempt_disable()/enable(). In my
view, per-CPU tracking at most saves a WBINVD on a CPU that never runs
SEAMCALLs during KEXEC, which is quite marginal. Did I miss any other benefits?

>+}
>+
> static __always_inline u64 sc_retry(sc_func_t func, u64 fn,
> 			   struct tdx_module_args *args)
> {
>@@ -113,7 +138,9 @@ static __always_inline u64 sc_retry(sc_func_t func, u64 fn,
> 	u64 ret;
> 
> 	do {
>-		ret = func(fn, args);
>+		preempt_disable();
>+		ret = do_seamcall(func, fn, args);
>+		preempt_enable();
> 	} while (ret == TDX_RND_NO_ENTROPY && --retry);
> 
> 	return ret;
>diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>index c7a9a087ccaf..d6ee4e5a75d2 100644
>--- a/arch/x86/virt/vmx/tdx/tdx.c
>+++ b/arch/x86/virt/vmx/tdx/tdx.c
>@@ -1266,7 +1266,7 @@ static bool paddr_is_tdx_private(unsigned long phys)
> 		return false;
> 
> 	/* Get page type from the TDX module */
>-	sret = __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args);
>+	sret = do_seamcall(__seamcall_ret, TDH_PHYMEM_PAGE_RDMD, &args);
> 
> 	/*
> 	 * The SEAMCALL will not return success unless there is a
>@@ -1522,7 +1522,7 @@ noinstr __flatten u64 tdh_vp_enter(struct tdx_vp *td, struct tdx_module_args *ar
> {
> 	args->rcx = tdx_tdvpr_pa(td);
> 
>-	return __seamcall_saved_ret(TDH_VP_ENTER, args);
>+	return do_seamcall(__seamcall_saved_ret, TDH_VP_ENTER, args);
> }
> EXPORT_SYMBOL_GPL(tdh_vp_enter);
> 
>-- 
>2.50.0
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters
  2025-07-21 21:36         ` Huang, Kai
  2025-07-21 23:05           ` H. Peter Anvin
@ 2025-07-22 14:58           ` Borislav Petkov
  2025-07-23  2:13             ` Huang, Kai
  1 sibling, 1 reply; 28+ messages in thread
From: Borislav Petkov @ 2025-07-22 14:58 UTC (permalink / raw)
  To: Huang, Kai
  Cc: thomas.lendacky@amd.com, tglx@linutronix.de, peterz@infradead.org,
	Hansen, Dave, mingo@redhat.com, hpa@zytor.com, Gao, Chao,
	Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
	kas@kernel.org, sagis@google.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Williams, Dan J,
	nik.borisov@suse.com, ashish.kalra@amd.com, pbonzini@redhat.com,
	dwmw@amazon.co.uk, kvm@vger.kernel.org, Yamahata, Isaku

On Mon, Jul 21, 2025 at 09:36:48PM +0000, Huang, Kai wrote:
> Np and thanks! I'll address your other comments but I'll see whether Boris
> has any other comments first.

Nah, he hasn't. This looks exactly like what I had in mind so thanks for
doing it.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 3/7] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  2025-07-22 14:52   ` Chao Gao
@ 2025-07-23  1:59     ` Huang, Kai
  2025-07-23  6:47       ` Chao Gao
  0 siblings, 1 reply; 28+ messages in thread
From: Huang, Kai @ 2025-07-23  1:59 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm@vger.kernel.org, ashish.kalra@amd.com, Hansen, Dave,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, dwmw@amazon.co.uk, pbonzini@redhat.com,
	seanjc@google.com, tglx@linutronix.de, kas@kernel.org,
	Chatre, Reinette, Yamahata, Isaku, nik.borisov@suse.com,
	hpa@zytor.com, peterz@infradead.org, sagis@google.com,
	Chen, Farrah, Edgecombe, Rick P, bp@alien8.de, x86@kernel.org,
	Williams, Dan J

On Tue, 2025-07-22 at 22:52 +0800, Gao, Chao wrote:
> > +static __always_inline u64 do_seamcall(sc_func_t func, u64 fn,
> > +				       struct tdx_module_args *args)
> > +{
> > +	u64 ret;
> > +
> > +	lockdep_assert_preemption_disabled();
> > +
> > +	/*
> > +	 * SEAMCALLs are made to the TDX module and can generate dirty
> > +	 * cachelines of TDX private memory.  Mark cache state incoherent
> > +	 * so that the cache can be flushed during kexec.
> > +	 *
> > +	 * This needs to be done before actually making the SEAMCALL,
> > +	 * because kexec-ing CPU could send NMI to stop remote CPUs,
> > +	 * in which case even disabling IRQ won't help here.
> > +	 */
> > +	this_cpu_write(cache_state_incoherent, true);
> > +
> > +	ret = func(fn, args);
> > +
> > +	return ret;
> 
> @ret can be dropped here. Just
> 
> 	return func(fn, args);
> 
> should work.

Yeah thanks will do.

> 
> And tracking cache incoherent state at the per-CPU level seems to add
> unnecessary complexity. It requires a new do_seamcall() wrapper, setting the
> flag on every seamcall rather than just the first one (I'm not concerned about
> performance; it just feels silly), and using preempt_disable()/enable(). In my
> view, per-CPU tracking at most saves a WBINVD on a CPU that never runs
> SEAMCALLs during KEXEC, which is quite marginal. Did I miss any other benefits?

The cache state is percpu thus a percpu boolean is a natural fit.  Besides
the benefit you mentioned, it fits better if there are other cases which
could also lead to an incoherent state:

https://lore.kernel.org/lkml/eb2e3b02-cf5e-4848-8f1d-9f3af8f9c96b@intel.com/

Setting the boolean in the SEAMCALL common code makes the logic quite
simple:

  If you ever do a SEAMCALL, mark the cache in incoherent state.

Please see Dave's comment here:

https://lore.kernel.org/lkml/31e17bc8-2e9e-4e93-a912-3d54826e59d0@intel.com/

The new code around the common SEAMCALL is pretty marginal comparing to
the SEAMCALL itself (as you said), and it's pretty straightforward, i.e.,
logically less error prone IMHO, so I am not seeing it silly.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters
  2025-07-22 14:58           ` Borislav Petkov
@ 2025-07-23  2:13             ` Huang, Kai
  0 siblings, 0 replies; 28+ messages in thread
From: Huang, Kai @ 2025-07-23  2:13 UTC (permalink / raw)
  To: bp@alien8.de
  Cc: kvm@vger.kernel.org, ashish.kalra@amd.com, Hansen, Dave,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, Chatre, Reinette, kas@kernel.org,
	mingo@redhat.com, tglx@linutronix.de, nik.borisov@suse.com,
	dwmw@amazon.co.uk, pbonzini@redhat.com, hpa@zytor.com,
	peterz@infradead.org, sagis@google.com, Gao, Chao,
	Edgecombe, Rick P, Yamahata, Isaku, x86@kernel.org,
	Williams, Dan J

On Tue, 2025-07-22 at 16:58 +0200, Borislav Petkov wrote:
> On Mon, Jul 21, 2025 at 09:36:48PM +0000, Huang, Kai wrote:
> > Np and thanks! I'll address your other comments but I'll see whether Boris
> > has any other comments first.
> 
> Nah, he hasn't. This looks exactly like what I had in mind so thanks for
> doing it.
> 

Hi Boris,

Thanks for the feedback.

I guess we don't need to wait other TDX patches to respin this, so I can
quickly address comments from Tom and hpa and send out v2 of this patch as
a standalone one if you prefer?  :-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 3/7] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  2025-07-23  1:59     ` Huang, Kai
@ 2025-07-23  6:47       ` Chao Gao
  0 siblings, 0 replies; 28+ messages in thread
From: Chao Gao @ 2025-07-23  6:47 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm@vger.kernel.org, ashish.kalra@amd.com, Hansen, Dave,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, dwmw@amazon.co.uk, pbonzini@redhat.com,
	seanjc@google.com, tglx@linutronix.de, kas@kernel.org,
	Chatre, Reinette, Yamahata, Isaku, nik.borisov@suse.com,
	hpa@zytor.com, peterz@infradead.org, sagis@google.com,
	Chen, Farrah, Edgecombe, Rick P, bp@alien8.de, x86@kernel.org,
	Williams, Dan J

>> And tracking cache incoherent state at the per-CPU level seems to add
>> unnecessary complexity. It requires a new do_seamcall() wrapper, setting the
>> flag on every seamcall rather than just the first one (I'm not concerned about
>> performance; it just feels silly), and using preempt_disable()/enable(). In my
>> view, per-CPU tracking at most saves a WBINVD on a CPU that never runs
>> SEAMCALLs during KEXEC, which is quite marginal. Did I miss any other benefits?
>
>The cache state is percpu thus a percpu boolean is a natural fit.  Besides
>the benefit you mentioned, it fits better if there are other cases which
>could also lead to an incoherent state:
>
>https://lore.kernel.org/lkml/eb2e3b02-cf5e-4848-8f1d-9f3af8f9c96b@intel.com/
>
>Setting the boolean in the SEAMCALL common code makes the logic quite
>simple:
>
>  If you ever do a SEAMCALL, mark the cache in incoherent state.
>
>Please see Dave's comment here:
>
>https://lore.kernel.org/lkml/31e17bc8-2e9e-4e93-a912-3d54826e59d0@intel.com/
>
>The new code around the common SEAMCALL is pretty marginal comparing to
>the SEAMCALL itself (as you said), and it's pretty straightforward, i.e.,
>logically less error prone IMHO, so I am not seeing it silly.

Sure, let's follow Dave's suggestion.

For anyone else who has the same question, see the discussion here:

https://lore.kernel.org/lkml/9a9380b55e1d01c650456e56be0949b531d88af5.camel@intel.com/
https://lore.kernel.org/lkml/6536c0cf614101eda89b3fe861f95ad0c1476cfd.camel@intel.com/

Both options, per-CPU variable and global variable, were evaluated, and the
agreed approach is to use the per-CPU variable. Apologize for the noise.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters
  2025-07-22  0:53                   ` Huang, Kai
@ 2025-07-23  8:13                     ` Huang, Kai
  0 siblings, 0 replies; 28+ messages in thread
From: Huang, Kai @ 2025-07-23  8:13 UTC (permalink / raw)
  To: hpa@zytor.com, thomas.lendacky@amd.com, peterz@infradead.org,
	tglx@linutronix.de, Hansen, Dave, mingo@redhat.com, bp@alien8.de
  Cc: Gao, Chao, Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
	kas@kernel.org, sagis@google.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Williams, Dan J,
	pbonzini@redhat.com, kvm@vger.kernel.org, ashish.kalra@amd.com,
	Yamahata, Isaku, nik.borisov@suse.com, dwmw@amazon.co.uk

On Tue, 2025-07-22 at 00:53 +0000, Huang, Kai wrote:
> On Mon, 2025-07-21 at 17:44 -0700, H. Peter Anvin wrote:
> > On July 21, 2025 4:42:22 PM PDT, "Huang, Kai" <kai.huang@intel.com> wrote:
> > > On Mon, 2025-07-21 at 23:29 +0000, Huang, Kai wrote:
> > > > > On July 21, 2025 2:36:48 PM PDT, "Huang, Kai" <kai.huang@intel.com> wrote:
> > > > > > On Mon, 2025-07-21 at 16:27 -0500, Tom Lendacky wrote:
> > > > > > > > > > @@ -204,7 +202,7 @@
> > > > > SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> > > > > > > > > >   	 * entries that will conflict with the now unencrypted memory
> > > > > > > > > >   	 * used by kexec. Flush the caches before copying the kernel.
> > > > > > > > > >   	 */
> > > > > > > > > > -	testq	%r8, %r8
> > > > > > > > > > +	testq	$RELOC_KERNEL_HOST_MEM_ACTIVE, %r11
> > > > > > > > > 
> > > > > > > > > Hmmm... can't both bits be set at the same time? If so, then this
> > > > > > > > > will fail. This should be doing bit tests now.
> > > > > > > > 
> > > > > > > > TEST instruction performs logical AND of the two operands,
> > > > > > > > therefore the above equals to:
> > > > > > > > 
> > > > > > > >  	set ZF if "R11 AND BIT(1) == 0".
> > > > > > > > 
> > > > > > > > Whether there's any other bits set in R11 doesn't impact the above, right?
> > > > > > > > 
> > > > > > > 
> > > > > > > Doh! My bad, yes, not sure what I was thinking there.
> > > > > > > 
> > > > > > 
> > > > > > Np and thanks! I'll address your other comments but I'll see whether
> > > > > > Boris has any other comments first.
> > > > > > 
> > > > > 
> > > > > You can use testb in this case to save 3 bytes, too.
> > > > 
> > > > Yeah I can do that, thanks for the info!
> > > 
> > > I just tried.  I need to do:
> > > 
> > > 	testb	$RELOC_KERNEL_HOST_MEM_ACTIVE, %r11b
> > > 
> > > in order to compile, otherwise using plain %r11 generates:
> > > 
> > > arch/x86/kernel/relocate_kernel_64.S:212: Error: `%r11' not allowed with
> > > `testb'
> > > 
> > > I'll do some test and if there's no problem I'll switch to use this way,
> > > assuming it still saves us 3-bytes.
> > > 
> > 
> > That works just fine.
> 
> Yeah I have now tested. Thanks :-)

Sorry for multiple emails.  One minor thing related to using testb:

With this patch there's one movl to R11:

    movl    $RELOC_KERNEL_PRESERVE_CONTEXT, %r11d

I think it would be better to make it consistent with testb:

    movb    $RELOC_KERNEL_PRESERVE_CONTEXT, %r11b

Please let me know if there's any concern?

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-07-23  8:13 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-17 21:46 [PATCH v4 0/7] TDX host: kexec/kdump support Kai Huang
2025-07-17 21:46 ` [PATCH v4 1/7] x86/kexec: Consolidate relocate_kernel() function parameters Kai Huang
2025-07-21 14:40   ` Tom Lendacky
2025-07-21 21:20     ` Huang, Kai
2025-07-21 21:27       ` Tom Lendacky
2025-07-21 21:36         ` Huang, Kai
2025-07-21 23:05           ` H. Peter Anvin
2025-07-21 23:29             ` Huang, Kai
2025-07-21 23:42               ` Huang, Kai
2025-07-22  0:44                 ` H. Peter Anvin
2025-07-22  0:53                   ` Huang, Kai
2025-07-23  8:13                     ` Huang, Kai
2025-07-22 14:58           ` Borislav Petkov
2025-07-23  2:13             ` Huang, Kai
2025-07-17 21:46 ` [PATCH v4 2/7] x86/sme: Use percpu boolean to control WBINVD during kexec Kai Huang
2025-07-17 21:46 ` [PATCH v4 3/7] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Kai Huang
2025-07-22 14:52   ` Chao Gao
2025-07-23  1:59     ` Huang, Kai
2025-07-23  6:47       ` Chao Gao
2025-07-17 21:46 ` [PATCH v4 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Kai Huang
2025-07-17 21:46 ` [PATCH v4 5/7] x86/virt/tdx: Remove the !KEXEC_CORE dependency Kai Huang
2025-07-17 21:46 ` [PATCH v4 6/7] x86/virt/tdx: Update the kexec section in the TDX documentation Kai Huang
2025-07-17 21:46 ` [PATCH v4 7/7] KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs Kai Huang
2025-07-21 13:08 ` [PATCH v4 0/7] TDX host: kexec/kdump support Tom Lendacky
2025-07-21 14:50   ` Tom Lendacky
2025-07-21 21:30     ` Huang, Kai
2025-07-21 21:30     ` Huang, Kai
2025-07-22 13:50 ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).