public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] TDX host: kexec/kdump support
@ 2025-03-12 11:34 Kai Huang
  2025-03-12 11:34 ` [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu() Kai Huang
                   ` (4 more replies)
  0 siblings, 5 replies; 34+ messages in thread
From: Kai Huang @ 2025-03-12 11:34 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, kirill.shutemov
  Cc: hpa, x86, linux-kernel, pbonzini, seanjc, rick.p.edgecombe,
	reinette.chatre, isaku.yamahata, dan.j.williams, thomas.lendacky,
	ashish.kalra, dwmw, bhe, nik.borisov, sagis

Hi Dave,

This series is not ready for your review, but we want to move the
discussion external at this point.  Please feel free to ignore it until
we get it a bit more polished.

-----------------------------------------------------------------------

TDX hosts do not currently support kexec.  CONFIG_KEXEC_CORE must be
disabled in order to enable CONFIG_INTEL_TDX_HOST.  This is not
acceptable to distros at least Redhat since they want both to be turned
on.  There are other customers who want to use kexec together with TDX.

This series adds TDX host kexec support.  With CONFIG_KEXEC_CORE and
CONFIG_INTEL_TDX_HOST both enabled, a TDX enabled kernel can kexec into
a new kernel and the kdump (crash kexec) can work as normal.

One limitation is if the old kernel has ever enabled TDX, the new kernel
cannot use TDX.  This is a future work.

One exception is that kexec/kdump is disabled when the platform has the
TDX "partial write machine check" erratum (and when the
CONFIG_INTEL_TDX_HOST is turned on).  See below for more information.

This was supposed to be a v8, but I tagged this series as RFC because
in the recent internal review I feel there's one point regarding the use
of MOVDIR64B to reset TDX private memory that I want to get feedback on
the list.  Please see section "2) Reset TDX private memory using
MOVDIR64B" below for more information.

v7 -> this RFC: 

The major change is, for the sake of keeping code change minimal, I
removed the patches which handle resetting TDX private memory to make
kexec work with the TDX erratum.  Instead, add a patch to simply disable
kexec/kdump for such platforms.

v7: https://lore.kernel.org/lkml/cover.1727179214.git.kai.huang@intel.com/

=== More information ===

TDX private memory is memory that is encrypted with private Host Key IDs
(HKID).  If the kernel has ever enabled TDX, part of system memory
remains TDX private memory when kexec happens.  E.g., the PAMT (Physical
Address Metadata Table) pages used by the TDX module to track each TDX
memory page's state are never freed once the TDX module is initialized.
TDX guests also have guest private memory and secure-EPT pages.

After kexec, the new kernel will have no knowledge of which memory page
was used as TDX private page and can use all memory as regular memory.

1) Cache flush

Per TDX 1.5 base spec "8.6.1.Platforms not Using ACT: Required Cache
Flush and Initialization by the Host VMM", to support kexec for TDX, the
kernel needs to flush cache to make sure there's no dirty cachelines of
TDX private memory left over to the new kernel (when the TDX module
reports TDX_FEATURES.CLFLUSH_BEFORE_ALLOC as 1 in the global metadata for
the platform).  The kernel also needs to make sure there's no more TDX
activity (no SEAMCALL) after cache flush so that no new dirty cachelines
of TDX private memory are generated.

SME has similar requirement.  SME kexec support uses WBINVD to do the
cache flush.  WBINVD is able to flush cachelines associated with any
HKID.  Reuse the WBINVD introduced by SME to flush cache for TDX.

Currently the kernel explicitly checks whether the hardware supports SME
and only does WBINVD if true.  Instead of adding yet another TDX
specific check, this series does unconditional WBINVD for bare-metal for
code simplicity since kexec is a slow path.

2) Reset TDX private memory using MOVDIR64B

The TDX spec (the aforementioned section) also suggests the kernel
*should* use MOVDIR64B to clear TDX private page before the kernel
reuses it as regular one.

However, in reality the situation can be more flexible.  Per TDX 1.5
base spec ("Table 16.2: Non-ACT Platforms Checks on Memory Reads in Ci
Mode" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
Mode"), the read/write to TDX private memory using shared KeyID without
integrity check enabled will not poison the memory and cause machine
check.

Note on the platforms with ACT (Access Control Table), there's no
integrity check involved thus no machine check is possible to happen due
to memory read/write using different KeyIDs.

Since it is not trivial to reset TDX private memory, this series assumes
KeyID 0 doesn't have integrity check enabled, and chooses to NOT reset
TDX private memory but leave TDX private memory as-is to the new kernel.
As mentioned above, in practice it is safe to do so.

The worst case is someday Intel decides to enable integrity check for
KeyID 0 for some new platforms, and the impact is the old kernels
running on those platforms may get machine check after kexec.  But this
change will not happen silently.  We can have a patch to reset TDX
private memory for those platforms and backport to stable.  In the
meantime, we can enjoy the performance gain until that happens.

3) One limitation

If the kernel has ever enabled TDX, after kexec the new kernel won't be
able to use TDX anymore.  This is because when the new kernel tries to
initialize TDX module it will fail on the first SEAMCALL due to the
module has already been initialized by the old kernel.

More (non-trivial) work will be needed for the new kernel to use TDX,
e.g., one solution is to just reload the TDX module from the location
where BIOS loads the TDX module (/boot/efi/EFI/TDX/).  This series
doesn't cover this, but leave this as future work.

4) Kdump support

This series also enables kdump with TDX, but no special handling is
needed for crash kexec (except turning on the Kconfig option):

 - kdump kernel uses reserved memory from the old kernel as system ram,
   and the old kernel will never use the reserved memory as TDX memory.
 - /proc/vmcore contains TDX private memory pages.  It's meaningless to
   read them, but it doesn't do any harm either.

5) TDX "partial write machine check" erratum

On the platform with TDX erratum, a partial write (a write transaction
of less than a cacheline lands at memory controller) to TDX private
memory poisons that memory, and a subsequent read triggers machine
check.  On those platforms, the kernel needs to reset TDX private memory
before jumping to the new kernel otherwise the new kernel may see
unexpected machine check.

The kernel currently doesn't track which page is TDX private memory.
It's not trivial to reset TDX private memory.  For simplicity, this
series simply disables kexec/kdump for such platforms.  This will be
enhanced in the future.


Kai Huang (5):
  x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
  x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel()
  x86/kexec: Disable kexec/kdump on platforms with TDX partial write
    erratum
  x86/virt/tdx: Remove the !KEXEC_CORE dependency
  x86/virt/tdx: Update the kexec section in the TDX documentation

 Documentation/arch/x86/tdx.rst       | 17 +++++++++-------
 arch/x86/Kconfig                     |  1 -
 arch/x86/include/asm/kexec.h         |  2 +-
 arch/x86/kernel/machine_kexec_64.c   | 30 ++++++++++++++++++++--------
 arch/x86/kernel/process.c            | 21 +++++++++----------
 arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++-----
 6 files changed, 54 insertions(+), 32 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
  2025-03-12 11:34 [RFC PATCH 0/5] TDX host: kexec/kdump support Kai Huang
@ 2025-03-12 11:34 ` Kai Huang
  2025-03-13 18:40   ` Edgecombe, Rick P
  2025-03-12 11:34 ` [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel() Kai Huang
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 34+ messages in thread
From: Kai Huang @ 2025-03-12 11:34 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, kirill.shutemov
  Cc: hpa, x86, linux-kernel, pbonzini, seanjc, rick.p.edgecombe,
	reinette.chatre, isaku.yamahata, dan.j.williams, thomas.lendacky,
	ashish.kalra, dwmw, bhe, nik.borisov, sagis, Dave Young

TL;DR:

Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
cover kexec support for both AMD SME and Intel TDX.  Previously there
_was_ some issue preventing from doing so but now it has been fixed.

Long version:

AMD SME uses the C-bit to determine whether to encrypt the memory or
not.  For the same physical memory address, dirty cachelines with and
without the C-bit can coexist and the CPU can flush them back to memory
in random order.  To support kexec for SME, the old kernel uses WBINVD
to flush cache before booting to the new kernel so that no stale dirty
cacheline are left over by the old kernel which could otherwise corrupt
the new kernel's memory.

TDX uses 'KeyID' bits in the physical address for memory encryption and
has the same requirement.  To support kexec for TDX, the old kernel
needs to flush cache of TDX private memory.

Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
is supported by hardware.  Perform unconditional WBINVD to support TDX
instead of adding one more vendor-specific check.  Kexec is a slow path,
and the additional WBINVD is acceptable for the sake of simplicity and
maintainability.

Only do WBINVD on bare-metal.  Doing WBINVD in guest triggers unexpected
exception (#VE or #VC) for TDX and SEV-ES/SEV-SNP guests and the guest
may not be able to handle such exception (e.g., TDX guests panics if it
sees such #VE).

History of SME and kexec WBINVD:

There _was_ an issue preventing doing unconditional WBINVD but that has
been fixed.

Initial SME kexec support added an unconditional WBINVD in commit

  bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME")

This commit caused different Intel systems to hang or reset.

Without a clear root cause, a later commit

  f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")

fixed the Intel system hang issues by only doing WBINVD when hardware
supports SME.

A corner case [*] revealed the root cause of the system hang issues and
was fixed by commit

  1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")

See [1][2] for more information.

Further testing of doing unconditional WBINVD based on the above fix on
the problematic machines (that issues were originally reported)
confirmed the issues couldn't be reproduced.

See [3][4] for more information.

Therefore, it is safe to do unconditional WBINVD for bare-metal now.

[*] The commit didn't check whether the CPUID leaf is available or not.
Making unsupported CPUID leaf on Intel returns garbage resulting in
unintended WBINVD which caused some issue (followed by the analysis and
the reveal of the final root cause).  The corner case was independently
fixed by commit

  9b040453d444: ("x86/smp: Dont access non-existing CPUID leaf")

Link: https://lore.kernel.org/lkml/28a494ca-3173-4072-921c-6c5f5b257e79@amd.com/ [1]
Link: https://lore.kernel.org/lkml/24844584-8031-4b58-ba5c-f85ef2f4c718@amd.com/ [2]
Link: https://lore.kernel.org/lkml/20240221092856.GAZdXCWGJL7c9KLewv@fat_crate.local/ [3]
Link: https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/ [4]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 arch/x86/kernel/process.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 9c75d701011f..8475d9d2d8c4 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -819,18 +819,19 @@ void __noreturn stop_this_cpu(void *dummy)
 	mcheck_cpu_clear(c);
 
 	/*
-	 * Use wbinvd on processors that support SME. This provides support
-	 * for performing a successful kexec when going from SME inactive
-	 * to SME active (or vice-versa). The cache must be cleared so that
-	 * if there are entries with the same physical address, both with and
-	 * without the encryption bit, they don't race each other when flushed
-	 * and potentially end up with the wrong entry being committed to
-	 * memory.
+	 * Use wbinvd to support kexec for both SME (from inactive to active
+	 * or vice-versa) and TDX.  The cache must be cleared so that if there
+	 * are entries with the same physical address, both with and without
+	 * the encryption bit(s), they don't race each other when flushed and
+	 * potentially end up with the wrong entry being committed to memory.
 	 *
-	 * Test the CPUID bit directly because the machine might've cleared
-	 * X86_FEATURE_SME due to cmdline options.
+	 * stop_this_cpu() isn't a fast path, just do unconditional WBINVD for
+	 * bare-metal to cover both SME and TDX.  Do not do WBINVD in a guest
+	 * since performing one will result in an exception (#VE or #VC) for
+	 * TDX or SEV-ES/SEV-SNP guests which the guest may not be able to
+	 * handle (e.g., TDX guest panics if it sees #VE).
 	 */
-	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
 		wbinvd();
 
 	/*
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel()
  2025-03-12 11:34 [RFC PATCH 0/5] TDX host: kexec/kdump support Kai Huang
  2025-03-12 11:34 ` [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu() Kai Huang
@ 2025-03-12 11:34 ` Kai Huang
  2025-03-13 23:17   ` Edgecombe, Rick P
  2025-03-12 11:34 ` [RFC PATCH 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Kai Huang
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 34+ messages in thread
From: Kai Huang @ 2025-03-12 11:34 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, kirill.shutemov
  Cc: hpa, x86, linux-kernel, pbonzini, seanjc, rick.p.edgecombe,
	reinette.chatre, isaku.yamahata, dan.j.williams, thomas.lendacky,
	ashish.kalra, dwmw, bhe, nik.borisov, sagis, Dave Young,
	David Kaplan

For both SME and TDX, dirty cachelines with and without the encryption
bit(s) of the same physical memory address can coexist and the CPU can
flush them back to memory in random order.  During kexec, the caches
must be flushed before jumping to the new kernel to avoid silent memory
corruption to the new kernel.

The WBINVD in stop_this_cpu() flushes caches for all remote CPUs when
they are being stopped.  For SME, the WBINVD in relocate_kernel()
flushes the cache for the last running CPU (which is doing kexec).

Similarly, to support kexec for TDX host, after stopping all remote CPUs
with cache flushed, the kernel needs to flush cache for the last running
CPU.

Use the existing WBINVD in relocate_kernel() to cover TDX host as well.

Just do unconditional WBINVD to cover both SME and TDX instead of
sprinkling additional vendor-specific checks.  Kexec is a slow path, and
the additional WBINVD is acceptable for the sake of simplicity and
maintainability.

But only do WBINVD for bare-metal because TDX guests and SEV-ES/SEV-SNP
guests will get unexpected (and yet unnecessary) exception (#VE or #VC)
which the kernel is unable to handle at the time of relocate_kernel()
since the kernel has torn down the IDT.

Remove the host_mem_enc_active local variable and directly use
!cpu_feature_enabled(X86_FEATURE_HYPERVISOR) as an argument of calling
relocate_kernel().  cpu_feature_enabled() is always inline but not a
function call, thus it is safe to use after load_segments() when call
depth tracking is enabled.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Kaplan <david.kaplan@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: David Kaplan <david.kaplan@amd.com>
---
 arch/x86/include/asm/kexec.h         |  2 +-
 arch/x86/kernel/machine_kexec_64.c   | 14 ++++++--------
 arch/x86/kernel/relocate_kernel_64.S | 15 ++++++++++-----
 3 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 8ad187462b68..48c313575262 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -123,7 +123,7 @@ relocate_kernel_fn(unsigned long indirection_page,
 		   unsigned long pa_control_page,
 		   unsigned long start_address,
 		   unsigned int preserve_context,
-		   unsigned int host_mem_enc_active);
+		   unsigned int bare_metal);
 #endif
 extern relocate_kernel_fn relocate_kernel;
 #define ARCH_HAS_KIMAGE_ARCH
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index a68f5a0a9f37..0e9808eeb63e 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -346,16 +346,9 @@ void __nocfi machine_kexec(struct kimage *image)
 {
 	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
 	relocate_kernel_fn *relocate_kernel_ptr;
-	unsigned int host_mem_enc_active;
 	int save_ftrace_enabled;
 	void *control_page;
 
-	/*
-	 * This must be done before load_segments() since if call depth tracking
-	 * is used then GS must be valid to make any function calls.
-	 */
-	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
-
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
 		save_processor_state();
@@ -398,6 +391,11 @@ void __nocfi machine_kexec(struct kimage *image)
 	 *
 	 * I take advantage of this here by force loading the
 	 * segments, before I zap the gdt with an invalid value.
+	 *
+	 * load_segments() resets GS to 0.  Don't make any function call
+	 * after here since call depth tracking uses per-CPU variables to
+	 * operate (relocate_kernel() is explicitly ignored by call depth
+	 * tracking).
 	 */
 	load_segments();
 	/*
@@ -412,7 +410,7 @@ void __nocfi machine_kexec(struct kimage *image)
 					   virt_to_phys(control_page),
 					   image->start,
 					   image->preserve_context,
-					   host_mem_enc_active);
+					   !cpu_feature_enabled(X86_FEATURE_HYPERVISOR));
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index b44d8863e57f..dc1a59cd8139 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -50,7 +50,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
 	 * %rsi pa_control_page
 	 * %rdx start address
 	 * %rcx preserve_context
-	 * %r8  host_mem_enc_active
+	 * %r8  bare_metal
 	 */
 
 	/* Save the CPU context, used for jumping back */
@@ -107,7 +107,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	/*
 	 * %rdi	indirection page
 	 * %rdx start address
-	 * %r8 host_mem_enc_active
+	 * %r8 bare_metal
 	 * %r9 page table page
 	 * %r11 preserve_context
 	 * %r13 original CR4 when relocate_kernel() was invoked
@@ -156,14 +156,19 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	movq	%r9, %cr3
 
 	/*
-	 * If SME is active, there could be old encrypted cache line
+	 * If SME/TDX is active, there could be old encrypted cache line
 	 * entries that will conflict with the now unencrypted memory
 	 * used by kexec. Flush the caches before copying the kernel.
+	 *
+	 * Do WBINVD for bare-metal only to cover both SME and TDX. Doing
+	 * WBINVD in guest results in an unexpected exception (#VE or #VC)
+	 * for TDX and SEV-ES/SNP guests which then crashes the guest (the
+	 * kernel has torn down the IDT).
 	 */
 	testq	%r8, %r8
-	jz .Lsme_off
+	jz .Lno_wbinvd
 	wbinvd
-.Lsme_off:
+.Lno_wbinvd:
 
 	call	swap_pages
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-03-12 11:34 [RFC PATCH 0/5] TDX host: kexec/kdump support Kai Huang
  2025-03-12 11:34 ` [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu() Kai Huang
  2025-03-12 11:34 ` [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel() Kai Huang
@ 2025-03-12 11:34 ` Kai Huang
  2025-03-12 23:27   ` Edgecombe, Rick P
  2025-03-12 11:34 ` [RFC PATCH 4/5] x86/virt/tdx: Remove the !KEXEC_CORE dependency Kai Huang
  2025-03-12 11:34 ` [RFC PATCH 5/5] x86/virt/tdx: Update the kexec section in the TDX documentation Kai Huang
  4 siblings, 1 reply; 34+ messages in thread
From: Kai Huang @ 2025-03-12 11:34 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, kirill.shutemov
  Cc: hpa, x86, linux-kernel, pbonzini, seanjc, rick.p.edgecombe,
	reinette.chatre, isaku.yamahata, dan.j.williams, thomas.lendacky,
	ashish.kalra, dwmw, bhe, nik.borisov, sagis

Some early TDX-capable platforms have an erratum: A kernel partial
write (a write transaction of less than cacheline lands at memory
controller) to TDX private memory poisons that memory, and a subsequent
read triggers a machine check.

On those platforms, the old kernel must reset TDX private memory before
jumping to the new kernel, otherwise the new kernel may see unexpected
machine check.  Currently the kernel doesn't track which page is a TDX
private page.  For simplicity just fail kexec/kdump for those platforms.

Leverage the existing machine_kexec_prepare() to fail kexec/kdump by
adding the check of the presence of the TDX erratum (which is only
checked for if the kernel is built with TDX host support).  This rejects
kexec/kdump when the kernel is loading the kexec/kdump kernel image.

The alternative is to reject kexec/kdump when the kernel is jumping to
the new kernel.  But for kexec this requires adding a new check (e.g.,
arch_kexec_allowed()) in the common code to fail kernel_kexec() at early
stage.  Kdump (crash_kexec()) needs similar check, but it's hard to
justify because crash_kexec() is not supposed to abort.

It's feasible to further relax this limitation, i.e., only fail kexec
when TDX is actually enabled by the kernel.  But this is still a half
measure compared to resetting TDX private memory so just do the simplest
thing for now.

The impact to userspace is the users will get an error when loading the
kexec/kdump kernel image:

  kexec_load failed: Operation not supported

This might be confusing to the users, thus also print the reason in the
dmesg:

  [..] kexec: not allowed on platform with tdx_pw_mce bug.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/kernel/machine_kexec_64.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 0e9808eeb63e..e438c4163960 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -311,6 +311,22 @@ int machine_kexec_prepare(struct kimage *image)
 	unsigned long reloc_end = (unsigned long)__relocate_kernel_end;
 	int result;
 
+	/*
+	 * Some early TDX-capable platforms have an erratum.  A kernel
+	 * partial write (a write transaction of less than cacheline
+	 * lands at memory controller) to TDX private memory poisons that
+	 * memory, and a subsequent read triggers a machine check.
+	 *
+	 * On those platforms the old kernel must reset TDX private
+	 * memory before jumping to the new kernel otherwise the new
+	 * kernel may see unexpected machine check.  For simplicity
+	 * just fail kexec/kdump on those platforms.
+	 */
+	if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) {
+		pr_info_once("Not allowed on platform with tdx_pw_mce bug\n");
+		return -EOPNOTSUPP;
+	}
+
 	/* Setup the identity mapped 64bit page table */
 	result = init_pgtable(image, __pa(control_page));
 	if (result)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 4/5] x86/virt/tdx: Remove the !KEXEC_CORE dependency
  2025-03-12 11:34 [RFC PATCH 0/5] TDX host: kexec/kdump support Kai Huang
                   ` (2 preceding siblings ...)
  2025-03-12 11:34 ` [RFC PATCH 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Kai Huang
@ 2025-03-12 11:34 ` Kai Huang
  2025-03-12 11:34 ` [RFC PATCH 5/5] x86/virt/tdx: Update the kexec section in the TDX documentation Kai Huang
  4 siblings, 0 replies; 34+ messages in thread
From: Kai Huang @ 2025-03-12 11:34 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, kirill.shutemov
  Cc: hpa, x86, linux-kernel, pbonzini, seanjc, rick.p.edgecombe,
	reinette.chatre, isaku.yamahata, dan.j.williams, thomas.lendacky,
	ashish.kalra, dwmw, bhe, nik.borisov, sagis

During kexec it is now guaranteed that all dirty cachelines of TDX
private memory are flushed before jumping to the new kernel.  The TDX
private memory from the old kernel will remain as TDX private memory in
the new kernel, but it is OK because kernel read/write to TDX private
memory will never cause machine check, except on the platforms with the
TDX partial write erratum, which has already been handled.

It is safe to allow kexec to work together with TDX now.  Remove the
!KEXEC_CORE dependency.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ac4d65ef54a5..2d423964beb9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1977,7 +1977,6 @@ config INTEL_TDX_HOST
 	depends on X86_X2APIC
 	select ARCH_KEEP_MEMBLOCK
 	depends on CONTIG_ALLOC
-	depends on !KEXEC_CORE
 	depends on X86_MCE
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 5/5] x86/virt/tdx: Update the kexec section in the TDX documentation
  2025-03-12 11:34 [RFC PATCH 0/5] TDX host: kexec/kdump support Kai Huang
                   ` (3 preceding siblings ...)
  2025-03-12 11:34 ` [RFC PATCH 4/5] x86/virt/tdx: Remove the !KEXEC_CORE dependency Kai Huang
@ 2025-03-12 11:34 ` Kai Huang
  4 siblings, 0 replies; 34+ messages in thread
From: Kai Huang @ 2025-03-12 11:34 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, kirill.shutemov
  Cc: hpa, x86, linux-kernel, pbonzini, seanjc, rick.p.edgecombe,
	reinette.chatre, isaku.yamahata, dan.j.williams, thomas.lendacky,
	ashish.kalra, dwmw, bhe, nik.borisov, sagis

TDX host kernel now supports kexec/kdump.  Update the documentation to
reflect that.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 Documentation/arch/x86/tdx.rst | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst
index 719043cd8b46..8874c210e545 100644
--- a/Documentation/arch/x86/tdx.rst
+++ b/Documentation/arch/x86/tdx.rst
@@ -142,13 +142,6 @@ but depends on the BIOS to behave correctly.
 Note TDX works with CPU logical online/offline, thus the kernel still
 allows to offline logical CPU and online it again.
 
-Kexec()
-~~~~~~~
-
-TDX host support currently lacks the ability to handle kexec.  For
-simplicity only one of them can be enabled in the Kconfig.  This will be
-fixed in the future.
-
 Erratum
 ~~~~~~~
 
@@ -171,6 +164,16 @@ If the platform has such erratum, the kernel prints additional message in
 machine check handler to tell user the machine check may be caused by
 kernel bug on TDX private memory.
 
+Kexec
+~~~~~~~
+
+Kexec/kdump works normally when TDX is enabled in the kernel.  One
+limitation is if the old kernel has ever enabled TDX the new kernel won't
+be able to use TDX anymore.
+
+One exception is kexec/kdump are disabled on the platform with the TDX
+"tdx_pw_mce" erratum.  This will be enhanced in the future.
+
 Interaction vs S3 and deeper states
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-03-12 11:34 ` [RFC PATCH 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Kai Huang
@ 2025-03-12 23:27   ` Edgecombe, Rick P
  2025-03-13  0:57     ` Huang, Kai
  0 siblings, 1 reply; 34+ messages in thread
From: Edgecombe, Rick P @ 2025-03-12 23:27 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Huang, Kai, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: nik.borisov@suse.com, bhe@redhat.com, seanjc@google.com,
	x86@kernel.org, sagis@google.com, hpa@zytor.com, Chatre, Reinette,
	Williams, Dan J, thomas.lendacky@amd.com, pbonzini@redhat.com,
	ashish.kalra@amd.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, dwmw@amazon.co.uk

On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> The impact to userspace is the users will get an error when loading the
> kexec/kdump kernel image:
> 
>   kexec_load failed: Operation not supported
> 
> This might be confusing to the users, thus also print the reason in the
> dmesg:
> 
>   [..] kexec: not allowed on platform with tdx_pw_mce bug.

Continuing an internal discussion... A problem with the plan to more simply
start support for kexec by not supporting the errata platforms is that when
these platforms configure TDX they will lose kexec by default.

Probably a better default for a lot of kernels would be to have kexec work by
default, and require opt-in to use TDX (and lose kexec). One idea was a kernel
parameter be required to enable TDX on those platforms. But then we are starting
to add complexity to avoid other complexity (the errata platform kexec support).

Still, it may be a net win on complexity.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-03-12 23:27   ` Edgecombe, Rick P
@ 2025-03-13  0:57     ` Huang, Kai
  2025-03-13 17:18       ` Edgecombe, Rick P
  0 siblings, 1 reply; 34+ messages in thread
From: Huang, Kai @ 2025-03-13  0:57 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Edgecombe, Rick P, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: dwmw@amazon.co.uk, linux-kernel@vger.kernel.org,
	seanjc@google.com, x86@kernel.org, sagis@google.com,
	hpa@zytor.com, Chatre, Reinette, Williams, Dan J,
	thomas.lendacky@amd.com, bhe@redhat.com, pbonzini@redhat.com,
	nik.borisov@suse.com, ashish.kalra@amd.com, Yamahata, Isaku

On Wed, 2025-03-12 at 23:27 +0000, Edgecombe, Rick P wrote:
> On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> > The impact to userspace is the users will get an error when loading the
> > kexec/kdump kernel image:
> > 
> >   kexec_load failed: Operation not supported
> > 
> > This might be confusing to the users, thus also print the reason in the
> > dmesg:
> > 
> >   [..] kexec: not allowed on platform with tdx_pw_mce bug.
> 
> Continuing an internal discussion... A problem with the plan to more simply
> start support for kexec by not supporting the errata platforms is that when
> these platforms configure TDX they will lose kexec by default.
> 
> Probably a better default for a lot of kernels would be to have kexec work by
> default, and require opt-in to use TDX (and lose kexec). One idea was a kernel
> parameter be required to enable TDX on those platforms. But then we are starting
> to add complexity to avoid other complexity (the errata platform kexec support).
> 
> Still, it may be a net win on complexity.

We can add a kernel parameter 'tdx_host={on|off}' and skip all TDX code (thus no
erratum detection) when it is off.  I suppose it will be useful in general
anyway even w/o the context of kexec.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-03-13  0:57     ` Huang, Kai
@ 2025-03-13 17:18       ` Edgecombe, Rick P
  2025-03-13 22:32         ` Huang, Kai
  0 siblings, 1 reply; 34+ messages in thread
From: Edgecombe, Rick P @ 2025-03-13 17:18 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Huang, Kai, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: ashish.kalra@amd.com, Yamahata, Isaku, seanjc@google.com,
	x86@kernel.org, sagis@google.com, hpa@zytor.com, Chatre, Reinette,
	Williams, Dan J, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	dwmw@amazon.co.uk, bhe@redhat.com, nik.borisov@suse.com

On Thu, 2025-03-13 at 00:57 +0000, Huang, Kai wrote:
> > Continuing an internal discussion... A problem with the plan to more simply
> > start support for kexec by not supporting the errata platforms is that when
> > these platforms configure TDX they will lose kexec by default.
> > 
> > Probably a better default for a lot of kernels would be to have kexec work
> > by
> > default, and require opt-in to use TDX (and lose kexec). One idea was a
> > kernel
> > parameter be required to enable TDX on those platforms. But then we are
> > starting
> > to add complexity to avoid other complexity (the errata platform kexec
> > support).
> > 
> > Still, it may be a net win on complexity.
> 
> We can add a kernel parameter 'tdx_host={on|off}' and skip all TDX code (thus
> no
> erratum detection) when it is off.  I suppose it will be useful in general
> anyway even w/o the context of kexec.

What exactly are you thinking? Add a tdx_host parameter, but what is the default
behavior? When tdx_host=on with the errata, kexec must still be disabled, right?
Better to return an error, than proceed and crash.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
  2025-03-12 11:34 ` [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu() Kai Huang
@ 2025-03-13 18:40   ` Edgecombe, Rick P
  2025-03-14 10:03     ` Huang, Kai
                       ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Edgecombe, Rick P @ 2025-03-13 18:40 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Huang, Kai, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: nik.borisov@suse.com, bhe@redhat.com, seanjc@google.com,
	x86@kernel.org, dyoung@redhat.com, sagis@google.com,
	hpa@zytor.com, Chatre, Reinette, Williams, Dan J,
	thomas.lendacky@amd.com, pbonzini@redhat.com,
	ashish.kalra@amd.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, dwmw@amazon.co.uk

On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> TL;DR:
> 
> Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
> cover kexec support for both AMD SME and Intel TDX.  Previously there
> _was_ some issue preventing from doing so but now it has been fixed.
                             ^ Adding "the kernel" here would read a little
cleaner to me.


When I read "some issue" I start assuming it wasn't fully debugged and it is
"some issue" that no one knows. But below it sounds like it was root caused.

> Long version:

It might make this easier to digest this long version if it start with a "tell
them what you are going to tell them" paragraph.

> 
> AMD SME uses the C-bit to determine whether to encrypt the memory or
> not.  For the same physical memory address, dirty cachelines with and
> without the C-bit can coexist and the CPU can flush them back to memory
> in random order.  To support kexec for SME, the old kernel uses WBINVD
> to flush cache before booting to the new kernel so that no stale dirty
> cacheline are left over by the old kernel which could otherwise corrupt
> the new kernel's memory.
> 
> TDX uses 'KeyID' bits in the physical address for memory encryption and
> has the same requirement.  To support kexec for TDX, the old kernel
> needs to flush cache of TDX private memory.

This paragraph is a little jarring because it's not clear how it follows from
the first paragraph. It helps the reader to give some hints on how they should
organize the information as they go along. If it's too much of an info dump, it
puts an extra burden. They have to try to hold all of the facts in their head
until they can put together the bigger picture themselves.

The extra info about TDX using KeyID also seems unnecessary to the point (IIUC).

> 
> Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
> is supported by hardware.  Perform unconditional WBINVD to support TDX
> instead of adding one more vendor-specific check.  Kexec is a slow path,
> and the additional WBINVD is acceptable for the sake of simplicity and
> maintainability.

Out of curiosity, do you know why this was not already needed for non-self snoop
CPUs? Why can't there be other cache modes that get written back after the new
kernel starts using the memory for something else?

> 
> Only do WBINVD on bare-metal.  Doing WBINVD in guest triggers unexpected
                                                ^the
> exception (#VE or #VC) for TDX and SEV-ES/SEV-SNP guests and the guest
> may not be able to handle such exception (e.g., TDX guests panics if it
                                                             ^panic
> sees such #VE).

It's a small thing, but I think you could skip the #VE or #VC info in here. All
they need to know to understand this patch is that TDX and some SEV kernels
cannot handle WBINVD. And TDX panics. (does SEV not?)

> 
> History of SME and kexec WBINVD:
> 
> There _was_ an issue preventing doing unconditional WBINVD but that has
> been fixed.
> 
> Initial SME kexec support added an unconditional WBINVD in commit
> 
>   bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME")
> 
> This commit caused different Intel systems to hang or reset.
> 
> Without a clear root cause, a later commit
> 
>   f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
> 
> fixed the Intel system hang issues by only doing WBINVD when hardware
> supports SME.
> 
> A corner case [*] revealed the root cause of the system hang issues and
> was fixed by commit
> 
>   1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
> 
> See [1][2] for more information.
> 
> Further testing of doing unconditional WBINVD based on the above fix on
> the problematic machines (that issues were originally reported)
> confirmed the issues couldn't be reproduced.
> 
> See [3][4] for more information.
> 
> Therefore, it is safe to do unconditional WBINVD for bare-metal now.

Instead of a play-by-play, it might be more informative to summarize the edges
covered in this history:
 - Don't do anything that writes memory between wbinvd and native_halt(). This
includes function calls that touch the stack.
 - Avoid issuing wbinvd on multiple CPUs at the same time. As tglx implies it is
too expensive.
 - Don't race reboot by watching cpumask instead of num_online_cpus(). But there
is a race still.

Hmm, on the last one tglx says:
    The cpumask cannot plug all holes either, but it's better than a raw
    counter and allows to restrict the NMI fallback IPI to be sent only the
    CPUs which have not reported within the timeout window

Are we opening up more platforms to a race by unconditionally doing the wbinvd?
Can we be clarify that nothing bad happens if we lose the race? (and is it
true?)

> 
> [*] The commit didn't check whether the CPUID leaf is available or not.
> Making unsupported CPUID leaf on Intel returns garbage resulting in
> unintended WBINVD which caused some issue (followed by the analysis and
> the reveal of the final root cause).  The corner case was independently
> fixed by commit
> 
>   9b040453d444: ("x86/smp: Dont access non-existing CPUID leaf")
> 
> Link: https://lore.kernel.org/lkml/28a494ca-3173-4072-921c-6c5f5b257e79@amd.com/ [1]
> Link: https://lore.kernel.org/lkml/24844584-8031-4b58-ba5c-f85ef2f4c718@amd.com/ [2]
> Link: https://lore.kernel.org/lkml/20240221092856.GAZdXCWGJL7c9KLewv@fat_crate.local/ [3]
> Link: https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/ [4]
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Suggested-by: Borislav Petkov <bp@alien8.de>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Dave Young <dyoung@redhat.com>
> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
> ---
>  arch/x86/kernel/process.c | 21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 9c75d701011f..8475d9d2d8c4 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -819,18 +819,19 @@ void __noreturn stop_this_cpu(void *dummy)
>  	mcheck_cpu_clear(c);
>  
>  	/*
> -	 * Use wbinvd on processors that support SME. This provides support
> -	 * for performing a successful kexec when going from SME inactive
> -	 * to SME active (or vice-versa). The cache must be cleared so that
> -	 * if there are entries with the same physical address, both with and
> -	 * without the encryption bit, they don't race each other when flushed
> -	 * and potentially end up with the wrong entry being committed to
> -	 * memory.
> +	 * Use wbinvd to support kexec for both SME (from inactive to active
> +	 * or vice-versa) and TDX.  The cache must be cleared so that if there
> +	 * are entries with the same physical address, both with and without
> +	 * the encryption bit(s), they don't race each other when flushed and
> +	 * potentially end up with the wrong entry being committed to memory.
>  	 *
> -	 * Test the CPUID bit directly because the machine might've cleared
> -	 * X86_FEATURE_SME due to cmdline options.
> +	 * stop_this_cpu() isn't a fast path, just do unconditional WBINVD for
> +	 * bare-metal to cover both SME and TDX.  Do not do WBINVD in a guest
> +	 * since performing one will result in an exception (#VE or #VC) for
> +	 * TDX or SEV-ES/SEV-SNP guests which the guest may not be able to
> +	 * handle (e.g., TDX guest panics if it sees #VE).
>  	 */
> -	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> +	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
>  		wbinvd();

I see that this already has Tom's RB, but I'm not sure how this works for AMD.
The original SME patch tried to avoid writing to memory by putting the wbinvd
immediately before the halt, but today it is further away. Below this hunk there
are more instructions that could dirty memory before the halt.  Ohh... it's new.
9 months ago 26ba7353caaa ("x86/smp: Add smp_ops.stop_this_cpu() callback") adds
a function call that would touch the stack. I think it's wrong? And probably
introduced after this patch was originally written.

Then the cpuid_eax() could be non-inlined, but probably not. But the
boot_cpu_has() added in this patch could call out to kasan and dirty the stack.

So I think the existing SME case might be theoretically incorrect, and if so
this makes things very slightly worse.

>  
>  	/*


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-03-13 17:18       ` Edgecombe, Rick P
@ 2025-03-13 22:32         ` Huang, Kai
  2025-03-13 22:47           ` Edgecombe, Rick P
  0 siblings, 1 reply; 34+ messages in thread
From: Huang, Kai @ 2025-03-13 22:32 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Edgecombe, Rick P, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: nik.borisov@suse.com, bhe@redhat.com, seanjc@google.com,
	x86@kernel.org, sagis@google.com, hpa@zytor.com, Chatre, Reinette,
	Williams, Dan J, thomas.lendacky@amd.com, ashish.kalra@amd.com,
	pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, dwmw@amazon.co.uk

On Thu, 2025-03-13 at 17:18 +0000, Edgecombe, Rick P wrote:
> On Thu, 2025-03-13 at 00:57 +0000, Huang, Kai wrote:
> > > Continuing an internal discussion... A problem with the plan to more simply
> > > start support for kexec by not supporting the errata platforms is that when
> > > these platforms configure TDX they will lose kexec by default.
> > > 
> > > Probably a better default for a lot of kernels would be to have kexec work
> > > by
> > > default, and require opt-in to use TDX (and lose kexec). One idea was a
> > > kernel
> > > parameter be required to enable TDX on those platforms. But then we are
> > > starting
> > > to add complexity to avoid other complexity (the errata platform kexec
> > > support).
> > > 
> > > Still, it may be a net win on complexity.
> > 
> > We can add a kernel parameter 'tdx_host={on|off}' and skip all TDX code (thus
> > no
> > erratum detection) when it is off.  I suppose it will be useful in general
> > anyway even w/o the context of kexec.
> 
> What exactly are you thinking? Add a tdx_host parameter, but what is the default
> behavior? When tdx_host=on with the errata, kexec must still be disabled, right?
> Better to return an error, than proceed and crash.

The default behaviour is tdx_host=off in order to not disrupt kexec/kdump
behaviour on the TDX platforms with erratum.  The distros will be able to ship
kernels with both CONFIG_KEXEC_CORE and CONFIG_INTEL_TDX_HOST on, and no visible
impact to the user who doesn't care about TDX.

If the user is interested in TDX, tdx_host=on must be set in the kernel command
line, but in this case user is expected to know kexec/kdump can only work
normally if the TDX platform doesn't have the erratum -- kexec/kdump are
disabled if the platform has the erratum.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-03-13 22:32         ` Huang, Kai
@ 2025-03-13 22:47           ` Edgecombe, Rick P
  2025-03-13 23:57             ` Huang, Kai
  0 siblings, 1 reply; 34+ messages in thread
From: Edgecombe, Rick P @ 2025-03-13 22:47 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Huang, Kai, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: dwmw@amazon.co.uk, linux-kernel@vger.kernel.org,
	seanjc@google.com, x86@kernel.org, sagis@google.com,
	hpa@zytor.com, Chatre, Reinette, Williams, Dan J,
	thomas.lendacky@amd.com, bhe@redhat.com, ashish.kalra@amd.com,
	nik.borisov@suse.com, pbonzini@redhat.com, Yamahata, Isaku

On Thu, 2025-03-13 at 22:32 +0000, Huang, Kai wrote:
> > > 
> > > We can add a kernel parameter 'tdx_host={on|off}' and skip all TDX code
> > > (thus
> > > no
> > > erratum detection) when it is off.  I suppose it will be useful in general
> > > anyway even w/o the context of kexec.
> > 
> > What exactly are you thinking? Add a tdx_host parameter, but what is the
> > default
> > behavior? When tdx_host=on with the errata, kexec must still be disabled,
> > right?
> > Better to return an error, than proceed and crash.
> 
> The default behaviour is tdx_host=off in order to not disrupt kexec/kdump
> behaviour on the TDX platforms with erratum.  The distros will be able to ship
> kernels with both CONFIG_KEXEC_CORE and CONFIG_INTEL_TDX_HOST on, and no
> visible
> impact to the user who doesn't care about TDX.
> 
> If the user is interested in TDX, tdx_host=on must be set in the kernel
> command
> line, but in this case user is expected to know kexec/kdump can only work
> normally if the TDX platform doesn't have the erratum -- kexec/kdump are
> disabled if the platform has the erratum.

So this will switch all of TDX to be default off then, unless the kernel gets a
parameter set. In which case we could also just unlock the Kconfig with just one
small change. TDX and kexec would still mutually exclusive, but just at runtime.
We should try to flag Paolo and see what he thinks.

Or is the proposal to only be default tdx_host=off on the errata platforms? And
tdx_host=on otherwise?

It seems like this series is close though, and would probably be wanted sooner
than later.




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel()
  2025-03-12 11:34 ` [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel() Kai Huang
@ 2025-03-13 23:17   ` Edgecombe, Rick P
  2025-03-14  9:44     ` Huang, Kai
  0 siblings, 1 reply; 34+ messages in thread
From: Edgecombe, Rick P @ 2025-03-13 23:17 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Huang, Kai, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: ashish.kalra@amd.com, dyoung@redhat.com, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	dwmw@amazon.co.uk, pbonzini@redhat.com, bhe@redhat.com,
	Yamahata, Isaku, nik.borisov@suse.com, Chatre, Reinette,
	hpa@zytor.com, sagis@google.com, david.kaplan@amd.com,
	x86@kernel.org, Williams, Dan J

On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> For both SME and TDX, dirty cachelines with and without the encryption
> bit(s) of the same physical memory address can coexist and the CPU can
> flush them back to memory in random order.
> 

A lot going on in this sentence, how about simplifying it:

For SME and TDX, multiple dirty cachelines for the same memory can co-exist, and
the CPU can flush them back to memory in a random order.


>   During kexec, the caches
> must be flushed before jumping to the new kernel to avoid silent memory
> corruption to the new kernel.

During kexec, the caches must be flushed before jumping to the new kernel to
avoid silent memory corruption when a cacheline with a different encryption
property is written back over whatever encryption properties the new kernel is
using.

...it distributes some of the details from the first sentence into the second.
Easier to read or no? I'm not sure.

> 
> The WBINVD in stop_this_cpu() flushes caches for all remote CPUs when
> they are being stopped.  For SME, the WBINVD in relocate_kernel()
> flushes the cache for the last running CPU (which is doing kexec).
> 
> Similarly, to support kexec for TDX host, after stopping all remote CPUs
> with cache flushed, the kernel needs to flush cache for the last running
> CPU.


I mentioned this in a previous version. I think you need to give some hint to
where you are going before you start listing facts. Like:

During kexec, WBINVD needs to be executed on each CPU for TDX and SME. On the
remote CPUs this is covered in stop_this_cpu() for both TDX and SME. For the
kexecing CPU, SME handles this in relocate_kernel(). This leaves the TDX case
for the kexec-ing CPU still to implement.

...it first says the overall problem to solve, then explains what is missing in
the current code to solve it. The reader is already thinking of what the
solutions should be and...

> 
> Use the existing WBINVD in relocate_kernel() to cover TDX host as well.

...they read the solution just as they are wondering about it. The reader can
feel like the change is aligned with their thinking.

> 
> Just do unconditional
> 

"Unconditional". Now I'm starting to think about how unconditional wbinvd will
be.

>  WBINVD to cover both SME and TDX instead of
> sprinkling additional vendor-specific checks.  Kexec is a slow path, and
> the additional WBINVD is acceptable for the sake of simplicity and
> maintainability.
> 
> But only do WBINVD for bare-metal
> 

But wait, now I'm learning it's not unconditional. I need to re-think what I
just evaluated. And I'm doubting the earlier statements because I just got
surprised.

>  because TDX guests and SEV-ES/SEV-SNP
> guests will get unexpected (and yet unnecessary) exception (#VE or #VC)
> which the kernel is unable to handle at the time of relocate_kernel()
> since the kernel has torn down the IDT.
> 
> Remove the host_mem_enc_active local variable and directly use
> !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) as an argument of calling
> relocate_kernel().
> 

Start with the problem here. It just describes another change and I'm not sure
why when I start reading it.

By problem I mean that host_mem_enc_active doesn't fit the conditional anymore,
so it needs to be changed.

>   cpu_feature_enabled() is always inline but not a

I was just noticing this on the other patch. Actually it could call into some
kasan stuff.

> function call, thus it is safe to use after load_segments() when call
> depth tracking is enabled.

This function call tracking stuff is a wild card at the end. What about
describing the rules this function needs to follow due to call depth tracking,
and explain why the change does that.

> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Dave Young <dyoung@redhat.com>
> Cc: David Kaplan <david.kaplan@amd.com>
> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
> Tested-by: David Kaplan <david.kaplan@amd.com>
> ---
>  arch/x86/include/asm/kexec.h         |  2 +-
>  arch/x86/kernel/machine_kexec_64.c   | 14 ++++++--------
>  arch/x86/kernel/relocate_kernel_64.S | 15 ++++++++++-----
>  3 files changed, 17 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 8ad187462b68..48c313575262 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -123,7 +123,7 @@ relocate_kernel_fn(unsigned long indirection_page,
>  		   unsigned long pa_control_page,
>  		   unsigned long start_address,
>  		   unsigned int preserve_context,
> -		   unsigned int host_mem_enc_active);
> +		   unsigned int bare_metal);
>  #endif
>  extern relocate_kernel_fn relocate_kernel;
>  #define ARCH_HAS_KIMAGE_ARCH
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index a68f5a0a9f37..0e9808eeb63e 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -346,16 +346,9 @@ void __nocfi machine_kexec(struct kimage *image)
>  {
>  	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
>  	relocate_kernel_fn *relocate_kernel_ptr;
> -	unsigned int host_mem_enc_active;
>  	int save_ftrace_enabled;
>  	void *control_page;
>  
> -	/*
> -	 * This must be done before load_segments() since if call depth tracking
> -	 * is used then GS must be valid to make any function calls.
> -	 */
> -	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
> -
>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)
>  		save_processor_state();
> @@ -398,6 +391,11 @@ void __nocfi machine_kexec(struct kimage *image)
>  	 *
>  	 * I take advantage of this here by force loading the
>  	 * segments, before I zap the gdt with an invalid value.
> +	 *
> +	 * load_segments() resets GS to 0.  Don't make any function call
> +	 * after here since call depth tracking uses per-CPU variables to
> +	 * operate (relocate_kernel() is explicitly ignored by call depth
> +	 * tracking).

I think I suggested you should call out the opportunistic change here in the
log. Did you disagree?

>  	 */
>  	load_segments();
>  	/*
> @@ -412,7 +410,7 @@ void __nocfi machine_kexec(struct kimage *image)
>  					   virt_to_phys(control_page),
>  					   image->start,
>  					   image->preserve_context,
> -					   host_mem_enc_active);
> +					   !cpu_feature_enabled(X86_FEATURE_HYPERVISOR));
>  
>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index b44d8863e57f..dc1a59cd8139 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -50,7 +50,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
>  	 * %rsi pa_control_page
>  	 * %rdx start address
>  	 * %rcx preserve_context
> -	 * %r8  host_mem_enc_active
> +	 * %r8  bare_metal
>  	 */
>  
>  	/* Save the CPU context, used for jumping back */
> @@ -107,7 +107,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	/*
>  	 * %rdi	indirection page
>  	 * %rdx start address
> -	 * %r8 host_mem_enc_active
> +	 * %r8 bare_metal
>  	 * %r9 page table page
>  	 * %r11 preserve_context
>  	 * %r13 original CR4 when relocate_kernel() was invoked
> @@ -156,14 +156,19 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>  	movq	%r9, %cr3
>  
>  	/*
> -	 * If SME is active, there could be old encrypted cache line
> +	 * If SME/TDX is active, there could be old encrypted cache line
>  	 * entries that will conflict with the now unencrypted memory
>  	 * used by kexec. Flush the caches before copying the kernel.
> +	 *
> +	 * Do WBINVD for bare-metal only to cover both SME and TDX. Doing
> +	 * WBINVD in guest results in an unexpected exception (#VE or #VC)
> +	 * for TDX and SEV-ES/SNP guests which then crashes the guest (the
> +	 * kernel has torn down the IDT).
>  	 */
>  	testq	%r8, %r8
> -	jz .Lsme_off
> +	jz .Lno_wbinvd
>  	wbinvd
> -.Lsme_off:
> +.Lno_wbinvd:
>  
>  	call	swap_pages
>  


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-03-13 22:47           ` Edgecombe, Rick P
@ 2025-03-13 23:57             ` Huang, Kai
  2025-03-14 19:03               ` Edgecombe, Rick P
  0 siblings, 1 reply; 34+ messages in thread
From: Huang, Kai @ 2025-03-13 23:57 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Edgecombe, Rick P, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: pbonzini@redhat.com, Yamahata, Isaku, seanjc@google.com,
	x86@kernel.org, sagis@google.com, hpa@zytor.com, Chatre, Reinette,
	Williams, Dan J, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org, ashish.kalra@amd.com,
	dwmw@amazon.co.uk, bhe@redhat.com, nik.borisov@suse.com

On Thu, 2025-03-13 at 22:47 +0000, Edgecombe, Rick P wrote:
> On Thu, 2025-03-13 at 22:32 +0000, Huang, Kai wrote:
> > > > 
> > > > We can add a kernel parameter 'tdx_host={on|off}' and skip all TDX code
> > > > (thus
> > > > no
> > > > erratum detection) when it is off.  I suppose it will be useful in general
> > > > anyway even w/o the context of kexec.
> > > 
> > > What exactly are you thinking? Add a tdx_host parameter, but what is the
> > > default
> > > behavior? When tdx_host=on with the errata, kexec must still be disabled,
> > > right?
> > > Better to return an error, than proceed and crash.
> > 
> > The default behaviour is tdx_host=off in order to not disrupt kexec/kdump
> > behaviour on the TDX platforms with erratum.  The distros will be able to ship
> > kernels with both CONFIG_KEXEC_CORE and CONFIG_INTEL_TDX_HOST on, and no
> > visible
> > impact to the user who doesn't care about TDX.
> > 
> > If the user is interested in TDX, tdx_host=on must be set in the kernel
> > command
> > line, but in this case user is expected to know kexec/kdump can only work
> > normally if the TDX platform doesn't have the erratum -- kexec/kdump are
> > disabled if the platform has the erratum.
> 
> So this will switch all of TDX to be default off then, unless the kernel gets a
> parameter set. 
> 

Currently in KVM TDX is also default off.

> In which case we could also just unlock the Kconfig with just one
> small change. TDX and kexec would still mutually exclusive, but just at runtime.

Yeah I am thinking this too, given the "keyID 0 integrity" thing are still on-
going.

> We should try to flag Paolo and see what he thinks.

I appreciate if you could help to do.

> 
> Or is the proposal to only be default tdx_host=off on the errata platforms? And
> tdx_host=on otherwise?

The tricky thing is, naturally, we want to skip all the code in tdx_init() if
tdx_host=off, because there's no reason to do those detection/initialization if
we are not going to use TDX, e.g., we don't need to this one:

	register_memory_notifier(&tdx_memory_nb);

.. that means the code of detecting erratum will be skipped too.

If we only to only make tdx_host=off as default for erratum platforms, then we
need to do cleanup (e.g., to unregister the above memory notifier).

This isn't nice and seems hacky.

I don't see making tdx_host=off as default has problem, anyway, as mentioned
above TDX is off by default in KVM.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel()
  2025-03-13 23:17   ` Edgecombe, Rick P
@ 2025-03-14  9:44     ` Huang, Kai
  2025-03-18  3:54       ` Edgecombe, Rick P
  0 siblings, 1 reply; 34+ messages in thread
From: Huang, Kai @ 2025-03-14  9:44 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Edgecombe, Rick P, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: ashish.kalra@amd.com, dyoung@redhat.com, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	dwmw@amazon.co.uk, pbonzini@redhat.com, bhe@redhat.com,
	Yamahata, Isaku, nik.borisov@suse.com, Chatre, Reinette,
	hpa@zytor.com, sagis@google.com, david.kaplan@amd.com,
	x86@kernel.org, Williams, Dan J

On Thu, 2025-03-13 at 23:17 +0000, Edgecombe, Rick P wrote:
> On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> > For both SME and TDX, dirty cachelines with and without the encryption
> > bit(s) of the same physical memory address can coexist and the CPU can
> > flush them back to memory in random order.
> > 
> 
> A lot going on in this sentence, how about simplifying it:
> 
> For SME and TDX, multiple dirty cachelines for the same memory can co-exist, and
> the CPU can flush them back to memory in a random order.

"multiple" isn't accurate at least for SME.  How about:

 For SME and TDX, dirty cachelines with and without encryption bit(s) of 
 the same memory can coexist, and the CPU can flush them back to memory 
 in random order.

> 
> 
> >   During kexec, the caches
> > must be flushed before jumping to the new kernel to avoid silent memory
> > corruption to the new kernel.
> 
> During kexec, the caches must be flushed before jumping to the new kernel to
> avoid silent memory corruption when a cacheline with a different encryption
> property is written back over whatever encryption properties the new kernel is
> using.
> 
> ...it distributes some of the details from the first sentence into the second.
> Easier to read or no? I'm not sure.

I don't have opinion.  I see no difference.

I tends to keep the original words since people have reviewed.

> 
> > 
> > The WBINVD in stop_this_cpu() flushes caches for all remote CPUs when
> > they are being stopped.  For SME, the WBINVD in relocate_kernel()
> > flushes the cache for the last running CPU (which is doing kexec).
> > 
> > Similarly, to support kexec for TDX host, after stopping all remote CPUs
> > with cache flushed, the kernel needs to flush cache for the last running
> > CPU.
> 
> 
> I mentioned this in a previous version. I think you need to give some hint to
> where you are going before you start listing facts. Like:
> 
> During kexec, WBINVD needs to be executed on each CPU for TDX and SME. 
> 
> On the
> remote CPUs this is covered in stop_this_cpu() for both TDX and SME. For the
> kexecing CPU, SME handles this in relocate_kernel(). This leaves the TDX case
> for the kexec-ing CPU still to implement.
> 
> ...it first says the overall problem to solve, then explains what is missing in
> the current code to solve it. The reader is already thinking of what the
> solutions should be and...

Will do.

> 
> > 
> > Use the existing WBINVD in relocate_kernel() to cover TDX host as well.
> 
> ...they read the solution just as they are wondering about it. The reader can
> feel like the change is aligned with their thinking.
> 
> > 
> > Just do unconditional
> > 
> 
> "Unconditional". Now I'm starting to think about how unconditional wbinvd will
> be.
> 
> >  WBINVD to cover both SME and TDX instead of
> > sprinkling additional vendor-specific checks.  Kexec is a slow path, and
> > the additional WBINVD is acceptable for the sake of simplicity and
> > maintainability.
> > 
> > But only do WBINVD for bare-metal
> > 
> 
> But wait, now I'm learning it's not unconditional. I need to re-think what I
> just evaluated. And I'm doubting the earlier statements because I just got
> surprised.

Do you mean you got surprised by saying "unconditional" first and then saying
"for bare-metal"?  This is literally what the patch title says.  I don't see any
problem, but I can mentioned the "for bare-metal" part when I say
"unconditional" above.

> 
> >  because TDX guests and SEV-ES/SEV-SNP
> > guests will get unexpected (and yet unnecessary) exception (#VE or #VC)
> > which the kernel is unable to handle at the time of relocate_kernel()
> > since the kernel has torn down the IDT.
> > 
> > Remove the host_mem_enc_active local variable and directly use
> > !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) as an argument of calling
> > relocate_kernel().
> > 
> 
> Start with the problem here. It just describes another change and I'm not sure
> why when I start reading it.
> 
> By problem I mean that host_mem_enc_active doesn't fit the conditional anymore,
> so it needs to be changed.
> 
> >   cpu_feature_enabled() is always inline but not a
> 
> I was just noticing this on the other patch. Actually it could call into some
> kasan stuff.

Can you be more specific since I am not seeing it.

> 
> > function call, thus it is safe to use after load_segments() when call
> > depth tracking is enabled.
> 
> This function call tracking stuff is a wild card at the end. What about
> describing the rules this function needs to follow due to call depth tracking,
> and explain why the change does that.

Below is what I had before.  Do you think it's better?  I replaced them with the
current one since Reinette commented the original one (which contains history
etc) was not necessary.

"
Commit 93c1800b3799 ("x86/kexec: Fix bug with call depth tracking")
moved calling 'cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)' as an argument
of relocate_kernel() to an earlier place before load_segments() by
adding a variable 'host_mem_enc_active'.  The reason was the call to
cc_platform_has() after load_segments() caused a fault and system crash
when call depth tracking is active because load_segments() resets GS to
0 but call depth tracking uses per-CPU variable to operate.

Use !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) to check whether the
kernel runs on bare-metal.  cpu_feature_enabled() is always inline but
not a function call, thus it is safe to use it after load_segments()
when call depth tracking is enabled.  Remove the 'host_mem_enc_active'
variable and use cpu_feature_enabled() directly as the argument when
calling relocate_kernel().
"


[...]

> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -346,16 +346,9 @@ void __nocfi machine_kexec(struct kimage *image)
> >  {
> >  	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
> >  	relocate_kernel_fn *relocate_kernel_ptr;
> > -	unsigned int host_mem_enc_active;
> >  	int save_ftrace_enabled;
> >  	void *control_page;
> >  
> > -	/*
> > -	 * This must be done before load_segments() since if call depth tracking
> > -	 * is used then GS must be valid to make any function calls.
> > -	 */
> > -	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
> > -
> >  #ifdef CONFIG_KEXEC_JUMP
> >  	if (image->preserve_context)
> >  		save_processor_state();
> > @@ -398,6 +391,11 @@ void __nocfi machine_kexec(struct kimage *image)
> >  	 *
> >  	 * I take advantage of this here by force loading the
> >  	 * segments, before I zap the gdt with an invalid value.
> > +	 *
> > +	 * load_segments() resets GS to 0.  Don't make any function call
> > +	 * after here since call depth tracking uses per-CPU variables to
> > +	 * operate (relocate_kernel() is explicitly ignored by call depth
> > +	 * tracking).
> 
> I think I suggested you should call out the opportunistic change here in the
> log. Did you disagree?

I replied this was suggested by David Kaplan, but I guess I forgot to reply the
"opportunistic" part.

I don't think this is opportunistic change.  It's a valid comment after the 
'host_mem_enc_active' variable and the comment around it were removed.




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
  2025-03-13 18:40   ` Edgecombe, Rick P
@ 2025-03-14 10:03     ` Huang, Kai
  2025-03-14 15:11     ` Tom Lendacky
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 34+ messages in thread
From: Huang, Kai @ 2025-03-14 10:03 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Edgecombe, Rick P, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: dwmw@amazon.co.uk, dyoung@redhat.com, seanjc@google.com,
	x86@kernel.org, linux-kernel@vger.kernel.org, sagis@google.com,
	hpa@zytor.com, Chatre, Reinette, bhe@redhat.com, Williams, Dan J,
	pbonzini@redhat.com, thomas.lendacky@amd.com,
	nik.borisov@suse.com, ashish.kalra@amd.com, Yamahata, Isaku

On Thu, 2025-03-13 at 18:40 +0000, Edgecombe, Rick P wrote:
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index 9c75d701011f..8475d9d2d8c4 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -819,18 +819,19 @@ void __noreturn stop_this_cpu(void *dummy)
> >   	mcheck_cpu_clear(c);
> >   
> >   	/*
> > -	 * Use wbinvd on processors that support SME. This provides support
> > -	 * for performing a successful kexec when going from SME inactive
> > -	 * to SME active (or vice-versa). The cache must be cleared so that
> > -	 * if there are entries with the same physical address, both with and
> > -	 * without the encryption bit, they don't race each other when flushed
> > -	 * and potentially end up with the wrong entry being committed to
> > -	 * memory.
> > +	 * Use wbinvd to support kexec for both SME (from inactive to active
> > +	 * or vice-versa) and TDX.  The cache must be cleared so that if there
> > +	 * are entries with the same physical address, both with and without
> > +	 * the encryption bit(s), they don't race each other when flushed and
> > +	 * potentially end up with the wrong entry being committed to memory.
> >   	 *
> > -	 * Test the CPUID bit directly because the machine might've cleared
> > -	 * X86_FEATURE_SME due to cmdline options.
> > +	 * stop_this_cpu() isn't a fast path, just do unconditional WBINVD for
> > +	 * bare-metal to cover both SME and TDX.  Do not do WBINVD in a guest
> > +	 * since performing one will result in an exception (#VE or #VC) for
> > +	 * TDX or SEV-ES/SEV-SNP guests which the guest may not be able to
> > +	 * handle (e.g., TDX guest panics if it sees #VE).
> >   	 */
> > -	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> > +	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
> >   		wbinvd();
> 
> I see that this already has Tom's RB, but I'm not sure how this works for AMD.
> The original SME patch tried to avoid writing to memory by putting the wbinvd
> immediately before the halt, but today it is further away. Below this hunk there
> are more instructions that could dirty memory before the halt.  Ohh... it's new.
> 9 months ago 26ba7353caaa ("x86/smp: Add smp_ops.stop_this_cpu() callback") adds
> a function call that would touch the stack. I think it's wrong? And probably
> introduced after this patch was originally written.

(I'll reply others separately since today I am a little bit sick.)

That callback is for TDX guset (which doesn't invoke WBINVD during kexec).  It
doesn't impact SME host.

I kinda agree maybe the code here can be improved, e.g., the code to call
smp_ops.stop_this_cpu() perhaps should be called before the WBINVD.

> 
> Then the cpuid_eax() could be non-inlined, but probably not. But the
> boot_cpu_has() added in this patch could call out to kasan and dirty the stack.

Could you elaborate this since I don't see how kasan is involved?

> 
> So I think the existing SME case might be theoretically incorrect, and if so
> this makes things very slightly worse.

As explained above the function call is empty for SME host.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
  2025-03-13 18:40   ` Edgecombe, Rick P
  2025-03-14 10:03     ` Huang, Kai
@ 2025-03-14 15:11     ` Tom Lendacky
  2025-03-14 16:28       ` Edgecombe, Rick P
  2025-03-17 10:11     ` Huang, Kai
  2025-03-17 12:52     ` kirill.shutemov
  3 siblings, 1 reply; 34+ messages in thread
From: Tom Lendacky @ 2025-03-14 15:11 UTC (permalink / raw)
  To: Edgecombe, Rick P, tglx@linutronix.de, peterz@infradead.org,
	mingo@redhat.com, Hansen, Dave, Huang, Kai, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: nik.borisov@suse.com, bhe@redhat.com, seanjc@google.com,
	x86@kernel.org, dyoung@redhat.com, sagis@google.com,
	hpa@zytor.com, Chatre, Reinette, Williams, Dan J,
	pbonzini@redhat.com, ashish.kalra@amd.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, dwmw@amazon.co.uk

On 3/13/25 13:40, Edgecombe, Rick P wrote:
> On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
>> TL;DR:
>>
>> Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
>> cover kexec support for both AMD SME and Intel TDX.  Previously there
>> _was_ some issue preventing from doing so but now it has been fixed.
>                              ^ Adding "the kernel" here would read a little
> cleaner to me.
> 
> 
> When I read "some issue" I start assuming it wasn't fully debugged and it is
> "some issue" that no one knows. But below it sounds like it was root caused.
> 
>> Long version:
> 
> It might make this easier to digest this long version if it start with a "tell
> them what you are going to tell them" paragraph.
> 
>>
>> AMD SME uses the C-bit to determine whether to encrypt the memory or
>> not.  For the same physical memory address, dirty cachelines with and
>> without the C-bit can coexist and the CPU can flush them back to memory
>> in random order.  To support kexec for SME, the old kernel uses WBINVD
>> to flush cache before booting to the new kernel so that no stale dirty
>> cacheline are left over by the old kernel which could otherwise corrupt
>> the new kernel's memory.
>>
>> TDX uses 'KeyID' bits in the physical address for memory encryption and
>> has the same requirement.  To support kexec for TDX, the old kernel
>> needs to flush cache of TDX private memory.
> 
> This paragraph is a little jarring because it's not clear how it follows from
> the first paragraph. It helps the reader to give some hints on how they should
> organize the information as they go along. If it's too much of an info dump, it
> puts an extra burden. They have to try to hold all of the facts in their head
> until they can put together the bigger picture themselves.
> 
> The extra info about TDX using KeyID also seems unnecessary to the point (IIUC).
> 
>>
>> Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
>> is supported by hardware.  Perform unconditional WBINVD to support TDX
>> instead of adding one more vendor-specific check.  Kexec is a slow path,
>> and the additional WBINVD is acceptable for the sake of simplicity and
>> maintainability.
> 
> Out of curiosity, do you know why this was not already needed for non-self snoop
> CPUs? Why can't there be other cache modes that get written back after the new
> kernel starts using the memory for something else?
> 
>>
>> Only do WBINVD on bare-metal.  Doing WBINVD in guest triggers unexpected
>                                                 ^the
>> exception (#VE or #VC) for TDX and SEV-ES/SEV-SNP guests and the guest
>> may not be able to handle such exception (e.g., TDX guests panics if it
>                                                              ^panic
>> sees such #VE).
> 
> It's a small thing, but I think you could skip the #VE or #VC info in here. All
> they need to know to understand this patch is that TDX and some SEV kernels
> cannot handle WBINVD. And TDX panics. (does SEV not?)
> 
>>
>> History of SME and kexec WBINVD:
>>
>> There _was_ an issue preventing doing unconditional WBINVD but that has
>> been fixed.
>>
>> Initial SME kexec support added an unconditional WBINVD in commit
>>
>>   bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME")
>>
>> This commit caused different Intel systems to hang or reset.
>>
>> Without a clear root cause, a later commit
>>
>>   f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
>>
>> fixed the Intel system hang issues by only doing WBINVD when hardware
>> supports SME.
>>
>> A corner case [*] revealed the root cause of the system hang issues and
>> was fixed by commit
>>
>>   1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
>>
>> See [1][2] for more information.
>>
>> Further testing of doing unconditional WBINVD based on the above fix on
>> the problematic machines (that issues were originally reported)
>> confirmed the issues couldn't be reproduced.
>>
>> See [3][4] for more information.
>>
>> Therefore, it is safe to do unconditional WBINVD for bare-metal now.
> 
> Instead of a play-by-play, it might be more informative to summarize the edges
> covered in this history:
>  - Don't do anything that writes memory between wbinvd and native_halt(). This
> includes function calls that touch the stack.
>  - Avoid issuing wbinvd on multiple CPUs at the same time. As tglx implies it is
> too expensive.
>  - Don't race reboot by watching cpumask instead of num_online_cpus(). But there
> is a race still.
> 
> Hmm, on the last one tglx says:
>     The cpumask cannot plug all holes either, but it's better than a raw
>     counter and allows to restrict the NMI fallback IPI to be sent only the
>     CPUs which have not reported within the timeout window
> 
> Are we opening up more platforms to a race by unconditionally doing the wbinvd?
> Can we be clarify that nothing bad happens if we lose the race? (and is it
> true?)
> 
>>
>> [*] The commit didn't check whether the CPUID leaf is available or not.
>> Making unsupported CPUID leaf on Intel returns garbage resulting in
>> unintended WBINVD which caused some issue (followed by the analysis and
>> the reveal of the final root cause).  The corner case was independently
>> fixed by commit
>>
>>   9b040453d444: ("x86/smp: Dont access non-existing CPUID leaf")
>>
>> Link: https://lore.kernel.org/lkml/28a494ca-3173-4072-921c-6c5f5b257e79@amd.com/ [1]
>> Link: https://lore.kernel.org/lkml/24844584-8031-4b58-ba5c-f85ef2f4c718@amd.com/ [2]
>> Link: https://lore.kernel.org/lkml/20240221092856.GAZdXCWGJL7c9KLewv@fat_crate.local/ [3]
>> Link: https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/ [4]
>> Signed-off-by: Kai Huang <kai.huang@intel.com>
>> Suggested-by: Borislav Petkov <bp@alien8.de>
>> Cc: Tom Lendacky <thomas.lendacky@amd.com>
>> Cc: Dave Young <dyoung@redhat.com>
>> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
>> ---
>>  arch/x86/kernel/process.c | 21 +++++++++++----------
>>  1 file changed, 11 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>> index 9c75d701011f..8475d9d2d8c4 100644
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -819,18 +819,19 @@ void __noreturn stop_this_cpu(void *dummy)
>>  	mcheck_cpu_clear(c);
>>  
>>  	/*
>> -	 * Use wbinvd on processors that support SME. This provides support
>> -	 * for performing a successful kexec when going from SME inactive
>> -	 * to SME active (or vice-versa). The cache must be cleared so that
>> -	 * if there are entries with the same physical address, both with and
>> -	 * without the encryption bit, they don't race each other when flushed
>> -	 * and potentially end up with the wrong entry being committed to
>> -	 * memory.
>> +	 * Use wbinvd to support kexec for both SME (from inactive to active
>> +	 * or vice-versa) and TDX.  The cache must be cleared so that if there
>> +	 * are entries with the same physical address, both with and without
>> +	 * the encryption bit(s), they don't race each other when flushed and
>> +	 * potentially end up with the wrong entry being committed to memory.
>>  	 *
>> -	 * Test the CPUID bit directly because the machine might've cleared
>> -	 * X86_FEATURE_SME due to cmdline options.
>> +	 * stop_this_cpu() isn't a fast path, just do unconditional WBINVD for
>> +	 * bare-metal to cover both SME and TDX.  Do not do WBINVD in a guest
>> +	 * since performing one will result in an exception (#VE or #VC) for
>> +	 * TDX or SEV-ES/SEV-SNP guests which the guest may not be able to
>> +	 * handle (e.g., TDX guest panics if it sees #VE).
>>  	 */
>> -	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
>> +	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
>>  		wbinvd();
> 
> I see that this already has Tom's RB, but I'm not sure how this works for AMD.
> The original SME patch tried to avoid writing to memory by putting the wbinvd
> immediately before the halt, but today it is further away. Below this hunk there
> are more instructions that could dirty memory before the halt.  Ohh... it's new.
> 9 months ago 26ba7353caaa ("x86/smp: Add smp_ops.stop_this_cpu() callback") adds
> a function call that would touch the stack. I think it's wrong? And probably
> introduced after this patch was originally written.
> 
> Then the cpuid_eax() could be non-inlined, but probably not. But the
> boot_cpu_has() added in this patch could call out to kasan and dirty the stack.
> 
> So I think the existing SME case might be theoretically incorrect, and if so
> this makes things very slightly worse.

But the wbinvd() is performed after those checks, so everything gets flushed.

Thanks,
Tom

> 
>>  
>>  	/*
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
  2025-03-14 15:11     ` Tom Lendacky
@ 2025-03-14 16:28       ` Edgecombe, Rick P
  2025-03-14 18:18         ` Tom Lendacky
  0 siblings, 1 reply; 34+ messages in thread
From: Edgecombe, Rick P @ 2025-03-14 16:28 UTC (permalink / raw)
  To: Hansen, Dave, Huang, Kai, bp@alien8.de, peterz@infradead.org,
	mingo@redhat.com, thomas.lendacky@amd.com, tglx@linutronix.de,
	kirill.shutemov@linux.intel.com
  Cc: dyoung@redhat.com, seanjc@google.com, x86@kernel.org,
	linux-kernel@vger.kernel.org, sagis@google.com, hpa@zytor.com,
	Chatre, Reinette, bhe@redhat.com, Williams, Dan J,
	pbonzini@redhat.com, ashish.kalra@amd.com, nik.borisov@suse.com,
	Yamahata, Isaku, dwmw@amazon.co.uk

On Fri, 2025-03-14 at 10:11 -0500, Tom Lendacky wrote:
> > I see that this already has Tom's RB, but I'm not sure how this works for
> > AMD.
> > The original SME patch tried to avoid writing to memory by putting the
> > wbinvd
> > immediately before the halt, but today it is further away. Below this hunk
> > there
> > are more instructions that could dirty memory before the halt.  Ohh... it's
> > new.
> > 9 months ago 26ba7353caaa ("x86/smp: Add smp_ops.stop_this_cpu() callback")
> > adds
> > a function call that would touch the stack. I think it's wrong? And probably
> > introduced after this patch was originally written.
> > 
> > Then the cpuid_eax() could be non-inlined, but probably not. But the
> > boot_cpu_has() added in this patch could call out to kasan and dirty the
> > stack.
> > 
> > So I think the existing SME case might be theoretically incorrect, and if so
> > this makes things very slightly worse.
> 
> But the wbinvd() is performed after those checks, so everything gets flushed.

Oh, right, duh. Thanks for checking. Yea those shouldn't matter.

Does the stop_this_cpu() part never come into play for SME either? It looks like
it was added for TDX guest kexec, but is a general ACPI thing.

Regarding the kasan thing, I was looking at this too:
wbinvd()
cpumask_clear_cpu()
  clear_bit()
    instrument_atomic_write()
      kasan_check_write()
        __kasan_check_write() <- non-inline


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
  2025-03-14 16:28       ` Edgecombe, Rick P
@ 2025-03-14 18:18         ` Tom Lendacky
  2025-03-14 18:57           ` Edgecombe, Rick P
  0 siblings, 1 reply; 34+ messages in thread
From: Tom Lendacky @ 2025-03-14 18:18 UTC (permalink / raw)
  To: Edgecombe, Rick P, Hansen, Dave, Huang, Kai, bp@alien8.de,
	peterz@infradead.org, mingo@redhat.com, tglx@linutronix.de,
	kirill.shutemov@linux.intel.com
  Cc: dyoung@redhat.com, seanjc@google.com, x86@kernel.org,
	linux-kernel@vger.kernel.org, sagis@google.com, hpa@zytor.com,
	Chatre, Reinette, bhe@redhat.com, Williams, Dan J,
	pbonzini@redhat.com, ashish.kalra@amd.com, nik.borisov@suse.com,
	Yamahata, Isaku, dwmw@amazon.co.uk

On 3/14/25 11:28, Edgecombe, Rick P wrote:
> On Fri, 2025-03-14 at 10:11 -0500, Tom Lendacky wrote:
>>> I see that this already has Tom's RB, but I'm not sure how this works for
>>> AMD.
>>> The original SME patch tried to avoid writing to memory by putting the
>>> wbinvd
>>> immediately before the halt, but today it is further away. Below this hunk
>>> there
>>> are more instructions that could dirty memory before the halt.  Ohh... it's
>>> new.
>>> 9 months ago 26ba7353caaa ("x86/smp: Add smp_ops.stop_this_cpu() callback")
>>> adds
>>> a function call that would touch the stack. I think it's wrong? And probably
>>> introduced after this patch was originally written.
>>>
>>> Then the cpuid_eax() could be non-inlined, but probably not. But the
>>> boot_cpu_has() added in this patch could call out to kasan and dirty the
>>> stack.
>>>
>>> So I think the existing SME case might be theoretically incorrect, and if so
>>> this makes things very slightly worse.
>>
>> But the wbinvd() is performed after those checks, so everything gets flushed.
> 
> Oh, right, duh. Thanks for checking. Yea those shouldn't matter.
> 
> Does the stop_this_cpu() part never come into play for SME either? It looks like
> it was added for TDX guest kexec, but is a general ACPI thing.

It is a general ACPI thing, but I don't know of it being used by our MADT
tables.

> 
> Regarding the kasan thing, I was looking at this too:
> wbinvd()
> cpumask_clear_cpu()
>   clear_bit()
>     instrument_atomic_write()
>       kasan_check_write()
>         __kasan_check_write() <- non-inline

Yes, this does look worrisome. Too bad there isn't a way to turn off KASAN
for a single function.

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
  2025-03-14 18:18         ` Tom Lendacky
@ 2025-03-14 18:57           ` Edgecombe, Rick P
  0 siblings, 0 replies; 34+ messages in thread
From: Edgecombe, Rick P @ 2025-03-14 18:57 UTC (permalink / raw)
  To: Hansen, Dave, Huang, Kai, bp@alien8.de, peterz@infradead.org,
	mingo@redhat.com, thomas.lendacky@amd.com, tglx@linutronix.de,
	kirill.shutemov@linux.intel.com
  Cc: dwmw@amazon.co.uk, dyoung@redhat.com, seanjc@google.com,
	x86@kernel.org, sagis@google.com, hpa@zytor.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Williams, Dan J, bhe@redhat.com,
	pbonzini@redhat.com, nik.borisov@suse.com, ashish.kalra@amd.com,
	Yamahata, Isaku

On Fri, 2025-03-14 at 13:18 -0500, Tom Lendacky wrote:
> > Does the stop_this_cpu() part never come into play for SME either? It looks
> > like
> > it was added for TDX guest kexec, but is a general ACPI thing.
> 
> It is a general ACPI thing, but I don't know of it being used by our MADT
> tables.
> 
> > 
> > Regarding the kasan thing, I was looking at this too:
> > wbinvd()
> > cpumask_clear_cpu()
> >    clear_bit()
> >      instrument_atomic_write()
> >        kasan_check_write()
> >          __kasan_check_write() <- non-inline
> 
> Yes, this does look worrisome. Too bad there isn't a way to turn off KASAN
> for a single function.

I wonder why that one has an explicit call, compared to compiler generated
stuff. It makes me wonder if there is just some KASAN skipping bit-wise
operations that could be done to fix it.

For stop_this_cpu(), not sure how to avoid it. Maybe just a raw jump to the
function pointer or something. It's not supposed to return. If it is actually
not an issue due foreseeable lack of real world HW/firmware, a comment would be
nice touch.

Ok, well based on your earlier point, the new code actually doesn't make things
worse for SME. For TDX the late function calls are not a problem. So we'll leave
it to you guys.

Thanks for responding so quickly.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-03-13 23:57             ` Huang, Kai
@ 2025-03-14 19:03               ` Edgecombe, Rick P
  2025-03-17  1:19                 ` Huang, Kai
  0 siblings, 1 reply; 34+ messages in thread
From: Edgecombe, Rick P @ 2025-03-14 19:03 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Huang, Kai, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: nik.borisov@suse.com, bhe@redhat.com, seanjc@google.com,
	x86@kernel.org, sagis@google.com, hpa@zytor.com, Chatre, Reinette,
	Williams, Dan J, thomas.lendacky@amd.com, pbonzini@redhat.com,
	ashish.kalra@amd.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, dwmw@amazon.co.uk

On Thu, 2025-03-13 at 23:57 +0000, Huang, Kai wrote:
> > So this will switch all of TDX to be default off then, unless the kernel
> > gets a
> > parameter set. 
> > 
> 
> Currently in KVM TDX is also default off.

Good point. It begs the question of how many command line options the user
should have to pass to enable TDX.

> 
> > In which case we could also just unlock the Kconfig with just one
> > small change. TDX and kexec would still mutually exclusive, but just at
> > runtime.
> 
> Yeah I am thinking this too, given the "keyID 0 integrity" thing are still on-
> going.

You mentioned offline that there used to be a command line option, but it was
removed after discussion with Dave. I went to look for it and only found this:
https://lore.kernel.org/lkml/7e63912a-895f-d3b3-3173-336beaa86d08@intel.com/

...where Dave just asks why it's needed. In the next version it's dropped.
Unless there is anything more, it doesn't seem like there was really any
objection.

> 
> > We should try to flag Paolo and see what he thinks.
> 
> I appreciate if you could help to do.
> 
> > 
> > Or is the proposal to only be default tdx_host=off on the errata platforms?
> > And
> > tdx_host=on otherwise?
> 
> The tricky thing is, naturally, we want to skip all the code in tdx_init() if
> tdx_host=off, because there's no reason to do those detection/initialization
> if
> we are not going to use TDX, e.g., we don't need to this one:
> 
> 	register_memory_notifier(&tdx_memory_nb);
> 
> .. that means the code of detecting erratum will be skipped too.
> 
> If we only to only make tdx_host=off as default for erratum platforms, then we
> need to do cleanup (e.g., to unregister the above memory notifier).

This is a strange point. The errata detection is not dependent on the earlier
code in TDX init. It couldn't just be moved?

> 
> This isn't nice and seems hacky.
> 
> I don't see making tdx_host=off as default has problem, anyway, as mentioned
> above TDX is off by default in KVM.

Yea, tdx_host=!errata as a default value makes it more complicated.


So I think the situation is we need at one kernel parameter. We already have one
for KVM, which controls the late initialization parts of TDX that we care about
here. So what about just using the existing one? I think we don't want two.

If KVM has not initialized TDX (based on its own TDX parameter), then kexec is
fine. It could work by exposing an interface for features to be exclusive with
TDX. Since real TDX module initialization happens late anyway. I don't know if
it's better than a kernel one, but I don't see adding a second one going well.


Very, very rough:

diff --git a/arch/x86/kernel/machine_kexec_64.c
b/arch/x86/kernel/machine_kexec_64.c
index a68f5a0a9f37..bfea4e78c577 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -315,6 +315,12 @@ int machine_kexec_prepare(struct kimage *image)
        result = init_pgtable(image, __pa(control_page));
        if (result)
                return result;
+
+       if (tdx_exclude_feature()) {
+               pr_info_once("Not allowed once TDX has been used.\n");
+               return -EOPNOTSUPP;
+       }
+
        kexec_va_control_page = (unsigned long)control_page;
        kexec_pa_table_page = (unsigned long)__pa(image->arch.pgd);
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index f5e2a937c1e7..9b1f42a1059c 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1215,6 +1215,21 @@ int tdx_enable(void)
 }
 EXPORT_SYMBOL_GPL(tdx_enable);
 
+bool tdx_exclude_feature(void)
+{
+       bool ret = false;
+
+       mutex_lock(&tdx_module_lock);
+       if (tdx_module_status == TDX_MODULE_INITIALIZED)
+               ret = true;
+       else
+               tdx_module_status = TDX_MODULE_EXCLUDED;
+       mutex_lock(&tdx_module_lock);
+
+       return ret;
+}
+
 static bool is_pamt_page(unsigned long phys)
 {
        struct tdmr_info_list *tdmr_list = &tdx_tdmr_list;



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-03-14 19:03               ` Edgecombe, Rick P
@ 2025-03-17  1:19                 ` Huang, Kai
  2025-03-17 23:53                   ` Edgecombe, Rick P
  0 siblings, 1 reply; 34+ messages in thread
From: Huang, Kai @ 2025-03-17  1:19 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Edgecombe, Rick P, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: dwmw@amazon.co.uk, linux-kernel@vger.kernel.org,
	seanjc@google.com, x86@kernel.org, sagis@google.com,
	hpa@zytor.com, Chatre, Reinette, Williams, Dan J,
	thomas.lendacky@amd.com, bhe@redhat.com, pbonzini@redhat.com,
	nik.borisov@suse.com, ashish.kalra@amd.com, Yamahata, Isaku

On Fri, 2025-03-14 at 19:03 +0000, Edgecombe, Rick P wrote:
> On Thu, 2025-03-13 at 23:57 +0000, Huang, Kai wrote:
> > > So this will switch all of TDX to be default off then, unless the kernel
> > > gets a
> > > parameter set. 
> > > 
> > 
> > Currently in KVM TDX is also default off.
> 
> Good point. It begs the question of how many command line options the user
> should have to pass to enable TDX.
> 
> > 
> > > In which case we could also just unlock the Kconfig with just one
> > > small change. TDX and kexec would still mutually exclusive, but just at
> > > runtime.
> > 
> > Yeah I am thinking this too, given the "keyID 0 integrity" thing are still on-
> > going.
> 
> You mentioned offline that there used to be a command line option, but it was
> removed after discussion with Dave. I went to look for it and only found this:
> https://lore.kernel.org/lkml/7e63912a-895f-d3b3-3173-336beaa86d08@intel.com/
> 
> ...where Dave just asks why it's needed. In the next version it's dropped.
> Unless there is anything more, it doesn't seem like there was really any
> objection.

Thanks for digging. :-)

I couldn't find any solid reason to argue against Dave so I just dropped it.  I
could argue that "this allows people to disable TDX once for all" but it was not
something mandatory at that time.

> 
> > 
> > > We should try to flag Paolo and see what he thinks.
> > 
> > I appreciate if you could help to do.
> > 
> > > 
> > > Or is the proposal to only be default tdx_host=off on the errata platforms?
> > > And
> > > tdx_host=on otherwise?
> > 
> > The tricky thing is, naturally, we want to skip all the code in tdx_init() if
> > tdx_host=off, because there's no reason to do those detection/initialization
> > if
> > we are not going to use TDX, e.g., we don't need to this one:
> > 
> > 	register_memory_notifier(&tdx_memory_nb);
> > 
> > .. that means the code of detecting erratum will be skipped too.
> > 
> > If we only to only make tdx_host=off as default for erratum platforms, then we
> > need to do cleanup (e.g., to unregister the above memory notifier).
> 
> This is a strange point. The errata detection is not dependent on the earlier
> code in TDX init. It couldn't just be moved?

Sorry I don't quite follow your point, but seems you agreed it's not a good
idea.

> 
> > 
> > This isn't nice and seems hacky.
> > 
> > I don't see making tdx_host=off as default has problem, anyway, as mentioned
> > above TDX is off by default in KVM.
> 
> Yea, tdx_host=!errata as a default value makes it more complicated.

Yes.

> 
> 
> So I think the situation is we need at one kernel parameter. We already have one
> for KVM, which controls the late initialization parts of TDX that we care about
> here. So what about just using the existing one? I think we don't want two.

Logically, KVM is one user of TDX.  I think whether KVM has a parameter should
not impact whether we should introduce one kernel parameter for TDX host core-
kernel.

Dan also made a point that in the context of TDX Connect, there's requirement to
make SEAMCALLs even KVM is not going to run any TDX guest:

https://lore.kernel.org/kvm/cover.1730120881.git.kai.huang@intel.com/T/#m6928f5519de25def97d47fc6bbb77f5c3e958f7b

So I agree ideally we don't want two, but I think it is also OK if there's good
reason to do so.

> 
> If KVM has not initialized TDX (based on its own TDX parameter), then kexec is
> fine. 
> 

For now.  In the future TDX module could be initialized by other kernel
components.

> It could work by exposing an interface for features to be exclusive with
> TDX. Since real TDX module initialization happens late anyway. I don't know if
> it's better than a kernel one, but I don't see adding a second one going well.
> 
> 
> Very, very rough:
> 
> diff --git a/arch/x86/kernel/machine_kexec_64.c
> b/arch/x86/kernel/machine_kexec_64.c
> index a68f5a0a9f37..bfea4e78c577 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -315,6 +315,12 @@ int machine_kexec_prepare(struct kimage *image)
>         result = init_pgtable(image, __pa(control_page));
>         if (result)
>                 return result;
> +
> +       if (tdx_exclude_feature()) {
> +               pr_info_once("Not allowed once TDX has been used.\n");
> +               return -EOPNOTSUPP;
> +       }
> +
>         kexec_va_control_page = (unsigned long)control_page;
>         kexec_pa_table_page = (unsigned long)__pa(image->arch.pgd);
>  
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index f5e2a937c1e7..9b1f42a1059c 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1215,6 +1215,21 @@ int tdx_enable(void)
>  }
>  EXPORT_SYMBOL_GPL(tdx_enable);
>  
> +bool tdx_exclude_feature(void)
> +{
> +       bool ret = false;
> +
> +       mutex_lock(&tdx_module_lock);
> +       if (tdx_module_status == TDX_MODULE_INITIALIZED)
> +               ret = true;
> +       else
> +               tdx_module_status = TDX_MODULE_EXCLUDED;
> +       mutex_lock(&tdx_module_lock);
> +
> +       return ret;
> +}

Assuming setting module status to "excluded" means we are not able to initialize
TDX module for ever.

The thing is Kexec has two phases: 1) loading kernel image, and 2) actually do
kexec.  Your approach basically marks TDX unusable for ever when a user tries to
load a kxec kernel image, but this is a little bit nasty because loading kexec
kernel image successfully doesn't mean you have to actually do the kexec, i.e.,
you can unload the image and move on.

I am not saying this doesn't work, but IMHO it is more straightforward to just
let user make decision via kernel parameter.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
  2025-03-13 18:40   ` Edgecombe, Rick P
  2025-03-14 10:03     ` Huang, Kai
  2025-03-14 15:11     ` Tom Lendacky
@ 2025-03-17 10:11     ` Huang, Kai
  2025-03-18  3:41       ` Edgecombe, Rick P
  2025-03-17 12:52     ` kirill.shutemov
  3 siblings, 1 reply; 34+ messages in thread
From: Huang, Kai @ 2025-03-17 10:11 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Edgecombe, Rick P, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: dwmw@amazon.co.uk, dyoung@redhat.com, seanjc@google.com,
	x86@kernel.org, linux-kernel@vger.kernel.org, sagis@google.com,
	hpa@zytor.com, Chatre, Reinette, bhe@redhat.com, Williams, Dan J,
	pbonzini@redhat.com, thomas.lendacky@amd.com,
	nik.borisov@suse.com, ashish.kalra@amd.com, Yamahata, Isaku

On Thu, 2025-03-13 at 18:40 +0000, Edgecombe, Rick P wrote:
> On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> > TL;DR:
> > 
> > Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
> > cover kexec support for both AMD SME and Intel TDX.  Previously there
> > _was_ some issue preventing from doing so but now it has been fixed.
>                              ^ Adding "the kernel" here would read a little
> cleaner to me.

OK.

> 
> 
> When I read "some issue" I start assuming it wasn't fully debugged and it is
> "some issue" that no one knows. But below it sounds like it was root caused.

I am not sure what's wrong with "some issue".  I used "_was_" to convey this
issue was fixed (thus root caused).  Perhaps the "root caused" part isn't clear?
I can explicitly call it out.

  Previously there _was_ some issue preventing the kernel from doing so but 
  now it has been root caused and fixed. 

> 
> > Long version:
> 
> It might make this easier to digest this long version if it start with a "tell
> them what you are going to tell them" paragraph.
> 
> > 
> > AMD SME uses the C-bit to determine whether to encrypt the memory or
> > not.  For the same physical memory address, dirty cachelines with and
> > without the C-bit can coexist and the CPU can flush them back to memory
> > in random order.  To support kexec for SME, the old kernel uses WBINVD
> > to flush cache before booting to the new kernel so that no stale dirty
> > cacheline are left over by the old kernel which could otherwise corrupt
> > the new kernel's memory.
> > 
> > TDX uses 'KeyID' bits in the physical address for memory encryption and
> > has the same requirement.  To support kexec for TDX, the old kernel
> > needs to flush cache of TDX private memory.
> 
> This paragraph is a little jarring because it's not clear how it follows from
> the first paragraph. It helps the reader to give some hints on how they should
> organize the information as they go along. If it's too much of an info dump, it
> puts an extra burden. They have to try to hold all of the facts in their head
> until they can put together the bigger picture themselves.
> 
> The extra info about TDX using KeyID also seems unnecessary to the point (IIUC).

I added the above two paragraphs to mainly address Reinette's concern regarding
"cache can be in coherent status due to memory encryption" being too vague.

I also think they are too verbose.  How about placing the first two paragraphs
with what I have (after addressing your comments) in the patch 2:

 For SME and TDX, dirty cachelines with and without encryption bit(s) of 
 the same memory can coexist, and the CPU can flush them back to memory 
 in random order.

?

> 
> > 
> > Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
> > is supported by hardware.  Perform unconditional WBINVD to support TDX
> > instead of adding one more vendor-specific check.  Kexec is a slow path,
> > and the additional WBINVD is acceptable for the sake of simplicity and
> > maintainability.
> 
> Out of curiosity, do you know why this was not already needed for non-self snoop
> CPUs? 
> 

Self snooping only deals with different memory types (UC, WB etc) but doesn't
handle memory encryption which with additional bit(s) tagged into the physical
address. 

> Why can't there be other cache modes that get written back after the new
> kernel starts using the memory for something else?

I assume teaching cache coherent protocol to understand the additional
encryption bit(s) isn't something that can be supported easily for all vendors.

> 
> > 
> > Only do WBINVD on bare-metal.  Doing WBINVD in guest triggers unexpected
>                                                 ^the

OK.

> > exception (#VE or #VC) for TDX and SEV-ES/SEV-SNP guests and the guest
> > may not be able to handle such exception (e.g., TDX guests panics if it
>                                                              ^panic

Ah good catch.

> > sees such #VE).
> 
> It's a small thing, but I think you could skip the #VE or #VC info in here. All
> they need to know to understand this patch is that TDX and some SEV kernels
> cannot handle WBINVD. And TDX panics. (does SEV not?)

I can remove "(#VE or #VC)", unless Kirill/Tom object.

SEV does not panic here in stop_this_cpu(), but it does panic later in
relocate_kernel():

https://lore.kernel.org/lkml/e1d37efb8951eb1d38493687b10a21b23353e35a.1710811610.git.kai.huang@intel.com/t/#m13e38c50f93afbf1c15f506e96817b2a78444b1d

> 
> > 
> > History of SME and kexec WBINVD:
> > 
> > There _was_ an issue preventing doing unconditional WBINVD but that has
> > been fixed.
> > 
> > Initial SME kexec support added an unconditional WBINVD in commit
> > 
> >   bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME")
> > 
> > This commit caused different Intel systems to hang or reset.
> > 
> > Without a clear root cause, a later commit
> > 
> >   f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
> > 
> > fixed the Intel system hang issues by only doing WBINVD when hardware
> > supports SME.
> > 
> > A corner case [*] revealed the root cause of the system hang issues and
> > was fixed by commit
> > 
> >   1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
> > 
> > See [1][2] for more information.
> > 
> > Further testing of doing unconditional WBINVD based on the above fix on
> > the problematic machines (that issues were originally reported)
> > confirmed the issues couldn't be reproduced.
> > 
> > See [3][4] for more information.
> > 
> > Therefore, it is safe to do unconditional WBINVD for bare-metal now.
> 
> Instead of a play-by-play, it might be more informative to summarize the edges
> covered in this history:

I think the above is also informative.  Boris suggested to keep the discussion
around those relevant commits in the changelog:

https://lore.kernel.org/linux-kernel/20240228110207.GCZd8Sr8mXHA2KTiLz@fat_crate.local/
https://lore.kernel.org/linux-kernel/20240415175912.GAZh1q8GgpY3tpJj5a@fat_crate.local/

>  - Don't do anything that writes memory between wbinvd and native_halt(). This
> includes function calls that touch the stack.

This is a requirement to SME, but changing to unconditional WBINVD on bare-metal
doesn't change this behaviour.  What's the purpose of mentioning here?

>  - Avoid issuing wbinvd on multiple CPUs at the same time. As tglx implies it is
> too expensive.

Boris suggested to do unconditional WBINVD.  The test by Dave Young also
confirms there was no issue on the once-problematic platforms:

https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/

>  - Don't race reboot by watching cpumask instead of num_online_cpus(). But there
> is a race still.
> 
> Hmm, on the last one tglx says:
>     The cpumask cannot plug all holes either, but it's better than a raw
>     counter and allows to restrict the NMI fallback IPI to be sent only the
>     CPUs which have not reported within the timeout window
> 
> Are we opening up more platforms to a race by unconditionally doing the wbinvd?
> Can we be clarify that nothing bad happens if we lose the race? (and is it
> true?)

IIUC in most cases the REBOOT vector will just do the job, stop_other_cpus()
won't need to send NMIs thus I believe in most cases this shouldn't increase
race.  

I don't know what kinda platform will need NMI to stop remote CPUs.  For
instance, AFAICT my SPR machine with 192 CPUs never needed to send NMI in my
testings.

Dave Yong tested on those once-problematic platforms and doing unconditional
wbinvd was fine.  This patchset (albeit not upstreamed) has also been tested by
customers in their environment.  I believe doing unconditional WBINVD is fine.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
  2025-03-13 18:40   ` Edgecombe, Rick P
                       ` (2 preceding siblings ...)
  2025-03-17 10:11     ` Huang, Kai
@ 2025-03-17 12:52     ` kirill.shutemov
  2025-03-17 21:59       ` Edgecombe, Rick P
  3 siblings, 1 reply; 34+ messages in thread
From: kirill.shutemov @ 2025-03-17 12:52 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Huang, Kai, bp@alien8.de, nik.borisov@suse.com,
	bhe@redhat.com, seanjc@google.com, x86@kernel.org,
	dyoung@redhat.com, sagis@google.com, hpa@zytor.com,
	Chatre, Reinette, Williams, Dan J, thomas.lendacky@amd.com,
	pbonzini@redhat.com, ashish.kalra@amd.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, dwmw@amazon.co.uk

On Thu, Mar 13, 2025 at 06:40:09PM +0000, Edgecombe, Rick P wrote:
> > Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
> > is supported by hardware.  Perform unconditional WBINVD to support TDX
> > instead of adding one more vendor-specific check.  Kexec is a slow path,
> > and the additional WBINVD is acceptable for the sake of simplicity and
> > maintainability.
> 
> Out of curiosity, do you know why this was not already needed for non-self snoop
> CPUs? Why can't there be other cache modes that get written back after the new
> kernel starts using the memory for something else?

KeyID is a hack. Memory controller is aware about KeyID, but not cache.
Cache considers KeyID as part of physical address. Two cache lines for the
same physical address with different KeyID are considered unrelated from
cache coherency PoV.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
  2025-03-17 12:52     ` kirill.shutemov
@ 2025-03-17 21:59       ` Edgecombe, Rick P
  2025-03-19 16:41         ` Dave Hansen
  0 siblings, 1 reply; 34+ messages in thread
From: Edgecombe, Rick P @ 2025-03-17 21:59 UTC (permalink / raw)
  To: kirill.shutemov@linux.intel.com
  Cc: Huang, Kai, ashish.kalra@amd.com, Hansen, Dave, dyoung@redhat.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	Chatre, Reinette, seanjc@google.com, pbonzini@redhat.com,
	mingo@redhat.com, tglx@linutronix.de, nik.borisov@suse.com,
	bhe@redhat.com, Yamahata, Isaku, hpa@zytor.com,
	peterz@infradead.org, sagis@google.com, bp@alien8.de,
	dwmw@amazon.co.uk, x86@kernel.org, Williams, Dan J

On Mon, 2025-03-17 at 14:52 +0200, kirill.shutemov@linux.intel.com wrote:
> On Thu, Mar 13, 2025 at 06:40:09PM +0000, Edgecombe, Rick P wrote:
> > > Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
> > > is supported by hardware.  Perform unconditional WBINVD to support TDX
> > > instead of adding one more vendor-specific check.  Kexec is a slow path,
> > > and the additional WBINVD is acceptable for the sake of simplicity and
> > > maintainability.
> > 
> > Out of curiosity, do you know why this was not already needed for non-self snoop
> > CPUs? Why can't there be other cache modes that get written back after the new
> > kernel starts using the memory for something else?
> 
> KeyID is a hack. Memory controller is aware about KeyID, but not cache.
> Cache considers KeyID as part of physical address. Two cache lines for the
> same physical address with different KeyID are considered unrelated from
> cache coherency PoV.

Sure, but non-selfsnoop CPUs can have trouble when PAT aliases cachetypes, I
guess. This came up in KVM recently.

So if new kernel maps the same memory with a different memtype I thought it
might be a similar problem.

There is a little bit here, but it doesn't mention snooping. Not an expert in
this area, btw.
https://www.kernel.org/doc/Documentation/x86/pat.txt

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
  2025-03-17  1:19                 ` Huang, Kai
@ 2025-03-17 23:53                   ` Edgecombe, Rick P
  0 siblings, 0 replies; 34+ messages in thread
From: Edgecombe, Rick P @ 2025-03-17 23:53 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Huang, Kai, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: ashish.kalra@amd.com, Yamahata, Isaku, seanjc@google.com,
	x86@kernel.org, sagis@google.com, hpa@zytor.com, Chatre, Reinette,
	Williams, Dan J, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	dwmw@amazon.co.uk, bhe@redhat.com, nik.borisov@suse.com

On Mon, 2025-03-17 at 01:19 +0000, Huang, Kai wrote:
> On Fri, 2025-03-14 at 19:03 +0000, Edgecombe, Rick P wrote:
> > On Thu, 2025-03-13 at 23:57 +0000, Huang, Kai wrote:
> > 
> > 
> > 
> > So I think the situation is we need at one kernel parameter. We already have one
> > for KVM, which controls the late initialization parts of TDX that we care about
> > here. So what about just using the existing one? I think we don't want two.
> 
> Logically, KVM is one user of TDX.  I think whether KVM has a parameter should
> not impact whether we should introduce one kernel parameter for TDX host core-
> kernel.
> 
> Dan also made a point that in the context of TDX Connect, there's requirement to
> make SEAMCALLs even KVM is not going to run any TDX guest:
> 
> https://lore.kernel.org/kvm/cover.1730120881.git.kai.huang@intel.com/T/#m6928f5519de25def97d47fc6bbb77f5c3e958f7b
> 
> So I agree ideally we don't want two, but I think it is also OK if there's good
> reason to do so.

What is the good reason to have two though? Do we just want one host side one
and lose the KVM one? It seems adding kernel parameters to make code problems go
away is usually frowned upon.

> 
> > 
> > If KVM has not initialized TDX (based on its own TDX parameter), then kexec is
> > fine. 
> > 
> 
> For now.  In the future TDX module could be initialized by other kernel
> components.
> 
> > It could work by exposing an interface for features to be exclusive with
> > TDX. Since real TDX module initialization happens late anyway. I don't know if
> > it's better than a kernel one, but I don't see adding a second one going well.
> > 
> > 
> > Very, very rough:
> > 
> > diff --git a/arch/x86/kernel/machine_kexec_64.c
> > b/arch/x86/kernel/machine_kexec_64.c
> > index a68f5a0a9f37..bfea4e78c577 100644
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -315,6 +315,12 @@ int machine_kexec_prepare(struct kimage *image)
> >         result = init_pgtable(image, __pa(control_page));
> >         if (result)
> >                 return result;
> > +
> > +       if (tdx_exclude_feature()) {
> > +               pr_info_once("Not allowed once TDX has been used.\n");
> > +               return -EOPNOTSUPP;
> > +       }
> > +
> >         kexec_va_control_page = (unsigned long)control_page;
> >         kexec_pa_table_page = (unsigned long)__pa(image->arch.pgd);
> >  
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index f5e2a937c1e7..9b1f42a1059c 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1215,6 +1215,21 @@ int tdx_enable(void)
> >  }
> >  EXPORT_SYMBOL_GPL(tdx_enable);
> >  
> > +bool tdx_exclude_feature(void)
> > +{
> > +       bool ret = false;
> > +
> > +       mutex_lock(&tdx_module_lock);
> > +       if (tdx_module_status == TDX_MODULE_INITIALIZED)
> > +               ret = true;
> > +       else
> > +               tdx_module_status = TDX_MODULE_EXCLUDED;
> > +       mutex_lock(&tdx_module_lock);
> > +
> > +       return ret;
> > +}
> 
> Assuming setting module status to "excluded" means we are not able to initialize
> TDX module for ever.

I was going for the simplest approach without adding a new kernel parameter. But
in practice for distros KVM will load at boot and it should work pretty much the
same. If there is the tdx parameter kexec is disabled, otherwise it's enabled.
> 
> The thing is Kexec has two phases: 1) loading kernel image, and 2) actually do
> kexec.  Your approach basically marks TDX unusable for ever when a user tries to
> load a kxec kernel image, but this is a little bit nasty because loading kexec
> kernel image successfully doesn't mean you have to actually do the kexec, i.e.,
> you can unload the image and move on.

This compared to tdx_host parameter means that sometimes the user may be able to
decide late whether they want TDX or kexec.

> 
> I am not saying this doesn't work, but IMHO it is more straightforward to just
> let user make decision via kernel parameter.

Straightforward, yes agree. It's easier to document and the code would be
simpler.

I'm ok trying the tdx_host method, but I do think we need a better reason for
having two tdx kernel parameters when there is only one users of TDX today
(KVM).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
  2025-03-17 10:11     ` Huang, Kai
@ 2025-03-18  3:41       ` Edgecombe, Rick P
  2025-03-20  0:03         ` Huang, Kai
  0 siblings, 1 reply; 34+ messages in thread
From: Edgecombe, Rick P @ 2025-03-18  3:41 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Huang, Kai, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: ashish.kalra@amd.com, dyoung@redhat.com, seanjc@google.com,
	x86@kernel.org, sagis@google.com, hpa@zytor.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Williams, Dan J, bhe@redhat.com,
	pbonzini@redhat.com, dwmw@amazon.co.uk, nik.borisov@suse.com,
	thomas.lendacky@amd.com, Yamahata, Isaku

On Mon, 2025-03-17 at 10:11 +0000, Huang, Kai wrote:
> On Thu, 2025-03-13 at 18:40 +0000, Edgecombe, Rick P wrote:
> > On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> > > TL;DR:
> > > 
> > > Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
> > > cover kexec support for both AMD SME and Intel TDX.  Previously there
> > > _was_ some issue preventing from doing so but now it has been fixed.
> >                               ^ Adding "the kernel" here would read a little
> > cleaner to me.
> 
> OK.
> 
> > 
> > 
> > When I read "some issue" I start assuming it wasn't fully debugged and it is
> > "some issue" that no one knows. But below it sounds like it was root caused.
> 
> I am not sure what's wrong with "some issue".  I used "_was_" to convey this
> issue was fixed (thus root caused).  Perhaps the "root caused" part isn't clear?
> I can explicitly call it out.
> 
>   Previously there _was_ some issue preventing the kernel from doing so but 
>   now it has been root caused and fixed. 

The problem is the phrase "some issue" sounds like the issue is not understood.
You could just change it to "an issue". I don't even think you need the "_"
around "_was_". Just "Previously there was an issue preventing the kernel..." is
more reassuring.


[snip]

> > 
> > Instead of a play-by-play, it might be more informative to summarize the edges
> > covered in this history:
> 
> I think the above is also informative.  Boris suggested to keep the discussion
> around those relevant commits in the changelog:
> 
> https://lore.kernel.org/linux-kernel/20240228110207.GCZd8Sr8mXHA2KTiLz@fat_crate.local/

Summary: Kirill says it's too verbose, can be dropped.

> https://lore.kernel.org/linux-kernel/20240415175912.GAZh1q8GgpY3tpJj5a@fat_crate.local/

Summary: Boris says not to drop it and it's ok to be verbose.

But what I'm saying is that the section doesn't summarize the issue well. It
just links to a bunch of commits for the reviewer to go figure it out
themselves. So I think explaining the takeaways instead of only linking to
threads wouldn't be objectionable. You don't need to drop the commit references,
just don't leave so much homework for the reader.

> 
> >   - Don't do anything that writes memory between wbinvd and native_halt(). This
> > includes function calls that touch the stack.
> 
> This is a requirement to SME, but changing to unconditional WBINVD on bare-metal
> doesn't change this behaviour.  What's the purpose of mentioning here?

The purpose is to help the reviewer understand the delicate design of this
function so that they can evaluate whether the changes upset it.

> 
> >   - Avoid issuing wbinvd on multiple CPUs at the same time. As tglx implies it is
> > too expensive.
> 
> Boris suggested to do unconditional WBINVD.  The test by Dave Young also
> confirms there was no issue on the once-problematic platforms:
> 
> https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/

I'm not sure what your point is here. That there is no issue? This log points to
a commit that contradicts the assertions it is making. Especially since to
understand this part of the log, the reviewer is going to have to read those
referenced commits, don't you think it's a problem? It is opening new doubts.

> 
> >   - Don't race reboot by watching cpumask instead of num_online_cpus(). But there
> > is a race still.
> > 
> > Hmm, on the last one tglx says:
> >      The cpumask cannot plug all holes either, but it's better than a raw
> >      counter and allows to restrict the NMI fallback IPI to be sent only the
> >      CPUs which have not reported within the timeout window
> > 
> > Are we opening up more platforms to a race by unconditionally doing the wbinvd?
> > Can we be clarify that nothing bad happens if we lose the race? (and is it
> > true?)
> 
> IIUC in most cases the REBOOT vector will just do the job, stop_other_cpus()
> won't need to send NMIs thus I believe in most cases this shouldn't increase
> race.  
> 
> I don't know what kinda platform will need NMI to stop remote CPUs.  For
> instance, AFAICT my SPR machine with 192 CPUs never needed to send NMI in my
> testings.
> 
> Dave Yong tested on those once-problematic platforms and doing unconditional
> wbinvd was fine.  This patchset (albeit not upstreamed) has also been tested by
> customers in their environment.  I believe doing unconditional WBINVD is fine.

Sounds too much like: "Someone tested it on a platform that used to have a
problem and at least that one is fixed, but we really don't understand what the
issue is".

I haven't tried to understand what race tglx was seeing, or what the consequence
is. I think we should understand and explain why it's harmless, especially since
this bring it up by linking the log that references it.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel()
  2025-03-14  9:44     ` Huang, Kai
@ 2025-03-18  3:54       ` Edgecombe, Rick P
  2025-03-19  9:57         ` Huang, Kai
  0 siblings, 1 reply; 34+ messages in thread
From: Edgecombe, Rick P @ 2025-03-18  3:54 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Huang, Kai, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: ashish.kalra@amd.com, dyoung@redhat.com, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	dwmw@amazon.co.uk, pbonzini@redhat.com, bhe@redhat.com,
	Yamahata, Isaku, nik.borisov@suse.com, Chatre, Reinette,
	hpa@zytor.com, sagis@google.com, david.kaplan@amd.com,
	x86@kernel.org, Williams, Dan J

On Fri, 2025-03-14 at 09:44 +0000, Huang, Kai wrote:
> On Thu, 2025-03-13 at 23:17 +0000, Edgecombe, Rick P wrote:
> > On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> > > For both SME and TDX, dirty cachelines with and without the encryption
> > > bit(s) of the same physical memory address can coexist and the CPU can
> > > flush them back to memory in random order.
> > > 
> > 
> > A lot going on in this sentence, how about simplifying it:
> > 
> > For SME and TDX, multiple dirty cachelines for the same memory can co-exist, and
> > the CPU can flush them back to memory in a random order.
> 
> "multiple" isn't accurate at least for SME.  How about:
> 
>  For SME and TDX, dirty cachelines with and without encryption bit(s) of 
>  the same memory can coexist, and the CPU can flush them back to memory 
>  in random order.

Ok.

> 
> > 
> > 
> > >   During kexec, the caches
> > > must be flushed before jumping to the new kernel to avoid silent memory
> > > corruption to the new kernel.
> > 
> > During kexec, the caches must be flushed before jumping to the new kernel to
> > avoid silent memory corruption when a cacheline with a different encryption
> > property is written back over whatever encryption properties the new kernel is
> > using.
> > 
> > ...it distributes some of the details from the first sentence into the second.
> > Easier to read or no? I'm not sure.
> 
> I don't have opinion.  I see no difference.
> 
> I tends to keep the original words since people have reviewed.
> 
> > 
> > > 
> > > The WBINVD in stop_this_cpu() flushes caches for all remote CPUs when
> > > they are being stopped.  For SME, the WBINVD in relocate_kernel()
> > > flushes the cache for the last running CPU (which is doing kexec).
> > > 
> > > Similarly, to support kexec for TDX host, after stopping all remote CPUs
> > > with cache flushed, the kernel needs to flush cache for the last running
> > > CPU.
> > 
> > 
> > I mentioned this in a previous version. I think you need to give some hint to
> > where you are going before you start listing facts. Like:
> > 
> > During kexec, WBINVD needs to be executed on each CPU for TDX and SME. 
> > 
> > On the
> > remote CPUs this is covered in stop_this_cpu() for both TDX and SME. For the
> > kexecing CPU, SME handles this in relocate_kernel(). This leaves the TDX case
> > for the kexec-ing CPU still to implement.
> > 
> > ...it first says the overall problem to solve, then explains what is missing in
> > the current code to solve it. The reader is already thinking of what the
> > solutions should be and...
> 
> Will do.
> 
> > 
> > > 
> > > Use the existing WBINVD in relocate_kernel() to cover TDX host as well.
> > 
> > ...they read the solution just as they are wondering about it. The reader can
> > feel like the change is aligned with their thinking.
> > 
> > > 
> > > Just do unconditional
> > > 
> > 
> > "Unconditional". Now I'm starting to think about how unconditional wbinvd will
> > be.
> > 
> > >  WBINVD to cover both SME and TDX instead of
> > > sprinkling additional vendor-specific checks.  Kexec is a slow path, and
> > > the additional WBINVD is acceptable for the sake of simplicity and
> > > maintainability.
> > > 
> > > But only do WBINVD for bare-metal
> > > 
> > 
> > But wait, now I'm learning it's not unconditional. I need to re-think what I
> > just evaluated. And I'm doubting the earlier statements because I just got
> > surprised.
> 
> Do you mean you got surprised by saying "unconditional" first and then saying
> "for bare-metal"?  This is literally what the patch title says.  I don't see any
> problem, but I can mentioned the "for bare-metal" part when I say
> "unconditional" above.

I don't know how to explain it more. Unconditional means, well, unconditional.
"Only do..." is a condition. It's not unconditional.

So don't say it is or it will be confusing.

> 
> > 
> > >  because TDX guests and SEV-ES/SEV-SNP
> > > guests will get unexpected (and yet unnecessary) exception (#VE or #VC)
> > > which the kernel is unable to handle at the time of relocate_kernel()
> > > since the kernel has torn down the IDT.
> > > 
> > > Remove the host_mem_enc_active local variable and directly use
> > > !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) as an argument of calling
> > > relocate_kernel().
> > > 
> > 
> > Start with the problem here. It just describes another change and I'm not sure
> > why when I start reading it.
> > 
> > By problem I mean that host_mem_enc_active doesn't fit the conditional anymore,
> > so it needs to be changed.
> > 
> > >   cpu_feature_enabled() is always inline but not a
> > 
> > I was just noticing this on the other patch. Actually it could call into some
> > kasan stuff.
> 
> Can you be more specific since I am not seeing it.

You are right, I don't know what I was looking at.

> 
> > 
> > > function call, thus it is safe to use after load_segments() when call
> > > depth tracking is enabled.
> > 
> > This function call tracking stuff is a wild card at the end. What about
> > describing the rules this function needs to follow due to call depth tracking,
> > and explain why the change does that.
> 
> Below is what I had before.  Do you think it's better?  I replaced them with the
> current one since Reinette commented the original one (which contains history
> etc) was not necessary.
> 
> "
> Commit 93c1800b3799 ("x86/kexec: Fix bug with call depth tracking")
> moved calling 'cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)' as an argument
> of relocate_kernel() to an earlier place before load_segments() by
> adding a variable 'host_mem_enc_active'.  The reason was the call to
> cc_platform_has() after load_segments() caused a fault and system crash
> when call depth tracking is active because load_segments() resets GS to
> 0 but call depth tracking uses per-CPU variable to operate.

That does feel like a lot of details. But I think it still needs more info that
in the current version. Again, just describe the special requirements and why.
You don't need to skip the links, but just jumping right into the history takes
more work to understand what is needed.

> 
> Use !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) to check whether the
> kernel runs on bare-metal.  cpu_feature_enabled() is always inline but
> not a function call, thus it is safe to use it after load_segments()
> when call depth tracking is enabled.  Remove the 'host_mem_enc_active'
> variable and use cpu_feature_enabled() directly as the argument when
> calling relocate_kernel().
> "



> 
> 
> [...]
> 
> > > --- a/arch/x86/kernel/machine_kexec_64.c
> > > +++ b/arch/x86/kernel/machine_kexec_64.c
> > > @@ -346,16 +346,9 @@ void __nocfi machine_kexec(struct kimage *image)
> > >  {
> > >  	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
> > >  	relocate_kernel_fn *relocate_kernel_ptr;
> > > -	unsigned int host_mem_enc_active;
> > >  	int save_ftrace_enabled;
> > >  	void *control_page;
> > >  
> > > -	/*
> > > -	 * This must be done before load_segments() since if call depth tracking
> > > -	 * is used then GS must be valid to make any function calls.
> > > -	 */
> > > -	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
> > > -
> > >  #ifdef CONFIG_KEXEC_JUMP
> > >  	if (image->preserve_context)
> > >  		save_processor_state();
> > > @@ -398,6 +391,11 @@ void __nocfi machine_kexec(struct kimage *image)
> > >  	 *
> > >  	 * I take advantage of this here by force loading the
> > >  	 * segments, before I zap the gdt with an invalid value.
> > > +	 *
> > > +	 * load_segments() resets GS to 0.  Don't make any function call
> > > +	 * after here since call depth tracking uses per-CPU variables to
> > > +	 * operate (relocate_kernel() is explicitly ignored by call depth
> > > +	 * tracking).
> > 
> > I think I suggested you should call out the opportunistic change here in the
> > log. Did you disagree?
> 
> I replied this was suggested by David Kaplan, but I guess I forgot to reply the
> "opportunistic" part.
> 
> I don't think this is opportunistic change.  It's a valid comment after the 
> 'host_mem_enc_active' variable and the comment around it were removed.

It's valid before too. So it's a separate change. I think based on the recent
history we should be extra careful about slipping extra changes into the series.
I'm not against doing the change in this patch, just want to make sure it's not
perceived as being slipped in.

> 
> 
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel()
  2025-03-18  3:54       ` Edgecombe, Rick P
@ 2025-03-19  9:57         ` Huang, Kai
  2025-03-19 16:20           ` Edgecombe, Rick P
  0 siblings, 1 reply; 34+ messages in thread
From: Huang, Kai @ 2025-03-19  9:57 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Edgecombe, Rick P, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: ashish.kalra@amd.com, dyoung@redhat.com, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	dwmw@amazon.co.uk, pbonzini@redhat.com, bhe@redhat.com,
	Yamahata, Isaku, nik.borisov@suse.com, Chatre, Reinette,
	hpa@zytor.com, sagis@google.com, david.kaplan@amd.com,
	x86@kernel.org, Williams, Dan J

> > 
> > > > --- a/arch/x86/kernel/machine_kexec_64.c
> > > > +++ b/arch/x86/kernel/machine_kexec_64.c
> > > > @@ -346,16 +346,9 @@ void __nocfi machine_kexec(struct kimage *image)
> > > >  {
> > > >  	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
> > > >  	relocate_kernel_fn *relocate_kernel_ptr;
> > > > -	unsigned int host_mem_enc_active;
> > > >  	int save_ftrace_enabled;
> > > >  	void *control_page;
> > > >  
> > > > -	/*
> > > > -	 * This must be done before load_segments() since if call depth tracking
> > > > -	 * is used then GS must be valid to make any function calls.
> > > > -	 */
> > > > -	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
> > > > -
> > > >  #ifdef CONFIG_KEXEC_JUMP
> > > >  	if (image->preserve_context)
> > > >  		save_processor_state();
> > > > @@ -398,6 +391,11 @@ void __nocfi machine_kexec(struct kimage *image)
> > > >  	 *
> > > >  	 * I take advantage of this here by force loading the
> > > >  	 * segments, before I zap the gdt with an invalid value.
> > > > +	 *
> > > > +	 * load_segments() resets GS to 0.  Don't make any function call
> > > > +	 * after here since call depth tracking uses per-CPU variables to
> > > > +	 * operate (relocate_kernel() is explicitly ignored by call depth
> > > > +	 * tracking).
> > > 
> > > I think I suggested you should call out the opportunistic change here in the
> > > log. Did you disagree?
> > 
> > I replied this was suggested by David Kaplan, but I guess I forgot to reply the
> > "opportunistic" part.
> > 
> > I don't think this is opportunistic change.  It's a valid comment after the 
> > 'host_mem_enc_active' variable and the comment around it were removed.
> 
> It's valid before too. So it's a separate change. 
> 

I tried to understand what you mean here, but I am not sure I am following.  My
thinking:

Before this code change, in the existing code there's a comment right before the
'host_mem_enc_active' variable to explain why this variable is needed (which is
because of depth tracking).

After we remove 'host_mem_enc_active' and the comment before it, there's no
comment to mention anything about depth tracking here.  So comparing to the
existing code, we lost information which is actually helpful.

To still keep the helpful information about the depth tracking, a new comment is
added before load_segments().

Could you explain why this is a separate/extra change?

Nevertheless, are you looking for something like below in the changelog?

  With the 'host_mem_enc_active' and the comment around it removed,
  the information about depth tracking no longer exists.  Expand the 
  comment around load_segments() to mention that due to depth tracking
  no function call can be made after load_segments().

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel()
  2025-03-19  9:57         ` Huang, Kai
@ 2025-03-19 16:20           ` Edgecombe, Rick P
  0 siblings, 0 replies; 34+ messages in thread
From: Edgecombe, Rick P @ 2025-03-19 16:20 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Huang, Kai, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: ashish.kalra@amd.com, dyoung@redhat.com, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org, seanjc@google.com,
	dwmw@amazon.co.uk, pbonzini@redhat.com, bhe@redhat.com,
	Yamahata, Isaku, nik.borisov@suse.com, Chatre, Reinette,
	hpa@zytor.com, sagis@google.com, david.kaplan@amd.com,
	x86@kernel.org, Williams, Dan J

On Wed, 2025-03-19 at 09:57 +0000, Huang, Kai wrote:
> Nevertheless, are you looking for something like below in the changelog?
> 
>   With the 'host_mem_enc_active' and the comment around it removed,
>   the information about depth tracking no longer exists.  Expand the 
>   comment around load_segments() to mention that due to depth tracking
>   no function call can be made after load_segments().

Looks good to me.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
  2025-03-17 21:59       ` Edgecombe, Rick P
@ 2025-03-19 16:41         ` Dave Hansen
  2025-03-19 17:03           ` Edgecombe, Rick P
  2025-03-19 21:42           ` Huang, Kai
  0 siblings, 2 replies; 34+ messages in thread
From: Dave Hansen @ 2025-03-19 16:41 UTC (permalink / raw)
  To: Edgecombe, Rick P, kirill.shutemov@linux.intel.com
  Cc: Huang, Kai, ashish.kalra@amd.com, dyoung@redhat.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	Chatre, Reinette, seanjc@google.com, pbonzini@redhat.com,
	mingo@redhat.com, tglx@linutronix.de, nik.borisov@suse.com,
	bhe@redhat.com, Yamahata, Isaku, hpa@zytor.com,
	peterz@infradead.org, sagis@google.com, bp@alien8.de,
	dwmw@amazon.co.uk, x86@kernel.org, Williams, Dan J

On 3/17/25 14:59, Edgecombe, Rick P wrote:
> Sure, but non-selfsnoop CPUs can have trouble when PAT aliases cachetypes, I
> guess. This came up in KVM recently.
> 
> So if new kernel maps the same memory with a different memtype I thought it
> might be a similar problem.
Yeah, both the KeyIDs and memtypes mismatches are places that normal
cache coherency breaks. They break it in different ways for sure, but
it's still broken in a way that software has to work around.

As for kexec vs. PAT memtypes, there are only even theoretical issues on
old hardware.  They _theoretically_ need a WBINVD at kexec. But there
might be enough other things happening during kexec (including other
WBINVD's) to keep us from getting bitten in practice.

I'm not going to lose any sleep over it though.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
  2025-03-19 16:41         ` Dave Hansen
@ 2025-03-19 17:03           ` Edgecombe, Rick P
  2025-03-19 21:42           ` Huang, Kai
  1 sibling, 0 replies; 34+ messages in thread
From: Edgecombe, Rick P @ 2025-03-19 17:03 UTC (permalink / raw)
  To: kirill.shutemov@linux.intel.com, Hansen, Dave
  Cc: Huang, Kai, ashish.kalra@amd.com, dyoung@redhat.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	Chatre, Reinette, seanjc@google.com, pbonzini@redhat.com,
	mingo@redhat.com, Yamahata, Isaku, nik.borisov@suse.com,
	bhe@redhat.com, dwmw@amazon.co.uk, hpa@zytor.com,
	peterz@infradead.org, sagis@google.com, tglx@linutronix.de,
	bp@alien8.de, x86@kernel.org, Williams, Dan J

On Wed, 2025-03-19 at 09:41 -0700, Dave Hansen wrote:
> They _theoretically_ need a WBINVD at kexec. But there
> might be enough other things happening during kexec (including other
> WBINVD's) to keep us from getting bitten in practice.

Thanks Dave. It seems there have been a number of issues in this step, so just
trying to make sure I understand how it is supposed to work before we change it.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
  2025-03-19 16:41         ` Dave Hansen
  2025-03-19 17:03           ` Edgecombe, Rick P
@ 2025-03-19 21:42           ` Huang, Kai
  1 sibling, 0 replies; 34+ messages in thread
From: Huang, Kai @ 2025-03-19 21:42 UTC (permalink / raw)
  To: kirill.shutemov@linux.intel.com, Hansen, Dave, Edgecombe, Rick P
  Cc: Williams, Dan J, ashish.kalra@amd.com, dyoung@redhat.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	Chatre, Reinette, seanjc@google.com, pbonzini@redhat.com,
	bhe@redhat.com, Yamahata, Isaku, nik.borisov@suse.com,
	mingo@redhat.com, dwmw@amazon.co.uk, hpa@zytor.com,
	sagis@google.com, peterz@infradead.org, bp@alien8.de,
	tglx@linutronix.de, x86@kernel.org

On Wed, 2025-03-19 at 09:41 -0700, Hansen, Dave wrote:
> As for kexec vs. PAT memtypes, there are only even theoretical issues on
> old hardware.  They _theoretically_ need a WBINVD at kexec. But there
> might be enough other things happening during kexec (including other
> WBINVD's) to keep us from getting bitten in practice.

From this perspective, I think it makes sense to do WBINVD for bare-metal
machines anyway?  Perhaps I can mention this in the changelog?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu()
  2025-03-18  3:41       ` Edgecombe, Rick P
@ 2025-03-20  0:03         ` Huang, Kai
  0 siblings, 0 replies; 34+ messages in thread
From: Huang, Kai @ 2025-03-20  0:03 UTC (permalink / raw)
  To: tglx@linutronix.de, peterz@infradead.org, mingo@redhat.com,
	Hansen, Dave, Edgecombe, Rick P, bp@alien8.de,
	kirill.shutemov@linux.intel.com
  Cc: Yamahata, Isaku, dyoung@redhat.com, seanjc@google.com,
	x86@kernel.org, sagis@google.com, hpa@zytor.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Williams, Dan J,
	ashish.kalra@amd.com, pbonzini@redhat.com, dwmw@amazon.co.uk,
	bhe@redhat.com, thomas.lendacky@amd.com, nik.borisov@suse.com

On Tue, 2025-03-18 at 03:41 +0000, Edgecombe, Rick P wrote:
> On Mon, 2025-03-17 at 10:11 +0000, Huang, Kai wrote:
> > On Thu, 2025-03-13 at 18:40 +0000, Edgecombe, Rick P wrote:
> > > On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> > > > TL;DR:
> > > > 
> > > > Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
> > > > cover kexec support for both AMD SME and Intel TDX.  Previously there
> > > > _was_ some issue preventing from doing so but now it has been fixed.
> > >                               ^ Adding "the kernel" here would read a little
> > > cleaner to me.
> > 
> > OK.
> > 
> > > 
> > > 
> > > When I read "some issue" I start assuming it wasn't fully debugged and it is
> > > "some issue" that no one knows. But below it sounds like it was root caused.
> > 
> > I am not sure what's wrong with "some issue".  I used "_was_" to convey this
> > issue was fixed (thus root caused).  Perhaps the "root caused" part isn't clear?
> > I can explicitly call it out.
> > 
> >   Previously there _was_ some issue preventing the kernel from doing so but 
> >   now it has been root caused and fixed. 
> 
> The problem is the phrase "some issue" sounds like the issue is not understood.
> You could just change it to "an issue". I don't even think you need the "_"
> around "_was_". Just "Previously there was an issue preventing the kernel..." is
> more reassuring.

(sorry for getting back late).

Sure.

> 
> 
> [snip]
> 
> > > 
> > > Instead of a play-by-play, it might be more informative to summarize the edges
> > > covered in this history:
> > 
> > I think the above is also informative.  Boris suggested to keep the discussion
> > around those relevant commits in the changelog:
> > 
> > https://lore.kernel.org/linux-kernel/20240228110207.GCZd8Sr8mXHA2KTiLz@fat_crate.local/
> 
> Summary: Kirill says it's too verbose, can be dropped.
> 
> > https://lore.kernel.org/linux-kernel/20240415175912.GAZh1q8GgpY3tpJj5a@fat_crate.local/
> 
> Summary: Boris says not to drop it and it's ok to be verbose.
> 
> But what I'm saying is that the section doesn't summarize the issue well. It
> just links to a bunch of commits for the reviewer to go figure it out
> themselves. So I think explaining the takeaways instead of only linking to
> threads wouldn't be objectionable. You don't need to drop the commit references,
> just don't leave so much homework for the reader.

I'll refine per discussion with you.

> 
> > 
> > >   - Don't do anything that writes memory between wbinvd and native_halt(). This
> > > includes function calls that touch the stack.
> > 
> > This is a requirement to SME, but changing to unconditional WBINVD on bare-metal
> > doesn't change this behaviour.  What's the purpose of mentioning here?
> 
> The purpose is to help the reviewer understand the delicate design of this
> function so that they can evaluate whether the changes upset it.

As discussed, I will mention the SME problem that you discovered and the code
change will not make any worse for SME.

> 
> > 
> > >   - Avoid issuing wbinvd on multiple CPUs at the same time. As tglx implies it is
> > > too expensive.
> > 
> > Boris suggested to do unconditional WBINVD.  The test by Dave Young also
> > confirms there was no issue on the once-problematic platforms:
> > 
> > https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/
> 
> I'm not sure what your point is here. That there is no issue? 
> 

I meant "too expensive" doesn't mean there will be issue, and indeed no issue
was discovered at least on the once-problematic platforms as tested by Dave
Young.

> This log points to
> a commit that contradicts the assertions it is making. Especially since to
> understand this part of the log, the reviewer is going to have to read those
> referenced commits, don't you think it's a problem? It is opening new doubts.

Sorry I am not able to follow where's the "contradiction".

Anyway I have spent some time to read the code around stop_other_cpus() and
stop_this_cpu() and it is now clear (at least to me) doing wbinvd for all bare-
metal is safe.  Please see below.

> 
> > 
> > >   - Don't race reboot by watching cpumask instead of num_online_cpus(). But there
> > > is a race still.
> > > 
> > > Hmm, on the last one tglx says:
> > >      The cpumask cannot plug all holes either, but it's better than a raw
> > >      counter and allows to restrict the NMI fallback IPI to be sent only the
> > >      CPUs which have not reported within the timeout window
> > > 
> > > Are we opening up more platforms to a race by unconditionally doing the wbinvd?
> > > Can we be clarify that nothing bad happens if we lose the race? (and is it
> > > true?)
> > 
> > IIUC in most cases the REBOOT vector will just do the job, stop_other_cpus()
> > won't need to send NMIs thus I believe in most cases this shouldn't increase
> > race.  
> > 
> > I don't know what kinda platform will need NMI to stop remote CPUs.  For
> > instance, AFAICT my SPR machine with 192 CPUs never needed to send NMI in my
> > testings.
> > 
> > Dave Yong tested on those once-problematic platforms and doing unconditional
> > wbinvd was fine.  This patchset (albeit not upstreamed) has also been tested by
> > customers in their environment.  I believe doing unconditional WBINVD is fine.
> 
> Sounds too much like: "Someone tested it on a platform that used to have a
> problem and at least that one is fixed, but we really don't understand what the
> issue is".

Not exactly.  I think commit 1f5e7eb7868e ("x86/smp: Make stop_other_cpus() more
robust") has actually fixed this issue.

AFAICT the root cause is due to below race mentioned in the changelog:

    CPU0                                    CPU1
    
     stop_other_cpus()
       send_IPIs(REBOOT);                   stop_this_cpu()
       while (num_online_cpus() > 1);         set_online(false);
       proceed... -> hang
                                              wbinvd()
    
So the key is to make sure the "proceed..." part can only happen after all
wbinvd()s are done in the remote CPUs.

To achieve this, the above commit introduced two rounds of sending-requests-and-
waiting:

 1) The first round sends REBOOT IPI to remote CPUs and waits 1 seconds for 
    all remote CPUs to stop;

 2) If the first round times out, the second round sends NMI to remote CPUs 
    and wait for 10ms, or forever (which is requested by the @wait parameter to
    to the smp_ops.stop_other_cpus()).

For the kexec case, the kernel will just wait until all remote CPUs are stopped,
see:

	kernel_kexec() ->
		machine_shutdown() ->
			native_machine_shutdown() -> 
				stop_other_cpus() ->
					smp_ops->stop_other_cpus(1)
> 
> I haven't tried to understand what race tglx was seeing, or what the consequence
> is. I think we should understand and explain why it's harmless, especially since
> this bring it up by linking the log that references it.
> 

As talked to you offline, changing to doing wbinvd for bare-metal only increases
the time to stop remote CPUs thus the potential impact is the first round of
waiting may time out on some large systems (albeit it didn't happen at least on
my 192 CPUs machine), but this is fine since the second round of waiting will
actually wait until all wbinvd have been done and all remote CPUs have actually
been stopped.

In other words, AFAICT the "race" you mentioned (presumably it is the one I
mentioned above) cannot happen in kexec() after changing to do wbinvd for bare-
metal.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-03-20  0:03 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-12 11:34 [RFC PATCH 0/5] TDX host: kexec/kdump support Kai Huang
2025-03-12 11:34 ` [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this_cpu() Kai Huang
2025-03-13 18:40   ` Edgecombe, Rick P
2025-03-14 10:03     ` Huang, Kai
2025-03-14 15:11     ` Tom Lendacky
2025-03-14 16:28       ` Edgecombe, Rick P
2025-03-14 18:18         ` Tom Lendacky
2025-03-14 18:57           ` Edgecombe, Rick P
2025-03-17 10:11     ` Huang, Kai
2025-03-18  3:41       ` Edgecombe, Rick P
2025-03-20  0:03         ` Huang, Kai
2025-03-17 12:52     ` kirill.shutemov
2025-03-17 21:59       ` Edgecombe, Rick P
2025-03-19 16:41         ` Dave Hansen
2025-03-19 17:03           ` Edgecombe, Rick P
2025-03-19 21:42           ` Huang, Kai
2025-03-12 11:34 ` [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate_kernel() Kai Huang
2025-03-13 23:17   ` Edgecombe, Rick P
2025-03-14  9:44     ` Huang, Kai
2025-03-18  3:54       ` Edgecombe, Rick P
2025-03-19  9:57         ` Huang, Kai
2025-03-19 16:20           ` Edgecombe, Rick P
2025-03-12 11:34 ` [RFC PATCH 3/5] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Kai Huang
2025-03-12 23:27   ` Edgecombe, Rick P
2025-03-13  0:57     ` Huang, Kai
2025-03-13 17:18       ` Edgecombe, Rick P
2025-03-13 22:32         ` Huang, Kai
2025-03-13 22:47           ` Edgecombe, Rick P
2025-03-13 23:57             ` Huang, Kai
2025-03-14 19:03               ` Edgecombe, Rick P
2025-03-17  1:19                 ` Huang, Kai
2025-03-17 23:53                   ` Edgecombe, Rick P
2025-03-12 11:34 ` [RFC PATCH 4/5] x86/virt/tdx: Remove the !KEXEC_CORE dependency Kai Huang
2025-03-12 11:34 ` [RFC PATCH 5/5] x86/virt/tdx: Update the kexec section in the TDX documentation Kai Huang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox