kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/7] TDX host: kexec/kdump support
@ 2025-08-13 23:59 Kai Huang
  2025-08-13 23:59 ` [PATCH v6 1/7] x86/kexec: Consolidate relocate_kernel() function parameters Kai Huang
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Kai Huang @ 2025-08-13 23:59 UTC (permalink / raw)
  To: dave.hansen, bp, tglx, peterz, mingo, hpa, thomas.lendacky
  Cc: x86, kas, rick.p.edgecombe, dwmw, linux-kernel, pbonzini, seanjc,
	kvm, reinette.chatre, isaku.yamahata, dan.j.williams,
	ashish.kalra, nik.borisov, chao.gao, sagis, farrah.chen

This series is the latest attempt to support kexec on TDX host following
Dave's suggestion to use a percpu boolean to control WBINVD during
kexec.

Hi Boris/Tom,

Thanks for your review on the first two patches.  Please let me know if
you have more comments.

Hi Dave,

Tom has provided Reviewed-by for the first two patches which change SME
code.  TDX patches also received RBs from multiple Intel TDX developers
(the last patch has Paolo's Acked-by too).  Could you help to review this
series, and if looks good to you, consider merging this series?

v5 -> v6:
 - Regenerate based on latest tip/master.
 - Rename do_seamcall() to __seamcall_dirty_cache() - Rick.
 - Collect Reviewed-by tags from Tom, Rick, Chao (thanks!).

v5: https://lore.kernel.org/kvm/cover.1753679792.git.kai.huang@intel.com/

v4 -> v5:
 - Address comments from Tom, Hpa and Chao (nothing major)
   - RELOC_KERNEL_HOST_MEM_ACTIVE -> RELOC_KERNEL_HOST_MEM_ENC_ACTIVE
     in patch 1 (Tom)
   - Add a comment to explain only RELOC_KERNEL_PRESERVE_CONTEXT is
     restored after jumping back from peer kernel for preserved_context
     kexec in patch 1.
   - Use testb instead of testq to save 3 bytes in patch 1 (Hpa)
   - Remove the unneeded 'ret' local variable in do_seamcall() (Chao)

v4: https://lore.kernel.org/kvm/cover.1752730040.git.kai.huang@intel.com/

v3 -> v4:
 - Rebase to latest tip/master.
 - Add a cleanup patch to consolidate relocate_kernel()'s last two
   function parameters -- Boris.
 - Address comments received -- please see individual patches.
 - Collect tags (Tom, Rick, binbin).

 v3: https://lore.kernel.org/kvm/cover.1750934177.git.kai.huang@intel.com/

(For more history please see v3 coverletter.)

=== More information ===

TDX private memory is memory that is encrypted with private Host Key IDs
(HKID).  If the kernel has ever enabled TDX, part of system memory
remains TDX private memory when kexec happens.  E.g., the PAMT (Physical
Address Metadata Table) pages used by the TDX module to track each TDX
memory page's state are never freed once the TDX module is initialized.
TDX guests also have guest private memory and secure-EPT pages.

After kexec, the new kernel will have no knowledge of which memory page
was used as TDX private page and can use all memory as regular memory.

1) Cache flush

Per TDX 1.5 base spec "8.6.1.Platforms not Using ACT: Required Cache
Flush and Initialization by the Host VMM", to support kexec for TDX, the
kernel needs to flush cache to make sure there's no dirty cachelines of
TDX private memory left over to the new kernel (when the TDX module
reports TDX_FEATURES.CLFLUSH_BEFORE_ALLOC as 1 in the global metadata for
the platform).  The kernel also needs to make sure there's no more TDX
activity (no SEAMCALL) after cache flush so that no new dirty cachelines
of TDX private memory are generated.

SME has similar requirement.  SME kexec support uses WBINVD to do the
cache flush.  WBINVD is able to flush cachelines associated with any
HKID.  Reuse the WBINVD introduced by SME to flush cache for TDX.

Currently the kernel explicitly checks whether the hardware supports SME
and only does WBINVD if true.  Instead of adding yet another TDX
specific check, this series uses a percpu boolean to indicate whether
WBINVD is needed on that CPU during kexec.

2) Reset TDX private memory using MOVDIR64B

The TDX spec (the aforementioned section) also suggests the kernel
*should* use MOVDIR64B to clear TDX private page before the kernel
reuses it as regular one.

However, in reality the situation can be more flexible.  Per TDX 1.5
base spec ("Table 16.2: Non-ACT Platforms Checks on Memory Reads in Ci
Mode" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
Mode"), the read/write to TDX private memory using shared KeyID without
integrity check enabled will not poison the memory and cause machine
check.

Note on the platforms with ACT (Access Control Table), there's no
integrity check involved thus no machine check is possible to happen due
to memory read/write using different KeyIDs.

KeyID 0 (TME key) doesn't support integrity check.  This series chooses
to NOT reset TDX private memory but leave TDX private memory as-is to the
new kernel.  As mentioned above, in practice it is safe to do so.

3) One limitation

If the kernel has ever enabled TDX, after kexec the new kernel won't be
able to use TDX anymore.  This is because when the new kernel tries to
initialize TDX module it will fail on the first SEAMCALL due to the
module has already been initialized by the old kernel.

More (non-trivial) work will be needed for the new kernel to use TDX,
e.g., one solution is to just reload the TDX module from the location
where BIOS loads the TDX module (/boot/efi/EFI/TDX/).  This series
doesn't cover this, but leave this as future work.

4) Kdump support

This series also enables kdump with TDX, but no special handling is
needed for crash kexec (except turning on the Kconfig option):

 - kdump kernel uses reserved memory from the old kernel as system ram,
   and the old kernel will never use the reserved memory as TDX memory.
 - /proc/vmcore contains TDX private memory pages.  It's meaningless to
   read them, but it doesn't do any harm either.

5) TDX "partial write machine check" erratum

On the platform with TDX erratum, a partial write (a write transaction
of less than a cacheline lands at memory controller) to TDX private
memory poisons that memory, and a subsequent read triggers machine
check.  On those platforms, the kernel needs to reset TDX private memory
before jumping to the new kernel otherwise the new kernel may see
unexpected machine check.

The kernel currently doesn't track which page is TDX private memory.
It's not trivial to reset TDX private memory.  For simplicity, this
series simply disables kexec/kdump for such platforms.  This can be
enhanced in the future.


Kai Huang (7):
  x86/kexec: Consolidate relocate_kernel() function parameters
  x86/sme: Use percpu boolean to control WBINVD during kexec
  x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
  x86/kexec: Disable kexec/kdump on platforms with TDX partial write
    erratum
  x86/virt/tdx: Remove the !KEXEC_CORE dependency
  x86/virt/tdx: Update the kexec section in the TDX documentation
  KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs

 Documentation/arch/x86/tdx.rst       | 14 ++++-----
 arch/x86/Kconfig                     |  1 -
 arch/x86/include/asm/kexec.h         | 12 ++++++--
 arch/x86/include/asm/processor.h     |  2 ++
 arch/x86/include/asm/tdx.h           | 27 ++++++++++++++++-
 arch/x86/kernel/cpu/amd.c            | 17 +++++++++++
 arch/x86/kernel/machine_kexec_64.c   | 44 ++++++++++++++++++++++------
 arch/x86/kernel/process.c            | 24 +++++++--------
 arch/x86/kernel/relocate_kernel_64.S | 36 +++++++++++++++--------
 arch/x86/kvm/vmx/tdx.c               | 12 ++++++++
 arch/x86/virt/vmx/tdx/tdx.c          | 16 ++++++++--
 11 files changed, 158 insertions(+), 47 deletions(-)


base-commit: 4b6b14d20bc04dcab6dd3ad0d5a50a0f473d1c18
-- 
2.50.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-08-20 21:34 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-13 23:59 [PATCH v6 0/7] TDX host: kexec/kdump support Kai Huang
2025-08-13 23:59 ` [PATCH v6 1/7] x86/kexec: Consolidate relocate_kernel() function parameters Kai Huang
2025-08-15 10:46   ` Borislav Petkov
2025-08-18  1:15     ` Huang, Kai
2025-08-13 23:59 ` [PATCH v6 2/7] x86/sme: Use percpu boolean to control WBINVD during kexec Kai Huang
2025-08-19 19:28   ` Borislav Petkov
2025-08-19 21:57     ` Huang, Kai
2025-08-13 23:59 ` [PATCH v6 3/7] x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL Kai Huang
2025-08-13 23:59 ` [PATCH v6 4/7] x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum Kai Huang
2025-08-13 23:59 ` [PATCH v6 5/7] x86/virt/tdx: Remove the !KEXEC_CORE dependency Kai Huang
2025-08-13 23:59 ` [PATCH v6 6/7] x86/virt/tdx: Update the kexec section in the TDX documentation Kai Huang
2025-08-13 23:59 ` [PATCH v6 7/7] KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs Kai Huang
2025-08-14 13:54   ` Sean Christopherson
2025-08-14 15:38     ` Edgecombe, Rick P
2025-08-14 18:00       ` Sean Christopherson
2025-08-14 22:19         ` Huang, Kai
2025-08-14 23:22           ` Sean Christopherson
2025-08-15  0:00             ` Huang, Kai
2025-08-19 10:31               ` Paolo Bonzini
2025-08-19 21:53                 ` Huang, Kai
2025-08-20  9:51                   ` Paolo Bonzini
2025-08-20 11:22                     ` Huang, Kai
2025-08-20 20:35                       ` Paolo Bonzini
2025-08-20 21:34                         ` Huang, Kai
2025-08-20 15:39                   ` Paolo Bonzini
2025-08-14 22:25     ` Huang, Kai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).