All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Hansen <dave.hansen@intel.com>
To: "Huang, Kai" <kai.huang@intel.com>, linux-kernel@vger.kernel.org
Cc: x86@kernel.org, kirill.shutemov@linux.intel.com,
	tglx@linutronix.de, bp@alien8.de, mingo@redhat.com,
	hpa@zytor.com, luto@kernel.org, peterz@infradead.org,
	thomas.lendacky@amd.com, chao.gao@intel.com, bhe@redhat.com,
	nik.borisov@suse.com, pbonzini@redhat.com
Subject: Re: [PATCH 3/4] x86/kexec(): Reset TDX private memory on platforms with TDX erratum
Date: Wed, 31 Jan 2024 13:21:39 -0800	[thread overview]
Message-ID: <12a927df-e437-40ff-ba4d-ceca5446c5b1@intel.com> (raw)
In-Reply-To: <b52ed259e0d487b3a968b98da91cb4f55a28ae82.1706698706.git.kai.huang@intel.com>

On 1/31/24 03:31, Huang, Kai wrote:
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -28,6 +28,7 @@
>  #include <asm/setup.h>
>  #include <asm/set_memory.h>
>  #include <asm/cpu.h>
> +#include <asm/tdx.h>
>  
>  #ifdef CONFIG_ACPI
>  /*
> @@ -298,9 +299,24 @@ void machine_kexec(struct kimage *image)
>  	void *control_page;
>  	int save_ftrace_enabled;
>  
> +	/*
> +	 * For platforms with TDX "partial write machine check" erratum,
> +	 * all TDX private pages need to be converted back to normal
> +	 * before booting to the new kernel, otherwise the new kernel
> +	 * may get unexpected machine check.
> +	 *
> +	 * But skip this when preserve_context is on.  The second kernel
> +	 * shouldn't write to the first kernel's memory anyway.  Skipping
> +	 * this also avoids killing TDX in the first kernel, which would
> +	 * require more complicated handling.
> +	 */

This is waaaaaaaaaaaaaaay too much information to stick in a generic
function.  Just call tdx_reset_memory() and make the argument about

>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)
>  		save_processor_state();
> +	else
> +		tdx_reset_memory();
> +#else
> +	tdx_reset_memory();
>  #endif

Wow, that's awfully hard to read.  I really wish folks' gag reflex would
kick in when they see stuff like this to get them to spend an additional
15 seconds to turn this into:

	if (IS_ENABLED(CONFIG_KEXEC_JUMP) && image->preserve_context)
		save_processor_state();
	else
		tdx_reset_memory();

>  	save_ftrace_enabled = __ftrace_enabled_save();
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 9f1fed458a32..0537b1b76c2b 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -28,6 +28,7 @@
>  #include <linux/acpi.h>
>  #include <linux/suspend.h>
>  #include <linux/acpi.h>
> +#include <linux/reboot.h>
>  #include <asm/page.h>
>  #include <asm/special_insns.h>
>  #include <asm/msr-index.h>
> @@ -54,6 +55,8 @@ static DEFINE_MUTEX(tdx_module_lock);
>  /* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
>  static LIST_HEAD(tdx_memlist);
>  
> +static bool tdx_rebooting;
> +
>  typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
>  
>  static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
> @@ -1187,6 +1190,9 @@ static int __tdx_enable(void)
>  {
>  	int ret;
>  
> +	if (tdx_rebooting)
> +		return -EAGAIN;

This whole 'tdx_rebooting' deserves to at least be in its own patch.
Also -EAGAIN seems to be a really odd return code for a permanent failure.

>  	ret = init_tdx_module();
>  	if (ret) {
>  		pr_err("module initialization failed (%d)\n", ret);
> @@ -1420,6 +1426,90 @@ static struct notifier_block tdx_memory_nb = {
>  	.notifier_call = tdx_memory_notifier,
>  };
>  
> +/*
> + * Convert TDX private pages back to normal on platforms with
> + * "partial write machine check" erratum.
> + *
> + * Called from machine_kexec() before booting to the new kernel.
> + */
> +void tdx_reset_memory(void)
> +{
> +	if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM))
> +		return;
> +
> +	/*
> +	 * Kernel read/write to TDX private memory doesn't
> +	 * cause machine check on hardware w/o this erratum.
> +	 */
> +	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> +		return;
> +
> +	/* Called from kexec() when only rebooting cpu is alive */
> +	WARN_ON_ONCE(num_online_cpus() != 1);
> +
> +	/*
> +	 * tdx_reboot_notifier() waits until ongoing TDX module
> +	 * initialization to finish, and module initialization is
> +	 * rejected after that.  Therefore @tdx_module_status is
> +	 * stable here and can be read w/o holding lock.
> +	 */
> +	if (tdx_module_status != TDX_MODULE_INITIALIZED)
> +		return;

kexec() can't happen until after reboot notifiers are called.
tdx_reboot_notifier() makes 'tdx_module_status' stable, so no lock is
needed.

Right?

> +	/*
> +	 * Flush cache of all TDX private memory _before_ converting
> +	 * them back to avoid silent memory corruption.
> +	 */
> +	native_wbinvd();

Since this is single-threaded, it also needs to mention where all the
other CPU caches got flushed.

> +	/*
> +	 * Convert PAMTs back to normal.  All other cpus are already
> +	 * dead and TDMRs/PAMTs are stable.
> +	 *
> +	 * Ideally it's better to cover all types of TDX private pages
> +	 * here, but it's impractical:
> +	 *
> +	 *  - There's no existing infrastructure to tell whether a page
> +	 *    is TDX private memory or not.
> +	 *
> +	 *  - Using SEAMCALL to query TDX module isn't feasible either:
> +	 *    - VMX has been turned off by reaching here so SEAMCALL
> +	 *      cannot be made;
> +	 *    - Even SEAMCALL can be made the result from TDX module may

		    ^ if     ^ a          ^,

> +	 *      not be accurate (e.g., remote CPU can be stopped while
> +	 *      the kernel is in the middle of reclaiming TDX private
> +	 *      page and doing MOVDIR64B).
> +	 *
> +	 * One temporary solution could be just converting all memory
> +	 * pages, but it's problematic too, because not all pages are
> +	 * mapped as writable in direct mapping.  It can be done by
> +	 * switching to the identical mapping for kexec() or a new page
> +	 * table which maps all pages as writable, but the complexity is
> +	 * overkill.
> +	 *
> +	 * Thus instead of doing something dramatic to convert all pages,
> +	 * only convert PAMTs here.  Other kernel components which use
> +	 * TDX need to do the conversion on their own by intercepting the
> +	 * rebooting/shutdown notifier (KVM already does that).
> +	 */

I'd leave the extended alternatives discussion in the changelog, not here.

Focus on what _this_ is doing and why it is imperfect:

 1. Just reset the PAMDs
 2. This leaves A, B, and C undealt with
 3. The risk of leaving those is ...


> +	tdmrs_reset_pamt_all(&tdx_tdmr_list);
> +}
> +
> +static int tdx_reboot_notifier(struct notifier_block *nb, unsigned long mode,
> +			       void *unused)
> +{
> +	/* Wait ongoing TDX initialization to finish */
> +	mutex_lock(&tdx_module_lock);
> +	tdx_rebooting = true;
> +	mutex_unlock(&tdx_module_lock);
> +
> +	return NOTIFY_OK;
> +}

Why is 'tdx_rebooting' a new variable instead of a new state in
tdx_module_status?


  reply	other threads:[~2024-01-31 21:21 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-31 11:31 [PATCH 0/4] TDX host: kexec() support Huang, Kai
2024-01-31 11:31 ` [PATCH 1/4] x86/coco: Add a new CC attribute to unify cache flush during kexec Huang, Kai
2024-02-19 16:16   ` Borislav Petkov
2024-02-19 19:45     ` Tom Lendacky
2024-02-19 20:32       ` Borislav Petkov
2024-02-19 22:09         ` Tom Lendacky
2024-02-20  2:57           ` Huang, Kai
2024-02-20 14:40             ` Tom Lendacky
2024-02-20 14:28           ` Borislav Petkov
2024-02-20 14:47             ` Tom Lendacky
2024-02-20 20:07               ` Huang, Kai
2024-02-20 22:30                 ` Tom Lendacky
2024-02-21  1:38                   ` Huang, Kai
2024-02-21  9:28                   ` Borislav Petkov
2024-02-22 11:49                     ` Huang, Kai
2024-02-23  3:13                       ` Dave Young
2024-02-23 10:41                     ` Dave Young
2024-02-28  2:54                       ` Dave Young
2024-02-28  9:21                         ` Huang, Kai
2024-02-28 11:02                           ` Borislav Petkov
2024-02-28 22:21                             ` Huang, Kai
2024-02-28 10:44                         ` Borislav Petkov
2024-02-20  3:12     ` Huang, Kai
2024-01-31 11:31 ` [PATCH 2/4] x86/virt/tdx: Advertise the CC_ATTR_HOST_MEM_INCOHERENT for TDX host Huang, Kai
2024-01-31 17:11   ` Dave Hansen
2024-02-01 14:42     ` Huang, Kai
2024-01-31 11:31 ` [PATCH 3/4] x86/kexec(): Reset TDX private memory on platforms with TDX erratum Huang, Kai
2024-01-31 21:21   ` Dave Hansen [this message]
2024-01-31 22:03     ` Kirill A. Shutemov
2024-02-01 14:22       ` Huang, Kai
2024-02-01 14:39         ` Kirill A. Shutemov
2024-02-01 14:47           ` Huang, Kai
2024-02-01 16:57           ` Dave Hansen
2024-02-05  6:49             ` Huang, Kai
2024-02-01 14:35     ` Huang, Kai
2024-02-02  0:54   ` Edgecombe, Rick P
2024-02-05  6:44     ` Huang, Kai
2024-01-31 11:31 ` [PATCH 4/4] x86/virt/tdx: Remove the !KEXEC_CORE dependency Huang, Kai
2024-02-01 18:28 ` [PATCH 0/4] TDX host: kexec() support Tom Lendacky
2024-02-05  6:50   ` Huang, Kai
2024-02-06 18:56     ` Kalra, Ashish
2024-02-07  1:43       ` Huang, Kai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=12a927df-e437-40ff-ba4d-ceca5446c5b1@intel.com \
    --to=dave.hansen@intel.com \
    --cc=bhe@redhat.com \
    --cc=bp@alien8.de \
    --cc=chao.gao@intel.com \
    --cc=hpa@zytor.com \
    --cc=kai.huang@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=nik.borisov@suse.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=thomas.lendacky@amd.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.