linux-coco.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: David Woodhouse <dwmw2@infradead.org>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: "Huang, Kai" <kai.huang@intel.com>,
	Xiaoyao Li <xiaoyao.li@intel.com>,
	 Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org, "Rafael J. Wysocki" <rafael@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Adrian Hunter <adrian.hunter@intel.com>,
	Kuppuswamy Sathyanarayanan
	<sathyanarayanan.kuppuswamy@linux.intel.com>,
	Elena Reshetova <elena.reshetova@intel.com>,
	Jun Nakajima <jun.nakajima@intel.com>,
	Rick Edgecombe <rick.p.edgecombe@intel.com>,
	Tom Lendacky <thomas.lendacky@amd.com>,
	"Kalra, Ashish" <ashish.kalra@amd.com>,
	Sean Christopherson <seanjc@google.com>,
	Baoquan He <bhe@redhat.com>,
	kexec@lists.infradead.org, linux-coco@lists.linux.dev,
	 linux-kernel@vger.kernel.org
Subject: Re: [PATCHv9 05/17] x86/kexec: Keep CR4.MCE set during kexec for TDX guest
Date: Mon, 17 Mar 2025 11:32:42 +0000	[thread overview]
Message-ID: <0c6cab4ea4e898a62ecb0b047959f09011d3c85f.camel@infradead.org> (raw)
In-Reply-To: <uchg74rtpcpwlkxgqww2n6nh23p4ouaswqc737xy7y6rqzowtb@pbf4whogx2s4>

[-- Attachment #1: Type: text/plain, Size: 7538 bytes --]

On Mon, 2025-03-17 at 13:03 +0200, Kirill A. Shutemov wrote:
> On Mon, Mar 17, 2025 at 09:27:16AM +0000, David Woodhouse wrote:
> > On Thu, 2024-04-04 at 12:32 +0300, Kirill A. Shutemov wrote:
> > > On Thu, Apr 04, 2024 at 10:40:34AM +1300, Huang, Kai wrote:
> > > > 
> > > > 
> > > > On 3/04/2024 4:42 am, Kirill A. Shutemov wrote:
> > > > > On Fri, Mar 29, 2024 at 06:48:21PM +0200, Kirill A. Shutemov wrote:
> > > > > > On Fri, Mar 29, 2024 at 11:21:32PM +0800, Xiaoyao Li wrote:
> > > > > > > On 3/25/2024 6:38 PM, Kirill A. Shutemov wrote:
> > > > > > > > TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
> > > > > > > > to #VE.
> > > > > > > 
> > > > > > > Will we consider making it more safe and compatible for future to guard
> > > > > > > against X86_FEATURE_MCE as well?
> > > > > > > 
> > > > > > > If in the future, MCE becomes configurable for TD guest, then CR4.MCE might
> > > > > > > not be fixed1.
> > > > > > 
> > > > > > Good point.
> > > > > > 
> > > > > > I guess we can leave it clear if it was clear. This should be easy
> > > > > > enough. But we might want to clear even if was set if clearing is allowed.
> > > > > > 
> > > > > > It would require some kind of indication that clearing MCE is fine. We
> > > > > > don't have such indication yet. Not sure we can reasonably future-proof
> > > > > > the code at this point.
> > > > > > 
> > > > > > But let me think more.
> > > > > 
> > > > > I think I will go with the variant below.
> > > > > 
> > > > > diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> > > > > index 56cab1bb25f5..8e2037d78a1f 100644
> > > > > --- a/arch/x86/kernel/relocate_kernel_64.S
> > > > > +++ b/arch/x86/kernel/relocate_kernel_64.S
> > > > > @@ -5,6 +5,8 @@
> > > > >    */
> > > > >   #include <linux/linkage.h>
> > > > > +#include <linux/stringify.h>
> > > > > +#include <asm/alternative.h>
> > > > >   #include <asm/page_types.h>
> > > > >   #include <asm/kexec.h>
> > > > >   #include <asm/processor-flags.h>
> > > > > @@ -145,11 +147,17 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> > > > >   	 * Set cr4 to a known state:
> > > > >   	 *  - physical address extension enabled
> > > > >   	 *  - 5-level paging, if it was enabled before
> > > > > +	 *  - Machine check exception on TDX guest, if it was enabled before.
> > > > > +	 *    Clearing MCE might not allowed in TDX guests, depending on setup.
> > > > 
> > > > Nit:  Perhaps we can just call out:
> > > > 
> > > > 	Clearing MCE is not allowed if it _was_ enabled before.
> > > > 
> > > > Which is always true I suppose.
> > > 
> > > It is true now. Future TDX will allow to clear CR4.MCE and we don't want
> > > to flip it back on in this case.
> > 
> > And yet v12 of the patch which became commit de60613173df does
> > precisely that.
> > 
> > It uses the original contents of CR4 which are stored in %r13 (instead
> > of building a completely new set of bits for CR4 as before). So it
> > would never have *cleared* the CR4.MCE bit now anyway... what it does
> > is explicitly *set* the bit even if it wasn't set before?
> > 
> > This is what got committed, and I think we can just drop the
> > ALTERNATIVE line completely because it's redundant in the case that
> > CR4.MCE was already set, and *wrong* in the case that it wasn't already
> > set?
> 
> But we AND R13 against $(X86_CR4_PAE | X86_CR4_LA57). We will lose MCE if
> drop the ALTERNATIVE.

Ah, yes.

> And we don't want MCE to be enabled during kexec for !TDX_GUEST:
> 
> https://lore.kernel.org/all/1144340e-dd95-ee3b-dabb-579f9a65b3c7@citrix.com/

Actually now I've added proper exception handling in relocate_kernel
perhaps we could rethink that. But that's for the future.

> I think we should patch AND instruction to include X86_CR4_MCE on
> TDX_GUEST:
> ...
> -	andl	$(X86_CR4_PAE | X86_CR4_LA57), %r13d
> -	ALTERNATIVE "", __stringify(orl $X86_CR4_MCE, %r13d), X86_FEATURE_TDX_GUEST
> +	ALTERNATIVE __stringify(andl	$(X86_CR4_PAE | X86_CR4_LA57), %r13d), \
> +		    __stringify(andl	$(X86_CR4_PAE | X86_CR4_LA57 | X86_CR4_MCE), %r13d), X86_FEATURE_TDX_GUEST

Yeah... although the reason I'm looking at this is because I want to
kill the ALTERNATIVE so that I can move the relocate_kernel() function
into a data section:
https://lore.kernel.org/all/20241218212326.44qff3i5n6cxuu5d@jpoimboe/

So I think I'll do it like this instead:

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 5081d0b9e290..bd9fc22a6be2 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -65,6 +65,7 @@ extern gate_desc kexec_debug_idt[];
 extern unsigned char kexec_debug_exc_vectors[];
 extern uint16_t kexec_debug_8250_port;
 extern unsigned long kexec_debug_8250_mmio32;
+extern uint32_t kexec_preserve_cr4_bits;
 #endif
 
 /*
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 7abc7aa0261b..016862d2b544 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -353,6 +353,22 @@ int machine_kexec_prepare(struct kimage *image)
 	kexec_va_control_page = (unsigned long)control_page;
 	kexec_pa_table_page = (unsigned long)__pa(image->arch.pgd);
 
+	/*
+	 * The relocate_kernel assembly code sets CR4 to a subset of the bits
+	 * which were set during kernel runtime, including only:
+	 *  - physical address extension (which is always set in kernel)
+	 *  - 5-level paging (if it's enabled)
+	 *  - Machine check exception on TDX guests
+	 *
+	 * Clearing MCE may not be allowed in TDX guests, but it *should* be
+	 * cleared in the general case. Because of the conditional nature of
+	 * that, pass the set of bits in from the kernel for relocate_kernel
+	 * to do a simple 'andl' with them.
+	 */
+	kexec_preserve_cr4_bits = X86_CR4_PAE | X86_CR4_LA57;
+	if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+		kexec_preserve_cr4_bits |= X86_CR4_MCE;
+
 	if (image->type == KEXEC_TYPE_DEFAULT)
 		kexec_pa_swap_page = page_to_pfn(image->swap_page) << PAGE_SHIFT;
 
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 4f8b7d318025..576b7bbdd55e 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -41,6 +41,7 @@ SYM_DATA(kexec_pa_swap_page, .quad 0)
 SYM_DATA_LOCAL(pa_backup_pages_map, .quad 0)
 SYM_DATA(kexec_debug_8250_mmio32, .quad 0)
 SYM_DATA(kexec_debug_8250_port, .word 0)
+SYM_DATA(kexec_preserve_cr4_bits, .long 0)
 
 	.balign 16
 SYM_DATA_START_LOCAL(kexec_debug_gdt)
@@ -183,17 +184,12 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
 	movq	%rax, %cr0
 
 	/*
-	 * Set cr4 to a known state:
-	 *  - physical address extension enabled
-	 *  - 5-level paging, if it was enabled before
-	 *  - Machine check exception on TDX guest, if it was enabled before.
-	 *    Clearing MCE might not be allowed in TDX guests, depending on setup.
+	 * Set CR4 to a known state, using the bitmask which was set in
+	 * machine_kexec_prepare().
 	 *
 	 * Use R13 that contains the original CR4 value, read in relocate_kernel().
-	 * PAE is always set in the original CR4.
 	 */
-	andl	$(X86_CR4_PAE | X86_CR4_LA57), %r13d
-	ALTERNATIVE "", __stringify(orl $X86_CR4_MCE, %r13d), X86_FEATURE_TDX_GUEST
+	andl	kexec_preserve_cr4_bits(%rip), %r13d
 	movq	%r13, %cr4
 
 	/* Flush the TLB (needed?) */


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

  reply	other threads:[~2025-03-17 11:32 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-25 10:38 [PATCHv9 00/17] x86/tdx: Add kexec support Kirill A. Shutemov
2024-03-25 10:38 ` [PATCHv9 01/17] x86/acpi: Extract ACPI MADT wakeup code into a separate file Kirill A. Shutemov
2024-03-25 10:38 ` [PATCHv9 02/17] x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init Kirill A. Shutemov
2024-03-25 10:38 ` [PATCHv9 03/17] cpu/hotplug: Add support for declaring CPU offlining not supported Kirill A. Shutemov
2024-03-25 10:38 ` [PATCHv9 04/17] cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup Kirill A. Shutemov
2024-03-25 10:38 ` [PATCHv9 05/17] x86/kexec: Keep CR4.MCE set during kexec for TDX guest Kirill A. Shutemov
2024-03-29 15:21   ` Xiaoyao Li
2024-03-29 16:48     ` Kirill A. Shutemov
2024-04-02 15:42       ` Kirill A. Shutemov
2024-04-03 21:40         ` Huang, Kai
2024-04-04  9:32           ` Kirill A. Shutemov
2025-03-17  9:27             ` David Woodhouse
2025-03-17 11:03               ` Kirill A. Shutemov
2025-03-17 11:32                 ` David Woodhouse [this message]
2025-03-17 11:59                   ` Kirill A. Shutemov
2024-04-03 15:23   ` [PATCHv9.1 " Kirill A. Shutemov
2024-03-25 10:39 ` [PATCHv9 06/17] x86/mm: Make x86_platform.guest.enc_status_change_*() return errno Kirill A. Shutemov
2024-03-26 10:30   ` Huang, Kai
2024-03-27 12:34   ` [PATCHv9.1 " Kirill A. Shutemov
2024-03-25 10:39 ` [PATCHv9 07/17] x86/mm: Return correct level from lookup_address() if pte is none Kirill A. Shutemov
2024-03-25 10:39 ` [PATCHv9 08/17] x86/tdx: Account shared memory Kirill A. Shutemov
2024-03-25 15:43   ` Kuppuswamy Sathyanarayanan
2024-03-26 10:30   ` Huang, Kai
2024-03-25 10:39 ` [PATCHv9 09/17] x86/mm: Adding callbacks to prepare encrypted memory for kexec Kirill A. Shutemov
2024-04-03 22:33   ` Huang, Kai
2024-03-25 10:39 ` [PATCHv9 10/17] x86/tdx: Convert shared memory back to private on kexec Kirill A. Shutemov
2024-03-26 10:31   ` Huang, Kai
2024-03-25 10:39 ` [PATCHv9 11/17] x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges Kirill A. Shutemov
2024-03-25 10:39 ` [PATCHv9 12/17] x86/acpi: Rename fields in acpi_madt_multiproc_wakeup structure Kirill A. Shutemov
2024-03-25 10:39 ` [PATCHv9 13/17] x86/acpi: Do not attempt to bring up secondary CPUs in kexec case Kirill A. Shutemov
2024-03-25 10:39 ` [PATCHv9 14/17] x86/smp: Add smp_ops.stop_this_cpu() callback Kirill A. Shutemov
2024-03-25 10:39 ` [PATCHv9 15/17] x86/mm: Introduce kernel_ident_mapping_free() Kirill A. Shutemov
2024-03-25 10:39 ` [PATCHv9 16/17] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method Kirill A. Shutemov
2024-03-25 10:39 ` [PATCHv9 17/17] ACPI: tables: Print MULTIPROC_WAKEUP when MADT is parsed Kirill A. Shutemov
2024-03-26 10:32   ` Huang, Kai
2024-03-26 17:53   ` Kuppuswamy Sathyanarayanan
2024-04-04 18:27 ` [PATCHv9 00/17] x86/tdx: Add kexec support Kalra, Ashish
2024-04-07 15:55   ` Kirill A. Shutemov
2024-04-04 23:10 ` [PATCH v3 0/4] x86/snp: " Ashish Kalra
2024-04-04 23:11   ` [PATCH v3 1/4] efi/x86: skip efi_arch_mem_reserve() in case of kexec Ashish Kalra
2024-04-05 17:02     ` Kuppuswamy Sathyanarayanan
2024-04-04 23:11   ` [PATCH v3 2/4] x86/sev: add sev_es_enabled() function Ashish Kalra
2024-04-05 17:03     ` Kuppuswamy Sathyanarayanan
2024-04-04 23:11   ` [PATCH v3 3/4] x86/boot/compressed: Skip Video Memory access in Decompressor for SEV-ES/SNP Ashish Kalra
2024-04-05 17:05     ` Kuppuswamy Sathyanarayanan
2024-04-04 23:11   ` [PATCH v3 4/4] x86/snp: Convert shared memory back to private on kexec Ashish Kalra
2024-04-05 11:30     ` kernel test robot
2024-04-05 11:34     ` kernel test robot
2024-04-05 11:36     ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0c6cab4ea4e898a62ecb0b047959f09011d3c85f.camel@infradead.org \
    --to=dwmw2@infradead.org \
    --cc=adrian.hunter@intel.com \
    --cc=ashish.kalra@amd.com \
    --cc=bhe@redhat.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=elena.reshetova@intel.com \
    --cc=jun.nakajima@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kexec@lists.infradead.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-coco@lists.linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=thomas.lendacky@amd.com \
    --cc=x86@kernel.org \
    --cc=xiaoyao.li@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).