* [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving
@ 2024-11-22 22:38 David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 01/16] x86/kexec: Clean up and document register use in relocate_kernel_64.S David Woodhouse
` (15 more replies)
0 siblings, 16 replies; 21+ messages in thread
From: David Woodhouse @ 2024-11-22 22:38 UTC (permalink / raw)
To: kexec
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, David Woodhouse, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
Make it easier to pass information into relocate_kernel by allowing it to
have actual variables which are set from the real kernel. To do this, move
it into the kernel's .data section, keeping its data and code together
with linker script rules. Execute it from the *copy* instead of its
original in the kernel data section, and clean it up a bit.
Then do what I originally started with, which is add a GDT+IDT and some
exception handling so we can actually catch problems instead of just
suffering a triple fault and wondering why the world hates us.
The serial output of the debug mode can be cleaned up a little, and it's
even now possible to pass in information about which serial port to write
to.
I'll also work on resyncing with the i386 code and applying as many of
these cleanups there as possible. And probably also make the 64-bit one
use a separate image->arch.pgd instead of lumping it into a single 8KiB
"control page" as we do on x86_64 at the moment.
But the basic cleanups are probably ready for another round of bikeshedding.
Testing the preserve_context mode with the following test case:
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <linux/kexec.h>
#include <linux/reboot.h>
#include <sys/reboot.h>
#include <sys/syscall.h>
int main (void)
{
struct kexec_segment segment = {};
unsigned char purgatory[] = {
0x66, 0xba, 0xf8, 0x03, // mov $0x3f8, %dx
0xb0, 0x42, // mov $0x42, %al
0xee, // outb %al, (%dx)
0xc3, // ret
};
int ret;
segment.buf = &purgatory;
segment.bufsz = sizeof(purgatory);
segment.mem = (void *)0x400000;
segment.memsz = 0x1000;
ret = syscall(__NR_kexec_load, 0x400000, 1, &segment, KEXEC_PRESERVE_CONTEXT);
if (ret) {
perror("kexec_load");
exit(1);
}
return 0;
}
David Woodhouse (16):
x86/kexec: Clean up and document register use in relocate_kernel_64.S
x86/kexec: Use named labels in swap_pages in relocate_kernel_64.S
x86/kexec: Restore GDT on return from preserve_context kexec
x86/kexec: Only swap pages for preserve_context mode
x86/kexec: Invoke copy of relocate_kernel() instead of the original
x86/kexec: Move relocate_kernel to kernel .data section
x86/kexec: Add data section to relocate_kernel
x86/kexec: Copy control page into place in machine_kexec_prepare()
x86/kexec: Drop page_list argument from relocate_kernel()
x86/kexec: Eliminate writes through kernel mapping of relocate_kernel page
x86/kexec: Clean up register usage in relocate_kernel()
x86/kexec: Mark relocate_kernel page as ROX instead of RWX
x86/kexec: Debugging support: load a GDT
x86/kexec: Debugging support: Load an IDT and basic exception entry points
x86/kexec: Debugging support: Dump registers on exception
[DO NOT MERGE] x86/kexec: enable DEBUG
arch/x86/include/asm/kexec.h | 13 +-
arch/x86/include/asm/sections.h | 1 +
arch/x86/kernel/machine_kexec_64.c | 55 +++--
arch/x86/kernel/relocate_kernel_64.S | 384 +++++++++++++++++++++++++++--------
arch/x86/kernel/vmlinux.lds.S | 12 +-
5 files changed, 358 insertions(+), 107 deletions(-)
^ permalink raw reply [flat|nested] 21+ messages in thread
* [RFC PATCH v2 01/16] x86/kexec: Clean up and document register use in relocate_kernel_64.S
2024-11-22 22:38 [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving David Woodhouse
@ 2024-11-22 22:38 ` David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 02/16] x86/kexec: Use named labels in swap_pages " David Woodhouse
` (14 subsequent siblings)
15 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2024-11-22 22:38 UTC (permalink / raw)
To: kexec
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, David Woodhouse, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
From: David Woodhouse <dwmw@amazon.co.uk>
Add more comments explaining what each register contains, and save the
preserve_context flag to a non-clobbered register sooner, to keep things
simpler.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Acked-by: Kai Huang <kai.huang@intel.com>
---
arch/x86/kernel/relocate_kernel_64.S | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index e9e88c342f75..7ee32bcb6e01 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -100,6 +100,9 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
movq %r10, CP_PA_SWAP_PAGE(%r11)
movq %rdi, CP_PA_BACKUP_PAGES_MAP(%r11)
+ /* Save the preserve_context to %r11 as swap_pages clobbers %rcx. */
+ movq %rcx, %r11
+
/* Switch to the identity mapped page tables */
movq %r9, %cr3
@@ -116,6 +119,14 @@ SYM_CODE_END(relocate_kernel)
SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
UNWIND_HINT_END_OF_STACK
+ /*
+ * %rdi indirection page
+ * %rdx start address
+ * %r11 preserve_context
+ * %r12 host_mem_enc_active
+ * %r13 original CR4 when relocate_kernel() was invoked
+ */
+
/* set return address to 0 if not preserving context */
pushq $0
/* store the start address on the stack */
@@ -170,8 +181,6 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
wbinvd
.Lsme_off:
- /* Save the preserve_context to %r11 as swap_pages clobbers %rcx. */
- movq %rcx, %r11
call swap_pages
/*
@@ -183,13 +192,14 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
movq %cr3, %rax
movq %rax, %cr3
+ testq %r11, %r11 /* preserve_context */
+ jnz .Lrelocate
+
/*
* set all of the registers to known values
* leave %rsp alone
*/
- testq %r11, %r11
- jnz .Lrelocate
xorl %eax, %eax
xorl %ebx, %ebx
xorl %ecx, %ecx
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH v2 02/16] x86/kexec: Use named labels in swap_pages in relocate_kernel_64.S
2024-11-22 22:38 [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 01/16] x86/kexec: Clean up and document register use in relocate_kernel_64.S David Woodhouse
@ 2024-11-22 22:38 ` David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 03/16] x86/kexec: Restore GDT on return from preserve_context kexec David Woodhouse
` (13 subsequent siblings)
15 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2024-11-22 22:38 UTC (permalink / raw)
To: kexec
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, David Woodhouse, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
From: David Woodhouse <dwmw@amazon.co.uk>
Make the code a little more readable.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Acked-by: Kai Huang <kai.huang@intel.com>
---
arch/x86/kernel/relocate_kernel_64.S | 30 ++++++++++++++--------------
1 file changed, 15 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 7ee32bcb6e01..ca01e3e2f097 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -272,31 +272,31 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
movq %rdi, %rcx /* Put the indirection_page in %rcx */
xorl %edi, %edi
xorl %esi, %esi
- jmp 1f
+ jmp .Lstart /* Should start with an indirection record */
-0: /* top, read another word for the indirection page */
+.Lloop: /* top, read another word for the indirection page */
movq (%rbx), %rcx
addq $8, %rbx
-1:
+.Lstart:
testb $0x1, %cl /* is it a destination page? */
- jz 2f
+ jz .Lnotdest
movq %rcx, %rdi
andq $0xfffffffffffff000, %rdi
- jmp 0b
-2:
+ jmp .Lloop
+.Lnotdest:
testb $0x2, %cl /* is it an indirection page? */
- jz 2f
+ jz .Lnotind
movq %rcx, %rbx
andq $0xfffffffffffff000, %rbx
- jmp 0b
-2:
+ jmp .Lloop
+.Lnotind:
testb $0x4, %cl /* is it the done indicator? */
- jz 2f
- jmp 3f
-2:
+ jz .Lnotdone
+ jmp .Ldone
+.Lnotdone:
testb $0x8, %cl /* is it the source indicator? */
- jz 0b /* Ignore it otherwise */
+ jz .Lloop /* Ignore it otherwise */
movq %rcx, %rsi /* For ever source page do a copy */
andq $0xfffffffffffff000, %rsi
@@ -321,8 +321,8 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
rep ; movsq
lea PAGE_SIZE(%rax), %rsi
- jmp 0b
-3:
+ jmp .Lloop
+.Ldone:
ANNOTATE_UNRET_SAFE
ret
int3
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH v2 03/16] x86/kexec: Restore GDT on return from preserve_context kexec
2024-11-22 22:38 [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 01/16] x86/kexec: Clean up and document register use in relocate_kernel_64.S David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 02/16] x86/kexec: Use named labels in swap_pages " David Woodhouse
@ 2024-11-22 22:38 ` David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 04/16] x86/kexec: Only swap pages for preserve_context mode David Woodhouse
` (12 subsequent siblings)
15 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2024-11-22 22:38 UTC (permalink / raw)
To: kexec
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, David Woodhouse, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
From: David Woodhouse <dwmw@amazon.co.uk>
The restore_processor_state() function explicitly states that "the asm code
that gets us here will have restored a usable GDT". That wasn't true in the
case of returning from a preserve_context kexec. Make it so.
Without this, the kernel was depending on the called function to reload an
appropriate GDT.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/kernel/relocate_kernel_64.S | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index ca01e3e2f097..ed2ae50535dd 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -252,6 +252,11 @@ SYM_CODE_START_LOCAL_NOALIGN(virtual_mapped)
movq CR0(%r8), %r8
movq %rax, %cr3
movq %r8, %cr0
+
+ /* Saved in save_processor_state. */
+ movq $saved_context, %rax
+ lgdt saved_context_gdt_desc(%rax)
+
movq %rbp, %rax
popf
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH v2 04/16] x86/kexec: Only swap pages for preserve_context mode
2024-11-22 22:38 [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving David Woodhouse
` (2 preceding siblings ...)
2024-11-22 22:38 ` [RFC PATCH v2 03/16] x86/kexec: Restore GDT on return from preserve_context kexec David Woodhouse
@ 2024-11-22 22:38 ` David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 05/16] x86/kexec: Invoke copy of relocate_kernel() instead of the original David Woodhouse
` (11 subsequent siblings)
15 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2024-11-22 22:38 UTC (permalink / raw)
To: kexec
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, David Woodhouse, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
From: David Woodhouse <dwmw@amazon.co.uk>
There's no need to swap pages (which involves three memcopies for each
page) in the plain kexec case. Just do a single copy from source to
destination page.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/kernel/relocate_kernel_64.S | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index ed2ae50535dd..92d5dbed3097 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -308,6 +308,9 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
movq %rdi, %rdx /* Save destination page to %rdx */
movq %rsi, %rax /* Save source page to %rax */
+ testq %r11, %r11 /* Only actually swap for preserve_context */
+ jnz .Lnoswap
+
/* copy source page to swap page */
movq %r10, %rdi
movl $512, %ecx
@@ -322,6 +325,7 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
/* copy swap page to destination page */
movq %rdx, %rdi
movq %r10, %rsi
+.Lnoswap:
movl $512, %ecx
rep ; movsq
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH v2 05/16] x86/kexec: Invoke copy of relocate_kernel() instead of the original
2024-11-22 22:38 [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving David Woodhouse
` (3 preceding siblings ...)
2024-11-22 22:38 ` [RFC PATCH v2 04/16] x86/kexec: Only swap pages for preserve_context mode David Woodhouse
@ 2024-11-22 22:38 ` David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 06/16] x86/kexec: Move relocate_kernel to kernel .data section David Woodhouse
` (10 subsequent siblings)
15 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2024-11-22 22:38 UTC (permalink / raw)
To: kexec
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, David Woodhouse, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
From: David Woodhouse <dwmw@amazon.co.uk>
This currently calls set_memory_x() from machine_kexec_prepare() just
like the 32-bit version does. That's actually a bit earlier than I'd
like, as it leaves the page RWX all the time the image is even *loaded*.
Subsequent commits will eliminate all the writes to the page between the
point it's marked executable in machine_kexec_prepare() the time that
relocate_kernel() is running and has switched to the identmap %cr3, so
that it can be ROX. But that can't happen until it's moved to the .data
section of the kernel, and *that* can't happen until we start executing
the copy instead of executing it in place in the kernel .text. So break
the circular dependency in those commits by letting it be RWX for now.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/kernel/machine_kexec_64.c | 28 +++++++++++++++++++++-------
arch/x86/kernel/relocate_kernel_64.S | 5 ++++-
2 files changed, 25 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 9c9ac606893e..3aeb225a0b36 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -156,8 +156,8 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
pmd_t *pmd;
pte_t *pte;
- vaddr = (unsigned long)relocate_kernel;
- paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE);
+ vaddr = (unsigned long)page_address(image->control_code_page) + PAGE_SIZE;
+ paddr = __pa(vaddr);
pgd += pgd_index(vaddr);
if (!pgd_present(*pgd)) {
p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL);
@@ -296,6 +296,7 @@ static void load_segments(void)
int machine_kexec_prepare(struct kimage *image)
{
+ void *control_page = page_address(image->control_code_page) + PAGE_SIZE;
unsigned long start_pgtable;
int result;
@@ -307,11 +308,17 @@ int machine_kexec_prepare(struct kimage *image)
if (result)
return result;
+ set_memory_x((unsigned long)control_page, 1);
+
return 0;
}
void machine_kexec_cleanup(struct kimage *image)
{
+ void *control_page = page_address(image->control_code_page) + PAGE_SIZE;
+
+ set_memory_nx((unsigned long)control_page, 1);
+
free_transition_pgtable(image);
}
@@ -321,6 +328,11 @@ void machine_kexec_cleanup(struct kimage *image)
*/
void machine_kexec(struct kimage *image)
{
+ unsigned long (*relocate_kernel_ptr)(unsigned long indirection_page,
+ unsigned long page_list,
+ unsigned long start_address,
+ unsigned int preserve_context,
+ unsigned int host_mem_enc_active);
unsigned long page_list[PAGES_NR];
unsigned int host_mem_enc_active;
int save_ftrace_enabled;
@@ -369,6 +381,8 @@ void machine_kexec(struct kimage *image)
page_list[PA_SWAP_PAGE] = (page_to_pfn(image->swap_page)
<< PAGE_SHIFT);
+ relocate_kernel_ptr = control_page;
+
/*
* The segment registers are funny things, they have both a
* visible and an invisible part. Whenever the visible part is
@@ -388,11 +402,11 @@ void machine_kexec(struct kimage *image)
native_gdt_invalidate();
/* now call it */
- image->start = relocate_kernel((unsigned long)image->head,
- (unsigned long)page_list,
- image->start,
- image->preserve_context,
- host_mem_enc_active);
+ image->start = relocate_kernel_ptr((unsigned long)image->head,
+ (unsigned long)page_list,
+ image->start,
+ image->preserve_context,
+ host_mem_enc_active);
#ifdef CONFIG_KEXEC_JUMP
if (image->preserve_context)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 92d5dbed3097..70539b1b9545 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -39,6 +39,7 @@
#define CP_PA_TABLE_PAGE DATA(0x20)
#define CP_PA_SWAP_PAGE DATA(0x28)
#define CP_PA_BACKUP_PAGES_MAP DATA(0x30)
+#define CP_VA_CONTROL_PAGE DATA(0x38)
.text
.align PAGE_SIZE
@@ -99,6 +100,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
movq %r9, CP_PA_TABLE_PAGE(%r11)
movq %r10, CP_PA_SWAP_PAGE(%r11)
movq %rdi, CP_PA_BACKUP_PAGES_MAP(%r11)
+ movq %r11, CP_VA_CONTROL_PAGE(%r11)
/* Save the preserve_context to %r11 as swap_pages clobbers %rcx. */
movq %rcx, %r11
@@ -235,7 +237,8 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
movq %rax, %cr3
lea PAGE_SIZE(%r8), %rsp
call swap_pages
- movq $virtual_mapped, %rax
+ movq CP_VA_CONTROL_PAGE(%r8), %rax
+ addq $(virtual_mapped - relocate_kernel), %rax
pushq %rax
ANNOTATE_UNRET_SAFE
ret
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH v2 06/16] x86/kexec: Move relocate_kernel to kernel .data section
2024-11-22 22:38 [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving David Woodhouse
` (4 preceding siblings ...)
2024-11-22 22:38 ` [RFC PATCH v2 05/16] x86/kexec: Invoke copy of relocate_kernel() instead of the original David Woodhouse
@ 2024-11-22 22:38 ` David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 07/16] x86/kexec: Add data section to relocate_kernel David Woodhouse
` (9 subsequent siblings)
15 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2024-11-22 22:38 UTC (permalink / raw)
To: kexec
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, David Woodhouse, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
From: David Woodhouse <dwmw@amazon.co.uk>
Now that the copy is executed instead of the original, the relocate_kernel
page can live in the kernel's .text section. This will allow subsequent
commits to actually add real data to it and clean up the code somewhat as
well as making the control page ROX.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/include/asm/sections.h | 1 +
arch/x86/kernel/machine_kexec_64.c | 4 +++-
arch/x86/kernel/relocate_kernel_64.S | 6 +-----
arch/x86/kernel/vmlinux.lds.S | 11 ++++++++++-
4 files changed, 15 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index 3fa87e5e11ab..30e8ee7006f9 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -5,6 +5,7 @@
#include <asm-generic/sections.h>
#include <asm/extable.h>
+extern char __relocate_kernel_start[], __relocate_kernel_end[];
extern char __brk_base[], __brk_limit[];
extern char __end_rodata_aligned[];
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 3aeb225a0b36..048868d868ce 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -333,6 +333,8 @@ void machine_kexec(struct kimage *image)
unsigned long start_address,
unsigned int preserve_context,
unsigned int host_mem_enc_active);
+ unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
+ unsigned long reloc_end = (unsigned long)__relocate_kernel_end;
unsigned long page_list[PAGES_NR];
unsigned int host_mem_enc_active;
int save_ftrace_enabled;
@@ -370,7 +372,7 @@ void machine_kexec(struct kimage *image)
}
control_page = page_address(image->control_code_page) + PAGE_SIZE;
- __memcpy(control_page, relocate_kernel, KEXEC_CONTROL_CODE_MAX_SIZE);
+ __memcpy(control_page, __relocate_kernel_start, reloc_end - reloc_start);
page_list[PA_CONTROL_PAGE] = virt_to_phys(control_page);
page_list[VA_CONTROL_PAGE] = (unsigned long)control_page;
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 70539b1b9545..085dddf79476 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -41,10 +41,8 @@
#define CP_PA_BACKUP_PAGES_MAP DATA(0x30)
#define CP_VA_CONTROL_PAGE DATA(0x38)
- .text
- .align PAGE_SIZE
+ .section .text.relocate_kernel,"ax";
.code64
-SYM_CODE_START_NOALIGN(relocate_range)
SYM_CODE_START_NOALIGN(relocate_kernel)
UNWIND_HINT_END_OF_STACK
ANNOTATE_NOENDBR
@@ -340,5 +338,3 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
int3
SYM_CODE_END(swap_pages)
- .skip KEXEC_CONTROL_CODE_MAX_SIZE - (. - relocate_kernel), 0xcc
-SYM_CODE_END(relocate_range);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index b8c5741d2fb4..925a821134b5 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -95,7 +95,15 @@ const_pcpu_hot = pcpu_hot;
#define BSS_DECRYPTED
#endif
-
+#if defined(CONFIG_X86_64) && defined(CONFIG_KEXEC_CORE)
+#define KEXEC_RELOCATE_KERNEL \
+ . = ALIGN(0x100); \
+ __relocate_kernel_start = .; \
+ *(.text.relocate_kernel); \
+ __relocate_kernel_end = .;
+#else
+#define KEXEC_RELOCATE_KERNEL
+#endif
PHDRS {
text PT_LOAD FLAGS(5); /* R_E */
data PT_LOAD FLAGS(6); /* RW_ */
@@ -181,6 +189,7 @@ SECTIONS
DATA_DATA
CONSTRUCTORS
+ KEXEC_RELOCATE_KERNEL
/* rarely changed data like cpu maps */
READ_MOSTLY_DATA(INTERNODE_CACHE_BYTES)
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH v2 07/16] x86/kexec: Add data section to relocate_kernel
2024-11-22 22:38 [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving David Woodhouse
` (5 preceding siblings ...)
2024-11-22 22:38 ` [RFC PATCH v2 06/16] x86/kexec: Move relocate_kernel to kernel .data section David Woodhouse
@ 2024-11-22 22:38 ` David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 08/16] x86/kexec: Copy control page into place in machine_kexec_prepare() David Woodhouse
` (8 subsequent siblings)
15 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2024-11-22 22:38 UTC (permalink / raw)
To: kexec
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, David Woodhouse, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
From: David Woodhouse <dwmw@amazon.co.uk>
Now that the relocate_kernel page is handled sanely by a linker script
we can have actual data, and just use %rip-relative addressing to access
it.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/kernel/machine_kexec_64.c | 7 +++-
arch/x86/kernel/relocate_kernel_64.S | 62 ++++++++++++++--------------
arch/x86/kernel/vmlinux.lds.S | 1 +
3 files changed, 37 insertions(+), 33 deletions(-)
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 048868d868ce..123e9544506b 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -383,7 +383,12 @@ void machine_kexec(struct kimage *image)
page_list[PA_SWAP_PAGE] = (page_to_pfn(image->swap_page)
<< PAGE_SHIFT);
- relocate_kernel_ptr = control_page;
+ /*
+ * Allow for the possibility that relocate_kernel might not be at
+ * the very start of the page.
+ */
+ relocate_kernel_ptr = control_page + (unsigned long)relocate_kernel -
+ reloc_start;
/*
* The segment registers are funny things, they have both a
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 085dddf79476..445ca56dabbe 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -23,23 +23,21 @@
#define PAGE_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
/*
- * control_page + KEXEC_CONTROL_CODE_MAX_SIZE
- * ~ control_page + PAGE_SIZE are used as data storage and stack for
- * jumping back
+ * The .text.relocate_kernel and .data.relocate_kernel sections are copied
+ * into the control page, and the remainder of the page is used as the stack.
*/
-#define DATA(offset) (KEXEC_CONTROL_CODE_MAX_SIZE+(offset))
+ .section .data.relocate_kernel,"a";
/* Minimal CPU state */
-#define RSP DATA(0x0)
-#define CR0 DATA(0x8)
-#define CR3 DATA(0x10)
-#define CR4 DATA(0x18)
-
-/* other data */
-#define CP_PA_TABLE_PAGE DATA(0x20)
-#define CP_PA_SWAP_PAGE DATA(0x28)
-#define CP_PA_BACKUP_PAGES_MAP DATA(0x30)
-#define CP_VA_CONTROL_PAGE DATA(0x38)
+SYM_DATA_LOCAL(saved_rsp, .quad 0)
+SYM_DATA_LOCAL(saved_cr0, .quad 0)
+SYM_DATA_LOCAL(saved_cr3, .quad 0)
+SYM_DATA_LOCAL(saved_cr4, .quad 0)
+ /* other data */
+SYM_DATA_LOCAL(va_control_page, .quad 0)
+SYM_DATA_LOCAL(pa_table_page, .quad 0)
+SYM_DATA_LOCAL(pa_swap_page, .quad 0)
+SYM_DATA_LOCAL(pa_backup_pages_map, .quad 0)
.section .text.relocate_kernel,"ax";
.code64
@@ -63,14 +61,13 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
pushq %r15
pushf
- movq PTR(VA_CONTROL_PAGE)(%rsi), %r11
- movq %rsp, RSP(%r11)
+ movq %rsp, saved_rsp(%rip)
movq %cr0, %rax
- movq %rax, CR0(%r11)
+ movq %rax, saved_cr0(%rip)
movq %cr3, %rax
- movq %rax, CR3(%r11)
+ movq %rax, saved_cr3(%rip)
movq %cr4, %rax
- movq %rax, CR4(%r11)
+ movq %rax, saved_cr4(%rip)
/* Save CR4. Required to enable the right paging mode later. */
movq %rax, %r13
@@ -83,10 +80,11 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
movq %r8, %r12
/*
- * get physical address of control page now
+ * get physical and virtual address of control page now
* this is impossible after page table switch
*/
movq PTR(PA_CONTROL_PAGE)(%rsi), %r8
+ movq PTR(VA_CONTROL_PAGE)(%rsi), %r11
/* get physical address of page table now too */
movq PTR(PA_TABLE_PAGE)(%rsi), %r9
@@ -95,10 +93,10 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
movq PTR(PA_SWAP_PAGE)(%rsi), %r10
/* save some information for jumping back */
- movq %r9, CP_PA_TABLE_PAGE(%r11)
- movq %r10, CP_PA_SWAP_PAGE(%r11)
- movq %rdi, CP_PA_BACKUP_PAGES_MAP(%r11)
- movq %r11, CP_VA_CONTROL_PAGE(%r11)
+ movq %r9, pa_table_page(%rip)
+ movq %r10, pa_swap_page(%rip)
+ movq %rdi, pa_backup_pages_map(%rip)
+ movq %r11, va_control_page(%rip)
/* Save the preserve_context to %r11 as swap_pages clobbers %rcx. */
movq %rcx, %r11
@@ -229,13 +227,13 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
/* get the re-entry point of the peer system */
movq 0(%rsp), %rbp
leaq relocate_kernel(%rip), %r8
- movq CP_PA_SWAP_PAGE(%r8), %r10
- movq CP_PA_BACKUP_PAGES_MAP(%r8), %rdi
- movq CP_PA_TABLE_PAGE(%r8), %rax
+ movq pa_swap_page(%rip), %r10
+ movq pa_backup_pages_map(%rip), %rdi
+ movq pa_table_page(%rip), %rax
movq %rax, %cr3
lea PAGE_SIZE(%r8), %rsp
call swap_pages
- movq CP_VA_CONTROL_PAGE(%r8), %rax
+ movq va_control_page(%rip), %rax
addq $(virtual_mapped - relocate_kernel), %rax
pushq %rax
ANNOTATE_UNRET_SAFE
@@ -246,11 +244,11 @@ SYM_CODE_END(identity_mapped)
SYM_CODE_START_LOCAL_NOALIGN(virtual_mapped)
UNWIND_HINT_END_OF_STACK
ANNOTATE_NOENDBR // RET target, above
- movq RSP(%r8), %rsp
- movq CR4(%r8), %rax
+ movq saved_rsp(%rip), %rsp
+ movq saved_cr4(%rip), %rax
movq %rax, %cr4
- movq CR3(%r8), %rax
- movq CR0(%r8), %r8
+ movq saved_cr3(%rip), %rax
+ movq saved_cr0(%rip), %r8
movq %rax, %cr3
movq %r8, %cr0
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 925a821134b5..324c1c42faae 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -100,6 +100,7 @@ const_pcpu_hot = pcpu_hot;
. = ALIGN(0x100); \
__relocate_kernel_start = .; \
*(.text.relocate_kernel); \
+ *(.data.relocate_kernel); \
__relocate_kernel_end = .;
#else
#define KEXEC_RELOCATE_KERNEL
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH v2 08/16] x86/kexec: Copy control page into place in machine_kexec_prepare()
2024-11-22 22:38 [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving David Woodhouse
` (6 preceding siblings ...)
2024-11-22 22:38 ` [RFC PATCH v2 07/16] x86/kexec: Add data section to relocate_kernel David Woodhouse
@ 2024-11-22 22:38 ` David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 09/16] x86/kexec: Drop page_list argument from relocate_kernel() David Woodhouse
` (7 subsequent siblings)
15 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2024-11-22 22:38 UTC (permalink / raw)
To: kexec
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, David Woodhouse, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
From: David Woodhouse <dwmw@amazon.co.uk>
There's no need for this to wait until the actual machine_kexec() invocation;
a subsequent change will mark the control page ROX so all writes should be
completed earlier.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/kernel/machine_kexec_64.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 123e9544506b..60632a5a2a13 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -297,6 +297,8 @@ static void load_segments(void)
int machine_kexec_prepare(struct kimage *image)
{
void *control_page = page_address(image->control_code_page) + PAGE_SIZE;
+ unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
+ unsigned long reloc_end = (unsigned long)__relocate_kernel_end;
unsigned long start_pgtable;
int result;
@@ -308,6 +310,8 @@ int machine_kexec_prepare(struct kimage *image)
if (result)
return result;
+ __memcpy(control_page, __relocate_kernel_start, reloc_end - reloc_start);
+
set_memory_x((unsigned long)control_page, 1);
return 0;
@@ -334,7 +338,6 @@ void machine_kexec(struct kimage *image)
unsigned int preserve_context,
unsigned int host_mem_enc_active);
unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
- unsigned long reloc_end = (unsigned long)__relocate_kernel_end;
unsigned long page_list[PAGES_NR];
unsigned int host_mem_enc_active;
int save_ftrace_enabled;
@@ -372,7 +375,6 @@ void machine_kexec(struct kimage *image)
}
control_page = page_address(image->control_code_page) + PAGE_SIZE;
- __memcpy(control_page, __relocate_kernel_start, reloc_end - reloc_start);
page_list[PA_CONTROL_PAGE] = virt_to_phys(control_page);
page_list[VA_CONTROL_PAGE] = (unsigned long)control_page;
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH v2 09/16] x86/kexec: Drop page_list argument from relocate_kernel()
2024-11-22 22:38 [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving David Woodhouse
` (7 preceding siblings ...)
2024-11-22 22:38 ` [RFC PATCH v2 08/16] x86/kexec: Copy control page into place in machine_kexec_prepare() David Woodhouse
@ 2024-11-22 22:38 ` David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 10/16] x86/kexec: Eliminate writes through kernel mapping of relocate_kernel page David Woodhouse
` (6 subsequent siblings)
15 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2024-11-22 22:38 UTC (permalink / raw)
To: kexec
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, David Woodhouse, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
From: David Woodhouse <dwmw@amazon.co.uk>
The kernel's virtual mapping of the relocate_kernel page currently needs
to be RWX because it is written to before the %cr3 switch.
Now that the relocate_kernel page has its own .data section and local
variables, it can also have *global* variables. So eliminate the separate
page_list argument, and write the same information directly to variables
in the relocate_kernel page instead. This way, the relocate_kernel code
itself doesn't need to copy it.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/include/asm/kexec.h | 13 +++++-----
arch/x86/kernel/machine_kexec_64.c | 21 +++++++---------
arch/x86/kernel/relocate_kernel_64.S | 36 ++++++++++------------------
3 files changed, 27 insertions(+), 43 deletions(-)
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index ae5482a2f0ca..9af54743de90 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -8,12 +8,6 @@
# define PA_PGD 2
# define PA_SWAP_PAGE 3
# define PAGES_NR 4
-#else
-# define PA_CONTROL_PAGE 0
-# define VA_CONTROL_PAGE 1
-# define PA_TABLE_PAGE 2
-# define PA_SWAP_PAGE 3
-# define PAGES_NR 4
#endif
# define KEXEC_CONTROL_CODE_MAX_SIZE 2048
@@ -63,6 +57,11 @@ struct kimage;
/* The native architecture */
# define KEXEC_ARCH KEXEC_ARCH_X86_64
+
+extern unsigned long kexec_pa_control_page;
+extern unsigned long kexec_va_control_page;
+extern unsigned long kexec_pa_table_page;
+extern unsigned long kexec_pa_swap_page;
#endif
/*
@@ -125,7 +124,7 @@ relocate_kernel(unsigned long indirection_page,
#else
unsigned long
relocate_kernel(unsigned long indirection_page,
- unsigned long page_list,
+ unsigned long pa_control_page,
unsigned long start_address,
unsigned int preserve_context,
unsigned int host_mem_enc_active);
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 60632a5a2a13..c653c2c22d63 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -309,6 +309,13 @@ int machine_kexec_prepare(struct kimage *image)
result = init_pgtable(image, start_pgtable);
if (result)
return result;
+ kexec_va_control_page = (unsigned long)control_page;
+ kexec_pa_table_page =
+ (unsigned long)__pa(page_address(image->control_code_page));
+
+ if (image->type == KEXEC_TYPE_DEFAULT)
+ kexec_pa_swap_page = (page_to_pfn(image->swap_page)
+ << PAGE_SHIFT);
__memcpy(control_page, __relocate_kernel_start, reloc_end - reloc_start);
@@ -333,12 +340,11 @@ void machine_kexec_cleanup(struct kimage *image)
void machine_kexec(struct kimage *image)
{
unsigned long (*relocate_kernel_ptr)(unsigned long indirection_page,
- unsigned long page_list,
+ unsigned long pa_control_page,
unsigned long start_address,
unsigned int preserve_context,
unsigned int host_mem_enc_active);
unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
- unsigned long page_list[PAGES_NR];
unsigned int host_mem_enc_active;
int save_ftrace_enabled;
void *control_page;
@@ -376,15 +382,6 @@ void machine_kexec(struct kimage *image)
control_page = page_address(image->control_code_page) + PAGE_SIZE;
- page_list[PA_CONTROL_PAGE] = virt_to_phys(control_page);
- page_list[VA_CONTROL_PAGE] = (unsigned long)control_page;
- page_list[PA_TABLE_PAGE] =
- (unsigned long)__pa(page_address(image->control_code_page));
-
- if (image->type == KEXEC_TYPE_DEFAULT)
- page_list[PA_SWAP_PAGE] = (page_to_pfn(image->swap_page)
- << PAGE_SHIFT);
-
/*
* Allow for the possibility that relocate_kernel might not be at
* the very start of the page.
@@ -412,7 +409,7 @@ void machine_kexec(struct kimage *image)
/* now call it */
image->start = relocate_kernel_ptr((unsigned long)image->head,
- (unsigned long)page_list,
+ virt_to_phys(control_page),
image->start,
image->preserve_context,
host_mem_enc_active);
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 445ca56dabbe..b9ad3ef0b982 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -34,9 +34,9 @@ SYM_DATA_LOCAL(saved_cr0, .quad 0)
SYM_DATA_LOCAL(saved_cr3, .quad 0)
SYM_DATA_LOCAL(saved_cr4, .quad 0)
/* other data */
-SYM_DATA_LOCAL(va_control_page, .quad 0)
-SYM_DATA_LOCAL(pa_table_page, .quad 0)
-SYM_DATA_LOCAL(pa_swap_page, .quad 0)
+SYM_DATA(kexec_va_control_page, .quad 0)
+SYM_DATA(kexec_pa_table_page, .quad 0)
+SYM_DATA(kexec_pa_swap_page, .quad 0)
SYM_DATA_LOCAL(pa_backup_pages_map, .quad 0)
.section .text.relocate_kernel,"ax";
@@ -46,7 +46,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
ANNOTATE_NOENDBR
/*
* %rdi indirection_page
- * %rsi page_list
+ * %rsi pa_control_page
* %rdx start address
* %rcx preserve_context
* %r8 host_mem_enc_active
@@ -79,31 +79,19 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
/* Save SME active flag */
movq %r8, %r12
- /*
- * get physical and virtual address of control page now
- * this is impossible after page table switch
- */
- movq PTR(PA_CONTROL_PAGE)(%rsi), %r8
- movq PTR(VA_CONTROL_PAGE)(%rsi), %r11
-
- /* get physical address of page table now too */
- movq PTR(PA_TABLE_PAGE)(%rsi), %r9
-
- /* get physical address of swap page now */
- movq PTR(PA_SWAP_PAGE)(%rsi), %r10
-
- /* save some information for jumping back */
- movq %r9, pa_table_page(%rip)
- movq %r10, pa_swap_page(%rip)
+ /* save indirection list for jumping back */
movq %rdi, pa_backup_pages_map(%rip)
- movq %r11, va_control_page(%rip)
/* Save the preserve_context to %r11 as swap_pages clobbers %rcx. */
movq %rcx, %r11
/* Switch to the identity mapped page tables */
+ movq kexec_pa_table_page(%rip), %r9
movq %r9, %cr3
+ /* Physical address of control page */
+ movq %rsi, %r8
+
/* setup a new stack at the end of the physical control page */
lea PAGE_SIZE(%r8), %rsp
@@ -227,13 +215,13 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
/* get the re-entry point of the peer system */
movq 0(%rsp), %rbp
leaq relocate_kernel(%rip), %r8
- movq pa_swap_page(%rip), %r10
+ movq kexec_pa_swap_page(%rip), %r10
movq pa_backup_pages_map(%rip), %rdi
- movq pa_table_page(%rip), %rax
+ movq kexec_pa_table_page(%rip), %rax
movq %rax, %cr3
lea PAGE_SIZE(%r8), %rsp
call swap_pages
- movq va_control_page(%rip), %rax
+ movq kexec_va_control_page(%rip), %rax
addq $(virtual_mapped - relocate_kernel), %rax
pushq %rax
ANNOTATE_UNRET_SAFE
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH v2 10/16] x86/kexec: Eliminate writes through kernel mapping of relocate_kernel page
2024-11-22 22:38 [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving David Woodhouse
` (8 preceding siblings ...)
2024-11-22 22:38 ` [RFC PATCH v2 09/16] x86/kexec: Drop page_list argument from relocate_kernel() David Woodhouse
@ 2024-11-22 22:38 ` David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 11/16] x86/kexec: Clean up register usage in relocate_kernel() David Woodhouse
` (5 subsequent siblings)
15 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2024-11-22 22:38 UTC (permalink / raw)
To: kexec
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, David Woodhouse, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
From: David Woodhouse <dwmw@amazon.co.uk>
All writes to the relocate_kernel control page are now done *after* the
%cr3 switch via simple %rip-relative addressing, which means the DATA()
macro with its pointer arithmetic can also now be removed.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/kernel/relocate_kernel_64.S | 29 ++++++++++++++--------------
1 file changed, 14 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index b9ad3ef0b982..5c6456467f08 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -61,21 +61,24 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
pushq %r15
pushf
- movq %rsp, saved_rsp(%rip)
- movq %cr0, %rax
- movq %rax, saved_cr0(%rip)
- movq %cr3, %rax
- movq %rax, saved_cr3(%rip)
- movq %cr4, %rax
- movq %rax, saved_cr4(%rip)
-
- /* Save CR4. Required to enable the right paging mode later. */
- movq %rax, %r13
-
/* zero out flags, and disable interrupts */
pushq $0
popfq
+ /* Switch to the identity mapped page tables */
+ movq %cr3, %rax
+ movq kexec_pa_table_page(%rip), %r9
+ movq %r9, %cr3
+
+ /* Save %rsp and CRs. */
+ movq %rsp, saved_rsp(%rip)
+ movq %rax, saved_cr3(%rip)
+ movq %cr0, %rax
+ movq %rax, saved_cr0(%rip)
+ /* Leave CR4 in %r13 to enable the right paging mode later. */
+ movq %cr4, %r13
+ movq %r13, saved_cr4(%rip)
+
/* Save SME active flag */
movq %r8, %r12
@@ -85,10 +88,6 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
/* Save the preserve_context to %r11 as swap_pages clobbers %rcx. */
movq %rcx, %r11
- /* Switch to the identity mapped page tables */
- movq kexec_pa_table_page(%rip), %r9
- movq %r9, %cr3
-
/* Physical address of control page */
movq %rsi, %r8
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH v2 11/16] x86/kexec: Clean up register usage in relocate_kernel()
2024-11-22 22:38 [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving David Woodhouse
` (9 preceding siblings ...)
2024-11-22 22:38 ` [RFC PATCH v2 10/16] x86/kexec: Eliminate writes through kernel mapping of relocate_kernel page David Woodhouse
@ 2024-11-22 22:38 ` David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 12/16] x86/kexec: Mark relocate_kernel page as ROX instead of RWX David Woodhouse
` (4 subsequent siblings)
15 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2024-11-22 22:38 UTC (permalink / raw)
To: kexec
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, David Woodhouse, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
From: David Woodhouse <dwmw@amazon.co.uk>
The memory encryption flag is passed in %r8 because that's where the
calling convention puts it. Instead of moving it to %r12 and then using
%r8 for other things, just leave it in %r8 and use other registers
instead.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/kernel/relocate_kernel_64.S | 17 ++++++-----------
1 file changed, 6 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 5c6456467f08..51dc55ac4395 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -79,24 +79,18 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
movq %cr4, %r13
movq %r13, saved_cr4(%rip)
- /* Save SME active flag */
- movq %r8, %r12
-
/* save indirection list for jumping back */
movq %rdi, pa_backup_pages_map(%rip)
/* Save the preserve_context to %r11 as swap_pages clobbers %rcx. */
movq %rcx, %r11
- /* Physical address of control page */
- movq %rsi, %r8
-
/* setup a new stack at the end of the physical control page */
- lea PAGE_SIZE(%r8), %rsp
+ lea PAGE_SIZE(%rsi), %rsp
/* jump to identity mapped page */
- addq $(identity_mapped - relocate_kernel), %r8
- pushq %r8
+ addq $(identity_mapped - relocate_kernel), %rsi
+ pushq %rsi
ANNOTATE_UNRET_SAFE
ret
int3
@@ -107,8 +101,9 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
/*
* %rdi indirection page
* %rdx start address
+ * %r8 host_mem_enc_active
+ * %r9 page table page
* %r11 preserve_context
- * %r12 host_mem_enc_active
* %r13 original CR4 when relocate_kernel() was invoked
*/
@@ -161,7 +156,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
* entries that will conflict with the now unencrypted memory
* used by kexec. Flush the caches before copying the kernel.
*/
- testq %r12, %r12
+ testq %r8, %r8
jz .Lsme_off
wbinvd
.Lsme_off:
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH v2 12/16] x86/kexec: Mark relocate_kernel page as ROX instead of RWX
2024-11-22 22:38 [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving David Woodhouse
` (10 preceding siblings ...)
2024-11-22 22:38 ` [RFC PATCH v2 11/16] x86/kexec: Clean up register usage in relocate_kernel() David Woodhouse
@ 2024-11-22 22:38 ` David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 13/16] x86/kexec: Debugging support: load a GDT David Woodhouse
` (3 subsequent siblings)
15 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2024-11-22 22:38 UTC (permalink / raw)
To: kexec
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, David Woodhouse, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
From: David Woodhouse <dwmw@amazon.co.uk>
All writes to the page now happen before it gets marked as executable
(or after it's already switched to the identmap page tables where it's
OK to be RWX).
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/kernel/machine_kexec_64.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index c653c2c22d63..2a294daeeb1a 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -319,7 +319,7 @@ int machine_kexec_prepare(struct kimage *image)
__memcpy(control_page, __relocate_kernel_start, reloc_end - reloc_start);
- set_memory_x((unsigned long)control_page, 1);
+ set_memory_rox((unsigned long)control_page, 1);
return 0;
}
@@ -329,6 +329,7 @@ void machine_kexec_cleanup(struct kimage *image)
void *control_page = page_address(image->control_code_page) + PAGE_SIZE;
set_memory_nx((unsigned long)control_page, 1);
+ set_memory_rw((unsigned long)control_page, 1);
free_transition_pgtable(image);
}
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH v2 13/16] x86/kexec: Debugging support: load a GDT
2024-11-22 22:38 [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving David Woodhouse
` (11 preceding siblings ...)
2024-11-22 22:38 ` [RFC PATCH v2 12/16] x86/kexec: Mark relocate_kernel page as ROX instead of RWX David Woodhouse
@ 2024-11-22 22:38 ` David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 14/16] x86/kexec: Debugging support: Load an IDT and basic exception entry points David Woodhouse
` (2 subsequent siblings)
15 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2024-11-22 22:38 UTC (permalink / raw)
To: kexec
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, David Woodhouse, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
From: David Woodhouse <dwmw@amazon.co.uk>
There are some failure modes which lead to triple-faults in the
relocate_kernel function, which is fairly much undebuggable for normal
mortals.
Adding a GDT in the relocate_kernel environment is step 1 towards being
able to catch faults and do something more useful.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/kernel/relocate_kernel_64.S | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 51dc55ac4395..5c174829f794 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -39,6 +39,18 @@ SYM_DATA(kexec_pa_table_page, .quad 0)
SYM_DATA(kexec_pa_swap_page, .quad 0)
SYM_DATA_LOCAL(pa_backup_pages_map, .quad 0)
+#ifdef DEBUG
+SYM_DATA_START_LOCAL(reloc_kernel_gdt)
+ .balign 16
+ .word reloc_kernel_gdt_end - reloc_kernel_gdt - 1
+ .long 0
+ .word 0
+ .quad 0x00cf9a000000ffff /* __KERNEL32_CS */
+ .quad 0x00af9a000000ffff /* __KERNEL_CS */
+ .quad 0x00cf92000000ffff /* __KERNEL_DS */
+SYM_DATA_END_LABEL(reloc_kernel_gdt, SYM_L_LOCAL, reloc_kernel_gdt_end)
+#endif /* DEBUG */
+
.section .text.relocate_kernel,"ax";
.code64
SYM_CODE_START_NOALIGN(relocate_kernel)
@@ -112,6 +124,21 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
/* store the start address on the stack */
pushq %rdx
+#ifdef DEBUG
+ /* Create a GDTR (16 bits limit, 64 bits addr) on stack */
+ leaq reloc_kernel_gdt(%rip), %rax
+ pushq %rax
+ pushw (%rax)
+
+ /* Load the GDT, put the stack back */
+ lgdt (%rsp)
+ addq $10, %rsp
+
+ /* Test that we can load segments */
+ movq %ds, %rax
+ movq %rax, %ds
+#endif /* DEBUG */
+
/*
* Clear X86_CR4_CET (if it was set) such that we can clear CR0_WP
* below.
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH v2 14/16] x86/kexec: Debugging support: Load an IDT and basic exception entry points
2024-11-22 22:38 [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving David Woodhouse
` (12 preceding siblings ...)
2024-11-22 22:38 ` [RFC PATCH v2 13/16] x86/kexec: Debugging support: load a GDT David Woodhouse
@ 2024-11-22 22:38 ` David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 15/16] x86/kexec: Debugging support: Dump registers on exception David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 16/16] [DO NOT MERGE] x86/kexec: enable DEBUG David Woodhouse
15 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2024-11-22 22:38 UTC (permalink / raw)
To: kexec
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, David Woodhouse, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
From: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/kernel/relocate_kernel_64.S | 114 +++++++++++++++++++++++++++
1 file changed, 114 insertions(+)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 5c174829f794..4ace2577afc6 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -40,6 +40,9 @@ SYM_DATA(kexec_pa_swap_page, .quad 0)
SYM_DATA_LOCAL(pa_backup_pages_map, .quad 0)
#ifdef DEBUG
+ /* Size of each exception handler referenced by the IDT */
+#define EXC_HANDLER_SIZE 6 /* pushi, pushi, 2-byte jmp */
+
SYM_DATA_START_LOCAL(reloc_kernel_gdt)
.balign 16
.word reloc_kernel_gdt_end - reloc_kernel_gdt - 1
@@ -108,6 +111,11 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
int3
SYM_CODE_END(relocate_kernel)
+#ifdef DEBUG
+ UNWIND_HINT_UNDEFINED
+ .balign 0x100 /* relocate_kernel will be overwritten with an IDT */
+#endif
+
SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
UNWIND_HINT_END_OF_STACK
/*
@@ -137,6 +145,52 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
/* Test that we can load segments */
movq %ds, %rax
movq %rax, %ds
+
+ /* Load address of reloc_kernel, at start of this page, into %rsi */
+ lea relocate_kernel(%rip), %rsi
+
+ /*
+ * Build an IDT descriptor in %rax/%rbx. The address is in the low 16
+ * and high 16 bits of %rax, and low 32 of %rbx. The niddle 32 bits
+ * of %rax hold the selector/ist/flags which are hard-coded below.
+ */
+ movq %rsi, %rax // 1234567890abcdef
+
+ andq $-0xFFFF, %rax // 1234567890ab....
+ shlq $16, %rax // 567890ab........
+
+ movq $0x8F000010, %rcx // Present, DPL0, Interrupt Gate, __KERNEL_CS.
+ orq %rcx, %rax // 567890ab8F000010
+ shlq $16, %rax // 90ab8F000010....
+
+ movq %rsi, %rcx
+ andq $0xffff, %rcx // ............cdef
+ orq %rcx, %rax // 90ab87000010cdef
+
+ movq %rsi, %rbx
+ shrq $32, %rbx
+
+ /*
+ * The descriptor was built using the address of relocate_kernel. Add
+ * the required offset to point to the actual entry points.
+ */
+ addq $(exc_vectors - relocate_kernel), %rax
+
+ /* Loop 16 times to handle exception 0-15 */
+ movq $16, %rcx
+1:
+ movq %rax, (%rsi)
+ movq %rbx, 8(%rsi)
+ addq $16, %rsi
+ addq $EXC_HANDLER_SIZE, %rax
+ loop 1b
+
+ /* Now put an IDTR on the stack (temporarily) to load it */
+ subq $0x100, %rsi
+ pushq %rsi
+ pushw $0xff
+ lidt (%rsp)
+ addq $10, %rsp
#endif /* DEBUG */
/*
@@ -345,3 +399,63 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
int3
SYM_CODE_END(swap_pages)
+#ifdef DEBUG
+SYM_CODE_START_LOCAL_NOALIGN(exc_vectors)
+ /* Each of these is 6 bytes. */
+.macro vec_err exc
+ UNWIND_HINT_ENTRY
+ . = exc_vectors + (\exc * EXC_HANDLER_SIZE)
+ nop
+ nop
+ pushq $\exc
+ jmp exc_handler
+.endm
+
+.macro vec_noerr exc
+ UNWIND_HINT_ENTRY
+ . = exc_vectors + (\exc * EXC_HANDLER_SIZE)
+ pushq $0
+ pushq $\exc
+ jmp exc_handler
+.endm
+
+ vec_noerr 0 // #DE
+ vec_noerr 1 // #DB
+ vec_noerr 2 // #NMI
+ vec_noerr 3 // #BP
+ vec_noerr 4 // #OF
+ vec_noerr 5 // #BR
+ vec_noerr 6 // #UD
+ vec_noerr 7 // #NM
+ vec_err 8 // #DF
+ vec_noerr 9
+ vec_err 10 // #TS
+ vec_err 11 // #NP
+ vec_err 12 // #SS
+ vec_err 13 // #GP
+ vec_err 14 // #PF
+ vec_noerr 15
+SYM_CODE_END(exc_vectors)
+
+SYM_CODE_START_LOCAL_NOALIGN(exc_handler)
+ pushq %rax
+ pushq %rdx
+ movw $0x3f8, %dx
+ movb $'A', %al
+ outb %al, %dx
+ popq %rdx
+ popq %rax
+
+ /* Only return from int3 */
+ cmpq $3, (%rsp)
+ jne .Ldie
+
+ addq $16, %rsp
+ iretq
+
+.Ldie:
+ hlt
+ jmp .Ldie
+
+SYM_CODE_END(exc_handler)
+#endif /* DEBUG */
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH v2 15/16] x86/kexec: Debugging support: Dump registers on exception
2024-11-22 22:38 [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving David Woodhouse
` (13 preceding siblings ...)
2024-11-22 22:38 ` [RFC PATCH v2 14/16] x86/kexec: Debugging support: Load an IDT and basic exception entry points David Woodhouse
@ 2024-11-22 22:38 ` David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 16/16] [DO NOT MERGE] x86/kexec: enable DEBUG David Woodhouse
15 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2024-11-22 22:38 UTC (permalink / raw)
To: kexec
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, David Woodhouse, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
From: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/kernel/relocate_kernel_64.S | 83 +++++++++++++++++++++++++++-
1 file changed, 80 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 4ace2577afc6..67f6853c7abe 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -400,6 +400,55 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages)
SYM_CODE_END(swap_pages)
#ifdef DEBUG
+/*
+ * This allows other types of serial ports to be used.
+ * - %al: Character to be printed (no clobber %rax)
+ * - %rdx: MMIO address or port.
+ */
+.macro pr_char
+ outb %al, %dx
+.endm
+
+/* Print the nybble in %bl, clobber %rax */
+SYM_CODE_START_LOCAL_NOALIGN(pr_nybble)
+ UNWIND_HINT_FUNC
+ movb %bl, %al
+ nop
+ andb $0x0f, %al
+ addb $0x30, %al
+ cmpb $0x3a, %al
+ jb 1f
+ addb $('a' - '0' - 10), %al
+1: pr_char
+ ANNOTATE_UNRET_SAFE
+ ret
+SYM_CODE_END(pr_nybble)
+
+SYM_CODE_START_LOCAL_NOALIGN(pr_qword)
+ UNWIND_HINT_FUNC
+ movq $16, %rcx
+1: rolq $4, %rbx
+ call pr_nybble
+ loop 1b
+ movb $'\n', %al
+ pr_char
+ ANNOTATE_UNRET_SAFE
+ ret
+SYM_CODE_END(pr_qword)
+
+.macro print_reg a, b, c, d, r
+ movb $\a, %al
+ pr_char
+ movb $\b, %al
+ pr_char
+ movb $\c, %al
+ pr_char
+ movb $\d, %al
+ pr_char
+ movq \r, %rbx
+ call pr_qword
+.endm
+
SYM_CODE_START_LOCAL_NOALIGN(exc_vectors)
/* Each of these is 6 bytes. */
.macro vec_err exc
@@ -439,11 +488,39 @@ SYM_CODE_END(exc_vectors)
SYM_CODE_START_LOCAL_NOALIGN(exc_handler)
pushq %rax
+ pushq %rbx
+ pushq %rcx
pushq %rdx
+
movw $0x3f8, %dx
- movb $'A', %al
- outb %al, %dx
+
+ /* rip and exception info */
+ print_reg 'E', 'x', 'c', ':', 32(%rsp)
+ print_reg 'E', 'r', 'r', ':', 40(%rsp)
+ print_reg 'r', 'i', 'p', ':', 48(%rsp)
+
+ /* We spilled these to the stack */
+ print_reg 'r', 'a', 'x', ':', 24(%rsp)
+ print_reg 'r', 'b', 'x', ':', 16(%rsp)
+ print_reg 'r', 'c', 'x', ':', 8(%rsp)
+ print_reg 'r', 'd', 'x', ':', (%rsp)
+
+ /* Other registers */
+ print_reg 'r', 's', 'i', ':', %rsi
+ print_reg 'r', 'd', 'i', ':', %rdi
+ print_reg 'r', '8', ' ', ':', %r8
+ print_reg 'r', '9', ' ', ':', %r9
+ print_reg 'r', '1', '0', ':', %r10
+ print_reg 'r', '1', '1', ':', %r11
+ print_reg 'r', '1', '2', ':', %r12
+ print_reg 'r', '1', '3', ':', %r13
+ print_reg 'r', '1', '4', ':', %r14
+ print_reg 'r', '1', '5', ':', %r15
+ print_reg 'c', 'r', '2', ':', %cr2
+
popq %rdx
+ popq %rcx
+ popq %rbx
popq %rax
/* Only return from int3 */
@@ -456,6 +533,6 @@ SYM_CODE_START_LOCAL_NOALIGN(exc_handler)
.Ldie:
hlt
jmp .Ldie
-
+ int3
SYM_CODE_END(exc_handler)
#endif /* DEBUG */
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC PATCH v2 16/16] [DO NOT MERGE] x86/kexec: enable DEBUG
2024-11-22 22:38 [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving David Woodhouse
` (14 preceding siblings ...)
2024-11-22 22:38 ` [RFC PATCH v2 15/16] x86/kexec: Debugging support: Dump registers on exception David Woodhouse
@ 2024-11-22 22:38 ` David Woodhouse
2024-11-25 9:21 ` Ingo Molnar
15 siblings, 1 reply; 21+ messages in thread
From: David Woodhouse @ 2024-11-22 22:38 UTC (permalink / raw)
To: kexec
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, David Woodhouse, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
From: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/kernel/relocate_kernel_64.S | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 67f6853c7abe..ebbd76c9a3e9 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -14,6 +14,8 @@
#include <asm/nospec-branch.h>
#include <asm/unwind_hints.h>
+#define DEBUG
+
/*
* Must be relocatable PIC code callable as a C function, in particular
* there must be a plain RET and not jump to return thunk.
@@ -191,6 +193,8 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
pushw $0xff
lidt (%rsp)
addq $10, %rsp
+
+ int3
#endif /* DEBUG */
/*
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [RFC PATCH v2 16/16] [DO NOT MERGE] x86/kexec: enable DEBUG
2024-11-22 22:38 ` [RFC PATCH v2 16/16] [DO NOT MERGE] x86/kexec: enable DEBUG David Woodhouse
@ 2024-11-25 9:21 ` Ingo Molnar
2024-11-25 9:32 ` David Woodhouse
0 siblings, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2024-11-25 9:21 UTC (permalink / raw)
To: David Woodhouse
Cc: kexec, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86, H. Peter Anvin, David Woodhouse, Kirill A. Shutemov,
Kai Huang, Nikolay Borisov, linux-kernel, Simon Horman,
Dave Young, Peter Zijlstra, jpoimboe
* David Woodhouse <dwmw2@infradead.org> wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
> arch/x86/kernel/relocate_kernel_64.S | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index 67f6853c7abe..ebbd76c9a3e9 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -14,6 +14,8 @@
> #include <asm/nospec-branch.h>
> #include <asm/unwind_hints.h>
>
> +#define DEBUG
> +
> /*
> * Must be relocatable PIC code callable as a C function, in particular
> * there must be a plain RET and not jump to return thunk.
> @@ -191,6 +193,8 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> pushw $0xff
> lidt (%rsp)
> addq $10, %rsp
> +
> + int3
> #endif /* DEBUG */
That's a really nice piece of debugging code written in assembly,
combined with the exception handling feature that generates debug
output to begin with. Epic effort. :-)
Just curious: did you write this code to debug the series, or was there
some original hair-tearing regression that motivated you? Is there's an
upstream fix to marvel at and be horrified about in equal measure?
I'd argue that this debugging code probably needs a default-off Kconfig
option, even with the obvious hard-coded environmental limitations &
assumptions it has. Could be useful to very early debugging & would
preserve your effort without it bitrotting too obviously.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH v2 16/16] [DO NOT MERGE] x86/kexec: enable DEBUG
2024-11-25 9:21 ` Ingo Molnar
@ 2024-11-25 9:32 ` David Woodhouse
2024-11-25 20:34 ` Ingo Molnar
0 siblings, 1 reply; 21+ messages in thread
From: David Woodhouse @ 2024-11-25 9:32 UTC (permalink / raw)
To: Ingo Molnar
Cc: kexec, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86, H. Peter Anvin, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
[-- Attachment #1: Type: text/plain, Size: 3060 bytes --]
On Mon, 2024-11-25 at 10:21 +0100, Ingo Molnar wrote:
>
> * David Woodhouse <dwmw2@infradead.org> wrote:
>
> > From: David Woodhouse <dwmw@amazon.co.uk>
> >
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > ---
> > arch/x86/kernel/relocate_kernel_64.S | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> > index 67f6853c7abe..ebbd76c9a3e9 100644
> > --- a/arch/x86/kernel/relocate_kernel_64.S
> > +++ b/arch/x86/kernel/relocate_kernel_64.S
> > @@ -14,6 +14,8 @@
> > #include <asm/nospec-branch.h>
> > #include <asm/unwind_hints.h>
> >
> > +#define DEBUG
> > +
> > /*
> > * Must be relocatable PIC code callable as a C function, in particular
> > * there must be a plain RET and not jump to return thunk.
> > @@ -191,6 +193,8 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> > pushw $0xff
> > lidt (%rsp)
> > addq $10, %rsp
> > +
> > + int3
> > #endif /* DEBUG */
>
> That's a really nice piece of debugging code written in assembly,
> combined with the exception handling feature that generates debug
> output to begin with. Epic effort. :-)
Thanks :)
> Just curious: did you write this code to debug the series, or was there
> some original hair-tearing regression that motivated you? Is there's an
> upstream fix to marvel at and be horrified about in equal measure?
https://lore.kernel.org/all/2ab14f6f-2690-056b-cf9e-38a12dafd728@amd.com/t/#u
is the upstream fix. It's all the more horrifying because it was
already *fixed* upstream before I lost weeks of my life to chasing it.
And the trigger which actually made it *happen*, and made our
production systems allocate memory within that dangerous 1MiB region
adjacent to the RMP table, was a tweak to the NMI watchdog period...
leading to an assumption that we were getting stray perf NMIs during
the kexec, and a *long* wild goose chase based on that false
assumption...
Once I'd written the debug code, I just wanted to clean it up a bit and
push it out for the benefit of others; that *was* the main point of
this series. All the rest of the cleanups are just yak shaving.
The realisation that we never even explicitly mapped the control code
page and always just got lucky because it happened to be in the same
2MiB or 1GiB superpage as something else that we did map... was just a
bonus :)
(That one is fixed in v3 which I'll post shortly, and is already in
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kexec-debug
)
> I'd argue that this debugging code probably needs a default-off Kconfig
> option, even with the obvious hard-coded environmental limitations &
> assumptions it has. Could be useful to very early debugging & would
> preserve your effort without it bitrotting too obviously.
Yeah. In v3 I've made it a config option, and made it use the
early_printk serial console (as long as that's an I/O based 8250; we
can add others too later).
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH v2 16/16] [DO NOT MERGE] x86/kexec: enable DEBUG
2024-11-25 9:32 ` David Woodhouse
@ 2024-11-25 20:34 ` Ingo Molnar
2024-11-25 20:46 ` David Woodhouse
0 siblings, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2024-11-25 20:34 UTC (permalink / raw)
To: David Woodhouse
Cc: kexec, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86, H. Peter Anvin, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
* David Woodhouse <dwmw2@infradead.org> wrote:
> > Just curious: did you write this code to debug the series, or was
> > there some original hair-tearing regression that motivated you? Is
> > there's an upstream fix to marvel at and be horrified about in
> > equal measure?
>
> https://lore.kernel.org/all/2ab14f6f-2690-056b-cf9e-38a12dafd728@amd.com/t/#u
> is the upstream fix.
Which ended up being the following upstream commit:
88a921aa3c6b ("x86/sev: Ensure that RMP table fixups are reserved")
Might make sense to add this commit reference to one of the central
patches of the GDT/IDT code, to document how this feature is able to
pin down very hard to debug regressions. (Even if the upstream fix was
done independently in probably luckier circumstances.)
> [...] It's all the more horrifying because it was already *fixed*
> upstream before I lost weeks of my life to chasing it. And the
> trigger which actually made it *happen*, and made our production
> systems allocate memory within that dangerous 1MiB region adjacent to
> the RMP table, was a tweak to the NMI watchdog period... leading to
> an assumption that we were getting stray perf NMIs during the kexec,
> and a *long* wild goose chase based on that false assumption...
:-/
> Once I'd written the debug code, I just wanted to clean it up a bit
> and push it out for the benefit of others; that *was* the main point
> of this series. All the rest of the cleanups are just yak shaving.
>
> The realisation that we never even explicitly mapped the control code
> page and always just got lucky because it happened to be in the same
> 2MiB or 1GiB superpage as something else that we did map... was just
> a bonus :)
I'm amazed and horrified in equal measure ;-)
> (That one is fixed in v3 which I'll post shortly, and is already in
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kexec-debug
> )
>
> > I'd argue that this debugging code probably needs a default-off Kconfig
> > option, even with the obvious hard-coded environmental limitations &
> > assumptions it has. Could be useful to very early debugging & would
> > preserve your effort without it bitrotting too obviously.
>
> Yeah. In v3 I've made it a config option, and made it use the
> early_printk serial console (as long as that's an I/O based 8250; we
> can add others too later).
That's lovely!
Thanks,
Ingo
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH v2 16/16] [DO NOT MERGE] x86/kexec: enable DEBUG
2024-11-25 20:34 ` Ingo Molnar
@ 2024-11-25 20:46 ` David Woodhouse
0 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2024-11-25 20:46 UTC (permalink / raw)
To: Ingo Molnar
Cc: kexec, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86, H. Peter Anvin, Kirill A. Shutemov, Kai Huang,
Nikolay Borisov, linux-kernel, Simon Horman, Dave Young,
Peter Zijlstra, jpoimboe
[-- Attachment #1: Type: text/plain, Size: 4615 bytes --]
On Mon, 2024-11-25 at 21:34 +0100, Ingo Molnar wrote:
>
> > The realisation that we never even explicitly mapped the control code
> > page and always just got lucky because it happened to be in the same
> > 2MiB or 1GiB superpage as something else that we did map... was just
> > a bonus :)
>
> I'm amazed and horrified in equal measure ;-)
:)
The rest of today was dedicated to finding out that that isn't entirely
true. Mapping the control page explicitly was only helping because it
forced 2MiB mappings instead of a 1GiB mapping, and masked the fact
that PTI was causing the identmap code to scribble off the end of the
root PGD page...
It all just worked by pure fluke on x86_64 before, because x86_64 would
allocate a 8KiB control region and use the first half of it for the
PGD, and *then* copy the trampoline code into the second half, after
the identmap code had finished scribbling on it. So when I cleaned that
up to allocate the PGD separately and explicitly like i386 does, that's
why it exploded; not just due to allocation patterns.
Still, I think I have a handle on fairly much everything that's broken,
except the occasional warning on the way back from
KEXEC_PRESERVE_CONTEXT thus:
[ 1.423464] ------------[ cut here ]------------
[ 1.423950] Interrupts enabled after irqrouter_resume+0x0/0x50
[ 1.424605] WARNING: CPU: 0 PID: 215 at drivers/base/syscore.c:103 syscore_resume+0x152/0x180
[ 1.425467] Modules linked in:
[ 1.425791] CPU: 0 UID: 0 PID: 215 Comm: kexec Not tainted 6.12.0-rc5+ #2015
[ 1.426498] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1.427628] RIP: 0010:syscore_resume+0x152/0x180
[ 1.428101] Code: 00 e9 e1 fe ff ff 80 3d b1 b8 c4 01 00 0f 85 21 ff ff ff 48 8b 73 18 48 c7 c7 32 b8 b6 ac c6 05 99 b8 c4 01 01 e8 9e 3f 55 ff <0f> 0b e9 03 ff ff ff 80 3d 87 b8 c4 01 00 0f 85 b8 fe ff ff 48 c7
[ 1.429913] RSP: 0018:ffffae9bc03bfd00 EFLAGS: 00010282
[ 1.430445] RAX: 0000000000000000 RBX: ffffffffad6fbb20 RCX: ffffffffad5636a8
[ 1.431153] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001
[ 1.431869] RBP: 0000000028121969 R08: 0000000000000000 R09: 0000000000000000
[ 1.432594] R10: ffffae9bc03bfaa8 R11: 7075727265746e49 R12: ffffae9bc03bfd28
[ 1.433313] R13: ffffffffad471f60 R14: 00000000fee1dead R15: 0000000000000000
[ 1.434021] FS: 00007f77d4a45740(0000) GS:ffff91d0fd600000(0000) knlGS:0000000000000000
[ 1.434815] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.435385] CR2: 00007f7e011e7070 CR3: 00000000012fe001 CR4: 0000000000170ef0
[ 1.436073] Call Trace:
[ 1.436334] <TASK>
[ 1.436558] ? syscore_resume+0x152/0x180
[ 1.436956] ? __warn.cold+0x93/0xfa
[ 1.437319] ? syscore_resume+0x152/0x180
[ 1.437717] ? report_bug+0xff/0x140
[ 1.438075] ? handle_bug+0x58/0x90
[ 1.438438] ? exc_invalid_op+0x17/0x70
[ 1.438826] ? asm_exc_invalid_op+0x1a/0x20
[ 1.439246] ? syscore_resume+0x152/0x180
[ 1.439644] kernel_kexec+0x10a/0x160
[ 1.440010] __do_sys_reboot+0x1fd/0x240
[ 1.440485] do_syscall_64+0x82/0x160
[ 1.440863] ? syscall_exit_to_user_mode+0x10/0x210
[ 1.441351] ? do_syscall_64+0x8e/0x160
[ 1.441735] ? exc_page_fault+0x7e/0x180
[ 1.442123] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1.442623] RIP: 0033:0x7f77d4b5adb7
[ 1.442992] Code: c7 c0 ff ff ff ff eb be 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 49 50 0c 00 f7 d8 64 89 02 b8
[ 1.444757] RSP: 002b:00007ffc56bc30f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
[ 1.445493] RAX: ffffffffffffffda RBX: 00007ffc56bc3260 RCX: 00007f77d4b5adb7
[ 1.446173] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
[ 1.446848] RBP: 00007ffc56bc32c0 R08: 000055e1cef3e010 R09: 0000000000000007
[ 1.447527] R10: 000055e1cef41020 R11: 0000000000000246 R12: 0000000000000001
[ 1.448219] R13: 000055e19046b896 R14: 000055e1cef3e4a0 R15: 0000000000000000
[ 1.448893] </TASK>
[ 1.449119] ---[ end trace 0000000000000000 ]---
[ 1.452539] Enabling non-boot CPUs ...
[ 1.452935] crash hp: kexec_trylock() failed, kdump image may be inaccurate
[ 1.453678] smpboot: Booting Node 0 Processor 1 APIC 0x1
[ 1.455531] CPU1 is up
[ 1.460031] virtio_blk virtio1: 2/0/0 default/read/poll queues
[ 1.465246] OOM killer enabled.
[ 1.465580] Restarting tasks ... done.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2024-11-25 20:47 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-22 22:38 [RFC PATCH v2 0/16] x86/kexec: Add exception handling for relocate_kernel and further yak-shaving David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 01/16] x86/kexec: Clean up and document register use in relocate_kernel_64.S David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 02/16] x86/kexec: Use named labels in swap_pages " David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 03/16] x86/kexec: Restore GDT on return from preserve_context kexec David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 04/16] x86/kexec: Only swap pages for preserve_context mode David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 05/16] x86/kexec: Invoke copy of relocate_kernel() instead of the original David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 06/16] x86/kexec: Move relocate_kernel to kernel .data section David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 07/16] x86/kexec: Add data section to relocate_kernel David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 08/16] x86/kexec: Copy control page into place in machine_kexec_prepare() David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 09/16] x86/kexec: Drop page_list argument from relocate_kernel() David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 10/16] x86/kexec: Eliminate writes through kernel mapping of relocate_kernel page David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 11/16] x86/kexec: Clean up register usage in relocate_kernel() David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 12/16] x86/kexec: Mark relocate_kernel page as ROX instead of RWX David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 13/16] x86/kexec: Debugging support: load a GDT David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 14/16] x86/kexec: Debugging support: Load an IDT and basic exception entry points David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 15/16] x86/kexec: Debugging support: Dump registers on exception David Woodhouse
2024-11-22 22:38 ` [RFC PATCH v2 16/16] [DO NOT MERGE] x86/kexec: enable DEBUG David Woodhouse
2024-11-25 9:21 ` Ingo Molnar
2024-11-25 9:32 ` David Woodhouse
2024-11-25 20:34 ` Ingo Molnar
2024-11-25 20:46 ` David Woodhouse
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.