* [RFT PATCH v2 00/23] x86: strict separation of startup code
@ 2025-05-04 9:52 Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 01/23] x86/boot: Move early_setup_gdt() back into head64.c Ard Biesheuvel
` (24 more replies)
0 siblings, 25 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
This RFT series implements a strict separation between startup code and
ordinary code, where startup code is built in a way that tolerates being
invoked from the initial 1:1 mapping of memory.
The existsing approach of emitting this code into .head.text and
checking for absolute relocations in that section is not 100% safe, and
produces diagnostics that are sometimes difficult to interpret.
Instead, rely on symbol prefixes, similar to how this is implemented for
the EFI stub and for the startup code in the arm64 port. This ensures
that startup code can only call other startup code, unless a special
symbol alias is emitted that exposes a non-startup routine to the
startup code.
This is somewhat intrusive, as there are many data objects that are
referenced both by startup code and by ordinary code, and an alias needs
to be emitted for each of those. If startup code references anything
that has not been made available to it explicitly, a build time link
error will occur.
This ultimately allows the .head.text section to be dropped entirely, as
it no longer has a special significance. Instead, code that only
executes at boot is emitted into .init.text as it should.
Change since RFC/v1:
- Include a major disentanglement/refactor of the SEV-SNP startup code,
so that only code that really needs to run from the 1:1 mapping is
included in the startup/ code
- Incorporate some early notes from Ingo
Build tested defconfig and allmodconfig
!!! Boot tested on non-SEV guest ONLY !!!!
Again, I will need to lean on Tom to determine whether this breaks
SEV-SNP guest boot. As I mentioned before, I am still waiting for
SEV-SNP capable hardware to be delivered.
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Ard Biesheuvel (23):
x86/boot: Move early_setup_gdt() back into head64.c
x86/boot: Disregard __supported_pte_mask in __startup_64()
x86/boot: Drop global variables keeping track of LA57 state
x86/sev: Make sev_snp_enabled() a static function
x86/sev: Move instruction decoder into separate source file
x86/sev: Disentangle #VC handling code from startup code
x86/sev: Separate MSR and GHCB based snp_cpuid() via a callback
x86/sev: Fall back to early page state change code only during boot
x86/sev: Move GHCB page based HV communication out of startup code
x86/sev: Use boot SVSM CA for all startup and init code
x86/boot: Drop redundant RMPADJUST in SEV SVSM presence check
x86/sev: Unify SEV-SNP hypervisor feature check
x86/linkage: Add SYM_PIC_ALIAS() macro helper to emit symbol aliases
x86/boot: Add a bunch of PIC aliases
x86/boot: Provide __pti_set_user_pgtbl() to startup code
x86/sev: Provide PIC aliases for SEV related data objects
x86/sev: Move __sev_[get|put]_ghcb() into separate noinstr object
x86/sev: Export startup routines for ordinary use
x86/boot: Created a confined code area for startup code
x86/boot: Move startup code out of __head section
x86/boot: Disallow absolute symbol references in startup code
x86/boot: Revert "Reject absolute references in .head.text"
x86/boot: Get rid of the .head.text section
arch/x86/boot/compressed/Makefile | 6 +-
arch/x86/boot/compressed/misc.h | 12 +-
arch/x86/boot/compressed/pgtable_64.c | 12 -
arch/x86/boot/compressed/sev-handle-vc.c | 134 +++
arch/x86/boot/compressed/sev.c | 210 +---
arch/x86/boot/compressed/sev.h | 21 +-
arch/x86/boot/compressed/vmlinux.lds.S | 1 +
arch/x86/boot/startup/Makefile | 21 +
arch/x86/boot/startup/exports.h | 14 +
arch/x86/boot/startup/gdt_idt.c | 17 +-
arch/x86/boot/startup/map_kernel.c | 18 +-
arch/x86/boot/startup/sev-shared.c | 804 +-------------
arch/x86/boot/startup/sev-startup.c | 1169 +-------------------
arch/x86/boot/startup/sme.c | 45 +-
arch/x86/coco/core.c | 2 +
arch/x86/coco/sev/Makefile | 6 +-
arch/x86/coco/sev/core.c | 189 +++-
arch/x86/coco/sev/{sev-nmi.c => sev-noinstr.c} | 74 ++
arch/x86/coco/sev/vc-handle.c | 1060 ++++++++++++++++++
arch/x86/coco/sev/vc-shared.c | 614 ++++++++++
arch/x86/include/asm/init.h | 6 -
arch/x86/include/asm/linkage.h | 10 +
arch/x86/include/asm/pgtable_64_types.h | 43 +-
arch/x86/include/asm/setup.h | 2 +
arch/x86/include/asm/sev-internal.h | 30 +-
arch/x86/include/asm/sev.h | 78 ++
arch/x86/kernel/cpu/common.c | 3 +-
arch/x86/kernel/head64.c | 27 +-
arch/x86/kernel/head_32.S | 2 +-
arch/x86/kernel/head_64.S | 18 +-
arch/x86/kernel/setup.c | 1 +
arch/x86/kernel/vmlinux.lds.S | 11 +-
arch/x86/lib/memcpy_64.S | 1 +
arch/x86/lib/memset_64.S | 1 +
arch/x86/lib/retpoline.S | 2 +
arch/x86/mm/kasan_init_64.c | 3 -
arch/x86/mm/mem_encrypt_amd.c | 2 +
arch/x86/mm/mem_encrypt_boot.S | 6 +-
arch/x86/mm/pgtable.c | 1 +
arch/x86/platform/pvh/head.S | 2 +-
arch/x86/tools/relocs.c | 8 +-
tools/objtool/arch/x86/decode.c | 6 +-
42 files changed, 2367 insertions(+), 2325 deletions(-)
create mode 100644 arch/x86/boot/compressed/sev-handle-vc.c
create mode 100644 arch/x86/boot/startup/exports.h
rename arch/x86/coco/sev/{sev-nmi.c => sev-noinstr.c} (61%)
create mode 100644 arch/x86/coco/sev/vc-handle.c
create mode 100644 arch/x86/coco/sev/vc-shared.c
base-commit: 18ea89eae404d119ced26d80ac3e62255ce15409
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply [flat|nested] 59+ messages in thread
* [RFT PATCH v2 01/23] x86/boot: Move early_setup_gdt() back into head64.c
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 02/23] x86/boot: Disregard __supported_pte_mask in __startup_64() Ard Biesheuvel
` (23 subsequent siblings)
24 siblings, 1 reply; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
Move early_setup_gdt() out of the startup code that is callable from the
1:1 mapping - this is not needed, and instead, it is better to expose
the helper that does reside in __head directly. This reduces the amount
of code that needs special checks for 1:1 execution suitability. In
particular, it avoids dealing with the GHCB page (and its physical
address) in startup code, which runs from the 1:1 mapping, making
physical to virtual translations ambiguous.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/startup/gdt_idt.c | 15 +--------------
arch/x86/include/asm/setup.h | 1 +
arch/x86/kernel/head64.c | 12 ++++++++++++
3 files changed, 14 insertions(+), 14 deletions(-)
diff --git a/arch/x86/boot/startup/gdt_idt.c b/arch/x86/boot/startup/gdt_idt.c
index 7e34d0b426b1..a3112a69b06a 100644
--- a/arch/x86/boot/startup/gdt_idt.c
+++ b/arch/x86/boot/startup/gdt_idt.c
@@ -24,7 +24,7 @@
static gate_desc bringup_idt_table[NUM_EXCEPTION_VECTORS] __page_aligned_data;
/* This may run while still in the direct mapping */
-static void __head startup_64_load_idt(void *vc_handler)
+void __head startup_64_load_idt(void *vc_handler)
{
struct desc_ptr desc = {
.address = (unsigned long)rip_rel_ptr(bringup_idt_table),
@@ -43,19 +43,6 @@ static void __head startup_64_load_idt(void *vc_handler)
native_load_idt(&desc);
}
-/* This is used when running on kernel addresses */
-void early_setup_idt(void)
-{
- void *handler = NULL;
-
- if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
- setup_ghcb();
- handler = vc_boot_ghcb;
- }
-
- startup_64_load_idt(handler);
-}
-
/*
* Setup boot CPU state needed before kernel switches to virtual addresses.
*/
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index ad9212df0ec0..6324f4c6c545 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -52,6 +52,7 @@ extern void reserve_standard_io_resources(void);
extern void i386_reserve_resources(void);
extern unsigned long __startup_64(unsigned long p2v_offset, struct boot_params *bp);
extern void startup_64_setup_gdt_idt(void);
+extern void startup_64_load_idt(void *vc_handler);
extern void early_setup_idt(void);
extern void __init do_early_exception(struct pt_regs *regs, int trapnr);
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 6b68a206fa7f..29226f3ac064 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -303,3 +303,15 @@ void __init __noreturn x86_64_start_reservations(char *real_mode_data)
start_kernel();
}
+
+void early_setup_idt(void)
+{
+ void *handler = NULL;
+
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
+ setup_ghcb();
+ handler = vc_boot_ghcb;
+ }
+
+ startup_64_load_idt(handler);
+}
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 02/23] x86/boot: Disregard __supported_pte_mask in __startup_64()
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 01/23] x86/boot: Move early_setup_gdt() back into head64.c Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 03/23] x86/boot: Drop global variables keeping track of LA57 state Ard Biesheuvel
` (22 subsequent siblings)
24 siblings, 1 reply; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
__supported_pte_mask is statically initialized to U64_MAX and never
assigned until long after the startup code executes that creates the
initial page tables. So applying the mask is unnecessary, and can be
avoided.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/startup/map_kernel.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/arch/x86/boot/startup/map_kernel.c b/arch/x86/boot/startup/map_kernel.c
index 0eac3f17dbd3..099ae2559336 100644
--- a/arch/x86/boot/startup/map_kernel.c
+++ b/arch/x86/boot/startup/map_kernel.c
@@ -179,8 +179,6 @@ unsigned long __head __startup_64(unsigned long p2v_offset,
pud[(i + 1) % PTRS_PER_PUD] = (pudval_t)pmd + pgtable_flags;
pmd_entry = __PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL;
- /* Filter out unsupported __PAGE_KERNEL_* bits: */
- pmd_entry &= __supported_pte_mask;
pmd_entry += sme_get_me_mask();
pmd_entry += physaddr;
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 03/23] x86/boot: Drop global variables keeping track of LA57 state
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 01/23] x86/boot: Move early_setup_gdt() back into head64.c Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 02/23] x86/boot: Disregard __supported_pte_mask in __startup_64() Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 13:50 ` Ingo Molnar
2025-05-04 9:52 ` [RFT PATCH v2 04/23] x86/sev: Make sev_snp_enabled() a static function Ard Biesheuvel
` (21 subsequent siblings)
24 siblings, 1 reply; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
On x86_64, the core kernel is entered in long mode, which implies that
paging is enabled. This means that the CR4.LA57 control bit is
guaranteed to be in sync with the number of paging levels used by the
kernel, and there is no need to store this in a variable.
There is also no need to use variables for storing the calculations of
pgdir_shift and ptrs_per_p4d, as they are easily determined on the fly.
This removes the need for two different sources of truth (i.e., early
and late) for determining whether 5-level paging is in use: CR4.LA57
always reflects the actual state, and never changes from the point of
view of the 64-bit core kernel. It also removes the need for exposing
the associated variables to the startup code. The only potential concern
is the cost of CR4 accesses, which can be mitigated using alternatives
patching based on feature detection.
Note that even the decompressor does not manipulate any page tables
before updating CR4.LA57, so it can also avoid the associated global
variables entirely. However, as it does not implement alternatives
patching, the associated ELF sections need to be discarded.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/compressed/misc.h | 4 --
arch/x86/boot/compressed/pgtable_64.c | 12 ------
arch/x86/boot/compressed/vmlinux.lds.S | 1 +
arch/x86/boot/startup/map_kernel.c | 12 +-----
arch/x86/boot/startup/sme.c | 9 ----
arch/x86/include/asm/pgtable_64_types.h | 43 ++++++++++----------
arch/x86/kernel/cpu/common.c | 2 -
arch/x86/kernel/head64.c | 11 -----
arch/x86/mm/kasan_init_64.c | 3 --
9 files changed, 24 insertions(+), 73 deletions(-)
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index dd8d1a85f671..450d27d0f449 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -16,9 +16,6 @@
#define __NO_FORTIFY
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
/*
* Boot stub deals with identity mappings, physical and virtual addresses are
* the same, so override these defines.
@@ -181,7 +178,6 @@ static inline int count_immovable_mem_regions(void) { return 0; }
#endif
/* ident_map_64.c */
-extern unsigned int __pgtable_l5_enabled, pgdir_shift, ptrs_per_p4d;
extern void kernel_add_identity_map(unsigned long start, unsigned long end);
/* Used by PAGE_KERN* macros: */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 5a6c7a190e5b..591d28f2feb6 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -10,13 +10,6 @@
#define BIOS_START_MIN 0x20000U /* 128K, less than this is insane */
#define BIOS_START_MAX 0x9f000U /* 640K, absolute maximum */
-#ifdef CONFIG_X86_5LEVEL
-/* __pgtable_l5_enabled needs to be in .data to avoid being cleared along with .bss */
-unsigned int __section(".data") __pgtable_l5_enabled;
-unsigned int __section(".data") pgdir_shift = 39;
-unsigned int __section(".data") ptrs_per_p4d = 1;
-#endif
-
/* Buffer to preserve trampoline memory */
static char trampoline_save[TRAMPOLINE_32BIT_SIZE];
@@ -127,11 +120,6 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
native_cpuid_eax(0) >= 7 &&
(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)))) {
l5_required = true;
-
- /* Initialize variables for 5-level paging */
- __pgtable_l5_enabled = 1;
- pgdir_shift = 48;
- ptrs_per_p4d = 512;
}
/*
diff --git a/arch/x86/boot/compressed/vmlinux.lds.S b/arch/x86/boot/compressed/vmlinux.lds.S
index 3b2bc61c9408..d3e786ff7dcb 100644
--- a/arch/x86/boot/compressed/vmlinux.lds.S
+++ b/arch/x86/boot/compressed/vmlinux.lds.S
@@ -81,6 +81,7 @@ SECTIONS
*(.dynamic) *(.dynsym) *(.dynstr) *(.dynbss)
*(.hash) *(.gnu.hash)
*(.note.*)
+ *(.altinstructions .altinstr_replacement)
}
.got.plt (INFO) : {
diff --git a/arch/x86/boot/startup/map_kernel.c b/arch/x86/boot/startup/map_kernel.c
index 099ae2559336..11f99d907f76 100644
--- a/arch/x86/boot/startup/map_kernel.c
+++ b/arch/x86/boot/startup/map_kernel.c
@@ -16,19 +16,9 @@ extern unsigned int next_early_pgt;
static inline bool check_la57_support(void)
{
- if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+ if (!pgtable_l5_enabled())
return false;
- /*
- * 5-level paging is detected and enabled at kernel decompression
- * stage. Only check if it has been enabled there.
- */
- if (!(native_read_cr4() & X86_CR4_LA57))
- return false;
-
- __pgtable_l5_enabled = 1;
- pgdir_shift = 48;
- ptrs_per_p4d = 512;
page_offset_base = __PAGE_OFFSET_BASE_L5;
vmalloc_base = __VMALLOC_BASE_L5;
vmemmap_base = __VMEMMAP_BASE_L5;
diff --git a/arch/x86/boot/startup/sme.c b/arch/x86/boot/startup/sme.c
index 5738b31c8e60..ade5ad5060e9 100644
--- a/arch/x86/boot/startup/sme.c
+++ b/arch/x86/boot/startup/sme.c
@@ -25,15 +25,6 @@
#undef CONFIG_PARAVIRT_XXL
#undef CONFIG_PARAVIRT_SPINLOCKS
-/*
- * This code runs before CPU feature bits are set. By default, the
- * pgtable_l5_enabled() function uses bit X86_FEATURE_LA57 to determine if
- * 5-level paging is active, so that won't work here. USE_EARLY_PGTABLE_L5
- * is provided to handle this situation and, instead, use a variable that
- * has been set by the early boot code.
- */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/mem_encrypt.h>
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 5bb782d856f2..40dceb7d80f5 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -6,7 +6,10 @@
#ifndef __ASSEMBLER__
#include <linux/types.h>
+#include <asm/alternative.h>
+#include <asm/cpufeatures.h>
#include <asm/kaslr.h>
+#include <asm/processor-flags.h>
/*
* These are used to make use of C type-checking..
@@ -21,28 +24,24 @@ typedef unsigned long pgprotval_t;
typedef struct { pteval_t pte; } pte_t;
typedef struct { pmdval_t pmd; } pmd_t;
-extern unsigned int __pgtable_l5_enabled;
-
-#ifdef CONFIG_X86_5LEVEL
-#ifdef USE_EARLY_PGTABLE_L5
-/*
- * cpu_feature_enabled() is not available in early boot code.
- * Use variable instead.
- */
-static inline bool pgtable_l5_enabled(void)
+static __always_inline __pure bool pgtable_l5_enabled(void)
{
- return __pgtable_l5_enabled;
-}
-#else
-#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)
-#endif /* USE_EARLY_PGTABLE_L5 */
+ unsigned long r;
+ bool ret;
-#else
-#define pgtable_l5_enabled() 0
-#endif /* CONFIG_X86_5LEVEL */
+ if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+ return false;
-extern unsigned int pgdir_shift;
-extern unsigned int ptrs_per_p4d;
+ asm(ALTERNATIVE_TERNARY(
+ "movq %%cr4, %[reg] \n\t btl %[la57], %k[reg]" CC_SET(c),
+ %P[feat], "stc", "clc")
+ : [reg] "=&r" (r), CC_OUT(c) (ret)
+ : [feat] "i" (X86_FEATURE_LA57),
+ [la57] "i" (X86_CR4_LA57_BIT)
+ : "cc");
+
+ return ret;
+}
#endif /* !__ASSEMBLER__ */
@@ -53,7 +52,7 @@ extern unsigned int ptrs_per_p4d;
/*
* PGDIR_SHIFT determines what a top-level page table entry can map
*/
-#define PGDIR_SHIFT pgdir_shift
+#define PGDIR_SHIFT (pgtable_l5_enabled() ? 48 : 39)
#define PTRS_PER_PGD 512
/*
@@ -61,7 +60,7 @@ extern unsigned int ptrs_per_p4d;
*/
#define P4D_SHIFT 39
#define MAX_PTRS_PER_P4D 512
-#define PTRS_PER_P4D ptrs_per_p4d
+#define PTRS_PER_P4D (pgtable_l5_enabled() ? 512 : 1)
#define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
#define P4D_MASK (~(P4D_SIZE - 1))
@@ -76,6 +75,8 @@ extern unsigned int ptrs_per_p4d;
#define PTRS_PER_PGD 512
#define MAX_PTRS_PER_P4D 1
+#define MAX_POSSIBLE_PHYSMEM_BITS 46
+
#endif /* CONFIG_X86_5LEVEL */
/*
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 12126adbc3a9..eb6a7f6e20c4 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1,6 +1,4 @@
// SPDX-License-Identifier: GPL-2.0-only
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
#include <linux/memblock.h>
#include <linux/linkage.h>
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 29226f3ac064..3d49abb1bb3a 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -5,9 +5,6 @@
* Copyright (C) 2000 Andrea Arcangeli <andrea@suse.de> SuSE
*/
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/init.h>
#include <linux/linkage.h>
#include <linux/types.h>
@@ -50,14 +47,6 @@ extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
unsigned int __initdata next_early_pgt;
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
-#ifdef CONFIG_X86_5LEVEL
-unsigned int __pgtable_l5_enabled __ro_after_init;
-unsigned int pgdir_shift __ro_after_init = 39;
-EXPORT_SYMBOL(pgdir_shift);
-unsigned int ptrs_per_p4d __ro_after_init = 1;
-EXPORT_SYMBOL(ptrs_per_p4d);
-#endif
-
#ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
EXPORT_SYMBOL(page_offset_base);
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0539efd0d216..7c4fafbd52cc 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -1,9 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
#define pr_fmt(fmt) "kasan: " fmt
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/memblock.h>
#include <linux/kasan.h>
#include <linux/kdebug.h>
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 04/23] x86/sev: Make sev_snp_enabled() a static function
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (2 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 03/23] x86/boot: Drop global variables keeping track of LA57 state Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 05/23] x86/sev: Move instruction decoder into separate source file Ard Biesheuvel
` (20 subsequent siblings)
24 siblings, 1 reply; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
sev_snp_enabled() is no longer used outside of the source file that
defines it, so make it static and drop the extern declarations.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/compressed/sev.c | 2 +-
arch/x86/boot/compressed/sev.h | 2 --
2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 478c65149cf0..bc52c0aa96d4 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -164,7 +164,7 @@ int svsm_perform_call_protocol(struct svsm_call *call)
return ret;
}
-bool sev_snp_enabled(void)
+static bool sev_snp_enabled(void)
{
return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
}
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
index 4e463f33186d..9d21af3a220d 100644
--- a/arch/x86/boot/compressed/sev.h
+++ b/arch/x86/boot/compressed/sev.h
@@ -10,13 +10,11 @@
#ifdef CONFIG_AMD_MEM_ENCRYPT
-bool sev_snp_enabled(void);
void snp_accept_memory(phys_addr_t start, phys_addr_t end);
u64 sev_get_status(void);
#else
-static inline bool sev_snp_enabled(void) { return false; }
static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
static inline u64 sev_get_status(void) { return 0; }
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 05/23] x86/sev: Move instruction decoder into separate source file
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (3 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 04/23] x86/sev: Make sev_snp_enabled() a static function Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
` (2 more replies)
2025-05-04 9:52 ` [RFT PATCH v2 06/23] x86/sev: Disentangle #VC handling code from startup code Ard Biesheuvel
` (19 subsequent siblings)
24 siblings, 3 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
As a first step towards disentangling the SEV #VC handling code -which
is shared between the decompressor and the core kernel- from the SEV
startup code, move the decompressor's copy of the instruction decoder
into a separate source file.
Code movement only - no functional change intended.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/compressed/Makefile | 6 +--
arch/x86/boot/compressed/misc.h | 7 +++
arch/x86/boot/compressed/sev-handle-vc.c | 51 ++++++++++++++++++++
arch/x86/boot/compressed/sev.c | 39 +--------------
4 files changed, 62 insertions(+), 41 deletions(-)
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 0fcad7b7e007..f4f7b22d8113 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -44,10 +44,10 @@ KBUILD_CFLAGS += -D__DISABLE_EXPORTS
KBUILD_CFLAGS += $(call cc-option,-Wa$(comma)-mrelax-relocations=no)
KBUILD_CFLAGS += -include $(srctree)/include/linux/hidden.h
-# sev.c indirectly includes inat-table.h which is generated during
+# sev-decode-insn.c indirectly includes inat-table.c which is generated during
# compilation and stored in $(objtree). Add the directory to the includes so
# that the compiler finds it even with out-of-tree builds (make O=/some/path).
-CFLAGS_sev.o += -I$(objtree)/arch/x86/lib/
+CFLAGS_sev-handle-vc.o += -I$(objtree)/arch/x86/lib/
KBUILD_AFLAGS := $(KBUILD_CFLAGS) -D__ASSEMBLY__
@@ -96,7 +96,7 @@ ifdef CONFIG_X86_64
vmlinux-objs-y += $(obj)/idt_64.o $(obj)/idt_handlers_64.o
vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/mem_encrypt.o
vmlinux-objs-y += $(obj)/pgtable_64.o
- vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/sev.o
+ vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/sev.o $(obj)/sev-handle-vc.o
endif
vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 450d27d0f449..ccd3f4257bcd 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -133,6 +133,9 @@ static inline void console_init(void)
#endif
#ifdef CONFIG_AMD_MEM_ENCRYPT
+struct es_em_ctxt;
+struct insn;
+
void sev_enable(struct boot_params *bp);
void snp_check_features(void);
void sev_es_shutdown_ghcb(void);
@@ -140,6 +143,10 @@ extern bool sev_es_check_ghcb_fault(unsigned long address);
void snp_set_page_private(unsigned long paddr);
void snp_set_page_shared(unsigned long paddr);
void sev_prep_identity_maps(unsigned long top_level_pgt);
+
+enum es_result vc_decode_insn(struct es_em_ctxt *ctxt);
+bool insn_has_rep_prefix(struct insn *insn);
+void sev_insn_decode_init(void);
#else
static inline void sev_enable(struct boot_params *bp)
{
diff --git a/arch/x86/boot/compressed/sev-handle-vc.c b/arch/x86/boot/compressed/sev-handle-vc.c
new file mode 100644
index 000000000000..b1aa073b732c
--- /dev/null
+++ b/arch/x86/boot/compressed/sev-handle-vc.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "misc.h"
+
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <asm/insn.h>
+#include <asm/pgtable_types.h>
+#include <asm/ptrace.h>
+#include <asm/sev.h>
+
+#define __BOOT_COMPRESSED
+
+/* Basic instruction decoding support needed */
+#include "../../lib/inat.c"
+#include "../../lib/insn.c"
+
+/*
+ * Copy a version of this function here - insn-eval.c can't be used in
+ * pre-decompression code.
+ */
+bool insn_has_rep_prefix(struct insn *insn)
+{
+ insn_byte_t p;
+ int i;
+
+ insn_get_prefixes(insn);
+
+ for_each_insn_prefix(insn, i, p) {
+ if (p == 0xf2 || p == 0xf3)
+ return true;
+ }
+
+ return false;
+}
+
+enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
+{
+ char buffer[MAX_INSN_SIZE];
+ int ret;
+
+ memcpy(buffer, (unsigned char *)ctxt->regs->ip, MAX_INSN_SIZE);
+
+ ret = insn_decode(&ctxt->insn, buffer, MAX_INSN_SIZE, INSN_MODE_64);
+ if (ret < 0)
+ return ES_DECODE_FAILED;
+
+ return ES_OK;
+}
+
+extern void sev_insn_decode_init(void) __alias(inat_init_tables);
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index bc52c0aa96d4..5cd029c4f36d 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -29,25 +29,6 @@
static struct ghcb boot_ghcb_page __aligned(PAGE_SIZE);
struct ghcb *boot_ghcb;
-/*
- * Copy a version of this function here - insn-eval.c can't be used in
- * pre-decompression code.
- */
-static bool insn_has_rep_prefix(struct insn *insn)
-{
- insn_byte_t p;
- int i;
-
- insn_get_prefixes(insn);
-
- for_each_insn_prefix(insn, i, p) {
- if (p == 0xf2 || p == 0xf3)
- return true;
- }
-
- return false;
-}
-
/*
* Only a dummy for insn_get_seg_base() - Early boot-code is 64bit only and
* doesn't use segments.
@@ -74,20 +55,6 @@ static inline void sev_es_wr_ghcb_msr(u64 val)
boot_wrmsr(MSR_AMD64_SEV_ES_GHCB, &m);
}
-static enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
-{
- char buffer[MAX_INSN_SIZE];
- int ret;
-
- memcpy(buffer, (unsigned char *)ctxt->regs->ip, MAX_INSN_SIZE);
-
- ret = insn_decode(&ctxt->insn, buffer, MAX_INSN_SIZE, INSN_MODE_64);
- if (ret < 0)
- return ES_DECODE_FAILED;
-
- return ES_OK;
-}
-
static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
void *dst, char *buf, size_t size)
{
@@ -122,10 +89,6 @@ static bool fault_in_kernel_space(unsigned long address)
#define __BOOT_COMPRESSED
-/* Basic instruction decoding support needed */
-#include "../../lib/inat.c"
-#include "../../lib/insn.c"
-
extern struct svsm_ca *boot_svsm_caa;
extern u64 boot_svsm_caa_pa;
@@ -230,7 +193,7 @@ static bool early_setup_ghcb(void)
boot_ghcb = &boot_ghcb_page;
/* Initialize lookup tables for the instruction decoder */
- inat_init_tables();
+ sev_insn_decode_init();
/* SNP guest requires the GHCB GPA must be registered */
if (sev_snp_enabled())
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 06/23] x86/sev: Disentangle #VC handling code from startup code
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (4 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 05/23] x86/sev: Move instruction decoder into separate source file Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-05 5:31 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
` (2 more replies)
2025-05-04 9:52 ` [RFT PATCH v2 07/23] x86/sev: Separate MSR and GHCB based snp_cpuid() via a callback Ard Biesheuvel
` (18 subsequent siblings)
24 siblings, 3 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
Most of the SEV support code used to reside in a single C source file
that was included in two places: the core kernel, and the decompressor.
The code that is actually shared with the decompressor was moved into a
separate, shared source file under startup/, on the basis that the
decompressor also executes from the early 1:1 mapping of memory.
However, while the elaborate #VC handling and instruction decoding that
it involves is also performed by the decompressor, it does not actually
occur in the core kernel at early boot, and therefore, does not need to
be part of the confined early startup code.
So split off the #VC handling code and move it back into arch/x86/coco
where it came from, into another C source file that is included from
both the decompressor and the core kernel.
Code movement only - no functional change intended.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/compressed/misc.h | 1 +
arch/x86/boot/compressed/sev-handle-vc.c | 83 ++
arch/x86/boot/compressed/sev.c | 96 +-
arch/x86/boot/compressed/sev.h | 19 +
arch/x86/boot/startup/sev-shared.c | 519 +---------
arch/x86/boot/startup/sev-startup.c | 1022 -------------------
arch/x86/coco/sev/Makefile | 2 +-
arch/x86/coco/sev/vc-handle.c | 1061 ++++++++++++++++++++
arch/x86/coco/sev/vc-shared.c | 504 ++++++++++
arch/x86/include/asm/sev-internal.h | 8 -
arch/x86/include/asm/sev.h | 22 +
11 files changed, 1694 insertions(+), 1643 deletions(-)
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index ccd3f4257bcd..b33f1c5a486c 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -147,6 +147,7 @@ void sev_prep_identity_maps(unsigned long top_level_pgt);
enum es_result vc_decode_insn(struct es_em_ctxt *ctxt);
bool insn_has_rep_prefix(struct insn *insn);
void sev_insn_decode_init(void);
+bool early_setup_ghcb(void);
#else
static inline void sev_enable(struct boot_params *bp)
{
diff --git a/arch/x86/boot/compressed/sev-handle-vc.c b/arch/x86/boot/compressed/sev-handle-vc.c
index b1aa073b732c..89dd02de2a0f 100644
--- a/arch/x86/boot/compressed/sev-handle-vc.c
+++ b/arch/x86/boot/compressed/sev-handle-vc.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
#include "misc.h"
+#include "sev.h"
#include <linux/kernel.h>
#include <linux/string.h>
@@ -8,6 +9,9 @@
#include <asm/pgtable_types.h>
#include <asm/ptrace.h>
#include <asm/sev.h>
+#include <asm/trapnr.h>
+#include <asm/trap_pf.h>
+#include <asm/fpu/xcr.h>
#define __BOOT_COMPRESSED
@@ -49,3 +53,82 @@ enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
}
extern void sev_insn_decode_init(void) __alias(inat_init_tables);
+
+/*
+ * Only a dummy for insn_get_seg_base() - Early boot-code is 64bit only and
+ * doesn't use segments.
+ */
+static unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx)
+{
+ return 0UL;
+}
+
+static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
+ void *dst, char *buf, size_t size)
+{
+ memcpy(dst, buf, size);
+
+ return ES_OK;
+}
+
+static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
+ void *src, char *buf, size_t size)
+{
+ memcpy(buf, src, size);
+
+ return ES_OK;
+}
+
+static enum es_result vc_ioio_check(struct es_em_ctxt *ctxt, u16 port, size_t size)
+{
+ return ES_OK;
+}
+
+static bool fault_in_kernel_space(unsigned long address)
+{
+ return false;
+}
+
+#define sev_printk(fmt, ...)
+
+#include "../../coco/sev/vc-shared.c"
+
+void do_boot_stage2_vc(struct pt_regs *regs, unsigned long exit_code)
+{
+ struct es_em_ctxt ctxt;
+ enum es_result result;
+
+ if (!boot_ghcb && !early_setup_ghcb())
+ sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
+
+ vc_ghcb_invalidate(boot_ghcb);
+ result = vc_init_em_ctxt(&ctxt, regs, exit_code);
+ if (result != ES_OK)
+ goto finish;
+
+ result = vc_check_opcode_bytes(&ctxt, exit_code);
+ if (result != ES_OK)
+ goto finish;
+
+ switch (exit_code) {
+ case SVM_EXIT_RDTSC:
+ case SVM_EXIT_RDTSCP:
+ result = vc_handle_rdtsc(boot_ghcb, &ctxt, exit_code);
+ break;
+ case SVM_EXIT_IOIO:
+ result = vc_handle_ioio(boot_ghcb, &ctxt);
+ break;
+ case SVM_EXIT_CPUID:
+ result = vc_handle_cpuid(boot_ghcb, &ctxt);
+ break;
+ default:
+ result = ES_UNSUPPORTED;
+ break;
+ }
+
+finish:
+ if (result == ES_OK)
+ vc_finish_insn(&ctxt);
+ else if (result != ES_RETRY)
+ sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
+}
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 5cd029c4f36d..804166457022 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -24,63 +24,11 @@
#include <asm/cpuid.h>
#include "error.h"
-#include "../msr.h"
+#include "sev.h"
static struct ghcb boot_ghcb_page __aligned(PAGE_SIZE);
struct ghcb *boot_ghcb;
-/*
- * Only a dummy for insn_get_seg_base() - Early boot-code is 64bit only and
- * doesn't use segments.
- */
-static unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx)
-{
- return 0UL;
-}
-
-static inline u64 sev_es_rd_ghcb_msr(void)
-{
- struct msr m;
-
- boot_rdmsr(MSR_AMD64_SEV_ES_GHCB, &m);
-
- return m.q;
-}
-
-static inline void sev_es_wr_ghcb_msr(u64 val)
-{
- struct msr m;
-
- m.q = val;
- boot_wrmsr(MSR_AMD64_SEV_ES_GHCB, &m);
-}
-
-static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
- void *dst, char *buf, size_t size)
-{
- memcpy(dst, buf, size);
-
- return ES_OK;
-}
-
-static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
- void *src, char *buf, size_t size)
-{
- memcpy(buf, src, size);
-
- return ES_OK;
-}
-
-static enum es_result vc_ioio_check(struct es_em_ctxt *ctxt, u16 port, size_t size)
-{
- return ES_OK;
-}
-
-static bool fault_in_kernel_space(unsigned long address)
-{
- return false;
-}
-
#undef __init
#define __init
@@ -182,7 +130,7 @@ void snp_set_page_shared(unsigned long paddr)
__page_state_change(paddr, SNP_PAGE_STATE_SHARED);
}
-static bool early_setup_ghcb(void)
+bool early_setup_ghcb(void)
{
if (set_page_decrypted((unsigned long)&boot_ghcb_page))
return false;
@@ -266,46 +214,6 @@ bool sev_es_check_ghcb_fault(unsigned long address)
return ((address & PAGE_MASK) == (unsigned long)&boot_ghcb_page);
}
-void do_boot_stage2_vc(struct pt_regs *regs, unsigned long exit_code)
-{
- struct es_em_ctxt ctxt;
- enum es_result result;
-
- if (!boot_ghcb && !early_setup_ghcb())
- sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
-
- vc_ghcb_invalidate(boot_ghcb);
- result = vc_init_em_ctxt(&ctxt, regs, exit_code);
- if (result != ES_OK)
- goto finish;
-
- result = vc_check_opcode_bytes(&ctxt, exit_code);
- if (result != ES_OK)
- goto finish;
-
- switch (exit_code) {
- case SVM_EXIT_RDTSC:
- case SVM_EXIT_RDTSCP:
- result = vc_handle_rdtsc(boot_ghcb, &ctxt, exit_code);
- break;
- case SVM_EXIT_IOIO:
- result = vc_handle_ioio(boot_ghcb, &ctxt);
- break;
- case SVM_EXIT_CPUID:
- result = vc_handle_cpuid(boot_ghcb, &ctxt);
- break;
- default:
- result = ES_UNSUPPORTED;
- break;
- }
-
-finish:
- if (result == ES_OK)
- vc_finish_insn(&ctxt);
- else if (result != ES_RETRY)
- sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
-}
-
/*
* SNP_FEATURES_IMPL_REQ is the mask of SNP features that will need
* guest side implementation for proper functioning of the guest. If any
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
index 9d21af3a220d..3ed7fb27bd4d 100644
--- a/arch/x86/boot/compressed/sev.h
+++ b/arch/x86/boot/compressed/sev.h
@@ -10,9 +10,28 @@
#ifdef CONFIG_AMD_MEM_ENCRYPT
+#include "../msr.h"
+
void snp_accept_memory(phys_addr_t start, phys_addr_t end);
u64 sev_get_status(void);
+static inline u64 sev_es_rd_ghcb_msr(void)
+{
+ struct msr m;
+
+ boot_rdmsr(MSR_AMD64_SEV_ES_GHCB, &m);
+
+ return m.q;
+}
+
+static inline void sev_es_wr_ghcb_msr(u64 val)
+{
+ struct msr m;
+
+ m.q = val;
+ boot_wrmsr(MSR_AMD64_SEV_ES_GHCB, &m);
+}
+
#else
static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
diff --git a/arch/x86/boot/startup/sev-shared.c b/arch/x86/boot/startup/sev-shared.c
index 173f3d1f777a..7a706db87b93 100644
--- a/arch/x86/boot/startup/sev-shared.c
+++ b/arch/x86/boot/startup/sev-shared.c
@@ -14,13 +14,9 @@
#ifndef __BOOT_COMPRESSED
#define error(v) pr_err(v)
#define has_cpuflag(f) boot_cpu_has(f)
-#define sev_printk(fmt, ...) printk(fmt, ##__VA_ARGS__)
-#define sev_printk_rtl(fmt, ...) printk_ratelimited(fmt, ##__VA_ARGS__)
#else
#undef WARN
#define WARN(condition, format...) (!!(condition))
-#define sev_printk(fmt, ...)
-#define sev_printk_rtl(fmt, ...)
#undef vc_forward_exception
#define vc_forward_exception(c) panic("SNP: Hypervisor requested exception\n")
#endif
@@ -36,16 +32,6 @@
struct svsm_ca *boot_svsm_caa __ro_after_init;
u64 boot_svsm_caa_pa __ro_after_init;
-/* I/O parameters for CPUID-related helpers */
-struct cpuid_leaf {
- u32 fn;
- u32 subfn;
- u32 eax;
- u32 ebx;
- u32 ecx;
- u32 edx;
-};
-
/*
* Since feature negotiation related variables are set early in the boot
* process they must reside in the .data section so as not to be zeroed
@@ -151,33 +137,6 @@ bool sev_es_negotiate_protocol(void)
return true;
}
-static bool vc_decoding_needed(unsigned long exit_code)
-{
- /* Exceptions don't require to decode the instruction */
- return !(exit_code >= SVM_EXIT_EXCP_BASE &&
- exit_code <= SVM_EXIT_LAST_EXCP);
-}
-
-static enum es_result vc_init_em_ctxt(struct es_em_ctxt *ctxt,
- struct pt_regs *regs,
- unsigned long exit_code)
-{
- enum es_result ret = ES_OK;
-
- memset(ctxt, 0, sizeof(*ctxt));
- ctxt->regs = regs;
-
- if (vc_decoding_needed(exit_code))
- ret = vc_decode_insn(ctxt);
-
- return ret;
-}
-
-static void vc_finish_insn(struct es_em_ctxt *ctxt)
-{
- ctxt->regs->ip += ctxt->insn.length;
-}
-
static enum es_result verify_exception_info(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
{
u32 ret;
@@ -627,7 +586,7 @@ snp_cpuid_postprocess(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
* Returns -EOPNOTSUPP if feature not enabled. Any other non-zero return value
* should be treated as fatal by caller.
*/
-static int __head
+int __head
snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
{
const struct snp_cpuid_table *cpuid_table = snp_cpuid_get_table();
@@ -737,391 +696,6 @@ void __head do_vc_no_ghcb(struct pt_regs *regs, unsigned long exit_code)
sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
}
-static enum es_result vc_insn_string_check(struct es_em_ctxt *ctxt,
- unsigned long address,
- bool write)
-{
- if (user_mode(ctxt->regs) && fault_in_kernel_space(address)) {
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.error_code = X86_PF_USER;
- ctxt->fi.cr2 = address;
- if (write)
- ctxt->fi.error_code |= X86_PF_WRITE;
-
- return ES_EXCEPTION;
- }
-
- return ES_OK;
-}
-
-static enum es_result vc_insn_string_read(struct es_em_ctxt *ctxt,
- void *src, char *buf,
- unsigned int data_size,
- unsigned int count,
- bool backwards)
-{
- int i, b = backwards ? -1 : 1;
- unsigned long address = (unsigned long)src;
- enum es_result ret;
-
- ret = vc_insn_string_check(ctxt, address, false);
- if (ret != ES_OK)
- return ret;
-
- for (i = 0; i < count; i++) {
- void *s = src + (i * data_size * b);
- char *d = buf + (i * data_size);
-
- ret = vc_read_mem(ctxt, s, d, data_size);
- if (ret != ES_OK)
- break;
- }
-
- return ret;
-}
-
-static enum es_result vc_insn_string_write(struct es_em_ctxt *ctxt,
- void *dst, char *buf,
- unsigned int data_size,
- unsigned int count,
- bool backwards)
-{
- int i, s = backwards ? -1 : 1;
- unsigned long address = (unsigned long)dst;
- enum es_result ret;
-
- ret = vc_insn_string_check(ctxt, address, true);
- if (ret != ES_OK)
- return ret;
-
- for (i = 0; i < count; i++) {
- void *d = dst + (i * data_size * s);
- char *b = buf + (i * data_size);
-
- ret = vc_write_mem(ctxt, d, b, data_size);
- if (ret != ES_OK)
- break;
- }
-
- return ret;
-}
-
-#define IOIO_TYPE_STR BIT(2)
-#define IOIO_TYPE_IN 1
-#define IOIO_TYPE_INS (IOIO_TYPE_IN | IOIO_TYPE_STR)
-#define IOIO_TYPE_OUT 0
-#define IOIO_TYPE_OUTS (IOIO_TYPE_OUT | IOIO_TYPE_STR)
-
-#define IOIO_REP BIT(3)
-
-#define IOIO_ADDR_64 BIT(9)
-#define IOIO_ADDR_32 BIT(8)
-#define IOIO_ADDR_16 BIT(7)
-
-#define IOIO_DATA_32 BIT(6)
-#define IOIO_DATA_16 BIT(5)
-#define IOIO_DATA_8 BIT(4)
-
-#define IOIO_SEG_ES (0 << 10)
-#define IOIO_SEG_DS (3 << 10)
-
-static enum es_result vc_ioio_exitinfo(struct es_em_ctxt *ctxt, u64 *exitinfo)
-{
- struct insn *insn = &ctxt->insn;
- size_t size;
- u64 port;
-
- *exitinfo = 0;
-
- switch (insn->opcode.bytes[0]) {
- /* INS opcodes */
- case 0x6c:
- case 0x6d:
- *exitinfo |= IOIO_TYPE_INS;
- *exitinfo |= IOIO_SEG_ES;
- port = ctxt->regs->dx & 0xffff;
- break;
-
- /* OUTS opcodes */
- case 0x6e:
- case 0x6f:
- *exitinfo |= IOIO_TYPE_OUTS;
- *exitinfo |= IOIO_SEG_DS;
- port = ctxt->regs->dx & 0xffff;
- break;
-
- /* IN immediate opcodes */
- case 0xe4:
- case 0xe5:
- *exitinfo |= IOIO_TYPE_IN;
- port = (u8)insn->immediate.value & 0xffff;
- break;
-
- /* OUT immediate opcodes */
- case 0xe6:
- case 0xe7:
- *exitinfo |= IOIO_TYPE_OUT;
- port = (u8)insn->immediate.value & 0xffff;
- break;
-
- /* IN register opcodes */
- case 0xec:
- case 0xed:
- *exitinfo |= IOIO_TYPE_IN;
- port = ctxt->regs->dx & 0xffff;
- break;
-
- /* OUT register opcodes */
- case 0xee:
- case 0xef:
- *exitinfo |= IOIO_TYPE_OUT;
- port = ctxt->regs->dx & 0xffff;
- break;
-
- default:
- return ES_DECODE_FAILED;
- }
-
- *exitinfo |= port << 16;
-
- switch (insn->opcode.bytes[0]) {
- case 0x6c:
- case 0x6e:
- case 0xe4:
- case 0xe6:
- case 0xec:
- case 0xee:
- /* Single byte opcodes */
- *exitinfo |= IOIO_DATA_8;
- size = 1;
- break;
- default:
- /* Length determined by instruction parsing */
- *exitinfo |= (insn->opnd_bytes == 2) ? IOIO_DATA_16
- : IOIO_DATA_32;
- size = (insn->opnd_bytes == 2) ? 2 : 4;
- }
-
- switch (insn->addr_bytes) {
- case 2:
- *exitinfo |= IOIO_ADDR_16;
- break;
- case 4:
- *exitinfo |= IOIO_ADDR_32;
- break;
- case 8:
- *exitinfo |= IOIO_ADDR_64;
- break;
- }
-
- if (insn_has_rep_prefix(insn))
- *exitinfo |= IOIO_REP;
-
- return vc_ioio_check(ctxt, (u16)port, size);
-}
-
-static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
-{
- struct pt_regs *regs = ctxt->regs;
- u64 exit_info_1, exit_info_2;
- enum es_result ret;
-
- ret = vc_ioio_exitinfo(ctxt, &exit_info_1);
- if (ret != ES_OK)
- return ret;
-
- if (exit_info_1 & IOIO_TYPE_STR) {
-
- /* (REP) INS/OUTS */
-
- bool df = ((regs->flags & X86_EFLAGS_DF) == X86_EFLAGS_DF);
- unsigned int io_bytes, exit_bytes;
- unsigned int ghcb_count, op_count;
- unsigned long es_base;
- u64 sw_scratch;
-
- /*
- * For the string variants with rep prefix the amount of in/out
- * operations per #VC exception is limited so that the kernel
- * has a chance to take interrupts and re-schedule while the
- * instruction is emulated.
- */
- io_bytes = (exit_info_1 >> 4) & 0x7;
- ghcb_count = sizeof(ghcb->shared_buffer) / io_bytes;
-
- op_count = (exit_info_1 & IOIO_REP) ? regs->cx : 1;
- exit_info_2 = min(op_count, ghcb_count);
- exit_bytes = exit_info_2 * io_bytes;
-
- es_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_ES);
-
- /* Read bytes of OUTS into the shared buffer */
- if (!(exit_info_1 & IOIO_TYPE_IN)) {
- ret = vc_insn_string_read(ctxt,
- (void *)(es_base + regs->si),
- ghcb->shared_buffer, io_bytes,
- exit_info_2, df);
- if (ret)
- return ret;
- }
-
- /*
- * Issue an VMGEXIT to the HV to consume the bytes from the
- * shared buffer or to have it write them into the shared buffer
- * depending on the instruction: OUTS or INS.
- */
- sw_scratch = __pa(ghcb) + offsetof(struct ghcb, shared_buffer);
- ghcb_set_sw_scratch(ghcb, sw_scratch);
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO,
- exit_info_1, exit_info_2);
- if (ret != ES_OK)
- return ret;
-
- /* Read bytes from shared buffer into the guest's destination. */
- if (exit_info_1 & IOIO_TYPE_IN) {
- ret = vc_insn_string_write(ctxt,
- (void *)(es_base + regs->di),
- ghcb->shared_buffer, io_bytes,
- exit_info_2, df);
- if (ret)
- return ret;
-
- if (df)
- regs->di -= exit_bytes;
- else
- regs->di += exit_bytes;
- } else {
- if (df)
- regs->si -= exit_bytes;
- else
- regs->si += exit_bytes;
- }
-
- if (exit_info_1 & IOIO_REP)
- regs->cx -= exit_info_2;
-
- ret = regs->cx ? ES_RETRY : ES_OK;
-
- } else {
-
- /* IN/OUT into/from rAX */
-
- int bits = (exit_info_1 & 0x70) >> 1;
- u64 rax = 0;
-
- if (!(exit_info_1 & IOIO_TYPE_IN))
- rax = lower_bits(regs->ax, bits);
-
- ghcb_set_rax(ghcb, rax);
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO, exit_info_1, 0);
- if (ret != ES_OK)
- return ret;
-
- if (exit_info_1 & IOIO_TYPE_IN) {
- if (!ghcb_rax_is_valid(ghcb))
- return ES_VMM_ERROR;
- regs->ax = lower_bits(ghcb->save.rax, bits);
- }
- }
-
- return ret;
-}
-
-static int vc_handle_cpuid_snp(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
-{
- struct pt_regs *regs = ctxt->regs;
- struct cpuid_leaf leaf;
- int ret;
-
- leaf.fn = regs->ax;
- leaf.subfn = regs->cx;
- ret = snp_cpuid(ghcb, ctxt, &leaf);
- if (!ret) {
- regs->ax = leaf.eax;
- regs->bx = leaf.ebx;
- regs->cx = leaf.ecx;
- regs->dx = leaf.edx;
- }
-
- return ret;
-}
-
-static enum es_result vc_handle_cpuid(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- struct pt_regs *regs = ctxt->regs;
- u32 cr4 = native_read_cr4();
- enum es_result ret;
- int snp_cpuid_ret;
-
- snp_cpuid_ret = vc_handle_cpuid_snp(ghcb, ctxt);
- if (!snp_cpuid_ret)
- return ES_OK;
- if (snp_cpuid_ret != -EOPNOTSUPP)
- return ES_VMM_ERROR;
-
- ghcb_set_rax(ghcb, regs->ax);
- ghcb_set_rcx(ghcb, regs->cx);
-
- if (cr4 & X86_CR4_OSXSAVE)
- /* Safe to read xcr0 */
- ghcb_set_xcr0(ghcb, xgetbv(XCR_XFEATURE_ENABLED_MASK));
- else
- /* xgetbv will cause #GP - use reset value for xcr0 */
- ghcb_set_xcr0(ghcb, 1);
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_CPUID, 0, 0);
- if (ret != ES_OK)
- return ret;
-
- if (!(ghcb_rax_is_valid(ghcb) &&
- ghcb_rbx_is_valid(ghcb) &&
- ghcb_rcx_is_valid(ghcb) &&
- ghcb_rdx_is_valid(ghcb)))
- return ES_VMM_ERROR;
-
- regs->ax = ghcb->save.rax;
- regs->bx = ghcb->save.rbx;
- regs->cx = ghcb->save.rcx;
- regs->dx = ghcb->save.rdx;
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_rdtsc(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt,
- unsigned long exit_code)
-{
- bool rdtscp = (exit_code == SVM_EXIT_RDTSCP);
- enum es_result ret;
-
- /*
- * The hypervisor should not be intercepting RDTSC/RDTSCP when Secure
- * TSC is enabled. A #VC exception will be generated if the RDTSC/RDTSCP
- * instructions are being intercepted. If this should occur and Secure
- * TSC is enabled, guest execution should be terminated as the guest
- * cannot rely on the TSC value provided by the hypervisor.
- */
- if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
- return ES_VMM_ERROR;
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, 0, 0);
- if (ret != ES_OK)
- return ret;
-
- if (!(ghcb_rax_is_valid(ghcb) && ghcb_rdx_is_valid(ghcb) &&
- (!rdtscp || ghcb_rcx_is_valid(ghcb))))
- return ES_VMM_ERROR;
-
- ctxt->regs->ax = ghcb->save.rax;
- ctxt->regs->dx = ghcb->save.rdx;
- if (rdtscp)
- ctxt->regs->cx = ghcb->save.rcx;
-
- return ES_OK;
-}
-
struct cc_setup_data {
struct setup_data header;
u32 cc_blob_address;
@@ -1238,97 +812,6 @@ static void __head pvalidate_4k_page(unsigned long vaddr, unsigned long paddr,
}
}
-static enum es_result vc_check_opcode_bytes(struct es_em_ctxt *ctxt,
- unsigned long exit_code)
-{
- unsigned int opcode = (unsigned int)ctxt->insn.opcode.value;
- u8 modrm = ctxt->insn.modrm.value;
-
- switch (exit_code) {
-
- case SVM_EXIT_IOIO:
- case SVM_EXIT_NPF:
- /* handled separately */
- return ES_OK;
-
- case SVM_EXIT_CPUID:
- if (opcode == 0xa20f)
- return ES_OK;
- break;
-
- case SVM_EXIT_INVD:
- if (opcode == 0x080f)
- return ES_OK;
- break;
-
- case SVM_EXIT_MONITOR:
- /* MONITOR and MONITORX instructions generate the same error code */
- if (opcode == 0x010f && (modrm == 0xc8 || modrm == 0xfa))
- return ES_OK;
- break;
-
- case SVM_EXIT_MWAIT:
- /* MWAIT and MWAITX instructions generate the same error code */
- if (opcode == 0x010f && (modrm == 0xc9 || modrm == 0xfb))
- return ES_OK;
- break;
-
- case SVM_EXIT_MSR:
- /* RDMSR */
- if (opcode == 0x320f ||
- /* WRMSR */
- opcode == 0x300f)
- return ES_OK;
- break;
-
- case SVM_EXIT_RDPMC:
- if (opcode == 0x330f)
- return ES_OK;
- break;
-
- case SVM_EXIT_RDTSC:
- if (opcode == 0x310f)
- return ES_OK;
- break;
-
- case SVM_EXIT_RDTSCP:
- if (opcode == 0x010f && modrm == 0xf9)
- return ES_OK;
- break;
-
- case SVM_EXIT_READ_DR7:
- if (opcode == 0x210f &&
- X86_MODRM_REG(ctxt->insn.modrm.value) == 7)
- return ES_OK;
- break;
-
- case SVM_EXIT_VMMCALL:
- if (opcode == 0x010f && modrm == 0xd9)
- return ES_OK;
-
- break;
-
- case SVM_EXIT_WRITE_DR7:
- if (opcode == 0x230f &&
- X86_MODRM_REG(ctxt->insn.modrm.value) == 7)
- return ES_OK;
- break;
-
- case SVM_EXIT_WBINVD:
- if (opcode == 0x90f)
- return ES_OK;
- break;
-
- default:
- break;
- }
-
- sev_printk(KERN_ERR "Wrong/unhandled opcode bytes: 0x%x, exit_code: 0x%lx, rIP: 0x%lx\n",
- opcode, exit_code, ctxt->regs->ip);
-
- return ES_UNSUPPORTED;
-}
-
/*
* Maintain the GPA of the SVSM Calling Area (CA) in order to utilize the SVSM
* services needed when not running in VMPL0.
diff --git a/arch/x86/boot/startup/sev-startup.c b/arch/x86/boot/startup/sev-startup.c
index f901ce9680e6..435853a55768 100644
--- a/arch/x86/boot/startup/sev-startup.c
+++ b/arch/x86/boot/startup/sev-startup.c
@@ -9,7 +9,6 @@
#define pr_fmt(fmt) "SEV: " fmt
-#include <linux/sched/debug.h> /* For show_regs() */
#include <linux/percpu-defs.h>
#include <linux/cc_platform.h>
#include <linux/printk.h>
@@ -112,315 +111,6 @@ noinstr struct ghcb *__sev_get_ghcb(struct ghcb_state *state)
return ghcb;
}
-static int vc_fetch_insn_kernel(struct es_em_ctxt *ctxt,
- unsigned char *buffer)
-{
- return copy_from_kernel_nofault(buffer, (unsigned char *)ctxt->regs->ip, MAX_INSN_SIZE);
-}
-
-static enum es_result __vc_decode_user_insn(struct es_em_ctxt *ctxt)
-{
- char buffer[MAX_INSN_SIZE];
- int insn_bytes;
-
- insn_bytes = insn_fetch_from_user_inatomic(ctxt->regs, buffer);
- if (insn_bytes == 0) {
- /* Nothing could be copied */
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.error_code = X86_PF_INSTR | X86_PF_USER;
- ctxt->fi.cr2 = ctxt->regs->ip;
- return ES_EXCEPTION;
- } else if (insn_bytes == -EINVAL) {
- /* Effective RIP could not be calculated */
- ctxt->fi.vector = X86_TRAP_GP;
- ctxt->fi.error_code = 0;
- ctxt->fi.cr2 = 0;
- return ES_EXCEPTION;
- }
-
- if (!insn_decode_from_regs(&ctxt->insn, ctxt->regs, buffer, insn_bytes))
- return ES_DECODE_FAILED;
-
- if (ctxt->insn.immediate.got)
- return ES_OK;
- else
- return ES_DECODE_FAILED;
-}
-
-static enum es_result __vc_decode_kern_insn(struct es_em_ctxt *ctxt)
-{
- char buffer[MAX_INSN_SIZE];
- int res, ret;
-
- res = vc_fetch_insn_kernel(ctxt, buffer);
- if (res) {
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.error_code = X86_PF_INSTR;
- ctxt->fi.cr2 = ctxt->regs->ip;
- return ES_EXCEPTION;
- }
-
- ret = insn_decode(&ctxt->insn, buffer, MAX_INSN_SIZE, INSN_MODE_64);
- if (ret < 0)
- return ES_DECODE_FAILED;
- else
- return ES_OK;
-}
-
-static enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
-{
- if (user_mode(ctxt->regs))
- return __vc_decode_user_insn(ctxt);
- else
- return __vc_decode_kern_insn(ctxt);
-}
-
-static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
- char *dst, char *buf, size_t size)
-{
- unsigned long error_code = X86_PF_PROT | X86_PF_WRITE;
-
- /*
- * This function uses __put_user() independent of whether kernel or user
- * memory is accessed. This works fine because __put_user() does no
- * sanity checks of the pointer being accessed. All that it does is
- * to report when the access failed.
- *
- * Also, this function runs in atomic context, so __put_user() is not
- * allowed to sleep. The page-fault handler detects that it is running
- * in atomic context and will not try to take mmap_sem and handle the
- * fault, so additional pagefault_enable()/disable() calls are not
- * needed.
- *
- * The access can't be done via copy_to_user() here because
- * vc_write_mem() must not use string instructions to access unsafe
- * memory. The reason is that MOVS is emulated by the #VC handler by
- * splitting the move up into a read and a write and taking a nested #VC
- * exception on whatever of them is the MMIO access. Using string
- * instructions here would cause infinite nesting.
- */
- switch (size) {
- case 1: {
- u8 d1;
- u8 __user *target = (u8 __user *)dst;
-
- memcpy(&d1, buf, 1);
- if (__put_user(d1, target))
- goto fault;
- break;
- }
- case 2: {
- u16 d2;
- u16 __user *target = (u16 __user *)dst;
-
- memcpy(&d2, buf, 2);
- if (__put_user(d2, target))
- goto fault;
- break;
- }
- case 4: {
- u32 d4;
- u32 __user *target = (u32 __user *)dst;
-
- memcpy(&d4, buf, 4);
- if (__put_user(d4, target))
- goto fault;
- break;
- }
- case 8: {
- u64 d8;
- u64 __user *target = (u64 __user *)dst;
-
- memcpy(&d8, buf, 8);
- if (__put_user(d8, target))
- goto fault;
- break;
- }
- default:
- WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
- return ES_UNSUPPORTED;
- }
-
- return ES_OK;
-
-fault:
- if (user_mode(ctxt->regs))
- error_code |= X86_PF_USER;
-
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.error_code = error_code;
- ctxt->fi.cr2 = (unsigned long)dst;
-
- return ES_EXCEPTION;
-}
-
-static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
- char *src, char *buf, size_t size)
-{
- unsigned long error_code = X86_PF_PROT;
-
- /*
- * This function uses __get_user() independent of whether kernel or user
- * memory is accessed. This works fine because __get_user() does no
- * sanity checks of the pointer being accessed. All that it does is
- * to report when the access failed.
- *
- * Also, this function runs in atomic context, so __get_user() is not
- * allowed to sleep. The page-fault handler detects that it is running
- * in atomic context and will not try to take mmap_sem and handle the
- * fault, so additional pagefault_enable()/disable() calls are not
- * needed.
- *
- * The access can't be done via copy_from_user() here because
- * vc_read_mem() must not use string instructions to access unsafe
- * memory. The reason is that MOVS is emulated by the #VC handler by
- * splitting the move up into a read and a write and taking a nested #VC
- * exception on whatever of them is the MMIO access. Using string
- * instructions here would cause infinite nesting.
- */
- switch (size) {
- case 1: {
- u8 d1;
- u8 __user *s = (u8 __user *)src;
-
- if (__get_user(d1, s))
- goto fault;
- memcpy(buf, &d1, 1);
- break;
- }
- case 2: {
- u16 d2;
- u16 __user *s = (u16 __user *)src;
-
- if (__get_user(d2, s))
- goto fault;
- memcpy(buf, &d2, 2);
- break;
- }
- case 4: {
- u32 d4;
- u32 __user *s = (u32 __user *)src;
-
- if (__get_user(d4, s))
- goto fault;
- memcpy(buf, &d4, 4);
- break;
- }
- case 8: {
- u64 d8;
- u64 __user *s = (u64 __user *)src;
- if (__get_user(d8, s))
- goto fault;
- memcpy(buf, &d8, 8);
- break;
- }
- default:
- WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
- return ES_UNSUPPORTED;
- }
-
- return ES_OK;
-
-fault:
- if (user_mode(ctxt->regs))
- error_code |= X86_PF_USER;
-
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.error_code = error_code;
- ctxt->fi.cr2 = (unsigned long)src;
-
- return ES_EXCEPTION;
-}
-
-static enum es_result vc_slow_virt_to_phys(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
- unsigned long vaddr, phys_addr_t *paddr)
-{
- unsigned long va = (unsigned long)vaddr;
- unsigned int level;
- phys_addr_t pa;
- pgd_t *pgd;
- pte_t *pte;
-
- pgd = __va(read_cr3_pa());
- pgd = &pgd[pgd_index(va)];
- pte = lookup_address_in_pgd(pgd, va, &level);
- if (!pte) {
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.cr2 = vaddr;
- ctxt->fi.error_code = 0;
-
- if (user_mode(ctxt->regs))
- ctxt->fi.error_code |= X86_PF_USER;
-
- return ES_EXCEPTION;
- }
-
- if (WARN_ON_ONCE(pte_val(*pte) & _PAGE_ENC))
- /* Emulated MMIO to/from encrypted memory not supported */
- return ES_UNSUPPORTED;
-
- pa = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
- pa |= va & ~page_level_mask(level);
-
- *paddr = pa;
-
- return ES_OK;
-}
-
-static enum es_result vc_ioio_check(struct es_em_ctxt *ctxt, u16 port, size_t size)
-{
- BUG_ON(size > 4);
-
- if (user_mode(ctxt->regs)) {
- struct thread_struct *t = ¤t->thread;
- struct io_bitmap *iobm = t->io_bitmap;
- size_t idx;
-
- if (!iobm)
- goto fault;
-
- for (idx = port; idx < port + size; ++idx) {
- if (test_bit(idx, iobm->bitmap))
- goto fault;
- }
- }
-
- return ES_OK;
-
-fault:
- ctxt->fi.vector = X86_TRAP_GP;
- ctxt->fi.error_code = 0;
-
- return ES_EXCEPTION;
-}
-
-static __always_inline void vc_forward_exception(struct es_em_ctxt *ctxt)
-{
- long error_code = ctxt->fi.error_code;
- int trapnr = ctxt->fi.vector;
-
- ctxt->regs->orig_ax = ctxt->fi.error_code;
-
- switch (trapnr) {
- case X86_TRAP_GP:
- exc_general_protection(ctxt->regs, error_code);
- break;
- case X86_TRAP_UD:
- exc_invalid_op(ctxt->regs);
- break;
- case X86_TRAP_PF:
- write_cr2(ctxt->fi.cr2);
- exc_page_fault(ctxt->regs, error_code);
- break;
- case X86_TRAP_AC:
- exc_alignment_check(ctxt->regs, error_code);
- break;
- default:
- pr_emerg("Unsupported exception in #VC instruction emulation - can't continue\n");
- BUG();
- }
-}
-
/* Include code shared with pre-decompression boot stage */
#include "sev-shared.c"
@@ -563,718 +253,6 @@ void __head early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr
early_set_pages_state(vaddr, paddr, npages, SNP_PAGE_STATE_SHARED);
}
-/* Writes to the SVSM CAA MSR are ignored */
-static enum es_result __vc_handle_msr_caa(struct pt_regs *regs, bool write)
-{
- if (write)
- return ES_OK;
-
- regs->ax = lower_32_bits(this_cpu_read(svsm_caa_pa));
- regs->dx = upper_32_bits(this_cpu_read(svsm_caa_pa));
-
- return ES_OK;
-}
-
-/*
- * TSC related accesses should not exit to the hypervisor when a guest is
- * executing with Secure TSC enabled, so special handling is required for
- * accesses of MSR_IA32_TSC and MSR_AMD64_GUEST_TSC_FREQ.
- */
-static enum es_result __vc_handle_secure_tsc_msrs(struct pt_regs *regs, bool write)
-{
- u64 tsc;
-
- /*
- * GUEST_TSC_FREQ should not be intercepted when Secure TSC is enabled.
- * Terminate the SNP guest when the interception is enabled.
- */
- if (regs->cx == MSR_AMD64_GUEST_TSC_FREQ)
- return ES_VMM_ERROR;
-
- /*
- * Writes: Writing to MSR_IA32_TSC can cause subsequent reads of the TSC
- * to return undefined values, so ignore all writes.
- *
- * Reads: Reads of MSR_IA32_TSC should return the current TSC value, use
- * the value returned by rdtsc_ordered().
- */
- if (write) {
- WARN_ONCE(1, "TSC MSR writes are verboten!\n");
- return ES_OK;
- }
-
- tsc = rdtsc_ordered();
- regs->ax = lower_32_bits(tsc);
- regs->dx = upper_32_bits(tsc);
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
-{
- struct pt_regs *regs = ctxt->regs;
- enum es_result ret;
- bool write;
-
- /* Is it a WRMSR? */
- write = ctxt->insn.opcode.bytes[1] == 0x30;
-
- switch (regs->cx) {
- case MSR_SVSM_CAA:
- return __vc_handle_msr_caa(regs, write);
- case MSR_IA32_TSC:
- case MSR_AMD64_GUEST_TSC_FREQ:
- if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
- return __vc_handle_secure_tsc_msrs(regs, write);
- break;
- default:
- break;
- }
-
- ghcb_set_rcx(ghcb, regs->cx);
- if (write) {
- ghcb_set_rax(ghcb, regs->ax);
- ghcb_set_rdx(ghcb, regs->dx);
- }
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_MSR, write, 0);
-
- if ((ret == ES_OK) && !write) {
- regs->ax = ghcb->save.rax;
- regs->dx = ghcb->save.rdx;
- }
-
- return ret;
-}
-
-static void __init vc_early_forward_exception(struct es_em_ctxt *ctxt)
-{
- int trapnr = ctxt->fi.vector;
-
- if (trapnr == X86_TRAP_PF)
- native_write_cr2(ctxt->fi.cr2);
-
- ctxt->regs->orig_ax = ctxt->fi.error_code;
- do_early_exception(ctxt->regs, trapnr);
-}
-
-static long *vc_insn_get_rm(struct es_em_ctxt *ctxt)
-{
- long *reg_array;
- int offset;
-
- reg_array = (long *)ctxt->regs;
- offset = insn_get_modrm_rm_off(&ctxt->insn, ctxt->regs);
-
- if (offset < 0)
- return NULL;
-
- offset /= sizeof(long);
-
- return reg_array + offset;
-}
-static enum es_result vc_do_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
- unsigned int bytes, bool read)
-{
- u64 exit_code, exit_info_1, exit_info_2;
- unsigned long ghcb_pa = __pa(ghcb);
- enum es_result res;
- phys_addr_t paddr;
- void __user *ref;
-
- ref = insn_get_addr_ref(&ctxt->insn, ctxt->regs);
- if (ref == (void __user *)-1L)
- return ES_UNSUPPORTED;
-
- exit_code = read ? SVM_VMGEXIT_MMIO_READ : SVM_VMGEXIT_MMIO_WRITE;
-
- res = vc_slow_virt_to_phys(ghcb, ctxt, (unsigned long)ref, &paddr);
- if (res != ES_OK) {
- if (res == ES_EXCEPTION && !read)
- ctxt->fi.error_code |= X86_PF_WRITE;
-
- return res;
- }
-
- exit_info_1 = paddr;
- /* Can never be greater than 8 */
- exit_info_2 = bytes;
-
- ghcb_set_sw_scratch(ghcb, ghcb_pa + offsetof(struct ghcb, shared_buffer));
-
- return sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, exit_info_1, exit_info_2);
-}
-
-/*
- * The MOVS instruction has two memory operands, which raises the
- * problem that it is not known whether the access to the source or the
- * destination caused the #VC exception (and hence whether an MMIO read
- * or write operation needs to be emulated).
- *
- * Instead of playing games with walking page-tables and trying to guess
- * whether the source or destination is an MMIO range, split the move
- * into two operations, a read and a write with only one memory operand.
- * This will cause a nested #VC exception on the MMIO address which can
- * then be handled.
- *
- * This implementation has the benefit that it also supports MOVS where
- * source _and_ destination are MMIO regions.
- *
- * It will slow MOVS on MMIO down a lot, but in SEV-ES guests it is a
- * rare operation. If it turns out to be a performance problem the split
- * operations can be moved to memcpy_fromio() and memcpy_toio().
- */
-static enum es_result vc_handle_mmio_movs(struct es_em_ctxt *ctxt,
- unsigned int bytes)
-{
- unsigned long ds_base, es_base;
- unsigned char *src, *dst;
- unsigned char buffer[8];
- enum es_result ret;
- bool rep;
- int off;
-
- ds_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_DS);
- es_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_ES);
-
- if (ds_base == -1L || es_base == -1L) {
- ctxt->fi.vector = X86_TRAP_GP;
- ctxt->fi.error_code = 0;
- return ES_EXCEPTION;
- }
-
- src = ds_base + (unsigned char *)ctxt->regs->si;
- dst = es_base + (unsigned char *)ctxt->regs->di;
-
- ret = vc_read_mem(ctxt, src, buffer, bytes);
- if (ret != ES_OK)
- return ret;
-
- ret = vc_write_mem(ctxt, dst, buffer, bytes);
- if (ret != ES_OK)
- return ret;
-
- if (ctxt->regs->flags & X86_EFLAGS_DF)
- off = -bytes;
- else
- off = bytes;
-
- ctxt->regs->si += off;
- ctxt->regs->di += off;
-
- rep = insn_has_rep_prefix(&ctxt->insn);
- if (rep)
- ctxt->regs->cx -= 1;
-
- if (!rep || ctxt->regs->cx == 0)
- return ES_OK;
- else
- return ES_RETRY;
-}
-
-static enum es_result vc_handle_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
-{
- struct insn *insn = &ctxt->insn;
- enum insn_mmio_type mmio;
- unsigned int bytes = 0;
- enum es_result ret;
- u8 sign_byte;
- long *reg_data;
-
- mmio = insn_decode_mmio(insn, &bytes);
- if (mmio == INSN_MMIO_DECODE_FAILED)
- return ES_DECODE_FAILED;
-
- if (mmio != INSN_MMIO_WRITE_IMM && mmio != INSN_MMIO_MOVS) {
- reg_data = insn_get_modrm_reg_ptr(insn, ctxt->regs);
- if (!reg_data)
- return ES_DECODE_FAILED;
- }
-
- if (user_mode(ctxt->regs))
- return ES_UNSUPPORTED;
-
- switch (mmio) {
- case INSN_MMIO_WRITE:
- memcpy(ghcb->shared_buffer, reg_data, bytes);
- ret = vc_do_mmio(ghcb, ctxt, bytes, false);
- break;
- case INSN_MMIO_WRITE_IMM:
- memcpy(ghcb->shared_buffer, insn->immediate1.bytes, bytes);
- ret = vc_do_mmio(ghcb, ctxt, bytes, false);
- break;
- case INSN_MMIO_READ:
- ret = vc_do_mmio(ghcb, ctxt, bytes, true);
- if (ret)
- break;
-
- /* Zero-extend for 32-bit operation */
- if (bytes == 4)
- *reg_data = 0;
-
- memcpy(reg_data, ghcb->shared_buffer, bytes);
- break;
- case INSN_MMIO_READ_ZERO_EXTEND:
- ret = vc_do_mmio(ghcb, ctxt, bytes, true);
- if (ret)
- break;
-
- /* Zero extend based on operand size */
- memset(reg_data, 0, insn->opnd_bytes);
- memcpy(reg_data, ghcb->shared_buffer, bytes);
- break;
- case INSN_MMIO_READ_SIGN_EXTEND:
- ret = vc_do_mmio(ghcb, ctxt, bytes, true);
- if (ret)
- break;
-
- if (bytes == 1) {
- u8 *val = (u8 *)ghcb->shared_buffer;
-
- sign_byte = (*val & 0x80) ? 0xff : 0x00;
- } else {
- u16 *val = (u16 *)ghcb->shared_buffer;
-
- sign_byte = (*val & 0x8000) ? 0xff : 0x00;
- }
-
- /* Sign extend based on operand size */
- memset(reg_data, sign_byte, insn->opnd_bytes);
- memcpy(reg_data, ghcb->shared_buffer, bytes);
- break;
- case INSN_MMIO_MOVS:
- ret = vc_handle_mmio_movs(ctxt, bytes);
- break;
- default:
- ret = ES_UNSUPPORTED;
- break;
- }
-
- return ret;
-}
-
-static enum es_result vc_handle_dr7_write(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- struct sev_es_runtime_data *data = this_cpu_read(runtime_data);
- long val, *reg = vc_insn_get_rm(ctxt);
- enum es_result ret;
-
- if (sev_status & MSR_AMD64_SNP_DEBUG_SWAP)
- return ES_VMM_ERROR;
-
- if (!reg)
- return ES_DECODE_FAILED;
-
- val = *reg;
-
- /* Upper 32 bits must be written as zeroes */
- if (val >> 32) {
- ctxt->fi.vector = X86_TRAP_GP;
- ctxt->fi.error_code = 0;
- return ES_EXCEPTION;
- }
-
- /* Clear out other reserved bits and set bit 10 */
- val = (val & 0xffff23ffL) | BIT(10);
-
- /* Early non-zero writes to DR7 are not supported */
- if (!data && (val & ~DR7_RESET_VALUE))
- return ES_UNSUPPORTED;
-
- /* Using a value of 0 for ExitInfo1 means RAX holds the value */
- ghcb_set_rax(ghcb, val);
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WRITE_DR7, 0, 0);
- if (ret != ES_OK)
- return ret;
-
- if (data)
- data->dr7 = val;
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_dr7_read(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- struct sev_es_runtime_data *data = this_cpu_read(runtime_data);
- long *reg = vc_insn_get_rm(ctxt);
-
- if (sev_status & MSR_AMD64_SNP_DEBUG_SWAP)
- return ES_VMM_ERROR;
-
- if (!reg)
- return ES_DECODE_FAILED;
-
- if (data)
- *reg = data->dr7;
- else
- *reg = DR7_RESET_VALUE;
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_wbinvd(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WBINVD, 0, 0);
-}
-
-static enum es_result vc_handle_rdpmc(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
-{
- enum es_result ret;
-
- ghcb_set_rcx(ghcb, ctxt->regs->cx);
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_RDPMC, 0, 0);
- if (ret != ES_OK)
- return ret;
-
- if (!(ghcb_rax_is_valid(ghcb) && ghcb_rdx_is_valid(ghcb)))
- return ES_VMM_ERROR;
-
- ctxt->regs->ax = ghcb->save.rax;
- ctxt->regs->dx = ghcb->save.rdx;
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_monitor(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- /*
- * Treat it as a NOP and do not leak a physical address to the
- * hypervisor.
- */
- return ES_OK;
-}
-
-static enum es_result vc_handle_mwait(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- /* Treat the same as MONITOR/MONITORX */
- return ES_OK;
-}
-
-static enum es_result vc_handle_vmmcall(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- enum es_result ret;
-
- ghcb_set_rax(ghcb, ctxt->regs->ax);
- ghcb_set_cpl(ghcb, user_mode(ctxt->regs) ? 3 : 0);
-
- if (x86_platform.hyper.sev_es_hcall_prepare)
- x86_platform.hyper.sev_es_hcall_prepare(ghcb, ctxt->regs);
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_VMMCALL, 0, 0);
- if (ret != ES_OK)
- return ret;
-
- if (!ghcb_rax_is_valid(ghcb))
- return ES_VMM_ERROR;
-
- ctxt->regs->ax = ghcb->save.rax;
-
- /*
- * Call sev_es_hcall_finish() after regs->ax is already set.
- * This allows the hypervisor handler to overwrite it again if
- * necessary.
- */
- if (x86_platform.hyper.sev_es_hcall_finish &&
- !x86_platform.hyper.sev_es_hcall_finish(ghcb, ctxt->regs))
- return ES_VMM_ERROR;
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_trap_ac(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- /*
- * Calling ecx_alignment_check() directly does not work, because it
- * enables IRQs and the GHCB is active. Forward the exception and call
- * it later from vc_forward_exception().
- */
- ctxt->fi.vector = X86_TRAP_AC;
- ctxt->fi.error_code = 0;
- return ES_EXCEPTION;
-}
-
-static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
- struct ghcb *ghcb,
- unsigned long exit_code)
-{
- enum es_result result = vc_check_opcode_bytes(ctxt, exit_code);
-
- if (result != ES_OK)
- return result;
-
- switch (exit_code) {
- case SVM_EXIT_READ_DR7:
- result = vc_handle_dr7_read(ghcb, ctxt);
- break;
- case SVM_EXIT_WRITE_DR7:
- result = vc_handle_dr7_write(ghcb, ctxt);
- break;
- case SVM_EXIT_EXCP_BASE + X86_TRAP_AC:
- result = vc_handle_trap_ac(ghcb, ctxt);
- break;
- case SVM_EXIT_RDTSC:
- case SVM_EXIT_RDTSCP:
- result = vc_handle_rdtsc(ghcb, ctxt, exit_code);
- break;
- case SVM_EXIT_RDPMC:
- result = vc_handle_rdpmc(ghcb, ctxt);
- break;
- case SVM_EXIT_INVD:
- pr_err_ratelimited("#VC exception for INVD??? Seriously???\n");
- result = ES_UNSUPPORTED;
- break;
- case SVM_EXIT_CPUID:
- result = vc_handle_cpuid(ghcb, ctxt);
- break;
- case SVM_EXIT_IOIO:
- result = vc_handle_ioio(ghcb, ctxt);
- break;
- case SVM_EXIT_MSR:
- result = vc_handle_msr(ghcb, ctxt);
- break;
- case SVM_EXIT_VMMCALL:
- result = vc_handle_vmmcall(ghcb, ctxt);
- break;
- case SVM_EXIT_WBINVD:
- result = vc_handle_wbinvd(ghcb, ctxt);
- break;
- case SVM_EXIT_MONITOR:
- result = vc_handle_monitor(ghcb, ctxt);
- break;
- case SVM_EXIT_MWAIT:
- result = vc_handle_mwait(ghcb, ctxt);
- break;
- case SVM_EXIT_NPF:
- result = vc_handle_mmio(ghcb, ctxt);
- break;
- default:
- /*
- * Unexpected #VC exception
- */
- result = ES_UNSUPPORTED;
- }
-
- return result;
-}
-
-static __always_inline bool is_vc2_stack(unsigned long sp)
-{
- return (sp >= __this_cpu_ist_bottom_va(VC2) && sp < __this_cpu_ist_top_va(VC2));
-}
-
-static __always_inline bool vc_from_invalid_context(struct pt_regs *regs)
-{
- unsigned long sp, prev_sp;
-
- sp = (unsigned long)regs;
- prev_sp = regs->sp;
-
- /*
- * If the code was already executing on the VC2 stack when the #VC
- * happened, let it proceed to the normal handling routine. This way the
- * code executing on the VC2 stack can cause #VC exceptions to get handled.
- */
- return is_vc2_stack(sp) && !is_vc2_stack(prev_sp);
-}
-
-static bool vc_raw_handle_exception(struct pt_regs *regs, unsigned long error_code)
-{
- struct ghcb_state state;
- struct es_em_ctxt ctxt;
- enum es_result result;
- struct ghcb *ghcb;
- bool ret = true;
-
- ghcb = __sev_get_ghcb(&state);
-
- vc_ghcb_invalidate(ghcb);
- result = vc_init_em_ctxt(&ctxt, regs, error_code);
-
- if (result == ES_OK)
- result = vc_handle_exitcode(&ctxt, ghcb, error_code);
-
- __sev_put_ghcb(&state);
-
- /* Done - now check the result */
- switch (result) {
- case ES_OK:
- vc_finish_insn(&ctxt);
- break;
- case ES_UNSUPPORTED:
- pr_err_ratelimited("Unsupported exit-code 0x%02lx in #VC exception (IP: 0x%lx)\n",
- error_code, regs->ip);
- ret = false;
- break;
- case ES_VMM_ERROR:
- pr_err_ratelimited("Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n",
- error_code, regs->ip);
- ret = false;
- break;
- case ES_DECODE_FAILED:
- pr_err_ratelimited("Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n",
- error_code, regs->ip);
- ret = false;
- break;
- case ES_EXCEPTION:
- vc_forward_exception(&ctxt);
- break;
- case ES_RETRY:
- /* Nothing to do */
- break;
- default:
- pr_emerg("Unknown result in %s():%d\n", __func__, result);
- /*
- * Emulating the instruction which caused the #VC exception
- * failed - can't continue so print debug information
- */
- BUG();
- }
-
- return ret;
-}
-
-static __always_inline bool vc_is_db(unsigned long error_code)
-{
- return error_code == SVM_EXIT_EXCP_BASE + X86_TRAP_DB;
-}
-
-/*
- * Runtime #VC exception handler when raised from kernel mode. Runs in NMI mode
- * and will panic when an error happens.
- */
-DEFINE_IDTENTRY_VC_KERNEL(exc_vmm_communication)
-{
- irqentry_state_t irq_state;
-
- /*
- * With the current implementation it is always possible to switch to a
- * safe stack because #VC exceptions only happen at known places, like
- * intercepted instructions or accesses to MMIO areas/IO ports. They can
- * also happen with code instrumentation when the hypervisor intercepts
- * #DB, but the critical paths are forbidden to be instrumented, so #DB
- * exceptions currently also only happen in safe places.
- *
- * But keep this here in case the noinstr annotations are violated due
- * to bug elsewhere.
- */
- if (unlikely(vc_from_invalid_context(regs))) {
- instrumentation_begin();
- panic("Can't handle #VC exception from unsupported context\n");
- instrumentation_end();
- }
-
- /*
- * Handle #DB before calling into !noinstr code to avoid recursive #DB.
- */
- if (vc_is_db(error_code)) {
- exc_debug(regs);
- return;
- }
-
- irq_state = irqentry_nmi_enter(regs);
-
- instrumentation_begin();
-
- if (!vc_raw_handle_exception(regs, error_code)) {
- /* Show some debug info */
- show_regs(regs);
-
- /* Ask hypervisor to sev_es_terminate */
- sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
-
- /* If that fails and we get here - just panic */
- panic("Returned from Terminate-Request to Hypervisor\n");
- }
-
- instrumentation_end();
- irqentry_nmi_exit(regs, irq_state);
-}
-
-/*
- * Runtime #VC exception handler when raised from user mode. Runs in IRQ mode
- * and will kill the current task with SIGBUS when an error happens.
- */
-DEFINE_IDTENTRY_VC_USER(exc_vmm_communication)
-{
- /*
- * Handle #DB before calling into !noinstr code to avoid recursive #DB.
- */
- if (vc_is_db(error_code)) {
- noist_exc_debug(regs);
- return;
- }
-
- irqentry_enter_from_user_mode(regs);
- instrumentation_begin();
-
- if (!vc_raw_handle_exception(regs, error_code)) {
- /*
- * Do not kill the machine if user-space triggered the
- * exception. Send SIGBUS instead and let user-space deal with
- * it.
- */
- force_sig_fault(SIGBUS, BUS_OBJERR, (void __user *)0);
- }
-
- instrumentation_end();
- irqentry_exit_to_user_mode(regs);
-}
-
-bool __init handle_vc_boot_ghcb(struct pt_regs *regs)
-{
- unsigned long exit_code = regs->orig_ax;
- struct es_em_ctxt ctxt;
- enum es_result result;
-
- vc_ghcb_invalidate(boot_ghcb);
-
- result = vc_init_em_ctxt(&ctxt, regs, exit_code);
- if (result == ES_OK)
- result = vc_handle_exitcode(&ctxt, boot_ghcb, exit_code);
-
- /* Done - now check the result */
- switch (result) {
- case ES_OK:
- vc_finish_insn(&ctxt);
- break;
- case ES_UNSUPPORTED:
- early_printk("PANIC: Unsupported exit-code 0x%02lx in early #VC exception (IP: 0x%lx)\n",
- exit_code, regs->ip);
- goto fail;
- case ES_VMM_ERROR:
- early_printk("PANIC: Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n",
- exit_code, regs->ip);
- goto fail;
- case ES_DECODE_FAILED:
- early_printk("PANIC: Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n",
- exit_code, regs->ip);
- goto fail;
- case ES_EXCEPTION:
- vc_early_forward_exception(&ctxt);
- break;
- case ES_RETRY:
- /* Nothing to do */
- break;
- default:
- BUG();
- }
-
- return true;
-
-fail:
- show_regs(regs);
-
- sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
-}
-
/*
* Initial set up of SNP relies on information provided by the
* Confidential Computing blob, which can be passed to the kernel
diff --git a/arch/x86/coco/sev/Makefile b/arch/x86/coco/sev/Makefile
index 2919dcfc4b02..db3255b979bd 100644
--- a/arch/x86/coco/sev/Makefile
+++ b/arch/x86/coco/sev/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0
-obj-y += core.o sev-nmi.o
+obj-y += core.o sev-nmi.o vc-handle.o
# Clang 14 and older may fail to respect __no_sanitize_undefined when inlining
UBSAN_SANITIZE_sev-nmi.o := n
diff --git a/arch/x86/coco/sev/vc-handle.c b/arch/x86/coco/sev/vc-handle.c
new file mode 100644
index 000000000000..b4895c648024
--- /dev/null
+++ b/arch/x86/coco/sev/vc-handle.c
@@ -0,0 +1,1061 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AMD Memory Encryption Support
+ *
+ * Copyright (C) 2019 SUSE
+ *
+ * Author: Joerg Roedel <jroedel@suse.de>
+ */
+
+#define pr_fmt(fmt) "SEV: " fmt
+
+#include <linux/sched/debug.h> /* For show_regs() */
+#include <linux/cc_platform.h>
+#include <linux/printk.h>
+#include <linux/mm_types.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/io.h>
+#include <linux/psp-sev.h>
+#include <uapi/linux/sev-guest.h>
+
+#include <asm/init.h>
+#include <asm/stacktrace.h>
+#include <asm/sev.h>
+#include <asm/sev-internal.h>
+#include <asm/insn-eval.h>
+#include <asm/fpu/xcr.h>
+#include <asm/processor.h>
+#include <asm/setup.h>
+#include <asm/traps.h>
+#include <asm/svm.h>
+#include <asm/smp.h>
+#include <asm/cpu.h>
+#include <asm/apic.h>
+#include <asm/cpuid.h>
+
+static enum es_result vc_slow_virt_to_phys(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
+ unsigned long vaddr, phys_addr_t *paddr)
+{
+ unsigned long va = (unsigned long)vaddr;
+ unsigned int level;
+ phys_addr_t pa;
+ pgd_t *pgd;
+ pte_t *pte;
+
+ pgd = __va(read_cr3_pa());
+ pgd = &pgd[pgd_index(va)];
+ pte = lookup_address_in_pgd(pgd, va, &level);
+ if (!pte) {
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.cr2 = vaddr;
+ ctxt->fi.error_code = 0;
+
+ if (user_mode(ctxt->regs))
+ ctxt->fi.error_code |= X86_PF_USER;
+
+ return ES_EXCEPTION;
+ }
+
+ if (WARN_ON_ONCE(pte_val(*pte) & _PAGE_ENC))
+ /* Emulated MMIO to/from encrypted memory not supported */
+ return ES_UNSUPPORTED;
+
+ pa = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
+ pa |= va & ~page_level_mask(level);
+
+ *paddr = pa;
+
+ return ES_OK;
+}
+
+static enum es_result vc_ioio_check(struct es_em_ctxt *ctxt, u16 port, size_t size)
+{
+ BUG_ON(size > 4);
+
+ if (user_mode(ctxt->regs)) {
+ struct thread_struct *t = ¤t->thread;
+ struct io_bitmap *iobm = t->io_bitmap;
+ size_t idx;
+
+ if (!iobm)
+ goto fault;
+
+ for (idx = port; idx < port + size; ++idx) {
+ if (test_bit(idx, iobm->bitmap))
+ goto fault;
+ }
+ }
+
+ return ES_OK;
+
+fault:
+ ctxt->fi.vector = X86_TRAP_GP;
+ ctxt->fi.error_code = 0;
+
+ return ES_EXCEPTION;
+}
+
+void vc_forward_exception(struct es_em_ctxt *ctxt)
+{
+ long error_code = ctxt->fi.error_code;
+ int trapnr = ctxt->fi.vector;
+
+ ctxt->regs->orig_ax = ctxt->fi.error_code;
+
+ switch (trapnr) {
+ case X86_TRAP_GP:
+ exc_general_protection(ctxt->regs, error_code);
+ break;
+ case X86_TRAP_UD:
+ exc_invalid_op(ctxt->regs);
+ break;
+ case X86_TRAP_PF:
+ write_cr2(ctxt->fi.cr2);
+ exc_page_fault(ctxt->regs, error_code);
+ break;
+ case X86_TRAP_AC:
+ exc_alignment_check(ctxt->regs, error_code);
+ break;
+ default:
+ pr_emerg("Unsupported exception in #VC instruction emulation - can't continue\n");
+ BUG();
+ }
+}
+
+static int vc_fetch_insn_kernel(struct es_em_ctxt *ctxt,
+ unsigned char *buffer)
+{
+ return copy_from_kernel_nofault(buffer, (unsigned char *)ctxt->regs->ip, MAX_INSN_SIZE);
+}
+
+static enum es_result __vc_decode_user_insn(struct es_em_ctxt *ctxt)
+{
+ char buffer[MAX_INSN_SIZE];
+ int insn_bytes;
+
+ insn_bytes = insn_fetch_from_user_inatomic(ctxt->regs, buffer);
+ if (insn_bytes == 0) {
+ /* Nothing could be copied */
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.error_code = X86_PF_INSTR | X86_PF_USER;
+ ctxt->fi.cr2 = ctxt->regs->ip;
+ return ES_EXCEPTION;
+ } else if (insn_bytes == -EINVAL) {
+ /* Effective RIP could not be calculated */
+ ctxt->fi.vector = X86_TRAP_GP;
+ ctxt->fi.error_code = 0;
+ ctxt->fi.cr2 = 0;
+ return ES_EXCEPTION;
+ }
+
+ if (!insn_decode_from_regs(&ctxt->insn, ctxt->regs, buffer, insn_bytes))
+ return ES_DECODE_FAILED;
+
+ if (ctxt->insn.immediate.got)
+ return ES_OK;
+ else
+ return ES_DECODE_FAILED;
+}
+
+static enum es_result __vc_decode_kern_insn(struct es_em_ctxt *ctxt)
+{
+ char buffer[MAX_INSN_SIZE];
+ int res, ret;
+
+ res = vc_fetch_insn_kernel(ctxt, buffer);
+ if (res) {
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.error_code = X86_PF_INSTR;
+ ctxt->fi.cr2 = ctxt->regs->ip;
+ return ES_EXCEPTION;
+ }
+
+ ret = insn_decode(&ctxt->insn, buffer, MAX_INSN_SIZE, INSN_MODE_64);
+ if (ret < 0)
+ return ES_DECODE_FAILED;
+ else
+ return ES_OK;
+}
+
+static enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
+{
+ if (user_mode(ctxt->regs))
+ return __vc_decode_user_insn(ctxt);
+ else
+ return __vc_decode_kern_insn(ctxt);
+}
+
+static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
+ char *dst, char *buf, size_t size)
+{
+ unsigned long error_code = X86_PF_PROT | X86_PF_WRITE;
+
+ /*
+ * This function uses __put_user() independent of whether kernel or user
+ * memory is accessed. This works fine because __put_user() does no
+ * sanity checks of the pointer being accessed. All that it does is
+ * to report when the access failed.
+ *
+ * Also, this function runs in atomic context, so __put_user() is not
+ * allowed to sleep. The page-fault handler detects that it is running
+ * in atomic context and will not try to take mmap_sem and handle the
+ * fault, so additional pagefault_enable()/disable() calls are not
+ * needed.
+ *
+ * The access can't be done via copy_to_user() here because
+ * vc_write_mem() must not use string instructions to access unsafe
+ * memory. The reason is that MOVS is emulated by the #VC handler by
+ * splitting the move up into a read and a write and taking a nested #VC
+ * exception on whatever of them is the MMIO access. Using string
+ * instructions here would cause infinite nesting.
+ */
+ switch (size) {
+ case 1: {
+ u8 d1;
+ u8 __user *target = (u8 __user *)dst;
+
+ memcpy(&d1, buf, 1);
+ if (__put_user(d1, target))
+ goto fault;
+ break;
+ }
+ case 2: {
+ u16 d2;
+ u16 __user *target = (u16 __user *)dst;
+
+ memcpy(&d2, buf, 2);
+ if (__put_user(d2, target))
+ goto fault;
+ break;
+ }
+ case 4: {
+ u32 d4;
+ u32 __user *target = (u32 __user *)dst;
+
+ memcpy(&d4, buf, 4);
+ if (__put_user(d4, target))
+ goto fault;
+ break;
+ }
+ case 8: {
+ u64 d8;
+ u64 __user *target = (u64 __user *)dst;
+
+ memcpy(&d8, buf, 8);
+ if (__put_user(d8, target))
+ goto fault;
+ break;
+ }
+ default:
+ WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+ return ES_UNSUPPORTED;
+ }
+
+ return ES_OK;
+
+fault:
+ if (user_mode(ctxt->regs))
+ error_code |= X86_PF_USER;
+
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.error_code = error_code;
+ ctxt->fi.cr2 = (unsigned long)dst;
+
+ return ES_EXCEPTION;
+}
+
+static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
+ char *src, char *buf, size_t size)
+{
+ unsigned long error_code = X86_PF_PROT;
+
+ /*
+ * This function uses __get_user() independent of whether kernel or user
+ * memory is accessed. This works fine because __get_user() does no
+ * sanity checks of the pointer being accessed. All that it does is
+ * to report when the access failed.
+ *
+ * Also, this function runs in atomic context, so __get_user() is not
+ * allowed to sleep. The page-fault handler detects that it is running
+ * in atomic context and will not try to take mmap_sem and handle the
+ * fault, so additional pagefault_enable()/disable() calls are not
+ * needed.
+ *
+ * The access can't be done via copy_from_user() here because
+ * vc_read_mem() must not use string instructions to access unsafe
+ * memory. The reason is that MOVS is emulated by the #VC handler by
+ * splitting the move up into a read and a write and taking a nested #VC
+ * exception on whatever of them is the MMIO access. Using string
+ * instructions here would cause infinite nesting.
+ */
+ switch (size) {
+ case 1: {
+ u8 d1;
+ u8 __user *s = (u8 __user *)src;
+
+ if (__get_user(d1, s))
+ goto fault;
+ memcpy(buf, &d1, 1);
+ break;
+ }
+ case 2: {
+ u16 d2;
+ u16 __user *s = (u16 __user *)src;
+
+ if (__get_user(d2, s))
+ goto fault;
+ memcpy(buf, &d2, 2);
+ break;
+ }
+ case 4: {
+ u32 d4;
+ u32 __user *s = (u32 __user *)src;
+
+ if (__get_user(d4, s))
+ goto fault;
+ memcpy(buf, &d4, 4);
+ break;
+ }
+ case 8: {
+ u64 d8;
+ u64 __user *s = (u64 __user *)src;
+ if (__get_user(d8, s))
+ goto fault;
+ memcpy(buf, &d8, 8);
+ break;
+ }
+ default:
+ WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+ return ES_UNSUPPORTED;
+ }
+
+ return ES_OK;
+
+fault:
+ if (user_mode(ctxt->regs))
+ error_code |= X86_PF_USER;
+
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.error_code = error_code;
+ ctxt->fi.cr2 = (unsigned long)src;
+
+ return ES_EXCEPTION;
+}
+
+#define sev_printk(fmt, ...) printk(fmt, ##__VA_ARGS__)
+
+#include "vc-shared.c"
+
+/* Writes to the SVSM CAA MSR are ignored */
+static enum es_result __vc_handle_msr_caa(struct pt_regs *regs, bool write)
+{
+ if (write)
+ return ES_OK;
+
+ regs->ax = lower_32_bits(this_cpu_read(svsm_caa_pa));
+ regs->dx = upper_32_bits(this_cpu_read(svsm_caa_pa));
+
+ return ES_OK;
+}
+
+/*
+ * TSC related accesses should not exit to the hypervisor when a guest is
+ * executing with Secure TSC enabled, so special handling is required for
+ * accesses of MSR_IA32_TSC and MSR_AMD64_GUEST_TSC_FREQ.
+ */
+static enum es_result __vc_handle_secure_tsc_msrs(struct pt_regs *regs, bool write)
+{
+ u64 tsc;
+
+ /*
+ * GUEST_TSC_FREQ should not be intercepted when Secure TSC is enabled.
+ * Terminate the SNP guest when the interception is enabled.
+ */
+ if (regs->cx == MSR_AMD64_GUEST_TSC_FREQ)
+ return ES_VMM_ERROR;
+
+ /*
+ * Writes: Writing to MSR_IA32_TSC can cause subsequent reads of the TSC
+ * to return undefined values, so ignore all writes.
+ *
+ * Reads: Reads of MSR_IA32_TSC should return the current TSC value, use
+ * the value returned by rdtsc_ordered().
+ */
+ if (write) {
+ WARN_ONCE(1, "TSC MSR writes are verboten!\n");
+ return ES_OK;
+ }
+
+ tsc = rdtsc_ordered();
+ regs->ax = lower_32_bits(tsc);
+ regs->dx = upper_32_bits(tsc);
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+ struct pt_regs *regs = ctxt->regs;
+ enum es_result ret;
+ bool write;
+
+ /* Is it a WRMSR? */
+ write = ctxt->insn.opcode.bytes[1] == 0x30;
+
+ switch (regs->cx) {
+ case MSR_SVSM_CAA:
+ return __vc_handle_msr_caa(regs, write);
+ case MSR_IA32_TSC:
+ case MSR_AMD64_GUEST_TSC_FREQ:
+ if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
+ return __vc_handle_secure_tsc_msrs(regs, write);
+ break;
+ default:
+ break;
+ }
+
+ ghcb_set_rcx(ghcb, regs->cx);
+ if (write) {
+ ghcb_set_rax(ghcb, regs->ax);
+ ghcb_set_rdx(ghcb, regs->dx);
+ }
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_MSR, write, 0);
+
+ if ((ret == ES_OK) && !write) {
+ regs->ax = ghcb->save.rax;
+ regs->dx = ghcb->save.rdx;
+ }
+
+ return ret;
+}
+
+static void __init vc_early_forward_exception(struct es_em_ctxt *ctxt)
+{
+ int trapnr = ctxt->fi.vector;
+
+ if (trapnr == X86_TRAP_PF)
+ native_write_cr2(ctxt->fi.cr2);
+
+ ctxt->regs->orig_ax = ctxt->fi.error_code;
+ do_early_exception(ctxt->regs, trapnr);
+}
+
+static long *vc_insn_get_rm(struct es_em_ctxt *ctxt)
+{
+ long *reg_array;
+ int offset;
+
+ reg_array = (long *)ctxt->regs;
+ offset = insn_get_modrm_rm_off(&ctxt->insn, ctxt->regs);
+
+ if (offset < 0)
+ return NULL;
+
+ offset /= sizeof(long);
+
+ return reg_array + offset;
+}
+static enum es_result vc_do_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
+ unsigned int bytes, bool read)
+{
+ u64 exit_code, exit_info_1, exit_info_2;
+ unsigned long ghcb_pa = __pa(ghcb);
+ enum es_result res;
+ phys_addr_t paddr;
+ void __user *ref;
+
+ ref = insn_get_addr_ref(&ctxt->insn, ctxt->regs);
+ if (ref == (void __user *)-1L)
+ return ES_UNSUPPORTED;
+
+ exit_code = read ? SVM_VMGEXIT_MMIO_READ : SVM_VMGEXIT_MMIO_WRITE;
+
+ res = vc_slow_virt_to_phys(ghcb, ctxt, (unsigned long)ref, &paddr);
+ if (res != ES_OK) {
+ if (res == ES_EXCEPTION && !read)
+ ctxt->fi.error_code |= X86_PF_WRITE;
+
+ return res;
+ }
+
+ exit_info_1 = paddr;
+ /* Can never be greater than 8 */
+ exit_info_2 = bytes;
+
+ ghcb_set_sw_scratch(ghcb, ghcb_pa + offsetof(struct ghcb, shared_buffer));
+
+ return sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, exit_info_1, exit_info_2);
+}
+
+/*
+ * The MOVS instruction has two memory operands, which raises the
+ * problem that it is not known whether the access to the source or the
+ * destination caused the #VC exception (and hence whether an MMIO read
+ * or write operation needs to be emulated).
+ *
+ * Instead of playing games with walking page-tables and trying to guess
+ * whether the source or destination is an MMIO range, split the move
+ * into two operations, a read and a write with only one memory operand.
+ * This will cause a nested #VC exception on the MMIO address which can
+ * then be handled.
+ *
+ * This implementation has the benefit that it also supports MOVS where
+ * source _and_ destination are MMIO regions.
+ *
+ * It will slow MOVS on MMIO down a lot, but in SEV-ES guests it is a
+ * rare operation. If it turns out to be a performance problem the split
+ * operations can be moved to memcpy_fromio() and memcpy_toio().
+ */
+static enum es_result vc_handle_mmio_movs(struct es_em_ctxt *ctxt,
+ unsigned int bytes)
+{
+ unsigned long ds_base, es_base;
+ unsigned char *src, *dst;
+ unsigned char buffer[8];
+ enum es_result ret;
+ bool rep;
+ int off;
+
+ ds_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_DS);
+ es_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_ES);
+
+ if (ds_base == -1L || es_base == -1L) {
+ ctxt->fi.vector = X86_TRAP_GP;
+ ctxt->fi.error_code = 0;
+ return ES_EXCEPTION;
+ }
+
+ src = ds_base + (unsigned char *)ctxt->regs->si;
+ dst = es_base + (unsigned char *)ctxt->regs->di;
+
+ ret = vc_read_mem(ctxt, src, buffer, bytes);
+ if (ret != ES_OK)
+ return ret;
+
+ ret = vc_write_mem(ctxt, dst, buffer, bytes);
+ if (ret != ES_OK)
+ return ret;
+
+ if (ctxt->regs->flags & X86_EFLAGS_DF)
+ off = -bytes;
+ else
+ off = bytes;
+
+ ctxt->regs->si += off;
+ ctxt->regs->di += off;
+
+ rep = insn_has_rep_prefix(&ctxt->insn);
+ if (rep)
+ ctxt->regs->cx -= 1;
+
+ if (!rep || ctxt->regs->cx == 0)
+ return ES_OK;
+ else
+ return ES_RETRY;
+}
+
+static enum es_result vc_handle_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+ struct insn *insn = &ctxt->insn;
+ enum insn_mmio_type mmio;
+ unsigned int bytes = 0;
+ enum es_result ret;
+ u8 sign_byte;
+ long *reg_data;
+
+ mmio = insn_decode_mmio(insn, &bytes);
+ if (mmio == INSN_MMIO_DECODE_FAILED)
+ return ES_DECODE_FAILED;
+
+ if (mmio != INSN_MMIO_WRITE_IMM && mmio != INSN_MMIO_MOVS) {
+ reg_data = insn_get_modrm_reg_ptr(insn, ctxt->regs);
+ if (!reg_data)
+ return ES_DECODE_FAILED;
+ }
+
+ if (user_mode(ctxt->regs))
+ return ES_UNSUPPORTED;
+
+ switch (mmio) {
+ case INSN_MMIO_WRITE:
+ memcpy(ghcb->shared_buffer, reg_data, bytes);
+ ret = vc_do_mmio(ghcb, ctxt, bytes, false);
+ break;
+ case INSN_MMIO_WRITE_IMM:
+ memcpy(ghcb->shared_buffer, insn->immediate1.bytes, bytes);
+ ret = vc_do_mmio(ghcb, ctxt, bytes, false);
+ break;
+ case INSN_MMIO_READ:
+ ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+ if (ret)
+ break;
+
+ /* Zero-extend for 32-bit operation */
+ if (bytes == 4)
+ *reg_data = 0;
+
+ memcpy(reg_data, ghcb->shared_buffer, bytes);
+ break;
+ case INSN_MMIO_READ_ZERO_EXTEND:
+ ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+ if (ret)
+ break;
+
+ /* Zero extend based on operand size */
+ memset(reg_data, 0, insn->opnd_bytes);
+ memcpy(reg_data, ghcb->shared_buffer, bytes);
+ break;
+ case INSN_MMIO_READ_SIGN_EXTEND:
+ ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+ if (ret)
+ break;
+
+ if (bytes == 1) {
+ u8 *val = (u8 *)ghcb->shared_buffer;
+
+ sign_byte = (*val & 0x80) ? 0xff : 0x00;
+ } else {
+ u16 *val = (u16 *)ghcb->shared_buffer;
+
+ sign_byte = (*val & 0x8000) ? 0xff : 0x00;
+ }
+
+ /* Sign extend based on operand size */
+ memset(reg_data, sign_byte, insn->opnd_bytes);
+ memcpy(reg_data, ghcb->shared_buffer, bytes);
+ break;
+ case INSN_MMIO_MOVS:
+ ret = vc_handle_mmio_movs(ctxt, bytes);
+ break;
+ default:
+ ret = ES_UNSUPPORTED;
+ break;
+ }
+
+ return ret;
+}
+
+static enum es_result vc_handle_dr7_write(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ struct sev_es_runtime_data *data = this_cpu_read(runtime_data);
+ long val, *reg = vc_insn_get_rm(ctxt);
+ enum es_result ret;
+
+ if (sev_status & MSR_AMD64_SNP_DEBUG_SWAP)
+ return ES_VMM_ERROR;
+
+ if (!reg)
+ return ES_DECODE_FAILED;
+
+ val = *reg;
+
+ /* Upper 32 bits must be written as zeroes */
+ if (val >> 32) {
+ ctxt->fi.vector = X86_TRAP_GP;
+ ctxt->fi.error_code = 0;
+ return ES_EXCEPTION;
+ }
+
+ /* Clear out other reserved bits and set bit 10 */
+ val = (val & 0xffff23ffL) | BIT(10);
+
+ /* Early non-zero writes to DR7 are not supported */
+ if (!data && (val & ~DR7_RESET_VALUE))
+ return ES_UNSUPPORTED;
+
+ /* Using a value of 0 for ExitInfo1 means RAX holds the value */
+ ghcb_set_rax(ghcb, val);
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WRITE_DR7, 0, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (data)
+ data->dr7 = val;
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_dr7_read(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ struct sev_es_runtime_data *data = this_cpu_read(runtime_data);
+ long *reg = vc_insn_get_rm(ctxt);
+
+ if (sev_status & MSR_AMD64_SNP_DEBUG_SWAP)
+ return ES_VMM_ERROR;
+
+ if (!reg)
+ return ES_DECODE_FAILED;
+
+ if (data)
+ *reg = data->dr7;
+ else
+ *reg = DR7_RESET_VALUE;
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_wbinvd(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WBINVD, 0, 0);
+}
+
+static enum es_result vc_handle_rdpmc(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+ enum es_result ret;
+
+ ghcb_set_rcx(ghcb, ctxt->regs->cx);
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_RDPMC, 0, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (!(ghcb_rax_is_valid(ghcb) && ghcb_rdx_is_valid(ghcb)))
+ return ES_VMM_ERROR;
+
+ ctxt->regs->ax = ghcb->save.rax;
+ ctxt->regs->dx = ghcb->save.rdx;
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_monitor(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ /*
+ * Treat it as a NOP and do not leak a physical address to the
+ * hypervisor.
+ */
+ return ES_OK;
+}
+
+static enum es_result vc_handle_mwait(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ /* Treat the same as MONITOR/MONITORX */
+ return ES_OK;
+}
+
+static enum es_result vc_handle_vmmcall(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ enum es_result ret;
+
+ ghcb_set_rax(ghcb, ctxt->regs->ax);
+ ghcb_set_cpl(ghcb, user_mode(ctxt->regs) ? 3 : 0);
+
+ if (x86_platform.hyper.sev_es_hcall_prepare)
+ x86_platform.hyper.sev_es_hcall_prepare(ghcb, ctxt->regs);
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_VMMCALL, 0, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (!ghcb_rax_is_valid(ghcb))
+ return ES_VMM_ERROR;
+
+ ctxt->regs->ax = ghcb->save.rax;
+
+ /*
+ * Call sev_es_hcall_finish() after regs->ax is already set.
+ * This allows the hypervisor handler to overwrite it again if
+ * necessary.
+ */
+ if (x86_platform.hyper.sev_es_hcall_finish &&
+ !x86_platform.hyper.sev_es_hcall_finish(ghcb, ctxt->regs))
+ return ES_VMM_ERROR;
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_trap_ac(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ /*
+ * Calling ecx_alignment_check() directly does not work, because it
+ * enables IRQs and the GHCB is active. Forward the exception and call
+ * it later from vc_forward_exception().
+ */
+ ctxt->fi.vector = X86_TRAP_AC;
+ ctxt->fi.error_code = 0;
+ return ES_EXCEPTION;
+}
+
+static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
+ struct ghcb *ghcb,
+ unsigned long exit_code)
+{
+ enum es_result result = vc_check_opcode_bytes(ctxt, exit_code);
+
+ if (result != ES_OK)
+ return result;
+
+ switch (exit_code) {
+ case SVM_EXIT_READ_DR7:
+ result = vc_handle_dr7_read(ghcb, ctxt);
+ break;
+ case SVM_EXIT_WRITE_DR7:
+ result = vc_handle_dr7_write(ghcb, ctxt);
+ break;
+ case SVM_EXIT_EXCP_BASE + X86_TRAP_AC:
+ result = vc_handle_trap_ac(ghcb, ctxt);
+ break;
+ case SVM_EXIT_RDTSC:
+ case SVM_EXIT_RDTSCP:
+ result = vc_handle_rdtsc(ghcb, ctxt, exit_code);
+ break;
+ case SVM_EXIT_RDPMC:
+ result = vc_handle_rdpmc(ghcb, ctxt);
+ break;
+ case SVM_EXIT_INVD:
+ pr_err_ratelimited("#VC exception for INVD??? Seriously???\n");
+ result = ES_UNSUPPORTED;
+ break;
+ case SVM_EXIT_CPUID:
+ result = vc_handle_cpuid(ghcb, ctxt);
+ break;
+ case SVM_EXIT_IOIO:
+ result = vc_handle_ioio(ghcb, ctxt);
+ break;
+ case SVM_EXIT_MSR:
+ result = vc_handle_msr(ghcb, ctxt);
+ break;
+ case SVM_EXIT_VMMCALL:
+ result = vc_handle_vmmcall(ghcb, ctxt);
+ break;
+ case SVM_EXIT_WBINVD:
+ result = vc_handle_wbinvd(ghcb, ctxt);
+ break;
+ case SVM_EXIT_MONITOR:
+ result = vc_handle_monitor(ghcb, ctxt);
+ break;
+ case SVM_EXIT_MWAIT:
+ result = vc_handle_mwait(ghcb, ctxt);
+ break;
+ case SVM_EXIT_NPF:
+ result = vc_handle_mmio(ghcb, ctxt);
+ break;
+ default:
+ /*
+ * Unexpected #VC exception
+ */
+ result = ES_UNSUPPORTED;
+ }
+
+ return result;
+}
+
+static __always_inline bool is_vc2_stack(unsigned long sp)
+{
+ return (sp >= __this_cpu_ist_bottom_va(VC2) && sp < __this_cpu_ist_top_va(VC2));
+}
+
+static __always_inline bool vc_from_invalid_context(struct pt_regs *regs)
+{
+ unsigned long sp, prev_sp;
+
+ sp = (unsigned long)regs;
+ prev_sp = regs->sp;
+
+ /*
+ * If the code was already executing on the VC2 stack when the #VC
+ * happened, let it proceed to the normal handling routine. This way the
+ * code executing on the VC2 stack can cause #VC exceptions to get handled.
+ */
+ return is_vc2_stack(sp) && !is_vc2_stack(prev_sp);
+}
+
+static bool vc_raw_handle_exception(struct pt_regs *regs, unsigned long error_code)
+{
+ struct ghcb_state state;
+ struct es_em_ctxt ctxt;
+ enum es_result result;
+ struct ghcb *ghcb;
+ bool ret = true;
+
+ ghcb = __sev_get_ghcb(&state);
+
+ vc_ghcb_invalidate(ghcb);
+ result = vc_init_em_ctxt(&ctxt, regs, error_code);
+
+ if (result == ES_OK)
+ result = vc_handle_exitcode(&ctxt, ghcb, error_code);
+
+ __sev_put_ghcb(&state);
+
+ /* Done - now check the result */
+ switch (result) {
+ case ES_OK:
+ vc_finish_insn(&ctxt);
+ break;
+ case ES_UNSUPPORTED:
+ pr_err_ratelimited("Unsupported exit-code 0x%02lx in #VC exception (IP: 0x%lx)\n",
+ error_code, regs->ip);
+ ret = false;
+ break;
+ case ES_VMM_ERROR:
+ pr_err_ratelimited("Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n",
+ error_code, regs->ip);
+ ret = false;
+ break;
+ case ES_DECODE_FAILED:
+ pr_err_ratelimited("Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n",
+ error_code, regs->ip);
+ ret = false;
+ break;
+ case ES_EXCEPTION:
+ vc_forward_exception(&ctxt);
+ break;
+ case ES_RETRY:
+ /* Nothing to do */
+ break;
+ default:
+ pr_emerg("Unknown result in %s():%d\n", __func__, result);
+ /*
+ * Emulating the instruction which caused the #VC exception
+ * failed - can't continue so print debug information
+ */
+ BUG();
+ }
+
+ return ret;
+}
+
+static __always_inline bool vc_is_db(unsigned long error_code)
+{
+ return error_code == SVM_EXIT_EXCP_BASE + X86_TRAP_DB;
+}
+
+/*
+ * Runtime #VC exception handler when raised from kernel mode. Runs in NMI mode
+ * and will panic when an error happens.
+ */
+DEFINE_IDTENTRY_VC_KERNEL(exc_vmm_communication)
+{
+ irqentry_state_t irq_state;
+
+ /*
+ * With the current implementation it is always possible to switch to a
+ * safe stack because #VC exceptions only happen at known places, like
+ * intercepted instructions or accesses to MMIO areas/IO ports. They can
+ * also happen with code instrumentation when the hypervisor intercepts
+ * #DB, but the critical paths are forbidden to be instrumented, so #DB
+ * exceptions currently also only happen in safe places.
+ *
+ * But keep this here in case the noinstr annotations are violated due
+ * to bug elsewhere.
+ */
+ if (unlikely(vc_from_invalid_context(regs))) {
+ instrumentation_begin();
+ panic("Can't handle #VC exception from unsupported context\n");
+ instrumentation_end();
+ }
+
+ /*
+ * Handle #DB before calling into !noinstr code to avoid recursive #DB.
+ */
+ if (vc_is_db(error_code)) {
+ exc_debug(regs);
+ return;
+ }
+
+ irq_state = irqentry_nmi_enter(regs);
+
+ instrumentation_begin();
+
+ if (!vc_raw_handle_exception(regs, error_code)) {
+ /* Show some debug info */
+ show_regs(regs);
+
+ /* Ask hypervisor to sev_es_terminate */
+ sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
+
+ /* If that fails and we get here - just panic */
+ panic("Returned from Terminate-Request to Hypervisor\n");
+ }
+
+ instrumentation_end();
+ irqentry_nmi_exit(regs, irq_state);
+}
+
+/*
+ * Runtime #VC exception handler when raised from user mode. Runs in IRQ mode
+ * and will kill the current task with SIGBUS when an error happens.
+ */
+DEFINE_IDTENTRY_VC_USER(exc_vmm_communication)
+{
+ /*
+ * Handle #DB before calling into !noinstr code to avoid recursive #DB.
+ */
+ if (vc_is_db(error_code)) {
+ noist_exc_debug(regs);
+ return;
+ }
+
+ irqentry_enter_from_user_mode(regs);
+ instrumentation_begin();
+
+ if (!vc_raw_handle_exception(regs, error_code)) {
+ /*
+ * Do not kill the machine if user-space triggered the
+ * exception. Send SIGBUS instead and let user-space deal with
+ * it.
+ */
+ force_sig_fault(SIGBUS, BUS_OBJERR, (void __user *)0);
+ }
+
+ instrumentation_end();
+ irqentry_exit_to_user_mode(regs);
+}
+
+bool __init handle_vc_boot_ghcb(struct pt_regs *regs)
+{
+ unsigned long exit_code = regs->orig_ax;
+ struct es_em_ctxt ctxt;
+ enum es_result result;
+
+ vc_ghcb_invalidate(boot_ghcb);
+
+ result = vc_init_em_ctxt(&ctxt, regs, exit_code);
+ if (result == ES_OK)
+ result = vc_handle_exitcode(&ctxt, boot_ghcb, exit_code);
+
+ /* Done - now check the result */
+ switch (result) {
+ case ES_OK:
+ vc_finish_insn(&ctxt);
+ break;
+ case ES_UNSUPPORTED:
+ early_printk("PANIC: Unsupported exit-code 0x%02lx in early #VC exception (IP: 0x%lx)\n",
+ exit_code, regs->ip);
+ goto fail;
+ case ES_VMM_ERROR:
+ early_printk("PANIC: Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n",
+ exit_code, regs->ip);
+ goto fail;
+ case ES_DECODE_FAILED:
+ early_printk("PANIC: Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n",
+ exit_code, regs->ip);
+ goto fail;
+ case ES_EXCEPTION:
+ vc_early_forward_exception(&ctxt);
+ break;
+ case ES_RETRY:
+ /* Nothing to do */
+ break;
+ default:
+ BUG();
+ }
+
+ return true;
+
+fail:
+ show_regs(regs);
+
+ sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
+}
+
diff --git a/arch/x86/coco/sev/vc-shared.c b/arch/x86/coco/sev/vc-shared.c
new file mode 100644
index 000000000000..2c0ab0fdc060
--- /dev/null
+++ b/arch/x86/coco/sev/vc-shared.c
@@ -0,0 +1,504 @@
+// SPDX-License-Identifier: GPL-2.0
+
+static enum es_result vc_check_opcode_bytes(struct es_em_ctxt *ctxt,
+ unsigned long exit_code)
+{
+ unsigned int opcode = (unsigned int)ctxt->insn.opcode.value;
+ u8 modrm = ctxt->insn.modrm.value;
+
+ switch (exit_code) {
+
+ case SVM_EXIT_IOIO:
+ case SVM_EXIT_NPF:
+ /* handled separately */
+ return ES_OK;
+
+ case SVM_EXIT_CPUID:
+ if (opcode == 0xa20f)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_INVD:
+ if (opcode == 0x080f)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_MONITOR:
+ /* MONITOR and MONITORX instructions generate the same error code */
+ if (opcode == 0x010f && (modrm == 0xc8 || modrm == 0xfa))
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_MWAIT:
+ /* MWAIT and MWAITX instructions generate the same error code */
+ if (opcode == 0x010f && (modrm == 0xc9 || modrm == 0xfb))
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_MSR:
+ /* RDMSR */
+ if (opcode == 0x320f ||
+ /* WRMSR */
+ opcode == 0x300f)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_RDPMC:
+ if (opcode == 0x330f)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_RDTSC:
+ if (opcode == 0x310f)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_RDTSCP:
+ if (opcode == 0x010f && modrm == 0xf9)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_READ_DR7:
+ if (opcode == 0x210f &&
+ X86_MODRM_REG(ctxt->insn.modrm.value) == 7)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_VMMCALL:
+ if (opcode == 0x010f && modrm == 0xd9)
+ return ES_OK;
+
+ break;
+
+ case SVM_EXIT_WRITE_DR7:
+ if (opcode == 0x230f &&
+ X86_MODRM_REG(ctxt->insn.modrm.value) == 7)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_WBINVD:
+ if (opcode == 0x90f)
+ return ES_OK;
+ break;
+
+ default:
+ break;
+ }
+
+ sev_printk(KERN_ERR "Wrong/unhandled opcode bytes: 0x%x, exit_code: 0x%lx, rIP: 0x%lx\n",
+ opcode, exit_code, ctxt->regs->ip);
+
+ return ES_UNSUPPORTED;
+}
+
+static bool vc_decoding_needed(unsigned long exit_code)
+{
+ /* Exceptions don't require to decode the instruction */
+ return !(exit_code >= SVM_EXIT_EXCP_BASE &&
+ exit_code <= SVM_EXIT_LAST_EXCP);
+}
+
+static enum es_result vc_init_em_ctxt(struct es_em_ctxt *ctxt,
+ struct pt_regs *regs,
+ unsigned long exit_code)
+{
+ enum es_result ret = ES_OK;
+
+ memset(ctxt, 0, sizeof(*ctxt));
+ ctxt->regs = regs;
+
+ if (vc_decoding_needed(exit_code))
+ ret = vc_decode_insn(ctxt);
+
+ return ret;
+}
+
+static void vc_finish_insn(struct es_em_ctxt *ctxt)
+{
+ ctxt->regs->ip += ctxt->insn.length;
+}
+
+static enum es_result vc_insn_string_check(struct es_em_ctxt *ctxt,
+ unsigned long address,
+ bool write)
+{
+ if (user_mode(ctxt->regs) && fault_in_kernel_space(address)) {
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.error_code = X86_PF_USER;
+ ctxt->fi.cr2 = address;
+ if (write)
+ ctxt->fi.error_code |= X86_PF_WRITE;
+
+ return ES_EXCEPTION;
+ }
+
+ return ES_OK;
+}
+
+static enum es_result vc_insn_string_read(struct es_em_ctxt *ctxt,
+ void *src, char *buf,
+ unsigned int data_size,
+ unsigned int count,
+ bool backwards)
+{
+ int i, b = backwards ? -1 : 1;
+ unsigned long address = (unsigned long)src;
+ enum es_result ret;
+
+ ret = vc_insn_string_check(ctxt, address, false);
+ if (ret != ES_OK)
+ return ret;
+
+ for (i = 0; i < count; i++) {
+ void *s = src + (i * data_size * b);
+ char *d = buf + (i * data_size);
+
+ ret = vc_read_mem(ctxt, s, d, data_size);
+ if (ret != ES_OK)
+ break;
+ }
+
+ return ret;
+}
+
+static enum es_result vc_insn_string_write(struct es_em_ctxt *ctxt,
+ void *dst, char *buf,
+ unsigned int data_size,
+ unsigned int count,
+ bool backwards)
+{
+ int i, s = backwards ? -1 : 1;
+ unsigned long address = (unsigned long)dst;
+ enum es_result ret;
+
+ ret = vc_insn_string_check(ctxt, address, true);
+ if (ret != ES_OK)
+ return ret;
+
+ for (i = 0; i < count; i++) {
+ void *d = dst + (i * data_size * s);
+ char *b = buf + (i * data_size);
+
+ ret = vc_write_mem(ctxt, d, b, data_size);
+ if (ret != ES_OK)
+ break;
+ }
+
+ return ret;
+}
+
+#define IOIO_TYPE_STR BIT(2)
+#define IOIO_TYPE_IN 1
+#define IOIO_TYPE_INS (IOIO_TYPE_IN | IOIO_TYPE_STR)
+#define IOIO_TYPE_OUT 0
+#define IOIO_TYPE_OUTS (IOIO_TYPE_OUT | IOIO_TYPE_STR)
+
+#define IOIO_REP BIT(3)
+
+#define IOIO_ADDR_64 BIT(9)
+#define IOIO_ADDR_32 BIT(8)
+#define IOIO_ADDR_16 BIT(7)
+
+#define IOIO_DATA_32 BIT(6)
+#define IOIO_DATA_16 BIT(5)
+#define IOIO_DATA_8 BIT(4)
+
+#define IOIO_SEG_ES (0 << 10)
+#define IOIO_SEG_DS (3 << 10)
+
+static enum es_result vc_ioio_exitinfo(struct es_em_ctxt *ctxt, u64 *exitinfo)
+{
+ struct insn *insn = &ctxt->insn;
+ size_t size;
+ u64 port;
+
+ *exitinfo = 0;
+
+ switch (insn->opcode.bytes[0]) {
+ /* INS opcodes */
+ case 0x6c:
+ case 0x6d:
+ *exitinfo |= IOIO_TYPE_INS;
+ *exitinfo |= IOIO_SEG_ES;
+ port = ctxt->regs->dx & 0xffff;
+ break;
+
+ /* OUTS opcodes */
+ case 0x6e:
+ case 0x6f:
+ *exitinfo |= IOIO_TYPE_OUTS;
+ *exitinfo |= IOIO_SEG_DS;
+ port = ctxt->regs->dx & 0xffff;
+ break;
+
+ /* IN immediate opcodes */
+ case 0xe4:
+ case 0xe5:
+ *exitinfo |= IOIO_TYPE_IN;
+ port = (u8)insn->immediate.value & 0xffff;
+ break;
+
+ /* OUT immediate opcodes */
+ case 0xe6:
+ case 0xe7:
+ *exitinfo |= IOIO_TYPE_OUT;
+ port = (u8)insn->immediate.value & 0xffff;
+ break;
+
+ /* IN register opcodes */
+ case 0xec:
+ case 0xed:
+ *exitinfo |= IOIO_TYPE_IN;
+ port = ctxt->regs->dx & 0xffff;
+ break;
+
+ /* OUT register opcodes */
+ case 0xee:
+ case 0xef:
+ *exitinfo |= IOIO_TYPE_OUT;
+ port = ctxt->regs->dx & 0xffff;
+ break;
+
+ default:
+ return ES_DECODE_FAILED;
+ }
+
+ *exitinfo |= port << 16;
+
+ switch (insn->opcode.bytes[0]) {
+ case 0x6c:
+ case 0x6e:
+ case 0xe4:
+ case 0xe6:
+ case 0xec:
+ case 0xee:
+ /* Single byte opcodes */
+ *exitinfo |= IOIO_DATA_8;
+ size = 1;
+ break;
+ default:
+ /* Length determined by instruction parsing */
+ *exitinfo |= (insn->opnd_bytes == 2) ? IOIO_DATA_16
+ : IOIO_DATA_32;
+ size = (insn->opnd_bytes == 2) ? 2 : 4;
+ }
+
+ switch (insn->addr_bytes) {
+ case 2:
+ *exitinfo |= IOIO_ADDR_16;
+ break;
+ case 4:
+ *exitinfo |= IOIO_ADDR_32;
+ break;
+ case 8:
+ *exitinfo |= IOIO_ADDR_64;
+ break;
+ }
+
+ if (insn_has_rep_prefix(insn))
+ *exitinfo |= IOIO_REP;
+
+ return vc_ioio_check(ctxt, (u16)port, size);
+}
+
+static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+ struct pt_regs *regs = ctxt->regs;
+ u64 exit_info_1, exit_info_2;
+ enum es_result ret;
+
+ ret = vc_ioio_exitinfo(ctxt, &exit_info_1);
+ if (ret != ES_OK)
+ return ret;
+
+ if (exit_info_1 & IOIO_TYPE_STR) {
+
+ /* (REP) INS/OUTS */
+
+ bool df = ((regs->flags & X86_EFLAGS_DF) == X86_EFLAGS_DF);
+ unsigned int io_bytes, exit_bytes;
+ unsigned int ghcb_count, op_count;
+ unsigned long es_base;
+ u64 sw_scratch;
+
+ /*
+ * For the string variants with rep prefix the amount of in/out
+ * operations per #VC exception is limited so that the kernel
+ * has a chance to take interrupts and re-schedule while the
+ * instruction is emulated.
+ */
+ io_bytes = (exit_info_1 >> 4) & 0x7;
+ ghcb_count = sizeof(ghcb->shared_buffer) / io_bytes;
+
+ op_count = (exit_info_1 & IOIO_REP) ? regs->cx : 1;
+ exit_info_2 = min(op_count, ghcb_count);
+ exit_bytes = exit_info_2 * io_bytes;
+
+ es_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_ES);
+
+ /* Read bytes of OUTS into the shared buffer */
+ if (!(exit_info_1 & IOIO_TYPE_IN)) {
+ ret = vc_insn_string_read(ctxt,
+ (void *)(es_base + regs->si),
+ ghcb->shared_buffer, io_bytes,
+ exit_info_2, df);
+ if (ret)
+ return ret;
+ }
+
+ /*
+ * Issue an VMGEXIT to the HV to consume the bytes from the
+ * shared buffer or to have it write them into the shared buffer
+ * depending on the instruction: OUTS or INS.
+ */
+ sw_scratch = __pa(ghcb) + offsetof(struct ghcb, shared_buffer);
+ ghcb_set_sw_scratch(ghcb, sw_scratch);
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO,
+ exit_info_1, exit_info_2);
+ if (ret != ES_OK)
+ return ret;
+
+ /* Read bytes from shared buffer into the guest's destination. */
+ if (exit_info_1 & IOIO_TYPE_IN) {
+ ret = vc_insn_string_write(ctxt,
+ (void *)(es_base + regs->di),
+ ghcb->shared_buffer, io_bytes,
+ exit_info_2, df);
+ if (ret)
+ return ret;
+
+ if (df)
+ regs->di -= exit_bytes;
+ else
+ regs->di += exit_bytes;
+ } else {
+ if (df)
+ regs->si -= exit_bytes;
+ else
+ regs->si += exit_bytes;
+ }
+
+ if (exit_info_1 & IOIO_REP)
+ regs->cx -= exit_info_2;
+
+ ret = regs->cx ? ES_RETRY : ES_OK;
+
+ } else {
+
+ /* IN/OUT into/from rAX */
+
+ int bits = (exit_info_1 & 0x70) >> 1;
+ u64 rax = 0;
+
+ if (!(exit_info_1 & IOIO_TYPE_IN))
+ rax = lower_bits(regs->ax, bits);
+
+ ghcb_set_rax(ghcb, rax);
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO, exit_info_1, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (exit_info_1 & IOIO_TYPE_IN) {
+ if (!ghcb_rax_is_valid(ghcb))
+ return ES_VMM_ERROR;
+ regs->ax = lower_bits(ghcb->save.rax, bits);
+ }
+ }
+
+ return ret;
+}
+
+static int vc_handle_cpuid_snp(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+ struct pt_regs *regs = ctxt->regs;
+ struct cpuid_leaf leaf;
+ int ret;
+
+ leaf.fn = regs->ax;
+ leaf.subfn = regs->cx;
+ ret = snp_cpuid(ghcb, ctxt, &leaf);
+ if (!ret) {
+ regs->ax = leaf.eax;
+ regs->bx = leaf.ebx;
+ regs->cx = leaf.ecx;
+ regs->dx = leaf.edx;
+ }
+
+ return ret;
+}
+
+static enum es_result vc_handle_cpuid(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ struct pt_regs *regs = ctxt->regs;
+ u32 cr4 = native_read_cr4();
+ enum es_result ret;
+ int snp_cpuid_ret;
+
+ snp_cpuid_ret = vc_handle_cpuid_snp(ghcb, ctxt);
+ if (!snp_cpuid_ret)
+ return ES_OK;
+ if (snp_cpuid_ret != -EOPNOTSUPP)
+ return ES_VMM_ERROR;
+
+ ghcb_set_rax(ghcb, regs->ax);
+ ghcb_set_rcx(ghcb, regs->cx);
+
+ if (cr4 & X86_CR4_OSXSAVE)
+ /* Safe to read xcr0 */
+ ghcb_set_xcr0(ghcb, xgetbv(XCR_XFEATURE_ENABLED_MASK));
+ else
+ /* xgetbv will cause #GP - use reset value for xcr0 */
+ ghcb_set_xcr0(ghcb, 1);
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_CPUID, 0, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (!(ghcb_rax_is_valid(ghcb) &&
+ ghcb_rbx_is_valid(ghcb) &&
+ ghcb_rcx_is_valid(ghcb) &&
+ ghcb_rdx_is_valid(ghcb)))
+ return ES_VMM_ERROR;
+
+ regs->ax = ghcb->save.rax;
+ regs->bx = ghcb->save.rbx;
+ regs->cx = ghcb->save.rcx;
+ regs->dx = ghcb->save.rdx;
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_rdtsc(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt,
+ unsigned long exit_code)
+{
+ bool rdtscp = (exit_code == SVM_EXIT_RDTSCP);
+ enum es_result ret;
+
+ /*
+ * The hypervisor should not be intercepting RDTSC/RDTSCP when Secure
+ * TSC is enabled. A #VC exception will be generated if the RDTSC/RDTSCP
+ * instructions are being intercepted. If this should occur and Secure
+ * TSC is enabled, guest execution should be terminated as the guest
+ * cannot rely on the TSC value provided by the hypervisor.
+ */
+ if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
+ return ES_VMM_ERROR;
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, 0, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (!(ghcb_rax_is_valid(ghcb) && ghcb_rdx_is_valid(ghcb) &&
+ (!rdtscp || ghcb_rcx_is_valid(ghcb))))
+ return ES_VMM_ERROR;
+
+ ctxt->regs->ax = ghcb->save.rax;
+ ctxt->regs->dx = ghcb->save.rdx;
+ if (rdtscp)
+ ctxt->regs->cx = ghcb->save.rcx;
+
+ return ES_OK;
+}
diff --git a/arch/x86/include/asm/sev-internal.h b/arch/x86/include/asm/sev-internal.h
index a78f97208a39..b7232081f8f7 100644
--- a/arch/x86/include/asm/sev-internal.h
+++ b/arch/x86/include/asm/sev-internal.h
@@ -3,7 +3,6 @@
#define DR7_RESET_VALUE 0x400
extern struct ghcb boot_ghcb_page;
-extern struct ghcb *boot_ghcb;
extern u64 sev_hv_features;
extern u64 sev_secrets_pa;
@@ -59,8 +58,6 @@ DECLARE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
unsigned long npages, enum psc_op op);
-void __noreturn sev_es_terminate(unsigned int set, unsigned int reason);
-
DECLARE_PER_CPU(struct svsm_ca *, svsm_caa);
DECLARE_PER_CPU(u64, svsm_caa_pa);
@@ -100,11 +97,6 @@ static __always_inline void sev_es_wr_ghcb_msr(u64 val)
native_wrmsr(MSR_AMD64_SEV_ES_GHCB, low, high);
}
-enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt,
- u64 exit_code, u64 exit_info_1,
- u64 exit_info_2);
-
void snp_register_ghcb_early(unsigned long paddr);
bool sev_es_negotiate_protocol(void);
bool sev_es_check_cpu_features(void);
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index a8661dfc9a9a..6158893786d6 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -521,6 +521,28 @@ static __always_inline void vc_ghcb_invalidate(struct ghcb *ghcb)
__builtin_memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap));
}
+void vc_forward_exception(struct es_em_ctxt *ctxt);
+
+/* I/O parameters for CPUID-related helpers */
+struct cpuid_leaf {
+ u32 fn;
+ u32 subfn;
+ u32 eax;
+ u32 ebx;
+ u32 ecx;
+ u32 edx;
+};
+
+int snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf);
+
+void __noreturn sev_es_terminate(unsigned int set, unsigned int reason);
+enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt,
+ u64 exit_code, u64 exit_info_1,
+ u64 exit_info_2);
+
+extern struct ghcb *boot_ghcb;
+
#else /* !CONFIG_AMD_MEM_ENCRYPT */
#define snp_vmpl 0
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 07/23] x86/sev: Separate MSR and GHCB based snp_cpuid() via a callback
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (5 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 06/23] x86/sev: Disentangle #VC handling code from startup code Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 08/23] x86/sev: Fall back to early page state change code only during boot Ard Biesheuvel
` (17 subsequent siblings)
24 siblings, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
There are two distinct callers of snp_cpuid(): one where the MSR
protocol is always used, and one where the GHCB page based interface is
always used.
The snp_cpuid() logic does not care about the distinction, which only
matters at a lower level. But the fact that it supports both interfaces
means that the GHCB page based logic is pulled into the early startup
code where PA to VA conversions are problematic, given that it runs from
the 1:1 mapping of memory.
So keep snp_cpuid() itself in the startup code, but factor out the
hypervisor calls via a callback, so that the GHCB page handling can be
moved out.
Code refactoring only - no functional change intended.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/startup/sev-shared.c | 58 ++++----------------
arch/x86/coco/sev/vc-shared.c | 49 ++++++++++++++++-
arch/x86/include/asm/sev.h | 3 +-
3 files changed, 61 insertions(+), 49 deletions(-)
diff --git a/arch/x86/boot/startup/sev-shared.c b/arch/x86/boot/startup/sev-shared.c
index 7a706db87b93..408e064a80d9 100644
--- a/arch/x86/boot/startup/sev-shared.c
+++ b/arch/x86/boot/startup/sev-shared.c
@@ -342,44 +342,7 @@ static int __sev_cpuid_hv_msr(struct cpuid_leaf *leaf)
return ret;
}
-static int __sev_cpuid_hv_ghcb(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
-{
- u32 cr4 = native_read_cr4();
- int ret;
-
- ghcb_set_rax(ghcb, leaf->fn);
- ghcb_set_rcx(ghcb, leaf->subfn);
-
- if (cr4 & X86_CR4_OSXSAVE)
- /* Safe to read xcr0 */
- ghcb_set_xcr0(ghcb, xgetbv(XCR_XFEATURE_ENABLED_MASK));
- else
- /* xgetbv will cause #UD - use reset value for xcr0 */
- ghcb_set_xcr0(ghcb, 1);
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_CPUID, 0, 0);
- if (ret != ES_OK)
- return ret;
-
- if (!(ghcb_rax_is_valid(ghcb) &&
- ghcb_rbx_is_valid(ghcb) &&
- ghcb_rcx_is_valid(ghcb) &&
- ghcb_rdx_is_valid(ghcb)))
- return ES_VMM_ERROR;
- leaf->eax = ghcb->save.rax;
- leaf->ebx = ghcb->save.rbx;
- leaf->ecx = ghcb->save.rcx;
- leaf->edx = ghcb->save.rdx;
-
- return ES_OK;
-}
-
-static int sev_cpuid_hv(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
-{
- return ghcb ? __sev_cpuid_hv_ghcb(ghcb, ctxt, leaf)
- : __sev_cpuid_hv_msr(leaf);
-}
/*
* This may be called early while still running on the initial identity
@@ -484,21 +447,21 @@ snp_cpuid_get_validated_func(struct cpuid_leaf *leaf)
return false;
}
-static void snp_cpuid_hv(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
+static void snp_cpuid_hv_no_ghcb(void *ctx, struct cpuid_leaf *leaf)
{
- if (sev_cpuid_hv(ghcb, ctxt, leaf))
+ if (__sev_cpuid_hv_msr(leaf))
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_CPUID_HV);
}
static int __head
-snp_cpuid_postprocess(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
- struct cpuid_leaf *leaf)
+snp_cpuid_postprocess(void (*cpuid_hv)(void *ctx, struct cpuid_leaf *),
+ void *ctx, struct cpuid_leaf *leaf)
{
struct cpuid_leaf leaf_hv = *leaf;
switch (leaf->fn) {
case 0x1:
- snp_cpuid_hv(ghcb, ctxt, &leaf_hv);
+ cpuid_hv(ctx, &leaf_hv);
/* initial APIC ID */
leaf->ebx = (leaf_hv.ebx & GENMASK(31, 24)) | (leaf->ebx & GENMASK(23, 0));
@@ -517,7 +480,7 @@ snp_cpuid_postprocess(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
break;
case 0xB:
leaf_hv.subfn = 0;
- snp_cpuid_hv(ghcb, ctxt, &leaf_hv);
+ cpuid_hv(ctx, &leaf_hv);
/* extended APIC ID */
leaf->edx = leaf_hv.edx;
@@ -565,7 +528,7 @@ snp_cpuid_postprocess(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
}
break;
case 0x8000001E:
- snp_cpuid_hv(ghcb, ctxt, &leaf_hv);
+ cpuid_hv(ctx, &leaf_hv);
/* extended APIC ID */
leaf->eax = leaf_hv.eax;
@@ -587,7 +550,8 @@ snp_cpuid_postprocess(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
* should be treated as fatal by caller.
*/
int __head
-snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
+snp_cpuid(void (*cpuid_hv)(void *ctx, struct cpuid_leaf *),
+ void *ctx, struct cpuid_leaf *leaf)
{
const struct snp_cpuid_table *cpuid_table = snp_cpuid_get_table();
@@ -621,7 +585,7 @@ snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
return 0;
}
- return snp_cpuid_postprocess(ghcb, ctxt, leaf);
+ return snp_cpuid_postprocess(cpuid_hv, ctx, leaf);
}
/*
@@ -648,7 +612,7 @@ void __head do_vc_no_ghcb(struct pt_regs *regs, unsigned long exit_code)
leaf.fn = fn;
leaf.subfn = subfn;
- ret = snp_cpuid(NULL, NULL, &leaf);
+ ret = snp_cpuid(snp_cpuid_hv_no_ghcb, NULL, &leaf);
if (!ret)
goto cpuid_done;
diff --git a/arch/x86/coco/sev/vc-shared.c b/arch/x86/coco/sev/vc-shared.c
index 2c0ab0fdc060..b4688f69102e 100644
--- a/arch/x86/coco/sev/vc-shared.c
+++ b/arch/x86/coco/sev/vc-shared.c
@@ -409,15 +409,62 @@ static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
return ret;
}
+static int __sev_cpuid_hv_ghcb(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
+{
+ u32 cr4 = native_read_cr4();
+ int ret;
+
+ ghcb_set_rax(ghcb, leaf->fn);
+ ghcb_set_rcx(ghcb, leaf->subfn);
+
+ if (cr4 & X86_CR4_OSXSAVE)
+ /* Safe to read xcr0 */
+ ghcb_set_xcr0(ghcb, xgetbv(XCR_XFEATURE_ENABLED_MASK));
+ else
+ /* xgetbv will cause #UD - use reset value for xcr0 */
+ ghcb_set_xcr0(ghcb, 1);
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_CPUID, 0, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (!(ghcb_rax_is_valid(ghcb) &&
+ ghcb_rbx_is_valid(ghcb) &&
+ ghcb_rcx_is_valid(ghcb) &&
+ ghcb_rdx_is_valid(ghcb)))
+ return ES_VMM_ERROR;
+
+ leaf->eax = ghcb->save.rax;
+ leaf->ebx = ghcb->save.rbx;
+ leaf->ecx = ghcb->save.rcx;
+ leaf->edx = ghcb->save.rdx;
+
+ return ES_OK;
+}
+
+struct cpuid_ctx {
+ struct ghcb *ghcb;
+ struct es_em_ctxt *ctxt;
+};
+
+static void snp_cpuid_hv_ghcb(void *p, struct cpuid_leaf *leaf)
+{
+ struct cpuid_ctx *ctx = p;
+
+ if (__sev_cpuid_hv_ghcb(ctx->ghcb, ctx->ctxt, leaf))
+ sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_CPUID_HV);
+}
+
static int vc_handle_cpuid_snp(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
{
+ struct cpuid_ctx ctx = { ghcb, ctxt };
struct pt_regs *regs = ctxt->regs;
struct cpuid_leaf leaf;
int ret;
leaf.fn = regs->ax;
leaf.subfn = regs->cx;
- ret = snp_cpuid(ghcb, ctxt, &leaf);
+ ret = snp_cpuid(snp_cpuid_hv_ghcb, &ctx, &leaf);
if (!ret) {
regs->ax = leaf.eax;
regs->bx = leaf.ebx;
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 6158893786d6..f50a606be4f1 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -533,7 +533,8 @@ struct cpuid_leaf {
u32 edx;
};
-int snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf);
+int snp_cpuid(void (*cpuid_hv)(void *ctx, struct cpuid_leaf *),
+ void *ctx, struct cpuid_leaf *leaf);
void __noreturn sev_es_terminate(unsigned int set, unsigned int reason);
enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 08/23] x86/sev: Fall back to early page state change code only during boot
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (6 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 07/23] x86/sev: Separate MSR and GHCB based snp_cpuid() via a callback Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 09/23] x86/sev: Move GHCB page based HV communication out of startup code Ard Biesheuvel
` (16 subsequent siblings)
24 siblings, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
The core SEV-SNP page state change code will fall back to the early page
state change code if no boot_ghcb has been assigned yet. This is
unlikely to occur in practice, and it prevents the early code from being
emitted as __init, making it harder to reason about whether or not
startup code may be called at runtime.
So ensure that early_page_state_change() is only called before freeing
initmem, and move the call into a separate function annotated as __ref,
so that the callee can be annotated as __init in a subsequent patch.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/coco/sev/core.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index ac400525de73..fd5cd7d9b5f2 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -501,14 +501,24 @@ static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long
return vaddr;
}
+static bool __ref __early_set_pages_state(unsigned long vaddr,
+ unsigned long npages, int op)
+{
+ /* Use the MSR protocol when a GHCB is not yet available. */
+ if (boot_ghcb || WARN_ON(system_state >= SYSTEM_FREEING_INITMEM))
+ return false;
+
+ early_set_pages_state(vaddr, __pa(vaddr), npages, op);
+ return true;
+}
+
static void set_pages_state(unsigned long vaddr, unsigned long npages, int op)
{
struct snp_psc_desc desc;
unsigned long vaddr_end;
- /* Use the MSR protocol when a GHCB is not available. */
- if (!boot_ghcb)
- return early_set_pages_state(vaddr, __pa(vaddr), npages, op);
+ if (__early_set_pages_state(vaddr, npages, op))
+ return;
vaddr = vaddr & PAGE_MASK;
vaddr_end = vaddr + (npages << PAGE_SHIFT);
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 09/23] x86/sev: Move GHCB page based HV communication out of startup code
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (7 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 08/23] x86/sev: Fall back to early page state change code only during boot Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 10/23] x86/sev: Use boot SVSM CA for all startup and init code Ard Biesheuvel
` (15 subsequent siblings)
24 siblings, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
Both the decompressor and the core kernel implement an early #VC
handler, which only deals with CPUID instructions, and full featured
one, which can handle any #VC exception.
The former communicates with the hypervisor using the MSR based
protocol, whereas the latter uses a shared GHCB page, which is
configured a bit later. The SVSM PVALIDATE code can use either
interface, but will prefer the GHCB page based one once available.
Accessing this shared GHCB page from the core kernel's startup code is
problematic, because it involves converting the GHCB address provided by
the caller to a physical address. In the startup code, which may execute
from a 1:1 mapping of memory, virtual to physical address translations
are therefore ambiguous, and should be avoided.
This means that exposing startup code dealing with the GHCB to callers
that execute from the ordinary kernel virtual mapping should be avoided
too. So move all GHCB page based communication out of the startup code,
and rely on the MSR protocol only for all communication occurring before
the kernel virtual mapping is up, as well as all SVSM PVALIDATE calls
made via the 'early' page state change API (which may be called a bit
later as well)
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/compressed/sev.c | 20 ---
arch/x86/boot/startup/sev-shared.c | 160 +-------------------
arch/x86/boot/startup/sev-startup.c | 35 +----
arch/x86/coco/sev/core.c | 67 ++++++++
arch/x86/coco/sev/vc-shared.c | 63 ++++++++
arch/x86/include/asm/sev-internal.h | 4 +-
arch/x86/include/asm/sev.h | 56 ++++++-
7 files changed, 190 insertions(+), 215 deletions(-)
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 804166457022..9b6eebc24e78 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -50,31 +50,11 @@ u64 svsm_get_caa_pa(void)
return boot_svsm_caa_pa;
}
-int svsm_perform_call_protocol(struct svsm_call *call);
-
u8 snp_vmpl;
/* Include code for early handlers */
#include "../../boot/startup/sev-shared.c"
-int svsm_perform_call_protocol(struct svsm_call *call)
-{
- struct ghcb *ghcb;
- int ret;
-
- if (boot_ghcb)
- ghcb = boot_ghcb;
- else
- ghcb = NULL;
-
- do {
- ret = ghcb ? svsm_perform_ghcb_protocol(ghcb, call)
- : svsm_perform_msr_protocol(call);
- } while (ret == -EAGAIN);
-
- return ret;
-}
-
static bool sev_snp_enabled(void)
{
return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
diff --git a/arch/x86/boot/startup/sev-shared.c b/arch/x86/boot/startup/sev-shared.c
index 408e064a80d9..0709c8a8655a 100644
--- a/arch/x86/boot/startup/sev-shared.c
+++ b/arch/x86/boot/startup/sev-shared.c
@@ -17,8 +17,6 @@
#else
#undef WARN
#define WARN(condition, format...) (!!(condition))
-#undef vc_forward_exception
-#define vc_forward_exception(c) panic("SNP: Hypervisor requested exception\n")
#endif
/*
@@ -39,7 +37,7 @@ u64 boot_svsm_caa_pa __ro_after_init;
*
* GHCB protocol version negotiated with the hypervisor.
*/
-static u16 ghcb_version __ro_after_init;
+u16 ghcb_version __ro_after_init;
/* Copy of the SNP firmware's CPUID page. */
static struct snp_cpuid_table cpuid_table_copy __ro_after_init;
@@ -100,22 +98,6 @@ u64 get_hv_features(void)
return GHCB_MSR_HV_FT_RESP_VAL(val);
}
-void snp_register_ghcb_early(unsigned long paddr)
-{
- unsigned long pfn = paddr >> PAGE_SHIFT;
- u64 val;
-
- sev_es_wr_ghcb_msr(GHCB_MSR_REG_GPA_REQ_VAL(pfn));
- VMGEXIT();
-
- val = sev_es_rd_ghcb_msr();
-
- /* If the response GPA is not ours then abort the guest */
- if ((GHCB_RESP_CODE(val) != GHCB_MSR_REG_GPA_RESP) ||
- (GHCB_MSR_REG_GPA_RESP_VAL(val) != pfn))
- sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_REGISTER);
-}
-
bool sev_es_negotiate_protocol(void)
{
u64 val;
@@ -137,86 +119,7 @@ bool sev_es_negotiate_protocol(void)
return true;
}
-static enum es_result verify_exception_info(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
-{
- u32 ret;
-
- ret = ghcb->save.sw_exit_info_1 & GENMASK_ULL(31, 0);
- if (!ret)
- return ES_OK;
-
- if (ret == 1) {
- u64 info = ghcb->save.sw_exit_info_2;
- unsigned long v = info & SVM_EVTINJ_VEC_MASK;
-
- /* Check if exception information from hypervisor is sane. */
- if ((info & SVM_EVTINJ_VALID) &&
- ((v == X86_TRAP_GP) || (v == X86_TRAP_UD)) &&
- ((info & SVM_EVTINJ_TYPE_MASK) == SVM_EVTINJ_TYPE_EXEPT)) {
- ctxt->fi.vector = v;
-
- if (info & SVM_EVTINJ_VALID_ERR)
- ctxt->fi.error_code = info >> 32;
-
- return ES_EXCEPTION;
- }
- }
-
- return ES_VMM_ERROR;
-}
-
-static inline int svsm_process_result_codes(struct svsm_call *call)
-{
- switch (call->rax_out) {
- case SVSM_SUCCESS:
- return 0;
- case SVSM_ERR_INCOMPLETE:
- case SVSM_ERR_BUSY:
- return -EAGAIN;
- default:
- return -EINVAL;
- }
-}
-
-/*
- * Issue a VMGEXIT to call the SVSM:
- * - Load the SVSM register state (RAX, RCX, RDX, R8 and R9)
- * - Set the CA call pending field to 1
- * - Issue VMGEXIT
- * - Save the SVSM return register state (RAX, RCX, RDX, R8 and R9)
- * - Perform atomic exchange of the CA call pending field
- *
- * - See the "Secure VM Service Module for SEV-SNP Guests" specification for
- * details on the calling convention.
- * - The calling convention loosely follows the Microsoft X64 calling
- * convention by putting arguments in RCX, RDX, R8 and R9.
- * - RAX specifies the SVSM protocol/callid as input and the return code
- * as output.
- */
-static __always_inline void svsm_issue_call(struct svsm_call *call, u8 *pending)
-{
- register unsigned long rax asm("rax") = call->rax;
- register unsigned long rcx asm("rcx") = call->rcx;
- register unsigned long rdx asm("rdx") = call->rdx;
- register unsigned long r8 asm("r8") = call->r8;
- register unsigned long r9 asm("r9") = call->r9;
-
- call->caa->call_pending = 1;
-
- asm volatile("rep; vmmcall\n\t"
- : "+r" (rax), "+r" (rcx), "+r" (rdx), "+r" (r8), "+r" (r9)
- : : "memory");
-
- *pending = xchg(&call->caa->call_pending, *pending);
-
- call->rax_out = rax;
- call->rcx_out = rcx;
- call->rdx_out = rdx;
- call->r8_out = r8;
- call->r9_out = r9;
-}
-
-static int svsm_perform_msr_protocol(struct svsm_call *call)
+int svsm_perform_msr_protocol(struct svsm_call *call)
{
u8 pending = 0;
u64 val, resp;
@@ -247,63 +150,6 @@ static int svsm_perform_msr_protocol(struct svsm_call *call)
return svsm_process_result_codes(call);
}
-static int svsm_perform_ghcb_protocol(struct ghcb *ghcb, struct svsm_call *call)
-{
- struct es_em_ctxt ctxt;
- u8 pending = 0;
-
- vc_ghcb_invalidate(ghcb);
-
- /*
- * Fill in protocol and format specifiers. This can be called very early
- * in the boot, so use rip-relative references as needed.
- */
- ghcb->protocol_version = ghcb_version;
- ghcb->ghcb_usage = GHCB_DEFAULT_USAGE;
-
- ghcb_set_sw_exit_code(ghcb, SVM_VMGEXIT_SNP_RUN_VMPL);
- ghcb_set_sw_exit_info_1(ghcb, 0);
- ghcb_set_sw_exit_info_2(ghcb, 0);
-
- sev_es_wr_ghcb_msr(__pa(ghcb));
-
- svsm_issue_call(call, &pending);
-
- if (pending)
- return -EINVAL;
-
- switch (verify_exception_info(ghcb, &ctxt)) {
- case ES_OK:
- break;
- case ES_EXCEPTION:
- vc_forward_exception(&ctxt);
- fallthrough;
- default:
- return -EINVAL;
- }
-
- return svsm_process_result_codes(call);
-}
-
-enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt,
- u64 exit_code, u64 exit_info_1,
- u64 exit_info_2)
-{
- /* Fill in protocol and format specifiers */
- ghcb->protocol_version = ghcb_version;
- ghcb->ghcb_usage = GHCB_DEFAULT_USAGE;
-
- ghcb_set_sw_exit_code(ghcb, exit_code);
- ghcb_set_sw_exit_info_1(ghcb, exit_info_1);
- ghcb_set_sw_exit_info_2(ghcb, exit_info_2);
-
- sev_es_wr_ghcb_msr(__pa(ghcb));
- VMGEXIT();
-
- return verify_exception_info(ghcb, ctxt);
-}
-
static int __sev_cpuid_hv(u32 fn, int reg_idx, u32 *reg)
{
u64 val;
@@ -755,7 +601,7 @@ static void __head svsm_pval_4k_page(unsigned long paddr, bool validate)
call.rax = SVSM_CORE_CALL(SVSM_CORE_PVALIDATE);
call.rcx = pc_pa;
- ret = svsm_perform_call_protocol(&call);
+ ret = svsm_perform_msr_protocol(&call);
if (ret)
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
diff --git a/arch/x86/boot/startup/sev-startup.c b/arch/x86/boot/startup/sev-startup.c
index 435853a55768..797ca3e29b12 100644
--- a/arch/x86/boot/startup/sev-startup.c
+++ b/arch/x86/boot/startup/sev-startup.c
@@ -139,39 +139,6 @@ noinstr void __sev_put_ghcb(struct ghcb_state *state)
}
}
-int svsm_perform_call_protocol(struct svsm_call *call)
-{
- struct ghcb_state state;
- unsigned long flags;
- struct ghcb *ghcb;
- int ret;
-
- /*
- * This can be called very early in the boot, use native functions in
- * order to avoid paravirt issues.
- */
- flags = native_local_irq_save();
-
- if (sev_cfg.ghcbs_initialized)
- ghcb = __sev_get_ghcb(&state);
- else if (boot_ghcb)
- ghcb = boot_ghcb;
- else
- ghcb = NULL;
-
- do {
- ret = ghcb ? svsm_perform_ghcb_protocol(ghcb, call)
- : svsm_perform_msr_protocol(call);
- } while (ret == -EAGAIN);
-
- if (sev_cfg.ghcbs_initialized)
- __sev_put_ghcb(&state);
-
- native_local_irq_restore(flags);
-
- return ret;
-}
-
void __head
early_set_pages_state(unsigned long vaddr, unsigned long paddr,
unsigned long npages, enum psc_op op)
@@ -325,7 +292,7 @@ static __head void svsm_setup(struct cc_blob_sev_info *cc_info)
call.caa = svsm_get_caa();
call.rax = SVSM_CORE_CALL(SVSM_CORE_REMAP_CA);
call.rcx = pa;
- ret = svsm_perform_call_protocol(&call);
+ ret = svsm_perform_msr_protocol(&call);
if (ret)
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_SVSM_CA_REMAP_FAIL);
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index fd5cd7d9b5f2..883b2719986d 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -153,6 +153,73 @@ static u64 __init get_jump_table_addr(void)
return ret;
}
+static int svsm_perform_ghcb_protocol(struct ghcb *ghcb, struct svsm_call *call)
+{
+ struct es_em_ctxt ctxt;
+ u8 pending = 0;
+
+ vc_ghcb_invalidate(ghcb);
+
+ /*
+ * Fill in protocol and format specifiers. This can be called very early
+ * in the boot, so use rip-relative references as needed.
+ */
+ ghcb->protocol_version = ghcb_version;
+ ghcb->ghcb_usage = GHCB_DEFAULT_USAGE;
+
+ ghcb_set_sw_exit_code(ghcb, SVM_VMGEXIT_SNP_RUN_VMPL);
+ ghcb_set_sw_exit_info_1(ghcb, 0);
+ ghcb_set_sw_exit_info_2(ghcb, 0);
+
+ sev_es_wr_ghcb_msr(__pa(ghcb));
+
+ svsm_issue_call(call, &pending);
+
+ if (pending)
+ return -EINVAL;
+
+ switch (verify_exception_info(ghcb, &ctxt)) {
+ case ES_OK:
+ break;
+ case ES_EXCEPTION:
+ vc_forward_exception(&ctxt);
+ fallthrough;
+ default:
+ return -EINVAL;
+ }
+
+ return svsm_process_result_codes(call);
+}
+
+static int svsm_perform_call_protocol(struct svsm_call *call)
+{
+ struct ghcb_state state;
+ unsigned long flags;
+ struct ghcb *ghcb;
+ int ret;
+
+ flags = native_local_irq_save();
+
+ if (sev_cfg.ghcbs_initialized)
+ ghcb = __sev_get_ghcb(&state);
+ else if (boot_ghcb)
+ ghcb = boot_ghcb;
+ else
+ ghcb = NULL;
+
+ do {
+ ret = ghcb ? svsm_perform_ghcb_protocol(ghcb, call)
+ : svsm_perform_msr_protocol(call);
+ } while (ret == -EAGAIN);
+
+ if (sev_cfg.ghcbs_initialized)
+ __sev_put_ghcb(&state);
+
+ native_local_irq_restore(flags);
+
+ return ret;
+}
+
static inline void __pval_terminate(u64 pfn, bool action, unsigned int page_size,
int ret, u64 svsm_ret)
{
diff --git a/arch/x86/coco/sev/vc-shared.c b/arch/x86/coco/sev/vc-shared.c
index b4688f69102e..6df758560803 100644
--- a/arch/x86/coco/sev/vc-shared.c
+++ b/arch/x86/coco/sev/vc-shared.c
@@ -409,6 +409,53 @@ static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
return ret;
}
+enum es_result verify_exception_info(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+ u32 ret;
+
+ ret = ghcb->save.sw_exit_info_1 & GENMASK_ULL(31, 0);
+ if (!ret)
+ return ES_OK;
+
+ if (ret == 1) {
+ u64 info = ghcb->save.sw_exit_info_2;
+ unsigned long v = info & SVM_EVTINJ_VEC_MASK;
+
+ /* Check if exception information from hypervisor is sane. */
+ if ((info & SVM_EVTINJ_VALID) &&
+ ((v == X86_TRAP_GP) || (v == X86_TRAP_UD)) &&
+ ((info & SVM_EVTINJ_TYPE_MASK) == SVM_EVTINJ_TYPE_EXEPT)) {
+ ctxt->fi.vector = v;
+
+ if (info & SVM_EVTINJ_VALID_ERR)
+ ctxt->fi.error_code = info >> 32;
+
+ return ES_EXCEPTION;
+ }
+ }
+
+ return ES_VMM_ERROR;
+}
+
+enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt,
+ u64 exit_code, u64 exit_info_1,
+ u64 exit_info_2)
+{
+ /* Fill in protocol and format specifiers */
+ ghcb->protocol_version = ghcb_version;
+ ghcb->ghcb_usage = GHCB_DEFAULT_USAGE;
+
+ ghcb_set_sw_exit_code(ghcb, exit_code);
+ ghcb_set_sw_exit_info_1(ghcb, exit_info_1);
+ ghcb_set_sw_exit_info_2(ghcb, exit_info_2);
+
+ sev_es_wr_ghcb_msr(__pa(ghcb));
+ VMGEXIT();
+
+ return verify_exception_info(ghcb, ctxt);
+}
+
static int __sev_cpuid_hv_ghcb(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
{
u32 cr4 = native_read_cr4();
@@ -549,3 +596,19 @@ static enum es_result vc_handle_rdtsc(struct ghcb *ghcb,
return ES_OK;
}
+
+void snp_register_ghcb_early(unsigned long paddr)
+{
+ unsigned long pfn = paddr >> PAGE_SHIFT;
+ u64 val;
+
+ sev_es_wr_ghcb_msr(GHCB_MSR_REG_GPA_REQ_VAL(pfn));
+ VMGEXIT();
+
+ val = sev_es_rd_ghcb_msr();
+
+ /* If the response GPA is not ours then abort the guest */
+ if ((GHCB_RESP_CODE(val) != GHCB_MSR_REG_GPA_RESP) ||
+ (GHCB_MSR_REG_GPA_RESP_VAL(val) != pfn))
+ sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_REGISTER);
+}
diff --git a/arch/x86/include/asm/sev-internal.h b/arch/x86/include/asm/sev-internal.h
index b7232081f8f7..0d02e780beb3 100644
--- a/arch/x86/include/asm/sev-internal.h
+++ b/arch/x86/include/asm/sev-internal.h
@@ -80,7 +80,8 @@ static __always_inline u64 svsm_get_caa_pa(void)
return boot_svsm_caa_pa;
}
-int svsm_perform_call_protocol(struct svsm_call *call);
+enum es_result verify_exception_info(struct ghcb *ghcb, struct es_em_ctxt *ctxt);
+void vc_forward_exception(struct es_em_ctxt *ctxt);
static inline u64 sev_es_rd_ghcb_msr(void)
{
@@ -97,7 +98,6 @@ static __always_inline void sev_es_wr_ghcb_msr(u64 val)
native_wrmsr(MSR_AMD64_SEV_ES_GHCB, low, high);
}
-void snp_register_ghcb_early(unsigned long paddr);
bool sev_es_negotiate_protocol(void);
bool sev_es_check_cpu_features(void);
u64 get_hv_features(void);
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index f50a606be4f1..ca7168cc2118 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -485,6 +485,7 @@ static inline int pvalidate(unsigned long vaddr, bool rmp_psize, bool validate)
struct snp_guest_request_ioctl;
void setup_ghcb(void);
+void snp_register_ghcb_early(unsigned long paddr);
void early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
unsigned long npages);
void early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
@@ -521,8 +522,6 @@ static __always_inline void vc_ghcb_invalidate(struct ghcb *ghcb)
__builtin_memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap));
}
-void vc_forward_exception(struct es_em_ctxt *ctxt);
-
/* I/O parameters for CPUID-related helpers */
struct cpuid_leaf {
u32 fn;
@@ -533,15 +532,68 @@ struct cpuid_leaf {
u32 edx;
};
+int svsm_perform_msr_protocol(struct svsm_call *call);
int snp_cpuid(void (*cpuid_hv)(void *ctx, struct cpuid_leaf *),
void *ctx, struct cpuid_leaf *leaf);
+/*
+ * Issue a VMGEXIT to call the SVSM:
+ * - Load the SVSM register state (RAX, RCX, RDX, R8 and R9)
+ * - Set the CA call pending field to 1
+ * - Issue VMGEXIT
+ * - Save the SVSM return register state (RAX, RCX, RDX, R8 and R9)
+ * - Perform atomic exchange of the CA call pending field
+ *
+ * - See the "Secure VM Service Module for SEV-SNP Guests" specification for
+ * details on the calling convention.
+ * - The calling convention loosely follows the Microsoft X64 calling
+ * convention by putting arguments in RCX, RDX, R8 and R9.
+ * - RAX specifies the SVSM protocol/callid as input and the return code
+ * as output.
+ */
+static __always_inline void svsm_issue_call(struct svsm_call *call, u8 *pending)
+{
+ register unsigned long rax asm("rax") = call->rax;
+ register unsigned long rcx asm("rcx") = call->rcx;
+ register unsigned long rdx asm("rdx") = call->rdx;
+ register unsigned long r8 asm("r8") = call->r8;
+ register unsigned long r9 asm("r9") = call->r9;
+
+ call->caa->call_pending = 1;
+
+ asm volatile("rep; vmmcall\n\t"
+ : "+r" (rax), "+r" (rcx), "+r" (rdx), "+r" (r8), "+r" (r9)
+ : : "memory");
+
+ *pending = xchg(&call->caa->call_pending, *pending);
+
+ call->rax_out = rax;
+ call->rcx_out = rcx;
+ call->rdx_out = rdx;
+ call->r8_out = r8;
+ call->r9_out = r9;
+}
+
+static inline int svsm_process_result_codes(struct svsm_call *call)
+{
+ switch (call->rax_out) {
+ case SVSM_SUCCESS:
+ return 0;
+ case SVSM_ERR_INCOMPLETE:
+ case SVSM_ERR_BUSY:
+ return -EAGAIN;
+ default:
+ return -EINVAL;
+ }
+}
+
void __noreturn sev_es_terminate(unsigned int set, unsigned int reason);
enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
struct es_em_ctxt *ctxt,
u64 exit_code, u64 exit_info_1,
u64 exit_info_2);
+extern u16 ghcb_version;
extern struct ghcb *boot_ghcb;
#else /* !CONFIG_AMD_MEM_ENCRYPT */
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 10/23] x86/sev: Use boot SVSM CA for all startup and init code
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (8 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 09/23] x86/sev: Move GHCB page based HV communication out of startup code Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 11/23] x86/boot: Drop redundant RMPADJUST in SEV SVSM presence check Ard Biesheuvel
` (14 subsequent siblings)
24 siblings, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
To avoid having to reason about whether or not to use the per-CPU SVSM
calling area when running startup and init code on the boot CPU, reuse
the boot SVSM calling area as the per-CPU area for CPU #0.
This removes the need to make the per-CPU variables and associated state
in sev_cfg accessible to the startup code once confined.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/compressed/sev.c | 13 ------
arch/x86/boot/startup/sev-shared.c | 4 +-
arch/x86/boot/startup/sev-startup.c | 6 +--
arch/x86/coco/sev/core.c | 47 +++++++++-----------
arch/x86/include/asm/sev-internal.h | 16 -------
5 files changed, 26 insertions(+), 60 deletions(-)
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 9b6eebc24e78..91e8140250f6 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -37,19 +37,6 @@ struct ghcb *boot_ghcb;
#define __BOOT_COMPRESSED
-extern struct svsm_ca *boot_svsm_caa;
-extern u64 boot_svsm_caa_pa;
-
-struct svsm_ca *svsm_get_caa(void)
-{
- return boot_svsm_caa;
-}
-
-u64 svsm_get_caa_pa(void)
-{
- return boot_svsm_caa_pa;
-}
-
u8 snp_vmpl;
/* Include code for early handlers */
diff --git a/arch/x86/boot/startup/sev-shared.c b/arch/x86/boot/startup/sev-shared.c
index 0709c8a8655a..b1f4b9b15045 100644
--- a/arch/x86/boot/startup/sev-shared.c
+++ b/arch/x86/boot/startup/sev-shared.c
@@ -585,10 +585,10 @@ static void __head svsm_pval_4k_page(unsigned long paddr, bool validate)
*/
flags = native_local_irq_save();
- call.caa = svsm_get_caa();
+ call.caa = boot_svsm_caa;
pc = (struct svsm_pvalidate_call *)call.caa->svsm_buffer;
- pc_pa = svsm_get_caa_pa() + offsetof(struct svsm_ca, svsm_buffer);
+ pc_pa = boot_svsm_caa_pa + offsetof(struct svsm_ca, svsm_buffer);
pc->num_entries = 1;
pc->cur_index = 0;
diff --git a/arch/x86/boot/startup/sev-startup.c b/arch/x86/boot/startup/sev-startup.c
index 797ca3e29b12..ca6a9863ffab 100644
--- a/arch/x86/boot/startup/sev-startup.c
+++ b/arch/x86/boot/startup/sev-startup.c
@@ -59,9 +59,6 @@ u64 sev_secrets_pa __ro_after_init;
/* For early boot SVSM communication */
struct svsm_ca boot_svsm_ca_page __aligned(PAGE_SIZE);
-DEFINE_PER_CPU(struct svsm_ca *, svsm_caa);
-DEFINE_PER_CPU(u64, svsm_caa_pa);
-
/*
* Nothing shall interrupt this code path while holding the per-CPU
* GHCB. The backup GHCB is only for NMIs interrupting this path.
@@ -261,6 +258,7 @@ static __head struct cc_blob_sev_info *find_cc_blob(struct boot_params *bp)
static __head void svsm_setup(struct cc_blob_sev_info *cc_info)
{
+ struct snp_secrets_page *secrets = (void *)cc_info->secrets_phys;
struct svsm_call call = {};
int ret;
u64 pa;
@@ -289,7 +287,7 @@ static __head void svsm_setup(struct cc_blob_sev_info *cc_info)
* RAX = 0 (Protocol=0, CallID=0)
* RCX = New CA GPA
*/
- call.caa = svsm_get_caa();
+ call.caa = (struct svsm_ca *)secrets->svsm_caa;
call.rax = SVSM_CORE_CALL(SVSM_CORE_REMAP_CA);
call.rcx = pa;
ret = svsm_perform_msr_protocol(&call);
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 883b2719986d..36edf670ff19 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -45,6 +45,25 @@
#include <asm/cpuid.h>
#include <asm/cmdline.h>
+DEFINE_PER_CPU(struct svsm_ca *, svsm_caa);
+DEFINE_PER_CPU(u64, svsm_caa_pa);
+
+static inline struct svsm_ca *svsm_get_caa(void)
+{
+ if (sev_cfg.use_cas)
+ return this_cpu_read(svsm_caa);
+ else
+ return boot_svsm_caa;
+}
+
+static inline u64 svsm_get_caa_pa(void)
+{
+ if (sev_cfg.use_cas)
+ return this_cpu_read(svsm_caa_pa);
+ else
+ return boot_svsm_caa_pa;
+}
+
/* AP INIT values as documented in the APM2 section "Processor Initialization State" */
#define AP_INIT_CS_LIMIT 0xffff
#define AP_INIT_DS_LIMIT 0xffff
@@ -1207,7 +1226,8 @@ static void __init alloc_runtime_data(int cpu)
struct svsm_ca *caa;
/* Allocate the SVSM CA page if an SVSM is present */
- caa = memblock_alloc_or_panic(sizeof(*caa), PAGE_SIZE);
+ caa = cpu ? memblock_alloc_or_panic(sizeof(*caa), PAGE_SIZE)
+ : boot_svsm_caa;
per_cpu(svsm_caa, cpu) = caa;
per_cpu(svsm_caa_pa, cpu) = __pa(caa);
@@ -1261,32 +1281,9 @@ void __init sev_es_init_vc_handling(void)
init_ghcb(cpu);
}
- /* If running under an SVSM, switch to the per-cpu CA */
- if (snp_vmpl) {
- struct svsm_call call = {};
- unsigned long flags;
- int ret;
-
- local_irq_save(flags);
-
- /*
- * SVSM_CORE_REMAP_CA call:
- * RAX = 0 (Protocol=0, CallID=0)
- * RCX = New CA GPA
- */
- call.caa = svsm_get_caa();
- call.rax = SVSM_CORE_CALL(SVSM_CORE_REMAP_CA);
- call.rcx = this_cpu_read(svsm_caa_pa);
- ret = svsm_perform_call_protocol(&call);
- if (ret)
- panic("Can't remap the SVSM CA, ret=%d, rax_out=0x%llx\n",
- ret, call.rax_out);
-
+ if (snp_vmpl)
sev_cfg.use_cas = true;
- local_irq_restore(flags);
- }
-
sev_es_setup_play_dead();
/* Secondary CPUs use the runtime #VC handler */
diff --git a/arch/x86/include/asm/sev-internal.h b/arch/x86/include/asm/sev-internal.h
index 0d02e780beb3..4335711274e3 100644
--- a/arch/x86/include/asm/sev-internal.h
+++ b/arch/x86/include/asm/sev-internal.h
@@ -64,22 +64,6 @@ DECLARE_PER_CPU(u64, svsm_caa_pa);
extern struct svsm_ca *boot_svsm_caa;
extern u64 boot_svsm_caa_pa;
-static __always_inline struct svsm_ca *svsm_get_caa(void)
-{
- if (sev_cfg.use_cas)
- return this_cpu_read(svsm_caa);
- else
- return boot_svsm_caa;
-}
-
-static __always_inline u64 svsm_get_caa_pa(void)
-{
- if (sev_cfg.use_cas)
- return this_cpu_read(svsm_caa_pa);
- else
- return boot_svsm_caa_pa;
-}
-
enum es_result verify_exception_info(struct ghcb *ghcb, struct es_em_ctxt *ctxt);
void vc_forward_exception(struct es_em_ctxt *ctxt);
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 11/23] x86/boot: Drop redundant RMPADJUST in SEV SVSM presence check
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (9 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 10/23] x86/sev: Use boot SVSM CA for all startup and init code Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 12/23] x86/sev: Unify SEV-SNP hypervisor feature check Ard Biesheuvel
` (13 subsequent siblings)
24 siblings, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
snp_vmpl will be assigned a non-zero value when executing at a VMPL
other than 0, and this is inferred from a call to RMPADJUST, which only
works when running at VMPL0.
This means that testing snp_vmpl is sufficient, and there is no need to
perform the same check again.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/compressed/sev.c | 20 +++-----------------
1 file changed, 3 insertions(+), 17 deletions(-)
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 91e8140250f6..18ba81c0b0ed 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -420,30 +420,16 @@ void sev_enable(struct boot_params *bp)
*/
if (sev_status & MSR_AMD64_SEV_SNP_ENABLED) {
u64 hv_features;
- int ret;
hv_features = get_hv_features();
if (!(hv_features & GHCB_HV_FT_SNP))
sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
/*
- * Enforce running at VMPL0 or with an SVSM.
- *
- * Use RMPADJUST (see the rmpadjust() function for a description of
- * what the instruction does) to update the VMPL1 permissions of a
- * page. If the guest is running at VMPL0, this will succeed. If the
- * guest is running at any other VMPL, this will fail. Linux SNP guests
- * only ever run at a single VMPL level so permission mask changes of a
- * lesser-privileged VMPL are a don't-care.
+ * Running at VMPL0 is required unless an SVSM is present and
+ * the hypervisor supports the required SVSM GHCB events.
*/
- ret = rmpadjust((unsigned long)&boot_ghcb_page, RMP_PG_SIZE_4K, 1);
-
- /*
- * Running at VMPL0 is not required if an SVSM is present and the hypervisor
- * supports the required SVSM GHCB events.
- */
- if (ret &&
- !(snp_vmpl && (hv_features & GHCB_HV_FT_SNP_MULTI_VMPL)))
+ if (snp_vmpl > 0 && !(hv_features & GHCB_HV_FT_SNP_MULTI_VMPL))
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_NOT_VMPL0);
}
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 12/23] x86/sev: Unify SEV-SNP hypervisor feature check
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (10 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 11/23] x86/boot: Drop redundant RMPADJUST in SEV SVSM presence check Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 13/23] x86/linkage: Add SYM_PIC_ALIAS() macro helper to emit symbol aliases Ard Biesheuvel
` (12 subsequent siblings)
24 siblings, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
The decompressor and the core kernel both check the hypervisor feature
mask exposed by the hypervisor, but test it in slightly different ways.
This disparity seems unintentional, and simply a result of the fact that
the decompressor and the core kernel evolve differently over time when
it comes to setting up the SEV-SNP execution context.
So move the HV feature check into a helper function and call that
instead. For the core kernel, move the check to an earlier boot stage,
right after the point where it is established that the guest is
executing in SEV-SNP mode.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/compressed/sev.c | 19 +----------
arch/x86/boot/startup/sev-shared.c | 33 +++++++++++++++-----
arch/x86/boot/startup/sme.c | 2 ++
arch/x86/coco/sev/core.c | 11 -------
arch/x86/include/asm/sev-internal.h | 2 +-
arch/x86/include/asm/sev.h | 2 ++
6 files changed, 32 insertions(+), 37 deletions(-)
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 18ba81c0b0ed..79550736ad2a 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -414,24 +414,7 @@ void sev_enable(struct boot_params *bp)
sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_PROT_UNSUPPORTED);
}
- /*
- * SNP is supported in v2 of the GHCB spec which mandates support for HV
- * features.
- */
- if (sev_status & MSR_AMD64_SEV_SNP_ENABLED) {
- u64 hv_features;
-
- hv_features = get_hv_features();
- if (!(hv_features & GHCB_HV_FT_SNP))
- sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
-
- /*
- * Running at VMPL0 is required unless an SVSM is present and
- * the hypervisor supports the required SVSM GHCB events.
- */
- if (snp_vmpl > 0 && !(hv_features & GHCB_HV_FT_SNP_MULTI_VMPL))
- sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_NOT_VMPL0);
- }
+ snp_check_hv_features();
if (snp && !(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
error("SEV-SNP supported indicated by CC blob, but not SEV status MSR.");
diff --git a/arch/x86/boot/startup/sev-shared.c b/arch/x86/boot/startup/sev-shared.c
index b1f4b9b15045..b72c2b9c40c3 100644
--- a/arch/x86/boot/startup/sev-shared.c
+++ b/arch/x86/boot/startup/sev-shared.c
@@ -78,16 +78,10 @@ sev_es_terminate(unsigned int set, unsigned int reason)
asm volatile("hlt\n" : : : "memory");
}
-/*
- * The hypervisor features are available from GHCB version 2 onward.
- */
-u64 get_hv_features(void)
+static u64 __head get_hv_features(void)
{
u64 val;
- if (ghcb_version < 2)
- return 0;
-
sev_es_wr_ghcb_msr(GHCB_MSR_HV_FT_REQ);
VMGEXIT();
@@ -98,6 +92,31 @@ u64 get_hv_features(void)
return GHCB_MSR_HV_FT_RESP_VAL(val);
}
+u64 __head snp_check_hv_features(void)
+{
+ /*
+ * SNP is supported in v2 of the GHCB spec which mandates support for HV
+ * features.
+ */
+ if (sev_status & MSR_AMD64_SEV_SNP_ENABLED) {
+ u64 hv_features;
+
+ hv_features = get_hv_features();
+ if (!(hv_features & GHCB_HV_FT_SNP))
+ sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
+
+ /*
+ * Running at VMPL0 is required unless an SVSM is present and
+ * the hypervisor supports the required SVSM GHCB events.
+ */
+ if (snp_vmpl > 0 && !(hv_features & GHCB_HV_FT_SNP_MULTI_VMPL))
+ sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_NOT_VMPL0);
+
+ return hv_features;
+ }
+ return 0;
+}
+
bool sev_es_negotiate_protocol(void)
{
u64 val;
diff --git a/arch/x86/boot/startup/sme.c b/arch/x86/boot/startup/sme.c
index ade5ad5060e9..7fc6a689cefe 100644
--- a/arch/x86/boot/startup/sme.c
+++ b/arch/x86/boot/startup/sme.c
@@ -524,6 +524,8 @@ void __head sme_enable(struct boot_params *bp)
if (snp_en ^ !!(msr & MSR_AMD64_SEV_SNP_ENABLED))
snp_abort();
+ sev_hv_features = snp_check_hv_features();
+
/* Check if memory encryption is enabled */
if (feature_mask == AMD_SME_BIT) {
if (!(bp->hdr.xloadflags & XLF_MEM_ENCRYPTION))
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 36edf670ff19..106c231d8ded 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -1264,17 +1264,6 @@ void __init sev_es_init_vc_handling(void)
if (!sev_es_check_cpu_features())
panic("SEV-ES CPU Features missing");
- /*
- * SNP is supported in v2 of the GHCB spec which mandates support for HV
- * features.
- */
- if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
- sev_hv_features = get_hv_features();
-
- if (!(sev_hv_features & GHCB_HV_FT_SNP))
- sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
- }
-
/* Initialize per-cpu GHCB pages */
for_each_possible_cpu(cpu) {
alloc_runtime_data(cpu);
diff --git a/arch/x86/include/asm/sev-internal.h b/arch/x86/include/asm/sev-internal.h
index 4335711274e3..3aa59ad288ce 100644
--- a/arch/x86/include/asm/sev-internal.h
+++ b/arch/x86/include/asm/sev-internal.h
@@ -84,6 +84,6 @@ static __always_inline void sev_es_wr_ghcb_msr(u64 val)
bool sev_es_negotiate_protocol(void);
bool sev_es_check_cpu_features(void);
-u64 get_hv_features(void);
+void check_hv_features(void);
const struct snp_cpuid_table *snp_cpuid_get_table(void);
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index ca7168cc2118..ca914f7384c6 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -418,6 +418,7 @@ struct svsm_call {
#ifdef CONFIG_AMD_MEM_ENCRYPT
extern u8 snp_vmpl;
+extern u64 sev_hv_features;
extern void __sev_es_ist_enter(struct pt_regs *regs);
extern void __sev_es_ist_exit(void);
@@ -495,6 +496,7 @@ void snp_set_memory_private(unsigned long vaddr, unsigned long npages);
void snp_set_wakeup_secondary_cpu(void);
bool snp_init(struct boot_params *bp);
void __noreturn snp_abort(void);
+u64 snp_check_hv_features(void);
void snp_dmi_setup(void);
int snp_issue_svsm_attest_req(u64 call_id, struct svsm_call *call, struct svsm_attest_call *input);
void snp_accept_memory(phys_addr_t start, phys_addr_t end);
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 13/23] x86/linkage: Add SYM_PIC_ALIAS() macro helper to emit symbol aliases
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (11 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 12/23] x86/sev: Unify SEV-SNP hypervisor feature check Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 14/23] x86/boot: Add a bunch of PIC aliases Ard Biesheuvel
` (11 subsequent siblings)
24 siblings, 1 reply; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
Startup code that may execute from the early 1:1 mapping of memory will
be confined into its own address space, and only be permitted to access
ordinary kernel symbols if this is known to be safe.
Introduce a macro helper SYM_PIC_ALIAS() that emits a __pi_ prefixed
alias for a symbol, which allows startup code to access it.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/include/asm/linkage.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/include/asm/linkage.h b/arch/x86/include/asm/linkage.h
index b51d8a4673f5..9d38ae744a2e 100644
--- a/arch/x86/include/asm/linkage.h
+++ b/arch/x86/include/asm/linkage.h
@@ -141,5 +141,15 @@
#define SYM_FUNC_START_WEAK_NOALIGN(name) \
SYM_START(name, SYM_L_WEAK, SYM_A_NONE)
+/*
+ * Expose 'sym' to the startup code in arch/x86/boot/startup/, by emitting an
+ * alias prefixed with __pi_
+ */
+#ifdef __ASSEMBLER__
+#define SYM_PIC_ALIAS(sym) SYM_ALIAS(__pi_ ## sym, sym, SYM_L_GLOBAL)
+#else
+#define SYM_PIC_ALIAS(sym) extern typeof(sym) __PASTE(__pi_, sym) __alias(sym)
+#endif
+
#endif /* _ASM_X86_LINKAGE_H */
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 14/23] x86/boot: Add a bunch of PIC aliases
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (12 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 13/23] x86/linkage: Add SYM_PIC_ALIAS() macro helper to emit symbol aliases Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 15/23] x86/boot: Provide __pti_set_user_pgtbl() to startup code Ard Biesheuvel
` (10 subsequent siblings)
24 siblings, 1 reply; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
Add aliases for all the data objects that the startup code references -
this is needed so that this code can be moved into its own confined area
where it can only access symbols that have a __pi_ prefix.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/coco/core.c | 2 ++
arch/x86/kernel/cpu/common.c | 1 +
arch/x86/kernel/head64.c | 4 ++++
arch/x86/kernel/head_64.S | 8 ++++++++
arch/x86/kernel/setup.c | 1 +
arch/x86/kernel/vmlinux.lds.S | 4 ++++
arch/x86/lib/memcpy_64.S | 1 +
arch/x86/lib/memset_64.S | 1 +
arch/x86/lib/retpoline.S | 2 ++
arch/x86/mm/mem_encrypt_amd.c | 2 ++
arch/x86/mm/pgtable.c | 1 +
tools/objtool/arch/x86/decode.c | 6 ++++--
12 files changed, 31 insertions(+), 2 deletions(-)
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 9a0ddda3aa69..d4610af68114 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -18,7 +18,9 @@
#include <asm/processor.h>
enum cc_vendor cc_vendor __ro_after_init = CC_VENDOR_NONE;
+SYM_PIC_ALIAS(cc_vendor);
u64 cc_mask __ro_after_init;
+SYM_PIC_ALIAS(cc_mask);
static struct cc_attr_flags {
__u64 host_sev_snp : 1,
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index eb6a7f6e20c4..7b8753224f3e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -240,6 +240,7 @@ DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
#endif
} };
EXPORT_PER_CPU_SYMBOL_GPL(gdt_page);
+SYM_PIC_ALIAS(gdt_page);
#ifdef CONFIG_X86_64
static int __init x86_nopcid_setup(char *s)
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 3d49abb1bb3a..b7da8b45b6d8 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -45,15 +45,19 @@
*/
extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
unsigned int __initdata next_early_pgt;
+SYM_PIC_ALIAS(next_early_pgt);
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
#ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
EXPORT_SYMBOL(page_offset_base);
+SYM_PIC_ALIAS(page_offset_base);
unsigned long vmalloc_base __ro_after_init = __VMALLOC_BASE_L4;
EXPORT_SYMBOL(vmalloc_base);
+SYM_PIC_ALIAS(vmalloc_base);
unsigned long vmemmap_base __ro_after_init = __VMEMMAP_BASE_L4;
EXPORT_SYMBOL(vmemmap_base);
+SYM_PIC_ALIAS(vmemmap_base);
#endif
/* Wipe all early page tables except for the kernel symbol map */
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index fefe2a25cf02..069420853304 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -573,6 +573,7 @@ SYM_CODE_START_NOALIGN(vc_no_ghcb)
/* Pure iret required here - don't use INTERRUPT_RETURN */
iretq
SYM_CODE_END(vc_no_ghcb)
+SYM_PIC_ALIAS(vc_no_ghcb);
#endif
#ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
@@ -604,10 +605,12 @@ SYM_DATA_START_PTI_ALIGNED(early_top_pgt)
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
.fill PTI_USER_PGD_FILL,8,0
SYM_DATA_END(early_top_pgt)
+SYM_PIC_ALIAS(early_top_pgt)
SYM_DATA_START_PAGE_ALIGNED(early_dynamic_pgts)
.fill 512*EARLY_DYNAMIC_PAGE_TABLES,8,0
SYM_DATA_END(early_dynamic_pgts)
+SYM_PIC_ALIAS(early_dynamic_pgts);
SYM_DATA(early_recursion_flag, .long 0)
@@ -651,6 +654,7 @@ SYM_DATA_START_PAGE_ALIGNED(level4_kernel_pgt)
.fill 511,8,0
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
SYM_DATA_END(level4_kernel_pgt)
+SYM_PIC_ALIAS(level4_kernel_pgt)
#endif
SYM_DATA_START_PAGE_ALIGNED(level3_kernel_pgt)
@@ -659,6 +663,7 @@ SYM_DATA_START_PAGE_ALIGNED(level3_kernel_pgt)
.quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
.quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
SYM_DATA_END(level3_kernel_pgt)
+SYM_PIC_ALIAS(level3_kernel_pgt)
SYM_DATA_START_PAGE_ALIGNED(level2_kernel_pgt)
/*
@@ -676,6 +681,7 @@ SYM_DATA_START_PAGE_ALIGNED(level2_kernel_pgt)
*/
PMDS(0, __PAGE_KERNEL_LARGE_EXEC, KERNEL_IMAGE_SIZE/PMD_SIZE)
SYM_DATA_END(level2_kernel_pgt)
+SYM_PIC_ALIAS(level2_kernel_pgt)
SYM_DATA_START_PAGE_ALIGNED(level2_fixmap_pgt)
.fill (512 - 4 - FIXMAP_PMD_NUM),8,0
@@ -688,6 +694,7 @@ SYM_DATA_START_PAGE_ALIGNED(level2_fixmap_pgt)
/* 6 MB reserved space + a 2MB hole */
.fill 4,8,0
SYM_DATA_END(level2_fixmap_pgt)
+SYM_PIC_ALIAS(level2_fixmap_pgt)
SYM_DATA_START_PAGE_ALIGNED(level1_fixmap_pgt)
.rept (FIXMAP_PMD_NUM)
@@ -703,6 +710,7 @@ SYM_DATA(smpboot_control, .long 0)
.align 16
/* This must match the first entry in level2_kernel_pgt */
SYM_DATA(phys_base, .quad 0x0)
+SYM_PIC_ALIAS(phys_base);
EXPORT_SYMBOL(phys_base)
#include "../xen/xen-head.S"
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 9d2a13b37833..e0cf1595a0ab 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -134,6 +134,7 @@ struct ist_info ist_info;
struct cpuinfo_x86 boot_cpu_data __read_mostly;
EXPORT_SYMBOL(boot_cpu_data);
+SYM_PIC_ALIAS(boot_cpu_data);
#if !defined(CONFIG_X86_PAE) || defined(CONFIG_X86_64)
__visible unsigned long mmu_cr4_features __ro_after_init;
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index ccdc45e5b759..9340c74b680d 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -79,11 +79,13 @@ const_cpu_current_top_of_stack = cpu_current_top_of_stack;
#define BSS_DECRYPTED \
. = ALIGN(PMD_SIZE); \
__start_bss_decrypted = .; \
+ __pi___start_bss_decrypted = .; \
*(.bss..decrypted); \
. = ALIGN(PAGE_SIZE); \
__start_bss_decrypted_unused = .; \
. = ALIGN(PMD_SIZE); \
__end_bss_decrypted = .; \
+ __pi___end_bss_decrypted = .; \
#else
@@ -128,6 +130,7 @@ SECTIONS
/* Text and read-only data */
.text : AT(ADDR(.text) - LOAD_OFFSET) {
_text = .;
+ __pi__text = .;
_stext = .;
ALIGN_ENTRY_TEXT_BEGIN
*(.text..__x86.rethunk_untrain)
@@ -391,6 +394,7 @@ SECTIONS
. = ALIGN(PAGE_SIZE); /* keep VO_INIT_SIZE page aligned */
_end = .;
+ __pi__end = .;
#ifdef CONFIG_AMD_MEM_ENCRYPT
/*
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 0ae2e1712e2e..12a23fa7c44c 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -41,6 +41,7 @@ SYM_FUNC_END(__memcpy)
EXPORT_SYMBOL(__memcpy)
SYM_FUNC_ALIAS_MEMFUNC(memcpy, __memcpy)
+SYM_PIC_ALIAS(memcpy)
EXPORT_SYMBOL(memcpy)
SYM_FUNC_START_LOCAL(memcpy_orig)
diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index d66b710d628f..fb5a03cf5ab7 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -42,6 +42,7 @@ SYM_FUNC_END(__memset)
EXPORT_SYMBOL(__memset)
SYM_FUNC_ALIAS_MEMFUNC(memset, __memset)
+SYM_PIC_ALIAS(memset)
EXPORT_SYMBOL(memset)
SYM_FUNC_START_LOCAL(memset_orig)
diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S
index a26c43abd47d..9f3116609c8c 100644
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -40,6 +40,7 @@ SYM_INNER_LABEL(__x86_indirect_thunk_\reg, SYM_L_GLOBAL)
ALTERNATIVE_2 __stringify(RETPOLINE \reg), \
__stringify(lfence; ANNOTATE_RETPOLINE_SAFE; jmp *%\reg; int3), X86_FEATURE_RETPOLINE_LFENCE, \
__stringify(ANNOTATE_RETPOLINE_SAFE; jmp *%\reg), ALT_NOT(X86_FEATURE_RETPOLINE)
+SYM_PIC_ALIAS(__x86_indirect_thunk_\reg)
.endm
@@ -394,6 +395,7 @@ SYM_CODE_START(__x86_return_thunk)
#endif
int3
SYM_CODE_END(__x86_return_thunk)
+SYM_PIC_ALIAS(__x86_return_thunk)
EXPORT_SYMBOL(__x86_return_thunk)
#endif /* CONFIG_MITIGATION_RETHUNK */
diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index 7490ff6d83b1..faf3a13fb6ba 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -40,7 +40,9 @@
* section is later cleared.
*/
u64 sme_me_mask __section(".data") = 0;
+SYM_PIC_ALIAS(sme_me_mask);
u64 sev_status __section(".data") = 0;
+SYM_PIC_ALIAS(sev_status);
u64 sev_check_data __section(".data") = 0;
EXPORT_SYMBOL(sme_me_mask);
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index a05fcddfc811..b871f55c5d20 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -10,6 +10,7 @@
#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
phys_addr_t physical_mask __ro_after_init = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
EXPORT_SYMBOL(physical_mask);
+SYM_PIC_ALIAS(physical_mask);
#endif
pgtable_t pte_alloc_one(struct mm_struct *mm)
diff --git a/tools/objtool/arch/x86/decode.c b/tools/objtool/arch/x86/decode.c
index 3ce7b54003c2..331b9a744410 100644
--- a/tools/objtool/arch/x86/decode.c
+++ b/tools/objtool/arch/x86/decode.c
@@ -842,12 +842,14 @@ int arch_decode_hint_reg(u8 sp_reg, int *base)
bool arch_is_retpoline(struct symbol *sym)
{
- return !strncmp(sym->name, "__x86_indirect_", 15);
+ return !strncmp(sym->name, "__x86_indirect_", 15) ||
+ !strncmp(sym->name, "__pi___x86_indirect_", 20);
}
bool arch_is_rethunk(struct symbol *sym)
{
- return !strcmp(sym->name, "__x86_return_thunk");
+ return !strcmp(sym->name, "__x86_return_thunk") ||
+ !strcmp(sym->name, "__pi___x86_return_thunk");
}
bool arch_is_embedded_insn(struct symbol *sym)
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 15/23] x86/boot: Provide __pti_set_user_pgtbl() to startup code
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (13 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 14/23] x86/boot: Add a bunch of PIC aliases Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2025-05-05 16:54 ` tip-bot2 for Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 16/23] x86/sev: Provide PIC aliases for SEV related data objects Ard Biesheuvel
` (9 subsequent siblings)
24 siblings, 2 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
The SME encryption startup code populates page tables using the ordinary
set_pXX() helpers, and in a PTI build, these will call out to
__pti_set_user_pgtbl() to manipulate the shadow copy of the page tables
for user space.
This is unneeded for the startup code, which only manipulates the
swapper page tables, and so this call could be avoided in this
particular case. So instead of exposing the ordinary
__pti_set_user_pgtblt() to the startup code after its gets confined into
its own symbol space, provide an alternative which just returns pgd,
which is always correct in the startup context.
Annotate it as __weak for now, this will be dropped in a subsequent
patch.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/startup/sme.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/arch/x86/boot/startup/sme.c b/arch/x86/boot/startup/sme.c
index 7fc6a689cefe..1e9f1f5a753c 100644
--- a/arch/x86/boot/startup/sme.c
+++ b/arch/x86/boot/startup/sme.c
@@ -557,3 +557,12 @@ void __head sme_enable(struct boot_params *bp)
cc_vendor = CC_VENDOR_AMD;
cc_set_mask(me_mask);
}
+
+#ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
+/* Local version for startup code, which never operates on user page tables */
+__weak
+pgd_t __pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
+{
+ return pgd;
+}
+#endif
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 16/23] x86/sev: Provide PIC aliases for SEV related data objects
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (14 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 15/23] x86/boot: Provide __pti_set_user_pgtbl() to startup code Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 17/23] x86/sev: Move __sev_[get|put]_ghcb() into separate noinstr object Ard Biesheuvel
` (8 subsequent siblings)
24 siblings, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/compressed/sev.c | 4 ++
arch/x86/boot/startup/sev-shared.c | 20 --------
arch/x86/boot/startup/sev-startup.c | 17 -------
arch/x86/coco/sev/core.c | 48 ++++++++++++++++++++
4 files changed, 52 insertions(+), 37 deletions(-)
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 79550736ad2a..8118a270485f 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -38,6 +38,10 @@ struct ghcb *boot_ghcb;
#define __BOOT_COMPRESSED
u8 snp_vmpl;
+u16 ghcb_version;
+
+struct svsm_ca *boot_svsm_caa;
+u64 boot_svsm_caa_pa;
/* Include code for early handlers */
#include "../../boot/startup/sev-shared.c"
diff --git a/arch/x86/boot/startup/sev-shared.c b/arch/x86/boot/startup/sev-shared.c
index b72c2b9c40c3..9ab132bed77b 100644
--- a/arch/x86/boot/startup/sev-shared.c
+++ b/arch/x86/boot/startup/sev-shared.c
@@ -19,26 +19,6 @@
#define WARN(condition, format...) (!!(condition))
#endif
-/*
- * SVSM related information:
- * During boot, the page tables are set up as identity mapped and later
- * changed to use kernel virtual addresses. Maintain separate virtual and
- * physical addresses for the CAA to allow SVSM functions to be used during
- * early boot, both with identity mapped virtual addresses and proper kernel
- * virtual addresses.
- */
-struct svsm_ca *boot_svsm_caa __ro_after_init;
-u64 boot_svsm_caa_pa __ro_after_init;
-
-/*
- * Since feature negotiation related variables are set early in the boot
- * process they must reside in the .data section so as not to be zeroed
- * out when the .bss section is later cleared.
- *
- * GHCB protocol version negotiated with the hypervisor.
- */
-u16 ghcb_version __ro_after_init;
-
/* Copy of the SNP firmware's CPUID page. */
static struct snp_cpuid_table cpuid_table_copy __ro_after_init;
diff --git a/arch/x86/boot/startup/sev-startup.c b/arch/x86/boot/startup/sev-startup.c
index ca6a9863ffab..2c2a5a043f18 100644
--- a/arch/x86/boot/startup/sev-startup.c
+++ b/arch/x86/boot/startup/sev-startup.c
@@ -41,23 +41,6 @@
#include <asm/cpuid.h>
#include <asm/cmdline.h>
-/* For early boot hypervisor communication in SEV-ES enabled guests */
-struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
-
-/*
- * Needs to be in the .data section because we need it NULL before bss is
- * cleared
- */
-struct ghcb *boot_ghcb __section(".data");
-
-/* Bitmap of SEV features supported by the hypervisor */
-u64 sev_hv_features __ro_after_init;
-
-/* Secrets page physical address from the CC blob */
-u64 sev_secrets_pa __ro_after_init;
-
-/* For early boot SVSM communication */
-struct svsm_ca boot_svsm_ca_page __aligned(PAGE_SIZE);
/*
* Nothing shall interrupt this code path while holding the per-CPU
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 106c231d8ded..33332c4299b9 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -45,6 +45,43 @@
#include <asm/cpuid.h>
#include <asm/cmdline.h>
+/* For early boot hypervisor communication in SEV-ES enabled guests */
+struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
+SYM_PIC_ALIAS(boot_ghcb_page);
+
+/*
+ * Needs to be in the .data section because we need it NULL before bss is
+ * cleared
+ */
+struct ghcb *boot_ghcb __section(".data");
+SYM_PIC_ALIAS(boot_ghcb);
+
+/* Bitmap of SEV features supported by the hypervisor */
+u64 sev_hv_features __ro_after_init;
+SYM_PIC_ALIAS(sev_hv_features);
+
+/* Secrets page physical address from the CC blob */
+u64 sev_secrets_pa __ro_after_init;
+SYM_PIC_ALIAS(sev_secrets_pa);
+
+/* For early boot SVSM communication */
+struct svsm_ca boot_svsm_ca_page __aligned(PAGE_SIZE);
+SYM_PIC_ALIAS(boot_svsm_ca_page);
+
+/*
+ * SVSM related information:
+ * During boot, the page tables are set up as identity mapped and later
+ * changed to use kernel virtual addresses. Maintain separate virtual and
+ * physical addresses for the CAA to allow SVSM functions to be used during
+ * early boot, both with identity mapped virtual addresses and proper kernel
+ * virtual addresses.
+ */
+struct svsm_ca *boot_svsm_caa __ro_after_init;
+SYM_PIC_ALIAS(boot_svsm_caa);
+
+u64 boot_svsm_caa_pa __ro_after_init;
+SYM_PIC_ALIAS(boot_svsm_caa_pa);
+
DEFINE_PER_CPU(struct svsm_ca *, svsm_caa);
DEFINE_PER_CPU(u64, svsm_caa_pa);
@@ -118,6 +155,17 @@ DEFINE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
*/
u8 snp_vmpl __ro_after_init;
EXPORT_SYMBOL_GPL(snp_vmpl);
+SYM_PIC_ALIAS(snp_vmpl);
+
+/*
+ * Since feature negotiation related variables are set early in the boot
+ * process they must reside in the .data section so as not to be zeroed
+ * out when the .bss section is later cleared.
+ *
+ * GHCB protocol version negotiated with the hypervisor.
+ */
+u16 ghcb_version __ro_after_init;
+SYM_PIC_ALIAS(ghcb_version);
static u64 __init get_snp_jump_table_addr(void)
{
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 17/23] x86/sev: Move __sev_[get|put]_ghcb() into separate noinstr object
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (15 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 16/23] x86/sev: Provide PIC aliases for SEV related data objects Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 18/23] x86/sev: Export startup routines for ordinary use Ard Biesheuvel
` (7 subsequent siblings)
24 siblings, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
Rename sev-nmi.c to sev-noinstr.c, and move the get/put GHCB routines
into it too.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/startup/sev-startup.c | 75 --------------------
arch/x86/coco/sev/Makefile | 6 +-
arch/x86/coco/sev/{sev-nmi.c => sev-noinstr.c} | 74 +++++++++++++++++++
3 files changed, 77 insertions(+), 78 deletions(-)
diff --git a/arch/x86/boot/startup/sev-startup.c b/arch/x86/boot/startup/sev-startup.c
index 2c2a5a043f18..6ba50723d546 100644
--- a/arch/x86/boot/startup/sev-startup.c
+++ b/arch/x86/boot/startup/sev-startup.c
@@ -41,84 +41,9 @@
#include <asm/cpuid.h>
#include <asm/cmdline.h>
-
-/*
- * Nothing shall interrupt this code path while holding the per-CPU
- * GHCB. The backup GHCB is only for NMIs interrupting this path.
- *
- * Callers must disable local interrupts around it.
- */
-noinstr struct ghcb *__sev_get_ghcb(struct ghcb_state *state)
-{
- struct sev_es_runtime_data *data;
- struct ghcb *ghcb;
-
- WARN_ON(!irqs_disabled());
-
- data = this_cpu_read(runtime_data);
- ghcb = &data->ghcb_page;
-
- if (unlikely(data->ghcb_active)) {
- /* GHCB is already in use - save its contents */
-
- if (unlikely(data->backup_ghcb_active)) {
- /*
- * Backup-GHCB is also already in use. There is no way
- * to continue here so just kill the machine. To make
- * panic() work, mark GHCBs inactive so that messages
- * can be printed out.
- */
- data->ghcb_active = false;
- data->backup_ghcb_active = false;
-
- instrumentation_begin();
- panic("Unable to handle #VC exception! GHCB and Backup GHCB are already in use");
- instrumentation_end();
- }
-
- /* Mark backup_ghcb active before writing to it */
- data->backup_ghcb_active = true;
-
- state->ghcb = &data->backup_ghcb;
-
- /* Backup GHCB content */
- *state->ghcb = *ghcb;
- } else {
- state->ghcb = NULL;
- data->ghcb_active = true;
- }
-
- return ghcb;
-}
-
/* Include code shared with pre-decompression boot stage */
#include "sev-shared.c"
-noinstr void __sev_put_ghcb(struct ghcb_state *state)
-{
- struct sev_es_runtime_data *data;
- struct ghcb *ghcb;
-
- WARN_ON(!irqs_disabled());
-
- data = this_cpu_read(runtime_data);
- ghcb = &data->ghcb_page;
-
- if (state->ghcb) {
- /* Restore GHCB from Backup */
- *ghcb = *state->ghcb;
- data->backup_ghcb_active = false;
- state->ghcb = NULL;
- } else {
- /*
- * Invalidate the GHCB so a VMGEXIT instruction issued
- * from userspace won't appear to be valid.
- */
- vc_ghcb_invalidate(ghcb);
- data->ghcb_active = false;
- }
-}
-
void __head
early_set_pages_state(unsigned long vaddr, unsigned long paddr,
unsigned long npages, enum psc_op op)
diff --git a/arch/x86/coco/sev/Makefile b/arch/x86/coco/sev/Makefile
index db3255b979bd..b348c9486315 100644
--- a/arch/x86/coco/sev/Makefile
+++ b/arch/x86/coco/sev/Makefile
@@ -1,9 +1,9 @@
# SPDX-License-Identifier: GPL-2.0
-obj-y += core.o sev-nmi.o vc-handle.o
+obj-y += core.o sev-noinstr.o vc-handle.o
# Clang 14 and older may fail to respect __no_sanitize_undefined when inlining
-UBSAN_SANITIZE_sev-nmi.o := n
+UBSAN_SANITIZE_sev-noinstr.o := n
# GCC may fail to respect __no_sanitize_address when inlining
-KASAN_SANITIZE_sev-nmi.o := n
+KASAN_SANITIZE_sev-noinstr.o := n
diff --git a/arch/x86/coco/sev/sev-nmi.c b/arch/x86/coco/sev/sev-noinstr.c
similarity index 61%
rename from arch/x86/coco/sev/sev-nmi.c
rename to arch/x86/coco/sev/sev-noinstr.c
index d8dfaddfb367..b527eafb6312 100644
--- a/arch/x86/coco/sev/sev-nmi.c
+++ b/arch/x86/coco/sev/sev-noinstr.c
@@ -106,3 +106,77 @@ void noinstr __sev_es_nmi_complete(void)
__sev_put_ghcb(&state);
}
+
+/*
+ * Nothing shall interrupt this code path while holding the per-CPU
+ * GHCB. The backup GHCB is only for NMIs interrupting this path.
+ *
+ * Callers must disable local interrupts around it.
+ */
+noinstr struct ghcb *__sev_get_ghcb(struct ghcb_state *state)
+{
+ struct sev_es_runtime_data *data;
+ struct ghcb *ghcb;
+
+ WARN_ON(!irqs_disabled());
+
+ data = this_cpu_read(runtime_data);
+ ghcb = &data->ghcb_page;
+
+ if (unlikely(data->ghcb_active)) {
+ /* GHCB is already in use - save its contents */
+
+ if (unlikely(data->backup_ghcb_active)) {
+ /*
+ * Backup-GHCB is also already in use. There is no way
+ * to continue here so just kill the machine. To make
+ * panic() work, mark GHCBs inactive so that messages
+ * can be printed out.
+ */
+ data->ghcb_active = false;
+ data->backup_ghcb_active = false;
+
+ instrumentation_begin();
+ panic("Unable to handle #VC exception! GHCB and Backup GHCB are already in use");
+ instrumentation_end();
+ }
+
+ /* Mark backup_ghcb active before writing to it */
+ data->backup_ghcb_active = true;
+
+ state->ghcb = &data->backup_ghcb;
+
+ /* Backup GHCB content */
+ *state->ghcb = *ghcb;
+ } else {
+ state->ghcb = NULL;
+ data->ghcb_active = true;
+ }
+
+ return ghcb;
+}
+
+noinstr void __sev_put_ghcb(struct ghcb_state *state)
+{
+ struct sev_es_runtime_data *data;
+ struct ghcb *ghcb;
+
+ WARN_ON(!irqs_disabled());
+
+ data = this_cpu_read(runtime_data);
+ ghcb = &data->ghcb_page;
+
+ if (state->ghcb) {
+ /* Restore GHCB from Backup */
+ *ghcb = *state->ghcb;
+ data->backup_ghcb_active = false;
+ state->ghcb = NULL;
+ } else {
+ /*
+ * Invalidate the GHCB so a VMGEXIT instruction issued
+ * from userspace won't appear to be valid.
+ */
+ vc_ghcb_invalidate(ghcb);
+ data->ghcb_active = false;
+ }
+}
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 18/23] x86/sev: Export startup routines for ordinary use
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (16 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 17/23] x86/sev: Move __sev_[get|put]_ghcb() into separate noinstr object Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 19/23] x86/boot: Created a confined code area for startup code Ard Biesheuvel
` (6 subsequent siblings)
24 siblings, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
Create aliases that expose routines that are part of the startup code to
other code in the core kernel.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/startup/exports.h | 14 ++++++++++++++
arch/x86/kernel/vmlinux.lds.S | 2 ++
2 files changed, 16 insertions(+)
diff --git a/arch/x86/boot/startup/exports.h b/arch/x86/boot/startup/exports.h
new file mode 100644
index 000000000000..5cbc14c64d61
--- /dev/null
+++ b/arch/x86/boot/startup/exports.h
@@ -0,0 +1,14 @@
+
+/*
+ * The symbols below are functions that are implemented by the startup code,
+ * but called at runtime by the SEV code residing in the core kernel.
+ */
+PROVIDE(early_set_pages_state = __pi_early_set_pages_state);
+PROVIDE(early_snp_set_memory_private = __pi_early_snp_set_memory_private);
+PROVIDE(early_snp_set_memory_shared = __pi_early_snp_set_memory_shared);
+PROVIDE(get_hv_features = __pi_get_hv_features);
+PROVIDE(sev_es_check_cpu_features = __pi_sev_es_check_cpu_features);
+PROVIDE(sev_es_negotiate_protocol = __pi_sev_es_negotiate_protocol);
+PROVIDE(sev_es_terminate = __pi_sev_es_terminate);
+PROVIDE(snp_cpuid = __pi_snp_cpuid);
+PROVIDE(snp_cpuid_get_table = __pi_snp_cpuid_get_table);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 9340c74b680d..4aaa1693b262 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -517,3 +517,5 @@ xen_elfnote_entry_value =
xen_elfnote_phys32_entry_value =
ABSOLUTE(xen_elfnote_phys32_entry) + ABSOLUTE(pvh_start_xen - LOAD_OFFSET);
#endif
+
+#include "../boot/startup/exports.h"
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 19/23] x86/boot: Created a confined code area for startup code
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (17 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 18/23] x86/sev: Export startup routines for ordinary use Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 20/23] x86/boot: Move startup code out of __head section Ard Biesheuvel
` (5 subsequent siblings)
24 siblings, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
In order to be able to have tight control over which code may execute
from the early 1:1 mapping of memory, but still link vmlinux as a single
executable, prefix all symbol references in startup code with __pi_, and
invoke it from outside using the __pi_ prefix.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/startup/Makefile | 13 +++++++++++++
arch/x86/boot/startup/sev-shared.c | 2 +-
arch/x86/boot/startup/sme.c | 1 -
arch/x86/coco/sev/core.c | 2 +-
arch/x86/coco/sev/vc-handle.c | 1 -
arch/x86/include/asm/setup.h | 1 +
arch/x86/include/asm/sev.h | 1 +
arch/x86/kernel/head64.c | 2 +-
arch/x86/kernel/head_64.S | 8 ++++----
arch/x86/mm/mem_encrypt_boot.S | 6 +++---
10 files changed, 25 insertions(+), 12 deletions(-)
diff --git a/arch/x86/boot/startup/Makefile b/arch/x86/boot/startup/Makefile
index b514f7e81332..c7e690091f81 100644
--- a/arch/x86/boot/startup/Makefile
+++ b/arch/x86/boot/startup/Makefile
@@ -28,3 +28,16 @@ lib-$(CONFIG_EFI_MIXED) += efi-mixed.o
# to be linked into the decompressor or the EFI stub but not vmlinux
#
$(patsubst %.o,$(obj)/%.o,$(lib-y)): OBJECT_FILES_NON_STANDARD := y
+
+#
+# Confine the startup code by prefixing all symbols with __pi_ (for position
+# independent). This ensures that startup code can only call other startup
+# code, or code that has explicitly been made accessible to it via a symbol
+# alias.
+#
+$(obj)/%.pi.o: OBJCOPYFLAGS := --prefix-symbols=__pi_
+$(obj)/%.pi.o: $(obj)/%.o FORCE
+ $(call if_changed,objcopy)
+
+extra-y := $(obj-y)
+obj-y := $(patsubst %.o,%.pi.o,$(obj-y))
diff --git a/arch/x86/boot/startup/sev-shared.c b/arch/x86/boot/startup/sev-shared.c
index 9ab132bed77b..0215f2355dfa 100644
--- a/arch/x86/boot/startup/sev-shared.c
+++ b/arch/x86/boot/startup/sev-shared.c
@@ -12,7 +12,7 @@
#include <asm/setup_data.h>
#ifndef __BOOT_COMPRESSED
-#define error(v) pr_err(v)
+#define error(v)
#define has_cpuflag(f) boot_cpu_has(f)
#else
#undef WARN
diff --git a/arch/x86/boot/startup/sme.c b/arch/x86/boot/startup/sme.c
index 1e9f1f5a753c..19dce2304636 100644
--- a/arch/x86/boot/startup/sme.c
+++ b/arch/x86/boot/startup/sme.c
@@ -560,7 +560,6 @@ void __head sme_enable(struct boot_params *bp)
#ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
/* Local version for startup code, which never operates on user page tables */
-__weak
pgd_t __pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
{
return pgd;
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 33332c4299b9..98921cde217e 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -276,7 +276,7 @@ static int svsm_perform_call_protocol(struct svsm_call *call)
do {
ret = ghcb ? svsm_perform_ghcb_protocol(ghcb, call)
- : svsm_perform_msr_protocol(call);
+ : __pi_svsm_perform_msr_protocol(call);
} while (ret == -EAGAIN);
if (sev_cfg.ghcbs_initialized)
diff --git a/arch/x86/coco/sev/vc-handle.c b/arch/x86/coco/sev/vc-handle.c
index b4895c648024..0820accb9e80 100644
--- a/arch/x86/coco/sev/vc-handle.c
+++ b/arch/x86/coco/sev/vc-handle.c
@@ -1058,4 +1058,3 @@ bool __init handle_vc_boot_ghcb(struct pt_regs *regs)
sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
}
-
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 6324f4c6c545..895d09faaf83 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -53,6 +53,7 @@ extern void i386_reserve_resources(void);
extern unsigned long __startup_64(unsigned long p2v_offset, struct boot_params *bp);
extern void startup_64_setup_gdt_idt(void);
extern void startup_64_load_idt(void *vc_handler);
+extern void __pi_startup_64_load_idt(void *vc_handler);
extern void early_setup_idt(void);
extern void __init do_early_exception(struct pt_regs *regs, int trapnr);
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index ca914f7384c6..cfec803a9b6c 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -535,6 +535,7 @@ struct cpuid_leaf {
};
int svsm_perform_msr_protocol(struct svsm_call *call);
+int __pi_svsm_perform_msr_protocol(struct svsm_call *call);
int snp_cpuid(void (*cpuid_hv)(void *ctx, struct cpuid_leaf *),
void *ctx, struct cpuid_leaf *leaf);
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index b7da8b45b6d8..23dc505bf609 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -306,5 +306,5 @@ void early_setup_idt(void)
handler = vc_boot_ghcb;
}
- startup_64_load_idt(handler);
+ __pi_startup_64_load_idt(handler);
}
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 069420853304..88cbf932f7c2 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -71,7 +71,7 @@ SYM_CODE_START_NOALIGN(startup_64)
xorl %edx, %edx
wrmsr
- call startup_64_setup_gdt_idt
+ call __pi_startup_64_setup_gdt_idt
/* Now switch to __KERNEL_CS so IRET works reliably */
pushq $__KERNEL_CS
@@ -91,7 +91,7 @@ SYM_CODE_START_NOALIGN(startup_64)
* subsequent code. Pass the boot_params pointer as the first argument.
*/
movq %r15, %rdi
- call sme_enable
+ call __pi_sme_enable
#endif
/* Sanitize CPU configuration */
@@ -111,7 +111,7 @@ SYM_CODE_START_NOALIGN(startup_64)
* programmed into CR3.
*/
movq %r15, %rsi
- call __startup_64
+ call __pi___startup_64
/* Form the CR3 value being sure to include the CR3 modifier */
leaq early_top_pgt(%rip), %rcx
@@ -562,7 +562,7 @@ SYM_CODE_START_NOALIGN(vc_no_ghcb)
/* Call C handler */
movq %rsp, %rdi
movq ORIG_RAX(%rsp), %rsi
- call do_vc_no_ghcb
+ call __pi_do_vc_no_ghcb
/* Unwind pt_regs */
POP_REGS
diff --git a/arch/x86/mm/mem_encrypt_boot.S b/arch/x86/mm/mem_encrypt_boot.S
index f8a33b25ae86..edbf9c998848 100644
--- a/arch/x86/mm/mem_encrypt_boot.S
+++ b/arch/x86/mm/mem_encrypt_boot.S
@@ -16,7 +16,7 @@
.text
.code64
-SYM_FUNC_START(sme_encrypt_execute)
+SYM_FUNC_START(__pi_sme_encrypt_execute)
/*
* Entry parameters:
@@ -69,9 +69,9 @@ SYM_FUNC_START(sme_encrypt_execute)
ANNOTATE_UNRET_SAFE
ret
int3
-SYM_FUNC_END(sme_encrypt_execute)
+SYM_FUNC_END(__pi_sme_encrypt_execute)
-SYM_FUNC_START(__enc_copy)
+SYM_FUNC_START_LOCAL(__enc_copy)
ANNOTATE_NOENDBR
/*
* Routine used to encrypt memory in place.
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 20/23] x86/boot: Move startup code out of __head section
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (18 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 19/23] x86/boot: Created a confined code area for startup code Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 21/23] x86/boot: Disallow absolute symbol references in startup code Ard Biesheuvel
` (4 subsequent siblings)
24 siblings, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
Move startup code out of the __head section, now that this no longer has
a special significance. Move everything into .text or .init.text as
appropriate, so that startup code is not kept around unnecessarily.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/compressed/sev.c | 3 ---
arch/x86/boot/startup/gdt_idt.c | 4 +--
arch/x86/boot/startup/map_kernel.c | 4 +--
arch/x86/boot/startup/sev-shared.c | 28 ++++++++++----------
arch/x86/boot/startup/sev-startup.c | 14 +++++-----
arch/x86/boot/startup/sme.c | 26 +++++++++---------
arch/x86/include/asm/init.h | 6 -----
arch/x86/kernel/head_32.S | 2 +-
arch/x86/kernel/head_64.S | 2 +-
arch/x86/platform/pvh/head.S | 2 +-
10 files changed, 41 insertions(+), 50 deletions(-)
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 8118a270485f..065525e7fb5e 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -32,9 +32,6 @@ struct ghcb *boot_ghcb;
#undef __init
#define __init
-#undef __head
-#define __head
-
#define __BOOT_COMPRESSED
u8 snp_vmpl;
diff --git a/arch/x86/boot/startup/gdt_idt.c b/arch/x86/boot/startup/gdt_idt.c
index a3112a69b06a..d16102abdaec 100644
--- a/arch/x86/boot/startup/gdt_idt.c
+++ b/arch/x86/boot/startup/gdt_idt.c
@@ -24,7 +24,7 @@
static gate_desc bringup_idt_table[NUM_EXCEPTION_VECTORS] __page_aligned_data;
/* This may run while still in the direct mapping */
-void __head startup_64_load_idt(void *vc_handler)
+void startup_64_load_idt(void *vc_handler)
{
struct desc_ptr desc = {
.address = (unsigned long)rip_rel_ptr(bringup_idt_table),
@@ -46,7 +46,7 @@ void __head startup_64_load_idt(void *vc_handler)
/*
* Setup boot CPU state needed before kernel switches to virtual addresses.
*/
-void __head startup_64_setup_gdt_idt(void)
+void __init startup_64_setup_gdt_idt(void)
{
struct gdt_page *gp = rip_rel_ptr((void *)(__force unsigned long)&gdt_page);
void *handler = NULL;
diff --git a/arch/x86/boot/startup/map_kernel.c b/arch/x86/boot/startup/map_kernel.c
index 11f99d907f76..e38f57b81745 100644
--- a/arch/x86/boot/startup/map_kernel.c
+++ b/arch/x86/boot/startup/map_kernel.c
@@ -26,7 +26,7 @@ static inline bool check_la57_support(void)
return true;
}
-static unsigned long __head sme_postprocess_startup(struct boot_params *bp,
+static unsigned long __init sme_postprocess_startup(struct boot_params *bp,
pmdval_t *pmd,
unsigned long p2v_offset)
{
@@ -80,7 +80,7 @@ static unsigned long __head sme_postprocess_startup(struct boot_params *bp,
* the 1:1 mapping of memory. Kernel virtual addresses can be determined by
* subtracting p2v_offset from the RIP-relative address.
*/
-unsigned long __head __startup_64(unsigned long p2v_offset,
+unsigned long __init __startup_64(unsigned long p2v_offset,
struct boot_params *bp)
{
pmd_t (*early_pgts)[PTRS_PER_PMD] = rip_rel_ptr(early_dynamic_pgts);
diff --git a/arch/x86/boot/startup/sev-shared.c b/arch/x86/boot/startup/sev-shared.c
index 0215f2355dfa..5d700565b5f1 100644
--- a/arch/x86/boot/startup/sev-shared.c
+++ b/arch/x86/boot/startup/sev-shared.c
@@ -42,7 +42,7 @@ bool __init sev_es_check_cpu_features(void)
return true;
}
-void __head __noreturn
+void __noreturn
sev_es_terminate(unsigned int set, unsigned int reason)
{
u64 val = GHCB_MSR_TERM_REQ;
@@ -58,7 +58,7 @@ sev_es_terminate(unsigned int set, unsigned int reason)
asm volatile("hlt\n" : : : "memory");
}
-static u64 __head get_hv_features(void)
+static u64 __init get_hv_features(void)
{
u64 val;
@@ -72,7 +72,7 @@ static u64 __head get_hv_features(void)
return GHCB_MSR_HV_FT_RESP_VAL(val);
}
-u64 __head snp_check_hv_features(void)
+u64 __init snp_check_hv_features(void)
{
/*
* SNP is supported in v2 of the GHCB spec which mandates support for HV
@@ -220,7 +220,7 @@ const struct snp_cpuid_table *snp_cpuid_get_table(void)
*
* Return: XSAVE area size on success, 0 otherwise.
*/
-static u32 __head snp_cpuid_calc_xsave_size(u64 xfeatures_en, bool compacted)
+static u32 snp_cpuid_calc_xsave_size(u64 xfeatures_en, bool compacted)
{
const struct snp_cpuid_table *cpuid_table = snp_cpuid_get_table();
u64 xfeatures_found = 0;
@@ -256,7 +256,7 @@ static u32 __head snp_cpuid_calc_xsave_size(u64 xfeatures_en, bool compacted)
return xsave_size;
}
-static bool __head
+static bool
snp_cpuid_get_validated_func(struct cpuid_leaf *leaf)
{
const struct snp_cpuid_table *cpuid_table = snp_cpuid_get_table();
@@ -298,7 +298,7 @@ static void snp_cpuid_hv_no_ghcb(void *ctx, struct cpuid_leaf *leaf)
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_CPUID_HV);
}
-static int __head
+static int
snp_cpuid_postprocess(void (*cpuid_hv)(void *ctx, struct cpuid_leaf *),
void *ctx, struct cpuid_leaf *leaf)
{
@@ -394,7 +394,7 @@ snp_cpuid_postprocess(void (*cpuid_hv)(void *ctx, struct cpuid_leaf *),
* Returns -EOPNOTSUPP if feature not enabled. Any other non-zero return value
* should be treated as fatal by caller.
*/
-int __head
+int
snp_cpuid(void (*cpuid_hv)(void *ctx, struct cpuid_leaf *),
void *ctx, struct cpuid_leaf *leaf)
{
@@ -438,7 +438,7 @@ snp_cpuid(void (*cpuid_hv)(void *ctx, struct cpuid_leaf *),
* page yet, so it only supports the MSR based communication with the
* hypervisor and only the CPUID exit-code.
*/
-void __head do_vc_no_ghcb(struct pt_regs *regs, unsigned long exit_code)
+void do_vc_no_ghcb(struct pt_regs *regs, unsigned long exit_code)
{
unsigned int subfn = lower_bits(regs->cx, 32);
unsigned int fn = lower_bits(regs->ax, 32);
@@ -514,7 +514,7 @@ struct cc_setup_data {
* Search for a Confidential Computing blob passed in as a setup_data entry
* via the Linux Boot Protocol.
*/
-static __head
+static __init
struct cc_blob_sev_info *find_cc_blob_setup_data(struct boot_params *bp)
{
struct cc_setup_data *sd = NULL;
@@ -542,7 +542,7 @@ struct cc_blob_sev_info *find_cc_blob_setup_data(struct boot_params *bp)
* mapping needs to be updated in sync with all the changes to virtual memory
* layout and related mapping facilities throughout the boot process.
*/
-static void __head setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
+static void __init setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
{
const struct snp_cpuid_table *cpuid_table_fw, *cpuid_table;
int i;
@@ -570,7 +570,7 @@ static void __head setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
}
}
-static void __head svsm_pval_4k_page(unsigned long paddr, bool validate)
+static void svsm_pval_4k_page(unsigned long paddr, bool validate)
{
struct svsm_pvalidate_call *pc;
struct svsm_call call = {};
@@ -607,8 +607,8 @@ static void __head svsm_pval_4k_page(unsigned long paddr, bool validate)
native_local_irq_restore(flags);
}
-static void __head pvalidate_4k_page(unsigned long vaddr, unsigned long paddr,
- bool validate)
+static void pvalidate_4k_page(unsigned long vaddr, unsigned long paddr,
+ bool validate)
{
int ret;
@@ -625,7 +625,7 @@ static void __head pvalidate_4k_page(unsigned long vaddr, unsigned long paddr,
* Maintain the GPA of the SVSM Calling Area (CA) in order to utilize the SVSM
* services needed when not running in VMPL0.
*/
-static bool __head svsm_setup_ca(const struct cc_blob_sev_info *cc_info)
+static bool __init svsm_setup_ca(const struct cc_blob_sev_info *cc_info)
{
struct snp_secrets_page *secrets_page;
struct snp_cpuid_table *cpuid_table;
diff --git a/arch/x86/boot/startup/sev-startup.c b/arch/x86/boot/startup/sev-startup.c
index 6ba50723d546..7c1532ca1e87 100644
--- a/arch/x86/boot/startup/sev-startup.c
+++ b/arch/x86/boot/startup/sev-startup.c
@@ -44,7 +44,7 @@
/* Include code shared with pre-decompression boot stage */
#include "sev-shared.c"
-void __head
+void __init
early_set_pages_state(unsigned long vaddr, unsigned long paddr,
unsigned long npages, enum psc_op op)
{
@@ -90,7 +90,7 @@ early_set_pages_state(unsigned long vaddr, unsigned long paddr,
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
}
-void __head early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
+void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
unsigned long npages)
{
/*
@@ -109,7 +109,7 @@ void __head early_snp_set_memory_private(unsigned long vaddr, unsigned long padd
early_set_pages_state(vaddr, paddr, npages, SNP_PAGE_STATE_PRIVATE);
}
-void __head early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
+void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
unsigned long npages)
{
/*
@@ -138,7 +138,7 @@ void __head early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr
*
* Scan for the blob in that order.
*/
-static __head struct cc_blob_sev_info *find_cc_blob(struct boot_params *bp)
+static struct cc_blob_sev_info *__init find_cc_blob(struct boot_params *bp)
{
struct cc_blob_sev_info *cc_info;
@@ -164,7 +164,7 @@ static __head struct cc_blob_sev_info *find_cc_blob(struct boot_params *bp)
return cc_info;
}
-static __head void svsm_setup(struct cc_blob_sev_info *cc_info)
+static void __init svsm_setup(struct cc_blob_sev_info *cc_info)
{
struct snp_secrets_page *secrets = (void *)cc_info->secrets_phys;
struct svsm_call call = {};
@@ -206,7 +206,7 @@ static __head void svsm_setup(struct cc_blob_sev_info *cc_info)
boot_svsm_caa_pa = pa;
}
-bool __head snp_init(struct boot_params *bp)
+bool __init snp_init(struct boot_params *bp)
{
struct cc_blob_sev_info *cc_info;
@@ -235,7 +235,7 @@ bool __head snp_init(struct boot_params *bp)
return true;
}
-void __head __noreturn snp_abort(void)
+void __init __noreturn snp_abort(void)
{
sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
}
diff --git a/arch/x86/boot/startup/sme.c b/arch/x86/boot/startup/sme.c
index 19dce2304636..e8f48d8bb59a 100644
--- a/arch/x86/boot/startup/sme.c
+++ b/arch/x86/boot/startup/sme.c
@@ -82,7 +82,7 @@ struct sme_populate_pgd_data {
*/
static char sme_workarea[2 * PMD_SIZE] __section(".init.scratch");
-static void __head sme_clear_pgd(struct sme_populate_pgd_data *ppd)
+static void __init sme_clear_pgd(struct sme_populate_pgd_data *ppd)
{
unsigned long pgd_start, pgd_end, pgd_size;
pgd_t *pgd_p;
@@ -97,7 +97,7 @@ static void __head sme_clear_pgd(struct sme_populate_pgd_data *ppd)
memset(pgd_p, 0, pgd_size);
}
-static pud_t __head *sme_prepare_pgd(struct sme_populate_pgd_data *ppd)
+static pud_t __init *sme_prepare_pgd(struct sme_populate_pgd_data *ppd)
{
pgd_t *pgd;
p4d_t *p4d;
@@ -134,7 +134,7 @@ static pud_t __head *sme_prepare_pgd(struct sme_populate_pgd_data *ppd)
return pud;
}
-static void __head sme_populate_pgd_large(struct sme_populate_pgd_data *ppd)
+static void __init sme_populate_pgd_large(struct sme_populate_pgd_data *ppd)
{
pud_t *pud;
pmd_t *pmd;
@@ -150,7 +150,7 @@ static void __head sme_populate_pgd_large(struct sme_populate_pgd_data *ppd)
set_pmd(pmd, __pmd(ppd->paddr | ppd->pmd_flags));
}
-static void __head sme_populate_pgd(struct sme_populate_pgd_data *ppd)
+static void __init sme_populate_pgd(struct sme_populate_pgd_data *ppd)
{
pud_t *pud;
pmd_t *pmd;
@@ -176,7 +176,7 @@ static void __head sme_populate_pgd(struct sme_populate_pgd_data *ppd)
set_pte(pte, __pte(ppd->paddr | ppd->pte_flags));
}
-static void __head __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
+static void __init __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
{
while (ppd->vaddr < ppd->vaddr_end) {
sme_populate_pgd_large(ppd);
@@ -186,7 +186,7 @@ static void __head __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
}
}
-static void __head __sme_map_range_pte(struct sme_populate_pgd_data *ppd)
+static void __init __sme_map_range_pte(struct sme_populate_pgd_data *ppd)
{
while (ppd->vaddr < ppd->vaddr_end) {
sme_populate_pgd(ppd);
@@ -196,7 +196,7 @@ static void __head __sme_map_range_pte(struct sme_populate_pgd_data *ppd)
}
}
-static void __head __sme_map_range(struct sme_populate_pgd_data *ppd,
+static void __init __sme_map_range(struct sme_populate_pgd_data *ppd,
pmdval_t pmd_flags, pteval_t pte_flags)
{
unsigned long vaddr_end;
@@ -220,22 +220,22 @@ static void __head __sme_map_range(struct sme_populate_pgd_data *ppd,
__sme_map_range_pte(ppd);
}
-static void __head sme_map_range_encrypted(struct sme_populate_pgd_data *ppd)
+static void __init sme_map_range_encrypted(struct sme_populate_pgd_data *ppd)
{
__sme_map_range(ppd, PMD_FLAGS_ENC, PTE_FLAGS_ENC);
}
-static void __head sme_map_range_decrypted(struct sme_populate_pgd_data *ppd)
+static void __init sme_map_range_decrypted(struct sme_populate_pgd_data *ppd)
{
__sme_map_range(ppd, PMD_FLAGS_DEC, PTE_FLAGS_DEC);
}
-static void __head sme_map_range_decrypted_wp(struct sme_populate_pgd_data *ppd)
+static void __init sme_map_range_decrypted_wp(struct sme_populate_pgd_data *ppd)
{
__sme_map_range(ppd, PMD_FLAGS_DEC_WP, PTE_FLAGS_DEC_WP);
}
-static unsigned long __head sme_pgtable_calc(unsigned long len)
+static unsigned long __init sme_pgtable_calc(unsigned long len)
{
unsigned long entries = 0, tables = 0;
@@ -272,7 +272,7 @@ static unsigned long __head sme_pgtable_calc(unsigned long len)
return entries + tables;
}
-void __head sme_encrypt_kernel(struct boot_params *bp)
+void __init sme_encrypt_kernel(struct boot_params *bp)
{
unsigned long workarea_start, workarea_end, workarea_len;
unsigned long execute_start, execute_end, execute_len;
@@ -476,7 +476,7 @@ void __head sme_encrypt_kernel(struct boot_params *bp)
native_write_cr3(__native_read_cr3());
}
-void __head sme_enable(struct boot_params *bp)
+void __init sme_enable(struct boot_params *bp)
{
unsigned int eax, ebx, ecx, edx;
unsigned long feature_mask;
diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
index 8b1b1abcef15..01ccdd168df0 100644
--- a/arch/x86/include/asm/init.h
+++ b/arch/x86/include/asm/init.h
@@ -2,12 +2,6 @@
#ifndef _ASM_X86_INIT_H
#define _ASM_X86_INIT_H
-#if defined(CONFIG_CC_IS_CLANG) && CONFIG_CLANG_VERSION < 170000
-#define __head __section(".head.text") __no_sanitize_undefined __no_stack_protector
-#else
-#define __head __section(".head.text") __no_sanitize_undefined
-#endif
-
struct x86_mapping_info {
void *(*alloc_pgt_page)(void *); /* allocate buf for page table */
void (*free_pgt_page)(void *, void *); /* free buf for page table */
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 2e42056d2306..5962ff2a189a 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -61,7 +61,7 @@ RESERVE_BRK(pagetables, INIT_MAP_SIZE)
* any particular GDT layout, because we load our own as soon as we
* can.
*/
-__HEAD
+ __INIT
SYM_CODE_START(startup_32)
movl pa(initial_stack),%ecx
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 88cbf932f7c2..ac1116f623d3 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -33,7 +33,7 @@
* because we need identity-mapped pages.
*/
- __HEAD
+ __INIT
.code64
SYM_CODE_START_NOALIGN(startup_64)
UNWIND_HINT_END_OF_STACK
diff --git a/arch/x86/platform/pvh/head.S b/arch/x86/platform/pvh/head.S
index cfa18ec7d55f..16aa1f018b80 100644
--- a/arch/x86/platform/pvh/head.S
+++ b/arch/x86/platform/pvh/head.S
@@ -24,7 +24,7 @@
#include <asm/nospec-branch.h>
#include <xen/interface/elfnote.h>
- __HEAD
+ __INIT
/*
* Entry point for PVH guests.
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 21/23] x86/boot: Disallow absolute symbol references in startup code
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (19 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 20/23] x86/boot: Move startup code out of __head section Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 22/23] x86/boot: Revert "Reject absolute references in .head.text" Ard Biesheuvel
` (3 subsequent siblings)
24 siblings, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
Check that the objects built under arch/x86/boot/startup do not contain
any absolute symbol reference. Given that the code is built with -fPIC,
such references can only be emitted using R_X86_64_64 relocations, so
checking that those are absent is sufficient.
Note that debug sections and __patchable_funtion_entries section may
contain such relocations nonetheless, but these are unnecessary in the
startup code, so they can be dropped first.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/startup/Makefile | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/x86/boot/startup/Makefile b/arch/x86/boot/startup/Makefile
index c7e690091f81..7b83c8f597ef 100644
--- a/arch/x86/boot/startup/Makefile
+++ b/arch/x86/boot/startup/Makefile
@@ -35,9 +35,17 @@ $(patsubst %.o,$(obj)/%.o,$(lib-y)): OBJECT_FILES_NON_STANDARD := y
# code, or code that has explicitly been made accessible to it via a symbol
# alias.
#
-$(obj)/%.pi.o: OBJCOPYFLAGS := --prefix-symbols=__pi_
+$(obj)/%.pi.o: OBJCOPYFLAGS := --prefix-symbols=__pi_ --strip-debug \
+ --remove-section=.rela__patchable_function_entries
$(obj)/%.pi.o: $(obj)/%.o FORCE
- $(call if_changed,objcopy)
+ $(call if_changed,piobjcopy)
+
+quiet_cmd_piobjcopy = $(quiet_cmd_objcopy)
+ cmd_piobjcopy = $(cmd_objcopy); \
+ if $(READELF) -r $(@) | grep R_X86_64_64; then \
+ echo "$@: R_X86_64_64 references not allowed in startup code" >&2; \
+ /bin/false; \
+ fi
extra-y := $(obj-y)
obj-y := $(patsubst %.o,%.pi.o,$(obj-y))
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 22/23] x86/boot: Revert "Reject absolute references in .head.text"
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (20 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 21/23] x86/boot: Disallow absolute symbol references in startup code Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 23/23] x86/boot: Get rid of the .head.text section Ard Biesheuvel
` (2 subsequent siblings)
24 siblings, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
This reverts commit faf0ed487415f76fe4acf7980ce360901f5e1698.
The startup code is checked directly for the absence of absolute symbol
references, so checking the .head.text section in the relocs tool is no
longer needed.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/tools/relocs.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index 5778bc498415..e5a2b9a912d1 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -740,10 +740,10 @@ static void walk_relocs(int (*process)(struct section *sec, Elf_Rel *rel,
static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
const char *symname)
{
- int headtext = !strcmp(sec_name(sec->shdr.sh_info), ".head.text");
unsigned r_type = ELF64_R_TYPE(rel->r_info);
ElfW(Addr) offset = rel->r_offset;
int shn_abs = (sym->st_shndx == SHN_ABS) && !is_reloc(S_REL, symname);
+
if (sym->st_shndx == SHN_UNDEF)
return 0;
@@ -783,12 +783,6 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
break;
}
- if (headtext) {
- die("Absolute reference to symbol '%s' not permitted in .head.text\n",
- symname);
- break;
- }
-
/*
* Relocation offsets for 64 bit kernels are output
* as 32 bits and sign extended back to 64 bits when
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFT PATCH v2 23/23] x86/boot: Get rid of the .head.text section
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (21 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 22/23] x86/boot: Revert "Reject absolute references in .head.text" Ard Biesheuvel
@ 2025-05-04 9:52 ` Ard Biesheuvel
2025-05-04 14:04 ` [RFT PATCH v2 00/23] x86: strict separation of startup code Ingo Molnar
2025-05-07 9:52 ` Borislav Petkov
24 siblings, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 9:52 UTC (permalink / raw)
To: linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
From: Ard Biesheuvel <ardb@kernel.org>
The .head.text section is now empty, so it can be dropped from the
linker script.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/kernel/vmlinux.lds.S | 5 -----
1 file changed, 5 deletions(-)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 4aaa1693b262..af8bc4a6f6c3 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -160,11 +160,6 @@ SECTIONS
} :text = 0xcccccccc
- /* bootstrapping code */
- .head.text : AT(ADDR(.head.text) - LOAD_OFFSET) {
- HEAD_TEXT
- } :text = 0xcccccccc
-
/* End of text section, which should occupy whole number of pages */
_etext = .;
. = ALIGN(PAGE_SIZE);
--
2.49.0.906.g1f30a19c02-goog
^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 03/23] x86/boot: Drop global variables keeping track of LA57 state
2025-05-04 9:52 ` [RFT PATCH v2 03/23] x86/boot: Drop global variables keeping track of LA57 state Ard Biesheuvel
@ 2025-05-04 13:50 ` Ingo Molnar
2025-05-04 14:46 ` Ard Biesheuvel
2025-05-04 14:58 ` Linus Torvalds
0 siblings, 2 replies; 59+ messages in thread
From: Ingo Molnar @ 2025-05-04 13:50 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-kernel, linux-efi, x86, Ard Biesheuvel, Borislav Petkov,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky, Linus Torvalds
* Ard Biesheuvel <ardb+git@google.com> wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
>
> On x86_64, the core kernel is entered in long mode, which implies that
> paging is enabled. This means that the CR4.LA57 control bit is
> guaranteed to be in sync with the number of paging levels used by the
> kernel, and there is no need to store this in a variable.
>
> There is also no need to use variables for storing the calculations of
> pgdir_shift and ptrs_per_p4d, as they are easily determined on the fly.
>
> This removes the need for two different sources of truth (i.e., early
> and late) for determining whether 5-level paging is in use: CR4.LA57
> always reflects the actual state, and never changes from the point of
> view of the 64-bit core kernel. It also removes the need for exposing
> the associated variables to the startup code. The only potential concern
> is the cost of CR4 accesses, which can be mitigated using alternatives
> patching based on feature detection.
>
> Note that even the decompressor does not manipulate any page tables
> before updating CR4.LA57, so it can also avoid the associated global
> variables entirely. However, as it does not implement alternatives
> patching, the associated ELF sections need to be discarded.
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
> arch/x86/boot/compressed/misc.h | 4 --
> arch/x86/boot/compressed/pgtable_64.c | 12 ------
> arch/x86/boot/compressed/vmlinux.lds.S | 1 +
> arch/x86/boot/startup/map_kernel.c | 12 +-----
> arch/x86/boot/startup/sme.c | 9 ----
> arch/x86/include/asm/pgtable_64_types.h | 43 ++++++++++----------
> arch/x86/kernel/cpu/common.c | 2 -
> arch/x86/kernel/head64.c | 11 -----
> arch/x86/mm/kasan_init_64.c | 3 --
> 9 files changed, 24 insertions(+), 73 deletions(-)
So this patch breaks the build & creates header dependency hell on
x86-64 allnoconfig:
starship:~/tip> m kernel/pid.o
DESCEND objtool
CC arch/x86/kernel/asm-offsets.s
INSTALL libsubcmd_headers
In file included from ./arch/x86/include/asm/pgtable_64_types.h:5,
from ./arch/x86/include/asm/pgtable_types.h:283,
from ./arch/x86/include/asm/processor.h:21,
from ./arch/x86/include/asm/cpufeature.h:5,
from ./arch/x86/include/asm/thread_info.h:59,
from ./include/linux/thread_info.h:60,
from ./include/linux/spinlock.h:60,
from ./include/linux/swait.h:7,
from ./include/linux/completion.h:12,
from ./include/linux/crypto.h:15,
from arch/x86/kernel/asm-offsets.c:9:
./arch/x86/include/asm/sparsemem.h:29:34: warning: "pgtable_l5_enabled" is not defined, evaluates to 0 [-Wundef]
29 | # define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)
| ^~~~~~~~~~~~~~~~~~
./include/linux/page-flags-layout.h:31:26: note: in expansion of macro ‘MAX_PHYSMEM_BITS’
31 | #define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
Plus I'm not sure I'm happy about this kind of complexity getting
embedded deep within low level MM primitives:
static __always_inline __pure bool pgtable_l5_enabled(void)
{
unsigned long r;
bool ret;
if (!IS_ENABLED(CONFIG_X86_5LEVEL))
return false;
asm(ALTERNATIVE_TERNARY(
"movq %%cr4, %[reg] \n\t btl %[la57], %k[reg]" CC_SET(c),
%P[feat], "stc", "clc")
: [reg] "=&r" (r), CC_OUT(c) (ret)
: [feat] "i" (X86_FEATURE_LA57),
[la57] "i" (X86_CR4_LA57_BIT)
: "cc");
return ret;
}
it's basically everywhere:
arch/x86/include/asm/page_64_types.h:#define __VIRTUAL_MASK_SHIFT (pgtable_l5_enabled() ? 56 : 47)
arch/x86/include/asm/paravirt.h: if (pgtable_l5_enabled()) \
arch/x86/include/asm/paravirt.h: if (pgtable_l5_enabled()) \
arch/x86/include/asm/pgalloc.h: if (!pgtable_l5_enabled())
arch/x86/include/asm/pgalloc.h: if (!pgtable_l5_enabled())
arch/x86/include/asm/pgalloc.h: if (pgtable_l5_enabled())
arch/x86/include/asm/pgtable.h:#define pgd_clear(pgd) (pgtable_l5_enabled() ? native_pgd_clear(pgd) : 0)
arch/x86/include/asm/pgtable.h: if (!pgtable_l5_enabled())
arch/x86/include/asm/pgtable.h: if (!pgtable_l5_enabled())
arch/x86/include/asm/pgtable.h: if (!pgtable_l5_enabled())
arch/x86/include/asm/pgtable.h: if (!pgtable_l5_enabled())
arch/x86/include/asm/pgtable_32_types.h:#define pgtable_l5_enabled() 0
arch/x86/include/asm/pgtable_64.h: return !pgtable_l5_enabled();
arch/x86/include/asm/pgtable_64.h: if (pgtable_l5_enabled() ||
arch/x86/include/asm/pgtable_64_types.h:static __always_inline __pure bool pgtable_l5_enabled(void)
arch/x86/include/asm/pgtable_64_types.h:#define PGDIR_SHIFT (pgtable_l5_enabled() ? 48 : 39)
arch/x86/include/asm/pgtable_64_types.h:#define PTRS_PER_P4D (pgtable_l5_enabled() ? 512 : 1)
arch/x86/include/asm/pgtable_64_types.h:# define VMALLOC_SIZE_TB (pgtable_l5_enabled() ? VMALLOC_SIZE_TB_L5 : VMALLOC_SIZE_TB_L4)
arch/x86/include/asm/sparsemem.h:# define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)
Inlined approximately a gazillion times. (449 times on x86 defconfig.
Yes, I just counted it.)
And it's not even worth it, as it generates horrendous code:
154: 0f 20 e0 mov %cr4,%rax
157: 0f ba e0 0c bt $0xc,%eax
... while CR4 access might be faster these days, it's certainly not as
fast as simple percpu access. Plus it clobbers a register (RAX in the
example above), which is unnecessary for a flag test.
Cannot pgtable_l5_enabled() be a single, simple percpu flag or so?
And yes, this creates another layer for these values - but thus
decouples low level MM from detection & implementation complexities,
which is a plus ...
Thanks,
Ingo
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 00/23] x86: strict separation of startup code
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (22 preceding siblings ...)
2025-05-04 9:52 ` [RFT PATCH v2 23/23] x86/boot: Get rid of the .head.text section Ard Biesheuvel
@ 2025-05-04 14:04 ` Ingo Molnar
2025-05-04 14:55 ` Ard Biesheuvel
2025-05-07 9:52 ` Borislav Petkov
24 siblings, 1 reply; 59+ messages in thread
From: Ingo Molnar @ 2025-05-04 14:04 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-kernel, linux-efi, x86, Ard Biesheuvel, Borislav Petkov,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
* Ard Biesheuvel <ardb+git@google.com> wrote:
> Ard Biesheuvel (23):
> x86/boot: Move early_setup_gdt() back into head64.c
> x86/boot: Disregard __supported_pte_mask in __startup_64()
> x86/boot: Drop global variables keeping track of LA57 state
> x86/sev: Make sev_snp_enabled() a static function
> x86/sev: Move instruction decoder into separate source file
> x86/sev: Disentangle #VC handling code from startup code
> x86/sev: Separate MSR and GHCB based snp_cpuid() via a callback
> x86/sev: Fall back to early page state change code only during boot
> x86/sev: Move GHCB page based HV communication out of startup code
> x86/sev: Use boot SVSM CA for all startup and init code
> x86/boot: Drop redundant RMPADJUST in SEV SVSM presence check
> x86/sev: Unify SEV-SNP hypervisor feature check
> x86/linkage: Add SYM_PIC_ALIAS() macro helper to emit symbol aliases
> x86/boot: Add a bunch of PIC aliases
> x86/boot: Provide __pti_set_user_pgtbl() to startup code
> x86/sev: Provide PIC aliases for SEV related data objects
> x86/sev: Move __sev_[get|put]_ghcb() into separate noinstr object
> x86/sev: Export startup routines for ordinary use
> x86/boot: Created a confined code area for startup code
> x86/boot: Move startup code out of __head section
> x86/boot: Disallow absolute symbol references in startup code
> x86/boot: Revert "Reject absolute references in .head.text"
> x86/boot: Get rid of the .head.text section
> 42 files changed, 2367 insertions(+), 2325 deletions(-)
So to move this forward I applied the following 7 patches to
tip:x86/boot:
x86/boot: Move early_setup_gdt() back into head64.c
x86/boot: Disregard __supported_pte_mask in __startup_64()
x86/sev: Make sev_snp_enabled() a static function
x86/sev: Move instruction decoder into separate source file
x86/linkage: Add SYM_PIC_ALIAS() macro helper to emit symbol aliases
x86/boot: Add a bunch of PIC aliases
x86/boot: Provide __pti_set_user_pgtbl() to startup code
Which are I believe independent of SEV testing.
I also merged in pending upstream fixes, including:
8ed12ab1319b ("x86/boot/sev: Support memory acceptance in the EFI stub under SVSM")
Which should make tip:x86/boot a good base for your series going
forward?
Thanks,
Ingo
^ permalink raw reply [flat|nested] 59+ messages in thread
* [tip: x86/boot] x86/boot: Provide __pti_set_user_pgtbl() to startup code
2025-05-04 9:52 ` [RFT PATCH v2 15/23] x86/boot: Provide __pti_set_user_pgtbl() to startup code Ard Biesheuvel
@ 2025-05-04 14:20 ` tip-bot2 for Ard Biesheuvel
2025-05-05 16:03 ` Borislav Petkov
2025-05-05 16:54 ` tip-bot2 for Ard Biesheuvel
1 sibling, 1 reply; 59+ messages in thread
From: tip-bot2 for Ard Biesheuvel @ 2025-05-04 14:20 UTC (permalink / raw)
To: linux-tip-commits
Cc: Ard Biesheuvel, Ingo Molnar, Arnd Bergmann, David Woodhouse,
Dionna Amalie Glaze, H. Peter Anvin, Kees Cook, Kevin Loughlin,
Len Brown, Linus Torvalds, Rafael J. Wysocki, Tom Lendacky,
linux-efi, x86, linux-kernel
The following commit has been merged into the x86/boot branch of tip:
Commit-ID: 5297886f0cc45db5f4a804caf359e6e7874ee864
Gitweb: https://git.kernel.org/tip/5297886f0cc45db5f4a804caf359e6e7874ee864
Author: Ard Biesheuvel <ardb@kernel.org>
AuthorDate: Sun, 04 May 2025 11:52:45 +02:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 04 May 2025 15:59:43 +02:00
x86/boot: Provide __pti_set_user_pgtbl() to startup code
The SME encryption startup code populates page tables using the ordinary
set_pXX() helpers, and in a PTI build, these will call out to
__pti_set_user_pgtbl() to manipulate the shadow copy of the page tables
for user space.
This is unneeded for the startup code, which only manipulates the
swapper page tables, and so this call could be avoided in this
particular case. So instead of exposing the ordinary
__pti_set_user_pgtblt() to the startup code after its gets confined into
its own symbol space, provide an alternative which just returns pgd,
which is always correct in the startup context.
Annotate it as __weak for now, this will be dropped in a subsequent
patch.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250504095230.2932860-40-ardb+git@google.com
---
arch/x86/boot/startup/sme.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/arch/x86/boot/startup/sme.c b/arch/x86/boot/startup/sme.c
index 5738b31..753cd20 100644
--- a/arch/x86/boot/startup/sme.c
+++ b/arch/x86/boot/startup/sme.c
@@ -564,3 +564,12 @@ void __head sme_enable(struct boot_params *bp)
cc_vendor = CC_VENDOR_AMD;
cc_set_mask(me_mask);
}
+
+#ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
+/* Local version for startup code, which never operates on user page tables */
+__weak
+pgd_t __pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
+{
+ return pgd;
+}
+#endif
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [tip: x86/boot] x86/boot: Add a bunch of PIC aliases
2025-05-04 9:52 ` [RFT PATCH v2 14/23] x86/boot: Add a bunch of PIC aliases Ard Biesheuvel
@ 2025-05-04 14:20 ` tip-bot2 for Ard Biesheuvel
0 siblings, 0 replies; 59+ messages in thread
From: tip-bot2 for Ard Biesheuvel @ 2025-05-04 14:20 UTC (permalink / raw)
To: linux-tip-commits
Cc: Ard Biesheuvel, Ingo Molnar, Arnd Bergmann, David Woodhouse,
Dionna Amalie Glaze, H. Peter Anvin, Kees Cook, Kevin Loughlin,
Len Brown, Linus Torvalds, Rafael J. Wysocki, Tom Lendacky,
linux-efi, x86, linux-kernel
The following commit has been merged into the x86/boot branch of tip:
Commit-ID: 419cbaf6a56a6e4b7e6d2278302c197f55dec830
Gitweb: https://git.kernel.org/tip/419cbaf6a56a6e4b7e6d2278302c197f55dec830
Author: Ard Biesheuvel <ardb@kernel.org>
AuthorDate: Sun, 04 May 2025 11:52:44 +02:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 04 May 2025 15:59:43 +02:00
x86/boot: Add a bunch of PIC aliases
Add aliases for all the data objects that the startup code references -
this is needed so that this code can be moved into its own confined area
where it can only access symbols that have a __pi_ prefix.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250504095230.2932860-39-ardb+git@google.com
---
arch/x86/coco/core.c | 2 ++
arch/x86/kernel/cpu/common.c | 1 +
arch/x86/kernel/head64.c | 4 ++++
arch/x86/kernel/head_64.S | 8 ++++++++
arch/x86/kernel/setup.c | 1 +
arch/x86/kernel/vmlinux.lds.S | 4 ++++
arch/x86/lib/memcpy_64.S | 1 +
arch/x86/lib/memset_64.S | 1 +
arch/x86/lib/retpoline.S | 2 ++
arch/x86/mm/mem_encrypt_amd.c | 2 ++
arch/x86/mm/pgtable.c | 1 +
tools/objtool/arch/x86/decode.c | 6 ++++--
12 files changed, 31 insertions(+), 2 deletions(-)
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 9a0ddda..d4610af 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -18,7 +18,9 @@
#include <asm/processor.h>
enum cc_vendor cc_vendor __ro_after_init = CC_VENDOR_NONE;
+SYM_PIC_ALIAS(cc_vendor);
u64 cc_mask __ro_after_init;
+SYM_PIC_ALIAS(cc_mask);
static struct cc_attr_flags {
__u64 host_sev_snp : 1,
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 12126ad..f0f8548 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -242,6 +242,7 @@ DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
#endif
} };
EXPORT_PER_CPU_SYMBOL_GPL(gdt_page);
+SYM_PIC_ALIAS(gdt_page);
#ifdef CONFIG_X86_64
static int __init x86_nopcid_setup(char *s)
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 29226f3..510fb41 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -48,6 +48,7 @@
*/
extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
unsigned int __initdata next_early_pgt;
+SYM_PIC_ALIAS(next_early_pgt);
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
#ifdef CONFIG_X86_5LEVEL
@@ -61,10 +62,13 @@ EXPORT_SYMBOL(ptrs_per_p4d);
#ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
EXPORT_SYMBOL(page_offset_base);
+SYM_PIC_ALIAS(page_offset_base);
unsigned long vmalloc_base __ro_after_init = __VMALLOC_BASE_L4;
EXPORT_SYMBOL(vmalloc_base);
+SYM_PIC_ALIAS(vmalloc_base);
unsigned long vmemmap_base __ro_after_init = __VMEMMAP_BASE_L4;
EXPORT_SYMBOL(vmemmap_base);
+SYM_PIC_ALIAS(vmemmap_base);
#endif
/* Wipe all early page tables except for the kernel symbol map */
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index fefe2a2..0694208 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -573,6 +573,7 @@ SYM_CODE_START_NOALIGN(vc_no_ghcb)
/* Pure iret required here - don't use INTERRUPT_RETURN */
iretq
SYM_CODE_END(vc_no_ghcb)
+SYM_PIC_ALIAS(vc_no_ghcb);
#endif
#ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
@@ -604,10 +605,12 @@ SYM_DATA_START_PTI_ALIGNED(early_top_pgt)
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
.fill PTI_USER_PGD_FILL,8,0
SYM_DATA_END(early_top_pgt)
+SYM_PIC_ALIAS(early_top_pgt)
SYM_DATA_START_PAGE_ALIGNED(early_dynamic_pgts)
.fill 512*EARLY_DYNAMIC_PAGE_TABLES,8,0
SYM_DATA_END(early_dynamic_pgts)
+SYM_PIC_ALIAS(early_dynamic_pgts);
SYM_DATA(early_recursion_flag, .long 0)
@@ -651,6 +654,7 @@ SYM_DATA_START_PAGE_ALIGNED(level4_kernel_pgt)
.fill 511,8,0
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
SYM_DATA_END(level4_kernel_pgt)
+SYM_PIC_ALIAS(level4_kernel_pgt)
#endif
SYM_DATA_START_PAGE_ALIGNED(level3_kernel_pgt)
@@ -659,6 +663,7 @@ SYM_DATA_START_PAGE_ALIGNED(level3_kernel_pgt)
.quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
.quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
SYM_DATA_END(level3_kernel_pgt)
+SYM_PIC_ALIAS(level3_kernel_pgt)
SYM_DATA_START_PAGE_ALIGNED(level2_kernel_pgt)
/*
@@ -676,6 +681,7 @@ SYM_DATA_START_PAGE_ALIGNED(level2_kernel_pgt)
*/
PMDS(0, __PAGE_KERNEL_LARGE_EXEC, KERNEL_IMAGE_SIZE/PMD_SIZE)
SYM_DATA_END(level2_kernel_pgt)
+SYM_PIC_ALIAS(level2_kernel_pgt)
SYM_DATA_START_PAGE_ALIGNED(level2_fixmap_pgt)
.fill (512 - 4 - FIXMAP_PMD_NUM),8,0
@@ -688,6 +694,7 @@ SYM_DATA_START_PAGE_ALIGNED(level2_fixmap_pgt)
/* 6 MB reserved space + a 2MB hole */
.fill 4,8,0
SYM_DATA_END(level2_fixmap_pgt)
+SYM_PIC_ALIAS(level2_fixmap_pgt)
SYM_DATA_START_PAGE_ALIGNED(level1_fixmap_pgt)
.rept (FIXMAP_PMD_NUM)
@@ -703,6 +710,7 @@ SYM_DATA(smpboot_control, .long 0)
.align 16
/* This must match the first entry in level2_kernel_pgt */
SYM_DATA(phys_base, .quad 0x0)
+SYM_PIC_ALIAS(phys_base);
EXPORT_SYMBOL(phys_base)
#include "../xen/xen-head.S"
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 9d2a13b..e0cf159 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -134,6 +134,7 @@ struct ist_info ist_info;
struct cpuinfo_x86 boot_cpu_data __read_mostly;
EXPORT_SYMBOL(boot_cpu_data);
+SYM_PIC_ALIAS(boot_cpu_data);
#if !defined(CONFIG_X86_PAE) || defined(CONFIG_X86_64)
__visible unsigned long mmu_cr4_features __ro_after_init;
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index ccdc45e..9340c74 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -79,11 +79,13 @@ const_cpu_current_top_of_stack = cpu_current_top_of_stack;
#define BSS_DECRYPTED \
. = ALIGN(PMD_SIZE); \
__start_bss_decrypted = .; \
+ __pi___start_bss_decrypted = .; \
*(.bss..decrypted); \
. = ALIGN(PAGE_SIZE); \
__start_bss_decrypted_unused = .; \
. = ALIGN(PMD_SIZE); \
__end_bss_decrypted = .; \
+ __pi___end_bss_decrypted = .; \
#else
@@ -128,6 +130,7 @@ SECTIONS
/* Text and read-only data */
.text : AT(ADDR(.text) - LOAD_OFFSET) {
_text = .;
+ __pi__text = .;
_stext = .;
ALIGN_ENTRY_TEXT_BEGIN
*(.text..__x86.rethunk_untrain)
@@ -391,6 +394,7 @@ SECTIONS
. = ALIGN(PAGE_SIZE); /* keep VO_INIT_SIZE page aligned */
_end = .;
+ __pi__end = .;
#ifdef CONFIG_AMD_MEM_ENCRYPT
/*
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 0ae2e17..12a23fa 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -41,6 +41,7 @@ SYM_FUNC_END(__memcpy)
EXPORT_SYMBOL(__memcpy)
SYM_FUNC_ALIAS_MEMFUNC(memcpy, __memcpy)
+SYM_PIC_ALIAS(memcpy)
EXPORT_SYMBOL(memcpy)
SYM_FUNC_START_LOCAL(memcpy_orig)
diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index d66b710..fb5a03c 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -42,6 +42,7 @@ SYM_FUNC_END(__memset)
EXPORT_SYMBOL(__memset)
SYM_FUNC_ALIAS_MEMFUNC(memset, __memset)
+SYM_PIC_ALIAS(memset)
EXPORT_SYMBOL(memset)
SYM_FUNC_START_LOCAL(memset_orig)
diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S
index a26c43a..9f31166 100644
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -40,6 +40,7 @@ SYM_INNER_LABEL(__x86_indirect_thunk_\reg, SYM_L_GLOBAL)
ALTERNATIVE_2 __stringify(RETPOLINE \reg), \
__stringify(lfence; ANNOTATE_RETPOLINE_SAFE; jmp *%\reg; int3), X86_FEATURE_RETPOLINE_LFENCE, \
__stringify(ANNOTATE_RETPOLINE_SAFE; jmp *%\reg), ALT_NOT(X86_FEATURE_RETPOLINE)
+SYM_PIC_ALIAS(__x86_indirect_thunk_\reg)
.endm
@@ -394,6 +395,7 @@ SYM_CODE_START(__x86_return_thunk)
#endif
int3
SYM_CODE_END(__x86_return_thunk)
+SYM_PIC_ALIAS(__x86_return_thunk)
EXPORT_SYMBOL(__x86_return_thunk)
#endif /* CONFIG_MITIGATION_RETHUNK */
diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index 7490ff6..faf3a13 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -40,7 +40,9 @@
* section is later cleared.
*/
u64 sme_me_mask __section(".data") = 0;
+SYM_PIC_ALIAS(sme_me_mask);
u64 sev_status __section(".data") = 0;
+SYM_PIC_ALIAS(sev_status);
u64 sev_check_data __section(".data") = 0;
EXPORT_SYMBOL(sme_me_mask);
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index f7ae44d..e62af72 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -10,6 +10,7 @@
#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
phys_addr_t physical_mask __ro_after_init = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
EXPORT_SYMBOL(physical_mask);
+SYM_PIC_ALIAS(physical_mask);
#endif
pgtable_t pte_alloc_one(struct mm_struct *mm)
diff --git a/tools/objtool/arch/x86/decode.c b/tools/objtool/arch/x86/decode.c
index 3ce7b54..331b9a7 100644
--- a/tools/objtool/arch/x86/decode.c
+++ b/tools/objtool/arch/x86/decode.c
@@ -842,12 +842,14 @@ int arch_decode_hint_reg(u8 sp_reg, int *base)
bool arch_is_retpoline(struct symbol *sym)
{
- return !strncmp(sym->name, "__x86_indirect_", 15);
+ return !strncmp(sym->name, "__x86_indirect_", 15) ||
+ !strncmp(sym->name, "__pi___x86_indirect_", 20);
}
bool arch_is_rethunk(struct symbol *sym)
{
- return !strcmp(sym->name, "__x86_return_thunk");
+ return !strcmp(sym->name, "__x86_return_thunk") ||
+ !strcmp(sym->name, "__pi___x86_return_thunk");
}
bool arch_is_embedded_insn(struct symbol *sym)
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [tip: x86/boot] x86/linkage: Add SYM_PIC_ALIAS() macro helper to emit symbol aliases
2025-05-04 9:52 ` [RFT PATCH v2 13/23] x86/linkage: Add SYM_PIC_ALIAS() macro helper to emit symbol aliases Ard Biesheuvel
@ 2025-05-04 14:20 ` tip-bot2 for Ard Biesheuvel
0 siblings, 0 replies; 59+ messages in thread
From: tip-bot2 for Ard Biesheuvel @ 2025-05-04 14:20 UTC (permalink / raw)
To: linux-tip-commits
Cc: Ard Biesheuvel, Ingo Molnar, Arnd Bergmann, David Woodhouse,
Dionna Amalie Glaze, H. Peter Anvin, Kees Cook, Kevin Loughlin,
Len Brown, Linus Torvalds, Rafael J. Wysocki, Tom Lendacky,
linux-efi, x86, linux-kernel
The following commit has been merged into the x86/boot branch of tip:
Commit-ID: f932adcc8650eb169efd0fd19f2a450c260d306c
Gitweb: https://git.kernel.org/tip/f932adcc8650eb169efd0fd19f2a450c260d306c
Author: Ard Biesheuvel <ardb@kernel.org>
AuthorDate: Sun, 04 May 2025 11:52:43 +02:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 04 May 2025 15:59:43 +02:00
x86/linkage: Add SYM_PIC_ALIAS() macro helper to emit symbol aliases
Startup code that may execute from the early 1:1 mapping of memory will
be confined into its own address space, and only be permitted to access
ordinary kernel symbols if this is known to be safe.
Introduce a macro helper SYM_PIC_ALIAS() that emits a __pi_ prefixed
alias for a symbol, which allows startup code to access it.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250504095230.2932860-38-ardb+git@google.com
---
arch/x86/include/asm/linkage.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/include/asm/linkage.h b/arch/x86/include/asm/linkage.h
index b51d8a4..9d38ae7 100644
--- a/arch/x86/include/asm/linkage.h
+++ b/arch/x86/include/asm/linkage.h
@@ -141,5 +141,15 @@
#define SYM_FUNC_START_WEAK_NOALIGN(name) \
SYM_START(name, SYM_L_WEAK, SYM_A_NONE)
+/*
+ * Expose 'sym' to the startup code in arch/x86/boot/startup/, by emitting an
+ * alias prefixed with __pi_
+ */
+#ifdef __ASSEMBLER__
+#define SYM_PIC_ALIAS(sym) SYM_ALIAS(__pi_ ## sym, sym, SYM_L_GLOBAL)
+#else
+#define SYM_PIC_ALIAS(sym) extern typeof(sym) __PASTE(__pi_, sym) __alias(sym)
+#endif
+
#endif /* _ASM_X86_LINKAGE_H */
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [tip: x86/boot] x86/sev: Move instruction decoder into separate source file
2025-05-04 9:52 ` [RFT PATCH v2 05/23] x86/sev: Move instruction decoder into separate source file Ard Biesheuvel
@ 2025-05-04 14:20 ` tip-bot2 for Ard Biesheuvel
2025-05-05 14:48 ` [RFT PATCH v2 05/23] " Tom Lendacky
2025-05-07 9:58 ` Borislav Petkov
2 siblings, 0 replies; 59+ messages in thread
From: tip-bot2 for Ard Biesheuvel @ 2025-05-04 14:20 UTC (permalink / raw)
To: linux-tip-commits
Cc: Ard Biesheuvel, Ingo Molnar, Arnd Bergmann, David Woodhouse,
Dionna Amalie Glaze, H. Peter Anvin, Kees Cook, Kevin Loughlin,
Len Brown, Linus Torvalds, Rafael J. Wysocki, Tom Lendacky,
linux-efi, x86, linux-kernel
The following commit has been merged into the x86/boot branch of tip:
Commit-ID: ae862964cbc562582576da91b0b92742a574c437
Gitweb: https://git.kernel.org/tip/ae862964cbc562582576da91b0b92742a574c437
Author: Ard Biesheuvel <ardb@kernel.org>
AuthorDate: Sun, 04 May 2025 11:52:35 +02:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 04 May 2025 15:53:06 +02:00
x86/sev: Move instruction decoder into separate source file
As a first step towards disentangling the SEV #VC handling code -which
is shared between the decompressor and the core kernel- from the SEV
startup code, move the decompressor's copy of the instruction decoder
into a separate source file.
Code movement only - no functional change intended.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250504095230.2932860-30-ardb+git@google.com
---
arch/x86/boot/compressed/Makefile | 6 +--
arch/x86/boot/compressed/misc.h | 7 +++-
arch/x86/boot/compressed/sev-handle-vc.c | 51 +++++++++++++++++++++++-
arch/x86/boot/compressed/sev.c | 39 +------------------
4 files changed, 62 insertions(+), 41 deletions(-)
create mode 100644 arch/x86/boot/compressed/sev-handle-vc.c
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 0fcad7b..f4f7b22 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -44,10 +44,10 @@ KBUILD_CFLAGS += -D__DISABLE_EXPORTS
KBUILD_CFLAGS += $(call cc-option,-Wa$(comma)-mrelax-relocations=no)
KBUILD_CFLAGS += -include $(srctree)/include/linux/hidden.h
-# sev.c indirectly includes inat-table.h which is generated during
+# sev-decode-insn.c indirectly includes inat-table.c which is generated during
# compilation and stored in $(objtree). Add the directory to the includes so
# that the compiler finds it even with out-of-tree builds (make O=/some/path).
-CFLAGS_sev.o += -I$(objtree)/arch/x86/lib/
+CFLAGS_sev-handle-vc.o += -I$(objtree)/arch/x86/lib/
KBUILD_AFLAGS := $(KBUILD_CFLAGS) -D__ASSEMBLY__
@@ -96,7 +96,7 @@ ifdef CONFIG_X86_64
vmlinux-objs-y += $(obj)/idt_64.o $(obj)/idt_handlers_64.o
vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/mem_encrypt.o
vmlinux-objs-y += $(obj)/pgtable_64.o
- vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/sev.o
+ vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/sev.o $(obj)/sev-handle-vc.o
endif
vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index dd8d1a8..04de48c 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -136,6 +136,9 @@ static inline void console_init(void)
#endif
#ifdef CONFIG_AMD_MEM_ENCRYPT
+struct es_em_ctxt;
+struct insn;
+
void sev_enable(struct boot_params *bp);
void snp_check_features(void);
void sev_es_shutdown_ghcb(void);
@@ -143,6 +146,10 @@ extern bool sev_es_check_ghcb_fault(unsigned long address);
void snp_set_page_private(unsigned long paddr);
void snp_set_page_shared(unsigned long paddr);
void sev_prep_identity_maps(unsigned long top_level_pgt);
+
+enum es_result vc_decode_insn(struct es_em_ctxt *ctxt);
+bool insn_has_rep_prefix(struct insn *insn);
+void sev_insn_decode_init(void);
#else
static inline void sev_enable(struct boot_params *bp)
{
diff --git a/arch/x86/boot/compressed/sev-handle-vc.c b/arch/x86/boot/compressed/sev-handle-vc.c
new file mode 100644
index 0000000..b1aa073
--- /dev/null
+++ b/arch/x86/boot/compressed/sev-handle-vc.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "misc.h"
+
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <asm/insn.h>
+#include <asm/pgtable_types.h>
+#include <asm/ptrace.h>
+#include <asm/sev.h>
+
+#define __BOOT_COMPRESSED
+
+/* Basic instruction decoding support needed */
+#include "../../lib/inat.c"
+#include "../../lib/insn.c"
+
+/*
+ * Copy a version of this function here - insn-eval.c can't be used in
+ * pre-decompression code.
+ */
+bool insn_has_rep_prefix(struct insn *insn)
+{
+ insn_byte_t p;
+ int i;
+
+ insn_get_prefixes(insn);
+
+ for_each_insn_prefix(insn, i, p) {
+ if (p == 0xf2 || p == 0xf3)
+ return true;
+ }
+
+ return false;
+}
+
+enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
+{
+ char buffer[MAX_INSN_SIZE];
+ int ret;
+
+ memcpy(buffer, (unsigned char *)ctxt->regs->ip, MAX_INSN_SIZE);
+
+ ret = insn_decode(&ctxt->insn, buffer, MAX_INSN_SIZE, INSN_MODE_64);
+ if (ret < 0)
+ return ES_DECODE_FAILED;
+
+ return ES_OK;
+}
+
+extern void sev_insn_decode_init(void) __alias(inat_init_tables);
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index f4b7f17..6fe7e85 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -30,25 +30,6 @@ static struct ghcb boot_ghcb_page __aligned(PAGE_SIZE);
struct ghcb *boot_ghcb;
/*
- * Copy a version of this function here - insn-eval.c can't be used in
- * pre-decompression code.
- */
-static bool insn_has_rep_prefix(struct insn *insn)
-{
- insn_byte_t p;
- int i;
-
- insn_get_prefixes(insn);
-
- for_each_insn_prefix(insn, i, p) {
- if (p == 0xf2 || p == 0xf3)
- return true;
- }
-
- return false;
-}
-
-/*
* Only a dummy for insn_get_seg_base() - Early boot-code is 64bit only and
* doesn't use segments.
*/
@@ -74,20 +55,6 @@ static inline void sev_es_wr_ghcb_msr(u64 val)
boot_wrmsr(MSR_AMD64_SEV_ES_GHCB, &m);
}
-static enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
-{
- char buffer[MAX_INSN_SIZE];
- int ret;
-
- memcpy(buffer, (unsigned char *)ctxt->regs->ip, MAX_INSN_SIZE);
-
- ret = insn_decode(&ctxt->insn, buffer, MAX_INSN_SIZE, INSN_MODE_64);
- if (ret < 0)
- return ES_DECODE_FAILED;
-
- return ES_OK;
-}
-
static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
void *dst, char *buf, size_t size)
{
@@ -122,10 +89,6 @@ static bool fault_in_kernel_space(unsigned long address)
#define __BOOT_COMPRESSED
-/* Basic instruction decoding support needed */
-#include "../../lib/inat.c"
-#include "../../lib/insn.c"
-
extern struct svsm_ca *boot_svsm_caa;
extern u64 boot_svsm_caa_pa;
@@ -230,7 +193,7 @@ static bool early_setup_ghcb(void)
boot_ghcb = &boot_ghcb_page;
/* Initialize lookup tables for the instruction decoder */
- inat_init_tables();
+ sev_insn_decode_init();
/* SNP guest requires the GHCB GPA must be registered */
if (sev_snp_enabled())
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [tip: x86/boot] x86/sev: Make sev_snp_enabled() a static function
2025-05-04 9:52 ` [RFT PATCH v2 04/23] x86/sev: Make sev_snp_enabled() a static function Ard Biesheuvel
@ 2025-05-04 14:20 ` tip-bot2 for Ard Biesheuvel
0 siblings, 0 replies; 59+ messages in thread
From: tip-bot2 for Ard Biesheuvel @ 2025-05-04 14:20 UTC (permalink / raw)
To: linux-tip-commits
Cc: Ard Biesheuvel, Ingo Molnar, Arnd Bergmann, David Woodhouse,
Dionna Amalie Glaze, H. Peter Anvin, Kees Cook, Kevin Loughlin,
Len Brown, Linus Torvalds, Rafael J. Wysocki, Tom Lendacky,
linux-efi, x86, linux-kernel
The following commit has been merged into the x86/boot branch of tip:
Commit-ID: fae89bbfdd9d692e3cb6eace89f4994177731b8c
Gitweb: https://git.kernel.org/tip/fae89bbfdd9d692e3cb6eace89f4994177731b8c
Author: Ard Biesheuvel <ardb@kernel.org>
AuthorDate: Sun, 04 May 2025 11:52:34 +02:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 04 May 2025 15:53:06 +02:00
x86/sev: Make sev_snp_enabled() a static function
sev_snp_enabled() is no longer used outside of the source file that
defines it, so make it static and drop the extern declarations.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250504095230.2932860-29-ardb+git@google.com
---
arch/x86/boot/compressed/sev.c | 2 +-
arch/x86/boot/compressed/sev.h | 2 --
2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 2ccad0a..f4b7f17 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -164,7 +164,7 @@ int svsm_perform_call_protocol(struct svsm_call *call)
return ret;
}
-bool sev_snp_enabled(void)
+static bool sev_snp_enabled(void)
{
return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
}
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
index d390038..e87af54 100644
--- a/arch/x86/boot/compressed/sev.h
+++ b/arch/x86/boot/compressed/sev.h
@@ -10,14 +10,12 @@
#ifdef CONFIG_AMD_MEM_ENCRYPT
-bool sev_snp_enabled(void);
void snp_accept_memory(phys_addr_t start, phys_addr_t end);
u64 sev_get_status(void);
bool early_is_sevsnp_guest(void);
#else
-static inline bool sev_snp_enabled(void) { return false; }
static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
static inline u64 sev_get_status(void) { return 0; }
static inline bool early_is_sevsnp_guest(void) { return false; }
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [tip: x86/boot] x86/boot: Disregard __supported_pte_mask in __startup_64()
2025-05-04 9:52 ` [RFT PATCH v2 02/23] x86/boot: Disregard __supported_pte_mask in __startup_64() Ard Biesheuvel
@ 2025-05-04 14:20 ` tip-bot2 for Ard Biesheuvel
0 siblings, 0 replies; 59+ messages in thread
From: tip-bot2 for Ard Biesheuvel @ 2025-05-04 14:20 UTC (permalink / raw)
To: linux-tip-commits
Cc: Ard Biesheuvel, Ingo Molnar, Arnd Bergmann, David Woodhouse,
Dionna Amalie Glaze, H. Peter Anvin, Kees Cook, Kevin Loughlin,
Len Brown, Linus Torvalds, Rafael J. Wysocki, Tom Lendacky,
linux-efi, x86, linux-kernel
The following commit has been merged into the x86/boot branch of tip:
Commit-ID: b3464a36f7f2499d517e8334e07ddd6eefcd67c1
Gitweb: https://git.kernel.org/tip/b3464a36f7f2499d517e8334e07ddd6eefcd67c1
Author: Ard Biesheuvel <ardb@kernel.org>
AuthorDate: Sun, 04 May 2025 11:52:32 +02:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 04 May 2025 15:27:23 +02:00
x86/boot: Disregard __supported_pte_mask in __startup_64()
__supported_pte_mask is statically initialized to U64_MAX and never
assigned until long after the startup code executes that creates the
initial page tables. So applying the mask is unnecessary, and can be
avoided.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250504095230.2932860-27-ardb+git@google.com
---
arch/x86/boot/startup/map_kernel.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/arch/x86/boot/startup/map_kernel.c b/arch/x86/boot/startup/map_kernel.c
index 0eac3f1..099ae25 100644
--- a/arch/x86/boot/startup/map_kernel.c
+++ b/arch/x86/boot/startup/map_kernel.c
@@ -179,8 +179,6 @@ unsigned long __head __startup_64(unsigned long p2v_offset,
pud[(i + 1) % PTRS_PER_PUD] = (pudval_t)pmd + pgtable_flags;
pmd_entry = __PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL;
- /* Filter out unsupported __PAGE_KERNEL_* bits: */
- pmd_entry &= __supported_pte_mask;
pmd_entry += sme_get_me_mask();
pmd_entry += physaddr;
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [tip: x86/boot] x86/boot: Move early_setup_gdt() back into head64.c
2025-05-04 9:52 ` [RFT PATCH v2 01/23] x86/boot: Move early_setup_gdt() back into head64.c Ard Biesheuvel
@ 2025-05-04 14:20 ` tip-bot2 for Ard Biesheuvel
0 siblings, 0 replies; 59+ messages in thread
From: tip-bot2 for Ard Biesheuvel @ 2025-05-04 14:20 UTC (permalink / raw)
To: linux-tip-commits
Cc: Ard Biesheuvel, Ingo Molnar, Arnd Bergmann, David Woodhouse,
Dionna Amalie Glaze, H. Peter Anvin, Kees Cook, Kevin Loughlin,
Len Brown, Linus Torvalds, Rafael J. Wysocki, Tom Lendacky,
linux-efi, x86, linux-kernel
The following commit has been merged into the x86/boot branch of tip:
Commit-ID: bd4a58beaaf1f4aff025282c6e8b130bdb4a29e4
Gitweb: https://git.kernel.org/tip/bd4a58beaaf1f4aff025282c6e8b130bdb4a29e4
Author: Ard Biesheuvel <ardb@kernel.org>
AuthorDate: Sun, 04 May 2025 11:52:31 +02:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 04 May 2025 15:27:23 +02:00
x86/boot: Move early_setup_gdt() back into head64.c
Move early_setup_gdt() out of the startup code that is callable from the
1:1 mapping - this is not needed, and instead, it is better to expose
the helper that does reside in __head directly.
This reduces the amount of code that needs special checks for 1:1
execution suitability. In particular, it avoids dealing with the GHCB
page (and its physical address) in startup code, which runs from the
1:1 mapping, making physical to virtual translations ambiguous.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250504095230.2932860-26-ardb+git@google.com
---
arch/x86/boot/startup/gdt_idt.c | 15 +--------------
arch/x86/include/asm/setup.h | 1 +
arch/x86/kernel/head64.c | 12 ++++++++++++
3 files changed, 14 insertions(+), 14 deletions(-)
diff --git a/arch/x86/boot/startup/gdt_idt.c b/arch/x86/boot/startup/gdt_idt.c
index 7e34d0b..a3112a6 100644
--- a/arch/x86/boot/startup/gdt_idt.c
+++ b/arch/x86/boot/startup/gdt_idt.c
@@ -24,7 +24,7 @@
static gate_desc bringup_idt_table[NUM_EXCEPTION_VECTORS] __page_aligned_data;
/* This may run while still in the direct mapping */
-static void __head startup_64_load_idt(void *vc_handler)
+void __head startup_64_load_idt(void *vc_handler)
{
struct desc_ptr desc = {
.address = (unsigned long)rip_rel_ptr(bringup_idt_table),
@@ -43,19 +43,6 @@ static void __head startup_64_load_idt(void *vc_handler)
native_load_idt(&desc);
}
-/* This is used when running on kernel addresses */
-void early_setup_idt(void)
-{
- void *handler = NULL;
-
- if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
- setup_ghcb();
- handler = vc_boot_ghcb;
- }
-
- startup_64_load_idt(handler);
-}
-
/*
* Setup boot CPU state needed before kernel switches to virtual addresses.
*/
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index ad9212d..6324f4c 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -52,6 +52,7 @@ extern void reserve_standard_io_resources(void);
extern void i386_reserve_resources(void);
extern unsigned long __startup_64(unsigned long p2v_offset, struct boot_params *bp);
extern void startup_64_setup_gdt_idt(void);
+extern void startup_64_load_idt(void *vc_handler);
extern void early_setup_idt(void);
extern void __init do_early_exception(struct pt_regs *regs, int trapnr);
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 6b68a20..29226f3 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -303,3 +303,15 @@ void __init __noreturn x86_64_start_reservations(char *real_mode_data)
start_kernel();
}
+
+void early_setup_idt(void)
+{
+ void *handler = NULL;
+
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
+ setup_ghcb();
+ handler = vc_boot_ghcb;
+ }
+
+ startup_64_load_idt(handler);
+}
^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 03/23] x86/boot: Drop global variables keeping track of LA57 state
2025-05-04 13:50 ` Ingo Molnar
@ 2025-05-04 14:46 ` Ard Biesheuvel
2025-05-04 14:58 ` Linus Torvalds
1 sibling, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 14:46 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ard Biesheuvel, linux-kernel, linux-efi, x86, Borislav Petkov,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky, Linus Torvalds
On Sun, 4 May 2025 at 15:51, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Ard Biesheuvel <ardb+git@google.com> wrote:
>
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > On x86_64, the core kernel is entered in long mode, which implies that
> > paging is enabled. This means that the CR4.LA57 control bit is
> > guaranteed to be in sync with the number of paging levels used by the
> > kernel, and there is no need to store this in a variable.
> >
> > There is also no need to use variables for storing the calculations of
> > pgdir_shift and ptrs_per_p4d, as they are easily determined on the fly.
> >
> > This removes the need for two different sources of truth (i.e., early
> > and late) for determining whether 5-level paging is in use: CR4.LA57
> > always reflects the actual state, and never changes from the point of
> > view of the 64-bit core kernel. It also removes the need for exposing
> > the associated variables to the startup code. The only potential concern
> > is the cost of CR4 accesses, which can be mitigated using alternatives
> > patching based on feature detection.
> >
> > Note that even the decompressor does not manipulate any page tables
> > before updating CR4.LA57, so it can also avoid the associated global
> > variables entirely. However, as it does not implement alternatives
> > patching, the associated ELF sections need to be discarded.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> > arch/x86/boot/compressed/misc.h | 4 --
> > arch/x86/boot/compressed/pgtable_64.c | 12 ------
> > arch/x86/boot/compressed/vmlinux.lds.S | 1 +
> > arch/x86/boot/startup/map_kernel.c | 12 +-----
> > arch/x86/boot/startup/sme.c | 9 ----
> > arch/x86/include/asm/pgtable_64_types.h | 43 ++++++++++----------
> > arch/x86/kernel/cpu/common.c | 2 -
> > arch/x86/kernel/head64.c | 11 -----
> > arch/x86/mm/kasan_init_64.c | 3 --
> > 9 files changed, 24 insertions(+), 73 deletions(-)
>
> So this patch breaks the build & creates header dependency hell on
> x86-64 allnoconfig:
>
Ugh
...
> Plus I'm not sure I'm happy about this kind of complexity getting
> embedded deep within low level MM primitives:
>
> static __always_inline __pure bool pgtable_l5_enabled(void)
> {
> unsigned long r;
> bool ret;
>
> if (!IS_ENABLED(CONFIG_X86_5LEVEL))
> return false;
>
> asm(ALTERNATIVE_TERNARY(
> "movq %%cr4, %[reg] \n\t btl %[la57], %k[reg]" CC_SET(c),
> %P[feat], "stc", "clc")
> : [reg] "=&r" (r), CC_OUT(c) (ret)
> : [feat] "i" (X86_FEATURE_LA57),
> [la57] "i" (X86_CR4_LA57_BIT)
> : "cc");
>
> return ret;
> }
>
...
>
> Inlined approximately a gazillion times. (449 times on x86 defconfig.
> Yes, I just counted it.)
>
> And it's not even worth it, as it generates horrendous code:
>
> 154: 0f 20 e0 mov %cr4,%rax
> 157: 0f ba e0 0c bt $0xc,%eax
>
> ... while CR4 access might be faster these days, it's certainly not as
> fast as simple percpu access. Plus it clobbers a register (RAX in the
> example above), which is unnecessary for a flag test.
>
It's an alternative, so this will be patched into stc or clc. But it
will still clobber a register.
> Cannot pgtable_l5_enabled() be a single, simple percpu flag or so?
>
We can just drop this patch, and I'll work around it by adding another
couple of SYM_PIC_ALIAS()es for these variables.
> And yes, this creates another layer for these values - but thus
> decouples low level MM from detection & implementation complexities,
> which is a plus ...
>
If you prefer to retain the early vs late distinction, where the late
one is more efficient, we could replace the early variant with the CR4
access. But frankly, if we go down that road, I'd prefer to just share
these early variables with the startup code.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 00/23] x86: strict separation of startup code
2025-05-04 14:04 ` [RFT PATCH v2 00/23] x86: strict separation of startup code Ingo Molnar
@ 2025-05-04 14:55 ` Ard Biesheuvel
2025-05-05 5:08 ` Ingo Molnar
0 siblings, 1 reply; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 14:55 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ard Biesheuvel, linux-kernel, linux-efi, x86, Borislav Petkov,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
On Sun, 4 May 2025 at 16:04, Ingo Molnar <mingo@kernel.org> wrote:
>
>
...
>
> So to move this forward I applied the following 7 patches to
> tip:x86/boot:
>
> x86/boot: Move early_setup_gdt() back into head64.c
> x86/boot: Disregard __supported_pte_mask in __startup_64()
> x86/sev: Make sev_snp_enabled() a static function
> x86/sev: Move instruction decoder into separate source file
> x86/linkage: Add SYM_PIC_ALIAS() macro helper to emit symbol aliases
> x86/boot: Add a bunch of PIC aliases
> x86/boot: Provide __pti_set_user_pgtbl() to startup code
>
> Which are I believe independent of SEV testing.
>
Excellent.
> I also merged in pending upstream fixes, including:
>
> 8ed12ab1319b ("x86/boot/sev: Support memory acceptance in the EFI stub under SVSM")
>
> Which should make tip:x86/boot a good base for your series going
> forward?
>
Yes, that helps a lot, thanks.
Please also consider the patch
x86/sev: Disentangle #VC handling code from startup code
11 files changed, 1694 insertions(+), 1643 deletions(-)
It just moves code around, but it is rather large and is likely to
cause merge conflicts if it lives out of tree for too long. The +/-
delta is mostly down to the fact that a new file vc-handle.c is added
which duplicates most of the #includes of the file that it was split
off from.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 03/23] x86/boot: Drop global variables keeping track of LA57 state
2025-05-04 13:50 ` Ingo Molnar
2025-05-04 14:46 ` Ard Biesheuvel
@ 2025-05-04 14:58 ` Linus Torvalds
2025-05-04 19:33 ` Ard Biesheuvel
2025-05-05 21:07 ` Ingo Molnar
1 sibling, 2 replies; 59+ messages in thread
From: Linus Torvalds @ 2025-05-04 14:58 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ard Biesheuvel, linux-kernel, linux-efi, x86, Ard Biesheuvel,
Borislav Petkov, Dionna Amalie Glaze, Kevin Loughlin,
Tom Lendacky
On Sun, 4 May 2025 at 06:51, Ingo Molnar <mingo@kernel.org> wrote:
>
> Cannot pgtable_l5_enabled() be a single, simple percpu flag or so?
Isn't this logically something that should just be a static key? For
some reason I thought we did that already, but looking around, that
was only in the critical user access routines (and got replaced by the
'runtime-const' stuff)
But I guess that what Ard wants to get rid of is the variable itself,
and for early boot using static keys sounds like a bad idea.
Honestly, looking at this, I think we should fix the *users* of
pgtable_l5_enabled().
Because I think there are two distinct classes:
- the stuff in <asm/pgtable.h> is exposed as the generic page table
accessor macros to "real" code, and should probably use a static key
(ie things like pgd_clear() / pgd_none() and friends)
- but in code like __kernel_physical_mapping_init() feels to me like
using the value in %cr4 actually makes sense
but that looks like a much bigger and fundamental patch than Ard's.
Linus
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 03/23] x86/boot: Drop global variables keeping track of LA57 state
2025-05-04 14:58 ` Linus Torvalds
@ 2025-05-04 19:33 ` Ard Biesheuvel
2025-05-05 21:07 ` Ingo Molnar
1 sibling, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-04 19:33 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, Ard Biesheuvel, linux-kernel, linux-efi, x86,
Borislav Petkov, Dionna Amalie Glaze, Kevin Loughlin,
Tom Lendacky
On Sun, 4 May 2025 at 16:58, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, 4 May 2025 at 06:51, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > Cannot pgtable_l5_enabled() be a single, simple percpu flag or so?
>
> Isn't this logically something that should just be a static key? For
> some reason I thought we did that already, but looking around, that
> was only in the critical user access routines (and got replaced by the
> 'runtime-const' stuff)
>
The ordinary, late version of pgtable_l5_enabled() is #define'd to
cpu_feature_enabled(), which is runtime patched in a slightly more
efficient manner than the alternative I'm proposing here. It is
conceptually the same as a static key, i.e., the asm goto() gets
patched into a JMP or a NOP.
Whether or not using runtime constants are worth the hassle for
pgdir_shift and ptrs_per_p4d is a different question, I'll keep them
as ordinary globals for now, and drop the conditional expressions.
> But I guess that what Ard wants to get rid of is the variable itself,
> and for early boot using static keys sounds like a bad idea.
>
I wanted to get rid of the variable, and then noticed the nasty
'#define USE_EARLY_PGTABLE_L5' thing, so I attempted to fix that as
well. (Having to keep track of which code may be called early, late or
both is how I ended up in this swamp to begin with.)
> Honestly, looking at this, I think we should fix the *users* of
> pgtable_l5_enabled().
>
> Because I think there are two distinct classes:
>
> - the stuff in <asm/pgtable.h> is exposed as the generic page table
> accessor macros to "real" code, and should probably use a static key
> (ie things like pgd_clear() / pgd_none() and friends)
>
> - but in code like __kernel_physical_mapping_init() feels to me like
> using the value in %cr4 actually makes sense
>
> but that looks like a much bigger and fundamental patch than Ard's.
>
I think we should just rely on cpu_feature_enabled(), which can in
fact be called early as long as the capability gets set. So instead of
setting __pgtable_l5_enabled, we should just call
'set_cpu_cap(&boot_cpu_data, X86_FEATURE_LA57)' so that even early
callers can use the ordinary pgtable_l5_enabled(), and we can drop the
early flavor. (The decompressor will need its own version but it can
just inspect CR4.LA57 directly)
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 00/23] x86: strict separation of startup code
2025-05-04 14:55 ` Ard Biesheuvel
@ 2025-05-05 5:08 ` Ingo Molnar
0 siblings, 0 replies; 59+ messages in thread
From: Ingo Molnar @ 2025-05-05 5:08 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Ard Biesheuvel, linux-kernel, linux-efi, x86, Borislav Petkov,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
* Ard Biesheuvel <ardb@kernel.org> wrote:
> On Sun, 4 May 2025 at 16:04, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> ...
> >
> > So to move this forward I applied the following 7 patches to
> > tip:x86/boot:
> >
> > x86/boot: Move early_setup_gdt() back into head64.c
> > x86/boot: Disregard __supported_pte_mask in __startup_64()
> > x86/sev: Make sev_snp_enabled() a static function
> > x86/sev: Move instruction decoder into separate source file
> > x86/linkage: Add SYM_PIC_ALIAS() macro helper to emit symbol aliases
> > x86/boot: Add a bunch of PIC aliases
> > x86/boot: Provide __pti_set_user_pgtbl() to startup code
> >
> > Which are I believe independent of SEV testing.
> >
>
> Excellent.
>
> > I also merged in pending upstream fixes, including:
> >
> > 8ed12ab1319b ("x86/boot/sev: Support memory acceptance in the EFI stub under SVSM")
> >
> > Which should make tip:x86/boot a good base for your series going
> > forward?
> >
>
> Yes, that helps a lot, thanks.
>
> Please also consider the patch
>
> x86/sev: Disentangle #VC handling code from startup code
> 11 files changed, 1694 insertions(+), 1643 deletions(-)
>
> It just moves code around, but it is rather large and is likely to
> cause merge conflicts if it lives out of tree for too long. The +/-
> delta is mostly down to the fact that a new file vc-handle.c is added
> which duplicates most of the #includes of the file that it was split
> off from.
Sure, applied, will push it out after testing. I almost applied it
yesterday, for these exact reasons.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 59+ messages in thread
* [tip: x86/boot] x86/sev: Disentangle #VC handling code from startup code
2025-05-04 9:52 ` [RFT PATCH v2 06/23] x86/sev: Disentangle #VC handling code from startup code Ard Biesheuvel
@ 2025-05-05 5:31 ` tip-bot2 for Ard Biesheuvel
2025-05-05 14:58 ` [RFT PATCH v2 06/23] " Tom Lendacky
2025-05-05 16:54 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2 siblings, 0 replies; 59+ messages in thread
From: tip-bot2 for Ard Biesheuvel @ 2025-05-05 5:31 UTC (permalink / raw)
To: linux-tip-commits
Cc: Ard Biesheuvel, Ingo Molnar, Arnd Bergmann, David Woodhouse,
Dionna Amalie Glaze, H. Peter Anvin, Kees Cook, Kevin Loughlin,
Len Brown, Linus Torvalds, Rafael J. Wysocki, Tom Lendacky,
linux-efi, x86, linux-kernel
The following commit has been merged into the x86/boot branch of tip:
Commit-ID: ed4d95d033e359f9445e85bf5a768a5859a5830b
Gitweb: https://git.kernel.org/tip/ed4d95d033e359f9445e85bf5a768a5859a5830b
Author: Ard Biesheuvel <ardb@kernel.org>
AuthorDate: Sun, 04 May 2025 11:52:36 +02:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 05 May 2025 07:07:29 +02:00
x86/sev: Disentangle #VC handling code from startup code
Most of the SEV support code used to reside in a single C source file
that was included in two places: the core kernel, and the decompressor.
The code that is actually shared with the decompressor was moved into a
separate, shared source file under startup/, on the basis that the
decompressor also executes from the early 1:1 mapping of memory.
However, while the elaborate #VC handling and instruction decoding that
it involves is also performed by the decompressor, it does not actually
occur in the core kernel at early boot, and therefore, does not need to
be part of the confined early startup code.
So split off the #VC handling code and move it back into arch/x86/coco
where it came from, into another C source file that is included from
both the decompressor and the core kernel.
Code movement only - no functional change intended.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250504095230.2932860-31-ardb+git@google.com
---
arch/x86/boot/compressed/misc.h | 1 +-
arch/x86/boot/compressed/sev-handle-vc.c | 83 ++-
arch/x86/boot/compressed/sev.c | 96 +--
arch/x86/boot/compressed/sev.h | 19 +-
arch/x86/boot/startup/sev-shared.c | 519 +----------
arch/x86/boot/startup/sev-startup.c | 1022 +--------------------
arch/x86/coco/sev/Makefile | 2 +-
arch/x86/coco/sev/vc-handle.c | 1061 +++++++++++++++++++++-
arch/x86/coco/sev/vc-shared.c | 504 ++++++++++-
arch/x86/include/asm/sev-internal.h | 8 +-
arch/x86/include/asm/sev.h | 22 +-
11 files changed, 1694 insertions(+), 1643 deletions(-)
create mode 100644 arch/x86/coco/sev/vc-handle.c
create mode 100644 arch/x86/coco/sev/vc-shared.c
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 04de48c..db10486 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -150,6 +150,7 @@ void sev_prep_identity_maps(unsigned long top_level_pgt);
enum es_result vc_decode_insn(struct es_em_ctxt *ctxt);
bool insn_has_rep_prefix(struct insn *insn);
void sev_insn_decode_init(void);
+bool early_setup_ghcb(void);
#else
static inline void sev_enable(struct boot_params *bp)
{
diff --git a/arch/x86/boot/compressed/sev-handle-vc.c b/arch/x86/boot/compressed/sev-handle-vc.c
index b1aa073..89dd02d 100644
--- a/arch/x86/boot/compressed/sev-handle-vc.c
+++ b/arch/x86/boot/compressed/sev-handle-vc.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
#include "misc.h"
+#include "sev.h"
#include <linux/kernel.h>
#include <linux/string.h>
@@ -8,6 +9,9 @@
#include <asm/pgtable_types.h>
#include <asm/ptrace.h>
#include <asm/sev.h>
+#include <asm/trapnr.h>
+#include <asm/trap_pf.h>
+#include <asm/fpu/xcr.h>
#define __BOOT_COMPRESSED
@@ -49,3 +53,82 @@ enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
}
extern void sev_insn_decode_init(void) __alias(inat_init_tables);
+
+/*
+ * Only a dummy for insn_get_seg_base() - Early boot-code is 64bit only and
+ * doesn't use segments.
+ */
+static unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx)
+{
+ return 0UL;
+}
+
+static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
+ void *dst, char *buf, size_t size)
+{
+ memcpy(dst, buf, size);
+
+ return ES_OK;
+}
+
+static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
+ void *src, char *buf, size_t size)
+{
+ memcpy(buf, src, size);
+
+ return ES_OK;
+}
+
+static enum es_result vc_ioio_check(struct es_em_ctxt *ctxt, u16 port, size_t size)
+{
+ return ES_OK;
+}
+
+static bool fault_in_kernel_space(unsigned long address)
+{
+ return false;
+}
+
+#define sev_printk(fmt, ...)
+
+#include "../../coco/sev/vc-shared.c"
+
+void do_boot_stage2_vc(struct pt_regs *regs, unsigned long exit_code)
+{
+ struct es_em_ctxt ctxt;
+ enum es_result result;
+
+ if (!boot_ghcb && !early_setup_ghcb())
+ sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
+
+ vc_ghcb_invalidate(boot_ghcb);
+ result = vc_init_em_ctxt(&ctxt, regs, exit_code);
+ if (result != ES_OK)
+ goto finish;
+
+ result = vc_check_opcode_bytes(&ctxt, exit_code);
+ if (result != ES_OK)
+ goto finish;
+
+ switch (exit_code) {
+ case SVM_EXIT_RDTSC:
+ case SVM_EXIT_RDTSCP:
+ result = vc_handle_rdtsc(boot_ghcb, &ctxt, exit_code);
+ break;
+ case SVM_EXIT_IOIO:
+ result = vc_handle_ioio(boot_ghcb, &ctxt);
+ break;
+ case SVM_EXIT_CPUID:
+ result = vc_handle_cpuid(boot_ghcb, &ctxt);
+ break;
+ default:
+ result = ES_UNSUPPORTED;
+ break;
+ }
+
+finish:
+ if (result == ES_OK)
+ vc_finish_insn(&ctxt);
+ else if (result != ES_RETRY)
+ sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
+}
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 6fe7e85..612b443 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -24,63 +24,11 @@
#include <asm/cpuid.h>
#include "error.h"
-#include "../msr.h"
+#include "sev.h"
static struct ghcb boot_ghcb_page __aligned(PAGE_SIZE);
struct ghcb *boot_ghcb;
-/*
- * Only a dummy for insn_get_seg_base() - Early boot-code is 64bit only and
- * doesn't use segments.
- */
-static unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx)
-{
- return 0UL;
-}
-
-static inline u64 sev_es_rd_ghcb_msr(void)
-{
- struct msr m;
-
- boot_rdmsr(MSR_AMD64_SEV_ES_GHCB, &m);
-
- return m.q;
-}
-
-static inline void sev_es_wr_ghcb_msr(u64 val)
-{
- struct msr m;
-
- m.q = val;
- boot_wrmsr(MSR_AMD64_SEV_ES_GHCB, &m);
-}
-
-static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
- void *dst, char *buf, size_t size)
-{
- memcpy(dst, buf, size);
-
- return ES_OK;
-}
-
-static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
- void *src, char *buf, size_t size)
-{
- memcpy(buf, src, size);
-
- return ES_OK;
-}
-
-static enum es_result vc_ioio_check(struct es_em_ctxt *ctxt, u16 port, size_t size)
-{
- return ES_OK;
-}
-
-static bool fault_in_kernel_space(unsigned long address)
-{
- return false;
-}
-
#undef __init
#define __init
@@ -182,7 +130,7 @@ void snp_set_page_shared(unsigned long paddr)
__page_state_change(paddr, SNP_PAGE_STATE_SHARED);
}
-static bool early_setup_ghcb(void)
+bool early_setup_ghcb(void)
{
if (set_page_decrypted((unsigned long)&boot_ghcb_page))
return false;
@@ -266,46 +214,6 @@ bool sev_es_check_ghcb_fault(unsigned long address)
return ((address & PAGE_MASK) == (unsigned long)&boot_ghcb_page);
}
-void do_boot_stage2_vc(struct pt_regs *regs, unsigned long exit_code)
-{
- struct es_em_ctxt ctxt;
- enum es_result result;
-
- if (!boot_ghcb && !early_setup_ghcb())
- sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
-
- vc_ghcb_invalidate(boot_ghcb);
- result = vc_init_em_ctxt(&ctxt, regs, exit_code);
- if (result != ES_OK)
- goto finish;
-
- result = vc_check_opcode_bytes(&ctxt, exit_code);
- if (result != ES_OK)
- goto finish;
-
- switch (exit_code) {
- case SVM_EXIT_RDTSC:
- case SVM_EXIT_RDTSCP:
- result = vc_handle_rdtsc(boot_ghcb, &ctxt, exit_code);
- break;
- case SVM_EXIT_IOIO:
- result = vc_handle_ioio(boot_ghcb, &ctxt);
- break;
- case SVM_EXIT_CPUID:
- result = vc_handle_cpuid(boot_ghcb, &ctxt);
- break;
- default:
- result = ES_UNSUPPORTED;
- break;
- }
-
-finish:
- if (result == ES_OK)
- vc_finish_insn(&ctxt);
- else if (result != ES_RETRY)
- sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
-}
-
/*
* SNP_FEATURES_IMPL_REQ is the mask of SNP features that will need
* guest side implementation for proper functioning of the guest. If any
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
index e87af54..92f79c2 100644
--- a/arch/x86/boot/compressed/sev.h
+++ b/arch/x86/boot/compressed/sev.h
@@ -10,10 +10,29 @@
#ifdef CONFIG_AMD_MEM_ENCRYPT
+#include "../msr.h"
+
void snp_accept_memory(phys_addr_t start, phys_addr_t end);
u64 sev_get_status(void);
bool early_is_sevsnp_guest(void);
+static inline u64 sev_es_rd_ghcb_msr(void)
+{
+ struct msr m;
+
+ boot_rdmsr(MSR_AMD64_SEV_ES_GHCB, &m);
+
+ return m.q;
+}
+
+static inline void sev_es_wr_ghcb_msr(u64 val)
+{
+ struct msr m;
+
+ m.q = val;
+ boot_wrmsr(MSR_AMD64_SEV_ES_GHCB, &m);
+}
+
#else
static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
diff --git a/arch/x86/boot/startup/sev-shared.c b/arch/x86/boot/startup/sev-shared.c
index 173f3d1..7a706db 100644
--- a/arch/x86/boot/startup/sev-shared.c
+++ b/arch/x86/boot/startup/sev-shared.c
@@ -14,13 +14,9 @@
#ifndef __BOOT_COMPRESSED
#define error(v) pr_err(v)
#define has_cpuflag(f) boot_cpu_has(f)
-#define sev_printk(fmt, ...) printk(fmt, ##__VA_ARGS__)
-#define sev_printk_rtl(fmt, ...) printk_ratelimited(fmt, ##__VA_ARGS__)
#else
#undef WARN
#define WARN(condition, format...) (!!(condition))
-#define sev_printk(fmt, ...)
-#define sev_printk_rtl(fmt, ...)
#undef vc_forward_exception
#define vc_forward_exception(c) panic("SNP: Hypervisor requested exception\n")
#endif
@@ -36,16 +32,6 @@
struct svsm_ca *boot_svsm_caa __ro_after_init;
u64 boot_svsm_caa_pa __ro_after_init;
-/* I/O parameters for CPUID-related helpers */
-struct cpuid_leaf {
- u32 fn;
- u32 subfn;
- u32 eax;
- u32 ebx;
- u32 ecx;
- u32 edx;
-};
-
/*
* Since feature negotiation related variables are set early in the boot
* process they must reside in the .data section so as not to be zeroed
@@ -151,33 +137,6 @@ bool sev_es_negotiate_protocol(void)
return true;
}
-static bool vc_decoding_needed(unsigned long exit_code)
-{
- /* Exceptions don't require to decode the instruction */
- return !(exit_code >= SVM_EXIT_EXCP_BASE &&
- exit_code <= SVM_EXIT_LAST_EXCP);
-}
-
-static enum es_result vc_init_em_ctxt(struct es_em_ctxt *ctxt,
- struct pt_regs *regs,
- unsigned long exit_code)
-{
- enum es_result ret = ES_OK;
-
- memset(ctxt, 0, sizeof(*ctxt));
- ctxt->regs = regs;
-
- if (vc_decoding_needed(exit_code))
- ret = vc_decode_insn(ctxt);
-
- return ret;
-}
-
-static void vc_finish_insn(struct es_em_ctxt *ctxt)
-{
- ctxt->regs->ip += ctxt->insn.length;
-}
-
static enum es_result verify_exception_info(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
{
u32 ret;
@@ -627,7 +586,7 @@ snp_cpuid_postprocess(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
* Returns -EOPNOTSUPP if feature not enabled. Any other non-zero return value
* should be treated as fatal by caller.
*/
-static int __head
+int __head
snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
{
const struct snp_cpuid_table *cpuid_table = snp_cpuid_get_table();
@@ -737,391 +696,6 @@ fail:
sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
}
-static enum es_result vc_insn_string_check(struct es_em_ctxt *ctxt,
- unsigned long address,
- bool write)
-{
- if (user_mode(ctxt->regs) && fault_in_kernel_space(address)) {
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.error_code = X86_PF_USER;
- ctxt->fi.cr2 = address;
- if (write)
- ctxt->fi.error_code |= X86_PF_WRITE;
-
- return ES_EXCEPTION;
- }
-
- return ES_OK;
-}
-
-static enum es_result vc_insn_string_read(struct es_em_ctxt *ctxt,
- void *src, char *buf,
- unsigned int data_size,
- unsigned int count,
- bool backwards)
-{
- int i, b = backwards ? -1 : 1;
- unsigned long address = (unsigned long)src;
- enum es_result ret;
-
- ret = vc_insn_string_check(ctxt, address, false);
- if (ret != ES_OK)
- return ret;
-
- for (i = 0; i < count; i++) {
- void *s = src + (i * data_size * b);
- char *d = buf + (i * data_size);
-
- ret = vc_read_mem(ctxt, s, d, data_size);
- if (ret != ES_OK)
- break;
- }
-
- return ret;
-}
-
-static enum es_result vc_insn_string_write(struct es_em_ctxt *ctxt,
- void *dst, char *buf,
- unsigned int data_size,
- unsigned int count,
- bool backwards)
-{
- int i, s = backwards ? -1 : 1;
- unsigned long address = (unsigned long)dst;
- enum es_result ret;
-
- ret = vc_insn_string_check(ctxt, address, true);
- if (ret != ES_OK)
- return ret;
-
- for (i = 0; i < count; i++) {
- void *d = dst + (i * data_size * s);
- char *b = buf + (i * data_size);
-
- ret = vc_write_mem(ctxt, d, b, data_size);
- if (ret != ES_OK)
- break;
- }
-
- return ret;
-}
-
-#define IOIO_TYPE_STR BIT(2)
-#define IOIO_TYPE_IN 1
-#define IOIO_TYPE_INS (IOIO_TYPE_IN | IOIO_TYPE_STR)
-#define IOIO_TYPE_OUT 0
-#define IOIO_TYPE_OUTS (IOIO_TYPE_OUT | IOIO_TYPE_STR)
-
-#define IOIO_REP BIT(3)
-
-#define IOIO_ADDR_64 BIT(9)
-#define IOIO_ADDR_32 BIT(8)
-#define IOIO_ADDR_16 BIT(7)
-
-#define IOIO_DATA_32 BIT(6)
-#define IOIO_DATA_16 BIT(5)
-#define IOIO_DATA_8 BIT(4)
-
-#define IOIO_SEG_ES (0 << 10)
-#define IOIO_SEG_DS (3 << 10)
-
-static enum es_result vc_ioio_exitinfo(struct es_em_ctxt *ctxt, u64 *exitinfo)
-{
- struct insn *insn = &ctxt->insn;
- size_t size;
- u64 port;
-
- *exitinfo = 0;
-
- switch (insn->opcode.bytes[0]) {
- /* INS opcodes */
- case 0x6c:
- case 0x6d:
- *exitinfo |= IOIO_TYPE_INS;
- *exitinfo |= IOIO_SEG_ES;
- port = ctxt->regs->dx & 0xffff;
- break;
-
- /* OUTS opcodes */
- case 0x6e:
- case 0x6f:
- *exitinfo |= IOIO_TYPE_OUTS;
- *exitinfo |= IOIO_SEG_DS;
- port = ctxt->regs->dx & 0xffff;
- break;
-
- /* IN immediate opcodes */
- case 0xe4:
- case 0xe5:
- *exitinfo |= IOIO_TYPE_IN;
- port = (u8)insn->immediate.value & 0xffff;
- break;
-
- /* OUT immediate opcodes */
- case 0xe6:
- case 0xe7:
- *exitinfo |= IOIO_TYPE_OUT;
- port = (u8)insn->immediate.value & 0xffff;
- break;
-
- /* IN register opcodes */
- case 0xec:
- case 0xed:
- *exitinfo |= IOIO_TYPE_IN;
- port = ctxt->regs->dx & 0xffff;
- break;
-
- /* OUT register opcodes */
- case 0xee:
- case 0xef:
- *exitinfo |= IOIO_TYPE_OUT;
- port = ctxt->regs->dx & 0xffff;
- break;
-
- default:
- return ES_DECODE_FAILED;
- }
-
- *exitinfo |= port << 16;
-
- switch (insn->opcode.bytes[0]) {
- case 0x6c:
- case 0x6e:
- case 0xe4:
- case 0xe6:
- case 0xec:
- case 0xee:
- /* Single byte opcodes */
- *exitinfo |= IOIO_DATA_8;
- size = 1;
- break;
- default:
- /* Length determined by instruction parsing */
- *exitinfo |= (insn->opnd_bytes == 2) ? IOIO_DATA_16
- : IOIO_DATA_32;
- size = (insn->opnd_bytes == 2) ? 2 : 4;
- }
-
- switch (insn->addr_bytes) {
- case 2:
- *exitinfo |= IOIO_ADDR_16;
- break;
- case 4:
- *exitinfo |= IOIO_ADDR_32;
- break;
- case 8:
- *exitinfo |= IOIO_ADDR_64;
- break;
- }
-
- if (insn_has_rep_prefix(insn))
- *exitinfo |= IOIO_REP;
-
- return vc_ioio_check(ctxt, (u16)port, size);
-}
-
-static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
-{
- struct pt_regs *regs = ctxt->regs;
- u64 exit_info_1, exit_info_2;
- enum es_result ret;
-
- ret = vc_ioio_exitinfo(ctxt, &exit_info_1);
- if (ret != ES_OK)
- return ret;
-
- if (exit_info_1 & IOIO_TYPE_STR) {
-
- /* (REP) INS/OUTS */
-
- bool df = ((regs->flags & X86_EFLAGS_DF) == X86_EFLAGS_DF);
- unsigned int io_bytes, exit_bytes;
- unsigned int ghcb_count, op_count;
- unsigned long es_base;
- u64 sw_scratch;
-
- /*
- * For the string variants with rep prefix the amount of in/out
- * operations per #VC exception is limited so that the kernel
- * has a chance to take interrupts and re-schedule while the
- * instruction is emulated.
- */
- io_bytes = (exit_info_1 >> 4) & 0x7;
- ghcb_count = sizeof(ghcb->shared_buffer) / io_bytes;
-
- op_count = (exit_info_1 & IOIO_REP) ? regs->cx : 1;
- exit_info_2 = min(op_count, ghcb_count);
- exit_bytes = exit_info_2 * io_bytes;
-
- es_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_ES);
-
- /* Read bytes of OUTS into the shared buffer */
- if (!(exit_info_1 & IOIO_TYPE_IN)) {
- ret = vc_insn_string_read(ctxt,
- (void *)(es_base + regs->si),
- ghcb->shared_buffer, io_bytes,
- exit_info_2, df);
- if (ret)
- return ret;
- }
-
- /*
- * Issue an VMGEXIT to the HV to consume the bytes from the
- * shared buffer or to have it write them into the shared buffer
- * depending on the instruction: OUTS or INS.
- */
- sw_scratch = __pa(ghcb) + offsetof(struct ghcb, shared_buffer);
- ghcb_set_sw_scratch(ghcb, sw_scratch);
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO,
- exit_info_1, exit_info_2);
- if (ret != ES_OK)
- return ret;
-
- /* Read bytes from shared buffer into the guest's destination. */
- if (exit_info_1 & IOIO_TYPE_IN) {
- ret = vc_insn_string_write(ctxt,
- (void *)(es_base + regs->di),
- ghcb->shared_buffer, io_bytes,
- exit_info_2, df);
- if (ret)
- return ret;
-
- if (df)
- regs->di -= exit_bytes;
- else
- regs->di += exit_bytes;
- } else {
- if (df)
- regs->si -= exit_bytes;
- else
- regs->si += exit_bytes;
- }
-
- if (exit_info_1 & IOIO_REP)
- regs->cx -= exit_info_2;
-
- ret = regs->cx ? ES_RETRY : ES_OK;
-
- } else {
-
- /* IN/OUT into/from rAX */
-
- int bits = (exit_info_1 & 0x70) >> 1;
- u64 rax = 0;
-
- if (!(exit_info_1 & IOIO_TYPE_IN))
- rax = lower_bits(regs->ax, bits);
-
- ghcb_set_rax(ghcb, rax);
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO, exit_info_1, 0);
- if (ret != ES_OK)
- return ret;
-
- if (exit_info_1 & IOIO_TYPE_IN) {
- if (!ghcb_rax_is_valid(ghcb))
- return ES_VMM_ERROR;
- regs->ax = lower_bits(ghcb->save.rax, bits);
- }
- }
-
- return ret;
-}
-
-static int vc_handle_cpuid_snp(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
-{
- struct pt_regs *regs = ctxt->regs;
- struct cpuid_leaf leaf;
- int ret;
-
- leaf.fn = regs->ax;
- leaf.subfn = regs->cx;
- ret = snp_cpuid(ghcb, ctxt, &leaf);
- if (!ret) {
- regs->ax = leaf.eax;
- regs->bx = leaf.ebx;
- regs->cx = leaf.ecx;
- regs->dx = leaf.edx;
- }
-
- return ret;
-}
-
-static enum es_result vc_handle_cpuid(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- struct pt_regs *regs = ctxt->regs;
- u32 cr4 = native_read_cr4();
- enum es_result ret;
- int snp_cpuid_ret;
-
- snp_cpuid_ret = vc_handle_cpuid_snp(ghcb, ctxt);
- if (!snp_cpuid_ret)
- return ES_OK;
- if (snp_cpuid_ret != -EOPNOTSUPP)
- return ES_VMM_ERROR;
-
- ghcb_set_rax(ghcb, regs->ax);
- ghcb_set_rcx(ghcb, regs->cx);
-
- if (cr4 & X86_CR4_OSXSAVE)
- /* Safe to read xcr0 */
- ghcb_set_xcr0(ghcb, xgetbv(XCR_XFEATURE_ENABLED_MASK));
- else
- /* xgetbv will cause #GP - use reset value for xcr0 */
- ghcb_set_xcr0(ghcb, 1);
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_CPUID, 0, 0);
- if (ret != ES_OK)
- return ret;
-
- if (!(ghcb_rax_is_valid(ghcb) &&
- ghcb_rbx_is_valid(ghcb) &&
- ghcb_rcx_is_valid(ghcb) &&
- ghcb_rdx_is_valid(ghcb)))
- return ES_VMM_ERROR;
-
- regs->ax = ghcb->save.rax;
- regs->bx = ghcb->save.rbx;
- regs->cx = ghcb->save.rcx;
- regs->dx = ghcb->save.rdx;
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_rdtsc(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt,
- unsigned long exit_code)
-{
- bool rdtscp = (exit_code == SVM_EXIT_RDTSCP);
- enum es_result ret;
-
- /*
- * The hypervisor should not be intercepting RDTSC/RDTSCP when Secure
- * TSC is enabled. A #VC exception will be generated if the RDTSC/RDTSCP
- * instructions are being intercepted. If this should occur and Secure
- * TSC is enabled, guest execution should be terminated as the guest
- * cannot rely on the TSC value provided by the hypervisor.
- */
- if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
- return ES_VMM_ERROR;
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, 0, 0);
- if (ret != ES_OK)
- return ret;
-
- if (!(ghcb_rax_is_valid(ghcb) && ghcb_rdx_is_valid(ghcb) &&
- (!rdtscp || ghcb_rcx_is_valid(ghcb))))
- return ES_VMM_ERROR;
-
- ctxt->regs->ax = ghcb->save.rax;
- ctxt->regs->dx = ghcb->save.rdx;
- if (rdtscp)
- ctxt->regs->cx = ghcb->save.rcx;
-
- return ES_OK;
-}
-
struct cc_setup_data {
struct setup_data header;
u32 cc_blob_address;
@@ -1238,97 +812,6 @@ static void __head pvalidate_4k_page(unsigned long vaddr, unsigned long paddr,
}
}
-static enum es_result vc_check_opcode_bytes(struct es_em_ctxt *ctxt,
- unsigned long exit_code)
-{
- unsigned int opcode = (unsigned int)ctxt->insn.opcode.value;
- u8 modrm = ctxt->insn.modrm.value;
-
- switch (exit_code) {
-
- case SVM_EXIT_IOIO:
- case SVM_EXIT_NPF:
- /* handled separately */
- return ES_OK;
-
- case SVM_EXIT_CPUID:
- if (opcode == 0xa20f)
- return ES_OK;
- break;
-
- case SVM_EXIT_INVD:
- if (opcode == 0x080f)
- return ES_OK;
- break;
-
- case SVM_EXIT_MONITOR:
- /* MONITOR and MONITORX instructions generate the same error code */
- if (opcode == 0x010f && (modrm == 0xc8 || modrm == 0xfa))
- return ES_OK;
- break;
-
- case SVM_EXIT_MWAIT:
- /* MWAIT and MWAITX instructions generate the same error code */
- if (opcode == 0x010f && (modrm == 0xc9 || modrm == 0xfb))
- return ES_OK;
- break;
-
- case SVM_EXIT_MSR:
- /* RDMSR */
- if (opcode == 0x320f ||
- /* WRMSR */
- opcode == 0x300f)
- return ES_OK;
- break;
-
- case SVM_EXIT_RDPMC:
- if (opcode == 0x330f)
- return ES_OK;
- break;
-
- case SVM_EXIT_RDTSC:
- if (opcode == 0x310f)
- return ES_OK;
- break;
-
- case SVM_EXIT_RDTSCP:
- if (opcode == 0x010f && modrm == 0xf9)
- return ES_OK;
- break;
-
- case SVM_EXIT_READ_DR7:
- if (opcode == 0x210f &&
- X86_MODRM_REG(ctxt->insn.modrm.value) == 7)
- return ES_OK;
- break;
-
- case SVM_EXIT_VMMCALL:
- if (opcode == 0x010f && modrm == 0xd9)
- return ES_OK;
-
- break;
-
- case SVM_EXIT_WRITE_DR7:
- if (opcode == 0x230f &&
- X86_MODRM_REG(ctxt->insn.modrm.value) == 7)
- return ES_OK;
- break;
-
- case SVM_EXIT_WBINVD:
- if (opcode == 0x90f)
- return ES_OK;
- break;
-
- default:
- break;
- }
-
- sev_printk(KERN_ERR "Wrong/unhandled opcode bytes: 0x%x, exit_code: 0x%lx, rIP: 0x%lx\n",
- opcode, exit_code, ctxt->regs->ip);
-
- return ES_UNSUPPORTED;
-}
-
/*
* Maintain the GPA of the SVSM Calling Area (CA) in order to utilize the SVSM
* services needed when not running in VMPL0.
diff --git a/arch/x86/boot/startup/sev-startup.c b/arch/x86/boot/startup/sev-startup.c
index f901ce9..435853a 100644
--- a/arch/x86/boot/startup/sev-startup.c
+++ b/arch/x86/boot/startup/sev-startup.c
@@ -9,7 +9,6 @@
#define pr_fmt(fmt) "SEV: " fmt
-#include <linux/sched/debug.h> /* For show_regs() */
#include <linux/percpu-defs.h>
#include <linux/cc_platform.h>
#include <linux/printk.h>
@@ -112,315 +111,6 @@ noinstr struct ghcb *__sev_get_ghcb(struct ghcb_state *state)
return ghcb;
}
-static int vc_fetch_insn_kernel(struct es_em_ctxt *ctxt,
- unsigned char *buffer)
-{
- return copy_from_kernel_nofault(buffer, (unsigned char *)ctxt->regs->ip, MAX_INSN_SIZE);
-}
-
-static enum es_result __vc_decode_user_insn(struct es_em_ctxt *ctxt)
-{
- char buffer[MAX_INSN_SIZE];
- int insn_bytes;
-
- insn_bytes = insn_fetch_from_user_inatomic(ctxt->regs, buffer);
- if (insn_bytes == 0) {
- /* Nothing could be copied */
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.error_code = X86_PF_INSTR | X86_PF_USER;
- ctxt->fi.cr2 = ctxt->regs->ip;
- return ES_EXCEPTION;
- } else if (insn_bytes == -EINVAL) {
- /* Effective RIP could not be calculated */
- ctxt->fi.vector = X86_TRAP_GP;
- ctxt->fi.error_code = 0;
- ctxt->fi.cr2 = 0;
- return ES_EXCEPTION;
- }
-
- if (!insn_decode_from_regs(&ctxt->insn, ctxt->regs, buffer, insn_bytes))
- return ES_DECODE_FAILED;
-
- if (ctxt->insn.immediate.got)
- return ES_OK;
- else
- return ES_DECODE_FAILED;
-}
-
-static enum es_result __vc_decode_kern_insn(struct es_em_ctxt *ctxt)
-{
- char buffer[MAX_INSN_SIZE];
- int res, ret;
-
- res = vc_fetch_insn_kernel(ctxt, buffer);
- if (res) {
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.error_code = X86_PF_INSTR;
- ctxt->fi.cr2 = ctxt->regs->ip;
- return ES_EXCEPTION;
- }
-
- ret = insn_decode(&ctxt->insn, buffer, MAX_INSN_SIZE, INSN_MODE_64);
- if (ret < 0)
- return ES_DECODE_FAILED;
- else
- return ES_OK;
-}
-
-static enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
-{
- if (user_mode(ctxt->regs))
- return __vc_decode_user_insn(ctxt);
- else
- return __vc_decode_kern_insn(ctxt);
-}
-
-static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
- char *dst, char *buf, size_t size)
-{
- unsigned long error_code = X86_PF_PROT | X86_PF_WRITE;
-
- /*
- * This function uses __put_user() independent of whether kernel or user
- * memory is accessed. This works fine because __put_user() does no
- * sanity checks of the pointer being accessed. All that it does is
- * to report when the access failed.
- *
- * Also, this function runs in atomic context, so __put_user() is not
- * allowed to sleep. The page-fault handler detects that it is running
- * in atomic context and will not try to take mmap_sem and handle the
- * fault, so additional pagefault_enable()/disable() calls are not
- * needed.
- *
- * The access can't be done via copy_to_user() here because
- * vc_write_mem() must not use string instructions to access unsafe
- * memory. The reason is that MOVS is emulated by the #VC handler by
- * splitting the move up into a read and a write and taking a nested #VC
- * exception on whatever of them is the MMIO access. Using string
- * instructions here would cause infinite nesting.
- */
- switch (size) {
- case 1: {
- u8 d1;
- u8 __user *target = (u8 __user *)dst;
-
- memcpy(&d1, buf, 1);
- if (__put_user(d1, target))
- goto fault;
- break;
- }
- case 2: {
- u16 d2;
- u16 __user *target = (u16 __user *)dst;
-
- memcpy(&d2, buf, 2);
- if (__put_user(d2, target))
- goto fault;
- break;
- }
- case 4: {
- u32 d4;
- u32 __user *target = (u32 __user *)dst;
-
- memcpy(&d4, buf, 4);
- if (__put_user(d4, target))
- goto fault;
- break;
- }
- case 8: {
- u64 d8;
- u64 __user *target = (u64 __user *)dst;
-
- memcpy(&d8, buf, 8);
- if (__put_user(d8, target))
- goto fault;
- break;
- }
- default:
- WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
- return ES_UNSUPPORTED;
- }
-
- return ES_OK;
-
-fault:
- if (user_mode(ctxt->regs))
- error_code |= X86_PF_USER;
-
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.error_code = error_code;
- ctxt->fi.cr2 = (unsigned long)dst;
-
- return ES_EXCEPTION;
-}
-
-static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
- char *src, char *buf, size_t size)
-{
- unsigned long error_code = X86_PF_PROT;
-
- /*
- * This function uses __get_user() independent of whether kernel or user
- * memory is accessed. This works fine because __get_user() does no
- * sanity checks of the pointer being accessed. All that it does is
- * to report when the access failed.
- *
- * Also, this function runs in atomic context, so __get_user() is not
- * allowed to sleep. The page-fault handler detects that it is running
- * in atomic context and will not try to take mmap_sem and handle the
- * fault, so additional pagefault_enable()/disable() calls are not
- * needed.
- *
- * The access can't be done via copy_from_user() here because
- * vc_read_mem() must not use string instructions to access unsafe
- * memory. The reason is that MOVS is emulated by the #VC handler by
- * splitting the move up into a read and a write and taking a nested #VC
- * exception on whatever of them is the MMIO access. Using string
- * instructions here would cause infinite nesting.
- */
- switch (size) {
- case 1: {
- u8 d1;
- u8 __user *s = (u8 __user *)src;
-
- if (__get_user(d1, s))
- goto fault;
- memcpy(buf, &d1, 1);
- break;
- }
- case 2: {
- u16 d2;
- u16 __user *s = (u16 __user *)src;
-
- if (__get_user(d2, s))
- goto fault;
- memcpy(buf, &d2, 2);
- break;
- }
- case 4: {
- u32 d4;
- u32 __user *s = (u32 __user *)src;
-
- if (__get_user(d4, s))
- goto fault;
- memcpy(buf, &d4, 4);
- break;
- }
- case 8: {
- u64 d8;
- u64 __user *s = (u64 __user *)src;
- if (__get_user(d8, s))
- goto fault;
- memcpy(buf, &d8, 8);
- break;
- }
- default:
- WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
- return ES_UNSUPPORTED;
- }
-
- return ES_OK;
-
-fault:
- if (user_mode(ctxt->regs))
- error_code |= X86_PF_USER;
-
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.error_code = error_code;
- ctxt->fi.cr2 = (unsigned long)src;
-
- return ES_EXCEPTION;
-}
-
-static enum es_result vc_slow_virt_to_phys(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
- unsigned long vaddr, phys_addr_t *paddr)
-{
- unsigned long va = (unsigned long)vaddr;
- unsigned int level;
- phys_addr_t pa;
- pgd_t *pgd;
- pte_t *pte;
-
- pgd = __va(read_cr3_pa());
- pgd = &pgd[pgd_index(va)];
- pte = lookup_address_in_pgd(pgd, va, &level);
- if (!pte) {
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.cr2 = vaddr;
- ctxt->fi.error_code = 0;
-
- if (user_mode(ctxt->regs))
- ctxt->fi.error_code |= X86_PF_USER;
-
- return ES_EXCEPTION;
- }
-
- if (WARN_ON_ONCE(pte_val(*pte) & _PAGE_ENC))
- /* Emulated MMIO to/from encrypted memory not supported */
- return ES_UNSUPPORTED;
-
- pa = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
- pa |= va & ~page_level_mask(level);
-
- *paddr = pa;
-
- return ES_OK;
-}
-
-static enum es_result vc_ioio_check(struct es_em_ctxt *ctxt, u16 port, size_t size)
-{
- BUG_ON(size > 4);
-
- if (user_mode(ctxt->regs)) {
- struct thread_struct *t = ¤t->thread;
- struct io_bitmap *iobm = t->io_bitmap;
- size_t idx;
-
- if (!iobm)
- goto fault;
-
- for (idx = port; idx < port + size; ++idx) {
- if (test_bit(idx, iobm->bitmap))
- goto fault;
- }
- }
-
- return ES_OK;
-
-fault:
- ctxt->fi.vector = X86_TRAP_GP;
- ctxt->fi.error_code = 0;
-
- return ES_EXCEPTION;
-}
-
-static __always_inline void vc_forward_exception(struct es_em_ctxt *ctxt)
-{
- long error_code = ctxt->fi.error_code;
- int trapnr = ctxt->fi.vector;
-
- ctxt->regs->orig_ax = ctxt->fi.error_code;
-
- switch (trapnr) {
- case X86_TRAP_GP:
- exc_general_protection(ctxt->regs, error_code);
- break;
- case X86_TRAP_UD:
- exc_invalid_op(ctxt->regs);
- break;
- case X86_TRAP_PF:
- write_cr2(ctxt->fi.cr2);
- exc_page_fault(ctxt->regs, error_code);
- break;
- case X86_TRAP_AC:
- exc_alignment_check(ctxt->regs, error_code);
- break;
- default:
- pr_emerg("Unsupported exception in #VC instruction emulation - can't continue\n");
- BUG();
- }
-}
-
/* Include code shared with pre-decompression boot stage */
#include "sev-shared.c"
@@ -563,718 +253,6 @@ void __head early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr
early_set_pages_state(vaddr, paddr, npages, SNP_PAGE_STATE_SHARED);
}
-/* Writes to the SVSM CAA MSR are ignored */
-static enum es_result __vc_handle_msr_caa(struct pt_regs *regs, bool write)
-{
- if (write)
- return ES_OK;
-
- regs->ax = lower_32_bits(this_cpu_read(svsm_caa_pa));
- regs->dx = upper_32_bits(this_cpu_read(svsm_caa_pa));
-
- return ES_OK;
-}
-
-/*
- * TSC related accesses should not exit to the hypervisor when a guest is
- * executing with Secure TSC enabled, so special handling is required for
- * accesses of MSR_IA32_TSC and MSR_AMD64_GUEST_TSC_FREQ.
- */
-static enum es_result __vc_handle_secure_tsc_msrs(struct pt_regs *regs, bool write)
-{
- u64 tsc;
-
- /*
- * GUEST_TSC_FREQ should not be intercepted when Secure TSC is enabled.
- * Terminate the SNP guest when the interception is enabled.
- */
- if (regs->cx == MSR_AMD64_GUEST_TSC_FREQ)
- return ES_VMM_ERROR;
-
- /*
- * Writes: Writing to MSR_IA32_TSC can cause subsequent reads of the TSC
- * to return undefined values, so ignore all writes.
- *
- * Reads: Reads of MSR_IA32_TSC should return the current TSC value, use
- * the value returned by rdtsc_ordered().
- */
- if (write) {
- WARN_ONCE(1, "TSC MSR writes are verboten!\n");
- return ES_OK;
- }
-
- tsc = rdtsc_ordered();
- regs->ax = lower_32_bits(tsc);
- regs->dx = upper_32_bits(tsc);
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
-{
- struct pt_regs *regs = ctxt->regs;
- enum es_result ret;
- bool write;
-
- /* Is it a WRMSR? */
- write = ctxt->insn.opcode.bytes[1] == 0x30;
-
- switch (regs->cx) {
- case MSR_SVSM_CAA:
- return __vc_handle_msr_caa(regs, write);
- case MSR_IA32_TSC:
- case MSR_AMD64_GUEST_TSC_FREQ:
- if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
- return __vc_handle_secure_tsc_msrs(regs, write);
- break;
- default:
- break;
- }
-
- ghcb_set_rcx(ghcb, regs->cx);
- if (write) {
- ghcb_set_rax(ghcb, regs->ax);
- ghcb_set_rdx(ghcb, regs->dx);
- }
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_MSR, write, 0);
-
- if ((ret == ES_OK) && !write) {
- regs->ax = ghcb->save.rax;
- regs->dx = ghcb->save.rdx;
- }
-
- return ret;
-}
-
-static void __init vc_early_forward_exception(struct es_em_ctxt *ctxt)
-{
- int trapnr = ctxt->fi.vector;
-
- if (trapnr == X86_TRAP_PF)
- native_write_cr2(ctxt->fi.cr2);
-
- ctxt->regs->orig_ax = ctxt->fi.error_code;
- do_early_exception(ctxt->regs, trapnr);
-}
-
-static long *vc_insn_get_rm(struct es_em_ctxt *ctxt)
-{
- long *reg_array;
- int offset;
-
- reg_array = (long *)ctxt->regs;
- offset = insn_get_modrm_rm_off(&ctxt->insn, ctxt->regs);
-
- if (offset < 0)
- return NULL;
-
- offset /= sizeof(long);
-
- return reg_array + offset;
-}
-static enum es_result vc_do_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
- unsigned int bytes, bool read)
-{
- u64 exit_code, exit_info_1, exit_info_2;
- unsigned long ghcb_pa = __pa(ghcb);
- enum es_result res;
- phys_addr_t paddr;
- void __user *ref;
-
- ref = insn_get_addr_ref(&ctxt->insn, ctxt->regs);
- if (ref == (void __user *)-1L)
- return ES_UNSUPPORTED;
-
- exit_code = read ? SVM_VMGEXIT_MMIO_READ : SVM_VMGEXIT_MMIO_WRITE;
-
- res = vc_slow_virt_to_phys(ghcb, ctxt, (unsigned long)ref, &paddr);
- if (res != ES_OK) {
- if (res == ES_EXCEPTION && !read)
- ctxt->fi.error_code |= X86_PF_WRITE;
-
- return res;
- }
-
- exit_info_1 = paddr;
- /* Can never be greater than 8 */
- exit_info_2 = bytes;
-
- ghcb_set_sw_scratch(ghcb, ghcb_pa + offsetof(struct ghcb, shared_buffer));
-
- return sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, exit_info_1, exit_info_2);
-}
-
-/*
- * The MOVS instruction has two memory operands, which raises the
- * problem that it is not known whether the access to the source or the
- * destination caused the #VC exception (and hence whether an MMIO read
- * or write operation needs to be emulated).
- *
- * Instead of playing games with walking page-tables and trying to guess
- * whether the source or destination is an MMIO range, split the move
- * into two operations, a read and a write with only one memory operand.
- * This will cause a nested #VC exception on the MMIO address which can
- * then be handled.
- *
- * This implementation has the benefit that it also supports MOVS where
- * source _and_ destination are MMIO regions.
- *
- * It will slow MOVS on MMIO down a lot, but in SEV-ES guests it is a
- * rare operation. If it turns out to be a performance problem the split
- * operations can be moved to memcpy_fromio() and memcpy_toio().
- */
-static enum es_result vc_handle_mmio_movs(struct es_em_ctxt *ctxt,
- unsigned int bytes)
-{
- unsigned long ds_base, es_base;
- unsigned char *src, *dst;
- unsigned char buffer[8];
- enum es_result ret;
- bool rep;
- int off;
-
- ds_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_DS);
- es_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_ES);
-
- if (ds_base == -1L || es_base == -1L) {
- ctxt->fi.vector = X86_TRAP_GP;
- ctxt->fi.error_code = 0;
- return ES_EXCEPTION;
- }
-
- src = ds_base + (unsigned char *)ctxt->regs->si;
- dst = es_base + (unsigned char *)ctxt->regs->di;
-
- ret = vc_read_mem(ctxt, src, buffer, bytes);
- if (ret != ES_OK)
- return ret;
-
- ret = vc_write_mem(ctxt, dst, buffer, bytes);
- if (ret != ES_OK)
- return ret;
-
- if (ctxt->regs->flags & X86_EFLAGS_DF)
- off = -bytes;
- else
- off = bytes;
-
- ctxt->regs->si += off;
- ctxt->regs->di += off;
-
- rep = insn_has_rep_prefix(&ctxt->insn);
- if (rep)
- ctxt->regs->cx -= 1;
-
- if (!rep || ctxt->regs->cx == 0)
- return ES_OK;
- else
- return ES_RETRY;
-}
-
-static enum es_result vc_handle_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
-{
- struct insn *insn = &ctxt->insn;
- enum insn_mmio_type mmio;
- unsigned int bytes = 0;
- enum es_result ret;
- u8 sign_byte;
- long *reg_data;
-
- mmio = insn_decode_mmio(insn, &bytes);
- if (mmio == INSN_MMIO_DECODE_FAILED)
- return ES_DECODE_FAILED;
-
- if (mmio != INSN_MMIO_WRITE_IMM && mmio != INSN_MMIO_MOVS) {
- reg_data = insn_get_modrm_reg_ptr(insn, ctxt->regs);
- if (!reg_data)
- return ES_DECODE_FAILED;
- }
-
- if (user_mode(ctxt->regs))
- return ES_UNSUPPORTED;
-
- switch (mmio) {
- case INSN_MMIO_WRITE:
- memcpy(ghcb->shared_buffer, reg_data, bytes);
- ret = vc_do_mmio(ghcb, ctxt, bytes, false);
- break;
- case INSN_MMIO_WRITE_IMM:
- memcpy(ghcb->shared_buffer, insn->immediate1.bytes, bytes);
- ret = vc_do_mmio(ghcb, ctxt, bytes, false);
- break;
- case INSN_MMIO_READ:
- ret = vc_do_mmio(ghcb, ctxt, bytes, true);
- if (ret)
- break;
-
- /* Zero-extend for 32-bit operation */
- if (bytes == 4)
- *reg_data = 0;
-
- memcpy(reg_data, ghcb->shared_buffer, bytes);
- break;
- case INSN_MMIO_READ_ZERO_EXTEND:
- ret = vc_do_mmio(ghcb, ctxt, bytes, true);
- if (ret)
- break;
-
- /* Zero extend based on operand size */
- memset(reg_data, 0, insn->opnd_bytes);
- memcpy(reg_data, ghcb->shared_buffer, bytes);
- break;
- case INSN_MMIO_READ_SIGN_EXTEND:
- ret = vc_do_mmio(ghcb, ctxt, bytes, true);
- if (ret)
- break;
-
- if (bytes == 1) {
- u8 *val = (u8 *)ghcb->shared_buffer;
-
- sign_byte = (*val & 0x80) ? 0xff : 0x00;
- } else {
- u16 *val = (u16 *)ghcb->shared_buffer;
-
- sign_byte = (*val & 0x8000) ? 0xff : 0x00;
- }
-
- /* Sign extend based on operand size */
- memset(reg_data, sign_byte, insn->opnd_bytes);
- memcpy(reg_data, ghcb->shared_buffer, bytes);
- break;
- case INSN_MMIO_MOVS:
- ret = vc_handle_mmio_movs(ctxt, bytes);
- break;
- default:
- ret = ES_UNSUPPORTED;
- break;
- }
-
- return ret;
-}
-
-static enum es_result vc_handle_dr7_write(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- struct sev_es_runtime_data *data = this_cpu_read(runtime_data);
- long val, *reg = vc_insn_get_rm(ctxt);
- enum es_result ret;
-
- if (sev_status & MSR_AMD64_SNP_DEBUG_SWAP)
- return ES_VMM_ERROR;
-
- if (!reg)
- return ES_DECODE_FAILED;
-
- val = *reg;
-
- /* Upper 32 bits must be written as zeroes */
- if (val >> 32) {
- ctxt->fi.vector = X86_TRAP_GP;
- ctxt->fi.error_code = 0;
- return ES_EXCEPTION;
- }
-
- /* Clear out other reserved bits and set bit 10 */
- val = (val & 0xffff23ffL) | BIT(10);
-
- /* Early non-zero writes to DR7 are not supported */
- if (!data && (val & ~DR7_RESET_VALUE))
- return ES_UNSUPPORTED;
-
- /* Using a value of 0 for ExitInfo1 means RAX holds the value */
- ghcb_set_rax(ghcb, val);
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WRITE_DR7, 0, 0);
- if (ret != ES_OK)
- return ret;
-
- if (data)
- data->dr7 = val;
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_dr7_read(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- struct sev_es_runtime_data *data = this_cpu_read(runtime_data);
- long *reg = vc_insn_get_rm(ctxt);
-
- if (sev_status & MSR_AMD64_SNP_DEBUG_SWAP)
- return ES_VMM_ERROR;
-
- if (!reg)
- return ES_DECODE_FAILED;
-
- if (data)
- *reg = data->dr7;
- else
- *reg = DR7_RESET_VALUE;
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_wbinvd(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WBINVD, 0, 0);
-}
-
-static enum es_result vc_handle_rdpmc(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
-{
- enum es_result ret;
-
- ghcb_set_rcx(ghcb, ctxt->regs->cx);
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_RDPMC, 0, 0);
- if (ret != ES_OK)
- return ret;
-
- if (!(ghcb_rax_is_valid(ghcb) && ghcb_rdx_is_valid(ghcb)))
- return ES_VMM_ERROR;
-
- ctxt->regs->ax = ghcb->save.rax;
- ctxt->regs->dx = ghcb->save.rdx;
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_monitor(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- /*
- * Treat it as a NOP and do not leak a physical address to the
- * hypervisor.
- */
- return ES_OK;
-}
-
-static enum es_result vc_handle_mwait(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- /* Treat the same as MONITOR/MONITORX */
- return ES_OK;
-}
-
-static enum es_result vc_handle_vmmcall(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- enum es_result ret;
-
- ghcb_set_rax(ghcb, ctxt->regs->ax);
- ghcb_set_cpl(ghcb, user_mode(ctxt->regs) ? 3 : 0);
-
- if (x86_platform.hyper.sev_es_hcall_prepare)
- x86_platform.hyper.sev_es_hcall_prepare(ghcb, ctxt->regs);
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_VMMCALL, 0, 0);
- if (ret != ES_OK)
- return ret;
-
- if (!ghcb_rax_is_valid(ghcb))
- return ES_VMM_ERROR;
-
- ctxt->regs->ax = ghcb->save.rax;
-
- /*
- * Call sev_es_hcall_finish() after regs->ax is already set.
- * This allows the hypervisor handler to overwrite it again if
- * necessary.
- */
- if (x86_platform.hyper.sev_es_hcall_finish &&
- !x86_platform.hyper.sev_es_hcall_finish(ghcb, ctxt->regs))
- return ES_VMM_ERROR;
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_trap_ac(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- /*
- * Calling ecx_alignment_check() directly does not work, because it
- * enables IRQs and the GHCB is active. Forward the exception and call
- * it later from vc_forward_exception().
- */
- ctxt->fi.vector = X86_TRAP_AC;
- ctxt->fi.error_code = 0;
- return ES_EXCEPTION;
-}
-
-static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
- struct ghcb *ghcb,
- unsigned long exit_code)
-{
- enum es_result result = vc_check_opcode_bytes(ctxt, exit_code);
-
- if (result != ES_OK)
- return result;
-
- switch (exit_code) {
- case SVM_EXIT_READ_DR7:
- result = vc_handle_dr7_read(ghcb, ctxt);
- break;
- case SVM_EXIT_WRITE_DR7:
- result = vc_handle_dr7_write(ghcb, ctxt);
- break;
- case SVM_EXIT_EXCP_BASE + X86_TRAP_AC:
- result = vc_handle_trap_ac(ghcb, ctxt);
- break;
- case SVM_EXIT_RDTSC:
- case SVM_EXIT_RDTSCP:
- result = vc_handle_rdtsc(ghcb, ctxt, exit_code);
- break;
- case SVM_EXIT_RDPMC:
- result = vc_handle_rdpmc(ghcb, ctxt);
- break;
- case SVM_EXIT_INVD:
- pr_err_ratelimited("#VC exception for INVD??? Seriously???\n");
- result = ES_UNSUPPORTED;
- break;
- case SVM_EXIT_CPUID:
- result = vc_handle_cpuid(ghcb, ctxt);
- break;
- case SVM_EXIT_IOIO:
- result = vc_handle_ioio(ghcb, ctxt);
- break;
- case SVM_EXIT_MSR:
- result = vc_handle_msr(ghcb, ctxt);
- break;
- case SVM_EXIT_VMMCALL:
- result = vc_handle_vmmcall(ghcb, ctxt);
- break;
- case SVM_EXIT_WBINVD:
- result = vc_handle_wbinvd(ghcb, ctxt);
- break;
- case SVM_EXIT_MONITOR:
- result = vc_handle_monitor(ghcb, ctxt);
- break;
- case SVM_EXIT_MWAIT:
- result = vc_handle_mwait(ghcb, ctxt);
- break;
- case SVM_EXIT_NPF:
- result = vc_handle_mmio(ghcb, ctxt);
- break;
- default:
- /*
- * Unexpected #VC exception
- */
- result = ES_UNSUPPORTED;
- }
-
- return result;
-}
-
-static __always_inline bool is_vc2_stack(unsigned long sp)
-{
- return (sp >= __this_cpu_ist_bottom_va(VC2) && sp < __this_cpu_ist_top_va(VC2));
-}
-
-static __always_inline bool vc_from_invalid_context(struct pt_regs *regs)
-{
- unsigned long sp, prev_sp;
-
- sp = (unsigned long)regs;
- prev_sp = regs->sp;
-
- /*
- * If the code was already executing on the VC2 stack when the #VC
- * happened, let it proceed to the normal handling routine. This way the
- * code executing on the VC2 stack can cause #VC exceptions to get handled.
- */
- return is_vc2_stack(sp) && !is_vc2_stack(prev_sp);
-}
-
-static bool vc_raw_handle_exception(struct pt_regs *regs, unsigned long error_code)
-{
- struct ghcb_state state;
- struct es_em_ctxt ctxt;
- enum es_result result;
- struct ghcb *ghcb;
- bool ret = true;
-
- ghcb = __sev_get_ghcb(&state);
-
- vc_ghcb_invalidate(ghcb);
- result = vc_init_em_ctxt(&ctxt, regs, error_code);
-
- if (result == ES_OK)
- result = vc_handle_exitcode(&ctxt, ghcb, error_code);
-
- __sev_put_ghcb(&state);
-
- /* Done - now check the result */
- switch (result) {
- case ES_OK:
- vc_finish_insn(&ctxt);
- break;
- case ES_UNSUPPORTED:
- pr_err_ratelimited("Unsupported exit-code 0x%02lx in #VC exception (IP: 0x%lx)\n",
- error_code, regs->ip);
- ret = false;
- break;
- case ES_VMM_ERROR:
- pr_err_ratelimited("Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n",
- error_code, regs->ip);
- ret = false;
- break;
- case ES_DECODE_FAILED:
- pr_err_ratelimited("Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n",
- error_code, regs->ip);
- ret = false;
- break;
- case ES_EXCEPTION:
- vc_forward_exception(&ctxt);
- break;
- case ES_RETRY:
- /* Nothing to do */
- break;
- default:
- pr_emerg("Unknown result in %s():%d\n", __func__, result);
- /*
- * Emulating the instruction which caused the #VC exception
- * failed - can't continue so print debug information
- */
- BUG();
- }
-
- return ret;
-}
-
-static __always_inline bool vc_is_db(unsigned long error_code)
-{
- return error_code == SVM_EXIT_EXCP_BASE + X86_TRAP_DB;
-}
-
-/*
- * Runtime #VC exception handler when raised from kernel mode. Runs in NMI mode
- * and will panic when an error happens.
- */
-DEFINE_IDTENTRY_VC_KERNEL(exc_vmm_communication)
-{
- irqentry_state_t irq_state;
-
- /*
- * With the current implementation it is always possible to switch to a
- * safe stack because #VC exceptions only happen at known places, like
- * intercepted instructions or accesses to MMIO areas/IO ports. They can
- * also happen with code instrumentation when the hypervisor intercepts
- * #DB, but the critical paths are forbidden to be instrumented, so #DB
- * exceptions currently also only happen in safe places.
- *
- * But keep this here in case the noinstr annotations are violated due
- * to bug elsewhere.
- */
- if (unlikely(vc_from_invalid_context(regs))) {
- instrumentation_begin();
- panic("Can't handle #VC exception from unsupported context\n");
- instrumentation_end();
- }
-
- /*
- * Handle #DB before calling into !noinstr code to avoid recursive #DB.
- */
- if (vc_is_db(error_code)) {
- exc_debug(regs);
- return;
- }
-
- irq_state = irqentry_nmi_enter(regs);
-
- instrumentation_begin();
-
- if (!vc_raw_handle_exception(regs, error_code)) {
- /* Show some debug info */
- show_regs(regs);
-
- /* Ask hypervisor to sev_es_terminate */
- sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
-
- /* If that fails and we get here - just panic */
- panic("Returned from Terminate-Request to Hypervisor\n");
- }
-
- instrumentation_end();
- irqentry_nmi_exit(regs, irq_state);
-}
-
-/*
- * Runtime #VC exception handler when raised from user mode. Runs in IRQ mode
- * and will kill the current task with SIGBUS when an error happens.
- */
-DEFINE_IDTENTRY_VC_USER(exc_vmm_communication)
-{
- /*
- * Handle #DB before calling into !noinstr code to avoid recursive #DB.
- */
- if (vc_is_db(error_code)) {
- noist_exc_debug(regs);
- return;
- }
-
- irqentry_enter_from_user_mode(regs);
- instrumentation_begin();
-
- if (!vc_raw_handle_exception(regs, error_code)) {
- /*
- * Do not kill the machine if user-space triggered the
- * exception. Send SIGBUS instead and let user-space deal with
- * it.
- */
- force_sig_fault(SIGBUS, BUS_OBJERR, (void __user *)0);
- }
-
- instrumentation_end();
- irqentry_exit_to_user_mode(regs);
-}
-
-bool __init handle_vc_boot_ghcb(struct pt_regs *regs)
-{
- unsigned long exit_code = regs->orig_ax;
- struct es_em_ctxt ctxt;
- enum es_result result;
-
- vc_ghcb_invalidate(boot_ghcb);
-
- result = vc_init_em_ctxt(&ctxt, regs, exit_code);
- if (result == ES_OK)
- result = vc_handle_exitcode(&ctxt, boot_ghcb, exit_code);
-
- /* Done - now check the result */
- switch (result) {
- case ES_OK:
- vc_finish_insn(&ctxt);
- break;
- case ES_UNSUPPORTED:
- early_printk("PANIC: Unsupported exit-code 0x%02lx in early #VC exception (IP: 0x%lx)\n",
- exit_code, regs->ip);
- goto fail;
- case ES_VMM_ERROR:
- early_printk("PANIC: Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n",
- exit_code, regs->ip);
- goto fail;
- case ES_DECODE_FAILED:
- early_printk("PANIC: Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n",
- exit_code, regs->ip);
- goto fail;
- case ES_EXCEPTION:
- vc_early_forward_exception(&ctxt);
- break;
- case ES_RETRY:
- /* Nothing to do */
- break;
- default:
- BUG();
- }
-
- return true;
-
-fail:
- show_regs(regs);
-
- sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
-}
-
/*
* Initial set up of SNP relies on information provided by the
* Confidential Computing blob, which can be passed to the kernel
diff --git a/arch/x86/coco/sev/Makefile b/arch/x86/coco/sev/Makefile
index 2919dcf..db3255b 100644
--- a/arch/x86/coco/sev/Makefile
+++ b/arch/x86/coco/sev/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0
-obj-y += core.o sev-nmi.o
+obj-y += core.o sev-nmi.o vc-handle.o
# Clang 14 and older may fail to respect __no_sanitize_undefined when inlining
UBSAN_SANITIZE_sev-nmi.o := n
diff --git a/arch/x86/coco/sev/vc-handle.c b/arch/x86/coco/sev/vc-handle.c
new file mode 100644
index 0000000..b4895c6
--- /dev/null
+++ b/arch/x86/coco/sev/vc-handle.c
@@ -0,0 +1,1061 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AMD Memory Encryption Support
+ *
+ * Copyright (C) 2019 SUSE
+ *
+ * Author: Joerg Roedel <jroedel@suse.de>
+ */
+
+#define pr_fmt(fmt) "SEV: " fmt
+
+#include <linux/sched/debug.h> /* For show_regs() */
+#include <linux/cc_platform.h>
+#include <linux/printk.h>
+#include <linux/mm_types.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/io.h>
+#include <linux/psp-sev.h>
+#include <uapi/linux/sev-guest.h>
+
+#include <asm/init.h>
+#include <asm/stacktrace.h>
+#include <asm/sev.h>
+#include <asm/sev-internal.h>
+#include <asm/insn-eval.h>
+#include <asm/fpu/xcr.h>
+#include <asm/processor.h>
+#include <asm/setup.h>
+#include <asm/traps.h>
+#include <asm/svm.h>
+#include <asm/smp.h>
+#include <asm/cpu.h>
+#include <asm/apic.h>
+#include <asm/cpuid.h>
+
+static enum es_result vc_slow_virt_to_phys(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
+ unsigned long vaddr, phys_addr_t *paddr)
+{
+ unsigned long va = (unsigned long)vaddr;
+ unsigned int level;
+ phys_addr_t pa;
+ pgd_t *pgd;
+ pte_t *pte;
+
+ pgd = __va(read_cr3_pa());
+ pgd = &pgd[pgd_index(va)];
+ pte = lookup_address_in_pgd(pgd, va, &level);
+ if (!pte) {
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.cr2 = vaddr;
+ ctxt->fi.error_code = 0;
+
+ if (user_mode(ctxt->regs))
+ ctxt->fi.error_code |= X86_PF_USER;
+
+ return ES_EXCEPTION;
+ }
+
+ if (WARN_ON_ONCE(pte_val(*pte) & _PAGE_ENC))
+ /* Emulated MMIO to/from encrypted memory not supported */
+ return ES_UNSUPPORTED;
+
+ pa = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
+ pa |= va & ~page_level_mask(level);
+
+ *paddr = pa;
+
+ return ES_OK;
+}
+
+static enum es_result vc_ioio_check(struct es_em_ctxt *ctxt, u16 port, size_t size)
+{
+ BUG_ON(size > 4);
+
+ if (user_mode(ctxt->regs)) {
+ struct thread_struct *t = ¤t->thread;
+ struct io_bitmap *iobm = t->io_bitmap;
+ size_t idx;
+
+ if (!iobm)
+ goto fault;
+
+ for (idx = port; idx < port + size; ++idx) {
+ if (test_bit(idx, iobm->bitmap))
+ goto fault;
+ }
+ }
+
+ return ES_OK;
+
+fault:
+ ctxt->fi.vector = X86_TRAP_GP;
+ ctxt->fi.error_code = 0;
+
+ return ES_EXCEPTION;
+}
+
+void vc_forward_exception(struct es_em_ctxt *ctxt)
+{
+ long error_code = ctxt->fi.error_code;
+ int trapnr = ctxt->fi.vector;
+
+ ctxt->regs->orig_ax = ctxt->fi.error_code;
+
+ switch (trapnr) {
+ case X86_TRAP_GP:
+ exc_general_protection(ctxt->regs, error_code);
+ break;
+ case X86_TRAP_UD:
+ exc_invalid_op(ctxt->regs);
+ break;
+ case X86_TRAP_PF:
+ write_cr2(ctxt->fi.cr2);
+ exc_page_fault(ctxt->regs, error_code);
+ break;
+ case X86_TRAP_AC:
+ exc_alignment_check(ctxt->regs, error_code);
+ break;
+ default:
+ pr_emerg("Unsupported exception in #VC instruction emulation - can't continue\n");
+ BUG();
+ }
+}
+
+static int vc_fetch_insn_kernel(struct es_em_ctxt *ctxt,
+ unsigned char *buffer)
+{
+ return copy_from_kernel_nofault(buffer, (unsigned char *)ctxt->regs->ip, MAX_INSN_SIZE);
+}
+
+static enum es_result __vc_decode_user_insn(struct es_em_ctxt *ctxt)
+{
+ char buffer[MAX_INSN_SIZE];
+ int insn_bytes;
+
+ insn_bytes = insn_fetch_from_user_inatomic(ctxt->regs, buffer);
+ if (insn_bytes == 0) {
+ /* Nothing could be copied */
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.error_code = X86_PF_INSTR | X86_PF_USER;
+ ctxt->fi.cr2 = ctxt->regs->ip;
+ return ES_EXCEPTION;
+ } else if (insn_bytes == -EINVAL) {
+ /* Effective RIP could not be calculated */
+ ctxt->fi.vector = X86_TRAP_GP;
+ ctxt->fi.error_code = 0;
+ ctxt->fi.cr2 = 0;
+ return ES_EXCEPTION;
+ }
+
+ if (!insn_decode_from_regs(&ctxt->insn, ctxt->regs, buffer, insn_bytes))
+ return ES_DECODE_FAILED;
+
+ if (ctxt->insn.immediate.got)
+ return ES_OK;
+ else
+ return ES_DECODE_FAILED;
+}
+
+static enum es_result __vc_decode_kern_insn(struct es_em_ctxt *ctxt)
+{
+ char buffer[MAX_INSN_SIZE];
+ int res, ret;
+
+ res = vc_fetch_insn_kernel(ctxt, buffer);
+ if (res) {
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.error_code = X86_PF_INSTR;
+ ctxt->fi.cr2 = ctxt->regs->ip;
+ return ES_EXCEPTION;
+ }
+
+ ret = insn_decode(&ctxt->insn, buffer, MAX_INSN_SIZE, INSN_MODE_64);
+ if (ret < 0)
+ return ES_DECODE_FAILED;
+ else
+ return ES_OK;
+}
+
+static enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
+{
+ if (user_mode(ctxt->regs))
+ return __vc_decode_user_insn(ctxt);
+ else
+ return __vc_decode_kern_insn(ctxt);
+}
+
+static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
+ char *dst, char *buf, size_t size)
+{
+ unsigned long error_code = X86_PF_PROT | X86_PF_WRITE;
+
+ /*
+ * This function uses __put_user() independent of whether kernel or user
+ * memory is accessed. This works fine because __put_user() does no
+ * sanity checks of the pointer being accessed. All that it does is
+ * to report when the access failed.
+ *
+ * Also, this function runs in atomic context, so __put_user() is not
+ * allowed to sleep. The page-fault handler detects that it is running
+ * in atomic context and will not try to take mmap_sem and handle the
+ * fault, so additional pagefault_enable()/disable() calls are not
+ * needed.
+ *
+ * The access can't be done via copy_to_user() here because
+ * vc_write_mem() must not use string instructions to access unsafe
+ * memory. The reason is that MOVS is emulated by the #VC handler by
+ * splitting the move up into a read and a write and taking a nested #VC
+ * exception on whatever of them is the MMIO access. Using string
+ * instructions here would cause infinite nesting.
+ */
+ switch (size) {
+ case 1: {
+ u8 d1;
+ u8 __user *target = (u8 __user *)dst;
+
+ memcpy(&d1, buf, 1);
+ if (__put_user(d1, target))
+ goto fault;
+ break;
+ }
+ case 2: {
+ u16 d2;
+ u16 __user *target = (u16 __user *)dst;
+
+ memcpy(&d2, buf, 2);
+ if (__put_user(d2, target))
+ goto fault;
+ break;
+ }
+ case 4: {
+ u32 d4;
+ u32 __user *target = (u32 __user *)dst;
+
+ memcpy(&d4, buf, 4);
+ if (__put_user(d4, target))
+ goto fault;
+ break;
+ }
+ case 8: {
+ u64 d8;
+ u64 __user *target = (u64 __user *)dst;
+
+ memcpy(&d8, buf, 8);
+ if (__put_user(d8, target))
+ goto fault;
+ break;
+ }
+ default:
+ WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+ return ES_UNSUPPORTED;
+ }
+
+ return ES_OK;
+
+fault:
+ if (user_mode(ctxt->regs))
+ error_code |= X86_PF_USER;
+
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.error_code = error_code;
+ ctxt->fi.cr2 = (unsigned long)dst;
+
+ return ES_EXCEPTION;
+}
+
+static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
+ char *src, char *buf, size_t size)
+{
+ unsigned long error_code = X86_PF_PROT;
+
+ /*
+ * This function uses __get_user() independent of whether kernel or user
+ * memory is accessed. This works fine because __get_user() does no
+ * sanity checks of the pointer being accessed. All that it does is
+ * to report when the access failed.
+ *
+ * Also, this function runs in atomic context, so __get_user() is not
+ * allowed to sleep. The page-fault handler detects that it is running
+ * in atomic context and will not try to take mmap_sem and handle the
+ * fault, so additional pagefault_enable()/disable() calls are not
+ * needed.
+ *
+ * The access can't be done via copy_from_user() here because
+ * vc_read_mem() must not use string instructions to access unsafe
+ * memory. The reason is that MOVS is emulated by the #VC handler by
+ * splitting the move up into a read and a write and taking a nested #VC
+ * exception on whatever of them is the MMIO access. Using string
+ * instructions here would cause infinite nesting.
+ */
+ switch (size) {
+ case 1: {
+ u8 d1;
+ u8 __user *s = (u8 __user *)src;
+
+ if (__get_user(d1, s))
+ goto fault;
+ memcpy(buf, &d1, 1);
+ break;
+ }
+ case 2: {
+ u16 d2;
+ u16 __user *s = (u16 __user *)src;
+
+ if (__get_user(d2, s))
+ goto fault;
+ memcpy(buf, &d2, 2);
+ break;
+ }
+ case 4: {
+ u32 d4;
+ u32 __user *s = (u32 __user *)src;
+
+ if (__get_user(d4, s))
+ goto fault;
+ memcpy(buf, &d4, 4);
+ break;
+ }
+ case 8: {
+ u64 d8;
+ u64 __user *s = (u64 __user *)src;
+ if (__get_user(d8, s))
+ goto fault;
+ memcpy(buf, &d8, 8);
+ break;
+ }
+ default:
+ WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+ return ES_UNSUPPORTED;
+ }
+
+ return ES_OK;
+
+fault:
+ if (user_mode(ctxt->regs))
+ error_code |= X86_PF_USER;
+
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.error_code = error_code;
+ ctxt->fi.cr2 = (unsigned long)src;
+
+ return ES_EXCEPTION;
+}
+
+#define sev_printk(fmt, ...) printk(fmt, ##__VA_ARGS__)
+
+#include "vc-shared.c"
+
+/* Writes to the SVSM CAA MSR are ignored */
+static enum es_result __vc_handle_msr_caa(struct pt_regs *regs, bool write)
+{
+ if (write)
+ return ES_OK;
+
+ regs->ax = lower_32_bits(this_cpu_read(svsm_caa_pa));
+ regs->dx = upper_32_bits(this_cpu_read(svsm_caa_pa));
+
+ return ES_OK;
+}
+
+/*
+ * TSC related accesses should not exit to the hypervisor when a guest is
+ * executing with Secure TSC enabled, so special handling is required for
+ * accesses of MSR_IA32_TSC and MSR_AMD64_GUEST_TSC_FREQ.
+ */
+static enum es_result __vc_handle_secure_tsc_msrs(struct pt_regs *regs, bool write)
+{
+ u64 tsc;
+
+ /*
+ * GUEST_TSC_FREQ should not be intercepted when Secure TSC is enabled.
+ * Terminate the SNP guest when the interception is enabled.
+ */
+ if (regs->cx == MSR_AMD64_GUEST_TSC_FREQ)
+ return ES_VMM_ERROR;
+
+ /*
+ * Writes: Writing to MSR_IA32_TSC can cause subsequent reads of the TSC
+ * to return undefined values, so ignore all writes.
+ *
+ * Reads: Reads of MSR_IA32_TSC should return the current TSC value, use
+ * the value returned by rdtsc_ordered().
+ */
+ if (write) {
+ WARN_ONCE(1, "TSC MSR writes are verboten!\n");
+ return ES_OK;
+ }
+
+ tsc = rdtsc_ordered();
+ regs->ax = lower_32_bits(tsc);
+ regs->dx = upper_32_bits(tsc);
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+ struct pt_regs *regs = ctxt->regs;
+ enum es_result ret;
+ bool write;
+
+ /* Is it a WRMSR? */
+ write = ctxt->insn.opcode.bytes[1] == 0x30;
+
+ switch (regs->cx) {
+ case MSR_SVSM_CAA:
+ return __vc_handle_msr_caa(regs, write);
+ case MSR_IA32_TSC:
+ case MSR_AMD64_GUEST_TSC_FREQ:
+ if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
+ return __vc_handle_secure_tsc_msrs(regs, write);
+ break;
+ default:
+ break;
+ }
+
+ ghcb_set_rcx(ghcb, regs->cx);
+ if (write) {
+ ghcb_set_rax(ghcb, regs->ax);
+ ghcb_set_rdx(ghcb, regs->dx);
+ }
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_MSR, write, 0);
+
+ if ((ret == ES_OK) && !write) {
+ regs->ax = ghcb->save.rax;
+ regs->dx = ghcb->save.rdx;
+ }
+
+ return ret;
+}
+
+static void __init vc_early_forward_exception(struct es_em_ctxt *ctxt)
+{
+ int trapnr = ctxt->fi.vector;
+
+ if (trapnr == X86_TRAP_PF)
+ native_write_cr2(ctxt->fi.cr2);
+
+ ctxt->regs->orig_ax = ctxt->fi.error_code;
+ do_early_exception(ctxt->regs, trapnr);
+}
+
+static long *vc_insn_get_rm(struct es_em_ctxt *ctxt)
+{
+ long *reg_array;
+ int offset;
+
+ reg_array = (long *)ctxt->regs;
+ offset = insn_get_modrm_rm_off(&ctxt->insn, ctxt->regs);
+
+ if (offset < 0)
+ return NULL;
+
+ offset /= sizeof(long);
+
+ return reg_array + offset;
+}
+static enum es_result vc_do_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
+ unsigned int bytes, bool read)
+{
+ u64 exit_code, exit_info_1, exit_info_2;
+ unsigned long ghcb_pa = __pa(ghcb);
+ enum es_result res;
+ phys_addr_t paddr;
+ void __user *ref;
+
+ ref = insn_get_addr_ref(&ctxt->insn, ctxt->regs);
+ if (ref == (void __user *)-1L)
+ return ES_UNSUPPORTED;
+
+ exit_code = read ? SVM_VMGEXIT_MMIO_READ : SVM_VMGEXIT_MMIO_WRITE;
+
+ res = vc_slow_virt_to_phys(ghcb, ctxt, (unsigned long)ref, &paddr);
+ if (res != ES_OK) {
+ if (res == ES_EXCEPTION && !read)
+ ctxt->fi.error_code |= X86_PF_WRITE;
+
+ return res;
+ }
+
+ exit_info_1 = paddr;
+ /* Can never be greater than 8 */
+ exit_info_2 = bytes;
+
+ ghcb_set_sw_scratch(ghcb, ghcb_pa + offsetof(struct ghcb, shared_buffer));
+
+ return sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, exit_info_1, exit_info_2);
+}
+
+/*
+ * The MOVS instruction has two memory operands, which raises the
+ * problem that it is not known whether the access to the source or the
+ * destination caused the #VC exception (and hence whether an MMIO read
+ * or write operation needs to be emulated).
+ *
+ * Instead of playing games with walking page-tables and trying to guess
+ * whether the source or destination is an MMIO range, split the move
+ * into two operations, a read and a write with only one memory operand.
+ * This will cause a nested #VC exception on the MMIO address which can
+ * then be handled.
+ *
+ * This implementation has the benefit that it also supports MOVS where
+ * source _and_ destination are MMIO regions.
+ *
+ * It will slow MOVS on MMIO down a lot, but in SEV-ES guests it is a
+ * rare operation. If it turns out to be a performance problem the split
+ * operations can be moved to memcpy_fromio() and memcpy_toio().
+ */
+static enum es_result vc_handle_mmio_movs(struct es_em_ctxt *ctxt,
+ unsigned int bytes)
+{
+ unsigned long ds_base, es_base;
+ unsigned char *src, *dst;
+ unsigned char buffer[8];
+ enum es_result ret;
+ bool rep;
+ int off;
+
+ ds_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_DS);
+ es_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_ES);
+
+ if (ds_base == -1L || es_base == -1L) {
+ ctxt->fi.vector = X86_TRAP_GP;
+ ctxt->fi.error_code = 0;
+ return ES_EXCEPTION;
+ }
+
+ src = ds_base + (unsigned char *)ctxt->regs->si;
+ dst = es_base + (unsigned char *)ctxt->regs->di;
+
+ ret = vc_read_mem(ctxt, src, buffer, bytes);
+ if (ret != ES_OK)
+ return ret;
+
+ ret = vc_write_mem(ctxt, dst, buffer, bytes);
+ if (ret != ES_OK)
+ return ret;
+
+ if (ctxt->regs->flags & X86_EFLAGS_DF)
+ off = -bytes;
+ else
+ off = bytes;
+
+ ctxt->regs->si += off;
+ ctxt->regs->di += off;
+
+ rep = insn_has_rep_prefix(&ctxt->insn);
+ if (rep)
+ ctxt->regs->cx -= 1;
+
+ if (!rep || ctxt->regs->cx == 0)
+ return ES_OK;
+ else
+ return ES_RETRY;
+}
+
+static enum es_result vc_handle_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+ struct insn *insn = &ctxt->insn;
+ enum insn_mmio_type mmio;
+ unsigned int bytes = 0;
+ enum es_result ret;
+ u8 sign_byte;
+ long *reg_data;
+
+ mmio = insn_decode_mmio(insn, &bytes);
+ if (mmio == INSN_MMIO_DECODE_FAILED)
+ return ES_DECODE_FAILED;
+
+ if (mmio != INSN_MMIO_WRITE_IMM && mmio != INSN_MMIO_MOVS) {
+ reg_data = insn_get_modrm_reg_ptr(insn, ctxt->regs);
+ if (!reg_data)
+ return ES_DECODE_FAILED;
+ }
+
+ if (user_mode(ctxt->regs))
+ return ES_UNSUPPORTED;
+
+ switch (mmio) {
+ case INSN_MMIO_WRITE:
+ memcpy(ghcb->shared_buffer, reg_data, bytes);
+ ret = vc_do_mmio(ghcb, ctxt, bytes, false);
+ break;
+ case INSN_MMIO_WRITE_IMM:
+ memcpy(ghcb->shared_buffer, insn->immediate1.bytes, bytes);
+ ret = vc_do_mmio(ghcb, ctxt, bytes, false);
+ break;
+ case INSN_MMIO_READ:
+ ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+ if (ret)
+ break;
+
+ /* Zero-extend for 32-bit operation */
+ if (bytes == 4)
+ *reg_data = 0;
+
+ memcpy(reg_data, ghcb->shared_buffer, bytes);
+ break;
+ case INSN_MMIO_READ_ZERO_EXTEND:
+ ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+ if (ret)
+ break;
+
+ /* Zero extend based on operand size */
+ memset(reg_data, 0, insn->opnd_bytes);
+ memcpy(reg_data, ghcb->shared_buffer, bytes);
+ break;
+ case INSN_MMIO_READ_SIGN_EXTEND:
+ ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+ if (ret)
+ break;
+
+ if (bytes == 1) {
+ u8 *val = (u8 *)ghcb->shared_buffer;
+
+ sign_byte = (*val & 0x80) ? 0xff : 0x00;
+ } else {
+ u16 *val = (u16 *)ghcb->shared_buffer;
+
+ sign_byte = (*val & 0x8000) ? 0xff : 0x00;
+ }
+
+ /* Sign extend based on operand size */
+ memset(reg_data, sign_byte, insn->opnd_bytes);
+ memcpy(reg_data, ghcb->shared_buffer, bytes);
+ break;
+ case INSN_MMIO_MOVS:
+ ret = vc_handle_mmio_movs(ctxt, bytes);
+ break;
+ default:
+ ret = ES_UNSUPPORTED;
+ break;
+ }
+
+ return ret;
+}
+
+static enum es_result vc_handle_dr7_write(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ struct sev_es_runtime_data *data = this_cpu_read(runtime_data);
+ long val, *reg = vc_insn_get_rm(ctxt);
+ enum es_result ret;
+
+ if (sev_status & MSR_AMD64_SNP_DEBUG_SWAP)
+ return ES_VMM_ERROR;
+
+ if (!reg)
+ return ES_DECODE_FAILED;
+
+ val = *reg;
+
+ /* Upper 32 bits must be written as zeroes */
+ if (val >> 32) {
+ ctxt->fi.vector = X86_TRAP_GP;
+ ctxt->fi.error_code = 0;
+ return ES_EXCEPTION;
+ }
+
+ /* Clear out other reserved bits and set bit 10 */
+ val = (val & 0xffff23ffL) | BIT(10);
+
+ /* Early non-zero writes to DR7 are not supported */
+ if (!data && (val & ~DR7_RESET_VALUE))
+ return ES_UNSUPPORTED;
+
+ /* Using a value of 0 for ExitInfo1 means RAX holds the value */
+ ghcb_set_rax(ghcb, val);
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WRITE_DR7, 0, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (data)
+ data->dr7 = val;
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_dr7_read(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ struct sev_es_runtime_data *data = this_cpu_read(runtime_data);
+ long *reg = vc_insn_get_rm(ctxt);
+
+ if (sev_status & MSR_AMD64_SNP_DEBUG_SWAP)
+ return ES_VMM_ERROR;
+
+ if (!reg)
+ return ES_DECODE_FAILED;
+
+ if (data)
+ *reg = data->dr7;
+ else
+ *reg = DR7_RESET_VALUE;
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_wbinvd(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WBINVD, 0, 0);
+}
+
+static enum es_result vc_handle_rdpmc(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+ enum es_result ret;
+
+ ghcb_set_rcx(ghcb, ctxt->regs->cx);
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_RDPMC, 0, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (!(ghcb_rax_is_valid(ghcb) && ghcb_rdx_is_valid(ghcb)))
+ return ES_VMM_ERROR;
+
+ ctxt->regs->ax = ghcb->save.rax;
+ ctxt->regs->dx = ghcb->save.rdx;
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_monitor(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ /*
+ * Treat it as a NOP and do not leak a physical address to the
+ * hypervisor.
+ */
+ return ES_OK;
+}
+
+static enum es_result vc_handle_mwait(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ /* Treat the same as MONITOR/MONITORX */
+ return ES_OK;
+}
+
+static enum es_result vc_handle_vmmcall(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ enum es_result ret;
+
+ ghcb_set_rax(ghcb, ctxt->regs->ax);
+ ghcb_set_cpl(ghcb, user_mode(ctxt->regs) ? 3 : 0);
+
+ if (x86_platform.hyper.sev_es_hcall_prepare)
+ x86_platform.hyper.sev_es_hcall_prepare(ghcb, ctxt->regs);
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_VMMCALL, 0, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (!ghcb_rax_is_valid(ghcb))
+ return ES_VMM_ERROR;
+
+ ctxt->regs->ax = ghcb->save.rax;
+
+ /*
+ * Call sev_es_hcall_finish() after regs->ax is already set.
+ * This allows the hypervisor handler to overwrite it again if
+ * necessary.
+ */
+ if (x86_platform.hyper.sev_es_hcall_finish &&
+ !x86_platform.hyper.sev_es_hcall_finish(ghcb, ctxt->regs))
+ return ES_VMM_ERROR;
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_trap_ac(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ /*
+ * Calling ecx_alignment_check() directly does not work, because it
+ * enables IRQs and the GHCB is active. Forward the exception and call
+ * it later from vc_forward_exception().
+ */
+ ctxt->fi.vector = X86_TRAP_AC;
+ ctxt->fi.error_code = 0;
+ return ES_EXCEPTION;
+}
+
+static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
+ struct ghcb *ghcb,
+ unsigned long exit_code)
+{
+ enum es_result result = vc_check_opcode_bytes(ctxt, exit_code);
+
+ if (result != ES_OK)
+ return result;
+
+ switch (exit_code) {
+ case SVM_EXIT_READ_DR7:
+ result = vc_handle_dr7_read(ghcb, ctxt);
+ break;
+ case SVM_EXIT_WRITE_DR7:
+ result = vc_handle_dr7_write(ghcb, ctxt);
+ break;
+ case SVM_EXIT_EXCP_BASE + X86_TRAP_AC:
+ result = vc_handle_trap_ac(ghcb, ctxt);
+ break;
+ case SVM_EXIT_RDTSC:
+ case SVM_EXIT_RDTSCP:
+ result = vc_handle_rdtsc(ghcb, ctxt, exit_code);
+ break;
+ case SVM_EXIT_RDPMC:
+ result = vc_handle_rdpmc(ghcb, ctxt);
+ break;
+ case SVM_EXIT_INVD:
+ pr_err_ratelimited("#VC exception for INVD??? Seriously???\n");
+ result = ES_UNSUPPORTED;
+ break;
+ case SVM_EXIT_CPUID:
+ result = vc_handle_cpuid(ghcb, ctxt);
+ break;
+ case SVM_EXIT_IOIO:
+ result = vc_handle_ioio(ghcb, ctxt);
+ break;
+ case SVM_EXIT_MSR:
+ result = vc_handle_msr(ghcb, ctxt);
+ break;
+ case SVM_EXIT_VMMCALL:
+ result = vc_handle_vmmcall(ghcb, ctxt);
+ break;
+ case SVM_EXIT_WBINVD:
+ result = vc_handle_wbinvd(ghcb, ctxt);
+ break;
+ case SVM_EXIT_MONITOR:
+ result = vc_handle_monitor(ghcb, ctxt);
+ break;
+ case SVM_EXIT_MWAIT:
+ result = vc_handle_mwait(ghcb, ctxt);
+ break;
+ case SVM_EXIT_NPF:
+ result = vc_handle_mmio(ghcb, ctxt);
+ break;
+ default:
+ /*
+ * Unexpected #VC exception
+ */
+ result = ES_UNSUPPORTED;
+ }
+
+ return result;
+}
+
+static __always_inline bool is_vc2_stack(unsigned long sp)
+{
+ return (sp >= __this_cpu_ist_bottom_va(VC2) && sp < __this_cpu_ist_top_va(VC2));
+}
+
+static __always_inline bool vc_from_invalid_context(struct pt_regs *regs)
+{
+ unsigned long sp, prev_sp;
+
+ sp = (unsigned long)regs;
+ prev_sp = regs->sp;
+
+ /*
+ * If the code was already executing on the VC2 stack when the #VC
+ * happened, let it proceed to the normal handling routine. This way the
+ * code executing on the VC2 stack can cause #VC exceptions to get handled.
+ */
+ return is_vc2_stack(sp) && !is_vc2_stack(prev_sp);
+}
+
+static bool vc_raw_handle_exception(struct pt_regs *regs, unsigned long error_code)
+{
+ struct ghcb_state state;
+ struct es_em_ctxt ctxt;
+ enum es_result result;
+ struct ghcb *ghcb;
+ bool ret = true;
+
+ ghcb = __sev_get_ghcb(&state);
+
+ vc_ghcb_invalidate(ghcb);
+ result = vc_init_em_ctxt(&ctxt, regs, error_code);
+
+ if (result == ES_OK)
+ result = vc_handle_exitcode(&ctxt, ghcb, error_code);
+
+ __sev_put_ghcb(&state);
+
+ /* Done - now check the result */
+ switch (result) {
+ case ES_OK:
+ vc_finish_insn(&ctxt);
+ break;
+ case ES_UNSUPPORTED:
+ pr_err_ratelimited("Unsupported exit-code 0x%02lx in #VC exception (IP: 0x%lx)\n",
+ error_code, regs->ip);
+ ret = false;
+ break;
+ case ES_VMM_ERROR:
+ pr_err_ratelimited("Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n",
+ error_code, regs->ip);
+ ret = false;
+ break;
+ case ES_DECODE_FAILED:
+ pr_err_ratelimited("Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n",
+ error_code, regs->ip);
+ ret = false;
+ break;
+ case ES_EXCEPTION:
+ vc_forward_exception(&ctxt);
+ break;
+ case ES_RETRY:
+ /* Nothing to do */
+ break;
+ default:
+ pr_emerg("Unknown result in %s():%d\n", __func__, result);
+ /*
+ * Emulating the instruction which caused the #VC exception
+ * failed - can't continue so print debug information
+ */
+ BUG();
+ }
+
+ return ret;
+}
+
+static __always_inline bool vc_is_db(unsigned long error_code)
+{
+ return error_code == SVM_EXIT_EXCP_BASE + X86_TRAP_DB;
+}
+
+/*
+ * Runtime #VC exception handler when raised from kernel mode. Runs in NMI mode
+ * and will panic when an error happens.
+ */
+DEFINE_IDTENTRY_VC_KERNEL(exc_vmm_communication)
+{
+ irqentry_state_t irq_state;
+
+ /*
+ * With the current implementation it is always possible to switch to a
+ * safe stack because #VC exceptions only happen at known places, like
+ * intercepted instructions or accesses to MMIO areas/IO ports. They can
+ * also happen with code instrumentation when the hypervisor intercepts
+ * #DB, but the critical paths are forbidden to be instrumented, so #DB
+ * exceptions currently also only happen in safe places.
+ *
+ * But keep this here in case the noinstr annotations are violated due
+ * to bug elsewhere.
+ */
+ if (unlikely(vc_from_invalid_context(regs))) {
+ instrumentation_begin();
+ panic("Can't handle #VC exception from unsupported context\n");
+ instrumentation_end();
+ }
+
+ /*
+ * Handle #DB before calling into !noinstr code to avoid recursive #DB.
+ */
+ if (vc_is_db(error_code)) {
+ exc_debug(regs);
+ return;
+ }
+
+ irq_state = irqentry_nmi_enter(regs);
+
+ instrumentation_begin();
+
+ if (!vc_raw_handle_exception(regs, error_code)) {
+ /* Show some debug info */
+ show_regs(regs);
+
+ /* Ask hypervisor to sev_es_terminate */
+ sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
+
+ /* If that fails and we get here - just panic */
+ panic("Returned from Terminate-Request to Hypervisor\n");
+ }
+
+ instrumentation_end();
+ irqentry_nmi_exit(regs, irq_state);
+}
+
+/*
+ * Runtime #VC exception handler when raised from user mode. Runs in IRQ mode
+ * and will kill the current task with SIGBUS when an error happens.
+ */
+DEFINE_IDTENTRY_VC_USER(exc_vmm_communication)
+{
+ /*
+ * Handle #DB before calling into !noinstr code to avoid recursive #DB.
+ */
+ if (vc_is_db(error_code)) {
+ noist_exc_debug(regs);
+ return;
+ }
+
+ irqentry_enter_from_user_mode(regs);
+ instrumentation_begin();
+
+ if (!vc_raw_handle_exception(regs, error_code)) {
+ /*
+ * Do not kill the machine if user-space triggered the
+ * exception. Send SIGBUS instead and let user-space deal with
+ * it.
+ */
+ force_sig_fault(SIGBUS, BUS_OBJERR, (void __user *)0);
+ }
+
+ instrumentation_end();
+ irqentry_exit_to_user_mode(regs);
+}
+
+bool __init handle_vc_boot_ghcb(struct pt_regs *regs)
+{
+ unsigned long exit_code = regs->orig_ax;
+ struct es_em_ctxt ctxt;
+ enum es_result result;
+
+ vc_ghcb_invalidate(boot_ghcb);
+
+ result = vc_init_em_ctxt(&ctxt, regs, exit_code);
+ if (result == ES_OK)
+ result = vc_handle_exitcode(&ctxt, boot_ghcb, exit_code);
+
+ /* Done - now check the result */
+ switch (result) {
+ case ES_OK:
+ vc_finish_insn(&ctxt);
+ break;
+ case ES_UNSUPPORTED:
+ early_printk("PANIC: Unsupported exit-code 0x%02lx in early #VC exception (IP: 0x%lx)\n",
+ exit_code, regs->ip);
+ goto fail;
+ case ES_VMM_ERROR:
+ early_printk("PANIC: Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n",
+ exit_code, regs->ip);
+ goto fail;
+ case ES_DECODE_FAILED:
+ early_printk("PANIC: Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n",
+ exit_code, regs->ip);
+ goto fail;
+ case ES_EXCEPTION:
+ vc_early_forward_exception(&ctxt);
+ break;
+ case ES_RETRY:
+ /* Nothing to do */
+ break;
+ default:
+ BUG();
+ }
+
+ return true;
+
+fail:
+ show_regs(regs);
+
+ sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
+}
+
diff --git a/arch/x86/coco/sev/vc-shared.c b/arch/x86/coco/sev/vc-shared.c
new file mode 100644
index 0000000..2c0ab0f
--- /dev/null
+++ b/arch/x86/coco/sev/vc-shared.c
@@ -0,0 +1,504 @@
+// SPDX-License-Identifier: GPL-2.0
+
+static enum es_result vc_check_opcode_bytes(struct es_em_ctxt *ctxt,
+ unsigned long exit_code)
+{
+ unsigned int opcode = (unsigned int)ctxt->insn.opcode.value;
+ u8 modrm = ctxt->insn.modrm.value;
+
+ switch (exit_code) {
+
+ case SVM_EXIT_IOIO:
+ case SVM_EXIT_NPF:
+ /* handled separately */
+ return ES_OK;
+
+ case SVM_EXIT_CPUID:
+ if (opcode == 0xa20f)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_INVD:
+ if (opcode == 0x080f)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_MONITOR:
+ /* MONITOR and MONITORX instructions generate the same error code */
+ if (opcode == 0x010f && (modrm == 0xc8 || modrm == 0xfa))
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_MWAIT:
+ /* MWAIT and MWAITX instructions generate the same error code */
+ if (opcode == 0x010f && (modrm == 0xc9 || modrm == 0xfb))
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_MSR:
+ /* RDMSR */
+ if (opcode == 0x320f ||
+ /* WRMSR */
+ opcode == 0x300f)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_RDPMC:
+ if (opcode == 0x330f)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_RDTSC:
+ if (opcode == 0x310f)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_RDTSCP:
+ if (opcode == 0x010f && modrm == 0xf9)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_READ_DR7:
+ if (opcode == 0x210f &&
+ X86_MODRM_REG(ctxt->insn.modrm.value) == 7)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_VMMCALL:
+ if (opcode == 0x010f && modrm == 0xd9)
+ return ES_OK;
+
+ break;
+
+ case SVM_EXIT_WRITE_DR7:
+ if (opcode == 0x230f &&
+ X86_MODRM_REG(ctxt->insn.modrm.value) == 7)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_WBINVD:
+ if (opcode == 0x90f)
+ return ES_OK;
+ break;
+
+ default:
+ break;
+ }
+
+ sev_printk(KERN_ERR "Wrong/unhandled opcode bytes: 0x%x, exit_code: 0x%lx, rIP: 0x%lx\n",
+ opcode, exit_code, ctxt->regs->ip);
+
+ return ES_UNSUPPORTED;
+}
+
+static bool vc_decoding_needed(unsigned long exit_code)
+{
+ /* Exceptions don't require to decode the instruction */
+ return !(exit_code >= SVM_EXIT_EXCP_BASE &&
+ exit_code <= SVM_EXIT_LAST_EXCP);
+}
+
+static enum es_result vc_init_em_ctxt(struct es_em_ctxt *ctxt,
+ struct pt_regs *regs,
+ unsigned long exit_code)
+{
+ enum es_result ret = ES_OK;
+
+ memset(ctxt, 0, sizeof(*ctxt));
+ ctxt->regs = regs;
+
+ if (vc_decoding_needed(exit_code))
+ ret = vc_decode_insn(ctxt);
+
+ return ret;
+}
+
+static void vc_finish_insn(struct es_em_ctxt *ctxt)
+{
+ ctxt->regs->ip += ctxt->insn.length;
+}
+
+static enum es_result vc_insn_string_check(struct es_em_ctxt *ctxt,
+ unsigned long address,
+ bool write)
+{
+ if (user_mode(ctxt->regs) && fault_in_kernel_space(address)) {
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.error_code = X86_PF_USER;
+ ctxt->fi.cr2 = address;
+ if (write)
+ ctxt->fi.error_code |= X86_PF_WRITE;
+
+ return ES_EXCEPTION;
+ }
+
+ return ES_OK;
+}
+
+static enum es_result vc_insn_string_read(struct es_em_ctxt *ctxt,
+ void *src, char *buf,
+ unsigned int data_size,
+ unsigned int count,
+ bool backwards)
+{
+ int i, b = backwards ? -1 : 1;
+ unsigned long address = (unsigned long)src;
+ enum es_result ret;
+
+ ret = vc_insn_string_check(ctxt, address, false);
+ if (ret != ES_OK)
+ return ret;
+
+ for (i = 0; i < count; i++) {
+ void *s = src + (i * data_size * b);
+ char *d = buf + (i * data_size);
+
+ ret = vc_read_mem(ctxt, s, d, data_size);
+ if (ret != ES_OK)
+ break;
+ }
+
+ return ret;
+}
+
+static enum es_result vc_insn_string_write(struct es_em_ctxt *ctxt,
+ void *dst, char *buf,
+ unsigned int data_size,
+ unsigned int count,
+ bool backwards)
+{
+ int i, s = backwards ? -1 : 1;
+ unsigned long address = (unsigned long)dst;
+ enum es_result ret;
+
+ ret = vc_insn_string_check(ctxt, address, true);
+ if (ret != ES_OK)
+ return ret;
+
+ for (i = 0; i < count; i++) {
+ void *d = dst + (i * data_size * s);
+ char *b = buf + (i * data_size);
+
+ ret = vc_write_mem(ctxt, d, b, data_size);
+ if (ret != ES_OK)
+ break;
+ }
+
+ return ret;
+}
+
+#define IOIO_TYPE_STR BIT(2)
+#define IOIO_TYPE_IN 1
+#define IOIO_TYPE_INS (IOIO_TYPE_IN | IOIO_TYPE_STR)
+#define IOIO_TYPE_OUT 0
+#define IOIO_TYPE_OUTS (IOIO_TYPE_OUT | IOIO_TYPE_STR)
+
+#define IOIO_REP BIT(3)
+
+#define IOIO_ADDR_64 BIT(9)
+#define IOIO_ADDR_32 BIT(8)
+#define IOIO_ADDR_16 BIT(7)
+
+#define IOIO_DATA_32 BIT(6)
+#define IOIO_DATA_16 BIT(5)
+#define IOIO_DATA_8 BIT(4)
+
+#define IOIO_SEG_ES (0 << 10)
+#define IOIO_SEG_DS (3 << 10)
+
+static enum es_result vc_ioio_exitinfo(struct es_em_ctxt *ctxt, u64 *exitinfo)
+{
+ struct insn *insn = &ctxt->insn;
+ size_t size;
+ u64 port;
+
+ *exitinfo = 0;
+
+ switch (insn->opcode.bytes[0]) {
+ /* INS opcodes */
+ case 0x6c:
+ case 0x6d:
+ *exitinfo |= IOIO_TYPE_INS;
+ *exitinfo |= IOIO_SEG_ES;
+ port = ctxt->regs->dx & 0xffff;
+ break;
+
+ /* OUTS opcodes */
+ case 0x6e:
+ case 0x6f:
+ *exitinfo |= IOIO_TYPE_OUTS;
+ *exitinfo |= IOIO_SEG_DS;
+ port = ctxt->regs->dx & 0xffff;
+ break;
+
+ /* IN immediate opcodes */
+ case 0xe4:
+ case 0xe5:
+ *exitinfo |= IOIO_TYPE_IN;
+ port = (u8)insn->immediate.value & 0xffff;
+ break;
+
+ /* OUT immediate opcodes */
+ case 0xe6:
+ case 0xe7:
+ *exitinfo |= IOIO_TYPE_OUT;
+ port = (u8)insn->immediate.value & 0xffff;
+ break;
+
+ /* IN register opcodes */
+ case 0xec:
+ case 0xed:
+ *exitinfo |= IOIO_TYPE_IN;
+ port = ctxt->regs->dx & 0xffff;
+ break;
+
+ /* OUT register opcodes */
+ case 0xee:
+ case 0xef:
+ *exitinfo |= IOIO_TYPE_OUT;
+ port = ctxt->regs->dx & 0xffff;
+ break;
+
+ default:
+ return ES_DECODE_FAILED;
+ }
+
+ *exitinfo |= port << 16;
+
+ switch (insn->opcode.bytes[0]) {
+ case 0x6c:
+ case 0x6e:
+ case 0xe4:
+ case 0xe6:
+ case 0xec:
+ case 0xee:
+ /* Single byte opcodes */
+ *exitinfo |= IOIO_DATA_8;
+ size = 1;
+ break;
+ default:
+ /* Length determined by instruction parsing */
+ *exitinfo |= (insn->opnd_bytes == 2) ? IOIO_DATA_16
+ : IOIO_DATA_32;
+ size = (insn->opnd_bytes == 2) ? 2 : 4;
+ }
+
+ switch (insn->addr_bytes) {
+ case 2:
+ *exitinfo |= IOIO_ADDR_16;
+ break;
+ case 4:
+ *exitinfo |= IOIO_ADDR_32;
+ break;
+ case 8:
+ *exitinfo |= IOIO_ADDR_64;
+ break;
+ }
+
+ if (insn_has_rep_prefix(insn))
+ *exitinfo |= IOIO_REP;
+
+ return vc_ioio_check(ctxt, (u16)port, size);
+}
+
+static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+ struct pt_regs *regs = ctxt->regs;
+ u64 exit_info_1, exit_info_2;
+ enum es_result ret;
+
+ ret = vc_ioio_exitinfo(ctxt, &exit_info_1);
+ if (ret != ES_OK)
+ return ret;
+
+ if (exit_info_1 & IOIO_TYPE_STR) {
+
+ /* (REP) INS/OUTS */
+
+ bool df = ((regs->flags & X86_EFLAGS_DF) == X86_EFLAGS_DF);
+ unsigned int io_bytes, exit_bytes;
+ unsigned int ghcb_count, op_count;
+ unsigned long es_base;
+ u64 sw_scratch;
+
+ /*
+ * For the string variants with rep prefix the amount of in/out
+ * operations per #VC exception is limited so that the kernel
+ * has a chance to take interrupts and re-schedule while the
+ * instruction is emulated.
+ */
+ io_bytes = (exit_info_1 >> 4) & 0x7;
+ ghcb_count = sizeof(ghcb->shared_buffer) / io_bytes;
+
+ op_count = (exit_info_1 & IOIO_REP) ? regs->cx : 1;
+ exit_info_2 = min(op_count, ghcb_count);
+ exit_bytes = exit_info_2 * io_bytes;
+
+ es_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_ES);
+
+ /* Read bytes of OUTS into the shared buffer */
+ if (!(exit_info_1 & IOIO_TYPE_IN)) {
+ ret = vc_insn_string_read(ctxt,
+ (void *)(es_base + regs->si),
+ ghcb->shared_buffer, io_bytes,
+ exit_info_2, df);
+ if (ret)
+ return ret;
+ }
+
+ /*
+ * Issue an VMGEXIT to the HV to consume the bytes from the
+ * shared buffer or to have it write them into the shared buffer
+ * depending on the instruction: OUTS or INS.
+ */
+ sw_scratch = __pa(ghcb) + offsetof(struct ghcb, shared_buffer);
+ ghcb_set_sw_scratch(ghcb, sw_scratch);
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO,
+ exit_info_1, exit_info_2);
+ if (ret != ES_OK)
+ return ret;
+
+ /* Read bytes from shared buffer into the guest's destination. */
+ if (exit_info_1 & IOIO_TYPE_IN) {
+ ret = vc_insn_string_write(ctxt,
+ (void *)(es_base + regs->di),
+ ghcb->shared_buffer, io_bytes,
+ exit_info_2, df);
+ if (ret)
+ return ret;
+
+ if (df)
+ regs->di -= exit_bytes;
+ else
+ regs->di += exit_bytes;
+ } else {
+ if (df)
+ regs->si -= exit_bytes;
+ else
+ regs->si += exit_bytes;
+ }
+
+ if (exit_info_1 & IOIO_REP)
+ regs->cx -= exit_info_2;
+
+ ret = regs->cx ? ES_RETRY : ES_OK;
+
+ } else {
+
+ /* IN/OUT into/from rAX */
+
+ int bits = (exit_info_1 & 0x70) >> 1;
+ u64 rax = 0;
+
+ if (!(exit_info_1 & IOIO_TYPE_IN))
+ rax = lower_bits(regs->ax, bits);
+
+ ghcb_set_rax(ghcb, rax);
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO, exit_info_1, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (exit_info_1 & IOIO_TYPE_IN) {
+ if (!ghcb_rax_is_valid(ghcb))
+ return ES_VMM_ERROR;
+ regs->ax = lower_bits(ghcb->save.rax, bits);
+ }
+ }
+
+ return ret;
+}
+
+static int vc_handle_cpuid_snp(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+ struct pt_regs *regs = ctxt->regs;
+ struct cpuid_leaf leaf;
+ int ret;
+
+ leaf.fn = regs->ax;
+ leaf.subfn = regs->cx;
+ ret = snp_cpuid(ghcb, ctxt, &leaf);
+ if (!ret) {
+ regs->ax = leaf.eax;
+ regs->bx = leaf.ebx;
+ regs->cx = leaf.ecx;
+ regs->dx = leaf.edx;
+ }
+
+ return ret;
+}
+
+static enum es_result vc_handle_cpuid(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ struct pt_regs *regs = ctxt->regs;
+ u32 cr4 = native_read_cr4();
+ enum es_result ret;
+ int snp_cpuid_ret;
+
+ snp_cpuid_ret = vc_handle_cpuid_snp(ghcb, ctxt);
+ if (!snp_cpuid_ret)
+ return ES_OK;
+ if (snp_cpuid_ret != -EOPNOTSUPP)
+ return ES_VMM_ERROR;
+
+ ghcb_set_rax(ghcb, regs->ax);
+ ghcb_set_rcx(ghcb, regs->cx);
+
+ if (cr4 & X86_CR4_OSXSAVE)
+ /* Safe to read xcr0 */
+ ghcb_set_xcr0(ghcb, xgetbv(XCR_XFEATURE_ENABLED_MASK));
+ else
+ /* xgetbv will cause #GP - use reset value for xcr0 */
+ ghcb_set_xcr0(ghcb, 1);
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_CPUID, 0, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (!(ghcb_rax_is_valid(ghcb) &&
+ ghcb_rbx_is_valid(ghcb) &&
+ ghcb_rcx_is_valid(ghcb) &&
+ ghcb_rdx_is_valid(ghcb)))
+ return ES_VMM_ERROR;
+
+ regs->ax = ghcb->save.rax;
+ regs->bx = ghcb->save.rbx;
+ regs->cx = ghcb->save.rcx;
+ regs->dx = ghcb->save.rdx;
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_rdtsc(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt,
+ unsigned long exit_code)
+{
+ bool rdtscp = (exit_code == SVM_EXIT_RDTSCP);
+ enum es_result ret;
+
+ /*
+ * The hypervisor should not be intercepting RDTSC/RDTSCP when Secure
+ * TSC is enabled. A #VC exception will be generated if the RDTSC/RDTSCP
+ * instructions are being intercepted. If this should occur and Secure
+ * TSC is enabled, guest execution should be terminated as the guest
+ * cannot rely on the TSC value provided by the hypervisor.
+ */
+ if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
+ return ES_VMM_ERROR;
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, 0, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (!(ghcb_rax_is_valid(ghcb) && ghcb_rdx_is_valid(ghcb) &&
+ (!rdtscp || ghcb_rcx_is_valid(ghcb))))
+ return ES_VMM_ERROR;
+
+ ctxt->regs->ax = ghcb->save.rax;
+ ctxt->regs->dx = ghcb->save.rdx;
+ if (rdtscp)
+ ctxt->regs->cx = ghcb->save.rcx;
+
+ return ES_OK;
+}
diff --git a/arch/x86/include/asm/sev-internal.h b/arch/x86/include/asm/sev-internal.h
index a78f972..b723208 100644
--- a/arch/x86/include/asm/sev-internal.h
+++ b/arch/x86/include/asm/sev-internal.h
@@ -3,7 +3,6 @@
#define DR7_RESET_VALUE 0x400
extern struct ghcb boot_ghcb_page;
-extern struct ghcb *boot_ghcb;
extern u64 sev_hv_features;
extern u64 sev_secrets_pa;
@@ -59,8 +58,6 @@ DECLARE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
unsigned long npages, enum psc_op op);
-void __noreturn sev_es_terminate(unsigned int set, unsigned int reason);
-
DECLARE_PER_CPU(struct svsm_ca *, svsm_caa);
DECLARE_PER_CPU(u64, svsm_caa_pa);
@@ -100,11 +97,6 @@ static __always_inline void sev_es_wr_ghcb_msr(u64 val)
native_wrmsr(MSR_AMD64_SEV_ES_GHCB, low, high);
}
-enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt,
- u64 exit_code, u64 exit_info_1,
- u64 exit_info_2);
-
void snp_register_ghcb_early(unsigned long paddr);
bool sev_es_negotiate_protocol(void);
bool sev_es_check_cpu_features(void);
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index a8661df..6158893 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -521,6 +521,28 @@ static __always_inline void vc_ghcb_invalidate(struct ghcb *ghcb)
__builtin_memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap));
}
+void vc_forward_exception(struct es_em_ctxt *ctxt);
+
+/* I/O parameters for CPUID-related helpers */
+struct cpuid_leaf {
+ u32 fn;
+ u32 subfn;
+ u32 eax;
+ u32 ebx;
+ u32 ecx;
+ u32 edx;
+};
+
+int snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf);
+
+void __noreturn sev_es_terminate(unsigned int set, unsigned int reason);
+enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt,
+ u64 exit_code, u64 exit_info_1,
+ u64 exit_info_2);
+
+extern struct ghcb *boot_ghcb;
+
#else /* !CONFIG_AMD_MEM_ENCRYPT */
#define snp_vmpl 0
^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 05/23] x86/sev: Move instruction decoder into separate source file
2025-05-04 9:52 ` [RFT PATCH v2 05/23] x86/sev: Move instruction decoder into separate source file Ard Biesheuvel
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
@ 2025-05-05 14:48 ` Tom Lendacky
2025-05-05 14:50 ` Ard Biesheuvel
2025-05-07 9:58 ` Borislav Petkov
2 siblings, 1 reply; 59+ messages in thread
From: Tom Lendacky @ 2025-05-05 14:48 UTC (permalink / raw)
To: Ard Biesheuvel, linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin
On 5/4/25 04:52, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
>
> As a first step towards disentangling the SEV #VC handling code -which
> is shared between the decompressor and the core kernel- from the SEV
> startup code, move the decompressor's copy of the instruction decoder
> into a separate source file.
>
> Code movement only - no functional change intended.
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
> arch/x86/boot/compressed/Makefile | 6 +--
> arch/x86/boot/compressed/misc.h | 7 +++
> arch/x86/boot/compressed/sev-handle-vc.c | 51 ++++++++++++++++++++
> arch/x86/boot/compressed/sev.c | 39 +--------------
> 4 files changed, 62 insertions(+), 41 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> index 0fcad7b7e007..f4f7b22d8113 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -44,10 +44,10 @@ KBUILD_CFLAGS += -D__DISABLE_EXPORTS
> KBUILD_CFLAGS += $(call cc-option,-Wa$(comma)-mrelax-relocations=no)
> KBUILD_CFLAGS += -include $(srctree)/include/linux/hidden.h
>
> -# sev.c indirectly includes inat-table.h which is generated during
> +# sev-decode-insn.c indirectly includes inat-table.c which is generated during
did you mean sev-handle-vc.c ?
Thanks,
Tom
> # compilation and stored in $(objtree). Add the directory to the includes so
> # that the compiler finds it even with out-of-tree builds (make O=/some/path).
> -CFLAGS_sev.o += -I$(objtree)/arch/x86/lib/
> +CFLAGS_sev-handle-vc.o += -I$(objtree)/arch/x86/lib/
>
> KBUILD_AFLAGS := $(KBUILD_CFLAGS) -D__ASSEMBLY__
>
> @@ -96,7 +96,7 @@ ifdef CONFIG_X86_64
> vmlinux-objs-y += $(obj)/idt_64.o $(obj)/idt_handlers_64.o
> vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/mem_encrypt.o
> vmlinux-objs-y += $(obj)/pgtable_64.o
> - vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/sev.o
> + vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/sev.o $(obj)/sev-handle-vc.o
> endif
>
> vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 05/23] x86/sev: Move instruction decoder into separate source file
2025-05-05 14:48 ` [RFT PATCH v2 05/23] " Tom Lendacky
@ 2025-05-05 14:50 ` Ard Biesheuvel
0 siblings, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-05 14:50 UTC (permalink / raw)
To: Tom Lendacky
Cc: Ard Biesheuvel, linux-kernel, linux-efi, x86, Borislav Petkov,
Ingo Molnar, Dionna Amalie Glaze, Kevin Loughlin
On Mon, 5 May 2025 at 16:48, Tom Lendacky <thomas.lendacky@amd.com> wrote:
>
> On 5/4/25 04:52, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > As a first step towards disentangling the SEV #VC handling code -which
> > is shared between the decompressor and the core kernel- from the SEV
> > startup code, move the decompressor's copy of the instruction decoder
> > into a separate source file.
> >
> > Code movement only - no functional change intended.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> > arch/x86/boot/compressed/Makefile | 6 +--
> > arch/x86/boot/compressed/misc.h | 7 +++
> > arch/x86/boot/compressed/sev-handle-vc.c | 51 ++++++++++++++++++++
> > arch/x86/boot/compressed/sev.c | 39 +--------------
> > 4 files changed, 62 insertions(+), 41 deletions(-)
> >
> > diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> > index 0fcad7b7e007..f4f7b22d8113 100644
> > --- a/arch/x86/boot/compressed/Makefile
> > +++ b/arch/x86/boot/compressed/Makefile
> > @@ -44,10 +44,10 @@ KBUILD_CFLAGS += -D__DISABLE_EXPORTS
> > KBUILD_CFLAGS += $(call cc-option,-Wa$(comma)-mrelax-relocations=no)
> > KBUILD_CFLAGS += -include $(srctree)/include/linux/hidden.h
> >
> > -# sev.c indirectly includes inat-table.h which is generated during
> > +# sev-decode-insn.c indirectly includes inat-table.c which is generated during
>
> did you mean sev-handle-vc.c ?
>
Ah yes - I renamed that at some point and forgot to update the comment.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 06/23] x86/sev: Disentangle #VC handling code from startup code
2025-05-04 9:52 ` [RFT PATCH v2 06/23] x86/sev: Disentangle #VC handling code from startup code Ard Biesheuvel
2025-05-05 5:31 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
@ 2025-05-05 14:58 ` Tom Lendacky
2025-05-05 16:54 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2 siblings, 0 replies; 59+ messages in thread
From: Tom Lendacky @ 2025-05-05 14:58 UTC (permalink / raw)
To: Ard Biesheuvel, linux-kernel
Cc: linux-efi, x86, Ard Biesheuvel, Borislav Petkov, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin
On 5/4/25 04:52, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
>
> Most of the SEV support code used to reside in a single C source file
> that was included in two places: the core kernel, and the decompressor.
>
> The code that is actually shared with the decompressor was moved into a
> separate, shared source file under startup/, on the basis that the
> decompressor also executes from the early 1:1 mapping of memory.
>
> However, while the elaborate #VC handling and instruction decoding that
> it involves is also performed by the decompressor, it does not actually
> occur in the core kernel at early boot, and therefore, does not need to
> be part of the confined early startup code.
>
> So split off the #VC handling code and move it back into arch/x86/coco
> where it came from, into another C source file that is included from
> both the decompressor and the core kernel.
>
> Code movement only - no functional change intended.
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
> arch/x86/boot/compressed/misc.h | 1 +
> arch/x86/boot/compressed/sev-handle-vc.c | 83 ++
> arch/x86/boot/compressed/sev.c | 96 +-
> arch/x86/boot/compressed/sev.h | 19 +
> arch/x86/boot/startup/sev-shared.c | 519 +---------
> arch/x86/boot/startup/sev-startup.c | 1022 -------------------
> arch/x86/coco/sev/Makefile | 2 +-
> arch/x86/coco/sev/vc-handle.c | 1061 ++++++++++++++++++++
Probably vc-handler.c would be a better name. And actually the same
format for the compressed file, sev-vc-handler.c, would be nice.
Thanks,
Tom
> arch/x86/coco/sev/vc-shared.c | 504 ++++++++++
> arch/x86/include/asm/sev-internal.h | 8 -
> arch/x86/include/asm/sev.h | 22 +
> 11 files changed, 1694 insertions(+), 1643 deletions(-)
>
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [tip: x86/boot] x86/boot: Provide __pti_set_user_pgtbl() to startup code
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
@ 2025-05-05 16:03 ` Borislav Petkov
2025-05-05 16:19 ` Ard Biesheuvel
0 siblings, 1 reply; 59+ messages in thread
From: Borislav Petkov @ 2025-05-05 16:03 UTC (permalink / raw)
To: linux-kernel
Cc: linux-tip-commits, Ard Biesheuvel, Ingo Molnar, Arnd Bergmann,
David Woodhouse, Dionna Amalie Glaze, H. Peter Anvin, Kees Cook,
Kevin Loughlin, Len Brown, Linus Torvalds, Rafael J. Wysocki,
Tom Lendacky, linux-efi, x86
On Sun, May 04, 2025 at 02:20:04PM -0000, tip-bot2 for Ard Biesheuvel wrote:
> The following commit has been merged into the x86/boot branch of tip:
>
> Commit-ID: 5297886f0cc45db5f4a804caf359e6e7874ee864
> Gitweb: https://git.kernel.org/tip/5297886f0cc45db5f4a804caf359e6e7874ee864
> Author: Ard Biesheuvel <ardb@kernel.org>
> AuthorDate: Sun, 04 May 2025 11:52:45 +02:00
> Committer: Ingo Molnar <mingo@kernel.org>
> CommitterDate: Sun, 04 May 2025 15:59:43 +02:00
>
> x86/boot: Provide __pti_set_user_pgtbl() to startup code
>
> The SME encryption startup code populates page tables using the ordinary
> set_pXX() helpers, and in a PTI build, these will call out to
> __pti_set_user_pgtbl() to manipulate the shadow copy of the page tables
> for user space.
>
> This is unneeded for the startup code, which only manipulates the
> swapper page tables, and so this call could be avoided in this
> particular case. So instead of exposing the ordinary
> __pti_set_user_pgtblt() to the startup code after its gets confined into
> its own symbol space, provide an alternative which just returns pgd,
> which is always correct in the startup context.
>
> Annotate it as __weak for now, this will be dropped in a subsequent
> patch.
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: David Woodhouse <dwmw@amazon.co.uk>
> Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Kevin Loughlin <kevinloughlin@google.com>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: linux-efi@vger.kernel.org
> Link: https://lore.kernel.org/r/20250504095230.2932860-40-ardb+git@google.com
> ---
> arch/x86/boot/startup/sme.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/arch/x86/boot/startup/sme.c b/arch/x86/boot/startup/sme.c
> index 5738b31..753cd20 100644
> --- a/arch/x86/boot/startup/sme.c
> +++ b/arch/x86/boot/startup/sme.c
> @@ -564,3 +564,12 @@ void __head sme_enable(struct boot_params *bp)
> cc_vendor = CC_VENDOR_AMD;
> cc_set_mask(me_mask);
> }
> +
> +#ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
> +/* Local version for startup code, which never operates on user page tables */
> +__weak
> +pgd_t __pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
> +{
> + return pgd;
> +}
> +#endif
[ 1.227968] smp: Brought up 1 node, 32 CPUs
[ 1.231576] smpboot: Total of 32 processors activated (191999.61 BogoMIPS)
[ 1.247644] ------------[ cut here ]------------
[ 1.248697] WARNING: CPU: 17 PID: 104 at kernel/jump_label.c:276 __static_key_slow_dec_cpuslocked+0x2a/0x80
[ 1.251592] Modules linked in:
[ 1.252370] CPU: 17 UID: 0 PID: 104 Comm: kworker/17:0 Not tainted 6.15.0-rc4+ #2 PREEMPT(voluntary)
[ 1.253173] node 0 deferred pages initialised in 12ms
[ 1.254539] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[ 1.257490] Workqueue: events unaccepted_cleanup_work
[ 1.258698] RIP: 0010:__static_key_slow_dec_cpuslocked+0x2a/0x80
[ 1.259574] Code: 0f 1f 44 00 00 53 48 89 fb e8 82 66 e5 ff 8b 03 85 c0 78 16 74 54 83 f8 01 74 11 8d 50 ff f0 0f b1 13 75 ec 5b e9 66 bf 8c 00 <0f> 0b 48 c7 c7 00 44 54 82 e8 a8 5f 8c 00 8b 03 83 f8 ff 74 33 85
[ 1.266446] Memory: 7574396K/8381588K available (13737K kernel code, 2487K rwdata, 6056K rodata, 3916K init, 3592K bss, 791248K reserved, 0K cma-reserved)
[ 1.263574] RSP: 0018:ffffc9000048fe40 EFLAGS: 00010286
[ 1.267573] RAX: 00000000ffffffff RBX: ffffffff82f10d00 RCX: 0000000000000000
[ 1.269388] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff82f10d00
[ 1.275578] RBP: ffff88827b7d4d98 R08: 8080808080808080 R09: ffff888100b59100
[ 1.277441] R10: ffff888100050cc0 R11: fefefefefefefeff R12: ffff88827326e300
[ 1.279574] R13: ffff888100bc1940 R14: ffff8881000e2405 R15: ffff8881000e2400
[ 1.281460] FS: 0000000000000000(0000) GS:ffff8882f0400000(0000) knlGS:0000000000000000
[ 1.281460] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.281460] CR2: 0000000000000000 CR3: 0008000002c22000 CR4: 00000000003506f0
[ 1.287576] Call Trace:
[ 1.288381] <TASK>
[ 1.289015] static_key_slow_dec+0x1f/0x40
[ 1.289980] process_one_work+0x171/0x330
[ 1.290988] worker_thread+0x247/0x390
[ 1.291576] ? __pfx_worker_thread+0x10/0x10
[ 1.292773] kthread+0x107/0x240
[ 1.293536] ? __pfx_kthread+0x10/0x10
[ 1.295573] ret_from_fork+0x30/0x50
[ 1.295575] ? __pfx_kthread+0x10/0x10
[ 1.296580] ret_from_fork_asm+0x1a/0x30
[ 1.297507] </TASK>
[ 1.298085] ---[ end trace 0000000000000000 ]---
mingo simply doesn't want to listen and stop queueing untested patches.
So lemme whack this one.
My SNP guest had CONFIG_MITIGATION_PAGE_TABLE_ISOLATION=y leading to the
above.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [tip: x86/boot] x86/boot: Provide __pti_set_user_pgtbl() to startup code
2025-05-05 16:03 ` Borislav Petkov
@ 2025-05-05 16:19 ` Ard Biesheuvel
2025-05-05 16:47 ` Borislav Petkov
0 siblings, 1 reply; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-05 16:19 UTC (permalink / raw)
To: Borislav Petkov
Cc: linux-kernel, linux-tip-commits, Ingo Molnar, Arnd Bergmann,
David Woodhouse, Dionna Amalie Glaze, H. Peter Anvin, Kees Cook,
Kevin Loughlin, Len Brown, Linus Torvalds, Rafael J. Wysocki,
Tom Lendacky, linux-efi, x86
On Mon, 5 May 2025 at 18:04, Borislav Petkov <bp@alien8.de> wrote:
>
> On Sun, May 04, 2025 at 02:20:04PM -0000, tip-bot2 for Ard Biesheuvel wrote:
> > The following commit has been merged into the x86/boot branch of tip:
> >
> > Commit-ID: 5297886f0cc45db5f4a804caf359e6e7874ee864
> > Gitweb: https://git.kernel.org/tip/5297886f0cc45db5f4a804caf359e6e7874ee864
> > Author: Ard Biesheuvel <ardb@kernel.org>
> > AuthorDate: Sun, 04 May 2025 11:52:45 +02:00
> > Committer: Ingo Molnar <mingo@kernel.org>
> > CommitterDate: Sun, 04 May 2025 15:59:43 +02:00
> >
> > x86/boot: Provide __pti_set_user_pgtbl() to startup code
> >
> > The SME encryption startup code populates page tables using the ordinary
> > set_pXX() helpers, and in a PTI build, these will call out to
> > __pti_set_user_pgtbl() to manipulate the shadow copy of the page tables
> > for user space.
> >
> > This is unneeded for the startup code, which only manipulates the
> > swapper page tables, and so this call could be avoided in this
> > particular case. So instead of exposing the ordinary
> > __pti_set_user_pgtblt() to the startup code after its gets confined into
> > its own symbol space, provide an alternative which just returns pgd,
> > which is always correct in the startup context.
> >
> > Annotate it as __weak for now, this will be dropped in a subsequent
> > patch.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > Signed-off-by: Ingo Molnar <mingo@kernel.org>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: David Woodhouse <dwmw@amazon.co.uk>
> > Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
> > Cc: H. Peter Anvin <hpa@zytor.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Kevin Loughlin <kevinloughlin@google.com>
> > Cc: Len Brown <len.brown@intel.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Cc: Tom Lendacky <thomas.lendacky@amd.com>
> > Cc: linux-efi@vger.kernel.org
> > Link: https://lore.kernel.org/r/20250504095230.2932860-40-ardb+git@google.com
> > ---
> > arch/x86/boot/startup/sme.c | 9 +++++++++
> > 1 file changed, 9 insertions(+)
> >
> > diff --git a/arch/x86/boot/startup/sme.c b/arch/x86/boot/startup/sme.c
> > index 5738b31..753cd20 100644
> > --- a/arch/x86/boot/startup/sme.c
> > +++ b/arch/x86/boot/startup/sme.c
> > @@ -564,3 +564,12 @@ void __head sme_enable(struct boot_params *bp)
> > cc_vendor = CC_VENDOR_AMD;
> > cc_set_mask(me_mask);
> > }
> > +
> > +#ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
> > +/* Local version for startup code, which never operates on user page tables */
> > +__weak
> > +pgd_t __pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
> > +{
> > + return pgd;
> > +}
> > +#endif
>
> [ 1.227968] smp: Brought up 1 node, 32 CPUs
> [ 1.231576] smpboot: Total of 32 processors activated (191999.61 BogoMIPS)
> [ 1.247644] ------------[ cut here ]------------
> [ 1.248697] WARNING: CPU: 17 PID: 104 at kernel/jump_label.c:276 __static_key_slow_dec_cpuslocked+0x2a/0x80
> [ 1.251592] Modules linked in:
> [ 1.252370] CPU: 17 UID: 0 PID: 104 Comm: kworker/17:0 Not tainted 6.15.0-rc4+ #2 PREEMPT(voluntary)
> [ 1.253173] node 0 deferred pages initialised in 12ms
> [ 1.254539] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
> [ 1.257490] Workqueue: events unaccepted_cleanup_work
> [ 1.258698] RIP: 0010:__static_key_slow_dec_cpuslocked+0x2a/0x80
> [ 1.259574] Code: 0f 1f 44 00 00 53 48 89 fb e8 82 66 e5 ff 8b 03 85 c0 78 16 74 54 83 f8 01 74 11 8d 50 ff f0 0f b1 13 75 ec 5b e9 66 bf 8c 00 <0f> 0b 48 c7 c7 00 44 54 82 e8 a8 5f 8c 00 8b 03 83 f8 ff 74 33 85
> [ 1.266446] Memory: 7574396K/8381588K available (13737K kernel code, 2487K rwdata, 6056K rodata, 3916K init, 3592K bss, 791248K reserved, 0K cma-reserved)
> [ 1.263574] RSP: 0018:ffffc9000048fe40 EFLAGS: 00010286
> [ 1.267573] RAX: 00000000ffffffff RBX: ffffffff82f10d00 RCX: 0000000000000000
> [ 1.269388] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff82f10d00
> [ 1.275578] RBP: ffff88827b7d4d98 R08: 8080808080808080 R09: ffff888100b59100
> [ 1.277441] R10: ffff888100050cc0 R11: fefefefefefefeff R12: ffff88827326e300
> [ 1.279574] R13: ffff888100bc1940 R14: ffff8881000e2405 R15: ffff8881000e2400
> [ 1.281460] FS: 0000000000000000(0000) GS:ffff8882f0400000(0000) knlGS:0000000000000000
> [ 1.281460] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1.281460] CR2: 0000000000000000 CR3: 0008000002c22000 CR4: 00000000003506f0
> [ 1.287576] Call Trace:
> [ 1.288381] <TASK>
> [ 1.289015] static_key_slow_dec+0x1f/0x40
> [ 1.289980] process_one_work+0x171/0x330
> [ 1.290988] worker_thread+0x247/0x390
> [ 1.291576] ? __pfx_worker_thread+0x10/0x10
> [ 1.292773] kthread+0x107/0x240
> [ 1.293536] ? __pfx_kthread+0x10/0x10
> [ 1.295573] ret_from_fork+0x30/0x50
> [ 1.295575] ? __pfx_kthread+0x10/0x10
> [ 1.296580] ret_from_fork_asm+0x1a/0x30
> [ 1.297507] </TASK>
> [ 1.298085] ---[ end trace 0000000000000000 ]---
>
> mingo simply doesn't want to listen and stop queueing untested patches.
>
> So lemme whack this one.
>
> My SNP guest had CONFIG_MITIGATION_PAGE_TABLE_ISOLATION=y leading to the
> above.
>
This patch by itself does nothing. The symbol is __weak for the time
being, and given that this code does not have its __pi_ prefixes yet,
this function will be superseded by the existing one. (If you remove
the __weak you will get a linker error)
Are you sure this patch is causing the issue?
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [tip: x86/boot] x86/boot: Provide __pti_set_user_pgtbl() to startup code
2025-05-05 16:19 ` Ard Biesheuvel
@ 2025-05-05 16:47 ` Borislav Petkov
2025-05-06 7:18 ` Borislav Petkov
0 siblings, 1 reply; 59+ messages in thread
From: Borislav Petkov @ 2025-05-05 16:47 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-kernel, linux-tip-commits, Ingo Molnar, Arnd Bergmann,
David Woodhouse, Dionna Amalie Glaze, H. Peter Anvin, Kees Cook,
Kevin Loughlin, Len Brown, Linus Torvalds, Rafael J. Wysocki,
Tom Lendacky, linux-efi, x86
On Mon, May 05, 2025 at 06:19:49PM +0200, Ard Biesheuvel wrote:
> This patch by itself does nothing. The symbol is __weak for the time
> being, and given that this code does not have its __pi_ prefixes yet,
> this function will be superseded by the existing one. (If you remove
> the __weak you will get a linker error)
>
> Are you sure this patch is causing the issue?
My by-foot bisection of x86/boot is below.
I don't know if this patch uncovers something or maybe changes placement...
Tom also pointed to
4067196a5227 ("mm/page_alloc: fix deadlock on cpu_hotplug_lock in __accept_page()")
as potentially related.
And now I went and tried to reproduce the warning and it fired ontop of:
419cbaf6a56a ("x86/boot: Add a bunch of PIC aliases")
which is the previous patch.
Ufff, this is one of those which don't reproduce reliably because it was fine
on that commit in the previous run. Nasty.
Lemme go dig more.
---
ed4d95d033e3 (HEAD, refs/remotes/tip/x86/boot) x86/sev: Disentangle #VC handling code from startup code
<--- NOT
[ 1.227968] smp: Brought up 1 node, 32 CPUs
[ 1.231576] smpboot: Total of 32 processors activated (191999.61 BogoMIPS)
[ 1.247644] ------------[ cut here ]------------
[ 1.248697] WARNING: CPU: 17 PID: 104 at kernel/jump_label.c:276 __static_key_slow_dec_cpuslocked+0x2a/0x80
[ 1.251592] Modules linked in:
[ 1.252370] CPU: 17 UID: 0 PID: 104 Comm: kworker/17:0 Not tainted 6.15.0-rc4+ #2 PREEMPT(voluntary)
[ 1.253173] node 0 deferred pages initialised in 12ms
[ 1.254539] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[ 1.257490] Workqueue: events unaccepted_cleanup_work
[ 1.258698] RIP: 0010:__static_key_slow_dec_cpuslocked+0x2a/0x80
[ 1.259574] Code: 0f 1f 44 00 00 53 48 89 fb e8 82 66 e5 ff 8b 03 85 c0 78 16 74 54 83 f8 01 74 11 8d 50 ff f0 0f b1 13 75 ec 5b e9 66 bf 8c 00 <0f> 0b 48 c7 c7 00 44 54 82 e8 a8 5f 8c 00 8b 03 83 f8 ff 74 33 85
[ 1.266446] Memory: 7574396K/8381588K available (13737K kernel code, 2487K rwdata, 6056K rodata, 3916K init, 3592K bss, 791248K reserved, 0K cma-reserved)
[ 1.263574] RSP: 0018:ffffc9000048fe40 EFLAGS: 00010286
[ 1.267573] RAX: 00000000ffffffff RBX: ffffffff82f10d00 RCX: 0000000000000000
[ 1.269388] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff82f10d00
[ 1.275578] RBP: ffff88827b7d4d98 R08: 8080808080808080 R09: ffff888100b59100
[ 1.277441] R10: ffff888100050cc0 R11: fefefefefefefeff R12: ffff88827326e300
[ 1.279574] R13: ffff888100bc1940 R14: ffff8881000e2405 R15: ffff8881000e2400
[ 1.281460] FS: 0000000000000000(0000) GS:ffff8882f0400000(0000) knlGS:0000000000000000
[ 1.281460] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.281460] CR2: 0000000000000000 CR3: 0008000002c22000 CR4: 00000000003506f0
[ 1.287576] Call Trace:
[ 1.288381] <TASK>
[ 1.289015] static_key_slow_dec+0x1f/0x40
[ 1.289980] process_one_work+0x171/0x330
[ 1.290988] worker_thread+0x247/0x390
[ 1.291576] ? __pfx_worker_thread+0x10/0x10
[ 1.292773] kthread+0x107/0x240
[ 1.293536] ? __pfx_kthread+0x10/0x10
[ 1.295573] ret_from_fork+0x30/0x50
[ 1.295575] ? __pfx_kthread+0x10/0x10
[ 1.296580] ret_from_fork_asm+0x1a/0x30
[ 1.297507] </TASK>
[ 1.298085] ---[ end trace 0000000000000000 ]---
5297886f0cc4 x86/boot: Provide __pti_set_user_pgtbl() to startup code
<--- OK
419cbaf6a56a x86/boot: Add a bunch of PIC aliases
f932adcc8650 x86/linkage: Add SYM_PIC_ALIAS() macro helper to emit symbol aliases
<--- OK
ae862964cbc5 x86/sev: Move instruction decoder into separate source file
fae89bbfdd9d x86/sev: Make sev_snp_enabled() a static function
b3464a36f7f2 x86/boot: Disregard __supported_pte_mask in __startup_64()
bd4a58beaaf1 x86/boot: Move early_setup_gdt() back into head64.c
<--- OK
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 59+ messages in thread
* [tip: x86/boot] x86/sev: Disentangle #VC handling code from startup code
2025-05-04 9:52 ` [RFT PATCH v2 06/23] x86/sev: Disentangle #VC handling code from startup code Ard Biesheuvel
2025-05-05 5:31 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2025-05-05 14:58 ` [RFT PATCH v2 06/23] " Tom Lendacky
@ 2025-05-05 16:54 ` tip-bot2 for Ard Biesheuvel
2 siblings, 0 replies; 59+ messages in thread
From: tip-bot2 for Ard Biesheuvel @ 2025-05-05 16:54 UTC (permalink / raw)
To: linux-tip-commits
Cc: Ard Biesheuvel, Ingo Molnar, Borislav Petkov (AMD), Arnd Bergmann,
David Woodhouse, Dionna Amalie Glaze, H. Peter Anvin, Kees Cook,
Kevin Loughlin, Len Brown, Linus Torvalds, Rafael J. Wysocki,
Tom Lendacky, linux-efi, x86, linux-kernel
The following commit has been merged into the x86/boot branch of tip:
Commit-ID: c7d49ba5f7523bffd32f789824f4b03e9d13b56e
Gitweb: https://git.kernel.org/tip/c7d49ba5f7523bffd32f789824f4b03e9d13b56e
Author: Ard Biesheuvel <ardb@kernel.org>
AuthorDate: Sun, 04 May 2025 11:52:36 +02:00
Committer: Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Mon, 05 May 2025 18:49:14 +02:00
x86/sev: Disentangle #VC handling code from startup code
Most of the SEV support code used to reside in a single C source file
that was included in two places: the core kernel, and the decompressor.
The code that is actually shared with the decompressor was moved into a
separate, shared source file under startup/, on the basis that the
decompressor also executes from the early 1:1 mapping of memory.
However, while the elaborate #VC handling and instruction decoding that
it involves is also performed by the decompressor, it does not actually
occur in the core kernel at early boot, and therefore, does not need to
be part of the confined early startup code.
So split off the #VC handling code and move it back into arch/x86/coco
where it came from, into another C source file that is included from
both the decompressor and the core kernel.
Code movement only - no functional change intended.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250504095230.2932860-31-ardb+git@google.com
---
arch/x86/boot/compressed/misc.h | 1 +-
arch/x86/boot/compressed/sev-handle-vc.c | 83 ++-
arch/x86/boot/compressed/sev.c | 96 +--
arch/x86/boot/compressed/sev.h | 19 +-
arch/x86/boot/startup/sev-shared.c | 519 +----------
arch/x86/boot/startup/sev-startup.c | 1022 +--------------------
arch/x86/coco/sev/Makefile | 2 +-
arch/x86/coco/sev/vc-handle.c | 1061 +++++++++++++++++++++-
arch/x86/coco/sev/vc-shared.c | 504 ++++++++++-
arch/x86/include/asm/sev-internal.h | 8 +-
arch/x86/include/asm/sev.h | 22 +-
11 files changed, 1694 insertions(+), 1643 deletions(-)
create mode 100644 arch/x86/coco/sev/vc-handle.c
create mode 100644 arch/x86/coco/sev/vc-shared.c
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 04de48c..db10486 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -150,6 +150,7 @@ void sev_prep_identity_maps(unsigned long top_level_pgt);
enum es_result vc_decode_insn(struct es_em_ctxt *ctxt);
bool insn_has_rep_prefix(struct insn *insn);
void sev_insn_decode_init(void);
+bool early_setup_ghcb(void);
#else
static inline void sev_enable(struct boot_params *bp)
{
diff --git a/arch/x86/boot/compressed/sev-handle-vc.c b/arch/x86/boot/compressed/sev-handle-vc.c
index b1aa073..89dd02d 100644
--- a/arch/x86/boot/compressed/sev-handle-vc.c
+++ b/arch/x86/boot/compressed/sev-handle-vc.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
#include "misc.h"
+#include "sev.h"
#include <linux/kernel.h>
#include <linux/string.h>
@@ -8,6 +9,9 @@
#include <asm/pgtable_types.h>
#include <asm/ptrace.h>
#include <asm/sev.h>
+#include <asm/trapnr.h>
+#include <asm/trap_pf.h>
+#include <asm/fpu/xcr.h>
#define __BOOT_COMPRESSED
@@ -49,3 +53,82 @@ enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
}
extern void sev_insn_decode_init(void) __alias(inat_init_tables);
+
+/*
+ * Only a dummy for insn_get_seg_base() - Early boot-code is 64bit only and
+ * doesn't use segments.
+ */
+static unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx)
+{
+ return 0UL;
+}
+
+static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
+ void *dst, char *buf, size_t size)
+{
+ memcpy(dst, buf, size);
+
+ return ES_OK;
+}
+
+static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
+ void *src, char *buf, size_t size)
+{
+ memcpy(buf, src, size);
+
+ return ES_OK;
+}
+
+static enum es_result vc_ioio_check(struct es_em_ctxt *ctxt, u16 port, size_t size)
+{
+ return ES_OK;
+}
+
+static bool fault_in_kernel_space(unsigned long address)
+{
+ return false;
+}
+
+#define sev_printk(fmt, ...)
+
+#include "../../coco/sev/vc-shared.c"
+
+void do_boot_stage2_vc(struct pt_regs *regs, unsigned long exit_code)
+{
+ struct es_em_ctxt ctxt;
+ enum es_result result;
+
+ if (!boot_ghcb && !early_setup_ghcb())
+ sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
+
+ vc_ghcb_invalidate(boot_ghcb);
+ result = vc_init_em_ctxt(&ctxt, regs, exit_code);
+ if (result != ES_OK)
+ goto finish;
+
+ result = vc_check_opcode_bytes(&ctxt, exit_code);
+ if (result != ES_OK)
+ goto finish;
+
+ switch (exit_code) {
+ case SVM_EXIT_RDTSC:
+ case SVM_EXIT_RDTSCP:
+ result = vc_handle_rdtsc(boot_ghcb, &ctxt, exit_code);
+ break;
+ case SVM_EXIT_IOIO:
+ result = vc_handle_ioio(boot_ghcb, &ctxt);
+ break;
+ case SVM_EXIT_CPUID:
+ result = vc_handle_cpuid(boot_ghcb, &ctxt);
+ break;
+ default:
+ result = ES_UNSUPPORTED;
+ break;
+ }
+
+finish:
+ if (result == ES_OK)
+ vc_finish_insn(&ctxt);
+ else if (result != ES_RETRY)
+ sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
+}
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 6fe7e85..612b443 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -24,63 +24,11 @@
#include <asm/cpuid.h>
#include "error.h"
-#include "../msr.h"
+#include "sev.h"
static struct ghcb boot_ghcb_page __aligned(PAGE_SIZE);
struct ghcb *boot_ghcb;
-/*
- * Only a dummy for insn_get_seg_base() - Early boot-code is 64bit only and
- * doesn't use segments.
- */
-static unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx)
-{
- return 0UL;
-}
-
-static inline u64 sev_es_rd_ghcb_msr(void)
-{
- struct msr m;
-
- boot_rdmsr(MSR_AMD64_SEV_ES_GHCB, &m);
-
- return m.q;
-}
-
-static inline void sev_es_wr_ghcb_msr(u64 val)
-{
- struct msr m;
-
- m.q = val;
- boot_wrmsr(MSR_AMD64_SEV_ES_GHCB, &m);
-}
-
-static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
- void *dst, char *buf, size_t size)
-{
- memcpy(dst, buf, size);
-
- return ES_OK;
-}
-
-static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
- void *src, char *buf, size_t size)
-{
- memcpy(buf, src, size);
-
- return ES_OK;
-}
-
-static enum es_result vc_ioio_check(struct es_em_ctxt *ctxt, u16 port, size_t size)
-{
- return ES_OK;
-}
-
-static bool fault_in_kernel_space(unsigned long address)
-{
- return false;
-}
-
#undef __init
#define __init
@@ -182,7 +130,7 @@ void snp_set_page_shared(unsigned long paddr)
__page_state_change(paddr, SNP_PAGE_STATE_SHARED);
}
-static bool early_setup_ghcb(void)
+bool early_setup_ghcb(void)
{
if (set_page_decrypted((unsigned long)&boot_ghcb_page))
return false;
@@ -266,46 +214,6 @@ bool sev_es_check_ghcb_fault(unsigned long address)
return ((address & PAGE_MASK) == (unsigned long)&boot_ghcb_page);
}
-void do_boot_stage2_vc(struct pt_regs *regs, unsigned long exit_code)
-{
- struct es_em_ctxt ctxt;
- enum es_result result;
-
- if (!boot_ghcb && !early_setup_ghcb())
- sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
-
- vc_ghcb_invalidate(boot_ghcb);
- result = vc_init_em_ctxt(&ctxt, regs, exit_code);
- if (result != ES_OK)
- goto finish;
-
- result = vc_check_opcode_bytes(&ctxt, exit_code);
- if (result != ES_OK)
- goto finish;
-
- switch (exit_code) {
- case SVM_EXIT_RDTSC:
- case SVM_EXIT_RDTSCP:
- result = vc_handle_rdtsc(boot_ghcb, &ctxt, exit_code);
- break;
- case SVM_EXIT_IOIO:
- result = vc_handle_ioio(boot_ghcb, &ctxt);
- break;
- case SVM_EXIT_CPUID:
- result = vc_handle_cpuid(boot_ghcb, &ctxt);
- break;
- default:
- result = ES_UNSUPPORTED;
- break;
- }
-
-finish:
- if (result == ES_OK)
- vc_finish_insn(&ctxt);
- else if (result != ES_RETRY)
- sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
-}
-
/*
* SNP_FEATURES_IMPL_REQ is the mask of SNP features that will need
* guest side implementation for proper functioning of the guest. If any
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
index e87af54..92f79c2 100644
--- a/arch/x86/boot/compressed/sev.h
+++ b/arch/x86/boot/compressed/sev.h
@@ -10,10 +10,29 @@
#ifdef CONFIG_AMD_MEM_ENCRYPT
+#include "../msr.h"
+
void snp_accept_memory(phys_addr_t start, phys_addr_t end);
u64 sev_get_status(void);
bool early_is_sevsnp_guest(void);
+static inline u64 sev_es_rd_ghcb_msr(void)
+{
+ struct msr m;
+
+ boot_rdmsr(MSR_AMD64_SEV_ES_GHCB, &m);
+
+ return m.q;
+}
+
+static inline void sev_es_wr_ghcb_msr(u64 val)
+{
+ struct msr m;
+
+ m.q = val;
+ boot_wrmsr(MSR_AMD64_SEV_ES_GHCB, &m);
+}
+
#else
static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
diff --git a/arch/x86/boot/startup/sev-shared.c b/arch/x86/boot/startup/sev-shared.c
index 173f3d1..7a706db 100644
--- a/arch/x86/boot/startup/sev-shared.c
+++ b/arch/x86/boot/startup/sev-shared.c
@@ -14,13 +14,9 @@
#ifndef __BOOT_COMPRESSED
#define error(v) pr_err(v)
#define has_cpuflag(f) boot_cpu_has(f)
-#define sev_printk(fmt, ...) printk(fmt, ##__VA_ARGS__)
-#define sev_printk_rtl(fmt, ...) printk_ratelimited(fmt, ##__VA_ARGS__)
#else
#undef WARN
#define WARN(condition, format...) (!!(condition))
-#define sev_printk(fmt, ...)
-#define sev_printk_rtl(fmt, ...)
#undef vc_forward_exception
#define vc_forward_exception(c) panic("SNP: Hypervisor requested exception\n")
#endif
@@ -36,16 +32,6 @@
struct svsm_ca *boot_svsm_caa __ro_after_init;
u64 boot_svsm_caa_pa __ro_after_init;
-/* I/O parameters for CPUID-related helpers */
-struct cpuid_leaf {
- u32 fn;
- u32 subfn;
- u32 eax;
- u32 ebx;
- u32 ecx;
- u32 edx;
-};
-
/*
* Since feature negotiation related variables are set early in the boot
* process they must reside in the .data section so as not to be zeroed
@@ -151,33 +137,6 @@ bool sev_es_negotiate_protocol(void)
return true;
}
-static bool vc_decoding_needed(unsigned long exit_code)
-{
- /* Exceptions don't require to decode the instruction */
- return !(exit_code >= SVM_EXIT_EXCP_BASE &&
- exit_code <= SVM_EXIT_LAST_EXCP);
-}
-
-static enum es_result vc_init_em_ctxt(struct es_em_ctxt *ctxt,
- struct pt_regs *regs,
- unsigned long exit_code)
-{
- enum es_result ret = ES_OK;
-
- memset(ctxt, 0, sizeof(*ctxt));
- ctxt->regs = regs;
-
- if (vc_decoding_needed(exit_code))
- ret = vc_decode_insn(ctxt);
-
- return ret;
-}
-
-static void vc_finish_insn(struct es_em_ctxt *ctxt)
-{
- ctxt->regs->ip += ctxt->insn.length;
-}
-
static enum es_result verify_exception_info(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
{
u32 ret;
@@ -627,7 +586,7 @@ snp_cpuid_postprocess(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
* Returns -EOPNOTSUPP if feature not enabled. Any other non-zero return value
* should be treated as fatal by caller.
*/
-static int __head
+int __head
snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
{
const struct snp_cpuid_table *cpuid_table = snp_cpuid_get_table();
@@ -737,391 +696,6 @@ fail:
sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
}
-static enum es_result vc_insn_string_check(struct es_em_ctxt *ctxt,
- unsigned long address,
- bool write)
-{
- if (user_mode(ctxt->regs) && fault_in_kernel_space(address)) {
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.error_code = X86_PF_USER;
- ctxt->fi.cr2 = address;
- if (write)
- ctxt->fi.error_code |= X86_PF_WRITE;
-
- return ES_EXCEPTION;
- }
-
- return ES_OK;
-}
-
-static enum es_result vc_insn_string_read(struct es_em_ctxt *ctxt,
- void *src, char *buf,
- unsigned int data_size,
- unsigned int count,
- bool backwards)
-{
- int i, b = backwards ? -1 : 1;
- unsigned long address = (unsigned long)src;
- enum es_result ret;
-
- ret = vc_insn_string_check(ctxt, address, false);
- if (ret != ES_OK)
- return ret;
-
- for (i = 0; i < count; i++) {
- void *s = src + (i * data_size * b);
- char *d = buf + (i * data_size);
-
- ret = vc_read_mem(ctxt, s, d, data_size);
- if (ret != ES_OK)
- break;
- }
-
- return ret;
-}
-
-static enum es_result vc_insn_string_write(struct es_em_ctxt *ctxt,
- void *dst, char *buf,
- unsigned int data_size,
- unsigned int count,
- bool backwards)
-{
- int i, s = backwards ? -1 : 1;
- unsigned long address = (unsigned long)dst;
- enum es_result ret;
-
- ret = vc_insn_string_check(ctxt, address, true);
- if (ret != ES_OK)
- return ret;
-
- for (i = 0; i < count; i++) {
- void *d = dst + (i * data_size * s);
- char *b = buf + (i * data_size);
-
- ret = vc_write_mem(ctxt, d, b, data_size);
- if (ret != ES_OK)
- break;
- }
-
- return ret;
-}
-
-#define IOIO_TYPE_STR BIT(2)
-#define IOIO_TYPE_IN 1
-#define IOIO_TYPE_INS (IOIO_TYPE_IN | IOIO_TYPE_STR)
-#define IOIO_TYPE_OUT 0
-#define IOIO_TYPE_OUTS (IOIO_TYPE_OUT | IOIO_TYPE_STR)
-
-#define IOIO_REP BIT(3)
-
-#define IOIO_ADDR_64 BIT(9)
-#define IOIO_ADDR_32 BIT(8)
-#define IOIO_ADDR_16 BIT(7)
-
-#define IOIO_DATA_32 BIT(6)
-#define IOIO_DATA_16 BIT(5)
-#define IOIO_DATA_8 BIT(4)
-
-#define IOIO_SEG_ES (0 << 10)
-#define IOIO_SEG_DS (3 << 10)
-
-static enum es_result vc_ioio_exitinfo(struct es_em_ctxt *ctxt, u64 *exitinfo)
-{
- struct insn *insn = &ctxt->insn;
- size_t size;
- u64 port;
-
- *exitinfo = 0;
-
- switch (insn->opcode.bytes[0]) {
- /* INS opcodes */
- case 0x6c:
- case 0x6d:
- *exitinfo |= IOIO_TYPE_INS;
- *exitinfo |= IOIO_SEG_ES;
- port = ctxt->regs->dx & 0xffff;
- break;
-
- /* OUTS opcodes */
- case 0x6e:
- case 0x6f:
- *exitinfo |= IOIO_TYPE_OUTS;
- *exitinfo |= IOIO_SEG_DS;
- port = ctxt->regs->dx & 0xffff;
- break;
-
- /* IN immediate opcodes */
- case 0xe4:
- case 0xe5:
- *exitinfo |= IOIO_TYPE_IN;
- port = (u8)insn->immediate.value & 0xffff;
- break;
-
- /* OUT immediate opcodes */
- case 0xe6:
- case 0xe7:
- *exitinfo |= IOIO_TYPE_OUT;
- port = (u8)insn->immediate.value & 0xffff;
- break;
-
- /* IN register opcodes */
- case 0xec:
- case 0xed:
- *exitinfo |= IOIO_TYPE_IN;
- port = ctxt->regs->dx & 0xffff;
- break;
-
- /* OUT register opcodes */
- case 0xee:
- case 0xef:
- *exitinfo |= IOIO_TYPE_OUT;
- port = ctxt->regs->dx & 0xffff;
- break;
-
- default:
- return ES_DECODE_FAILED;
- }
-
- *exitinfo |= port << 16;
-
- switch (insn->opcode.bytes[0]) {
- case 0x6c:
- case 0x6e:
- case 0xe4:
- case 0xe6:
- case 0xec:
- case 0xee:
- /* Single byte opcodes */
- *exitinfo |= IOIO_DATA_8;
- size = 1;
- break;
- default:
- /* Length determined by instruction parsing */
- *exitinfo |= (insn->opnd_bytes == 2) ? IOIO_DATA_16
- : IOIO_DATA_32;
- size = (insn->opnd_bytes == 2) ? 2 : 4;
- }
-
- switch (insn->addr_bytes) {
- case 2:
- *exitinfo |= IOIO_ADDR_16;
- break;
- case 4:
- *exitinfo |= IOIO_ADDR_32;
- break;
- case 8:
- *exitinfo |= IOIO_ADDR_64;
- break;
- }
-
- if (insn_has_rep_prefix(insn))
- *exitinfo |= IOIO_REP;
-
- return vc_ioio_check(ctxt, (u16)port, size);
-}
-
-static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
-{
- struct pt_regs *regs = ctxt->regs;
- u64 exit_info_1, exit_info_2;
- enum es_result ret;
-
- ret = vc_ioio_exitinfo(ctxt, &exit_info_1);
- if (ret != ES_OK)
- return ret;
-
- if (exit_info_1 & IOIO_TYPE_STR) {
-
- /* (REP) INS/OUTS */
-
- bool df = ((regs->flags & X86_EFLAGS_DF) == X86_EFLAGS_DF);
- unsigned int io_bytes, exit_bytes;
- unsigned int ghcb_count, op_count;
- unsigned long es_base;
- u64 sw_scratch;
-
- /*
- * For the string variants with rep prefix the amount of in/out
- * operations per #VC exception is limited so that the kernel
- * has a chance to take interrupts and re-schedule while the
- * instruction is emulated.
- */
- io_bytes = (exit_info_1 >> 4) & 0x7;
- ghcb_count = sizeof(ghcb->shared_buffer) / io_bytes;
-
- op_count = (exit_info_1 & IOIO_REP) ? regs->cx : 1;
- exit_info_2 = min(op_count, ghcb_count);
- exit_bytes = exit_info_2 * io_bytes;
-
- es_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_ES);
-
- /* Read bytes of OUTS into the shared buffer */
- if (!(exit_info_1 & IOIO_TYPE_IN)) {
- ret = vc_insn_string_read(ctxt,
- (void *)(es_base + regs->si),
- ghcb->shared_buffer, io_bytes,
- exit_info_2, df);
- if (ret)
- return ret;
- }
-
- /*
- * Issue an VMGEXIT to the HV to consume the bytes from the
- * shared buffer or to have it write them into the shared buffer
- * depending on the instruction: OUTS or INS.
- */
- sw_scratch = __pa(ghcb) + offsetof(struct ghcb, shared_buffer);
- ghcb_set_sw_scratch(ghcb, sw_scratch);
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO,
- exit_info_1, exit_info_2);
- if (ret != ES_OK)
- return ret;
-
- /* Read bytes from shared buffer into the guest's destination. */
- if (exit_info_1 & IOIO_TYPE_IN) {
- ret = vc_insn_string_write(ctxt,
- (void *)(es_base + regs->di),
- ghcb->shared_buffer, io_bytes,
- exit_info_2, df);
- if (ret)
- return ret;
-
- if (df)
- regs->di -= exit_bytes;
- else
- regs->di += exit_bytes;
- } else {
- if (df)
- regs->si -= exit_bytes;
- else
- regs->si += exit_bytes;
- }
-
- if (exit_info_1 & IOIO_REP)
- regs->cx -= exit_info_2;
-
- ret = regs->cx ? ES_RETRY : ES_OK;
-
- } else {
-
- /* IN/OUT into/from rAX */
-
- int bits = (exit_info_1 & 0x70) >> 1;
- u64 rax = 0;
-
- if (!(exit_info_1 & IOIO_TYPE_IN))
- rax = lower_bits(regs->ax, bits);
-
- ghcb_set_rax(ghcb, rax);
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO, exit_info_1, 0);
- if (ret != ES_OK)
- return ret;
-
- if (exit_info_1 & IOIO_TYPE_IN) {
- if (!ghcb_rax_is_valid(ghcb))
- return ES_VMM_ERROR;
- regs->ax = lower_bits(ghcb->save.rax, bits);
- }
- }
-
- return ret;
-}
-
-static int vc_handle_cpuid_snp(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
-{
- struct pt_regs *regs = ctxt->regs;
- struct cpuid_leaf leaf;
- int ret;
-
- leaf.fn = regs->ax;
- leaf.subfn = regs->cx;
- ret = snp_cpuid(ghcb, ctxt, &leaf);
- if (!ret) {
- regs->ax = leaf.eax;
- regs->bx = leaf.ebx;
- regs->cx = leaf.ecx;
- regs->dx = leaf.edx;
- }
-
- return ret;
-}
-
-static enum es_result vc_handle_cpuid(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- struct pt_regs *regs = ctxt->regs;
- u32 cr4 = native_read_cr4();
- enum es_result ret;
- int snp_cpuid_ret;
-
- snp_cpuid_ret = vc_handle_cpuid_snp(ghcb, ctxt);
- if (!snp_cpuid_ret)
- return ES_OK;
- if (snp_cpuid_ret != -EOPNOTSUPP)
- return ES_VMM_ERROR;
-
- ghcb_set_rax(ghcb, regs->ax);
- ghcb_set_rcx(ghcb, regs->cx);
-
- if (cr4 & X86_CR4_OSXSAVE)
- /* Safe to read xcr0 */
- ghcb_set_xcr0(ghcb, xgetbv(XCR_XFEATURE_ENABLED_MASK));
- else
- /* xgetbv will cause #GP - use reset value for xcr0 */
- ghcb_set_xcr0(ghcb, 1);
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_CPUID, 0, 0);
- if (ret != ES_OK)
- return ret;
-
- if (!(ghcb_rax_is_valid(ghcb) &&
- ghcb_rbx_is_valid(ghcb) &&
- ghcb_rcx_is_valid(ghcb) &&
- ghcb_rdx_is_valid(ghcb)))
- return ES_VMM_ERROR;
-
- regs->ax = ghcb->save.rax;
- regs->bx = ghcb->save.rbx;
- regs->cx = ghcb->save.rcx;
- regs->dx = ghcb->save.rdx;
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_rdtsc(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt,
- unsigned long exit_code)
-{
- bool rdtscp = (exit_code == SVM_EXIT_RDTSCP);
- enum es_result ret;
-
- /*
- * The hypervisor should not be intercepting RDTSC/RDTSCP when Secure
- * TSC is enabled. A #VC exception will be generated if the RDTSC/RDTSCP
- * instructions are being intercepted. If this should occur and Secure
- * TSC is enabled, guest execution should be terminated as the guest
- * cannot rely on the TSC value provided by the hypervisor.
- */
- if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
- return ES_VMM_ERROR;
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, 0, 0);
- if (ret != ES_OK)
- return ret;
-
- if (!(ghcb_rax_is_valid(ghcb) && ghcb_rdx_is_valid(ghcb) &&
- (!rdtscp || ghcb_rcx_is_valid(ghcb))))
- return ES_VMM_ERROR;
-
- ctxt->regs->ax = ghcb->save.rax;
- ctxt->regs->dx = ghcb->save.rdx;
- if (rdtscp)
- ctxt->regs->cx = ghcb->save.rcx;
-
- return ES_OK;
-}
-
struct cc_setup_data {
struct setup_data header;
u32 cc_blob_address;
@@ -1238,97 +812,6 @@ static void __head pvalidate_4k_page(unsigned long vaddr, unsigned long paddr,
}
}
-static enum es_result vc_check_opcode_bytes(struct es_em_ctxt *ctxt,
- unsigned long exit_code)
-{
- unsigned int opcode = (unsigned int)ctxt->insn.opcode.value;
- u8 modrm = ctxt->insn.modrm.value;
-
- switch (exit_code) {
-
- case SVM_EXIT_IOIO:
- case SVM_EXIT_NPF:
- /* handled separately */
- return ES_OK;
-
- case SVM_EXIT_CPUID:
- if (opcode == 0xa20f)
- return ES_OK;
- break;
-
- case SVM_EXIT_INVD:
- if (opcode == 0x080f)
- return ES_OK;
- break;
-
- case SVM_EXIT_MONITOR:
- /* MONITOR and MONITORX instructions generate the same error code */
- if (opcode == 0x010f && (modrm == 0xc8 || modrm == 0xfa))
- return ES_OK;
- break;
-
- case SVM_EXIT_MWAIT:
- /* MWAIT and MWAITX instructions generate the same error code */
- if (opcode == 0x010f && (modrm == 0xc9 || modrm == 0xfb))
- return ES_OK;
- break;
-
- case SVM_EXIT_MSR:
- /* RDMSR */
- if (opcode == 0x320f ||
- /* WRMSR */
- opcode == 0x300f)
- return ES_OK;
- break;
-
- case SVM_EXIT_RDPMC:
- if (opcode == 0x330f)
- return ES_OK;
- break;
-
- case SVM_EXIT_RDTSC:
- if (opcode == 0x310f)
- return ES_OK;
- break;
-
- case SVM_EXIT_RDTSCP:
- if (opcode == 0x010f && modrm == 0xf9)
- return ES_OK;
- break;
-
- case SVM_EXIT_READ_DR7:
- if (opcode == 0x210f &&
- X86_MODRM_REG(ctxt->insn.modrm.value) == 7)
- return ES_OK;
- break;
-
- case SVM_EXIT_VMMCALL:
- if (opcode == 0x010f && modrm == 0xd9)
- return ES_OK;
-
- break;
-
- case SVM_EXIT_WRITE_DR7:
- if (opcode == 0x230f &&
- X86_MODRM_REG(ctxt->insn.modrm.value) == 7)
- return ES_OK;
- break;
-
- case SVM_EXIT_WBINVD:
- if (opcode == 0x90f)
- return ES_OK;
- break;
-
- default:
- break;
- }
-
- sev_printk(KERN_ERR "Wrong/unhandled opcode bytes: 0x%x, exit_code: 0x%lx, rIP: 0x%lx\n",
- opcode, exit_code, ctxt->regs->ip);
-
- return ES_UNSUPPORTED;
-}
-
/*
* Maintain the GPA of the SVSM Calling Area (CA) in order to utilize the SVSM
* services needed when not running in VMPL0.
diff --git a/arch/x86/boot/startup/sev-startup.c b/arch/x86/boot/startup/sev-startup.c
index f901ce9..435853a 100644
--- a/arch/x86/boot/startup/sev-startup.c
+++ b/arch/x86/boot/startup/sev-startup.c
@@ -9,7 +9,6 @@
#define pr_fmt(fmt) "SEV: " fmt
-#include <linux/sched/debug.h> /* For show_regs() */
#include <linux/percpu-defs.h>
#include <linux/cc_platform.h>
#include <linux/printk.h>
@@ -112,315 +111,6 @@ noinstr struct ghcb *__sev_get_ghcb(struct ghcb_state *state)
return ghcb;
}
-static int vc_fetch_insn_kernel(struct es_em_ctxt *ctxt,
- unsigned char *buffer)
-{
- return copy_from_kernel_nofault(buffer, (unsigned char *)ctxt->regs->ip, MAX_INSN_SIZE);
-}
-
-static enum es_result __vc_decode_user_insn(struct es_em_ctxt *ctxt)
-{
- char buffer[MAX_INSN_SIZE];
- int insn_bytes;
-
- insn_bytes = insn_fetch_from_user_inatomic(ctxt->regs, buffer);
- if (insn_bytes == 0) {
- /* Nothing could be copied */
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.error_code = X86_PF_INSTR | X86_PF_USER;
- ctxt->fi.cr2 = ctxt->regs->ip;
- return ES_EXCEPTION;
- } else if (insn_bytes == -EINVAL) {
- /* Effective RIP could not be calculated */
- ctxt->fi.vector = X86_TRAP_GP;
- ctxt->fi.error_code = 0;
- ctxt->fi.cr2 = 0;
- return ES_EXCEPTION;
- }
-
- if (!insn_decode_from_regs(&ctxt->insn, ctxt->regs, buffer, insn_bytes))
- return ES_DECODE_FAILED;
-
- if (ctxt->insn.immediate.got)
- return ES_OK;
- else
- return ES_DECODE_FAILED;
-}
-
-static enum es_result __vc_decode_kern_insn(struct es_em_ctxt *ctxt)
-{
- char buffer[MAX_INSN_SIZE];
- int res, ret;
-
- res = vc_fetch_insn_kernel(ctxt, buffer);
- if (res) {
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.error_code = X86_PF_INSTR;
- ctxt->fi.cr2 = ctxt->regs->ip;
- return ES_EXCEPTION;
- }
-
- ret = insn_decode(&ctxt->insn, buffer, MAX_INSN_SIZE, INSN_MODE_64);
- if (ret < 0)
- return ES_DECODE_FAILED;
- else
- return ES_OK;
-}
-
-static enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
-{
- if (user_mode(ctxt->regs))
- return __vc_decode_user_insn(ctxt);
- else
- return __vc_decode_kern_insn(ctxt);
-}
-
-static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
- char *dst, char *buf, size_t size)
-{
- unsigned long error_code = X86_PF_PROT | X86_PF_WRITE;
-
- /*
- * This function uses __put_user() independent of whether kernel or user
- * memory is accessed. This works fine because __put_user() does no
- * sanity checks of the pointer being accessed. All that it does is
- * to report when the access failed.
- *
- * Also, this function runs in atomic context, so __put_user() is not
- * allowed to sleep. The page-fault handler detects that it is running
- * in atomic context and will not try to take mmap_sem and handle the
- * fault, so additional pagefault_enable()/disable() calls are not
- * needed.
- *
- * The access can't be done via copy_to_user() here because
- * vc_write_mem() must not use string instructions to access unsafe
- * memory. The reason is that MOVS is emulated by the #VC handler by
- * splitting the move up into a read and a write and taking a nested #VC
- * exception on whatever of them is the MMIO access. Using string
- * instructions here would cause infinite nesting.
- */
- switch (size) {
- case 1: {
- u8 d1;
- u8 __user *target = (u8 __user *)dst;
-
- memcpy(&d1, buf, 1);
- if (__put_user(d1, target))
- goto fault;
- break;
- }
- case 2: {
- u16 d2;
- u16 __user *target = (u16 __user *)dst;
-
- memcpy(&d2, buf, 2);
- if (__put_user(d2, target))
- goto fault;
- break;
- }
- case 4: {
- u32 d4;
- u32 __user *target = (u32 __user *)dst;
-
- memcpy(&d4, buf, 4);
- if (__put_user(d4, target))
- goto fault;
- break;
- }
- case 8: {
- u64 d8;
- u64 __user *target = (u64 __user *)dst;
-
- memcpy(&d8, buf, 8);
- if (__put_user(d8, target))
- goto fault;
- break;
- }
- default:
- WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
- return ES_UNSUPPORTED;
- }
-
- return ES_OK;
-
-fault:
- if (user_mode(ctxt->regs))
- error_code |= X86_PF_USER;
-
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.error_code = error_code;
- ctxt->fi.cr2 = (unsigned long)dst;
-
- return ES_EXCEPTION;
-}
-
-static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
- char *src, char *buf, size_t size)
-{
- unsigned long error_code = X86_PF_PROT;
-
- /*
- * This function uses __get_user() independent of whether kernel or user
- * memory is accessed. This works fine because __get_user() does no
- * sanity checks of the pointer being accessed. All that it does is
- * to report when the access failed.
- *
- * Also, this function runs in atomic context, so __get_user() is not
- * allowed to sleep. The page-fault handler detects that it is running
- * in atomic context and will not try to take mmap_sem and handle the
- * fault, so additional pagefault_enable()/disable() calls are not
- * needed.
- *
- * The access can't be done via copy_from_user() here because
- * vc_read_mem() must not use string instructions to access unsafe
- * memory. The reason is that MOVS is emulated by the #VC handler by
- * splitting the move up into a read and a write and taking a nested #VC
- * exception on whatever of them is the MMIO access. Using string
- * instructions here would cause infinite nesting.
- */
- switch (size) {
- case 1: {
- u8 d1;
- u8 __user *s = (u8 __user *)src;
-
- if (__get_user(d1, s))
- goto fault;
- memcpy(buf, &d1, 1);
- break;
- }
- case 2: {
- u16 d2;
- u16 __user *s = (u16 __user *)src;
-
- if (__get_user(d2, s))
- goto fault;
- memcpy(buf, &d2, 2);
- break;
- }
- case 4: {
- u32 d4;
- u32 __user *s = (u32 __user *)src;
-
- if (__get_user(d4, s))
- goto fault;
- memcpy(buf, &d4, 4);
- break;
- }
- case 8: {
- u64 d8;
- u64 __user *s = (u64 __user *)src;
- if (__get_user(d8, s))
- goto fault;
- memcpy(buf, &d8, 8);
- break;
- }
- default:
- WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
- return ES_UNSUPPORTED;
- }
-
- return ES_OK;
-
-fault:
- if (user_mode(ctxt->regs))
- error_code |= X86_PF_USER;
-
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.error_code = error_code;
- ctxt->fi.cr2 = (unsigned long)src;
-
- return ES_EXCEPTION;
-}
-
-static enum es_result vc_slow_virt_to_phys(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
- unsigned long vaddr, phys_addr_t *paddr)
-{
- unsigned long va = (unsigned long)vaddr;
- unsigned int level;
- phys_addr_t pa;
- pgd_t *pgd;
- pte_t *pte;
-
- pgd = __va(read_cr3_pa());
- pgd = &pgd[pgd_index(va)];
- pte = lookup_address_in_pgd(pgd, va, &level);
- if (!pte) {
- ctxt->fi.vector = X86_TRAP_PF;
- ctxt->fi.cr2 = vaddr;
- ctxt->fi.error_code = 0;
-
- if (user_mode(ctxt->regs))
- ctxt->fi.error_code |= X86_PF_USER;
-
- return ES_EXCEPTION;
- }
-
- if (WARN_ON_ONCE(pte_val(*pte) & _PAGE_ENC))
- /* Emulated MMIO to/from encrypted memory not supported */
- return ES_UNSUPPORTED;
-
- pa = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
- pa |= va & ~page_level_mask(level);
-
- *paddr = pa;
-
- return ES_OK;
-}
-
-static enum es_result vc_ioio_check(struct es_em_ctxt *ctxt, u16 port, size_t size)
-{
- BUG_ON(size > 4);
-
- if (user_mode(ctxt->regs)) {
- struct thread_struct *t = ¤t->thread;
- struct io_bitmap *iobm = t->io_bitmap;
- size_t idx;
-
- if (!iobm)
- goto fault;
-
- for (idx = port; idx < port + size; ++idx) {
- if (test_bit(idx, iobm->bitmap))
- goto fault;
- }
- }
-
- return ES_OK;
-
-fault:
- ctxt->fi.vector = X86_TRAP_GP;
- ctxt->fi.error_code = 0;
-
- return ES_EXCEPTION;
-}
-
-static __always_inline void vc_forward_exception(struct es_em_ctxt *ctxt)
-{
- long error_code = ctxt->fi.error_code;
- int trapnr = ctxt->fi.vector;
-
- ctxt->regs->orig_ax = ctxt->fi.error_code;
-
- switch (trapnr) {
- case X86_TRAP_GP:
- exc_general_protection(ctxt->regs, error_code);
- break;
- case X86_TRAP_UD:
- exc_invalid_op(ctxt->regs);
- break;
- case X86_TRAP_PF:
- write_cr2(ctxt->fi.cr2);
- exc_page_fault(ctxt->regs, error_code);
- break;
- case X86_TRAP_AC:
- exc_alignment_check(ctxt->regs, error_code);
- break;
- default:
- pr_emerg("Unsupported exception in #VC instruction emulation - can't continue\n");
- BUG();
- }
-}
-
/* Include code shared with pre-decompression boot stage */
#include "sev-shared.c"
@@ -563,718 +253,6 @@ void __head early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr
early_set_pages_state(vaddr, paddr, npages, SNP_PAGE_STATE_SHARED);
}
-/* Writes to the SVSM CAA MSR are ignored */
-static enum es_result __vc_handle_msr_caa(struct pt_regs *regs, bool write)
-{
- if (write)
- return ES_OK;
-
- regs->ax = lower_32_bits(this_cpu_read(svsm_caa_pa));
- regs->dx = upper_32_bits(this_cpu_read(svsm_caa_pa));
-
- return ES_OK;
-}
-
-/*
- * TSC related accesses should not exit to the hypervisor when a guest is
- * executing with Secure TSC enabled, so special handling is required for
- * accesses of MSR_IA32_TSC and MSR_AMD64_GUEST_TSC_FREQ.
- */
-static enum es_result __vc_handle_secure_tsc_msrs(struct pt_regs *regs, bool write)
-{
- u64 tsc;
-
- /*
- * GUEST_TSC_FREQ should not be intercepted when Secure TSC is enabled.
- * Terminate the SNP guest when the interception is enabled.
- */
- if (regs->cx == MSR_AMD64_GUEST_TSC_FREQ)
- return ES_VMM_ERROR;
-
- /*
- * Writes: Writing to MSR_IA32_TSC can cause subsequent reads of the TSC
- * to return undefined values, so ignore all writes.
- *
- * Reads: Reads of MSR_IA32_TSC should return the current TSC value, use
- * the value returned by rdtsc_ordered().
- */
- if (write) {
- WARN_ONCE(1, "TSC MSR writes are verboten!\n");
- return ES_OK;
- }
-
- tsc = rdtsc_ordered();
- regs->ax = lower_32_bits(tsc);
- regs->dx = upper_32_bits(tsc);
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
-{
- struct pt_regs *regs = ctxt->regs;
- enum es_result ret;
- bool write;
-
- /* Is it a WRMSR? */
- write = ctxt->insn.opcode.bytes[1] == 0x30;
-
- switch (regs->cx) {
- case MSR_SVSM_CAA:
- return __vc_handle_msr_caa(regs, write);
- case MSR_IA32_TSC:
- case MSR_AMD64_GUEST_TSC_FREQ:
- if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
- return __vc_handle_secure_tsc_msrs(regs, write);
- break;
- default:
- break;
- }
-
- ghcb_set_rcx(ghcb, regs->cx);
- if (write) {
- ghcb_set_rax(ghcb, regs->ax);
- ghcb_set_rdx(ghcb, regs->dx);
- }
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_MSR, write, 0);
-
- if ((ret == ES_OK) && !write) {
- regs->ax = ghcb->save.rax;
- regs->dx = ghcb->save.rdx;
- }
-
- return ret;
-}
-
-static void __init vc_early_forward_exception(struct es_em_ctxt *ctxt)
-{
- int trapnr = ctxt->fi.vector;
-
- if (trapnr == X86_TRAP_PF)
- native_write_cr2(ctxt->fi.cr2);
-
- ctxt->regs->orig_ax = ctxt->fi.error_code;
- do_early_exception(ctxt->regs, trapnr);
-}
-
-static long *vc_insn_get_rm(struct es_em_ctxt *ctxt)
-{
- long *reg_array;
- int offset;
-
- reg_array = (long *)ctxt->regs;
- offset = insn_get_modrm_rm_off(&ctxt->insn, ctxt->regs);
-
- if (offset < 0)
- return NULL;
-
- offset /= sizeof(long);
-
- return reg_array + offset;
-}
-static enum es_result vc_do_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
- unsigned int bytes, bool read)
-{
- u64 exit_code, exit_info_1, exit_info_2;
- unsigned long ghcb_pa = __pa(ghcb);
- enum es_result res;
- phys_addr_t paddr;
- void __user *ref;
-
- ref = insn_get_addr_ref(&ctxt->insn, ctxt->regs);
- if (ref == (void __user *)-1L)
- return ES_UNSUPPORTED;
-
- exit_code = read ? SVM_VMGEXIT_MMIO_READ : SVM_VMGEXIT_MMIO_WRITE;
-
- res = vc_slow_virt_to_phys(ghcb, ctxt, (unsigned long)ref, &paddr);
- if (res != ES_OK) {
- if (res == ES_EXCEPTION && !read)
- ctxt->fi.error_code |= X86_PF_WRITE;
-
- return res;
- }
-
- exit_info_1 = paddr;
- /* Can never be greater than 8 */
- exit_info_2 = bytes;
-
- ghcb_set_sw_scratch(ghcb, ghcb_pa + offsetof(struct ghcb, shared_buffer));
-
- return sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, exit_info_1, exit_info_2);
-}
-
-/*
- * The MOVS instruction has two memory operands, which raises the
- * problem that it is not known whether the access to the source or the
- * destination caused the #VC exception (and hence whether an MMIO read
- * or write operation needs to be emulated).
- *
- * Instead of playing games with walking page-tables and trying to guess
- * whether the source or destination is an MMIO range, split the move
- * into two operations, a read and a write with only one memory operand.
- * This will cause a nested #VC exception on the MMIO address which can
- * then be handled.
- *
- * This implementation has the benefit that it also supports MOVS where
- * source _and_ destination are MMIO regions.
- *
- * It will slow MOVS on MMIO down a lot, but in SEV-ES guests it is a
- * rare operation. If it turns out to be a performance problem the split
- * operations can be moved to memcpy_fromio() and memcpy_toio().
- */
-static enum es_result vc_handle_mmio_movs(struct es_em_ctxt *ctxt,
- unsigned int bytes)
-{
- unsigned long ds_base, es_base;
- unsigned char *src, *dst;
- unsigned char buffer[8];
- enum es_result ret;
- bool rep;
- int off;
-
- ds_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_DS);
- es_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_ES);
-
- if (ds_base == -1L || es_base == -1L) {
- ctxt->fi.vector = X86_TRAP_GP;
- ctxt->fi.error_code = 0;
- return ES_EXCEPTION;
- }
-
- src = ds_base + (unsigned char *)ctxt->regs->si;
- dst = es_base + (unsigned char *)ctxt->regs->di;
-
- ret = vc_read_mem(ctxt, src, buffer, bytes);
- if (ret != ES_OK)
- return ret;
-
- ret = vc_write_mem(ctxt, dst, buffer, bytes);
- if (ret != ES_OK)
- return ret;
-
- if (ctxt->regs->flags & X86_EFLAGS_DF)
- off = -bytes;
- else
- off = bytes;
-
- ctxt->regs->si += off;
- ctxt->regs->di += off;
-
- rep = insn_has_rep_prefix(&ctxt->insn);
- if (rep)
- ctxt->regs->cx -= 1;
-
- if (!rep || ctxt->regs->cx == 0)
- return ES_OK;
- else
- return ES_RETRY;
-}
-
-static enum es_result vc_handle_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
-{
- struct insn *insn = &ctxt->insn;
- enum insn_mmio_type mmio;
- unsigned int bytes = 0;
- enum es_result ret;
- u8 sign_byte;
- long *reg_data;
-
- mmio = insn_decode_mmio(insn, &bytes);
- if (mmio == INSN_MMIO_DECODE_FAILED)
- return ES_DECODE_FAILED;
-
- if (mmio != INSN_MMIO_WRITE_IMM && mmio != INSN_MMIO_MOVS) {
- reg_data = insn_get_modrm_reg_ptr(insn, ctxt->regs);
- if (!reg_data)
- return ES_DECODE_FAILED;
- }
-
- if (user_mode(ctxt->regs))
- return ES_UNSUPPORTED;
-
- switch (mmio) {
- case INSN_MMIO_WRITE:
- memcpy(ghcb->shared_buffer, reg_data, bytes);
- ret = vc_do_mmio(ghcb, ctxt, bytes, false);
- break;
- case INSN_MMIO_WRITE_IMM:
- memcpy(ghcb->shared_buffer, insn->immediate1.bytes, bytes);
- ret = vc_do_mmio(ghcb, ctxt, bytes, false);
- break;
- case INSN_MMIO_READ:
- ret = vc_do_mmio(ghcb, ctxt, bytes, true);
- if (ret)
- break;
-
- /* Zero-extend for 32-bit operation */
- if (bytes == 4)
- *reg_data = 0;
-
- memcpy(reg_data, ghcb->shared_buffer, bytes);
- break;
- case INSN_MMIO_READ_ZERO_EXTEND:
- ret = vc_do_mmio(ghcb, ctxt, bytes, true);
- if (ret)
- break;
-
- /* Zero extend based on operand size */
- memset(reg_data, 0, insn->opnd_bytes);
- memcpy(reg_data, ghcb->shared_buffer, bytes);
- break;
- case INSN_MMIO_READ_SIGN_EXTEND:
- ret = vc_do_mmio(ghcb, ctxt, bytes, true);
- if (ret)
- break;
-
- if (bytes == 1) {
- u8 *val = (u8 *)ghcb->shared_buffer;
-
- sign_byte = (*val & 0x80) ? 0xff : 0x00;
- } else {
- u16 *val = (u16 *)ghcb->shared_buffer;
-
- sign_byte = (*val & 0x8000) ? 0xff : 0x00;
- }
-
- /* Sign extend based on operand size */
- memset(reg_data, sign_byte, insn->opnd_bytes);
- memcpy(reg_data, ghcb->shared_buffer, bytes);
- break;
- case INSN_MMIO_MOVS:
- ret = vc_handle_mmio_movs(ctxt, bytes);
- break;
- default:
- ret = ES_UNSUPPORTED;
- break;
- }
-
- return ret;
-}
-
-static enum es_result vc_handle_dr7_write(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- struct sev_es_runtime_data *data = this_cpu_read(runtime_data);
- long val, *reg = vc_insn_get_rm(ctxt);
- enum es_result ret;
-
- if (sev_status & MSR_AMD64_SNP_DEBUG_SWAP)
- return ES_VMM_ERROR;
-
- if (!reg)
- return ES_DECODE_FAILED;
-
- val = *reg;
-
- /* Upper 32 bits must be written as zeroes */
- if (val >> 32) {
- ctxt->fi.vector = X86_TRAP_GP;
- ctxt->fi.error_code = 0;
- return ES_EXCEPTION;
- }
-
- /* Clear out other reserved bits and set bit 10 */
- val = (val & 0xffff23ffL) | BIT(10);
-
- /* Early non-zero writes to DR7 are not supported */
- if (!data && (val & ~DR7_RESET_VALUE))
- return ES_UNSUPPORTED;
-
- /* Using a value of 0 for ExitInfo1 means RAX holds the value */
- ghcb_set_rax(ghcb, val);
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WRITE_DR7, 0, 0);
- if (ret != ES_OK)
- return ret;
-
- if (data)
- data->dr7 = val;
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_dr7_read(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- struct sev_es_runtime_data *data = this_cpu_read(runtime_data);
- long *reg = vc_insn_get_rm(ctxt);
-
- if (sev_status & MSR_AMD64_SNP_DEBUG_SWAP)
- return ES_VMM_ERROR;
-
- if (!reg)
- return ES_DECODE_FAILED;
-
- if (data)
- *reg = data->dr7;
- else
- *reg = DR7_RESET_VALUE;
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_wbinvd(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WBINVD, 0, 0);
-}
-
-static enum es_result vc_handle_rdpmc(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
-{
- enum es_result ret;
-
- ghcb_set_rcx(ghcb, ctxt->regs->cx);
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_RDPMC, 0, 0);
- if (ret != ES_OK)
- return ret;
-
- if (!(ghcb_rax_is_valid(ghcb) && ghcb_rdx_is_valid(ghcb)))
- return ES_VMM_ERROR;
-
- ctxt->regs->ax = ghcb->save.rax;
- ctxt->regs->dx = ghcb->save.rdx;
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_monitor(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- /*
- * Treat it as a NOP and do not leak a physical address to the
- * hypervisor.
- */
- return ES_OK;
-}
-
-static enum es_result vc_handle_mwait(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- /* Treat the same as MONITOR/MONITORX */
- return ES_OK;
-}
-
-static enum es_result vc_handle_vmmcall(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- enum es_result ret;
-
- ghcb_set_rax(ghcb, ctxt->regs->ax);
- ghcb_set_cpl(ghcb, user_mode(ctxt->regs) ? 3 : 0);
-
- if (x86_platform.hyper.sev_es_hcall_prepare)
- x86_platform.hyper.sev_es_hcall_prepare(ghcb, ctxt->regs);
-
- ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_VMMCALL, 0, 0);
- if (ret != ES_OK)
- return ret;
-
- if (!ghcb_rax_is_valid(ghcb))
- return ES_VMM_ERROR;
-
- ctxt->regs->ax = ghcb->save.rax;
-
- /*
- * Call sev_es_hcall_finish() after regs->ax is already set.
- * This allows the hypervisor handler to overwrite it again if
- * necessary.
- */
- if (x86_platform.hyper.sev_es_hcall_finish &&
- !x86_platform.hyper.sev_es_hcall_finish(ghcb, ctxt->regs))
- return ES_VMM_ERROR;
-
- return ES_OK;
-}
-
-static enum es_result vc_handle_trap_ac(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt)
-{
- /*
- * Calling ecx_alignment_check() directly does not work, because it
- * enables IRQs and the GHCB is active. Forward the exception and call
- * it later from vc_forward_exception().
- */
- ctxt->fi.vector = X86_TRAP_AC;
- ctxt->fi.error_code = 0;
- return ES_EXCEPTION;
-}
-
-static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
- struct ghcb *ghcb,
- unsigned long exit_code)
-{
- enum es_result result = vc_check_opcode_bytes(ctxt, exit_code);
-
- if (result != ES_OK)
- return result;
-
- switch (exit_code) {
- case SVM_EXIT_READ_DR7:
- result = vc_handle_dr7_read(ghcb, ctxt);
- break;
- case SVM_EXIT_WRITE_DR7:
- result = vc_handle_dr7_write(ghcb, ctxt);
- break;
- case SVM_EXIT_EXCP_BASE + X86_TRAP_AC:
- result = vc_handle_trap_ac(ghcb, ctxt);
- break;
- case SVM_EXIT_RDTSC:
- case SVM_EXIT_RDTSCP:
- result = vc_handle_rdtsc(ghcb, ctxt, exit_code);
- break;
- case SVM_EXIT_RDPMC:
- result = vc_handle_rdpmc(ghcb, ctxt);
- break;
- case SVM_EXIT_INVD:
- pr_err_ratelimited("#VC exception for INVD??? Seriously???\n");
- result = ES_UNSUPPORTED;
- break;
- case SVM_EXIT_CPUID:
- result = vc_handle_cpuid(ghcb, ctxt);
- break;
- case SVM_EXIT_IOIO:
- result = vc_handle_ioio(ghcb, ctxt);
- break;
- case SVM_EXIT_MSR:
- result = vc_handle_msr(ghcb, ctxt);
- break;
- case SVM_EXIT_VMMCALL:
- result = vc_handle_vmmcall(ghcb, ctxt);
- break;
- case SVM_EXIT_WBINVD:
- result = vc_handle_wbinvd(ghcb, ctxt);
- break;
- case SVM_EXIT_MONITOR:
- result = vc_handle_monitor(ghcb, ctxt);
- break;
- case SVM_EXIT_MWAIT:
- result = vc_handle_mwait(ghcb, ctxt);
- break;
- case SVM_EXIT_NPF:
- result = vc_handle_mmio(ghcb, ctxt);
- break;
- default:
- /*
- * Unexpected #VC exception
- */
- result = ES_UNSUPPORTED;
- }
-
- return result;
-}
-
-static __always_inline bool is_vc2_stack(unsigned long sp)
-{
- return (sp >= __this_cpu_ist_bottom_va(VC2) && sp < __this_cpu_ist_top_va(VC2));
-}
-
-static __always_inline bool vc_from_invalid_context(struct pt_regs *regs)
-{
- unsigned long sp, prev_sp;
-
- sp = (unsigned long)regs;
- prev_sp = regs->sp;
-
- /*
- * If the code was already executing on the VC2 stack when the #VC
- * happened, let it proceed to the normal handling routine. This way the
- * code executing on the VC2 stack can cause #VC exceptions to get handled.
- */
- return is_vc2_stack(sp) && !is_vc2_stack(prev_sp);
-}
-
-static bool vc_raw_handle_exception(struct pt_regs *regs, unsigned long error_code)
-{
- struct ghcb_state state;
- struct es_em_ctxt ctxt;
- enum es_result result;
- struct ghcb *ghcb;
- bool ret = true;
-
- ghcb = __sev_get_ghcb(&state);
-
- vc_ghcb_invalidate(ghcb);
- result = vc_init_em_ctxt(&ctxt, regs, error_code);
-
- if (result == ES_OK)
- result = vc_handle_exitcode(&ctxt, ghcb, error_code);
-
- __sev_put_ghcb(&state);
-
- /* Done - now check the result */
- switch (result) {
- case ES_OK:
- vc_finish_insn(&ctxt);
- break;
- case ES_UNSUPPORTED:
- pr_err_ratelimited("Unsupported exit-code 0x%02lx in #VC exception (IP: 0x%lx)\n",
- error_code, regs->ip);
- ret = false;
- break;
- case ES_VMM_ERROR:
- pr_err_ratelimited("Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n",
- error_code, regs->ip);
- ret = false;
- break;
- case ES_DECODE_FAILED:
- pr_err_ratelimited("Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n",
- error_code, regs->ip);
- ret = false;
- break;
- case ES_EXCEPTION:
- vc_forward_exception(&ctxt);
- break;
- case ES_RETRY:
- /* Nothing to do */
- break;
- default:
- pr_emerg("Unknown result in %s():%d\n", __func__, result);
- /*
- * Emulating the instruction which caused the #VC exception
- * failed - can't continue so print debug information
- */
- BUG();
- }
-
- return ret;
-}
-
-static __always_inline bool vc_is_db(unsigned long error_code)
-{
- return error_code == SVM_EXIT_EXCP_BASE + X86_TRAP_DB;
-}
-
-/*
- * Runtime #VC exception handler when raised from kernel mode. Runs in NMI mode
- * and will panic when an error happens.
- */
-DEFINE_IDTENTRY_VC_KERNEL(exc_vmm_communication)
-{
- irqentry_state_t irq_state;
-
- /*
- * With the current implementation it is always possible to switch to a
- * safe stack because #VC exceptions only happen at known places, like
- * intercepted instructions or accesses to MMIO areas/IO ports. They can
- * also happen with code instrumentation when the hypervisor intercepts
- * #DB, but the critical paths are forbidden to be instrumented, so #DB
- * exceptions currently also only happen in safe places.
- *
- * But keep this here in case the noinstr annotations are violated due
- * to bug elsewhere.
- */
- if (unlikely(vc_from_invalid_context(regs))) {
- instrumentation_begin();
- panic("Can't handle #VC exception from unsupported context\n");
- instrumentation_end();
- }
-
- /*
- * Handle #DB before calling into !noinstr code to avoid recursive #DB.
- */
- if (vc_is_db(error_code)) {
- exc_debug(regs);
- return;
- }
-
- irq_state = irqentry_nmi_enter(regs);
-
- instrumentation_begin();
-
- if (!vc_raw_handle_exception(regs, error_code)) {
- /* Show some debug info */
- show_regs(regs);
-
- /* Ask hypervisor to sev_es_terminate */
- sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
-
- /* If that fails and we get here - just panic */
- panic("Returned from Terminate-Request to Hypervisor\n");
- }
-
- instrumentation_end();
- irqentry_nmi_exit(regs, irq_state);
-}
-
-/*
- * Runtime #VC exception handler when raised from user mode. Runs in IRQ mode
- * and will kill the current task with SIGBUS when an error happens.
- */
-DEFINE_IDTENTRY_VC_USER(exc_vmm_communication)
-{
- /*
- * Handle #DB before calling into !noinstr code to avoid recursive #DB.
- */
- if (vc_is_db(error_code)) {
- noist_exc_debug(regs);
- return;
- }
-
- irqentry_enter_from_user_mode(regs);
- instrumentation_begin();
-
- if (!vc_raw_handle_exception(regs, error_code)) {
- /*
- * Do not kill the machine if user-space triggered the
- * exception. Send SIGBUS instead and let user-space deal with
- * it.
- */
- force_sig_fault(SIGBUS, BUS_OBJERR, (void __user *)0);
- }
-
- instrumentation_end();
- irqentry_exit_to_user_mode(regs);
-}
-
-bool __init handle_vc_boot_ghcb(struct pt_regs *regs)
-{
- unsigned long exit_code = regs->orig_ax;
- struct es_em_ctxt ctxt;
- enum es_result result;
-
- vc_ghcb_invalidate(boot_ghcb);
-
- result = vc_init_em_ctxt(&ctxt, regs, exit_code);
- if (result == ES_OK)
- result = vc_handle_exitcode(&ctxt, boot_ghcb, exit_code);
-
- /* Done - now check the result */
- switch (result) {
- case ES_OK:
- vc_finish_insn(&ctxt);
- break;
- case ES_UNSUPPORTED:
- early_printk("PANIC: Unsupported exit-code 0x%02lx in early #VC exception (IP: 0x%lx)\n",
- exit_code, regs->ip);
- goto fail;
- case ES_VMM_ERROR:
- early_printk("PANIC: Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n",
- exit_code, regs->ip);
- goto fail;
- case ES_DECODE_FAILED:
- early_printk("PANIC: Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n",
- exit_code, regs->ip);
- goto fail;
- case ES_EXCEPTION:
- vc_early_forward_exception(&ctxt);
- break;
- case ES_RETRY:
- /* Nothing to do */
- break;
- default:
- BUG();
- }
-
- return true;
-
-fail:
- show_regs(regs);
-
- sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
-}
-
/*
* Initial set up of SNP relies on information provided by the
* Confidential Computing blob, which can be passed to the kernel
diff --git a/arch/x86/coco/sev/Makefile b/arch/x86/coco/sev/Makefile
index 2919dcf..db3255b 100644
--- a/arch/x86/coco/sev/Makefile
+++ b/arch/x86/coco/sev/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0
-obj-y += core.o sev-nmi.o
+obj-y += core.o sev-nmi.o vc-handle.o
# Clang 14 and older may fail to respect __no_sanitize_undefined when inlining
UBSAN_SANITIZE_sev-nmi.o := n
diff --git a/arch/x86/coco/sev/vc-handle.c b/arch/x86/coco/sev/vc-handle.c
new file mode 100644
index 0000000..b4895c6
--- /dev/null
+++ b/arch/x86/coco/sev/vc-handle.c
@@ -0,0 +1,1061 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AMD Memory Encryption Support
+ *
+ * Copyright (C) 2019 SUSE
+ *
+ * Author: Joerg Roedel <jroedel@suse.de>
+ */
+
+#define pr_fmt(fmt) "SEV: " fmt
+
+#include <linux/sched/debug.h> /* For show_regs() */
+#include <linux/cc_platform.h>
+#include <linux/printk.h>
+#include <linux/mm_types.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/io.h>
+#include <linux/psp-sev.h>
+#include <uapi/linux/sev-guest.h>
+
+#include <asm/init.h>
+#include <asm/stacktrace.h>
+#include <asm/sev.h>
+#include <asm/sev-internal.h>
+#include <asm/insn-eval.h>
+#include <asm/fpu/xcr.h>
+#include <asm/processor.h>
+#include <asm/setup.h>
+#include <asm/traps.h>
+#include <asm/svm.h>
+#include <asm/smp.h>
+#include <asm/cpu.h>
+#include <asm/apic.h>
+#include <asm/cpuid.h>
+
+static enum es_result vc_slow_virt_to_phys(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
+ unsigned long vaddr, phys_addr_t *paddr)
+{
+ unsigned long va = (unsigned long)vaddr;
+ unsigned int level;
+ phys_addr_t pa;
+ pgd_t *pgd;
+ pte_t *pte;
+
+ pgd = __va(read_cr3_pa());
+ pgd = &pgd[pgd_index(va)];
+ pte = lookup_address_in_pgd(pgd, va, &level);
+ if (!pte) {
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.cr2 = vaddr;
+ ctxt->fi.error_code = 0;
+
+ if (user_mode(ctxt->regs))
+ ctxt->fi.error_code |= X86_PF_USER;
+
+ return ES_EXCEPTION;
+ }
+
+ if (WARN_ON_ONCE(pte_val(*pte) & _PAGE_ENC))
+ /* Emulated MMIO to/from encrypted memory not supported */
+ return ES_UNSUPPORTED;
+
+ pa = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
+ pa |= va & ~page_level_mask(level);
+
+ *paddr = pa;
+
+ return ES_OK;
+}
+
+static enum es_result vc_ioio_check(struct es_em_ctxt *ctxt, u16 port, size_t size)
+{
+ BUG_ON(size > 4);
+
+ if (user_mode(ctxt->regs)) {
+ struct thread_struct *t = ¤t->thread;
+ struct io_bitmap *iobm = t->io_bitmap;
+ size_t idx;
+
+ if (!iobm)
+ goto fault;
+
+ for (idx = port; idx < port + size; ++idx) {
+ if (test_bit(idx, iobm->bitmap))
+ goto fault;
+ }
+ }
+
+ return ES_OK;
+
+fault:
+ ctxt->fi.vector = X86_TRAP_GP;
+ ctxt->fi.error_code = 0;
+
+ return ES_EXCEPTION;
+}
+
+void vc_forward_exception(struct es_em_ctxt *ctxt)
+{
+ long error_code = ctxt->fi.error_code;
+ int trapnr = ctxt->fi.vector;
+
+ ctxt->regs->orig_ax = ctxt->fi.error_code;
+
+ switch (trapnr) {
+ case X86_TRAP_GP:
+ exc_general_protection(ctxt->regs, error_code);
+ break;
+ case X86_TRAP_UD:
+ exc_invalid_op(ctxt->regs);
+ break;
+ case X86_TRAP_PF:
+ write_cr2(ctxt->fi.cr2);
+ exc_page_fault(ctxt->regs, error_code);
+ break;
+ case X86_TRAP_AC:
+ exc_alignment_check(ctxt->regs, error_code);
+ break;
+ default:
+ pr_emerg("Unsupported exception in #VC instruction emulation - can't continue\n");
+ BUG();
+ }
+}
+
+static int vc_fetch_insn_kernel(struct es_em_ctxt *ctxt,
+ unsigned char *buffer)
+{
+ return copy_from_kernel_nofault(buffer, (unsigned char *)ctxt->regs->ip, MAX_INSN_SIZE);
+}
+
+static enum es_result __vc_decode_user_insn(struct es_em_ctxt *ctxt)
+{
+ char buffer[MAX_INSN_SIZE];
+ int insn_bytes;
+
+ insn_bytes = insn_fetch_from_user_inatomic(ctxt->regs, buffer);
+ if (insn_bytes == 0) {
+ /* Nothing could be copied */
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.error_code = X86_PF_INSTR | X86_PF_USER;
+ ctxt->fi.cr2 = ctxt->regs->ip;
+ return ES_EXCEPTION;
+ } else if (insn_bytes == -EINVAL) {
+ /* Effective RIP could not be calculated */
+ ctxt->fi.vector = X86_TRAP_GP;
+ ctxt->fi.error_code = 0;
+ ctxt->fi.cr2 = 0;
+ return ES_EXCEPTION;
+ }
+
+ if (!insn_decode_from_regs(&ctxt->insn, ctxt->regs, buffer, insn_bytes))
+ return ES_DECODE_FAILED;
+
+ if (ctxt->insn.immediate.got)
+ return ES_OK;
+ else
+ return ES_DECODE_FAILED;
+}
+
+static enum es_result __vc_decode_kern_insn(struct es_em_ctxt *ctxt)
+{
+ char buffer[MAX_INSN_SIZE];
+ int res, ret;
+
+ res = vc_fetch_insn_kernel(ctxt, buffer);
+ if (res) {
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.error_code = X86_PF_INSTR;
+ ctxt->fi.cr2 = ctxt->regs->ip;
+ return ES_EXCEPTION;
+ }
+
+ ret = insn_decode(&ctxt->insn, buffer, MAX_INSN_SIZE, INSN_MODE_64);
+ if (ret < 0)
+ return ES_DECODE_FAILED;
+ else
+ return ES_OK;
+}
+
+static enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
+{
+ if (user_mode(ctxt->regs))
+ return __vc_decode_user_insn(ctxt);
+ else
+ return __vc_decode_kern_insn(ctxt);
+}
+
+static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
+ char *dst, char *buf, size_t size)
+{
+ unsigned long error_code = X86_PF_PROT | X86_PF_WRITE;
+
+ /*
+ * This function uses __put_user() independent of whether kernel or user
+ * memory is accessed. This works fine because __put_user() does no
+ * sanity checks of the pointer being accessed. All that it does is
+ * to report when the access failed.
+ *
+ * Also, this function runs in atomic context, so __put_user() is not
+ * allowed to sleep. The page-fault handler detects that it is running
+ * in atomic context and will not try to take mmap_sem and handle the
+ * fault, so additional pagefault_enable()/disable() calls are not
+ * needed.
+ *
+ * The access can't be done via copy_to_user() here because
+ * vc_write_mem() must not use string instructions to access unsafe
+ * memory. The reason is that MOVS is emulated by the #VC handler by
+ * splitting the move up into a read and a write and taking a nested #VC
+ * exception on whatever of them is the MMIO access. Using string
+ * instructions here would cause infinite nesting.
+ */
+ switch (size) {
+ case 1: {
+ u8 d1;
+ u8 __user *target = (u8 __user *)dst;
+
+ memcpy(&d1, buf, 1);
+ if (__put_user(d1, target))
+ goto fault;
+ break;
+ }
+ case 2: {
+ u16 d2;
+ u16 __user *target = (u16 __user *)dst;
+
+ memcpy(&d2, buf, 2);
+ if (__put_user(d2, target))
+ goto fault;
+ break;
+ }
+ case 4: {
+ u32 d4;
+ u32 __user *target = (u32 __user *)dst;
+
+ memcpy(&d4, buf, 4);
+ if (__put_user(d4, target))
+ goto fault;
+ break;
+ }
+ case 8: {
+ u64 d8;
+ u64 __user *target = (u64 __user *)dst;
+
+ memcpy(&d8, buf, 8);
+ if (__put_user(d8, target))
+ goto fault;
+ break;
+ }
+ default:
+ WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+ return ES_UNSUPPORTED;
+ }
+
+ return ES_OK;
+
+fault:
+ if (user_mode(ctxt->regs))
+ error_code |= X86_PF_USER;
+
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.error_code = error_code;
+ ctxt->fi.cr2 = (unsigned long)dst;
+
+ return ES_EXCEPTION;
+}
+
+static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
+ char *src, char *buf, size_t size)
+{
+ unsigned long error_code = X86_PF_PROT;
+
+ /*
+ * This function uses __get_user() independent of whether kernel or user
+ * memory is accessed. This works fine because __get_user() does no
+ * sanity checks of the pointer being accessed. All that it does is
+ * to report when the access failed.
+ *
+ * Also, this function runs in atomic context, so __get_user() is not
+ * allowed to sleep. The page-fault handler detects that it is running
+ * in atomic context and will not try to take mmap_sem and handle the
+ * fault, so additional pagefault_enable()/disable() calls are not
+ * needed.
+ *
+ * The access can't be done via copy_from_user() here because
+ * vc_read_mem() must not use string instructions to access unsafe
+ * memory. The reason is that MOVS is emulated by the #VC handler by
+ * splitting the move up into a read and a write and taking a nested #VC
+ * exception on whatever of them is the MMIO access. Using string
+ * instructions here would cause infinite nesting.
+ */
+ switch (size) {
+ case 1: {
+ u8 d1;
+ u8 __user *s = (u8 __user *)src;
+
+ if (__get_user(d1, s))
+ goto fault;
+ memcpy(buf, &d1, 1);
+ break;
+ }
+ case 2: {
+ u16 d2;
+ u16 __user *s = (u16 __user *)src;
+
+ if (__get_user(d2, s))
+ goto fault;
+ memcpy(buf, &d2, 2);
+ break;
+ }
+ case 4: {
+ u32 d4;
+ u32 __user *s = (u32 __user *)src;
+
+ if (__get_user(d4, s))
+ goto fault;
+ memcpy(buf, &d4, 4);
+ break;
+ }
+ case 8: {
+ u64 d8;
+ u64 __user *s = (u64 __user *)src;
+ if (__get_user(d8, s))
+ goto fault;
+ memcpy(buf, &d8, 8);
+ break;
+ }
+ default:
+ WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+ return ES_UNSUPPORTED;
+ }
+
+ return ES_OK;
+
+fault:
+ if (user_mode(ctxt->regs))
+ error_code |= X86_PF_USER;
+
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.error_code = error_code;
+ ctxt->fi.cr2 = (unsigned long)src;
+
+ return ES_EXCEPTION;
+}
+
+#define sev_printk(fmt, ...) printk(fmt, ##__VA_ARGS__)
+
+#include "vc-shared.c"
+
+/* Writes to the SVSM CAA MSR are ignored */
+static enum es_result __vc_handle_msr_caa(struct pt_regs *regs, bool write)
+{
+ if (write)
+ return ES_OK;
+
+ regs->ax = lower_32_bits(this_cpu_read(svsm_caa_pa));
+ regs->dx = upper_32_bits(this_cpu_read(svsm_caa_pa));
+
+ return ES_OK;
+}
+
+/*
+ * TSC related accesses should not exit to the hypervisor when a guest is
+ * executing with Secure TSC enabled, so special handling is required for
+ * accesses of MSR_IA32_TSC and MSR_AMD64_GUEST_TSC_FREQ.
+ */
+static enum es_result __vc_handle_secure_tsc_msrs(struct pt_regs *regs, bool write)
+{
+ u64 tsc;
+
+ /*
+ * GUEST_TSC_FREQ should not be intercepted when Secure TSC is enabled.
+ * Terminate the SNP guest when the interception is enabled.
+ */
+ if (regs->cx == MSR_AMD64_GUEST_TSC_FREQ)
+ return ES_VMM_ERROR;
+
+ /*
+ * Writes: Writing to MSR_IA32_TSC can cause subsequent reads of the TSC
+ * to return undefined values, so ignore all writes.
+ *
+ * Reads: Reads of MSR_IA32_TSC should return the current TSC value, use
+ * the value returned by rdtsc_ordered().
+ */
+ if (write) {
+ WARN_ONCE(1, "TSC MSR writes are verboten!\n");
+ return ES_OK;
+ }
+
+ tsc = rdtsc_ordered();
+ regs->ax = lower_32_bits(tsc);
+ regs->dx = upper_32_bits(tsc);
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+ struct pt_regs *regs = ctxt->regs;
+ enum es_result ret;
+ bool write;
+
+ /* Is it a WRMSR? */
+ write = ctxt->insn.opcode.bytes[1] == 0x30;
+
+ switch (regs->cx) {
+ case MSR_SVSM_CAA:
+ return __vc_handle_msr_caa(regs, write);
+ case MSR_IA32_TSC:
+ case MSR_AMD64_GUEST_TSC_FREQ:
+ if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
+ return __vc_handle_secure_tsc_msrs(regs, write);
+ break;
+ default:
+ break;
+ }
+
+ ghcb_set_rcx(ghcb, regs->cx);
+ if (write) {
+ ghcb_set_rax(ghcb, regs->ax);
+ ghcb_set_rdx(ghcb, regs->dx);
+ }
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_MSR, write, 0);
+
+ if ((ret == ES_OK) && !write) {
+ regs->ax = ghcb->save.rax;
+ regs->dx = ghcb->save.rdx;
+ }
+
+ return ret;
+}
+
+static void __init vc_early_forward_exception(struct es_em_ctxt *ctxt)
+{
+ int trapnr = ctxt->fi.vector;
+
+ if (trapnr == X86_TRAP_PF)
+ native_write_cr2(ctxt->fi.cr2);
+
+ ctxt->regs->orig_ax = ctxt->fi.error_code;
+ do_early_exception(ctxt->regs, trapnr);
+}
+
+static long *vc_insn_get_rm(struct es_em_ctxt *ctxt)
+{
+ long *reg_array;
+ int offset;
+
+ reg_array = (long *)ctxt->regs;
+ offset = insn_get_modrm_rm_off(&ctxt->insn, ctxt->regs);
+
+ if (offset < 0)
+ return NULL;
+
+ offset /= sizeof(long);
+
+ return reg_array + offset;
+}
+static enum es_result vc_do_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
+ unsigned int bytes, bool read)
+{
+ u64 exit_code, exit_info_1, exit_info_2;
+ unsigned long ghcb_pa = __pa(ghcb);
+ enum es_result res;
+ phys_addr_t paddr;
+ void __user *ref;
+
+ ref = insn_get_addr_ref(&ctxt->insn, ctxt->regs);
+ if (ref == (void __user *)-1L)
+ return ES_UNSUPPORTED;
+
+ exit_code = read ? SVM_VMGEXIT_MMIO_READ : SVM_VMGEXIT_MMIO_WRITE;
+
+ res = vc_slow_virt_to_phys(ghcb, ctxt, (unsigned long)ref, &paddr);
+ if (res != ES_OK) {
+ if (res == ES_EXCEPTION && !read)
+ ctxt->fi.error_code |= X86_PF_WRITE;
+
+ return res;
+ }
+
+ exit_info_1 = paddr;
+ /* Can never be greater than 8 */
+ exit_info_2 = bytes;
+
+ ghcb_set_sw_scratch(ghcb, ghcb_pa + offsetof(struct ghcb, shared_buffer));
+
+ return sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, exit_info_1, exit_info_2);
+}
+
+/*
+ * The MOVS instruction has two memory operands, which raises the
+ * problem that it is not known whether the access to the source or the
+ * destination caused the #VC exception (and hence whether an MMIO read
+ * or write operation needs to be emulated).
+ *
+ * Instead of playing games with walking page-tables and trying to guess
+ * whether the source or destination is an MMIO range, split the move
+ * into two operations, a read and a write with only one memory operand.
+ * This will cause a nested #VC exception on the MMIO address which can
+ * then be handled.
+ *
+ * This implementation has the benefit that it also supports MOVS where
+ * source _and_ destination are MMIO regions.
+ *
+ * It will slow MOVS on MMIO down a lot, but in SEV-ES guests it is a
+ * rare operation. If it turns out to be a performance problem the split
+ * operations can be moved to memcpy_fromio() and memcpy_toio().
+ */
+static enum es_result vc_handle_mmio_movs(struct es_em_ctxt *ctxt,
+ unsigned int bytes)
+{
+ unsigned long ds_base, es_base;
+ unsigned char *src, *dst;
+ unsigned char buffer[8];
+ enum es_result ret;
+ bool rep;
+ int off;
+
+ ds_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_DS);
+ es_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_ES);
+
+ if (ds_base == -1L || es_base == -1L) {
+ ctxt->fi.vector = X86_TRAP_GP;
+ ctxt->fi.error_code = 0;
+ return ES_EXCEPTION;
+ }
+
+ src = ds_base + (unsigned char *)ctxt->regs->si;
+ dst = es_base + (unsigned char *)ctxt->regs->di;
+
+ ret = vc_read_mem(ctxt, src, buffer, bytes);
+ if (ret != ES_OK)
+ return ret;
+
+ ret = vc_write_mem(ctxt, dst, buffer, bytes);
+ if (ret != ES_OK)
+ return ret;
+
+ if (ctxt->regs->flags & X86_EFLAGS_DF)
+ off = -bytes;
+ else
+ off = bytes;
+
+ ctxt->regs->si += off;
+ ctxt->regs->di += off;
+
+ rep = insn_has_rep_prefix(&ctxt->insn);
+ if (rep)
+ ctxt->regs->cx -= 1;
+
+ if (!rep || ctxt->regs->cx == 0)
+ return ES_OK;
+ else
+ return ES_RETRY;
+}
+
+static enum es_result vc_handle_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+ struct insn *insn = &ctxt->insn;
+ enum insn_mmio_type mmio;
+ unsigned int bytes = 0;
+ enum es_result ret;
+ u8 sign_byte;
+ long *reg_data;
+
+ mmio = insn_decode_mmio(insn, &bytes);
+ if (mmio == INSN_MMIO_DECODE_FAILED)
+ return ES_DECODE_FAILED;
+
+ if (mmio != INSN_MMIO_WRITE_IMM && mmio != INSN_MMIO_MOVS) {
+ reg_data = insn_get_modrm_reg_ptr(insn, ctxt->regs);
+ if (!reg_data)
+ return ES_DECODE_FAILED;
+ }
+
+ if (user_mode(ctxt->regs))
+ return ES_UNSUPPORTED;
+
+ switch (mmio) {
+ case INSN_MMIO_WRITE:
+ memcpy(ghcb->shared_buffer, reg_data, bytes);
+ ret = vc_do_mmio(ghcb, ctxt, bytes, false);
+ break;
+ case INSN_MMIO_WRITE_IMM:
+ memcpy(ghcb->shared_buffer, insn->immediate1.bytes, bytes);
+ ret = vc_do_mmio(ghcb, ctxt, bytes, false);
+ break;
+ case INSN_MMIO_READ:
+ ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+ if (ret)
+ break;
+
+ /* Zero-extend for 32-bit operation */
+ if (bytes == 4)
+ *reg_data = 0;
+
+ memcpy(reg_data, ghcb->shared_buffer, bytes);
+ break;
+ case INSN_MMIO_READ_ZERO_EXTEND:
+ ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+ if (ret)
+ break;
+
+ /* Zero extend based on operand size */
+ memset(reg_data, 0, insn->opnd_bytes);
+ memcpy(reg_data, ghcb->shared_buffer, bytes);
+ break;
+ case INSN_MMIO_READ_SIGN_EXTEND:
+ ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+ if (ret)
+ break;
+
+ if (bytes == 1) {
+ u8 *val = (u8 *)ghcb->shared_buffer;
+
+ sign_byte = (*val & 0x80) ? 0xff : 0x00;
+ } else {
+ u16 *val = (u16 *)ghcb->shared_buffer;
+
+ sign_byte = (*val & 0x8000) ? 0xff : 0x00;
+ }
+
+ /* Sign extend based on operand size */
+ memset(reg_data, sign_byte, insn->opnd_bytes);
+ memcpy(reg_data, ghcb->shared_buffer, bytes);
+ break;
+ case INSN_MMIO_MOVS:
+ ret = vc_handle_mmio_movs(ctxt, bytes);
+ break;
+ default:
+ ret = ES_UNSUPPORTED;
+ break;
+ }
+
+ return ret;
+}
+
+static enum es_result vc_handle_dr7_write(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ struct sev_es_runtime_data *data = this_cpu_read(runtime_data);
+ long val, *reg = vc_insn_get_rm(ctxt);
+ enum es_result ret;
+
+ if (sev_status & MSR_AMD64_SNP_DEBUG_SWAP)
+ return ES_VMM_ERROR;
+
+ if (!reg)
+ return ES_DECODE_FAILED;
+
+ val = *reg;
+
+ /* Upper 32 bits must be written as zeroes */
+ if (val >> 32) {
+ ctxt->fi.vector = X86_TRAP_GP;
+ ctxt->fi.error_code = 0;
+ return ES_EXCEPTION;
+ }
+
+ /* Clear out other reserved bits and set bit 10 */
+ val = (val & 0xffff23ffL) | BIT(10);
+
+ /* Early non-zero writes to DR7 are not supported */
+ if (!data && (val & ~DR7_RESET_VALUE))
+ return ES_UNSUPPORTED;
+
+ /* Using a value of 0 for ExitInfo1 means RAX holds the value */
+ ghcb_set_rax(ghcb, val);
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WRITE_DR7, 0, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (data)
+ data->dr7 = val;
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_dr7_read(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ struct sev_es_runtime_data *data = this_cpu_read(runtime_data);
+ long *reg = vc_insn_get_rm(ctxt);
+
+ if (sev_status & MSR_AMD64_SNP_DEBUG_SWAP)
+ return ES_VMM_ERROR;
+
+ if (!reg)
+ return ES_DECODE_FAILED;
+
+ if (data)
+ *reg = data->dr7;
+ else
+ *reg = DR7_RESET_VALUE;
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_wbinvd(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WBINVD, 0, 0);
+}
+
+static enum es_result vc_handle_rdpmc(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+ enum es_result ret;
+
+ ghcb_set_rcx(ghcb, ctxt->regs->cx);
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_RDPMC, 0, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (!(ghcb_rax_is_valid(ghcb) && ghcb_rdx_is_valid(ghcb)))
+ return ES_VMM_ERROR;
+
+ ctxt->regs->ax = ghcb->save.rax;
+ ctxt->regs->dx = ghcb->save.rdx;
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_monitor(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ /*
+ * Treat it as a NOP and do not leak a physical address to the
+ * hypervisor.
+ */
+ return ES_OK;
+}
+
+static enum es_result vc_handle_mwait(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ /* Treat the same as MONITOR/MONITORX */
+ return ES_OK;
+}
+
+static enum es_result vc_handle_vmmcall(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ enum es_result ret;
+
+ ghcb_set_rax(ghcb, ctxt->regs->ax);
+ ghcb_set_cpl(ghcb, user_mode(ctxt->regs) ? 3 : 0);
+
+ if (x86_platform.hyper.sev_es_hcall_prepare)
+ x86_platform.hyper.sev_es_hcall_prepare(ghcb, ctxt->regs);
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_VMMCALL, 0, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (!ghcb_rax_is_valid(ghcb))
+ return ES_VMM_ERROR;
+
+ ctxt->regs->ax = ghcb->save.rax;
+
+ /*
+ * Call sev_es_hcall_finish() after regs->ax is already set.
+ * This allows the hypervisor handler to overwrite it again if
+ * necessary.
+ */
+ if (x86_platform.hyper.sev_es_hcall_finish &&
+ !x86_platform.hyper.sev_es_hcall_finish(ghcb, ctxt->regs))
+ return ES_VMM_ERROR;
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_trap_ac(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ /*
+ * Calling ecx_alignment_check() directly does not work, because it
+ * enables IRQs and the GHCB is active. Forward the exception and call
+ * it later from vc_forward_exception().
+ */
+ ctxt->fi.vector = X86_TRAP_AC;
+ ctxt->fi.error_code = 0;
+ return ES_EXCEPTION;
+}
+
+static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
+ struct ghcb *ghcb,
+ unsigned long exit_code)
+{
+ enum es_result result = vc_check_opcode_bytes(ctxt, exit_code);
+
+ if (result != ES_OK)
+ return result;
+
+ switch (exit_code) {
+ case SVM_EXIT_READ_DR7:
+ result = vc_handle_dr7_read(ghcb, ctxt);
+ break;
+ case SVM_EXIT_WRITE_DR7:
+ result = vc_handle_dr7_write(ghcb, ctxt);
+ break;
+ case SVM_EXIT_EXCP_BASE + X86_TRAP_AC:
+ result = vc_handle_trap_ac(ghcb, ctxt);
+ break;
+ case SVM_EXIT_RDTSC:
+ case SVM_EXIT_RDTSCP:
+ result = vc_handle_rdtsc(ghcb, ctxt, exit_code);
+ break;
+ case SVM_EXIT_RDPMC:
+ result = vc_handle_rdpmc(ghcb, ctxt);
+ break;
+ case SVM_EXIT_INVD:
+ pr_err_ratelimited("#VC exception for INVD??? Seriously???\n");
+ result = ES_UNSUPPORTED;
+ break;
+ case SVM_EXIT_CPUID:
+ result = vc_handle_cpuid(ghcb, ctxt);
+ break;
+ case SVM_EXIT_IOIO:
+ result = vc_handle_ioio(ghcb, ctxt);
+ break;
+ case SVM_EXIT_MSR:
+ result = vc_handle_msr(ghcb, ctxt);
+ break;
+ case SVM_EXIT_VMMCALL:
+ result = vc_handle_vmmcall(ghcb, ctxt);
+ break;
+ case SVM_EXIT_WBINVD:
+ result = vc_handle_wbinvd(ghcb, ctxt);
+ break;
+ case SVM_EXIT_MONITOR:
+ result = vc_handle_monitor(ghcb, ctxt);
+ break;
+ case SVM_EXIT_MWAIT:
+ result = vc_handle_mwait(ghcb, ctxt);
+ break;
+ case SVM_EXIT_NPF:
+ result = vc_handle_mmio(ghcb, ctxt);
+ break;
+ default:
+ /*
+ * Unexpected #VC exception
+ */
+ result = ES_UNSUPPORTED;
+ }
+
+ return result;
+}
+
+static __always_inline bool is_vc2_stack(unsigned long sp)
+{
+ return (sp >= __this_cpu_ist_bottom_va(VC2) && sp < __this_cpu_ist_top_va(VC2));
+}
+
+static __always_inline bool vc_from_invalid_context(struct pt_regs *regs)
+{
+ unsigned long sp, prev_sp;
+
+ sp = (unsigned long)regs;
+ prev_sp = regs->sp;
+
+ /*
+ * If the code was already executing on the VC2 stack when the #VC
+ * happened, let it proceed to the normal handling routine. This way the
+ * code executing on the VC2 stack can cause #VC exceptions to get handled.
+ */
+ return is_vc2_stack(sp) && !is_vc2_stack(prev_sp);
+}
+
+static bool vc_raw_handle_exception(struct pt_regs *regs, unsigned long error_code)
+{
+ struct ghcb_state state;
+ struct es_em_ctxt ctxt;
+ enum es_result result;
+ struct ghcb *ghcb;
+ bool ret = true;
+
+ ghcb = __sev_get_ghcb(&state);
+
+ vc_ghcb_invalidate(ghcb);
+ result = vc_init_em_ctxt(&ctxt, regs, error_code);
+
+ if (result == ES_OK)
+ result = vc_handle_exitcode(&ctxt, ghcb, error_code);
+
+ __sev_put_ghcb(&state);
+
+ /* Done - now check the result */
+ switch (result) {
+ case ES_OK:
+ vc_finish_insn(&ctxt);
+ break;
+ case ES_UNSUPPORTED:
+ pr_err_ratelimited("Unsupported exit-code 0x%02lx in #VC exception (IP: 0x%lx)\n",
+ error_code, regs->ip);
+ ret = false;
+ break;
+ case ES_VMM_ERROR:
+ pr_err_ratelimited("Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n",
+ error_code, regs->ip);
+ ret = false;
+ break;
+ case ES_DECODE_FAILED:
+ pr_err_ratelimited("Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n",
+ error_code, regs->ip);
+ ret = false;
+ break;
+ case ES_EXCEPTION:
+ vc_forward_exception(&ctxt);
+ break;
+ case ES_RETRY:
+ /* Nothing to do */
+ break;
+ default:
+ pr_emerg("Unknown result in %s():%d\n", __func__, result);
+ /*
+ * Emulating the instruction which caused the #VC exception
+ * failed - can't continue so print debug information
+ */
+ BUG();
+ }
+
+ return ret;
+}
+
+static __always_inline bool vc_is_db(unsigned long error_code)
+{
+ return error_code == SVM_EXIT_EXCP_BASE + X86_TRAP_DB;
+}
+
+/*
+ * Runtime #VC exception handler when raised from kernel mode. Runs in NMI mode
+ * and will panic when an error happens.
+ */
+DEFINE_IDTENTRY_VC_KERNEL(exc_vmm_communication)
+{
+ irqentry_state_t irq_state;
+
+ /*
+ * With the current implementation it is always possible to switch to a
+ * safe stack because #VC exceptions only happen at known places, like
+ * intercepted instructions or accesses to MMIO areas/IO ports. They can
+ * also happen with code instrumentation when the hypervisor intercepts
+ * #DB, but the critical paths are forbidden to be instrumented, so #DB
+ * exceptions currently also only happen in safe places.
+ *
+ * But keep this here in case the noinstr annotations are violated due
+ * to bug elsewhere.
+ */
+ if (unlikely(vc_from_invalid_context(regs))) {
+ instrumentation_begin();
+ panic("Can't handle #VC exception from unsupported context\n");
+ instrumentation_end();
+ }
+
+ /*
+ * Handle #DB before calling into !noinstr code to avoid recursive #DB.
+ */
+ if (vc_is_db(error_code)) {
+ exc_debug(regs);
+ return;
+ }
+
+ irq_state = irqentry_nmi_enter(regs);
+
+ instrumentation_begin();
+
+ if (!vc_raw_handle_exception(regs, error_code)) {
+ /* Show some debug info */
+ show_regs(regs);
+
+ /* Ask hypervisor to sev_es_terminate */
+ sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
+
+ /* If that fails and we get here - just panic */
+ panic("Returned from Terminate-Request to Hypervisor\n");
+ }
+
+ instrumentation_end();
+ irqentry_nmi_exit(regs, irq_state);
+}
+
+/*
+ * Runtime #VC exception handler when raised from user mode. Runs in IRQ mode
+ * and will kill the current task with SIGBUS when an error happens.
+ */
+DEFINE_IDTENTRY_VC_USER(exc_vmm_communication)
+{
+ /*
+ * Handle #DB before calling into !noinstr code to avoid recursive #DB.
+ */
+ if (vc_is_db(error_code)) {
+ noist_exc_debug(regs);
+ return;
+ }
+
+ irqentry_enter_from_user_mode(regs);
+ instrumentation_begin();
+
+ if (!vc_raw_handle_exception(regs, error_code)) {
+ /*
+ * Do not kill the machine if user-space triggered the
+ * exception. Send SIGBUS instead and let user-space deal with
+ * it.
+ */
+ force_sig_fault(SIGBUS, BUS_OBJERR, (void __user *)0);
+ }
+
+ instrumentation_end();
+ irqentry_exit_to_user_mode(regs);
+}
+
+bool __init handle_vc_boot_ghcb(struct pt_regs *regs)
+{
+ unsigned long exit_code = regs->orig_ax;
+ struct es_em_ctxt ctxt;
+ enum es_result result;
+
+ vc_ghcb_invalidate(boot_ghcb);
+
+ result = vc_init_em_ctxt(&ctxt, regs, exit_code);
+ if (result == ES_OK)
+ result = vc_handle_exitcode(&ctxt, boot_ghcb, exit_code);
+
+ /* Done - now check the result */
+ switch (result) {
+ case ES_OK:
+ vc_finish_insn(&ctxt);
+ break;
+ case ES_UNSUPPORTED:
+ early_printk("PANIC: Unsupported exit-code 0x%02lx in early #VC exception (IP: 0x%lx)\n",
+ exit_code, regs->ip);
+ goto fail;
+ case ES_VMM_ERROR:
+ early_printk("PANIC: Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n",
+ exit_code, regs->ip);
+ goto fail;
+ case ES_DECODE_FAILED:
+ early_printk("PANIC: Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n",
+ exit_code, regs->ip);
+ goto fail;
+ case ES_EXCEPTION:
+ vc_early_forward_exception(&ctxt);
+ break;
+ case ES_RETRY:
+ /* Nothing to do */
+ break;
+ default:
+ BUG();
+ }
+
+ return true;
+
+fail:
+ show_regs(regs);
+
+ sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SEV_ES_GEN_REQ);
+}
+
diff --git a/arch/x86/coco/sev/vc-shared.c b/arch/x86/coco/sev/vc-shared.c
new file mode 100644
index 0000000..2c0ab0f
--- /dev/null
+++ b/arch/x86/coco/sev/vc-shared.c
@@ -0,0 +1,504 @@
+// SPDX-License-Identifier: GPL-2.0
+
+static enum es_result vc_check_opcode_bytes(struct es_em_ctxt *ctxt,
+ unsigned long exit_code)
+{
+ unsigned int opcode = (unsigned int)ctxt->insn.opcode.value;
+ u8 modrm = ctxt->insn.modrm.value;
+
+ switch (exit_code) {
+
+ case SVM_EXIT_IOIO:
+ case SVM_EXIT_NPF:
+ /* handled separately */
+ return ES_OK;
+
+ case SVM_EXIT_CPUID:
+ if (opcode == 0xa20f)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_INVD:
+ if (opcode == 0x080f)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_MONITOR:
+ /* MONITOR and MONITORX instructions generate the same error code */
+ if (opcode == 0x010f && (modrm == 0xc8 || modrm == 0xfa))
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_MWAIT:
+ /* MWAIT and MWAITX instructions generate the same error code */
+ if (opcode == 0x010f && (modrm == 0xc9 || modrm == 0xfb))
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_MSR:
+ /* RDMSR */
+ if (opcode == 0x320f ||
+ /* WRMSR */
+ opcode == 0x300f)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_RDPMC:
+ if (opcode == 0x330f)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_RDTSC:
+ if (opcode == 0x310f)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_RDTSCP:
+ if (opcode == 0x010f && modrm == 0xf9)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_READ_DR7:
+ if (opcode == 0x210f &&
+ X86_MODRM_REG(ctxt->insn.modrm.value) == 7)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_VMMCALL:
+ if (opcode == 0x010f && modrm == 0xd9)
+ return ES_OK;
+
+ break;
+
+ case SVM_EXIT_WRITE_DR7:
+ if (opcode == 0x230f &&
+ X86_MODRM_REG(ctxt->insn.modrm.value) == 7)
+ return ES_OK;
+ break;
+
+ case SVM_EXIT_WBINVD:
+ if (opcode == 0x90f)
+ return ES_OK;
+ break;
+
+ default:
+ break;
+ }
+
+ sev_printk(KERN_ERR "Wrong/unhandled opcode bytes: 0x%x, exit_code: 0x%lx, rIP: 0x%lx\n",
+ opcode, exit_code, ctxt->regs->ip);
+
+ return ES_UNSUPPORTED;
+}
+
+static bool vc_decoding_needed(unsigned long exit_code)
+{
+ /* Exceptions don't require to decode the instruction */
+ return !(exit_code >= SVM_EXIT_EXCP_BASE &&
+ exit_code <= SVM_EXIT_LAST_EXCP);
+}
+
+static enum es_result vc_init_em_ctxt(struct es_em_ctxt *ctxt,
+ struct pt_regs *regs,
+ unsigned long exit_code)
+{
+ enum es_result ret = ES_OK;
+
+ memset(ctxt, 0, sizeof(*ctxt));
+ ctxt->regs = regs;
+
+ if (vc_decoding_needed(exit_code))
+ ret = vc_decode_insn(ctxt);
+
+ return ret;
+}
+
+static void vc_finish_insn(struct es_em_ctxt *ctxt)
+{
+ ctxt->regs->ip += ctxt->insn.length;
+}
+
+static enum es_result vc_insn_string_check(struct es_em_ctxt *ctxt,
+ unsigned long address,
+ bool write)
+{
+ if (user_mode(ctxt->regs) && fault_in_kernel_space(address)) {
+ ctxt->fi.vector = X86_TRAP_PF;
+ ctxt->fi.error_code = X86_PF_USER;
+ ctxt->fi.cr2 = address;
+ if (write)
+ ctxt->fi.error_code |= X86_PF_WRITE;
+
+ return ES_EXCEPTION;
+ }
+
+ return ES_OK;
+}
+
+static enum es_result vc_insn_string_read(struct es_em_ctxt *ctxt,
+ void *src, char *buf,
+ unsigned int data_size,
+ unsigned int count,
+ bool backwards)
+{
+ int i, b = backwards ? -1 : 1;
+ unsigned long address = (unsigned long)src;
+ enum es_result ret;
+
+ ret = vc_insn_string_check(ctxt, address, false);
+ if (ret != ES_OK)
+ return ret;
+
+ for (i = 0; i < count; i++) {
+ void *s = src + (i * data_size * b);
+ char *d = buf + (i * data_size);
+
+ ret = vc_read_mem(ctxt, s, d, data_size);
+ if (ret != ES_OK)
+ break;
+ }
+
+ return ret;
+}
+
+static enum es_result vc_insn_string_write(struct es_em_ctxt *ctxt,
+ void *dst, char *buf,
+ unsigned int data_size,
+ unsigned int count,
+ bool backwards)
+{
+ int i, s = backwards ? -1 : 1;
+ unsigned long address = (unsigned long)dst;
+ enum es_result ret;
+
+ ret = vc_insn_string_check(ctxt, address, true);
+ if (ret != ES_OK)
+ return ret;
+
+ for (i = 0; i < count; i++) {
+ void *d = dst + (i * data_size * s);
+ char *b = buf + (i * data_size);
+
+ ret = vc_write_mem(ctxt, d, b, data_size);
+ if (ret != ES_OK)
+ break;
+ }
+
+ return ret;
+}
+
+#define IOIO_TYPE_STR BIT(2)
+#define IOIO_TYPE_IN 1
+#define IOIO_TYPE_INS (IOIO_TYPE_IN | IOIO_TYPE_STR)
+#define IOIO_TYPE_OUT 0
+#define IOIO_TYPE_OUTS (IOIO_TYPE_OUT | IOIO_TYPE_STR)
+
+#define IOIO_REP BIT(3)
+
+#define IOIO_ADDR_64 BIT(9)
+#define IOIO_ADDR_32 BIT(8)
+#define IOIO_ADDR_16 BIT(7)
+
+#define IOIO_DATA_32 BIT(6)
+#define IOIO_DATA_16 BIT(5)
+#define IOIO_DATA_8 BIT(4)
+
+#define IOIO_SEG_ES (0 << 10)
+#define IOIO_SEG_DS (3 << 10)
+
+static enum es_result vc_ioio_exitinfo(struct es_em_ctxt *ctxt, u64 *exitinfo)
+{
+ struct insn *insn = &ctxt->insn;
+ size_t size;
+ u64 port;
+
+ *exitinfo = 0;
+
+ switch (insn->opcode.bytes[0]) {
+ /* INS opcodes */
+ case 0x6c:
+ case 0x6d:
+ *exitinfo |= IOIO_TYPE_INS;
+ *exitinfo |= IOIO_SEG_ES;
+ port = ctxt->regs->dx & 0xffff;
+ break;
+
+ /* OUTS opcodes */
+ case 0x6e:
+ case 0x6f:
+ *exitinfo |= IOIO_TYPE_OUTS;
+ *exitinfo |= IOIO_SEG_DS;
+ port = ctxt->regs->dx & 0xffff;
+ break;
+
+ /* IN immediate opcodes */
+ case 0xe4:
+ case 0xe5:
+ *exitinfo |= IOIO_TYPE_IN;
+ port = (u8)insn->immediate.value & 0xffff;
+ break;
+
+ /* OUT immediate opcodes */
+ case 0xe6:
+ case 0xe7:
+ *exitinfo |= IOIO_TYPE_OUT;
+ port = (u8)insn->immediate.value & 0xffff;
+ break;
+
+ /* IN register opcodes */
+ case 0xec:
+ case 0xed:
+ *exitinfo |= IOIO_TYPE_IN;
+ port = ctxt->regs->dx & 0xffff;
+ break;
+
+ /* OUT register opcodes */
+ case 0xee:
+ case 0xef:
+ *exitinfo |= IOIO_TYPE_OUT;
+ port = ctxt->regs->dx & 0xffff;
+ break;
+
+ default:
+ return ES_DECODE_FAILED;
+ }
+
+ *exitinfo |= port << 16;
+
+ switch (insn->opcode.bytes[0]) {
+ case 0x6c:
+ case 0x6e:
+ case 0xe4:
+ case 0xe6:
+ case 0xec:
+ case 0xee:
+ /* Single byte opcodes */
+ *exitinfo |= IOIO_DATA_8;
+ size = 1;
+ break;
+ default:
+ /* Length determined by instruction parsing */
+ *exitinfo |= (insn->opnd_bytes == 2) ? IOIO_DATA_16
+ : IOIO_DATA_32;
+ size = (insn->opnd_bytes == 2) ? 2 : 4;
+ }
+
+ switch (insn->addr_bytes) {
+ case 2:
+ *exitinfo |= IOIO_ADDR_16;
+ break;
+ case 4:
+ *exitinfo |= IOIO_ADDR_32;
+ break;
+ case 8:
+ *exitinfo |= IOIO_ADDR_64;
+ break;
+ }
+
+ if (insn_has_rep_prefix(insn))
+ *exitinfo |= IOIO_REP;
+
+ return vc_ioio_check(ctxt, (u16)port, size);
+}
+
+static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+ struct pt_regs *regs = ctxt->regs;
+ u64 exit_info_1, exit_info_2;
+ enum es_result ret;
+
+ ret = vc_ioio_exitinfo(ctxt, &exit_info_1);
+ if (ret != ES_OK)
+ return ret;
+
+ if (exit_info_1 & IOIO_TYPE_STR) {
+
+ /* (REP) INS/OUTS */
+
+ bool df = ((regs->flags & X86_EFLAGS_DF) == X86_EFLAGS_DF);
+ unsigned int io_bytes, exit_bytes;
+ unsigned int ghcb_count, op_count;
+ unsigned long es_base;
+ u64 sw_scratch;
+
+ /*
+ * For the string variants with rep prefix the amount of in/out
+ * operations per #VC exception is limited so that the kernel
+ * has a chance to take interrupts and re-schedule while the
+ * instruction is emulated.
+ */
+ io_bytes = (exit_info_1 >> 4) & 0x7;
+ ghcb_count = sizeof(ghcb->shared_buffer) / io_bytes;
+
+ op_count = (exit_info_1 & IOIO_REP) ? regs->cx : 1;
+ exit_info_2 = min(op_count, ghcb_count);
+ exit_bytes = exit_info_2 * io_bytes;
+
+ es_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_ES);
+
+ /* Read bytes of OUTS into the shared buffer */
+ if (!(exit_info_1 & IOIO_TYPE_IN)) {
+ ret = vc_insn_string_read(ctxt,
+ (void *)(es_base + regs->si),
+ ghcb->shared_buffer, io_bytes,
+ exit_info_2, df);
+ if (ret)
+ return ret;
+ }
+
+ /*
+ * Issue an VMGEXIT to the HV to consume the bytes from the
+ * shared buffer or to have it write them into the shared buffer
+ * depending on the instruction: OUTS or INS.
+ */
+ sw_scratch = __pa(ghcb) + offsetof(struct ghcb, shared_buffer);
+ ghcb_set_sw_scratch(ghcb, sw_scratch);
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO,
+ exit_info_1, exit_info_2);
+ if (ret != ES_OK)
+ return ret;
+
+ /* Read bytes from shared buffer into the guest's destination. */
+ if (exit_info_1 & IOIO_TYPE_IN) {
+ ret = vc_insn_string_write(ctxt,
+ (void *)(es_base + regs->di),
+ ghcb->shared_buffer, io_bytes,
+ exit_info_2, df);
+ if (ret)
+ return ret;
+
+ if (df)
+ regs->di -= exit_bytes;
+ else
+ regs->di += exit_bytes;
+ } else {
+ if (df)
+ regs->si -= exit_bytes;
+ else
+ regs->si += exit_bytes;
+ }
+
+ if (exit_info_1 & IOIO_REP)
+ regs->cx -= exit_info_2;
+
+ ret = regs->cx ? ES_RETRY : ES_OK;
+
+ } else {
+
+ /* IN/OUT into/from rAX */
+
+ int bits = (exit_info_1 & 0x70) >> 1;
+ u64 rax = 0;
+
+ if (!(exit_info_1 & IOIO_TYPE_IN))
+ rax = lower_bits(regs->ax, bits);
+
+ ghcb_set_rax(ghcb, rax);
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO, exit_info_1, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (exit_info_1 & IOIO_TYPE_IN) {
+ if (!ghcb_rax_is_valid(ghcb))
+ return ES_VMM_ERROR;
+ regs->ax = lower_bits(ghcb->save.rax, bits);
+ }
+ }
+
+ return ret;
+}
+
+static int vc_handle_cpuid_snp(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+ struct pt_regs *regs = ctxt->regs;
+ struct cpuid_leaf leaf;
+ int ret;
+
+ leaf.fn = regs->ax;
+ leaf.subfn = regs->cx;
+ ret = snp_cpuid(ghcb, ctxt, &leaf);
+ if (!ret) {
+ regs->ax = leaf.eax;
+ regs->bx = leaf.ebx;
+ regs->cx = leaf.ecx;
+ regs->dx = leaf.edx;
+ }
+
+ return ret;
+}
+
+static enum es_result vc_handle_cpuid(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt)
+{
+ struct pt_regs *regs = ctxt->regs;
+ u32 cr4 = native_read_cr4();
+ enum es_result ret;
+ int snp_cpuid_ret;
+
+ snp_cpuid_ret = vc_handle_cpuid_snp(ghcb, ctxt);
+ if (!snp_cpuid_ret)
+ return ES_OK;
+ if (snp_cpuid_ret != -EOPNOTSUPP)
+ return ES_VMM_ERROR;
+
+ ghcb_set_rax(ghcb, regs->ax);
+ ghcb_set_rcx(ghcb, regs->cx);
+
+ if (cr4 & X86_CR4_OSXSAVE)
+ /* Safe to read xcr0 */
+ ghcb_set_xcr0(ghcb, xgetbv(XCR_XFEATURE_ENABLED_MASK));
+ else
+ /* xgetbv will cause #GP - use reset value for xcr0 */
+ ghcb_set_xcr0(ghcb, 1);
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_CPUID, 0, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (!(ghcb_rax_is_valid(ghcb) &&
+ ghcb_rbx_is_valid(ghcb) &&
+ ghcb_rcx_is_valid(ghcb) &&
+ ghcb_rdx_is_valid(ghcb)))
+ return ES_VMM_ERROR;
+
+ regs->ax = ghcb->save.rax;
+ regs->bx = ghcb->save.rbx;
+ regs->cx = ghcb->save.rcx;
+ regs->dx = ghcb->save.rdx;
+
+ return ES_OK;
+}
+
+static enum es_result vc_handle_rdtsc(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt,
+ unsigned long exit_code)
+{
+ bool rdtscp = (exit_code == SVM_EXIT_RDTSCP);
+ enum es_result ret;
+
+ /*
+ * The hypervisor should not be intercepting RDTSC/RDTSCP when Secure
+ * TSC is enabled. A #VC exception will be generated if the RDTSC/RDTSCP
+ * instructions are being intercepted. If this should occur and Secure
+ * TSC is enabled, guest execution should be terminated as the guest
+ * cannot rely on the TSC value provided by the hypervisor.
+ */
+ if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
+ return ES_VMM_ERROR;
+
+ ret = sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, 0, 0);
+ if (ret != ES_OK)
+ return ret;
+
+ if (!(ghcb_rax_is_valid(ghcb) && ghcb_rdx_is_valid(ghcb) &&
+ (!rdtscp || ghcb_rcx_is_valid(ghcb))))
+ return ES_VMM_ERROR;
+
+ ctxt->regs->ax = ghcb->save.rax;
+ ctxt->regs->dx = ghcb->save.rdx;
+ if (rdtscp)
+ ctxt->regs->cx = ghcb->save.rcx;
+
+ return ES_OK;
+}
diff --git a/arch/x86/include/asm/sev-internal.h b/arch/x86/include/asm/sev-internal.h
index a78f972..b723208 100644
--- a/arch/x86/include/asm/sev-internal.h
+++ b/arch/x86/include/asm/sev-internal.h
@@ -3,7 +3,6 @@
#define DR7_RESET_VALUE 0x400
extern struct ghcb boot_ghcb_page;
-extern struct ghcb *boot_ghcb;
extern u64 sev_hv_features;
extern u64 sev_secrets_pa;
@@ -59,8 +58,6 @@ DECLARE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
unsigned long npages, enum psc_op op);
-void __noreturn sev_es_terminate(unsigned int set, unsigned int reason);
-
DECLARE_PER_CPU(struct svsm_ca *, svsm_caa);
DECLARE_PER_CPU(u64, svsm_caa_pa);
@@ -100,11 +97,6 @@ static __always_inline void sev_es_wr_ghcb_msr(u64 val)
native_wrmsr(MSR_AMD64_SEV_ES_GHCB, low, high);
}
-enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt,
- u64 exit_code, u64 exit_info_1,
- u64 exit_info_2);
-
void snp_register_ghcb_early(unsigned long paddr);
bool sev_es_negotiate_protocol(void);
bool sev_es_check_cpu_features(void);
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index a8661df..6158893 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -521,6 +521,28 @@ static __always_inline void vc_ghcb_invalidate(struct ghcb *ghcb)
__builtin_memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap));
}
+void vc_forward_exception(struct es_em_ctxt *ctxt);
+
+/* I/O parameters for CPUID-related helpers */
+struct cpuid_leaf {
+ u32 fn;
+ u32 subfn;
+ u32 eax;
+ u32 ebx;
+ u32 ecx;
+ u32 edx;
+};
+
+int snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf);
+
+void __noreturn sev_es_terminate(unsigned int set, unsigned int reason);
+enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt,
+ u64 exit_code, u64 exit_info_1,
+ u64 exit_info_2);
+
+extern struct ghcb *boot_ghcb;
+
#else /* !CONFIG_AMD_MEM_ENCRYPT */
#define snp_vmpl 0
^ permalink raw reply related [flat|nested] 59+ messages in thread
* [tip: x86/boot] x86/boot: Provide __pti_set_user_pgtbl() to startup code
2025-05-04 9:52 ` [RFT PATCH v2 15/23] x86/boot: Provide __pti_set_user_pgtbl() to startup code Ard Biesheuvel
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
@ 2025-05-05 16:54 ` tip-bot2 for Ard Biesheuvel
1 sibling, 0 replies; 59+ messages in thread
From: tip-bot2 for Ard Biesheuvel @ 2025-05-05 16:54 UTC (permalink / raw)
To: linux-tip-commits
Cc: Ard Biesheuvel, Ingo Molnar, Borislav Petkov (AMD), Arnd Bergmann,
David Woodhouse, Dionna Amalie Glaze, H. Peter Anvin, Kees Cook,
Kevin Loughlin, Len Brown, Linus Torvalds, Rafael J. Wysocki,
Tom Lendacky, linux-efi, x86, linux-kernel
The following commit has been merged into the x86/boot branch of tip:
Commit-ID: f92d3fe32874e83986b9edc330ccc9bc9faaa92a
Gitweb: https://git.kernel.org/tip/f92d3fe32874e83986b9edc330ccc9bc9faaa92a
Author: Ard Biesheuvel <ardb@kernel.org>
AuthorDate: Sun, 04 May 2025 11:52:45 +02:00
Committer: Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Mon, 05 May 2025 18:48:58 +02:00
x86/boot: Provide __pti_set_user_pgtbl() to startup code
The SME encryption startup code populates page tables using the ordinary
set_pXX() helpers, and in a PTI build, these will call out to
__pti_set_user_pgtbl() to manipulate the shadow copy of the page tables
for user space.
This is unneeded for the startup code, which only manipulates the
swapper page tables, and so this call could be avoided in this
particular case. So instead of exposing the ordinary
__pti_set_user_pgtblt() to the startup code after its gets confined into
its own symbol space, provide an alternative which just returns pgd,
which is always correct in the startup context.
Annotate it as __weak for now, this will be dropped in a subsequent
patch.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250504095230.2932860-40-ardb+git@google.com
---
arch/x86/boot/startup/sme.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/arch/x86/boot/startup/sme.c b/arch/x86/boot/startup/sme.c
index 5738b31..753cd20 100644
--- a/arch/x86/boot/startup/sme.c
+++ b/arch/x86/boot/startup/sme.c
@@ -564,3 +564,12 @@ void __head sme_enable(struct boot_params *bp)
cc_vendor = CC_VENDOR_AMD;
cc_set_mask(me_mask);
}
+
+#ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
+/* Local version for startup code, which never operates on user page tables */
+__weak
+pgd_t __pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
+{
+ return pgd;
+}
+#endif
^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 03/23] x86/boot: Drop global variables keeping track of LA57 state
2025-05-04 14:58 ` Linus Torvalds
2025-05-04 19:33 ` Ard Biesheuvel
@ 2025-05-05 21:07 ` Ingo Molnar
2025-05-05 21:24 ` Ingo Molnar
2025-05-05 21:26 ` Linus Torvalds
1 sibling, 2 replies; 59+ messages in thread
From: Ingo Molnar @ 2025-05-05 21:07 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ard Biesheuvel, linux-kernel, linux-efi, x86, Ard Biesheuvel,
Borislav Petkov, Dionna Amalie Glaze, Kevin Loughlin,
Tom Lendacky
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Honestly, looking at this, I think we should fix the *users* of
> pgtable_l5_enabled().
Agreed.
> Because I think there are two distinct classes:
>
> - the stuff in <asm/pgtable.h> is exposed as the generic page table
> accessor macros to "real" code, and should probably use a static key
> (ie things like pgd_clear() / pgd_none() and friends)
>
> - but in code like __kernel_physical_mapping_init() feels to me like
> using the value in %cr4 actually makes sense
So out of curiosity I measured where exactly pgtable_l5_enabled() is
used - at least in terms of inlining frequency.
Here's the histogram of first-level pgtable_l5_enabled() inlining uses
on x86-64-defconfig (there's over 600 of them):
1 ffffffff844cde60 stat__pgtable_l5_enabled____p4d_free_tlb()
1 ffffffff844cde64 stat__pgtable_l5_enabled__pgd_populate_safe()
1 ffffffff844cde80 stat__pgtable_l5_enabled__native_set_p4d()
1 ffffffff844cde84 stat__pgtable_l5_enabled__mm_p4d_folded()
2 ffffffff844cde8c stat__pgtable_l5_enabled__paravirt_pgd_clear()
5 ffffffff844cde68 stat__pgtable_l5_enabled__pgd_populate()
13 ffffffff844cde98 stat__pgtable_l5_enabled____VIRTUAL_MASK_SHIFT
14 ffffffff844cde78 stat__pgtable_l5_enabled__pgd_present()
16 ffffffff844cde70 stat__pgtable_l5_enabled__pgd_bad()
16 ffffffff844cdea0 stat__pgtable_l5_enabled__other()
20 ffffffff844cde88 stat__pgtable_l5_enabled__paravirt_set_pgd()
29 ffffffff844cde5c stat__pgtable_l5_enabled__VMALLOC_SIZE_TB
46 ffffffff844cde7c stat__pgtable_l5_enabled__PTRS_PER_P4D
49 ffffffff844cde6c stat__pgtable_l5_enabled__pgd_none()
60 ffffffff844cde74 stat__pgtable_l5_enabled__p4d_offset()
156 ffffffff844cde9c stat__pgtable_l5_enabled__PGDIR_SHIFT
179 ffffffff844cde94 stat__pgtable_l5_enabled__MAX_PHYSMEM_BITS
---
609 TOTAL
Limitations:
- 'Inlining frequency' is admittedly only an imperfect, statistical
proxy for 'importance'.
- This metric doesn't properly discount boot-time-only use either,
although by a quick guesstimate it's less than 10% of all use: the
startup code uses pgtable_l5_enabled() only 13 times. (But there's
more boot-time use.)
Anyway, with these limitations in mind, we can see that the top 5
usecases cover about 80% of all uses:
- MAX_PHYSMEM_BITS: (inlined 179 times)
arch/x86/include/asm/sparsemem.h:# define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)
This could be implemented via a precomputed, constant percpu value
(per_cpu__x86_MAX_PHYSMEM_BITS) of 52 vs. 46, eliminating not just
the CR4 access, but also a branch, at the cost of a percpu memory
access. (Which should still be a win on all microarchitectures IMO.)
Alternatively, since this value is a semi-constant of 52 vs. 46, we
could also, I suspect, ALTERNATIVES-patch MAX_PHYSMEM_BITS in as an
immediate constant value? Any reason this shouldn't work:
static inline unsigned int __MAX_PHYSMEM_BITS(void)
{
unsigned int bits;
asm_inline (ALTERNATIVE("movl $46, %0", "movl $52, %0", X86_FEATURE_LA57) :"=g" (bits));
return bits;
}
#define MAX_PHYSMEM_BITS __MAX_PHYSMEM_BITS()
... or something like that? This would result in the best code
generation IMO, by far. (It would even make use of the
zero-extension property of a 32-bit MOVL, further compressing the
opcode to only 5 bytes or so.)
We'd even create a secondary helper macro for this, something like:
#define ALTERNATIVES_CONST_U32(__val1, __val2, __feature) \
({ \
u32 __val; \
\
asm_inline (ALTERNATIVE("movl $" #__val1 ", %0", "movl $" __val2 ", %0", __feature) :"=g" (__val)); \
\
__val; \
})
...
#define MAX_PHYSMEM_BITS ALTERNATIVE_CONST_U32(46, 52, X86_FEATURE_LA57)
(Or so. Totally untested.)
- PGDIR_SHIFT: (inlined 156 times)
arch/x86/include/asm/pgtable_64_types.h:#define PGDIR_SHIFT (pgtable_l5_enabled() ? 48 : 39)
This too could be implemented via a precomputed constant percpu
value (per_cpu__x86_PGDIR_SHIFT), eliminating a branch as well, or
via an ALTERNATIVE() immediate operand constants.
- p4d_offset(): (inlined 60 times)
static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
{
if (!pgtable_l5_enabled())
return (p4d_t *)pgd;
return (p4d_t *)pgd_page_vaddr(*pgd) + p4d_index(address);
}
Here pgtable_l5_enabled() is used as a binary flag.
- pgd_none(): (inlined 49 times)
static inline int pgd_none(pgd_t pgd)
{
if (!pgtable_l5_enabled())
return 0;
return !native_pgd_val(pgd);
}
Binary flag use as well, although the compiler might eliminate the
branch here and replace it with 'AND !native_pgd_val(pgd)', because
native_pgd_val() is really simple. Or if the compiler doesn't do it,
we could code it up as a (hopefully) branch-less return of:
static inline int pgd_none(pgd_t pgd)
{
return pgtable_l5_enabled() & !native_pgd_val(pgd);
}
(If I got the conversion right.)
- PTRS_PER_P4D: (inlined 46 times)
arch/x86/include/asm/pgtable_64_types.h:#define PTRS_PER_P4D (pgtable_l5_enabled() ? 512 : 1)
This too could be implemented via a precomputed constant percpu
value (per_cpu__x86_PTRS_PER_P4D), eliminating a branch,
or via an ALTERNATIVE() immediate constant.
> but that looks like a much bigger and fundamental patch than Ard's.
Definitely. Using various derived percpu values will also complicate
bootstrapping, with the usual complications of whether these values are
entirely in sync with the chosen paging mode the kernel uses.
Anyway, I'm not against Ard's simplification patch as a first step, and
any optimizations can be layered on top of that.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 03/23] x86/boot: Drop global variables keeping track of LA57 state
2025-05-05 21:07 ` Ingo Molnar
@ 2025-05-05 21:24 ` Ingo Molnar
2025-05-05 22:30 ` Ard Biesheuvel
2025-05-05 21:26 ` Linus Torvalds
1 sibling, 1 reply; 59+ messages in thread
From: Ingo Molnar @ 2025-05-05 21:24 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ard Biesheuvel, linux-kernel, linux-efi, x86, Ard Biesheuvel,
Borislav Petkov, Dionna Amalie Glaze, Kevin Loughlin,
Tom Lendacky
* Ingo Molnar <mingo@kernel.org> wrote:
> Anyway, with these limitations in mind, we can see that the top 5
> usecases cover about 80% of all uses:
>
> - MAX_PHYSMEM_BITS: (inlined 179 times)
>
> arch/x86/include/asm/sparsemem.h:# define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)
>
> This could be implemented via a precomputed, constant percpu value
> (per_cpu__x86_MAX_PHYSMEM_BITS) of 52 vs. 46, eliminating not just
> the CR4 access, but also a branch, at the cost of a percpu memory
> access. (Which should still be a win on all microarchitectures IMO.)
>
> Alternatively, since this value is a semi-constant of 52 vs. 46, we
> could also, I suspect, ALTERNATIVES-patch MAX_PHYSMEM_BITS in as an
> immediate constant value? Any reason this shouldn't work:
>
> static inline unsigned int __MAX_PHYSMEM_BITS(void)
> {
> unsigned int bits;
>
> asm_inline (ALTERNATIVE("movl $46, %0", "movl $52, %0", X86_FEATURE_LA57) :"=g" (bits));
>
> return bits;
> }
> #define MAX_PHYSMEM_BITS __MAX_PHYSMEM_BITS()
>
> ... or something like that? This would result in the best code
> generation IMO, by far. (It would even make use of the
> zero-extension property of a 32-bit MOVL, further compressing the
> opcode to only 5 bytes or so.)
>
> We'd even create a secondary helper macro for this, something like:
>
> #define ALTERNATIVES_CONST_U32(__val1, __val2, __feature) \
> ({ \
> u32 __val; \
> \
> asm_inline (ALTERNATIVE("movl $" #__val1 ", %0", "movl $" __val2 ", %0", __feature) :"=g" (__val)); \
> \
> __val; \
> })
>
> ...
>
> #define MAX_PHYSMEM_BITS ALTERNATIVE_CONST_U32(46, 52, X86_FEATURE_LA57)
>
> (Or so. Totally untested.)
BTW., I keep comparing it to the CR4 access, which is a bit unfair,
since that's only one out of the 3 variants of ALTERNATIVE_TERNARY():
static __always_inline __pure bool pgtable_l5_enabled(void)
{
unsigned long r;
bool ret;
if (!IS_ENABLED(CONFIG_X86_5LEVEL))
return false;
asm(ALTERNATIVE_TERNARY(
"movq %%cr4, %[reg] \n\t btl %[la57], %k[reg]" CC_SET(c),
%P[feat], "stc", "clc")
: [reg] "=&r" (r), CC_OUT(c) (ret)
: [feat] "i" (X86_FEATURE_LA57),
[la57] "i" (X86_CR4_LA57_BIT)
: "cc");
return ret;
}
The STC and CLC variants will probably be the more common outcomes on
modern CPUs.
But I still think the ALTERNATIVE_CONST_U32() approach I outline above
generates superior code for the binary-values cases, which covers 3 out
of the top 5 uses of pgtable_l5_enabled().
For non-constant branching uses of pgtable_l5_enabled() I suspect the
STC/CLC approach above is pretty good, although the 'cc' constraint
will clobber all flags I suspect, while ALTERNATIVE_CONST_U32()
doesn't? Ie. with ALTERNATIVE_CONST_U32() we just load the resulting
constant into a register, with no additional branches and with flags
undisturbed.
Also, is STC/CLC always just as fast as the testing of an immediate (or
a static branch), on CPUs we care about?
Thanks,
Ingo
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 03/23] x86/boot: Drop global variables keeping track of LA57 state
2025-05-05 21:07 ` Ingo Molnar
2025-05-05 21:24 ` Ingo Molnar
@ 2025-05-05 21:26 ` Linus Torvalds
2025-05-05 21:51 ` Ingo Molnar
1 sibling, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2025-05-05 21:26 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ard Biesheuvel, linux-kernel, linux-efi, x86, Ard Biesheuvel,
Borislav Petkov, Dionna Amalie Glaze, Kevin Loughlin,
Tom Lendacky
On Mon, 5 May 2025 at 14:07, Ingo Molnar <mingo@kernel.org> wrote:
>
> - MAX_PHYSMEM_BITS: (inlined 179 times)
>
> arch/x86/include/asm/sparsemem.h:# define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)
>
> This could be implemented via a precomputed, constant percpu value
> (per_cpu__x86_MAX_PHYSMEM_BITS) of 52 vs. 46, eliminating not just
> the CR4 access, but also a branch, at the cost of a percpu memory
> access. (Which should still be a win on all microarchitectures IMO.)
This is literally why I did the "runtime-const" stuff. Exactly for
simple constants that you don't want to load from percpu memory
because it's just annoying.
Now, we only have 64-bit constant values which is very wasteful, and
we could just do a signed byte constant if we cared.
(We also have a "shift 32-bit value right by a constant amount",
which actually does use a signed byte, but it's masked by 0x1f because
that's how 32-bit shifts work).
I doubt we care - I doubt any of this MAX_PHYSMEM_BITS use is actually
performance-critical.
The runtime-const stuff would be trivial to use here if we really want to.
> - PGDIR_SHIFT: (inlined 156 times)
Several of those are actually of the form
#define PGDIR_SIZE (1UL << PGDIR_SHIFT)
so you artificially see PGDIR_SHIFT as the important part, even though
it's often a different constant entirely that just gets generated
using it.
> - p4d_offset(): (inlined 60 times)
> Here pgtable_l5_enabled() is used as a binary flag.
static branch would probably work best, and as Ard says, just using
cpu_feature_enabled() would just fix it..
> - pgd_none(): (inlined 49 times)
> Binary flag use as well, although the compiler might eliminate the
> branch here and replace it with 'AND !native_pgd_val(pgd)'
This could easily be done as runtime-const.
But again, I doubt it's all that performance-critical.
> - PTRS_PER_P4D: (inlined 46 times)
> This too could be implemented via a precomputed constant percpu
> value (per_cpu__x86_PTRS_PER_P4D), eliminating a branch,
> or via an ALTERNATIVE() immediate constant.
Again, we do have that, although the 64-bit constant is a bit wasteful.
The reason runtime-const does a 64-bit constant is that the actual
performance-critical cases were for big constants (TASK_SIZE) and for
pointers (hash table pointers).
Linus
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 03/23] x86/boot: Drop global variables keeping track of LA57 state
2025-05-05 21:26 ` Linus Torvalds
@ 2025-05-05 21:51 ` Ingo Molnar
0 siblings, 0 replies; 59+ messages in thread
From: Ingo Molnar @ 2025-05-05 21:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ard Biesheuvel, linux-kernel, linux-efi, x86, Ard Biesheuvel,
Borislav Petkov, Dionna Amalie Glaze, Kevin Loughlin,
Tom Lendacky
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > - PGDIR_SHIFT: (inlined 156 times)
>
> Several of those are actually of the form
>
> #define PGDIR_SIZE (1UL << PGDIR_SHIFT)
>
> so you artificially see PGDIR_SHIFT as the important part, even though
> it's often a different constant entirely that just gets generated
> using it.
Yeah, I only examined the first level use.
Here's the stats for PGDIR_SIZE:
31 ffffffff844cde5c <stat__pgtable_l5_enabled_PGDIR_SIZE>
157 ffffffff844cdea0 <stat__pgtable_l5_enabled_PGDIR_SHIFT> # total: includes PGDIR_SIZE
Ie. only about 19% of PGDIR_SHIFT use is PGDIR_SIZE.
> > - PTRS_PER_P4D: (inlined 46 times)
> > This too could be implemented via a precomputed constant percpu
> > value (per_cpu__x86_PTRS_PER_P4D), eliminating a branch,
> > or via an ALTERNATIVE() immediate constant.
>
> Again, we do have that, although the 64-bit constant is a bit wasteful.
Adds 4 bytes to the size of the MOVQ. Not the end of the world, but I
suspect for x86-specific values that flag off a CPU-feature flag (which
is the case here) we can use the ALTERNATIVE_CONST_U32() trick and have
it all in a single place.
> The reason runtime-const does a 64-bit constant is that the actual
> performance-critical cases were for big constants (TASK_SIZE) and for
> pointers (hash table pointers).
I remembered that we had something in this area, but I grepped for
'alternative.*const' which found nothing, while runtime_const uses its
own simple text-patching method a.k.a. runtime_const_fixup(). :-)
Thanks,
Ingo
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 03/23] x86/boot: Drop global variables keeping track of LA57 state
2025-05-05 21:24 ` Ingo Molnar
@ 2025-05-05 22:30 ` Ard Biesheuvel
0 siblings, 0 replies; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-05 22:30 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Ard Biesheuvel, linux-kernel, linux-efi, x86,
Borislav Petkov, Dionna Amalie Glaze, Kevin Loughlin,
Tom Lendacky
On Mon, 5 May 2025 at 23:25, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Ingo Molnar <mingo@kernel.org> wrote:
>
> > Anyway, with these limitations in mind, we can see that the top 5
> > usecases cover about 80% of all uses:
> >
> > - MAX_PHYSMEM_BITS: (inlined 179 times)
> >
> > arch/x86/include/asm/sparsemem.h:# define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)
> >
> > This could be implemented via a precomputed, constant percpu value
> > (per_cpu__x86_MAX_PHYSMEM_BITS) of 52 vs. 46, eliminating not just
> > the CR4 access, but also a branch, at the cost of a percpu memory
> > access. (Which should still be a win on all microarchitectures IMO.)
> >
> > Alternatively, since this value is a semi-constant of 52 vs. 46, we
> > could also, I suspect, ALTERNATIVES-patch MAX_PHYSMEM_BITS in as an
> > immediate constant value? Any reason this shouldn't work:
> >
> > static inline unsigned int __MAX_PHYSMEM_BITS(void)
> > {
> > unsigned int bits;
> >
> > asm_inline (ALTERNATIVE("movl $46, %0", "movl $52, %0", X86_FEATURE_LA57) :"=g" (bits));
> >
> > return bits;
> > }
> > #define MAX_PHYSMEM_BITS __MAX_PHYSMEM_BITS()
> >
> > ... or something like that? This would result in the best code
> > generation IMO, by far. (It would even make use of the
> > zero-extension property of a 32-bit MOVL, further compressing the
> > opcode to only 5 bytes or so.)
> >
> > We'd even create a secondary helper macro for this, something like:
> >
> > #define ALTERNATIVES_CONST_U32(__val1, __val2, __feature) \
> > ({ \
> > u32 __val; \
> > \
> > asm_inline (ALTERNATIVE("movl $" #__val1 ", %0", "movl $" __val2 ", %0", __feature) :"=g" (__val)); \
> > \
> > __val; \
> > })
> >
> > ...
> >
> > #define MAX_PHYSMEM_BITS ALTERNATIVE_CONST_U32(46, 52, X86_FEATURE_LA57)
> >
> > (Or so. Totally untested.)
>
> BTW., I keep comparing it to the CR4 access, which is a bit unfair,
> since that's only one out of the 3 variants of ALTERNATIVE_TERNARY():
>
> static __always_inline __pure bool pgtable_l5_enabled(void)
> {
> unsigned long r;
> bool ret;
>
> if (!IS_ENABLED(CONFIG_X86_5LEVEL))
> return false;
>
> asm(ALTERNATIVE_TERNARY(
> "movq %%cr4, %[reg] \n\t btl %[la57], %k[reg]" CC_SET(c),
> %P[feat], "stc", "clc")
> : [reg] "=&r" (r), CC_OUT(c) (ret)
> : [feat] "i" (X86_FEATURE_LA57),
> [la57] "i" (X86_CR4_LA57_BIT)
> : "cc");
>
> return ret;
> }
>
> The STC and CLC variants will probably be the more common outcomes on
> modern CPUs.
>
The CR4 access is only used before alternatives patching; after that,
either the STC or the CLC is patched in.
> But I still think the ALTERNATIVE_CONST_U32() approach I outline above
> generates superior code for the binary-values cases, which covers 3 out
> of the top 5 uses of pgtable_l5_enabled().
>
> For non-constant branching uses of pgtable_l5_enabled() I suspect the
> STC/CLC approach above is pretty good, although the 'cc' constraint
> will clobber all flags I suspect, while ALTERNATIVE_CONST_U32()
> doesn't? Ie. with ALTERNATIVE_CONST_U32() we just load the resulting
> constant into a register, with no additional branches and with flags
> undisturbed.
>
> Also, is STC/CLC always just as fast as the testing of an immediate (or
> a static branch), on CPUs we care about?
>
The reasoning behind using the C flag rather than asm goto is
precisely the fact that in many cases, pgtable_l5_enabled() is not
used for control flow but for picking between constants, and this
approach permits the compiler [in theory] to resolve it without a
branch. But I didn't inspect the resulting codegen, and just patching
in a MOV immediate is obviously more efficient than that.
I could just use asm goto here, and implement something similar to
cpu_feature_enabled(), but in a way that removes the need for
USE_EARLY_PGTABLE_L5. Then, we could get rid of the
__pgtable_l5_enabled variable in a separate patch, and base it on a
test of CR4.LA57 in the early init path.
Then, any optimizations related to constants defined in terms of
(pgtable_l5_enabled() ? foo : bar) could be layered on top of that.
Having runtime constants is the obvious choice, although I'm skeptical
whether it's worth the hassle. I also haven't yet looked into what it
would entail to share these runtime constants with the startup code.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [tip: x86/boot] x86/boot: Provide __pti_set_user_pgtbl() to startup code
2025-05-05 16:47 ` Borislav Petkov
@ 2025-05-06 7:18 ` Borislav Petkov
0 siblings, 0 replies; 59+ messages in thread
From: Borislav Petkov @ 2025-05-06 7:18 UTC (permalink / raw)
To: Ard Biesheuvel, Kirill A. Shutemov
Cc: linux-kernel, linux-tip-commits, Ingo Molnar, Arnd Bergmann,
David Woodhouse, Dionna Amalie Glaze, H. Peter Anvin, Kees Cook,
Kevin Loughlin, Len Brown, Linus Torvalds, Rafael J. Wysocki,
Tom Lendacky, linux-efi, x86
+ Kirill.
This looks like this unaccepted_cleanup_work() thing being unsynchronized...
On Mon, May 05, 2025 at 06:47:59PM +0200, Borislav Petkov wrote:
> On Mon, May 05, 2025 at 06:19:49PM +0200, Ard Biesheuvel wrote:
> > This patch by itself does nothing. The symbol is __weak for the time
> > being, and given that this code does not have its __pi_ prefixes yet,
> > this function will be superseded by the existing one. (If you remove
> > the __weak you will get a linker error)
> >
> > Are you sure this patch is causing the issue?
>
> My by-foot bisection of x86/boot is below.
>
> I don't know if this patch uncovers something or maybe changes placement...
>
> Tom also pointed to
>
> 4067196a5227 ("mm/page_alloc: fix deadlock on cpu_hotplug_lock in __accept_page()")
>
> as potentially related.
>
> And now I went and tried to reproduce the warning and it fired ontop of:
>
> 419cbaf6a56a ("x86/boot: Add a bunch of PIC aliases")
>
> which is the previous patch.
>
> Ufff, this is one of those which don't reproduce reliably because it was fine
> on that commit in the previous run. Nasty.
>
> Lemme go dig more.
>
> ---
>
> ed4d95d033e3 (HEAD, refs/remotes/tip/x86/boot) x86/sev: Disentangle #VC handling code from startup code
>
> <--- NOT
>
> [ 1.227968] smp: Brought up 1 node, 32 CPUs
> [ 1.231576] smpboot: Total of 32 processors activated (191999.61 BogoMIPS)
> [ 1.247644] ------------[ cut here ]------------
> [ 1.248697] WARNING: CPU: 17 PID: 104 at kernel/jump_label.c:276 __static_key_slow_dec_cpuslocked+0x2a/0x80
> [ 1.251592] Modules linked in:
> [ 1.252370] CPU: 17 UID: 0 PID: 104 Comm: kworker/17:0 Not tainted 6.15.0-rc4+ #2 PREEMPT(voluntary)
> [ 1.253173] node 0 deferred pages initialised in 12ms
> [ 1.254539] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
> [ 1.257490] Workqueue: events unaccepted_cleanup_work
> [ 1.258698] RIP: 0010:__static_key_slow_dec_cpuslocked+0x2a/0x80
> [ 1.259574] Code: 0f 1f 44 00 00 53 48 89 fb e8 82 66 e5 ff 8b 03 85 c0 78 16 74 54 83 f8 01 74 11 8d 50 ff f0 0f b1 13 75 ec 5b e9 66 bf 8c 00 <0f> 0b 48 c7 c7 00 44 54 82 e8 a8 5f 8c 00 8b 03 83 f8 ff 74 33 85
> [ 1.266446] Memory: 7574396K/8381588K available (13737K kernel code, 2487K rwdata, 6056K rodata, 3916K init, 3592K bss, 791248K reserved, 0K cma-reserved)
> [ 1.263574] RSP: 0018:ffffc9000048fe40 EFLAGS: 00010286
> [ 1.267573] RAX: 00000000ffffffff RBX: ffffffff82f10d00 RCX: 0000000000000000
> [ 1.269388] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff82f10d00
> [ 1.275578] RBP: ffff88827b7d4d98 R08: 8080808080808080 R09: ffff888100b59100
> [ 1.277441] R10: ffff888100050cc0 R11: fefefefefefefeff R12: ffff88827326e300
> [ 1.279574] R13: ffff888100bc1940 R14: ffff8881000e2405 R15: ffff8881000e2400
> [ 1.281460] FS: 0000000000000000(0000) GS:ffff8882f0400000(0000) knlGS:0000000000000000
> [ 1.281460] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1.281460] CR2: 0000000000000000 CR3: 0008000002c22000 CR4: 00000000003506f0
> [ 1.287576] Call Trace:
> [ 1.288381] <TASK>
> [ 1.289015] static_key_slow_dec+0x1f/0x40
> [ 1.289980] process_one_work+0x171/0x330
> [ 1.290988] worker_thread+0x247/0x390
> [ 1.291576] ? __pfx_worker_thread+0x10/0x10
> [ 1.292773] kthread+0x107/0x240
> [ 1.293536] ? __pfx_kthread+0x10/0x10
> [ 1.295573] ret_from_fork+0x30/0x50
> [ 1.295575] ? __pfx_kthread+0x10/0x10
> [ 1.296580] ret_from_fork_asm+0x1a/0x30
> [ 1.297507] </TASK>
> [ 1.298085] ---[ end trace 0000000000000000 ]---
>
> 5297886f0cc4 x86/boot: Provide __pti_set_user_pgtbl() to startup code
>
> <--- OK
>
> 419cbaf6a56a x86/boot: Add a bunch of PIC aliases
> f932adcc8650 x86/linkage: Add SYM_PIC_ALIAS() macro helper to emit symbol aliases
>
> <--- OK
>
> ae862964cbc5 x86/sev: Move instruction decoder into separate source file
> fae89bbfdd9d x86/sev: Make sev_snp_enabled() a static function
> b3464a36f7f2 x86/boot: Disregard __supported_pte_mask in __startup_64()
> bd4a58beaaf1 x86/boot: Move early_setup_gdt() back into head64.c
>
> <--- OK
>
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 00/23] x86: strict separation of startup code
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
` (23 preceding siblings ...)
2025-05-04 14:04 ` [RFT PATCH v2 00/23] x86: strict separation of startup code Ingo Molnar
@ 2025-05-07 9:52 ` Borislav Petkov
2025-05-07 12:05 ` Ard Biesheuvel
24 siblings, 1 reply; 59+ messages in thread
From: Borislav Petkov @ 2025-05-07 9:52 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-kernel, linux-efi, x86, Ard Biesheuvel, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
On Sun, May 04, 2025 at 11:52:30AM +0200, Ard Biesheuvel wrote:
> arch/x86/boot/compressed/Makefile | 6 +-
> arch/x86/boot/compressed/misc.h | 12 +-
> arch/x86/boot/compressed/pgtable_64.c | 12 -
> arch/x86/boot/compressed/sev-handle-vc.c | 134 +++
> arch/x86/boot/compressed/sev.c | 210 +---
> arch/x86/boot/compressed/sev.h | 21 +-
> arch/x86/boot/compressed/vmlinux.lds.S | 1 +
> arch/x86/boot/startup/Makefile | 21 +
> arch/x86/boot/startup/exports.h | 14 +
> arch/x86/boot/startup/gdt_idt.c | 17 +-
> arch/x86/boot/startup/map_kernel.c | 18 +-
> arch/x86/boot/startup/sev-shared.c | 804 +-------------
> arch/x86/boot/startup/sev-startup.c | 1169 +-------------------
> arch/x86/boot/startup/sme.c | 45 +-
> arch/x86/coco/core.c | 2 +
> arch/x86/coco/sev/Makefile | 6 +-
> arch/x86/coco/sev/core.c | 189 +++-
> arch/x86/coco/sev/{sev-nmi.c => sev-noinstr.c} | 74 ++
Can we drop the "sev-" prefix to filenames which are already in sev/
filepaths?
> arch/x86/coco/sev/vc-handle.c | 1060 ++++++++++++++++++
> arch/x86/coco/sev/vc-shared.c | 614 ++++++++++
> arch/x86/include/asm/init.h | 6 -
> arch/x86/include/asm/linkage.h | 10 +
> arch/x86/include/asm/pgtable_64_types.h | 43 +-
> arch/x86/include/asm/setup.h | 2 +
> arch/x86/include/asm/sev-internal.h | 30 +-
> arch/x86/include/asm/sev.h | 78 ++
Pfff, sev-internal and sev.
I guess I'll know how the new structure would look like once I go through this
but there are so many sev* files now.
Can we tone that down pls, through aggregation, moving up into headers and so
on?
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 05/23] x86/sev: Move instruction decoder into separate source file
2025-05-04 9:52 ` [RFT PATCH v2 05/23] x86/sev: Move instruction decoder into separate source file Ard Biesheuvel
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2025-05-05 14:48 ` [RFT PATCH v2 05/23] " Tom Lendacky
@ 2025-05-07 9:58 ` Borislav Petkov
2025-05-07 11:49 ` Ard Biesheuvel
2 siblings, 1 reply; 59+ messages in thread
From: Borislav Petkov @ 2025-05-07 9:58 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-kernel, linux-efi, x86, Ard Biesheuvel, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
On Sun, May 04, 2025 at 11:52:35AM +0200, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
>
> As a first step towards disentangling the SEV #VC handling code -which
> is shared between the decompressor and the core kernel- from the SEV
> startup code, move the decompressor's copy of the instruction decoder
> into a separate source file.
Why?
I'm still unclear why that happens.
I'd like to read some blurb in those commit messages which explains the big
picture: the insn decoder bits are going to be in the bla mapping because...
, and because... and this is wonderful because ...
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 05/23] x86/sev: Move instruction decoder into separate source file
2025-05-07 9:58 ` Borislav Petkov
@ 2025-05-07 11:49 ` Ard Biesheuvel
2025-05-08 11:08 ` Borislav Petkov
0 siblings, 1 reply; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-07 11:49 UTC (permalink / raw)
To: Borislav Petkov
Cc: Ard Biesheuvel, linux-kernel, linux-efi, x86, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
On Wed, 7 May 2025 at 11:58, Borislav Petkov <bp@alien8.de> wrote:
>
> On Sun, May 04, 2025 at 11:52:35AM +0200, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > As a first step towards disentangling the SEV #VC handling code -which
> > is shared between the decompressor and the core kernel- from the SEV
> > startup code, move the decompressor's copy of the instruction decoder
> > into a separate source file.
>
> Why?
>
> I'm still unclear why that happens.
>
> I'd like to read some blurb in those commit messages which explains the big
> picture: the insn decoder bits are going to be in the bla mapping because...
> , and because... and this is wonderful because ...
>
Sure, I can add some more prose. I'll add something along the lines of
"Some of the SEV code that is shared between the decompressor and the
kernel proper runs very early in the latter, and therefore needs to be
built in a special way. This does not apply to all of that shared
code, though - some is used both by the decompressor, and by the
kernel proper but at a much later stage. That code can be built as
ordinary, position dependent code with instrumentations enabled etc
etc.
The #VC handling machinery and the associated instruction decoder are
conceptually separate from the SEV initialization code, and are never
used on the early startup path in the core kernel. So start separating
it from the SEV startup code, by moving the decompressor's copy of the
instruction decoder to a separate source file. In a subsequent patch,
the shared #VC handling code will be moved into a separate shared
source file, which will be included here too and no longer into sev.c.
That way, it no longer gets included into the early SEV startup code,
and can be built in the ordinary way."
Does that help?
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 00/23] x86: strict separation of startup code
2025-05-07 9:52 ` Borislav Petkov
@ 2025-05-07 12:05 ` Ard Biesheuvel
2025-05-08 10:55 ` Borislav Petkov
0 siblings, 1 reply; 59+ messages in thread
From: Ard Biesheuvel @ 2025-05-07 12:05 UTC (permalink / raw)
To: Borislav Petkov
Cc: Ard Biesheuvel, linux-kernel, linux-efi, x86, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
On Wed, 7 May 2025 at 11:53, Borislav Petkov <bp@alien8.de> wrote:
>
> On Sun, May 04, 2025 at 11:52:30AM +0200, Ard Biesheuvel wrote:
> > arch/x86/boot/compressed/Makefile | 6 +-
> > arch/x86/boot/compressed/misc.h | 12 +-
> > arch/x86/boot/compressed/pgtable_64.c | 12 -
> > arch/x86/boot/compressed/sev-handle-vc.c | 134 +++
> > arch/x86/boot/compressed/sev.c | 210 +---
> > arch/x86/boot/compressed/sev.h | 21 +-
> > arch/x86/boot/compressed/vmlinux.lds.S | 1 +
> > arch/x86/boot/startup/Makefile | 21 +
> > arch/x86/boot/startup/exports.h | 14 +
> > arch/x86/boot/startup/gdt_idt.c | 17 +-
> > arch/x86/boot/startup/map_kernel.c | 18 +-
> > arch/x86/boot/startup/sev-shared.c | 804 +-------------
> > arch/x86/boot/startup/sev-startup.c | 1169 +-------------------
> > arch/x86/boot/startup/sme.c | 45 +-
> > arch/x86/coco/core.c | 2 +
> > arch/x86/coco/sev/Makefile | 6 +-
> > arch/x86/coco/sev/core.c | 189 +++-
> > arch/x86/coco/sev/{sev-nmi.c => sev-noinstr.c} | 74 ++
>
> Can we drop the "sev-" prefix to filenames which are already in sev/
> filepaths?
>
Sure.
> > arch/x86/coco/sev/vc-handle.c | 1060 ++++++++++++++++++
> > arch/x86/coco/sev/vc-shared.c | 614 ++++++++++
> > arch/x86/include/asm/init.h | 6 -
> > arch/x86/include/asm/linkage.h | 10 +
> > arch/x86/include/asm/pgtable_64_types.h | 43 +-
> > arch/x86/include/asm/setup.h | 2 +
> > arch/x86/include/asm/sev-internal.h | 30 +-
> > arch/x86/include/asm/sev.h | 78 ++
>
> Pfff, sev-internal and sev.
>
> I guess I'll know how the new structure would look like once I go through this
> but there are so many sev* files now.
>
> Can we tone that down pls, through aggregation, moving up into headers and so
> on?
>
I think the main issue with this code is that everything was in a
single source file, with no structure or layering whatsoever.
So I'd actually argue for splitting this up even more rather than
bundling it all together, although the sev vs sev-internal distinction
is a bit dubious - it would be better to split this across functional
lines.
I added sev-internal.h so that that single mother-of-all-source-files
could be hacked up without exposing implementation details to external
users that were hidden before. I.e., the high-level APIs that other
callers need to use should be in sev.h, and the implementation of that
API should be carved up meaningfully. For example, perhaps the #VC
handling stuff (which now lives in a separate source file) could be
exposed via sev-vc.h, and only included in places where that
particular functionality is being used.
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 00/23] x86: strict separation of startup code
2025-05-07 12:05 ` Ard Biesheuvel
@ 2025-05-08 10:55 ` Borislav Petkov
0 siblings, 0 replies; 59+ messages in thread
From: Borislav Petkov @ 2025-05-08 10:55 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Ard Biesheuvel, linux-kernel, linux-efi, x86, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
On Wed, May 07, 2025 at 02:05:37PM +0200, Ard Biesheuvel wrote:
> So I'd actually argue for splitting this up even more rather than
> bundling it all together,
Splitting it into logically separated bits? Yes. Just because: no.
The SEV stuff is still largely in motion so once it starts settling down,
splitting it would be the first thing that irks me and I'll go do it.
> although the sev vs sev-internal distinction is a bit dubious - it would be
> better to split this across functional lines.
>
> I added sev-internal.h so that that single mother-of-all-source-files
> could be hacked up without exposing implementation details to external
> users that were hidden before. I.e., the high-level APIs that other
> callers need to use should be in sev.h, and the implementation of that
> API should be carved up meaningfully. For example, perhaps the #VC
> handling stuff (which now lives in a separate source file) could be
> exposed via sev-vc.h, and only included in places where that
> particular functionality is being used.
Right, please put those considerations in the commit messages - it helps a lot
with the review.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFT PATCH v2 05/23] x86/sev: Move instruction decoder into separate source file
2025-05-07 11:49 ` Ard Biesheuvel
@ 2025-05-08 11:08 ` Borislav Petkov
0 siblings, 0 replies; 59+ messages in thread
From: Borislav Petkov @ 2025-05-08 11:08 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Ard Biesheuvel, linux-kernel, linux-efi, x86, Ingo Molnar,
Dionna Amalie Glaze, Kevin Loughlin, Tom Lendacky
On Wed, May 07, 2025 at 01:49:19PM +0200, Ard Biesheuvel wrote:
> Sure, I can add some more prose. I'll add something along the lines of
>
> "Some of the SEV code that is shared between the decompressor and the
> kernel proper runs very early in the latter, and therefore needs to be
> built in a special way. This does not apply to all of that shared
> code, though - some is used both by the decompressor, and by the
> kernel proper but at a much later stage. That code can be built as
> ordinary, position dependent code with instrumentations enabled etc
> etc.
>
> The #VC handling machinery and the associated instruction decoder are
> conceptually separate from the SEV initialization code, and are never
> used on the early startup path in the core kernel. So start separating
> it from the SEV startup code, by moving the decompressor's copy of the
> instruction decoder to a separate source file. In a subsequent patch,
> the shared #VC handling code will be moved into a separate shared
> source file, which will be included here too and no longer into sev.c.
> That way, it no longer gets included into the early SEV startup code,
> and can be built in the ordinary way."
>
> Does that help?
Yap, definitely.
So the logic is, everything in startup/ and everything that startup/
*includes* is going to end up being PIC and the rest is ordinary.
I guess that's one rule to separate it on.
Thanks!
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 59+ messages in thread
end of thread, other threads:[~2025-05-08 11:08 UTC | newest]
Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-04 9:52 [RFT PATCH v2 00/23] x86: strict separation of startup code Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 01/23] x86/boot: Move early_setup_gdt() back into head64.c Ard Biesheuvel
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 02/23] x86/boot: Disregard __supported_pte_mask in __startup_64() Ard Biesheuvel
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 03/23] x86/boot: Drop global variables keeping track of LA57 state Ard Biesheuvel
2025-05-04 13:50 ` Ingo Molnar
2025-05-04 14:46 ` Ard Biesheuvel
2025-05-04 14:58 ` Linus Torvalds
2025-05-04 19:33 ` Ard Biesheuvel
2025-05-05 21:07 ` Ingo Molnar
2025-05-05 21:24 ` Ingo Molnar
2025-05-05 22:30 ` Ard Biesheuvel
2025-05-05 21:26 ` Linus Torvalds
2025-05-05 21:51 ` Ingo Molnar
2025-05-04 9:52 ` [RFT PATCH v2 04/23] x86/sev: Make sev_snp_enabled() a static function Ard Biesheuvel
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 05/23] x86/sev: Move instruction decoder into separate source file Ard Biesheuvel
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2025-05-05 14:48 ` [RFT PATCH v2 05/23] " Tom Lendacky
2025-05-05 14:50 ` Ard Biesheuvel
2025-05-07 9:58 ` Borislav Petkov
2025-05-07 11:49 ` Ard Biesheuvel
2025-05-08 11:08 ` Borislav Petkov
2025-05-04 9:52 ` [RFT PATCH v2 06/23] x86/sev: Disentangle #VC handling code from startup code Ard Biesheuvel
2025-05-05 5:31 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2025-05-05 14:58 ` [RFT PATCH v2 06/23] " Tom Lendacky
2025-05-05 16:54 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 07/23] x86/sev: Separate MSR and GHCB based snp_cpuid() via a callback Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 08/23] x86/sev: Fall back to early page state change code only during boot Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 09/23] x86/sev: Move GHCB page based HV communication out of startup code Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 10/23] x86/sev: Use boot SVSM CA for all startup and init code Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 11/23] x86/boot: Drop redundant RMPADJUST in SEV SVSM presence check Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 12/23] x86/sev: Unify SEV-SNP hypervisor feature check Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 13/23] x86/linkage: Add SYM_PIC_ALIAS() macro helper to emit symbol aliases Ard Biesheuvel
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 14/23] x86/boot: Add a bunch of PIC aliases Ard Biesheuvel
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 15/23] x86/boot: Provide __pti_set_user_pgtbl() to startup code Ard Biesheuvel
2025-05-04 14:20 ` [tip: x86/boot] " tip-bot2 for Ard Biesheuvel
2025-05-05 16:03 ` Borislav Petkov
2025-05-05 16:19 ` Ard Biesheuvel
2025-05-05 16:47 ` Borislav Petkov
2025-05-06 7:18 ` Borislav Petkov
2025-05-05 16:54 ` tip-bot2 for Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 16/23] x86/sev: Provide PIC aliases for SEV related data objects Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 17/23] x86/sev: Move __sev_[get|put]_ghcb() into separate noinstr object Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 18/23] x86/sev: Export startup routines for ordinary use Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 19/23] x86/boot: Created a confined code area for startup code Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 20/23] x86/boot: Move startup code out of __head section Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 21/23] x86/boot: Disallow absolute symbol references in startup code Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 22/23] x86/boot: Revert "Reject absolute references in .head.text" Ard Biesheuvel
2025-05-04 9:52 ` [RFT PATCH v2 23/23] x86/boot: Get rid of the .head.text section Ard Biesheuvel
2025-05-04 14:04 ` [RFT PATCH v2 00/23] x86: strict separation of startup code Ingo Molnar
2025-05-04 14:55 ` Ard Biesheuvel
2025-05-05 5:08 ` Ingo Molnar
2025-05-07 9:52 ` Borislav Petkov
2025-05-07 12:05 ` Ard Biesheuvel
2025-05-08 10:55 ` Borislav Petkov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).