* [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions)
@ 2025-08-08 20:22 Andrew Cooper
2025-08-08 20:22 ` [PATCH 01/22] x86/msr: Rename MSR_INTERRUPT_SSP_TABLE to MSR_ISST Andrew Cooper
` (22 more replies)
0 siblings, 23 replies; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
Patches 1-12 are cleanup and rearrangement in order to fit FRED in more
nicely.
Patches 13-22 are the start of FRED support, setting up the stacks and
exception handling.
By the end of this series, Xen can run a PVH dom0 (if NMIs aren't in use -
VT-x has a really nasty corner case here), or can get to the point of
launching a PV dom0. Running a PV dom0 is going to take a similar quantity of
work.
All deveopment work has been done with Intel SIMICS. I don't yet have real
hardware to test on.
The Gitlab CI AlderLake box, sporting both Shadow Stacks and S3 has been
working overtime checking that S3 works on every relevant patch boundary.
https://gitlab.com/xen-project/hardware/xen-staging/-/pipelines/1974989391
Andrew Cooper (22):
x86/msr: Rename MSR_INTERRUPT_SSP_TABLE to MSR_ISST
x86/msr: Rename wrmsr_ns() to wrmsrns(), and take 64bit value
x86/traps: Drop incorrect BUILD_BUG_ON() and comment in load_system_tables()
x86/idt: Minor improvements to _update_gate_addr_lower()
x86/traps: Rename early_traps_init() to bsp_early_traps_init()
x86/traps: Introduce bsp_traps_reinit()
x86/spec-ctrl: Rework init_shadow_spec_ctrl_state() to take an info pointer
x86/traps: Introduce ap_early_traps_init() and set up exception handling earlier
x86/traps: Move load_system_tables() into traps-setup.c
x86/traps: Move subarch_percpu_traps_init() into traps-setup.c
x86/traps: Fold x86_64/traps.c into traps.c
x86/traps: Unexport show_code() and show_stack_overflow()
x86: FRED enumerations
x86/traps: Extend struct cpu_user_regs/cpu_info with FRED fields
x86/traps: Introduce opt_fred
x86/boot: Adjust CR4 handling around ap_early_traps_init()
x86/S3: Switch to using RSTORSSP to recover SSP on resume
x86/traps: Set MSR_PL0_SSP in load_system_tables()
x86/boot: Use RSTORSSP to establish SSP
x86/traps: Alter switch_stack_and_jump() for FRED mode
x86/traps: Introduce FRED entrypoints
x86/traps: Enable FRED when requested
docs/misc/xen-command-line.pandoc | 10 +
xen/arch/x86/Kconfig | 4 +
xen/arch/x86/acpi/wakeup_prot.S | 57 +--
xen/arch/x86/boot/x86_64.S | 53 ++-
xen/arch/x86/cpu/common.c | 120 ------
xen/arch/x86/include/asm/asm-defns.h | 9 +
xen/arch/x86/include/asm/asm_defns.h | 65 +++
xen/arch/x86/include/asm/cpu-user-regs.h | 71 +++-
xen/arch/x86/include/asm/cpufeature.h | 3 +
xen/arch/x86/include/asm/current.h | 14 +-
xen/arch/x86/include/asm/idt.h | 13 +-
xen/arch/x86/include/asm/msr-index.h | 13 +-
xen/arch/x86/include/asm/msr.h | 12 +-
xen/arch/x86/include/asm/processor.h | 2 -
xen/arch/x86/include/asm/prot-key.h | 4 +-
xen/arch/x86/include/asm/spec_ctrl.h | 4 +-
xen/arch/x86/include/asm/system.h | 3 -
xen/arch/x86/include/asm/traps.h | 7 +-
xen/arch/x86/include/asm/x86-defns.h | 1 +
xen/arch/x86/msr.c | 4 +-
xen/arch/x86/setup.c | 37 +-
xen/arch/x86/smpboot.c | 17 +-
xen/arch/x86/spec_ctrl.c | 2 +-
xen/arch/x86/traps-setup.c | 358 +++++++++++++++-
xen/arch/x86/traps.c | 448 +++++++++++++++++++-
xen/arch/x86/x86_64/Makefile | 2 +-
xen/arch/x86/x86_64/entry-fred.S | 35 ++
xen/arch/x86/x86_64/traps.c | 414 ------------------
xen/include/public/arch-x86/cpufeatureset.h | 3 +
xen/include/xen/macros.h | 2 +
30 files changed, 1142 insertions(+), 645 deletions(-)
create mode 100644 xen/arch/x86/x86_64/entry-fred.S
delete mode 100644 xen/arch/x86/x86_64/traps.c
--
2.39.5
^ permalink raw reply [flat|nested] 120+ messages in thread
* [PATCH 01/22] x86/msr: Rename MSR_INTERRUPT_SSP_TABLE to MSR_ISST
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
@ 2025-08-08 20:22 ` Andrew Cooper
2025-08-12 8:06 ` Jan Beulich
2025-08-08 20:22 ` [PATCH 02/22] x86/msr: Rename wrmsr_ns() to wrmsrns(), and take 64bit value Andrew Cooper
` (21 subsequent siblings)
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
The name AMD chose is rather more concise.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/cpu/common.c | 2 +-
xen/arch/x86/include/asm/msr-index.h | 2 +-
xen/arch/x86/msr.c | 4 ++--
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index 97bdda1d4a25..f6ec5c9df522 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -933,7 +933,7 @@ void load_system_tables(void)
wrss(df_ssp, _p(df_ssp));
}
- wrmsrl(MSR_INTERRUPT_SSP_TABLE, (unsigned long)ist_ssp);
+ wrmsrl(MSR_ISST, (unsigned long)ist_ssp);
}
BUILD_BUG_ON(sizeof(*tss) <= 0x67); /* Mandated by the architecture. */
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index 2e7e2aff9a33..428d993ee89b 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -157,7 +157,7 @@
#define MSR_PL1_SSP 0x000006a5
#define MSR_PL2_SSP 0x000006a6
#define MSR_PL3_SSP 0x000006a7
-#define MSR_INTERRUPT_SSP_TABLE 0x000006a8
+#define MSR_ISST 0x000006a8
#define MSR_PKRS 0x000006e1
diff --git a/xen/arch/x86/msr.c b/xen/arch/x86/msr.c
index 2cd46b6c8afa..1bf117cbd80f 100644
--- a/xen/arch/x86/msr.c
+++ b/xen/arch/x86/msr.c
@@ -138,7 +138,7 @@ int guest_rdmsr(struct vcpu *v, uint32_t msr, uint64_t *val)
case MSR_RTIT_OUTPUT_BASE ... MSR_RTIT_ADDR_B(7):
case MSR_U_CET:
case MSR_S_CET:
- case MSR_PL0_SSP ... MSR_INTERRUPT_SSP_TABLE:
+ case MSR_PL0_SSP ... MSR_ISST:
case MSR_AMD64_LWP_CFG:
case MSR_AMD64_LWP_CBADDR:
case MSR_PPIN_CTL:
@@ -442,7 +442,7 @@ int guest_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val)
case MSR_RTIT_OUTPUT_BASE ... MSR_RTIT_ADDR_B(7):
case MSR_U_CET:
case MSR_S_CET:
- case MSR_PL0_SSP ... MSR_INTERRUPT_SSP_TABLE:
+ case MSR_PL0_SSP ... MSR_ISST:
case MSR_AMD64_LWP_CFG:
case MSR_AMD64_LWP_CBADDR:
case MSR_PPIN_CTL:
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 02/22] x86/msr: Rename wrmsr_ns() to wrmsrns(), and take 64bit value
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
2025-08-08 20:22 ` [PATCH 01/22] x86/msr: Rename MSR_INTERRUPT_SSP_TABLE to MSR_ISST Andrew Cooper
@ 2025-08-08 20:22 ` Andrew Cooper
2025-08-11 6:36 ` Andrew Cooper
2025-08-08 20:22 ` [PATCH 03/22] x86/traps: Drop incorrect BUILD_BUG_ON() and comment in load_system_tables() Andrew Cooper
` (20 subsequent siblings)
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
In hindsight, having the wrapper name not be the instruction mnemonic was a
poor choice. Also, PKS turns out to be quite rare in wanting a split value.
Switch to using a single 64bit value in preparation for new users.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/include/asm/msr.h | 4 ++--
xen/arch/x86/include/asm/prot-key.h | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/xen/arch/x86/include/asm/msr.h b/xen/arch/x86/include/asm/msr.h
index 4c4f18b3a54d..b6b85b04c3fd 100644
--- a/xen/arch/x86/include/asm/msr.h
+++ b/xen/arch/x86/include/asm/msr.h
@@ -39,7 +39,7 @@ static inline void wrmsrl(unsigned int msr, uint64_t val)
}
/* Non-serialising WRMSR, when available. Falls back to a serialising WRMSR. */
-static inline void wrmsr_ns(uint32_t msr, uint32_t lo, uint32_t hi)
+static inline void wrmsrns(uint32_t msr, uint64_t val)
{
/*
* WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a redundant CS
@@ -47,7 +47,7 @@ static inline void wrmsr_ns(uint32_t msr, uint32_t lo, uint32_t hi)
*/
alternative_input(".byte 0x2e; wrmsr",
".byte 0x0f,0x01,0xc6", X86_FEATURE_WRMSRNS,
- "c" (msr), "a" (lo), "d" (hi));
+ "c" (msr), "a" (val), "d" (val >> 32));
}
/* rdmsr with exception handling */
diff --git a/xen/arch/x86/include/asm/prot-key.h b/xen/arch/x86/include/asm/prot-key.h
index 0cbecc2df401..3e9c2eaef415 100644
--- a/xen/arch/x86/include/asm/prot-key.h
+++ b/xen/arch/x86/include/asm/prot-key.h
@@ -72,14 +72,14 @@ static inline void wrpkrs(uint32_t pkrs)
{
*this_pkrs = pkrs;
- wrmsr_ns(MSR_PKRS, pkrs, 0);
+ wrmsrns(MSR_PKRS, pkrs);
}
}
static inline void wrpkrs_and_cache(uint32_t pkrs)
{
this_cpu(pkrs) = pkrs;
- wrmsr_ns(MSR_PKRS, pkrs, 0);
+ wrmsrns(MSR_PKRS, pkrs);
}
#endif /* ASM_PROT_KEY_H */
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 03/22] x86/traps: Drop incorrect BUILD_BUG_ON() and comment in load_system_tables()
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
2025-08-08 20:22 ` [PATCH 01/22] x86/msr: Rename MSR_INTERRUPT_SSP_TABLE to MSR_ISST Andrew Cooper
2025-08-08 20:22 ` [PATCH 02/22] x86/msr: Rename wrmsr_ns() to wrmsrns(), and take 64bit value Andrew Cooper
@ 2025-08-08 20:22 ` Andrew Cooper
2025-08-12 8:11 ` Jan Beulich
2025-08-08 20:22 ` [PATCH 04/22] x86/idt: Minor improvements to _update_gate_addr_lower() Andrew Cooper
` (19 subsequent siblings)
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
This was added erroneously by me.
Hardware task switching does demand a TSS of at least 0x67 bytes, but that's
not relevant in 64bit, and not relevant for Xen since commit
5d1181a5ea5e ("xen: Remove x86_32 build target.") in 2012.
We already load a 0-length TSS in early_traps_init() demonstrating that it's
possible.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/cpu/common.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index f6ec5c9df522..cdc41248d4e9 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -936,8 +936,6 @@ void load_system_tables(void)
wrmsrl(MSR_ISST, (unsigned long)ist_ssp);
}
- BUILD_BUG_ON(sizeof(*tss) <= 0x67); /* Mandated by the architecture. */
-
_set_tssldt_desc(gdt + TSS_ENTRY, (unsigned long)tss,
sizeof(*tss) - 1, SYS_DESC_tss_avail);
if ( IS_ENABLED(CONFIG_PV32) )
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 04/22] x86/idt: Minor improvements to _update_gate_addr_lower()
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (2 preceding siblings ...)
2025-08-08 20:22 ` [PATCH 03/22] x86/traps: Drop incorrect BUILD_BUG_ON() and comment in load_system_tables() Andrew Cooper
@ 2025-08-08 20:22 ` Andrew Cooper
2025-08-12 8:16 ` Jan Beulich
2025-08-08 20:22 ` [PATCH 05/22] x86/traps: Rename early_traps_init() to bsp_early_traps_init() Andrew Cooper
` (18 subsequent siblings)
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
After some experimentation, using .a/.b makes far better logic than using the
named fields, as both Clang and GCC spill idte to the stack when named fields
are used.
GCC seems to do something very daft for the addr1 field. It takes addr,
shifts it by 32, then ANDs with 0xffff0000000000000UL, which requires
manifesting a MOVABS.
Clang follows the C, whereby it ANDs with $imm32, then shifts, avoiding the
MOVABS entirely.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
I'm disappointed about how poor the code generation is when assigning to named
fields, but I suppose it is a harder problem for the compiler to figure out.
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-24 (-24)
Function old new delta
machine_kexec 356 348 -8
traps_init 434 418 -16
---
xen/arch/x86/include/asm/idt.h | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/xen/arch/x86/include/asm/idt.h b/xen/arch/x86/include/asm/idt.h
index f613d5693e0e..b5e570a77fae 100644
--- a/xen/arch/x86/include/asm/idt.h
+++ b/xen/arch/x86/include/asm/idt.h
@@ -92,15 +92,16 @@ static inline void _set_gate_lower(idt_entry_t *gate, unsigned long type,
* Update the lower half handler of an IDT entry, without changing any other
* configuration.
*/
-static inline void _update_gate_addr_lower(idt_entry_t *gate, void *addr)
+static inline void _update_gate_addr_lower(idt_entry_t *gate, void *_addr)
{
+ unsigned long addr = (unsigned long)_addr;
+ unsigned int addr1 = addr & 0xffff0000U; /* GCC force better codegen. */
idt_entry_t idte;
- idte.a = gate->a;
- idte.b = ((unsigned long)(addr) >> 32);
- idte.a &= 0x0000FFFFFFFF0000ULL;
- idte.a |= (((unsigned long)(addr) & 0xFFFF0000UL) << 32) |
- ((unsigned long)(addr) & 0xFFFFUL);
+ idte.b = addr >> 32;
+ idte.a = gate->a & 0x0000ffffffff0000UL;
+ idte.a |= (unsigned long)addr1 << 32;
+ idte.a |= addr & 0xffff;
_write_gate_lower(gate, &idte);
}
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 05/22] x86/traps: Rename early_traps_init() to bsp_early_traps_init()
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (3 preceding siblings ...)
2025-08-08 20:22 ` [PATCH 04/22] x86/idt: Minor improvements to _update_gate_addr_lower() Andrew Cooper
@ 2025-08-08 20:22 ` Andrew Cooper
2025-08-12 8:17 ` Jan Beulich
2025-08-08 20:22 ` [PATCH 06/22] x86/traps: Introduce bsp_traps_reinit() Andrew Cooper
` (17 subsequent siblings)
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
We're going to want to introduce an AP version shortly.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/include/asm/traps.h | 2 +-
xen/arch/x86/setup.c | 2 +-
xen/arch/x86/traps-setup.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/xen/arch/x86/include/asm/traps.h b/xen/arch/x86/include/asm/traps.h
index 72c33a33e283..7414420e57d8 100644
--- a/xen/arch/x86/include/asm/traps.h
+++ b/xen/arch/x86/include/asm/traps.h
@@ -7,7 +7,7 @@
#ifndef ASM_TRAP_H
#define ASM_TRAP_H
-void early_traps_init(void);
+void bsp_early_traps_init(void);
void traps_init(void);
void percpu_traps_init(void);
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 1543dd251cc6..64f02699e1aa 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1134,7 +1134,7 @@ void asmlinkage __init noreturn __start_xen(void)
percpu_init_areas();
- early_traps_init();
+ bsp_early_traps_init();
smp_prepare_boot_cpu();
sort_exception_tables();
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index a8385b26ae9b..7713f427d344 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -62,7 +62,7 @@ static void __init init_ler(void)
* boot_gdt is already loaded, and bsp_idt[] is constructed without IST
* settings, so we don't need a TSS configured yet.
*/
-void __init early_traps_init(void)
+void __init bsp_early_traps_init(void)
{
const struct desc_ptr idtr = {
.base = (unsigned long)bsp_idt,
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 06/22] x86/traps: Introduce bsp_traps_reinit()
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (4 preceding siblings ...)
2025-08-08 20:22 ` [PATCH 05/22] x86/traps: Rename early_traps_init() to bsp_early_traps_init() Andrew Cooper
@ 2025-08-08 20:22 ` Andrew Cooper
2025-08-12 8:19 ` Jan Beulich
2025-08-08 20:22 ` [PATCH 07/22] x86/spec-ctrl: Rework init_shadow_spec_ctrl_state() to take an info pointer Andrew Cooper
` (16 subsequent siblings)
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
... to abstract away updating the refereces to the old BSP stack.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/include/asm/traps.h | 1 +
xen/arch/x86/setup.c | 6 +-----
xen/arch/x86/traps-setup.c | 9 +++++++++
3 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/xen/arch/x86/include/asm/traps.h b/xen/arch/x86/include/asm/traps.h
index 7414420e57d8..6ae451d3fc70 100644
--- a/xen/arch/x86/include/asm/traps.h
+++ b/xen/arch/x86/include/asm/traps.h
@@ -9,6 +9,7 @@
void bsp_early_traps_init(void);
void traps_init(void);
+void bsp_traps_reinit(void);
void percpu_traps_init(void);
extern unsigned int ler_msr;
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 64f02699e1aa..c8c408e02436 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -902,11 +902,7 @@ static void __init noreturn reinit_bsp_stack(void)
unsigned long *stack = (void*)(get_stack_bottom() & ~(STACK_SIZE - 1));
int rc;
- /* Update TSS and ISTs */
- load_system_tables();
-
- /* Update SYSCALL trampolines */
- percpu_traps_init();
+ bsp_traps_reinit();
stack_base[0] = stack;
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index 7713f427d344..370f4d5f7b60 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -107,6 +107,15 @@ void __init traps_init(void)
percpu_traps_init();
}
+/*
+ * Re-initialise all state referencing the early-boot stack.
+ */
+void bsp_traps_reinit(void)
+{
+ load_system_tables();
+ percpu_traps_init();
+}
+
/*
* Set up per-CPU linkage registers for exception, interrupt and syscall
* handling.
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 07/22] x86/spec-ctrl: Rework init_shadow_spec_ctrl_state() to take an info pointer
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (5 preceding siblings ...)
2025-08-08 20:22 ` [PATCH 06/22] x86/traps: Introduce bsp_traps_reinit() Andrew Cooper
@ 2025-08-08 20:22 ` Andrew Cooper
2025-08-12 8:27 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 08/22] x86/traps: Introduce ap_early_traps_init() and set up exception handling earlier Andrew Cooper
` (15 subsequent siblings)
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
We're going to want to reuse it for a remote stack shortly.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/include/asm/spec_ctrl.h | 4 +---
xen/arch/x86/setup.c | 2 +-
xen/arch/x86/smpboot.c | 2 +-
xen/arch/x86/spec_ctrl.c | 2 +-
4 files changed, 4 insertions(+), 6 deletions(-)
diff --git a/xen/arch/x86/include/asm/spec_ctrl.h b/xen/arch/x86/include/asm/spec_ctrl.h
index 6724d3812029..3d92928f9439 100644
--- a/xen/arch/x86/include/asm/spec_ctrl.h
+++ b/xen/arch/x86/include/asm/spec_ctrl.h
@@ -99,10 +99,8 @@ extern bool opt_bp_spec_reduce;
*/
extern paddr_t l1tf_addr_mask, l1tf_safe_maddr;
-static inline void init_shadow_spec_ctrl_state(void)
+static inline void init_shadow_spec_ctrl_state(struct cpu_info *info)
{
- struct cpu_info *info = get_cpu_info();
-
info->shadow_spec_ctrl = 0;
info->xen_spec_ctrl = default_xen_spec_ctrl;
info->scf = default_scf;
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index c8c408e02436..6fb42c5a5f95 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1126,7 +1126,7 @@ void asmlinkage __init noreturn __start_xen(void)
/* Critical region without exception handling. Any fault is deadly! */
- init_shadow_spec_ctrl_state();
+ init_shadow_spec_ctrl_state(info);
percpu_init_areas();
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 302be4341bf3..ce4862dde5a7 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -332,7 +332,7 @@ void asmlinkage start_secondary(void)
set_current(idle_vcpu[cpu]);
this_cpu(curr_vcpu) = idle_vcpu[cpu];
rdmsrl(MSR_EFER, this_cpu(efer));
- init_shadow_spec_ctrl_state();
+ init_shadow_spec_ctrl_state(info);
/*
* Just as during early bootstrap, it is convenient here to disable
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index feae0d710f8e..1ff3d6835d9d 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -2226,7 +2226,7 @@ void __init init_speculation_mitigations(void)
opt_eager_fpu = should_use_eager_fpu();
/* (Re)init BSP state now that default_scf has been calculated. */
- init_shadow_spec_ctrl_state();
+ init_shadow_spec_ctrl_state(get_cpu_info());
/*
* For microcoded IBRS only (i.e. Intel, pre eIBRS), it is recommended to
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 08/22] x86/traps: Introduce ap_early_traps_init() and set up exception handling earlier
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (6 preceding siblings ...)
2025-08-08 20:22 ` [PATCH 07/22] x86/spec-ctrl: Rework init_shadow_spec_ctrl_state() to take an info pointer Andrew Cooper
@ 2025-08-08 20:23 ` Andrew Cooper
2025-08-12 8:41 ` Jan Beulich
2025-08-14 18:07 ` [PATCH v1.1 08/22] x86/traps: Introduce percpu_early_traps_init() " Andrew Cooper
2025-08-08 20:23 ` [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c Andrew Cooper
` (14 subsequent siblings)
22 siblings, 2 replies; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:23 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
As things stand, we set up AP/S3 exception handling marginally after the
fragile activity of setting up shadow stacks. Shadow stack setup is going to
get more complicated under FRED.
Introduce ap_early_traps_init() and call it ahead of setting up shadow stacks.
To start with, call load_system_tables() which is sufficient to set up full
exception handling.
In order to handle exceptions, current and the speculation controls needs to
work. cpu_smpboot_alloc() already constructs some of the AP's top-of-stack
block, so have it set up a little more.
This gets us complete exception coverage of setting up shadow stacks, rather
than dying with a triple fault.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/acpi/wakeup_prot.S | 5 +++--
xen/arch/x86/boot/x86_64.S | 5 ++++-
xen/arch/x86/smpboot.c | 17 ++++-------------
xen/arch/x86/traps-setup.c | 12 ++++++++++++
4 files changed, 23 insertions(+), 16 deletions(-)
diff --git a/xen/arch/x86/acpi/wakeup_prot.S b/xen/arch/x86/acpi/wakeup_prot.S
index 92af6230b31f..60eca4010042 100644
--- a/xen/arch/x86/acpi/wakeup_prot.S
+++ b/xen/arch/x86/acpi/wakeup_prot.S
@@ -63,6 +63,9 @@ LABEL(s3_resume)
pushq %rax
lretq
1:
+ /* Set up early exceptions and CET before entering C properly. */
+ call ap_early_traps_init
+
#if defined(CONFIG_XEN_SHSTK) || defined(CONFIG_XEN_IBT)
call xen_msr_s_cet_value
test %eax, %eax
@@ -117,8 +120,6 @@ LABEL(s3_resume)
.L_cet_done:
#endif /* CONFIG_XEN_SHSTK || CONFIG_XEN_IBT */
- call load_system_tables
-
/* Restore CR4 from the cpuinfo block. */
GET_STACK_END(bx)
mov STACK_CPUINFO_FIELD(cr4)(%rbx), %rax
diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index 95a6b6cf63bd..0dfcc8a88a40 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -30,7 +30,10 @@ ENTRY(__high_start)
test %ebx,%ebx
jz .L_bsp
- /* APs. Set up CET before entering C properly. */
+ /* APs. Set up early exceptions and CET before entering C properly. */
+
+ call ap_early_traps_init
+
#if defined(CONFIG_XEN_SHSTK) || defined(CONFIG_XEN_IBT)
call xen_msr_s_cet_value
test %eax, %eax
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index ce4862dde5a7..8af6556999d7 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -327,12 +327,7 @@ void asmlinkage start_secondary(void)
struct cpu_info *info = get_cpu_info();
unsigned int cpu = smp_processor_id();
- /* Critical region without IDT or TSS. Any fault is deadly! */
-
- set_current(idle_vcpu[cpu]);
- this_cpu(curr_vcpu) = idle_vcpu[cpu];
rdmsrl(MSR_EFER, this_cpu(efer));
- init_shadow_spec_ctrl_state(info);
/*
* Just as during early bootstrap, it is convenient here to disable
@@ -352,14 +347,6 @@ void asmlinkage start_secondary(void)
*/
spin_debug_disable();
- get_cpu_info()->use_pv_cr3 = false;
- get_cpu_info()->xen_cr3 = 0;
- get_cpu_info()->pv_cr3 = 0;
-
- load_system_tables();
-
- /* Full exception support from here on in. */
-
if ( cpu_has_pks )
wrpkrs_and_cache(0); /* Must be before setting CR4.PKS */
@@ -1064,8 +1051,12 @@ static int cpu_smpboot_alloc(unsigned int cpu)
goto out;
info = get_cpu_info_from_stack((unsigned long)stack_base[cpu]);
+ memset(info, 0, sizeof(*info));
+ init_shadow_spec_ctrl_state(info);
info->processor_id = cpu;
info->per_cpu_offset = __per_cpu_offset[cpu];
+ info->current_vcpu = idle_vcpu[cpu];
+ per_cpu(curr_vcpu, cpu) = idle_vcpu[cpu];
gdt = per_cpu(gdt, cpu) ?: alloc_xenheap_pages(0, memflags);
if ( gdt == NULL )
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index 370f4d5f7b60..8ca379c9e4cb 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -127,3 +127,15 @@ void percpu_traps_init(void)
if ( cpu_has_xen_lbr )
wrmsrl(MSR_IA32_DEBUGCTLMSR, IA32_DEBUGCTLMSR_LBR);
}
+
+/*
+ * Configure exception handling on APs and S3. Called before entering C
+ * properly, and before shadow stacks are activated.
+ *
+ * boot_gdt is currently loaded, and we must switch to our local GDT. The
+ * local IDT has unknown IST-ness.
+ */
+void asmlinkage ap_early_traps_init(void)
+{
+ load_system_tables();
+}
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (7 preceding siblings ...)
2025-08-08 20:23 ` [PATCH 08/22] x86/traps: Introduce ap_early_traps_init() and set up exception handling earlier Andrew Cooper
@ 2025-08-08 20:23 ` Andrew Cooper
2025-08-12 9:19 ` Jan Beulich
2025-08-12 9:43 ` Nicola Vetrini
2025-08-08 20:23 ` [PATCH 10/22] x86/traps: Move subarch_percpu_traps_init() " Andrew Cooper
` (13 subsequent siblings)
22 siblings, 2 replies; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:23 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
Switch it to Xen coding style and fix MISRA violations. Make it static as
there are no external caller now.
Since commit a35816b5cae8 ("x86/traps: Introduce early_traps_init() and
simplify setup"), load_system_tables() is called later on the BSP, so the
SYS_STATE_early_boot check can be dropped from the safety BUG_ON().
Move the BUILD_BUG_ON() into build_assertions(), and introduce an
endof_field() helper to make the expression clearer.
Swap wrmsrl(MSR_ISST, ...) for wrmsrns(). No serialisation is needed at this
point.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/cpu/common.c | 118 --------------------------
xen/arch/x86/include/asm/system.h | 1 -
xen/arch/x86/traps-setup.c | 132 ++++++++++++++++++++++++++++++
xen/include/xen/macros.h | 2 +
4 files changed, 134 insertions(+), 119 deletions(-)
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index cdc41248d4e9..da05015578aa 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -843,124 +843,6 @@ void print_cpu_info(unsigned int cpu)
static cpumask_t cpu_initialized;
-/*
- * Sets up system tables and descriptors.
- *
- * - Sets up TSS with stack pointers, including ISTs
- * - Inserts TSS selector into regular and compat GDTs
- * - Loads GDT, IDT, TR then null LDT
- * - Sets up IST references in the IDT
- */
-void load_system_tables(void)
-{
- unsigned int i, cpu = smp_processor_id();
- unsigned long stack_bottom = get_stack_bottom(),
- stack_top = stack_bottom & ~(STACK_SIZE - 1);
- /*
- * NB: define tss_page as a local variable because clang 3.5 doesn't
- * support using ARRAY_SIZE against per-cpu variables.
- */
- struct tss_page *tss_page = &this_cpu(tss_page);
- idt_entry_t *idt = this_cpu(idt);
-
- /* The TSS may be live. Disuade any clever optimisations. */
- volatile struct tss64 *tss = &tss_page->tss;
- seg_desc_t *gdt =
- this_cpu(gdt) - FIRST_RESERVED_GDT_ENTRY;
-
- const struct desc_ptr gdtr = {
- .base = (unsigned long)gdt,
- .limit = LAST_RESERVED_GDT_BYTE,
- };
- const struct desc_ptr idtr = {
- .base = (unsigned long)idt,
- .limit = sizeof(bsp_idt) - 1,
- };
-
- /*
- * Set up the TSS. Warning - may be live, and the NMI/#MC must remain
- * valid on every instruction boundary. (Note: these are all
- * semantically ACCESS_ONCE() due to tss's volatile qualifier.)
- *
- * rsp0 refers to the primary stack. #MC, NMI, #DB and #DF handlers
- * each get their own stacks. No IO Bitmap.
- */
- tss->rsp0 = stack_bottom;
- tss->ist[IST_MCE - 1] = stack_top + (1 + IST_MCE) * PAGE_SIZE;
- tss->ist[IST_NMI - 1] = stack_top + (1 + IST_NMI) * PAGE_SIZE;
- tss->ist[IST_DB - 1] = stack_top + (1 + IST_DB) * PAGE_SIZE;
- tss->ist[IST_DF - 1] = stack_top + (1 + IST_DF) * PAGE_SIZE;
- tss->bitmap = IOBMP_INVALID_OFFSET;
-
- /* All other stack pointers poisioned. */
- for ( i = IST_MAX; i < ARRAY_SIZE(tss->ist); ++i )
- tss->ist[i] = 0x8600111111111111ul;
- tss->rsp1 = 0x8600111111111111ul;
- tss->rsp2 = 0x8600111111111111ul;
-
- /*
- * Set up the shadow stack IST. Used entries must point at the
- * supervisor stack token. Unused entries are poisoned.
- *
- * This IST Table may be live, and the NMI/#MC entries must
- * remain valid on every instruction boundary, hence the
- * volatile qualifier.
- */
- if (cpu_has_xen_shstk) {
- volatile uint64_t *ist_ssp = tss_page->ist_ssp;
- unsigned long
- mce_ssp = stack_top + (IST_MCE * IST_SHSTK_SIZE) - 8,
- nmi_ssp = stack_top + (IST_NMI * IST_SHSTK_SIZE) - 8,
- db_ssp = stack_top + (IST_DB * IST_SHSTK_SIZE) - 8,
- df_ssp = stack_top + (IST_DF * IST_SHSTK_SIZE) - 8;
-
- ist_ssp[0] = 0x8600111111111111ul;
- ist_ssp[IST_MCE] = mce_ssp;
- ist_ssp[IST_NMI] = nmi_ssp;
- ist_ssp[IST_DB] = db_ssp;
- ist_ssp[IST_DF] = df_ssp;
- for ( i = IST_DF + 1; i < ARRAY_SIZE(tss_page->ist_ssp); ++i )
- ist_ssp[i] = 0x8600111111111111ul;
-
- if (IS_ENABLED(CONFIG_XEN_SHSTK) && rdssp() != SSP_NO_SHSTK) {
- /*
- * Rewrite supervisor tokens when shadow stacks are
- * active. This resets any busy bits left across S3.
- */
- wrss(mce_ssp, _p(mce_ssp));
- wrss(nmi_ssp, _p(nmi_ssp));
- wrss(db_ssp, _p(db_ssp));
- wrss(df_ssp, _p(df_ssp));
- }
-
- wrmsrl(MSR_ISST, (unsigned long)ist_ssp);
- }
-
- _set_tssldt_desc(gdt + TSS_ENTRY, (unsigned long)tss,
- sizeof(*tss) - 1, SYS_DESC_tss_avail);
- if ( IS_ENABLED(CONFIG_PV32) )
- _set_tssldt_desc(
- this_cpu(compat_gdt) - FIRST_RESERVED_GDT_ENTRY + TSS_ENTRY,
- (unsigned long)tss, sizeof(*tss) - 1, SYS_DESC_tss_busy);
-
- per_cpu(full_gdt_loaded, cpu) = false;
- lgdt(&gdtr);
- lidt(&idtr);
- ltr(TSS_SELECTOR);
- lldt(0);
-
- enable_each_ist(idt);
-
- /*
- * Bottom-of-stack must be 16-byte aligned!
- *
- * Defer checks until exception support is sufficiently set up.
- */
- BUILD_BUG_ON((sizeof(struct cpu_info) -
- sizeof(struct cpu_user_regs)) & 0xf);
- BUG_ON(system_state != SYS_STATE_early_boot && (stack_bottom & 0xf));
-}
-
static void skinit_enable_intr(void)
{
uint64_t val;
diff --git a/xen/arch/x86/include/asm/system.h b/xen/arch/x86/include/asm/system.h
index 57446c5b465c..3cdc56e4ba6d 100644
--- a/xen/arch/x86/include/asm/system.h
+++ b/xen/arch/x86/include/asm/system.h
@@ -256,7 +256,6 @@ static inline int local_irq_is_enabled(void)
#define BROKEN_ACPI_Sx 0x0001
#define BROKEN_INIT_AFTER_S1 0x0002
-void load_system_tables(void);
void subarch_percpu_traps_init(void);
#endif
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index 8ca379c9e4cb..13b8fcf0ba51 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -7,6 +7,7 @@
#include <asm/idt.h>
#include <asm/msr.h>
+#include <asm/shstk.h>
#include <asm/system.h>
#include <asm/traps.h>
@@ -19,6 +20,124 @@ boolean_param("ler", opt_ler);
void nocall entry_PF(void);
+/*
+ * Sets up system tables and descriptors for IDT devliery.
+ *
+ * - Sets up TSS with stack pointers, including ISTs
+ * - Inserts TSS selector into regular and compat GDTs
+ * - Loads GDT, IDT, TR then null LDT
+ * - Sets up IST references in the IDT
+ */
+static void load_system_tables(void)
+{
+ unsigned int i, cpu = smp_processor_id();
+ unsigned long stack_bottom = get_stack_bottom(),
+ stack_top = stack_bottom & ~(STACK_SIZE - 1);
+ /*
+ * NB: define tss_page as a local variable because clang 3.5 doesn't
+ * support using ARRAY_SIZE against per-cpu variables.
+ */
+ struct tss_page *tss_page = &this_cpu(tss_page);
+ idt_entry_t *idt = this_cpu(idt);
+
+ /* The TSS may be live. Disuade any clever optimisations. */
+ volatile struct tss64 *tss = &tss_page->tss;
+ seg_desc_t *gdt =
+ this_cpu(gdt) - FIRST_RESERVED_GDT_ENTRY;
+
+ const struct desc_ptr gdtr = {
+ .base = (unsigned long)gdt,
+ .limit = LAST_RESERVED_GDT_BYTE,
+ };
+ const struct desc_ptr idtr = {
+ .base = (unsigned long)idt,
+ .limit = sizeof(bsp_idt) - 1,
+ };
+
+ /*
+ * Set up the TSS. Warning - may be live, and the NMI/#MC must remain
+ * valid on every instruction boundary. (Note: these are all
+ * semantically ACCESS_ONCE() due to tss's volatile qualifier.)
+ *
+ * rsp0 refers to the primary stack. #MC, NMI, #DB and #DF handlers
+ * each get their own stacks. No IO Bitmap.
+ */
+ tss->rsp0 = stack_bottom;
+ tss->ist[IST_MCE - 1] = stack_top + (1 + IST_MCE) * PAGE_SIZE;
+ tss->ist[IST_NMI - 1] = stack_top + (1 + IST_NMI) * PAGE_SIZE;
+ tss->ist[IST_DB - 1] = stack_top + (1 + IST_DB) * PAGE_SIZE;
+ tss->ist[IST_DF - 1] = stack_top + (1 + IST_DF) * PAGE_SIZE;
+ tss->bitmap = IOBMP_INVALID_OFFSET;
+
+ /* All other stack pointers poisioned. */
+ for ( i = IST_MAX; i < ARRAY_SIZE(tss->ist); ++i )
+ tss->ist[i] = 0x8600111111111111UL;
+ tss->rsp1 = 0x8600111111111111UL;
+ tss->rsp2 = 0x8600111111111111UL;
+
+ /*
+ * Set up the shadow stack IST. Used entries must point at the
+ * supervisor stack token. Unused entries are poisoned.
+ *
+ * This IST Table may be live, and the NMI/#MC entries must
+ * remain valid on every instruction boundary, hence the
+ * volatile qualifier.
+ */
+ if ( cpu_has_xen_shstk )
+ {
+ volatile uint64_t *ist_ssp = tss_page->ist_ssp;
+ unsigned long
+ mce_ssp = stack_top + (IST_MCE * IST_SHSTK_SIZE) - 8,
+ nmi_ssp = stack_top + (IST_NMI * IST_SHSTK_SIZE) - 8,
+ db_ssp = stack_top + (IST_DB * IST_SHSTK_SIZE) - 8,
+ df_ssp = stack_top + (IST_DF * IST_SHSTK_SIZE) - 8;
+
+ ist_ssp[0] = 0x8600111111111111UL;
+ ist_ssp[IST_MCE] = mce_ssp;
+ ist_ssp[IST_NMI] = nmi_ssp;
+ ist_ssp[IST_DB] = db_ssp;
+ ist_ssp[IST_DF] = df_ssp;
+ for ( i = IST_DF + 1; i < ARRAY_SIZE(tss_page->ist_ssp); ++i )
+ ist_ssp[i] = 0x8600111111111111UL;
+
+ if ( IS_ENABLED(CONFIG_XEN_SHSTK) && rdssp() != SSP_NO_SHSTK )
+ {
+ /*
+ * Rewrite supervisor tokens when shadow stacks are
+ * active. This resets any busy bits left across S3.
+ */
+ wrss(mce_ssp, _p(mce_ssp));
+ wrss(nmi_ssp, _p(nmi_ssp));
+ wrss(db_ssp, _p(db_ssp));
+ wrss(df_ssp, _p(df_ssp));
+ }
+
+ wrmsrns(MSR_ISST, (unsigned long)ist_ssp);
+ }
+
+ _set_tssldt_desc(gdt + TSS_ENTRY, (unsigned long)tss,
+ sizeof(*tss) - 1, SYS_DESC_tss_avail);
+ if ( IS_ENABLED(CONFIG_PV32) )
+ _set_tssldt_desc(
+ this_cpu(compat_gdt) - FIRST_RESERVED_GDT_ENTRY + TSS_ENTRY,
+ (unsigned long)tss, sizeof(*tss) - 1, SYS_DESC_tss_busy);
+
+ per_cpu(full_gdt_loaded, cpu) = false;
+ lgdt(&gdtr);
+ lidt(&idtr);
+ ltr(TSS_SELECTOR);
+ lldt(0);
+
+ enable_each_ist(idt);
+
+ /*
+ * tss->rsp0 must be 16-byte aligned.
+ *
+ * Defer checks until exception support is sufficiently set up.
+ */
+ BUG_ON(stack_bottom & 15);
+}
+
static void __init init_ler(void)
{
unsigned int msr = 0;
@@ -139,3 +258,16 @@ void asmlinkage ap_early_traps_init(void)
{
load_system_tables();
}
+
+static void __init __maybe_unused build_assertions(void)
+{
+ /*
+ * This is best-effort (it doesn't cover some padding corner cases), but
+ * is preforable to hitting the check at boot time.
+ *
+ * tss->rsp0, pointing at the end of cpu_info.guest_cpu_user_regs, must be
+ * 16-byte aligned.
+ */
+ BUILD_BUG_ON((sizeof(struct cpu_info) -
+ endof_field(struct cpu_info, guest_cpu_user_regs)) & 15);
+}
diff --git a/xen/include/xen/macros.h b/xen/include/xen/macros.h
index cd528fbdb127..726ba221e0d8 100644
--- a/xen/include/xen/macros.h
+++ b/xen/include/xen/macros.h
@@ -102,6 +102,8 @@
*/
#define sizeof_field(type, member) sizeof(((type *)NULL)->member)
+#define endof_field(type, member) (offsetof(type, member) + sizeof_field(type, member))
+
/* Cast an arbitrary integer to a pointer. */
#define _p(x) ((void *)(unsigned long)(x))
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 10/22] x86/traps: Move subarch_percpu_traps_init() into traps-setup.c
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (8 preceding siblings ...)
2025-08-08 20:23 ` [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c Andrew Cooper
@ 2025-08-08 20:23 ` Andrew Cooper
2025-08-11 8:17 ` Andrew Cooper
2025-08-08 20:23 ` [PATCH 11/22] x86/traps: Fold x86_64/traps.c into traps.c Andrew Cooper
` (12 subsequent siblings)
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:23 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
... along with the supporting functions. Switch to Xen coding style, and make
static as there are no external callers.
Rename to legacy_syscall_init() as a more accurate name.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/include/asm/system.h | 2 -
xen/arch/x86/traps-setup.c | 97 ++++++++++++++++++++++++++++++-
xen/arch/x86/x86_64/traps.c | 92 -----------------------------
3 files changed, 95 insertions(+), 96 deletions(-)
diff --git a/xen/arch/x86/include/asm/system.h b/xen/arch/x86/include/asm/system.h
index 3cdc56e4ba6d..6c2800d8158d 100644
--- a/xen/arch/x86/include/asm/system.h
+++ b/xen/arch/x86/include/asm/system.h
@@ -256,6 +256,4 @@ static inline int local_irq_is_enabled(void)
#define BROKEN_ACPI_Sx 0x0001
#define BROKEN_INIT_AFTER_S1 0x0002
-void subarch_percpu_traps_init(void);
-
#endif
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index 13b8fcf0ba51..fbae7072c292 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -2,13 +2,15 @@
/*
* Configuration of event handling for all CPUs.
*/
+#include <xen/domain_page.h>
#include <xen/init.h>
#include <xen/param.h>
+#include <asm/endbr.h>
#include <asm/idt.h>
#include <asm/msr.h>
#include <asm/shstk.h>
-#include <asm/system.h>
+#include <asm/stubs.h>
#include <asm/traps.h>
DEFINE_PER_CPU_READ_MOSTLY(idt_entry_t *, idt);
@@ -19,6 +21,8 @@ static bool __initdata opt_ler;
boolean_param("ler", opt_ler);
void nocall entry_PF(void);
+void nocall lstar_enter(void);
+void nocall cstar_enter(void);
/*
* Sets up system tables and descriptors for IDT devliery.
@@ -138,6 +142,95 @@ static void load_system_tables(void)
BUG_ON(stack_bottom & 15);
}
+static unsigned int write_stub_trampoline(
+ unsigned char *stub, unsigned long stub_va,
+ unsigned long stack_bottom, unsigned long target_va)
+{
+ unsigned char *p = stub;
+
+ if ( cpu_has_xen_ibt )
+ {
+ place_endbr64(p);
+ p += 4;
+ }
+
+ /* Store guest %rax into %ss slot */
+ /* movabsq %rax, stack_bottom - 8 */
+ *p++ = 0x48;
+ *p++ = 0xa3;
+ *(uint64_t *)p = stack_bottom - 8;
+ p += 8;
+
+ /* Store guest %rsp in %rax */
+ /* movq %rsp, %rax */
+ *p++ = 0x48;
+ *p++ = 0x89;
+ *p++ = 0xe0;
+
+ /* Switch to Xen stack */
+ /* movabsq $stack_bottom - 8, %rsp */
+ *p++ = 0x48;
+ *p++ = 0xbc;
+ *(uint64_t *)p = stack_bottom - 8;
+ p += 8;
+
+ /* jmp target_va */
+ *p++ = 0xe9;
+ *(int32_t *)p = target_va - (stub_va + (p - stub) + 4);
+ p += 4;
+
+ /* Round up to a multiple of 16 bytes. */
+ return ROUNDUP(p - stub, 16);
+}
+
+static void legacy_syscall_init(void)
+{
+ unsigned long stack_bottom = get_stack_bottom();
+ unsigned long stub_va = this_cpu(stubs.addr);
+ unsigned char *stub_page;
+ unsigned int offset;
+
+ /* No PV guests? No need to set up SYSCALL/SYSENTER infrastructure. */
+ if ( !IS_ENABLED(CONFIG_PV) )
+ return;
+
+ stub_page = map_domain_page(_mfn(this_cpu(stubs.mfn)));
+
+ /*
+ * Trampoline for SYSCALL entry from 64-bit mode. The VT-x HVM vcpu
+ * context switch logic relies on the SYSCALL trampoline being at the
+ * start of the stubs.
+ */
+ wrmsrl(MSR_LSTAR, stub_va);
+ offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_bottom,
+ (unsigned long)lstar_enter);
+ stub_va += offset;
+
+ if ( cpu_has_sep )
+ {
+ /* SYSENTER entry. */
+ wrmsrl(MSR_IA32_SYSENTER_ESP, stack_bottom);
+ wrmsrl(MSR_IA32_SYSENTER_EIP, (unsigned long)sysenter_entry);
+ wrmsr(MSR_IA32_SYSENTER_CS, __HYPERVISOR_CS, 0);
+ }
+
+ /* Trampoline for SYSCALL entry from compatibility mode. */
+ wrmsrl(MSR_CSTAR, stub_va);
+ offset += write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_bottom,
+ (unsigned long)cstar_enter);
+
+ /* Don't consume more than half of the stub space here. */
+ ASSERT(offset <= STUB_BUF_SIZE / 2);
+
+ unmap_domain_page(stub_page);
+
+ /* Common SYSCALL parameters. */
+ wrmsrl(MSR_STAR, XEN_MSR_STAR);
+ wrmsrl(MSR_SYSCALL_MASK, XEN_SYSCALL_MASK);
+}
+
static void __init init_ler(void)
{
unsigned int msr = 0;
@@ -241,7 +334,7 @@ void bsp_traps_reinit(void)
*/
void percpu_traps_init(void)
{
- subarch_percpu_traps_init();
+ legacy_syscall_init();
if ( cpu_has_xen_lbr )
wrmsrl(MSR_IA32_DEBUGCTLMSR, IA32_DEBUGCTLMSR_LBR);
diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c
index 34adf55e48df..81e64466e47e 100644
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -311,98 +311,6 @@ void asmlinkage noreturn do_double_fault(struct cpu_user_regs *regs)
panic("DOUBLE FAULT -- system shutdown\n");
}
-static unsigned int write_stub_trampoline(
- unsigned char *stub, unsigned long stub_va,
- unsigned long stack_bottom, unsigned long target_va)
-{
- unsigned char *p = stub;
-
- if ( cpu_has_xen_ibt )
- {
- place_endbr64(p);
- p += 4;
- }
-
- /* Store guest %rax into %ss slot */
- /* movabsq %rax, stack_bottom - 8 */
- *p++ = 0x48;
- *p++ = 0xa3;
- *(uint64_t *)p = stack_bottom - 8;
- p += 8;
-
- /* Store guest %rsp in %rax */
- /* movq %rsp, %rax */
- *p++ = 0x48;
- *p++ = 0x89;
- *p++ = 0xe0;
-
- /* Switch to Xen stack */
- /* movabsq $stack_bottom - 8, %rsp */
- *p++ = 0x48;
- *p++ = 0xbc;
- *(uint64_t *)p = stack_bottom - 8;
- p += 8;
-
- /* jmp target_va */
- *p++ = 0xe9;
- *(int32_t *)p = target_va - (stub_va + (p - stub) + 4);
- p += 4;
-
- /* Round up to a multiple of 16 bytes. */
- return ROUNDUP(p - stub, 16);
-}
-
-void nocall lstar_enter(void);
-void nocall cstar_enter(void);
-
-void subarch_percpu_traps_init(void)
-{
- unsigned long stack_bottom = get_stack_bottom();
- unsigned long stub_va = this_cpu(stubs.addr);
- unsigned char *stub_page;
- unsigned int offset;
-
- /* No PV guests? No need to set up SYSCALL/SYSENTER infrastructure. */
- if ( !IS_ENABLED(CONFIG_PV) )
- return;
-
- stub_page = map_domain_page(_mfn(this_cpu(stubs.mfn)));
-
- /*
- * Trampoline for SYSCALL entry from 64-bit mode. The VT-x HVM vcpu
- * context switch logic relies on the SYSCALL trampoline being at the
- * start of the stubs.
- */
- wrmsrl(MSR_LSTAR, stub_va);
- offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
- stub_va, stack_bottom,
- (unsigned long)lstar_enter);
- stub_va += offset;
-
- if ( cpu_has_sep )
- {
- /* SYSENTER entry. */
- wrmsrl(MSR_IA32_SYSENTER_ESP, stack_bottom);
- wrmsrl(MSR_IA32_SYSENTER_EIP, (unsigned long)sysenter_entry);
- wrmsr(MSR_IA32_SYSENTER_CS, __HYPERVISOR_CS, 0);
- }
-
- /* Trampoline for SYSCALL entry from compatibility mode. */
- wrmsrl(MSR_CSTAR, stub_va);
- offset += write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
- stub_va, stack_bottom,
- (unsigned long)cstar_enter);
-
- /* Don't consume more than half of the stub space here. */
- ASSERT(offset <= STUB_BUF_SIZE / 2);
-
- unmap_domain_page(stub_page);
-
- /* Common SYSCALL parameters. */
- wrmsrl(MSR_STAR, XEN_MSR_STAR);
- wrmsrl(MSR_SYSCALL_MASK, XEN_SYSCALL_MASK);
-}
-
/*
* Local variables:
* mode: C
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 11/22] x86/traps: Fold x86_64/traps.c into traps.c
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (9 preceding siblings ...)
2025-08-08 20:23 ` [PATCH 10/22] x86/traps: Move subarch_percpu_traps_init() " Andrew Cooper
@ 2025-08-08 20:23 ` Andrew Cooper
2025-08-12 9:53 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 12/22] x86/traps: Unexport show_code() and show_stack_overflow() Andrew Cooper
` (11 subsequent siblings)
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:23 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
It's now just the double fault handler and various state dumping functions.
Swap u64 for uint64_t, and fix a few other minor style issues.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/traps.c | 291 +++++++++++++++++++++++++++++++
xen/arch/x86/x86_64/Makefile | 1 -
xen/arch/x86/x86_64/traps.c | 322 -----------------------------------
3 files changed, 291 insertions(+), 323 deletions(-)
delete mode 100644 xen/arch/x86/x86_64/traps.c
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 238d923dd188..ab8ff36acfe6 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -26,6 +26,7 @@
#include <xen/sched.h>
#include <xen/softirq.h>
#include <xen/trace.h>
+#include <xen/version.h>
#include <xen/watchdog.h>
#include <asm/apic.h>
@@ -87,6 +88,272 @@ const unsigned int nmi_cpu;
#define stack_words_per_line 4
#define ESP_BEFORE_EXCEPTION(regs) ((unsigned long *)(regs)->rsp)
+struct extra_state
+{
+ unsigned long cr0, cr2, cr3, cr4;
+ unsigned long fsb, gsb, gss;
+ uint16_t ds, es, fs, gs;
+};
+
+static void print_xen_info(void)
+{
+ char taint_str[TAINT_STRING_MAX_LEN];
+
+ printk("----[ Xen-%d.%d%s x86_64 %s %s ]----\n",
+ xen_major_version(), xen_minor_version(), xen_extra_version(),
+ xen_build_info(), print_tainted(taint_str));
+}
+
+enum context {
+ CTXT_hypervisor,
+ CTXT_pv_guest,
+ CTXT_hvm_guest,
+};
+
+static void read_registers(struct extra_state *state)
+{
+ state->cr0 = read_cr0();
+ state->cr2 = read_cr2();
+ state->cr3 = read_cr3();
+ state->cr4 = read_cr4();
+
+ state->fsb = read_fs_base();
+ state->gsb = read_gs_base();
+ state->gss = read_gs_shadow();
+
+ asm ( "mov %%ds, %0" : "=m" (state->ds) );
+ asm ( "mov %%es, %0" : "=m" (state->es) );
+ asm ( "mov %%fs, %0" : "=m" (state->fs) );
+ asm ( "mov %%gs, %0" : "=m" (state->gs) );
+}
+
+static void get_hvm_registers(struct vcpu *v, struct cpu_user_regs *regs,
+ struct extra_state *state)
+{
+ struct segment_register sreg;
+
+ state->cr0 = v->arch.hvm.guest_cr[0];
+ state->cr2 = v->arch.hvm.guest_cr[2];
+ state->cr3 = v->arch.hvm.guest_cr[3];
+ state->cr4 = v->arch.hvm.guest_cr[4];
+
+ hvm_get_segment_register(v, x86_seg_cs, &sreg);
+ regs->cs = sreg.sel;
+
+ hvm_get_segment_register(v, x86_seg_ds, &sreg);
+ state->ds = sreg.sel;
+
+ hvm_get_segment_register(v, x86_seg_es, &sreg);
+ state->es = sreg.sel;
+
+ hvm_get_segment_register(v, x86_seg_fs, &sreg);
+ state->fs = sreg.sel;
+ state->fsb = sreg.base;
+
+ hvm_get_segment_register(v, x86_seg_gs, &sreg);
+ state->gs = sreg.sel;
+ state->gsb = sreg.base;
+
+ hvm_get_segment_register(v, x86_seg_ss, &sreg);
+ regs->ss = sreg.sel;
+
+ state->gss = hvm_get_reg(v, MSR_SHADOW_GS_BASE);
+}
+
+static void _show_registers(
+ const struct cpu_user_regs *regs, const struct extra_state *state,
+ enum context context, const struct vcpu *v)
+{
+ static const char *const context_names[] = {
+ [CTXT_hypervisor] = "hypervisor",
+ [CTXT_pv_guest] = "pv guest",
+ [CTXT_hvm_guest] = "hvm guest",
+ };
+
+ printk("RIP: %04x:[<%016lx>]", regs->cs, regs->rip);
+ if ( context == CTXT_hypervisor )
+ printk(" %pS", _p(regs->rip));
+ printk("\nRFLAGS: %016lx ", regs->rflags);
+ if ( (context == CTXT_pv_guest) && v && v->vcpu_info_area.map )
+ printk("EM: %d ", !!vcpu_info(v, evtchn_upcall_mask));
+ printk("CONTEXT: %s", context_names[context]);
+ if ( v && !is_idle_vcpu(v) )
+ printk(" (%pv)", v);
+
+ printk("\nrax: %016lx rbx: %016lx rcx: %016lx\n",
+ regs->rax, regs->rbx, regs->rcx);
+ printk("rdx: %016lx rsi: %016lx rdi: %016lx\n",
+ regs->rdx, regs->rsi, regs->rdi);
+ printk("rbp: %016lx rsp: %016lx r8: %016lx\n",
+ regs->rbp, regs->rsp, regs->r8);
+ printk("r9: %016lx r10: %016lx r11: %016lx\n",
+ regs->r9, regs->r10, regs->r11);
+ printk("r12: %016lx r13: %016lx r14: %016lx\n",
+ regs->r12, regs->r13, regs->r14);
+ printk("r15: %016lx cr0: %016lx cr4: %016lx\n",
+ regs->r15, state->cr0, state->cr4);
+ printk("cr3: %016lx cr2: %016lx\n", state->cr3, state->cr2);
+ printk("fsb: %016lx gsb: %016lx gss: %016lx\n",
+ state->fsb, state->gsb, state->gss);
+ printk("ds: %04x es: %04x fs: %04x gs: %04x ss: %04x cs: %04x\n",
+ state->ds, state->es, state->fs,
+ state->gs, regs->ss, regs->cs);
+}
+
+void show_registers(const struct cpu_user_regs *regs)
+{
+ struct cpu_user_regs fault_regs = *regs;
+ struct extra_state fault_state;
+ enum context context;
+ struct vcpu *v = system_state >= SYS_STATE_smp_boot ? current : NULL;
+
+ if ( guest_mode(regs) && is_hvm_vcpu(v) )
+ {
+ get_hvm_registers(v, &fault_regs, &fault_state);
+ context = CTXT_hvm_guest;
+ }
+ else
+ {
+ read_registers(&fault_state);
+
+ if ( guest_mode(regs) )
+ {
+ context = CTXT_pv_guest;
+ fault_state.cr2 = arch_get_cr2(v);
+ }
+ else
+ {
+ context = CTXT_hypervisor;
+ }
+ }
+
+ print_xen_info();
+ printk("CPU: %d\n", smp_processor_id());
+ _show_registers(&fault_regs, &fault_state, context, v);
+
+ if ( ler_msr && !guest_mode(regs) )
+ {
+ uint64_t from, to;
+
+ rdmsrl(ler_msr, from);
+ rdmsrl(ler_msr + 1, to);
+
+ /* Upper bits may store metadata. Re-canonicalise for printing. */
+ printk("ler: from %016"PRIx64" [%ps]\n",
+ from, _p(canonicalise_addr(from)));
+ printk(" to %016"PRIx64" [%ps]\n",
+ to, _p(canonicalise_addr(to)));
+ }
+}
+
+void vcpu_show_registers(struct vcpu *v)
+{
+ const struct cpu_user_regs *regs = &v->arch.user_regs;
+ struct cpu_user_regs aux_regs;
+ struct extra_state state;
+ enum context context;
+
+ if ( is_hvm_vcpu(v) )
+ {
+ aux_regs = *regs;
+ get_hvm_registers(v, &aux_regs, &state);
+ regs = &aux_regs;
+ context = CTXT_hvm_guest;
+ }
+ else
+ {
+ bool kernel = guest_kernel_mode(v, regs);
+ unsigned long gsb, gss;
+
+ state.cr0 = v->arch.pv.ctrlreg[0];
+ state.cr2 = arch_get_cr2(v);
+ state.cr3 = pagetable_get_paddr(kernel
+ ? v->arch.guest_table
+ : v->arch.guest_table_user);
+ state.cr4 = v->arch.pv.ctrlreg[4];
+
+ gsb = v->arch.pv.gs_base_user;
+ gss = v->arch.pv.gs_base_kernel;
+ if ( kernel )
+ SWAP(gsb, gss);
+
+ state.fsb = v->arch.pv.fs_base;
+ state.gsb = gsb;
+ state.gss = gss;
+
+ state.ds = v->arch.pv.ds;
+ state.es = v->arch.pv.es;
+ state.fs = v->arch.pv.fs;
+ state.gs = v->arch.pv.gs;
+
+ context = CTXT_pv_guest;
+ }
+
+ _show_registers(regs, &state, context, v);
+}
+
+void show_page_walk(unsigned long addr)
+{
+ unsigned long pfn, mfn = read_cr3() >> PAGE_SHIFT;
+ l4_pgentry_t l4e, *l4t;
+ l3_pgentry_t l3e, *l3t;
+ l2_pgentry_t l2e, *l2t;
+ l1_pgentry_t l1e, *l1t;
+
+ printk("Pagetable walk from %016lx:\n", addr);
+ if ( !is_canonical_address(addr) )
+ return;
+
+ l4t = map_domain_page(_mfn(mfn));
+ l4e = l4t[l4_table_offset(addr)];
+ unmap_domain_page(l4t);
+ mfn = l4e_get_pfn(l4e);
+ pfn = mfn_valid(_mfn(mfn)) && machine_to_phys_mapping_valid ?
+ get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY;
+ printk(" L4[0x%03lx] = %"PRIpte" %016lx\n",
+ l4_table_offset(addr), l4e_get_intpte(l4e), pfn);
+ if ( !(l4e_get_flags(l4e) & _PAGE_PRESENT) ||
+ !mfn_valid(_mfn(mfn)) )
+ return;
+
+ l3t = map_domain_page(_mfn(mfn));
+ l3e = l3t[l3_table_offset(addr)];
+ unmap_domain_page(l3t);
+ mfn = l3e_get_pfn(l3e);
+ pfn = mfn_valid(_mfn(mfn)) && machine_to_phys_mapping_valid ?
+ get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY;
+ printk(" L3[0x%03lx] = %"PRIpte" %016lx%s\n",
+ l3_table_offset(addr), l3e_get_intpte(l3e), pfn,
+ (l3e_get_flags(l3e) & _PAGE_PSE) ? " (PSE)" : "");
+ if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) ||
+ (l3e_get_flags(l3e) & _PAGE_PSE) ||
+ !mfn_valid(_mfn(mfn)) )
+ return;
+
+ l2t = map_domain_page(_mfn(mfn));
+ l2e = l2t[l2_table_offset(addr)];
+ unmap_domain_page(l2t);
+ mfn = l2e_get_pfn(l2e);
+ pfn = mfn_valid(_mfn(mfn)) && machine_to_phys_mapping_valid ?
+ get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY;
+ printk(" L2[0x%03lx] = %"PRIpte" %016lx%s\n",
+ l2_table_offset(addr), l2e_get_intpte(l2e), pfn,
+ (l2e_get_flags(l2e) & _PAGE_PSE) ? " (PSE)" : "");
+ if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) ||
+ (l2e_get_flags(l2e) & _PAGE_PSE) ||
+ !mfn_valid(_mfn(mfn)) )
+ return;
+
+ l1t = map_domain_page(_mfn(mfn));
+ l1e = l1t[l1_table_offset(addr)];
+ unmap_domain_page(l1t);
+ mfn = l1e_get_pfn(l1e);
+ pfn = mfn_valid(_mfn(mfn)) && machine_to_phys_mapping_valid ?
+ get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY;
+ printk(" L1[0x%03lx] = %"PRIpte" %016lx\n",
+ l1_table_offset(addr), l1e_get_intpte(l1e), pfn);
+}
+
void show_code(const struct cpu_user_regs *regs)
{
unsigned char insns_before[8] = {}, insns_after[16] = {};
@@ -762,6 +1029,30 @@ const char *vector_name(unsigned int vec)
return (vec < ARRAY_SIZE(names) && names[vec][0]) ? names[vec] : "???";
}
+void asmlinkage do_double_fault(struct cpu_user_regs *regs)
+{
+ unsigned int cpu;
+ struct extra_state state;
+
+ console_force_unlock();
+
+ asm ( "lsll %[sel], %[limit]" : [limit] "=r" (cpu)
+ : [sel] "r" (PER_CPU_SELECTOR) );
+
+ /* Find information saved during fault and dump it to the console. */
+ printk("*** DOUBLE FAULT ***\n");
+ print_xen_info();
+
+ read_registers(&state);
+
+ printk("CPU: %d\n", cpu);
+ _show_registers(regs, &state, CTXT_hypervisor, NULL);
+ show_code(regs);
+ show_stack_overflow(cpu, regs);
+
+ panic("DOUBLE FAULT -- system shutdown\n");
+}
+
/*
* This is called for faults at very unexpected times (e.g., when interrupts
* are disabled). In such situations we can't do much that is safe. We try to
diff --git a/xen/arch/x86/x86_64/Makefile b/xen/arch/x86/x86_64/Makefile
index 472b2bab523d..f20763088740 100644
--- a/xen/arch/x86/x86_64/Makefile
+++ b/xen/arch/x86/x86_64/Makefile
@@ -1,7 +1,6 @@
obj-$(CONFIG_PV32) += compat/
obj-bin-y += entry.o
-obj-y += traps.o
obj-$(CONFIG_KEXEC) += machine_kexec.o
obj-y += pci.o
obj-y += acpi_mmcfg.o
diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c
deleted file mode 100644
index 81e64466e47e..000000000000
--- a/xen/arch/x86/x86_64/traps.c
+++ /dev/null
@@ -1,322 +0,0 @@
-#include <xen/console.h>
-#include <xen/errno.h>
-#include <xen/guest_access.h>
-#include <xen/hypercall.h>
-#include <xen/init.h>
-#include <xen/irq.h>
-#include <xen/lib.h>
-#include <xen/mm.h>
-#include <xen/sched.h>
-#include <xen/shutdown.h>
-#include <xen/symbols.h>
-#include <xen/version.h>
-#include <xen/watchdog.h>
-
-#include <asm/current.h>
-#include <asm/endbr.h>
-#include <asm/event.h>
-#include <asm/flushtlb.h>
-#include <asm/hvm/hvm.h>
-#include <asm/msr.h>
-#include <asm/nmi.h>
-#include <asm/page.h>
-#include <asm/shared.h>
-#include <asm/stubs.h>
-#include <asm/traps.h>
-
-struct extra_state
-{
- unsigned long cr0, cr2, cr3, cr4;
- unsigned long fsb, gsb, gss;
- uint16_t ds, es, fs, gs;
-};
-
-static void print_xen_info(void)
-{
- char taint_str[TAINT_STRING_MAX_LEN];
-
- printk("----[ Xen-%d.%d%s x86_64 %s %s ]----\n",
- xen_major_version(), xen_minor_version(), xen_extra_version(),
- xen_build_info(), print_tainted(taint_str));
-}
-
-enum context { CTXT_hypervisor, CTXT_pv_guest, CTXT_hvm_guest };
-
-static void read_registers(struct extra_state *state)
-{
- state->cr0 = read_cr0();
- state->cr2 = read_cr2();
- state->cr3 = read_cr3();
- state->cr4 = read_cr4();
-
- state->fsb = read_fs_base();
- state->gsb = read_gs_base();
- state->gss = read_gs_shadow();
-
- asm ( "mov %%ds, %0" : "=m" (state->ds) );
- asm ( "mov %%es, %0" : "=m" (state->es) );
- asm ( "mov %%fs, %0" : "=m" (state->fs) );
- asm ( "mov %%gs, %0" : "=m" (state->gs) );
-}
-
-static void get_hvm_registers(struct vcpu *v, struct cpu_user_regs *regs,
- struct extra_state *state)
-{
- struct segment_register sreg;
-
- state->cr0 = v->arch.hvm.guest_cr[0];
- state->cr2 = v->arch.hvm.guest_cr[2];
- state->cr3 = v->arch.hvm.guest_cr[3];
- state->cr4 = v->arch.hvm.guest_cr[4];
-
- hvm_get_segment_register(v, x86_seg_cs, &sreg);
- regs->cs = sreg.sel;
-
- hvm_get_segment_register(v, x86_seg_ds, &sreg);
- state->ds = sreg.sel;
-
- hvm_get_segment_register(v, x86_seg_es, &sreg);
- state->es = sreg.sel;
-
- hvm_get_segment_register(v, x86_seg_fs, &sreg);
- state->fs = sreg.sel;
- state->fsb = sreg.base;
-
- hvm_get_segment_register(v, x86_seg_gs, &sreg);
- state->gs = sreg.sel;
- state->gsb = sreg.base;
-
- hvm_get_segment_register(v, x86_seg_ss, &sreg);
- regs->ss = sreg.sel;
-
- state->gss = hvm_get_reg(v, MSR_SHADOW_GS_BASE);
-}
-
-static void _show_registers(
- const struct cpu_user_regs *regs, const struct extra_state *state,
- enum context context, const struct vcpu *v)
-{
- static const char *const context_names[] = {
- [CTXT_hypervisor] = "hypervisor",
- [CTXT_pv_guest] = "pv guest",
- [CTXT_hvm_guest] = "hvm guest"
- };
-
- printk("RIP: %04x:[<%016lx>]", regs->cs, regs->rip);
- if ( context == CTXT_hypervisor )
- printk(" %pS", _p(regs->rip));
- printk("\nRFLAGS: %016lx ", regs->rflags);
- if ( (context == CTXT_pv_guest) && v && v->vcpu_info_area.map )
- printk("EM: %d ", !!vcpu_info(v, evtchn_upcall_mask));
- printk("CONTEXT: %s", context_names[context]);
- if ( v && !is_idle_vcpu(v) )
- printk(" (%pv)", v);
-
- printk("\nrax: %016lx rbx: %016lx rcx: %016lx\n",
- regs->rax, regs->rbx, regs->rcx);
- printk("rdx: %016lx rsi: %016lx rdi: %016lx\n",
- regs->rdx, regs->rsi, regs->rdi);
- printk("rbp: %016lx rsp: %016lx r8: %016lx\n",
- regs->rbp, regs->rsp, regs->r8);
- printk("r9: %016lx r10: %016lx r11: %016lx\n",
- regs->r9, regs->r10, regs->r11);
- printk("r12: %016lx r13: %016lx r14: %016lx\n",
- regs->r12, regs->r13, regs->r14);
- printk("r15: %016lx cr0: %016lx cr4: %016lx\n",
- regs->r15, state->cr0, state->cr4);
- printk("cr3: %016lx cr2: %016lx\n", state->cr3, state->cr2);
- printk("fsb: %016lx gsb: %016lx gss: %016lx\n",
- state->fsb, state->gsb, state->gss);
- printk("ds: %04x es: %04x fs: %04x gs: %04x "
- "ss: %04x cs: %04x\n",
- state->ds, state->es, state->fs,
- state->gs, regs->ss, regs->cs);
-}
-
-void show_registers(const struct cpu_user_regs *regs)
-{
- struct cpu_user_regs fault_regs = *regs;
- struct extra_state fault_state;
- enum context context;
- struct vcpu *v = system_state >= SYS_STATE_smp_boot ? current : NULL;
-
- if ( guest_mode(regs) && is_hvm_vcpu(v) )
- {
- get_hvm_registers(v, &fault_regs, &fault_state);
- context = CTXT_hvm_guest;
- }
- else
- {
- read_registers(&fault_state);
-
- if ( guest_mode(regs) )
- {
- context = CTXT_pv_guest;
- fault_state.cr2 = arch_get_cr2(v);
- }
- else
- {
- context = CTXT_hypervisor;
- }
- }
-
- print_xen_info();
- printk("CPU: %d\n", smp_processor_id());
- _show_registers(&fault_regs, &fault_state, context, v);
-
- if ( ler_msr && !guest_mode(regs) )
- {
- u64 from, to;
-
- rdmsrl(ler_msr, from);
- rdmsrl(ler_msr + 1, to);
-
- /* Upper bits may store metadata. Re-canonicalise for printing. */
- printk("ler: from %016"PRIx64" [%ps]\n",
- from, _p(canonicalise_addr(from)));
- printk(" to %016"PRIx64" [%ps]\n",
- to, _p(canonicalise_addr(to)));
- }
-}
-
-void vcpu_show_registers(struct vcpu *v)
-{
- const struct cpu_user_regs *regs = &v->arch.user_regs;
- struct cpu_user_regs aux_regs;
- struct extra_state state;
- enum context context;
-
- if ( is_hvm_vcpu(v) )
- {
- aux_regs = *regs;
- get_hvm_registers(v, &aux_regs, &state);
- regs = &aux_regs;
- context = CTXT_hvm_guest;
- }
- else
- {
- bool kernel = guest_kernel_mode(v, regs);
- unsigned long gsb, gss;
-
- state.cr0 = v->arch.pv.ctrlreg[0];
- state.cr2 = arch_get_cr2(v);
- state.cr3 = pagetable_get_paddr(kernel
- ? v->arch.guest_table
- : v->arch.guest_table_user);
- state.cr4 = v->arch.pv.ctrlreg[4];
-
- gsb = v->arch.pv.gs_base_user;
- gss = v->arch.pv.gs_base_kernel;
- if ( kernel )
- SWAP(gsb, gss);
-
- state.fsb = v->arch.pv.fs_base;
- state.gsb = gsb;
- state.gss = gss;
-
- state.ds = v->arch.pv.ds;
- state.es = v->arch.pv.es;
- state.fs = v->arch.pv.fs;
- state.gs = v->arch.pv.gs;
-
- context = CTXT_pv_guest;
- }
-
- _show_registers(regs, &state, context, v);
-}
-
-void show_page_walk(unsigned long addr)
-{
- unsigned long pfn, mfn = read_cr3() >> PAGE_SHIFT;
- l4_pgentry_t l4e, *l4t;
- l3_pgentry_t l3e, *l3t;
- l2_pgentry_t l2e, *l2t;
- l1_pgentry_t l1e, *l1t;
-
- printk("Pagetable walk from %016lx:\n", addr);
- if ( !is_canonical_address(addr) )
- return;
-
- l4t = map_domain_page(_mfn(mfn));
- l4e = l4t[l4_table_offset(addr)];
- unmap_domain_page(l4t);
- mfn = l4e_get_pfn(l4e);
- pfn = mfn_valid(_mfn(mfn)) && machine_to_phys_mapping_valid ?
- get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY;
- printk(" L4[0x%03lx] = %"PRIpte" %016lx\n",
- l4_table_offset(addr), l4e_get_intpte(l4e), pfn);
- if ( !(l4e_get_flags(l4e) & _PAGE_PRESENT) ||
- !mfn_valid(_mfn(mfn)) )
- return;
-
- l3t = map_domain_page(_mfn(mfn));
- l3e = l3t[l3_table_offset(addr)];
- unmap_domain_page(l3t);
- mfn = l3e_get_pfn(l3e);
- pfn = mfn_valid(_mfn(mfn)) && machine_to_phys_mapping_valid ?
- get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY;
- printk(" L3[0x%03lx] = %"PRIpte" %016lx%s\n",
- l3_table_offset(addr), l3e_get_intpte(l3e), pfn,
- (l3e_get_flags(l3e) & _PAGE_PSE) ? " (PSE)" : "");
- if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) ||
- (l3e_get_flags(l3e) & _PAGE_PSE) ||
- !mfn_valid(_mfn(mfn)) )
- return;
-
- l2t = map_domain_page(_mfn(mfn));
- l2e = l2t[l2_table_offset(addr)];
- unmap_domain_page(l2t);
- mfn = l2e_get_pfn(l2e);
- pfn = mfn_valid(_mfn(mfn)) && machine_to_phys_mapping_valid ?
- get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY;
- printk(" L2[0x%03lx] = %"PRIpte" %016lx%s\n",
- l2_table_offset(addr), l2e_get_intpte(l2e), pfn,
- (l2e_get_flags(l2e) & _PAGE_PSE) ? " (PSE)" : "");
- if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) ||
- (l2e_get_flags(l2e) & _PAGE_PSE) ||
- !mfn_valid(_mfn(mfn)) )
- return;
-
- l1t = map_domain_page(_mfn(mfn));
- l1e = l1t[l1_table_offset(addr)];
- unmap_domain_page(l1t);
- mfn = l1e_get_pfn(l1e);
- pfn = mfn_valid(_mfn(mfn)) && machine_to_phys_mapping_valid ?
- get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY;
- printk(" L1[0x%03lx] = %"PRIpte" %016lx\n",
- l1_table_offset(addr), l1e_get_intpte(l1e), pfn);
-}
-
-void asmlinkage noreturn do_double_fault(struct cpu_user_regs *regs)
-{
- unsigned int cpu;
- struct extra_state state;
-
- console_force_unlock();
-
- asm ( "lsll %[sel], %[limit]" : [limit] "=r" (cpu)
- : [sel] "r" (PER_CPU_SELECTOR) );
-
- /* Find information saved during fault and dump it to the console. */
- printk("*** DOUBLE FAULT ***\n");
- print_xen_info();
-
- read_registers(&state);
-
- printk("CPU: %d\n", cpu);
- _show_registers(regs, &state, CTXT_hypervisor, NULL);
- show_code(regs);
- show_stack_overflow(cpu, regs);
-
- panic("DOUBLE FAULT -- system shutdown\n");
-}
-
-/*
- * Local variables:
- * mode: C
- * c-file-style: "BSD"
- * c-basic-offset: 4
- * tab-width: 4
- * indent-tabs-mode: nil
- * End:
- */
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 12/22] x86/traps: Unexport show_code() and show_stack_overflow()
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (10 preceding siblings ...)
2025-08-08 20:23 ` [PATCH 11/22] x86/traps: Fold x86_64/traps.c into traps.c Andrew Cooper
@ 2025-08-08 20:23 ` Andrew Cooper
2025-08-12 9:54 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 13/22] x86: FRED enumerations Andrew Cooper
` (10 subsequent siblings)
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:23 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
These can become static now the two traps.c have been merged.
No fucntional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/include/asm/processor.h | 2 --
xen/arch/x86/traps.c | 4 ++--
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h
index 2799d59e385a..1342241742ac 100644
--- a/xen/arch/x86/include/asm/processor.h
+++ b/xen/arch/x86/include/asm/processor.h
@@ -333,8 +333,6 @@ extern void write_ptbase(struct vcpu *v);
/* PAUSE (encoding: REP NOP) is a good thing to insert into busy-wait loops. */
#define cpu_relax() asm volatile ( "pause" ::: "memory" )
-void show_code(const struct cpu_user_regs *regs);
-void show_stack_overflow(unsigned int cpu, const struct cpu_user_regs *regs);
void show_registers(const struct cpu_user_regs *regs);
#define dump_execution_state() run_in_exception_handler(show_execution_state)
void show_page_walk(unsigned long addr);
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index ab8ff36acfe6..270b93ed623e 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -354,7 +354,7 @@ void show_page_walk(unsigned long addr)
l1_table_offset(addr), l1e_get_intpte(l1e), pfn);
}
-void show_code(const struct cpu_user_regs *regs)
+static void show_code(const struct cpu_user_regs *regs)
{
unsigned char insns_before[8] = {}, insns_after[16] = {};
unsigned int i, tmp, missing_before, missing_after;
@@ -838,7 +838,7 @@ static void show_stack(const struct cpu_user_regs *regs)
show_trace(regs);
}
-void show_stack_overflow(unsigned int cpu, const struct cpu_user_regs *regs)
+static void show_stack_overflow(unsigned int cpu, const struct cpu_user_regs *regs)
{
unsigned long esp = regs->rsp;
unsigned long curr_stack_base = esp & ~(STACK_SIZE - 1);
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 13/22] x86: FRED enumerations
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (11 preceding siblings ...)
2025-08-08 20:23 ` [PATCH 12/22] x86/traps: Unexport show_code() and show_stack_overflow() Andrew Cooper
@ 2025-08-08 20:23 ` Andrew Cooper
2025-08-13 12:28 ` Andrew Cooper
` (2 more replies)
2025-08-08 20:23 ` [PATCH 14/22] x86/traps: Extend struct cpu_user_regs/cpu_info with FRED fields Andrew Cooper
` (9 subsequent siblings)
22 siblings, 3 replies; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:23 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
Of note, CR4.FRED is bit 32 and cannot enabled outside of 64bit mode.
Most supported toolchains don't understand the FRED instructions yet. ERETU
and ERETS are easy to wrap (they encoded as REPZ/REPNE CLAC), while LKGS is
more complicated and deferred for now.
I have intentionally named the FRED MSRs differently to the spec. In the
spec, the stack pointer names alias the TSS fields of the same name, despite
very different semantics.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/Kconfig | 4 ++++
xen/arch/x86/include/asm/asm-defns.h | 9 +++++++++
xen/arch/x86/include/asm/cpufeature.h | 3 +++
xen/arch/x86/include/asm/msr-index.h | 11 +++++++++++
xen/arch/x86/include/asm/x86-defns.h | 1 +
xen/include/public/arch-x86/cpufeatureset.h | 3 +++
6 files changed, 31 insertions(+)
diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
index a45ce106e210..90cbad13a7c7 100644
--- a/xen/arch/x86/Kconfig
+++ b/xen/arch/x86/Kconfig
@@ -57,6 +57,10 @@ config HAS_CC_CET_IBT
# Retpoline check to work around https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93654
def_bool $(cc-option,-fcf-protection=branch -mmanual-endbr -mindirect-branch=thunk-extern) && $(as-instr,endbr64)
+config HAS_AS_FRED
+ # binutils >= 2.41 or LLVM >= 19
+ def_bool $(as-instr,eretu;lkgs %ax)
+
menu "Architecture Features"
source "arch/x86/Kconfig.cpu"
diff --git a/xen/arch/x86/include/asm/asm-defns.h b/xen/arch/x86/include/asm/asm-defns.h
index 61a5faf90446..2e5200b94b82 100644
--- a/xen/arch/x86/include/asm/asm-defns.h
+++ b/xen/arch/x86/include/asm/asm-defns.h
@@ -4,6 +4,15 @@
.byte 0x0f, 0x01, 0xfc
.endm
+#ifndef CONFIG_HAS_AS_FRED
+.macro eretu
+ .byte 0xf3, 0x0f, 0x01, 0xca
+.endm
+.macro erets
+ .byte 0xf2, 0x0f, 0x01, 0xca
+.endm
+#endif
+
/*
* Call a noreturn function. This could be JMP, but CALL results in a more
* helpful backtrace. BUG is to catch functions which do decide to return...
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 441a7ecc494b..b6cf0c8dfc7c 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -246,6 +246,9 @@ static inline bool boot_cpu_has(unsigned int feat)
#define cpu_has_avx_vnni boot_cpu_has(X86_FEATURE_AVX_VNNI)
#define cpu_has_avx512_bf16 boot_cpu_has(X86_FEATURE_AVX512_BF16)
#define cpu_has_cmpccxadd boot_cpu_has(X86_FEATURE_CMPCCXADD)
+#define cpu_has_fred boot_cpu_has(X86_FEATURE_FRED)
+#define cpu_has_lkgs boot_cpu_has(X86_FEATURE_LKGS)
+#define cpu_has_nmi_src boot_cpu_has(X86_FEATURE_NMI_SRC)
#define cpu_has_avx_ifma boot_cpu_has(X86_FEATURE_AVX_IFMA)
/* CPUID level 0x80000021.eax */
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index 428d993ee89b..bb48d16f0c6d 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -115,6 +115,17 @@
#define MCU_OPT_CTRL_GDS_MIT_DIS (_AC(1, ULL) << 4)
#define MCU_OPT_CTRL_GDS_MIT_LOCK (_AC(1, ULL) << 5)
+#define MSR_FRED_RSP_SL0 0x000001cc
+#define MSR_FRED_RSP_SL1 0x000001cd
+#define MSR_FRED_RSP_SL2 0x000001ce
+#define MSR_FRED_RSP_SL3 0x000001cf
+#define MSR_FRED_STK_LVLS 0x000001d0
+#define MSR_FRED_SSP_SL0 MSR_PL0_SSP
+#define MSR_FRED_SSP_SL1 0x000001d1
+#define MSR_FRED_SSP_SL2 0x000001d2
+#define MSR_FRED_SSP_SL3 0x000001d3
+#define MSR_FRED_CONFIG 0x000001d4
+
#define MSR_RTIT_OUTPUT_BASE 0x00000560
#define MSR_RTIT_OUTPUT_MASK 0x00000561
#define MSR_RTIT_CTL 0x00000570
diff --git a/xen/arch/x86/include/asm/x86-defns.h b/xen/arch/x86/include/asm/x86-defns.h
index 23579c471f4a..513f18b3be4e 100644
--- a/xen/arch/x86/include/asm/x86-defns.h
+++ b/xen/arch/x86/include/asm/x86-defns.h
@@ -75,6 +75,7 @@
#define X86_CR4_PKE 0x00400000 /* enable PKE */
#define X86_CR4_CET 0x00800000 /* Control-flow Enforcement Technology */
#define X86_CR4_PKS 0x01000000 /* Protection Key Supervisor */
+#define X86_CR4_FRED 0x100000000 /* Fast Return and Event Delivery */
#define X86_CR8_VALID_MASK 0xf
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index f7312e0b04e7..9cd778586f10 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -310,7 +310,10 @@ XEN_CPUFEATURE(ARCH_PERF_MON, 10*32+8) /* Architectural Perfmon */
XEN_CPUFEATURE(FZRM, 10*32+10) /*A Fast Zero-length REP MOVSB */
XEN_CPUFEATURE(FSRS, 10*32+11) /*A Fast Short REP STOSB */
XEN_CPUFEATURE(FSRCS, 10*32+12) /*A Fast Short REP CMPSB/SCASB */
+XEN_CPUFEATURE(FRED, 10*32+17) /* Fast Return and Event Delivery */
+XEN_CPUFEATURE(LKGS, 10*32+18) /* Load Kernel GS instruction */
XEN_CPUFEATURE(WRMSRNS, 10*32+19) /*S WRMSR Non-Serialising */
+XEN_CPUFEATURE(NMI_SRC, 10*32+20) /* NMI-Source Reporting */
XEN_CPUFEATURE(AMX_FP16, 10*32+21) /* AMX FP16 instruction */
XEN_CPUFEATURE(AVX_IFMA, 10*32+23) /*A AVX-IFMA Instructions */
XEN_CPUFEATURE(LAM, 10*32+26) /* Linear Address Masking */
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 14/22] x86/traps: Extend struct cpu_user_regs/cpu_info with FRED fields
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (12 preceding siblings ...)
2025-08-08 20:23 ` [PATCH 13/22] x86: FRED enumerations Andrew Cooper
@ 2025-08-08 20:23 ` Andrew Cooper
2025-08-14 13:12 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 15/22] x86/traps: Introduce opt_fred Andrew Cooper
` (8 subsequent siblings)
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:23 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
The FRED on-stack format is larger than the IDT format, but is by and large
compatible. FRED reuses space above cs and ss for extra metadata, some of
which is purely informational, and some of which causes additional effects in
ERET{U,S}.
Follow Linux's choice of naming for fred_{c,s}s structures, to make it very
clear at the point of use that it's dependent on FRED.
There is also the event data field and reserved fields, but we cannot include
these in struct cpu_user_regs without reintroducing OoB structure accesses in
the non-FRED case. See commit 6065a05adf15 ("x86/traps: 'Fix' safety of
read_registers() in #DF path"). for more details.
Instead, use a new struct fred_info and position it suitably in struct
cpu_info. This boundary will be loaded into MSR_FRED_RSP_SL0, and must be
64-byte aligned.
This does add 16 bytes back into struct cpu_info, undoing the saving we made
by dropping the vm86 data segment selectors.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/include/asm/cpu-user-regs.h | 71 ++++++++++++++++++++++--
xen/arch/x86/include/asm/current.h | 2 +
xen/arch/x86/traps-setup.c | 5 ++
3 files changed, 74 insertions(+), 4 deletions(-)
diff --git a/xen/arch/x86/include/asm/cpu-user-regs.h b/xen/arch/x86/include/asm/cpu-user-regs.h
index d700a3ef3447..06d9cbfbe6ba 100644
--- a/xen/arch/x86/include/asm/cpu-user-regs.h
+++ b/xen/arch/x86/include/asm/cpu-user-regs.h
@@ -30,6 +30,10 @@ struct cpu_user_regs
/*
* During IDT delivery for exceptions with an error code, hardware pushes
* to this point. Entry_vector is filled in by software.
+ *
+ * During FRED delivery, hardware always pushes to this point. Software
+ * copies fred_ss.vector into entry_vector so most interrupt/exception
+ * handling can be FRED-agnostic.
*/
uint32_t error_code;
@@ -42,17 +46,76 @@ struct cpu_user_regs
*/
union { uint64_t rip; uint32_t eip; uint16_t ip; };
- uint16_t cs, _pad0[1];
- uint8_t saved_upcall_mask; /* PV (v)rflags.IF == !saved_upcall_mask */
- uint8_t _pad1[3];
+ union {
+ struct {
+ uint16_t cs;
+ unsigned long :16;
+ uint8_t saved_upcall_mask; /* PV (v)rflags.IF == !saved_upcall_mask */
+ };
+ unsigned long csx;
+ struct {
+ /*
+ * Bits 0 thru 31 control ERET{U,S} behaviour, and is state of the
+ * interrupted context.
+ */
+ uint16_t cs;
+ unsigned int sl:2; /* Stack Level */
+ bool wfe:1; /* Wait-for-ENDBRANCH state */
+ } fred_cs;
+ };
union { uint64_t rflags; uint32_t eflags; uint16_t flags; };
union { uint64_t rsp; uint32_t esp; uint16_t sp; uint8_t spl; };
- uint16_t ss, _pad2[3];
+ union {
+ uint16_t ss;
+ unsigned long ssx;
+ struct {
+ /*
+ * Bits 0 thru 31 control ERET{U,S} behaviour, and is state about
+ * the event which occured.
+ */
+ uint16_t ss;
+ bool sti:1; /* Was blocked-by-STI, and not cancelled */
+ bool swint:1; /* Was a SYSCALL/SYSENTER/INT $N */
+ bool nmi:1; /* Was an NMI. */
+ unsigned long :13;
+
+ /*
+ * Bits 32 thru 63 are ignored by ERET{U,S} and are informative
+ * only.
+ */
+ uint8_t vector;
+ unsigned long :8;
+ unsigned int type:4; /* X86_ET_* */
+ unsigned long :4;
+ bool enclave:1; /* Event taken in SGX mode */
+ bool lm:1; /* Was in Long Mode */
+ bool nested:1; /* Exception during event delivery, clear for #DF */
+ unsigned long :1;
+ unsigned int insnlen:4; /* .type >= SW_INT */
+ } fred_ss;
+ };
/*
* For IDT delivery, tss->rsp0 points to this boundary as embedded within
* struct cpu_info. It must be 16-byte aligned.
*/
};
+struct fred_info
+{
+ /*
+ * Event Data. For:
+ * #DB: PENDING_DBG (%dr6 with positive polarity)
+ * NMI: NMI-Source Bitmap (on capable hardware)
+ * #PF: %cr2
+ * #NM: MSR_XFD_ERR (only XFD-induced #NMs)
+ */
+ uint64_t edata;
+ uint64_t _rsvd;
+
+ /*
+ * For FRED delivery, MSR_FRED_RSP_SL0 points to this boundary as embedded
+ * within struct cpu_info. It must be 64-byte aligned.
+ */
+};
#endif /* X86_CPU_USER_REGS_H */
diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
index a7c9473428b2..962eb76a82b3 100644
--- a/xen/arch/x86/include/asm/current.h
+++ b/xen/arch/x86/include/asm/current.h
@@ -38,6 +38,8 @@ struct vcpu;
struct cpu_info {
struct cpu_user_regs guest_cpu_user_regs;
+ struct fred_info _fred; /* Only used when FRED is active. */
+
unsigned int processor_id;
unsigned int verw_sel;
struct vcpu *current_vcpu;
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index fbae7072c292..37202c17fcea 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -360,7 +360,12 @@ static void __init __maybe_unused build_assertions(void)
*
* tss->rsp0, pointing at the end of cpu_info.guest_cpu_user_regs, must be
* 16-byte aligned.
+ *
+ * MSR_FRED_RSP_SL0, pointing to the end of cpu_info._fred must be 64-byte
+ * aligned.
*/
BUILD_BUG_ON((sizeof(struct cpu_info) -
endof_field(struct cpu_info, guest_cpu_user_regs)) & 15);
+ BUILD_BUG_ON((sizeof(struct cpu_info) -
+ endof_field(struct cpu_info, _fred)) & 63);
}
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 15/22] x86/traps: Introduce opt_fred
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (13 preceding siblings ...)
2025-08-08 20:23 ` [PATCH 14/22] x86/traps: Extend struct cpu_user_regs/cpu_info with FRED fields Andrew Cooper
@ 2025-08-08 20:23 ` Andrew Cooper
2025-08-14 13:30 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 16/22] x86/boot: Adjust CR4 handling around ap_early_traps_init() Andrew Cooper
` (7 subsequent siblings)
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:23 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
... disabled by default. There is a lot of work before FRED can be enabled by
default.
One part of FRED, the LKGS (Load Kernel GS) instruction, is enumerated
separately but is mandatory as FRED disallows the SWAPGS instruction.
Therefore, both CPUID bits must be checked.
FRED formally removes the use of Ring1 and Ring2, meaning we cannot run 32bit
PV guests. Therefore, don't enable FRED by default in shim mode. OTOH, if
FRED is active, then PV32 needs disabling like with CET.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
docs/misc/xen-command-line.pandoc | 10 ++++++++++
xen/arch/x86/include/asm/traps.h | 4 ++++
xen/arch/x86/traps-setup.c | 31 +++++++++++++++++++++++++++++++
3 files changed, 45 insertions(+)
diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 6865a61220ca..f293d973a8e8 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1283,6 +1283,16 @@ requirement can be relaxed. This option is particularly useful for nested
virtualization, to allow the L1 hypervisor to use EPT even if the L0 hypervisor
does not provide `VM_ENTRY_LOAD_GUEST_PAT`.
+### fred (x86)
+> `= <bool>`
+
+> Default: `false`
+
+Flexible Return and Event Delivery is an overhaul of interrupt, exception and
+system call handling, fixing many corner cases in the x86 architecture, and
+expected in hardware from 2025. Support in Xen is a work in progress and
+disabled by default.
+
### gnttab
> `= List of [ max-ver:<integer>, transitive=<bool>, transfer=<bool> ]`
diff --git a/xen/arch/x86/include/asm/traps.h b/xen/arch/x86/include/asm/traps.h
index 6ae451d3fc70..73097e957d05 100644
--- a/xen/arch/x86/include/asm/traps.h
+++ b/xen/arch/x86/include/asm/traps.h
@@ -7,6 +7,10 @@
#ifndef ASM_TRAP_H
#define ASM_TRAP_H
+#include <xen/types.h>
+
+extern int8_t opt_fred;
+
void bsp_early_traps_init(void);
void traps_init(void);
void bsp_traps_reinit(void);
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index 37202c17fcea..3b5e4969a375 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -9,6 +9,8 @@
#include <asm/endbr.h>
#include <asm/idt.h>
#include <asm/msr.h>
+#include <asm/pv/domain.h>
+#include <asm/pv/shim.h>
#include <asm/shstk.h>
#include <asm/stubs.h>
#include <asm/traps.h>
@@ -20,6 +22,9 @@ unsigned int __ro_after_init ler_msr;
static bool __initdata opt_ler;
boolean_param("ler", opt_ler);
+int8_t __ro_after_init opt_fred = 0; /* -1 when supported. */
+boolean_param("fred", opt_fred);
+
void nocall entry_PF(void);
void nocall lstar_enter(void);
void nocall cstar_enter(void);
@@ -305,6 +310,32 @@ void __init traps_init(void)
/* Replace early pagefault with real pagefault handler. */
_update_gate_addr_lower(&bsp_idt[X86_EXC_PF], entry_PF);
+ if ( !cpu_has_fred || !cpu_has_lkgs )
+ {
+ if ( opt_fred )
+ printk(XENLOG_WARNING "FRED not available, ignoring\n");
+ opt_fred = false;
+ }
+
+ if ( opt_fred == -1 )
+ opt_fred = !pv_shim;
+
+ if ( opt_fred )
+ {
+#ifdef CONFIG_PV32
+ if ( opt_pv32 )
+ {
+ opt_pv32 = 0;
+ printk(XENLOG_INFO "Disabling PV32 due to FRED\n");
+ }
+#endif
+ printk("Using FRED event delivery\n");
+ }
+ else
+ {
+ printk("Using IDT event delivery\n");
+ }
+
load_system_tables();
init_ler();
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 16/22] x86/boot: Adjust CR4 handling around ap_early_traps_init()
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (14 preceding siblings ...)
2025-08-08 20:23 ` [PATCH 15/22] x86/traps: Introduce opt_fred Andrew Cooper
@ 2025-08-08 20:23 ` Andrew Cooper
2025-08-14 14:47 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 17/22] x86/S3: Switch to using RSTORSSP to recover SSP on resume Andrew Cooper
` (6 subsequent siblings)
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:23 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
ap_early_traps_init() will shortly be setting CR4.FRED. This requires that
cpu_info->cr4 is already set up, and that the enablement of CET doesn't
truncate FRED back out because of it's 32bit logic.
For __high_start(), defer re-loading XEN_MINIMAL_CR4 until after %rsp is set
up and we can store the result in the cr4 field too.
For s3_resume(), explicitly re-load XEN_MINIMAL_CR4. Later when loading all
features, use the mmu_cr4_features variable which is how the rest of Xen
performs this operation.
No functional change, yet.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/acpi/wakeup_prot.S | 18 ++++++++++++++----
xen/arch/x86/boot/x86_64.S | 15 ++++++++++-----
2 files changed, 24 insertions(+), 9 deletions(-)
diff --git a/xen/arch/x86/acpi/wakeup_prot.S b/xen/arch/x86/acpi/wakeup_prot.S
index 60eca4010042..dfc8c6ac6e8c 100644
--- a/xen/arch/x86/acpi/wakeup_prot.S
+++ b/xen/arch/x86/acpi/wakeup_prot.S
@@ -63,6 +63,14 @@ LABEL(s3_resume)
pushq %rax
lretq
1:
+
+ GET_STACK_END(15)
+
+ /* Enable minimal CR4 features. */
+ mov $XEN_MINIMAL_CR4, %eax
+ mov %rax, STACK_CPUINFO_FIELD(cr4)(%r15)
+ mov %rax, %cr4
+
/* Set up early exceptions and CET before entering C properly. */
call ap_early_traps_init
@@ -77,7 +85,9 @@ LABEL(s3_resume)
wrmsr
/* Enable CR4.CET. */
- mov $XEN_MINIMAL_CR4 | X86_CR4_CET, %ecx
+ mov $X86_CR4_CET, %ecx
+ or STACK_CPUINFO_FIELD(cr4)(%r15), %rcx
+ mov %rcx, STACK_CPUINFO_FIELD(cr4)(%r15)
mov %rcx, %cr4
/* WARNING! call/ret now fatal (iff SHSTK) until SETSSBSY loads SSP */
@@ -120,9 +130,9 @@ LABEL(s3_resume)
.L_cet_done:
#endif /* CONFIG_XEN_SHSTK || CONFIG_XEN_IBT */
- /* Restore CR4 from the cpuinfo block. */
- GET_STACK_END(bx)
- mov STACK_CPUINFO_FIELD(cr4)(%rbx), %rax
+ /* Load all CR4 settings. */
+ mov mmu_cr4_features(%rip), %rax
+ mov %rax, STACK_CPUINFO_FIELD(cr4)(%r15)
mov %rax, %cr4
call mtrr_bp_restore
diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index 0dfcc8a88a40..631ea2f8236e 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -11,16 +11,19 @@ ENTRY(__high_start)
mov %ecx,%gs
mov %ecx,%ss
- /* Enable minimal CR4 features. */
- mov $XEN_MINIMAL_CR4,%rcx
- mov %rcx,%cr4
-
mov stack_start(%rip),%rsp
/* Reset EFLAGS (subsumes CLI and CLD). */
pushq $0
popf
+ GET_STACK_END(15)
+
+ /* Enable minimal CR4 features. */
+ mov $XEN_MINIMAL_CR4, %eax
+ mov %rax, STACK_CPUINFO_FIELD(cr4)(%r15)
+ mov %rax, %cr4
+
/* Reload code selector. */
pushq $__HYPERVISOR_CS
leaq 1f(%rip),%rax
@@ -45,7 +48,9 @@ ENTRY(__high_start)
wrmsr
/* Enable CR4.CET. */
- mov $XEN_MINIMAL_CR4 | X86_CR4_CET, %ecx
+ mov $X86_CR4_CET, %ecx
+ or STACK_CPUINFO_FIELD(cr4)(%r15), %rcx
+ mov %rcx, STACK_CPUINFO_FIELD(cr4)(%r15)
mov %rcx, %cr4
/* WARNING! call/ret now fatal (iff SHSTK) until SETSSBSY loads SSP */
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 17/22] x86/S3: Switch to using RSTORSSP to recover SSP on resume
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (15 preceding siblings ...)
2025-08-08 20:23 ` [PATCH 16/22] x86/boot: Adjust CR4 handling around ap_early_traps_init() Andrew Cooper
@ 2025-08-08 20:23 ` Andrew Cooper
2025-08-14 14:54 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 18/22] x86/traps: Set MSR_PL0_SSP in load_system_tables() Andrew Cooper
` (5 subsequent siblings)
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:23 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
Under FRED, SETSSBSY is unavailable, and we want to be setting up FRED prior
to setting up shadow stacks. Luckily, RSTORSSP will also work in this case.
This involves a new type of shadow stack token, the Restore Token, which is
distinguished from the Supervisor Token by pointing to the adjacent slot on
the shadow stack rather than pointing at itself.
In the short term, this logic still needs to load MSR_PL0_SSP.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/acpi/wakeup_prot.S | 31 +++++++++++++++----------------
1 file changed, 15 insertions(+), 16 deletions(-)
diff --git a/xen/arch/x86/acpi/wakeup_prot.S b/xen/arch/x86/acpi/wakeup_prot.S
index dfc8c6ac6e8c..6ddc4011d8b6 100644
--- a/xen/arch/x86/acpi/wakeup_prot.S
+++ b/xen/arch/x86/acpi/wakeup_prot.S
@@ -90,7 +90,7 @@ LABEL(s3_resume)
mov %rcx, STACK_CPUINFO_FIELD(cr4)(%r15)
mov %rcx, %cr4
- /* WARNING! call/ret now fatal (iff SHSTK) until SETSSBSY loads SSP */
+ /* WARNING! CALL/RET now fatal (iff SHSTK) until RSTORSSP loads SSP */
#if defined(CONFIG_XEN_SHSTK)
test $CET_SHSTK_EN, %al
@@ -98,32 +98,31 @@ LABEL(s3_resume)
/*
* Restoring SSP is a little complicated, because we are intercepting
- * an in-use shadow stack. Write a temporary token under the stack,
- * so SETSSBSY will successfully load a value useful for us, then
- * reset MSR_PL0_SSP to its usual value and pop the temporary token.
+ * an in-use shadow stack. Write a Restore Token under the stack, and
+ * use RSTORSSP to load it. RSTORSSP converts the token to a
+ * Previous-SSP Token, which we discard.
*/
mov saved_ssp(%rip), %rdi
- /* Construct the temporary supervisor token under SSP. */
- sub $8, %rdi
-
- /* Load it into MSR_PL0_SSP. */
+ /* Calculate MSR_PL0_SSP from SSP. */
mov $MSR_PL0_SSP, %ecx
mov %rdi, %rdx
shr $32, %rdx
mov %edi, %eax
- wrmsr
-
- /* Write the temporary token onto the shadow stack, and activate it. */
- wrssq %rdi, (%rdi)
- setssbsy
-
- /* Reset MSR_PL0_SSP back to its normal value. */
and $~(STACK_SIZE - 1), %eax
or $(PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8, %eax
wrmsr
- /* Pop the temporary token off the stack. */
+ /*
+ * A Restore Token's value is &token + 8 + 64BIT (bit 0).
+ * We want to put this on the shstk at SSP - 8.
+ */
+ lea 1(%rdi), %rax
+ sub $8, %rdi
+ wrssq %rax, (%rdi)
+ rstorssp (%rdi)
+
+ /* Discard the Previous-SSP Token from the shstk. */
mov $2, %eax
incsspd %eax
#endif /* CONFIG_XEN_SHSTK */
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 18/22] x86/traps: Set MSR_PL0_SSP in load_system_tables()
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (16 preceding siblings ...)
2025-08-08 20:23 ` [PATCH 17/22] x86/S3: Switch to using RSTORSSP to recover SSP on resume Andrew Cooper
@ 2025-08-08 20:23 ` Andrew Cooper
2025-08-14 15:00 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 19/22] x86/boot: Use RSTORSSP to establish SSP Andrew Cooper
` (4 subsequent siblings)
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:23 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
FRED and IDT differ by a Supervisor Token on the base of the shstk. This
means that the value they load into MSR_PL0_SSP differs by 8.
s3_resume() in particular has logic which is otherwise invariant of FRED mode,
and must not clobber a FRED MSR_PL0_SSP with an IDT one.
This also simplifies the AP path too. Updating reinit_bsp_stack() is deferred
until later.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/acpi/wakeup_prot.S | 9 ---------
xen/arch/x86/boot/x86_64.S | 12 +++---------
xen/arch/x86/traps-setup.c | 2 ++
3 files changed, 5 insertions(+), 18 deletions(-)
diff --git a/xen/arch/x86/acpi/wakeup_prot.S b/xen/arch/x86/acpi/wakeup_prot.S
index 6ddc4011d8b6..c800cd28a7c9 100644
--- a/xen/arch/x86/acpi/wakeup_prot.S
+++ b/xen/arch/x86/acpi/wakeup_prot.S
@@ -104,15 +104,6 @@ LABEL(s3_resume)
*/
mov saved_ssp(%rip), %rdi
- /* Calculate MSR_PL0_SSP from SSP. */
- mov $MSR_PL0_SSP, %ecx
- mov %rdi, %rdx
- shr $32, %rdx
- mov %edi, %eax
- and $~(STACK_SIZE - 1), %eax
- or $(PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8, %eax
- wrmsr
-
/*
* A Restore Token's value is &token + 8 + 64BIT (bit 0).
* We want to put this on the shstk at SSP - 8.
diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index 631ea2f8236e..ebb91d5e3f60 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -65,17 +65,11 @@ ENTRY(__high_start)
or $(PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8, %rdx
/*
- * Write a new supervisor token. Doesn't matter on boot, but for S3
- * resume this clears the busy bit.
+ * Write a new Supervisor Token. It doesn't matter the first time a
+ * CPU boots, but for S3 resume or CPU hot re-add, this clears the
+ * busy bit.
*/
wrssq %rdx, (%rdx)
-
- /* Point MSR_PL0_SSP at the token. */
- mov $MSR_PL0_SSP, %ecx
- mov %edx, %eax
- shr $32, %rdx
- wrmsr
-
setssbsy
#endif /* CONFIG_XEN_SHSTK */
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index 3b5e4969a375..c4825fc1b11a 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -96,6 +96,7 @@ static void load_system_tables(void)
{
volatile uint64_t *ist_ssp = tss_page->ist_ssp;
unsigned long
+ ssp = stack_top + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8,
mce_ssp = stack_top + (IST_MCE * IST_SHSTK_SIZE) - 8,
nmi_ssp = stack_top + (IST_NMI * IST_SHSTK_SIZE) - 8,
db_ssp = stack_top + (IST_DB * IST_SHSTK_SIZE) - 8,
@@ -122,6 +123,7 @@ static void load_system_tables(void)
}
wrmsrns(MSR_ISST, (unsigned long)ist_ssp);
+ wrmsrns(MSR_PL0_SSP, (unsigned long)ssp);
}
_set_tssldt_desc(gdt + TSS_ENTRY, (unsigned long)tss,
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 19/22] x86/boot: Use RSTORSSP to establish SSP
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (17 preceding siblings ...)
2025-08-08 20:23 ` [PATCH 18/22] x86/traps: Set MSR_PL0_SSP in load_system_tables() Andrew Cooper
@ 2025-08-08 20:23 ` Andrew Cooper
2025-08-14 15:11 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 20/22] x86/traps: Alter switch_stack_and_jump() for FRED mode Andrew Cooper
` (3 subsequent siblings)
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:23 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
Under FRED, SETSSBSY is unavailable, and we want to be setting up FRED prior
to setting up shadow stacks. As we still need Supervisor Tokens in IDT mode,
we need mode-specific logic to establish SSP.
In FRED mode, write a Restore Token, RSTORSSP it, and discard the resulting
Previous-SSP token.
No change outside of FRED mode.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/boot/x86_64.S | 23 +++++++++++++++++++++--
xen/arch/x86/setup.c | 27 ++++++++++++++++++++++++---
2 files changed, 45 insertions(+), 5 deletions(-)
diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index ebb91d5e3f60..138501f52158 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -53,17 +53,21 @@ ENTRY(__high_start)
mov %rcx, STACK_CPUINFO_FIELD(cr4)(%r15)
mov %rcx, %cr4
- /* WARNING! call/ret now fatal (iff SHSTK) until SETSSBSY loads SSP */
+ /* WARNING! CALL/RET now fatal (iff SHSTK) until SETSSBSY/RSTORSSP loads SSP */
#if defined(CONFIG_XEN_SHSTK)
test $CET_SHSTK_EN, %al
jz .L_ap_cet_done
- /* Derive the supervisor token address from %rsp. */
+ /* Derive the token address from %rsp. */
mov %rsp, %rdx
and $~(STACK_SIZE - 1), %rdx
or $(PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8, %rdx
+ /* Establishing SSP depends on whether we're using FRED or IDT mode. */
+ bt $32 /* ilog2(X86_CR4_FRED) */, %rcx
+ jc .L_fred_shstk
+
/*
* Write a new Supervisor Token. It doesn't matter the first time a
* CPU boots, but for S3 resume or CPU hot re-add, this clears the
@@ -71,6 +75,21 @@ ENTRY(__high_start)
*/
wrssq %rdx, (%rdx)
setssbsy
+ jmp .L_ap_cet_done
+
+.L_fred_shstk:
+
+ /*
+ * Write a Restore Token, value: &token + 8 + * 64BIT (bit 0) at the
+ * base of the shstk (which isn't in use yet).
+ */
+ lea 9(%rdx), %rdi
+ wrssq %rdi, (%rdx)
+ rstorssp (%rdx)
+
+ /* Discard the Previous-SSP Token from the shstk. */
+ mov $2, %edx
+ incsspd %edx
#endif /* CONFIG_XEN_SHSTK */
.L_ap_cet_done:
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 6fb42c5a5f95..c5dd2051dffe 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -53,6 +53,7 @@
#include <asm/prot-key.h>
#include <asm/pv/domain.h>
#include <asm/setup.h>
+#include <asm/shstk.h>
#include <asm/smp.h>
#include <asm/spec_ctrl.h>
#include <asm/stubs.h>
@@ -912,10 +913,30 @@ static void __init noreturn reinit_bsp_stack(void)
if ( cpu_has_xen_shstk )
{
- wrmsrl(MSR_PL0_SSP,
- (unsigned long)stack + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8);
wrmsrl(MSR_S_CET, xen_msr_s_cet_value());
- asm volatile ("setssbsy" ::: "memory");
+
+ /*
+ * IDT and FRED differ by a Supervisor Token on the shadow stack, and
+ * therefore by the value in MSR_PL0_SSP.
+ *
+ * In IDT mode, we use SETSSBSY to mark the Supervisor Token as busy.
+ * In FRED mode, there is no token, so we need a transient Restore
+ * Token to establish SSP.
+ */
+ if ( opt_fred )
+ {
+ unsigned long *token =
+ (void *)stack + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8;
+
+ wrss((unsigned long)token + 9, token);
+ asm volatile ( "rstorssp %0" : "+m" (*token) );
+ /*
+ * We need to discard the resulting Previous-SSP Token, but
+ * reset_stack_and_jump() will do that for us.
+ */
+ }
+ else
+ asm volatile ( "setssbsy" ::: "memory" );
}
reset_stack_and_jump(init_done);
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 20/22] x86/traps: Alter switch_stack_and_jump() for FRED mode
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (18 preceding siblings ...)
2025-08-08 20:23 ` [PATCH 19/22] x86/boot: Use RSTORSSP to establish SSP Andrew Cooper
@ 2025-08-08 20:23 ` Andrew Cooper
2025-08-14 15:35 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 21/22] x86/traps: Introduce FRED entrypoints Andrew Cooper
` (2 subsequent siblings)
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:23 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
FRED and IDT differ by a Supervisor Token on the base of the shstk. This
means that switch_stack_and_jump() needs to discard one extra word when FRED
is active.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
RFC. I don't like this, but it does work.
This emits opt_fred logic outside of CONFIG_XEN_SHSTK. But frankly, the
construct is already too unweildly, and all options I can think of make it
moreso.
---
xen/arch/x86/include/asm/current.h | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
index 962eb76a82b3..24d7d906a8c6 100644
--- a/xen/arch/x86/include/asm/current.h
+++ b/xen/arch/x86/include/asm/current.h
@@ -11,6 +11,7 @@
#include <xen/page-size.h>
#include <asm/cpu-user-regs.h>
+#include <asm/traps.h>
/*
* Xen's cpu stacks are 8 pages (8-page aligned), arranged as:
@@ -154,7 +155,6 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
"rdsspd %[ssp];" \
"cmp $1, %[ssp];" \
"je .L_shstk_done.%=;" /* CET not active? Skip. */ \
- "mov $%c[skstk_base], %[val];" \
"and $%c[stack_mask], %[ssp];" \
"sub %[ssp], %[val];" \
"shr $3, %[val];" \
@@ -177,6 +177,8 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
#define switch_stack_and_jump(fn, instr, constr) \
({ \
+ unsigned int token_offset = \
+ (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - (opt_fred ? 0 : 8); \
unsigned int tmp; \
BUILD_BUG_ON(!ssaj_has_attr_noreturn(fn)); \
__asm__ __volatile__ ( \
@@ -184,12 +186,11 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
"mov %[stk], %%rsp;" \
CHECK_FOR_LIVEPATCH_WORK \
instr "[fun]" \
- : [val] "=&r" (tmp), \
+ : [val] "=r" (tmp), \
[ssp] "=&r" (tmp) \
: [stk] "r" (guest_cpu_user_regs()), \
[fun] constr (fn), \
- [skstk_base] "i" \
- ((PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8), \
+ "[val]" (token_offset), \
[stack_mask] "i" (STACK_SIZE - 1), \
_ASM_BUGFRAME_INFO(BUGFRAME_bug, __LINE__, \
__FILE__, NULL) \
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 21/22] x86/traps: Introduce FRED entrypoints
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (19 preceding siblings ...)
2025-08-08 20:23 ` [PATCH 20/22] x86/traps: Alter switch_stack_and_jump() for FRED mode Andrew Cooper
@ 2025-08-08 20:23 ` Andrew Cooper
2025-08-11 11:38 ` Andrew Cooper
` (2 more replies)
2025-08-08 20:23 ` [PATCH 22/22] x86/traps: Enable FRED when requested Andrew Cooper
2025-08-08 23:49 ` [PATCH 23/22] x86/vmx: Adjust NMI handling for FRED Andrew Cooper
22 siblings, 3 replies; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:23 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
Under FRED, there's one entrypoint from Ring 3, and one from Ring 0.
FRED gives us a good stack (even for SYSCALL/SYSENTER), and a unified event
frame on the stack, meaing that all software needs to do is spill the GPRs
with a line of PUSHes. Introduce PUSH_AND_CLEAR_GPRS and POP_GPRS for this
purpose.
Introduce entry_FRED_R0() which to a first appoximation is complete for all
event handling within Xen.
entry_FRED_R0() needs deriving from entry_FRED_R3(), so introduce a basic
handler. There is more work required to make the return-to-guest path work
under FRED, so leave a BUG clearly in place.
Also introduce entry_from_{xen,pv}() to be the C level handlers. By simply
copying regs->fred_ss.vector into regs->entry_vector, we can reuse all the
existing fault handlers.
Extend fatal_trap() to render the event type, including by name, when FRED is
active. This is slightly complicated, because X86_ET_OTHER must not use
vector_name() or SYSCALL and SYSENTER get rendered as #BP and #DB. Also,
{read,write}_gs_shadow() needs modifying to avoid the SWAPGS instruction,
which is disallowed in FRED mode.
This is sufficient to handle all interrupts and exceptions encountered during
development, including plenty of Double Faults.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
SIMICS hasn't been updated to the FRED v9, and still wants ENDBR instructions
at the entrypoints.
---
xen/arch/x86/include/asm/asm_defns.h | 65 ++++++++++++
xen/arch/x86/include/asm/msr.h | 8 +-
xen/arch/x86/traps.c | 153 ++++++++++++++++++++++++++-
xen/arch/x86/x86_64/Makefile | 1 +
xen/arch/x86/x86_64/entry-fred.S | 35 ++++++
5 files changed, 256 insertions(+), 6 deletions(-)
create mode 100644 xen/arch/x86/x86_64/entry-fred.S
diff --git a/xen/arch/x86/include/asm/asm_defns.h b/xen/arch/x86/include/asm/asm_defns.h
index 72a0082d319d..a81a4043d0f1 100644
--- a/xen/arch/x86/include/asm/asm_defns.h
+++ b/xen/arch/x86/include/asm/asm_defns.h
@@ -315,6 +315,71 @@ static always_inline void stac(void)
subq $-(UREGS_error_code-UREGS_r15+\adj), %rsp
.endm
+/*
+ * Push and clear GPRs
+ */
+.macro PUSH_AND_CLEAR_GPRS
+ push %rdi
+ xor %edi, %edi
+ push %rsi
+ xor %esi, %esi
+ push %rdx
+ xor %edx, %edx
+ push %rcx
+ xor %ecx, %ecx
+ push %rax
+ xor %eax, %eax
+ push %r8
+ xor %r8d, %r8d
+ push %r9
+ xor %r9d, %r9d
+ push %r10
+ xor %r10d, %r10d
+ push %r11
+ xor %r11d, %r11d
+ push %rbx
+ xor %ebx, %ebx
+ push %rbp
+#ifdef CONFIG_FRAME_POINTER
+/* Indicate special exception stack frame by inverting the frame pointer. */
+ mov %rsp, %rbp
+ notq %rbp
+#else
+ xor %ebp, %ebp
+#endif
+ push %r12
+ xor %r12d, %r12d
+ push %r13
+ xor %r13d, %r13d
+ push %r14
+ xor %r14d, %r14d
+ push %r15
+ xor %r15d, %r15d
+.endm
+
+/*
+ * POP GPRs from a UREGS_* frame on the stack. Does not modify flags.
+ *
+ * @rax: Alternative destination for the %rax value on the stack.
+ */
+.macro POP_GPRS rax=%rax
+ pop %r15
+ pop %r14
+ pop %r13
+ pop %r12
+ pop %rbp
+ pop %rbx
+ pop %r11
+ pop %r10
+ pop %r9
+ pop %r8
+ pop \rax
+ pop %rcx
+ pop %rdx
+ pop %rsi
+ pop %rdi
+.endm
+
#ifdef CONFIG_PV32
#define CR4_PV32_RESTORE \
ALTERNATIVE_2 "", \
diff --git a/xen/arch/x86/include/asm/msr.h b/xen/arch/x86/include/asm/msr.h
index b6b85b04c3fd..01f510315ffe 100644
--- a/xen/arch/x86/include/asm/msr.h
+++ b/xen/arch/x86/include/asm/msr.h
@@ -202,9 +202,9 @@ static inline unsigned long read_gs_base(void)
static inline unsigned long read_gs_shadow(void)
{
- unsigned long base;
+ unsigned long base, cr4 = read_cr4();
- if ( read_cr4() & X86_CR4_FSGSBASE )
+ if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
{
asm volatile ( "swapgs" );
base = __rdgsbase();
@@ -234,7 +234,9 @@ static inline void write_gs_base(unsigned long base)
static inline void write_gs_shadow(unsigned long base)
{
- if ( read_cr4() & X86_CR4_FSGSBASE )
+ unsigned long cr4 = read_cr4();
+
+ if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
{
asm volatile ( "swapgs\n\t"
"wrgsbase %0\n\t"
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 270b93ed623e..e67a428e4362 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1013,6 +1013,32 @@ void show_execution_state_nmi(const cpumask_t *mask, bool show_all)
printk("Non-responding CPUs: {%*pbl}\n", CPUMASK_PR(&show_state_mask));
}
+static const char *x86_et_name(unsigned int type)
+{
+ static const char *const names[] = {
+ [X86_ET_EXT_INTR] = "EXT_INTR",
+ [X86_ET_NMI] = "NMI",
+ [X86_ET_HW_EXC] = "HW_EXC",
+ [X86_ET_SW_INT] = "SW_INT",
+ [X86_ET_PRIV_SW_EXC] = "PRIV_SW_EXEC",
+ [X86_ET_SW_EXC] = "SW_EXEC",
+ [X86_ET_OTHER] = "OTHER",
+ };
+
+ return (type < ARRAY_SIZE(names) && names[type]) ? names[type] : "???";
+}
+
+static const char *x86_et_other_name(unsigned int vec)
+{
+ static const char *const names[] = {
+ [0] = "MTF",
+ [1] = "SYSCALL",
+ [2] = "SYSENTER",
+ };
+
+ return (vec < ARRAY_SIZE(names) && names[vec][0]) ? names[vec] : "???";
+}
+
const char *vector_name(unsigned int vec)
{
static const char names[][4] = {
@@ -1091,9 +1117,42 @@ void fatal_trap(const struct cpu_user_regs *regs, bool show_remote)
}
}
- panic("FATAL TRAP: vec %u, %s[%04x]%s\n",
- trapnr, vector_name(trapnr), regs->error_code,
- (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
+ if ( read_cr4() & X86_CR4_FRED )
+ {
+ bool render_ec = false;
+ const char *vec_name = NULL;
+
+ switch ( regs->fred_ss.type )
+ {
+ case X86_ET_HW_EXC:
+ case X86_ET_SW_INT:
+ case X86_ET_PRIV_SW_EXC:
+ case X86_ET_SW_EXC:
+ render_ec = true;
+ vec_name = vector_name(regs->fred_ss.vector);
+ break;
+
+ case X86_ET_OTHER:
+ vec_name = x86_et_other_name(regs->fred_ss.vector);
+ break;
+ }
+
+ if ( render_ec )
+ panic("Fatal TRAP: type %u, %s, vec %u, %s[%04x]%s\n",
+ regs->fred_ss.type, x86_et_name(regs->fred_ss.type),
+ regs->fred_ss.vector, vec_name ?: "",
+ regs->error_code,
+ (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
+ else
+ panic("Fatal TRAP: type %u, %s, vec %u, %s%s\n",
+ regs->fred_ss.type, x86_et_name(regs->fred_ss.type),
+ regs->fred_ss.vector, vec_name ?: "",
+ (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
+ }
+ else
+ panic("FATAL TRAP: vec %u, %s[%04x]%s\n",
+ trapnr, vector_name(trapnr), regs->error_code,
+ (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
}
void asmlinkage noreturn do_unhandled_trap(struct cpu_user_regs *regs)
@@ -2181,6 +2240,94 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
}
#endif
+void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
+{
+ /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
+ regs->entry_vector = regs->fred_ss.vector;
+
+ switch ( regs->fred_ss.type )
+ {
+ case X86_ET_EXT_INTR:
+ do_IRQ(regs);
+ break;
+
+ case X86_ET_NMI:
+ do_nmi(regs);
+ break;
+
+ case X86_ET_HW_EXC:
+ case X86_ET_SW_INT:
+ case X86_ET_PRIV_SW_EXC:
+ case X86_ET_SW_EXC:
+ goto fatal;
+
+ default:
+ goto fatal;
+ }
+
+ return;
+
+ fatal:
+ fatal_trap(regs, false);
+}
+
+void asmlinkage entry_from_xen(struct cpu_user_regs *regs)
+{
+ /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
+ regs->entry_vector = regs->fred_ss.vector;
+
+ switch ( regs->fred_ss.type )
+ {
+ case X86_ET_EXT_INTR:
+ do_IRQ(regs);
+ break;
+
+ case X86_ET_NMI:
+ do_nmi(regs);
+ break;
+
+ case X86_ET_HW_EXC:
+ case X86_ET_SW_INT:
+ case X86_ET_PRIV_SW_EXC:
+ case X86_ET_SW_EXC:
+ switch ( regs->fred_ss.vector )
+ {
+ case X86_EXC_PF: do_page_fault(regs); break;
+ case X86_EXC_GP: do_general_protection(regs); break;
+ case X86_EXC_UD: do_invalid_op(regs); break;
+ case X86_EXC_NM: do_device_not_available(regs); break;
+ case X86_EXC_BP: do_int3(regs); break;
+ case X86_EXC_DB: do_debug(regs); break;
+ case X86_EXC_DF: do_double_fault(regs); break;
+
+ case X86_EXC_DE:
+ case X86_EXC_OF:
+ case X86_EXC_BR:
+ case X86_EXC_NP:
+ case X86_EXC_SS:
+ case X86_EXC_MF:
+ case X86_EXC_AC:
+ case X86_EXC_XM:
+ do_trap(regs);
+ break;
+
+ case X86_EXC_CP: do_entry_CP(regs); break;
+
+ default:
+ goto fatal;
+ }
+ break;
+
+ default:
+ goto fatal;
+ }
+
+ return;
+
+ fatal:
+ fatal_trap(regs, false);
+}
+
/*
* Local variables:
* mode: C
diff --git a/xen/arch/x86/x86_64/Makefile b/xen/arch/x86/x86_64/Makefile
index f20763088740..5ec933539adb 100644
--- a/xen/arch/x86/x86_64/Makefile
+++ b/xen/arch/x86/x86_64/Makefile
@@ -1,6 +1,7 @@
obj-$(CONFIG_PV32) += compat/
obj-bin-y += entry.o
+obj-bin-y += entry-fred.o
obj-$(CONFIG_KEXEC) += machine_kexec.o
obj-y += pci.o
obj-y += acpi_mmcfg.o
diff --git a/xen/arch/x86/x86_64/entry-fred.S b/xen/arch/x86/x86_64/entry-fred.S
new file mode 100644
index 000000000000..88d262b91f92
--- /dev/null
+++ b/xen/arch/x86/x86_64/entry-fred.S
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+ .file "x86_64/entry-fred.S"
+
+#include <asm/asm_defns.h>
+#include <asm/page.h>
+
+ .section .text.entry, "ax", @progbits
+
+ /* The Ring3 entry point is required to be 4k aligned. */
+
+FUNC(entry_FRED_R3, 4096)
+ PUSH_AND_CLEAR_GPRS
+
+ mov %rsp, %rdi
+ call entry_from_pv
+
+ BUG /* TODO - return to guest path */
+
+ POP_GPRS
+ eretu
+END(entry_FRED_R3)
+
+ /* The Ring0 entrypoint is at Ring3 + 256. */
+ .org entry_FRED_R3 + 256, 0xcc
+
+FUNC_LOCAL(entry_FRED_R0, 0)
+ PUSH_AND_CLEAR_GPRS
+
+ mov %rsp, %rdi
+ call entry_from_xen
+
+ POP_GPRS
+ erets
+END(entry_FRED_R0)
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 22/22] x86/traps: Enable FRED when requested
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (20 preceding siblings ...)
2025-08-08 20:23 ` [PATCH 21/22] x86/traps: Introduce FRED entrypoints Andrew Cooper
@ 2025-08-08 20:23 ` Andrew Cooper
2025-08-18 9:35 ` Jan Beulich
2025-08-08 23:49 ` [PATCH 23/22] x86/vmx: Adjust NMI handling for FRED Andrew Cooper
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 20:23 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
With the shadow stack and exception handling adjustements in place, we can now
activate FRED when appropriate. Note that opt_fred is still disabled by
default.
Introduce init_fred() to set up all the MSRs relevant for FRED. FRED uses
MSR_STAR (entries from Ring3 only), and MSR_FRED_SSP_SL0 aliases MSR_PL0_SSP
when CET-SS is active. Otherwise, they're all new MSRs.
With init_fred() existing, load_system_tables() and legacy_syscall_init()
should only be used when setting up IDT delivery. Insert ASSERT()s to this
effect, and adjust the various *_init() functions to make this property true.
Per the documentation, ap_early_traps_init() is responsible for switching off
the boot GDT, which needs doing even in FRED mode.
Finally, set CR4.FRED in {bsp,ap}_early_traps_init().
Xen can now boot in FRED mode up until starting a PV guest, where it faults
because IRET is not permitted to change privilege.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
In principle we can stop allocating the IDT and TSS for CPUs now, although I
want to get shutdown and kexec working before making this optimisation, in
case there's something I've overlooked.
---
xen/arch/x86/include/asm/current.h | 3 ++
xen/arch/x86/traps-setup.c | 78 +++++++++++++++++++++++++++---
2 files changed, 75 insertions(+), 6 deletions(-)
diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
index 24d7d906a8c6..046740447db0 100644
--- a/xen/arch/x86/include/asm/current.h
+++ b/xen/arch/x86/include/asm/current.h
@@ -24,6 +24,9 @@
* 2 - NMI IST stack
* 1 - #MC IST stack
* 0 - IST Shadow Stacks (4x 1k, read-only)
+ *
+ * In FRED mode, #DB and NMI do not need special stacks, so their stacks are
+ * unused.
*/
/*
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index c4825fc1b11a..fdcfc7f5777d 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -26,6 +26,7 @@ int8_t __ro_after_init opt_fred = 0; /* -1 when supported. */
boolean_param("fred", opt_fred);
void nocall entry_PF(void);
+void nocall entry_FRED_R3(void);
void nocall lstar_enter(void);
void nocall cstar_enter(void);
@@ -63,6 +64,8 @@ static void load_system_tables(void)
.limit = sizeof(bsp_idt) - 1,
};
+ ASSERT(opt_fred == 0);
+
/*
* Set up the TSS. Warning - may be live, and the NMI/#MC must remain
* valid on every instruction boundary. (Note: these are all
@@ -197,6 +200,8 @@ static void legacy_syscall_init(void)
unsigned char *stub_page;
unsigned int offset;
+ ASSERT(opt_fred == 0);
+
/* No PV guests? No need to set up SYSCALL/SYSENTER infrastructure. */
if ( !IS_ENABLED(CONFIG_PV) )
return;
@@ -274,6 +279,44 @@ static void __init init_ler(void)
setup_force_cpu_cap(X86_FEATURE_XEN_LBR);
}
+/*
+ * Set up all MSRs relevant for FRED event delivery.
+ *
+ * Xen does not use any of the optional config in MSR_FRED_CONFIG, so all that
+ * is needed is the entrypoint.
+ *
+ * Because FRED always provides a good stack, NMI and #DB do not need any
+ * special treatment. Only #DF needs another stack level, and #MC for the
+ * offchance that Xen's main stack suffers an uncorrectable error.
+ *
+ * FRED reuses MSR_STAR to provide the segment selector values to load on
+ * entry from Ring3. Entry from Ring0 leave %cs and %ss unmodified.
+ */
+static void init_fred(void)
+{
+ unsigned long stack_top = get_stack_bottom() & ~(STACK_SIZE - 1);
+
+ ASSERT(opt_fred == 1);
+
+ wrmsrns(MSR_STAR, XEN_MSR_STAR);
+ wrmsrns(MSR_FRED_CONFIG, (unsigned long)entry_FRED_R3);
+
+ wrmsrns(MSR_FRED_RSP_SL0, (unsigned long)(&get_cpu_info()->_fred + 1));
+ wrmsrns(MSR_FRED_RSP_SL1, 0);
+ wrmsrns(MSR_FRED_RSP_SL2, stack_top + (1 + IST_MCE) * PAGE_SIZE);
+ wrmsrns(MSR_FRED_RSP_SL3, stack_top + (1 + IST_DF) * PAGE_SIZE);
+ wrmsrns(MSR_FRED_STK_LVLS, ((2UL << (X86_EXC_MC * 2)) |
+ (3UL << (X86_EXC_DF * 2))));
+
+ if ( cpu_has_xen_shstk )
+ {
+ wrmsrns(MSR_FRED_SSP_SL0, stack_top + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE);
+ wrmsrns(MSR_FRED_SSP_SL1, 0);
+ wrmsrns(MSR_FRED_SSP_SL2, stack_top + (IST_MCE * IST_SHSTK_SIZE));
+ wrmsrns(MSR_FRED_SSP_SL3, stack_top + (IST_DF * IST_SHSTK_SIZE));
+ }
+}
+
/*
* Configure basic exception handling. This is prior to parsing the command
* line or configuring a console, and needs to be as simple as possible.
@@ -331,15 +374,18 @@ void __init traps_init(void)
printk(XENLOG_INFO "Disabling PV32 due to FRED\n");
}
#endif
+ init_fred();
+ set_in_cr4(X86_CR4_FRED);
+
printk("Using FRED event delivery\n");
}
else
{
+ load_system_tables();
+
printk("Using IDT event delivery\n");
}
- load_system_tables();
-
init_ler();
/* Cache {,compat_}gdt_l1e now that physically relocation is done. */
@@ -357,8 +403,13 @@ void __init traps_init(void)
*/
void bsp_traps_reinit(void)
{
- load_system_tables();
- percpu_traps_init();
+ if ( opt_fred )
+ init_fred();
+ else
+ {
+ load_system_tables();
+ percpu_traps_init();
+ }
}
/*
@@ -367,7 +418,8 @@ void bsp_traps_reinit(void)
*/
void percpu_traps_init(void)
{
- legacy_syscall_init();
+ if ( !opt_fred )
+ legacy_syscall_init();
if ( cpu_has_xen_lbr )
wrmsrl(MSR_IA32_DEBUGCTLMSR, IA32_DEBUGCTLMSR_LBR);
@@ -382,7 +434,21 @@ void percpu_traps_init(void)
*/
void asmlinkage ap_early_traps_init(void)
{
- load_system_tables();
+ if ( opt_fred )
+ {
+ const seg_desc_t *gdt = this_cpu(gdt) - FIRST_RESERVED_GDT_ENTRY;
+ const struct desc_ptr gdtr = {
+ .base = (unsigned long)gdt,
+ .limit = LAST_RESERVED_GDT_BYTE,
+ };
+
+ lgdt(&gdtr);
+
+ init_fred();
+ write_cr4(read_cr4() | X86_CR4_FRED);
+ }
+ else
+ load_system_tables();
}
static void __init __maybe_unused build_assertions(void)
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* [PATCH 23/22] x86/vmx: Adjust NMI handling for FRED
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
` (21 preceding siblings ...)
2025-08-08 20:23 ` [PATCH 22/22] x86/traps: Enable FRED when requested Andrew Cooper
@ 2025-08-08 23:49 ` Andrew Cooper
2025-08-18 10:02 ` Jan Beulich
22 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-08 23:49 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
Handling NMIs in VT-x is awkward. It turns out that using INT $2 happens to
be correct for IDT delivery, and can be made to be correct under FRED too.
Xen can now boot in FRED mode and run PVH dom0 even with the watchdog enabled.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/hvm/vmx/vmx.c | 14 ++++++++++++--
xen/arch/x86/traps.c | 16 +++++++++++++++-
2 files changed, 27 insertions(+), 3 deletions(-)
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index cb82d52ef035..577a5e2d59c6 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -4209,8 +4209,18 @@ void asmlinkage vmx_vmexit_handler(struct cpu_user_regs *regs)
((intr_info & INTR_INFO_INTR_TYPE_MASK) ==
MASK_INSR(X86_ET_NMI, INTR_INFO_INTR_TYPE_MASK)) )
{
- do_nmi(regs);
- enable_nmis();
+ /*
+ * If we exited because of an NMI, NMIs are blocked in hardware,
+ * but software is expected to invoke the handler.
+ *
+ * Use INT $2. Combined with the current state, it is the correct
+ * architectural state for the NMI handler, and the IRET on the
+ * way back out will unblock NMIs.
+ *
+ * In FRED mode, we can spot this trick and cause the ERETS to
+ * unblock NMIs too.
+ */
+ asm ("int $2");
}
break;
case EXIT_REASON_MCE_DURING_VMENTRY:
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index f58e6dcf13b7..16dd335cadb2 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -2285,8 +2285,22 @@ void asmlinkage entry_from_xen(struct cpu_user_regs *regs)
do_nmi(regs);
break;
- case X86_ET_HW_EXC:
case X86_ET_SW_INT:
+ if ( regs->fred_ss.vector == 2 )
+ {
+ /*
+ * Explicit request from the the VMExit handler. Rewrite the FRED
+ * frame to look like it was a real NMI, and go around again.
+ */
+ regs->fred_ss.swint = false;
+ regs->fred_ss.nmi = true;
+ regs->fred_ss.type = X86_ET_NMI;
+ regs->fred_ss.insnlen = 0;
+
+ return entry_from_xen(regs);
+ }
+ fallthrough;
+ case X86_ET_HW_EXC:
case X86_ET_PRIV_SW_EXC:
case X86_ET_SW_EXC:
switch ( regs->fred_ss.vector )
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* Re: [PATCH 02/22] x86/msr: Rename wrmsr_ns() to wrmsrns(), and take 64bit value
2025-08-08 20:22 ` [PATCH 02/22] x86/msr: Rename wrmsr_ns() to wrmsrns(), and take 64bit value Andrew Cooper
@ 2025-08-11 6:36 ` Andrew Cooper
2025-08-12 8:08 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-11 6:36 UTC (permalink / raw)
To: Xen-devel; +Cc: Jan Beulich, Roger Pau Monné
On 08/08/2025 9:22 pm, Andrew Cooper wrote:
> In hindsight, having the wrapper name not be the instruction mnemonic was a
> poor choice. Also, PKS turns out to be quite rare in wanting a split value.
>
> Switch to using a single 64bit value in preparation for new users.
>
> No functional change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> ---
> xen/arch/x86/include/asm/msr.h | 4 ++--
> xen/arch/x86/include/asm/prot-key.h | 4 ++--
> 2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/xen/arch/x86/include/asm/msr.h b/xen/arch/x86/include/asm/msr.h
> index 4c4f18b3a54d..b6b85b04c3fd 100644
> --- a/xen/arch/x86/include/asm/msr.h
> +++ b/xen/arch/x86/include/asm/msr.h
> @@ -39,7 +39,7 @@ static inline void wrmsrl(unsigned int msr, uint64_t val)
> }
>
> /* Non-serialising WRMSR, when available. Falls back to a serialising WRMSR. */
> -static inline void wrmsr_ns(uint32_t msr, uint32_t lo, uint32_t hi)
> +static inline void wrmsrns(uint32_t msr, uint64_t val)
> {
> /*
> * WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a redundant CS
> @@ -47,7 +47,7 @@ static inline void wrmsr_ns(uint32_t msr, uint32_t lo, uint32_t hi)
> */
> alternative_input(".byte 0x2e; wrmsr",
> ".byte 0x0f,0x01,0xc6", X86_FEATURE_WRMSRNS,
> - "c" (msr), "a" (lo), "d" (hi));
> + "c" (msr), "a" (val), "d" (val >> 32));
> }
It turns out this is the case poor code generation for MSR_STAR.
I've adjusted it to:
@@ -39,8 +39,10 @@ static inline void wrmsrl(unsigned int msr, uint64_t val)
}
/* Non-serialising WRMSR, when available. Falls back to a serialising WRMSR. */
-static inline void wrmsr_ns(uint32_t msr, uint32_t lo, uint32_t hi)
+static inline void wrmsrns(uint32_t msr, uint64_t val)
{
+ uint32_t lo = val, hi = val >> 32;
+
/*
* WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a redundant CS
* prefix to avoid a trailing NOP.
which stops the compiler from loading the high half of %rax too.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 10/22] x86/traps: Move subarch_percpu_traps_init() into traps-setup.c
2025-08-08 20:23 ` [PATCH 10/22] x86/traps: Move subarch_percpu_traps_init() " Andrew Cooper
@ 2025-08-11 8:17 ` Andrew Cooper
2025-08-12 9:52 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-11 8:17 UTC (permalink / raw)
To: Xen-devel; +Cc: Jan Beulich, Roger Pau Monné
On 08/08/2025 9:23 pm, Andrew Cooper wrote:
> ... along with the supporting functions. Switch to Xen coding style, and make
> static as there are no external callers.
>
> Rename to legacy_syscall_init() as a more accurate name.
>
> No functional change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> ---
> xen/arch/x86/include/asm/system.h | 2 -
> xen/arch/x86/traps-setup.c | 97 ++++++++++++++++++++++++++++++-
> xen/arch/x86/x86_64/traps.c | 92 -----------------------------
> 3 files changed, 95 insertions(+), 96 deletions(-)
>
> diff --git a/xen/arch/x86/include/asm/system.h b/xen/arch/x86/include/asm/system.h
> index 3cdc56e4ba6d..6c2800d8158d 100644
> --- a/xen/arch/x86/include/asm/system.h
> +++ b/xen/arch/x86/include/asm/system.h
> @@ -256,6 +256,4 @@ static inline int local_irq_is_enabled(void)
> #define BROKEN_ACPI_Sx 0x0001
> #define BROKEN_INIT_AFTER_S1 0x0002
>
> -void subarch_percpu_traps_init(void);
> -
> #endif
> diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
> index 13b8fcf0ba51..fbae7072c292 100644
> --- a/xen/arch/x86/traps-setup.c
> +++ b/xen/arch/x86/traps-setup.c
> @@ -2,13 +2,15 @@
> /*
> * Configuration of event handling for all CPUs.
> */
> +#include <xen/domain_page.h>
> #include <xen/init.h>
> #include <xen/param.h>
>
> +#include <asm/endbr.h>
> #include <asm/idt.h>
> #include <asm/msr.h>
> #include <asm/shstk.h>
> -#include <asm/system.h>
> +#include <asm/stubs.h>
> #include <asm/traps.h>
>
> DEFINE_PER_CPU_READ_MOSTLY(idt_entry_t *, idt);
> @@ -19,6 +21,8 @@ static bool __initdata opt_ler;
> boolean_param("ler", opt_ler);
>
> void nocall entry_PF(void);
> +void nocall lstar_enter(void);
> +void nocall cstar_enter(void);
>
> /*
> * Sets up system tables and descriptors for IDT devliery.
> @@ -138,6 +142,95 @@ static void load_system_tables(void)
> BUG_ON(stack_bottom & 15);
> }
>
> +static unsigned int write_stub_trampoline(
> + unsigned char *stub, unsigned long stub_va,
> + unsigned long stack_bottom, unsigned long target_va)
> +{
> + unsigned char *p = stub;
> +
> + if ( cpu_has_xen_ibt )
> + {
> + place_endbr64(p);
> + p += 4;
> + }
> +
> + /* Store guest %rax into %ss slot */
> + /* movabsq %rax, stack_bottom - 8 */
> + *p++ = 0x48;
> + *p++ = 0xa3;
> + *(uint64_t *)p = stack_bottom - 8;
> + p += 8;
> +
> + /* Store guest %rsp in %rax */
> + /* movq %rsp, %rax */
> + *p++ = 0x48;
> + *p++ = 0x89;
> + *p++ = 0xe0;
> +
> + /* Switch to Xen stack */
> + /* movabsq $stack_bottom - 8, %rsp */
> + *p++ = 0x48;
> + *p++ = 0xbc;
> + *(uint64_t *)p = stack_bottom - 8;
> + p += 8;
> +
> + /* jmp target_va */
> + *p++ = 0xe9;
> + *(int32_t *)p = target_va - (stub_va + (p - stub) + 4);
> + p += 4;
> +
> + /* Round up to a multiple of 16 bytes. */
> + return ROUNDUP(p - stub, 16);
> +}
> +
> +static void legacy_syscall_init(void)
> +{
> + unsigned long stack_bottom = get_stack_bottom();
> + unsigned long stub_va = this_cpu(stubs.addr);
> + unsigned char *stub_page;
> + unsigned int offset;
> +
> + /* No PV guests? No need to set up SYSCALL/SYSENTER infrastructure. */
> + if ( !IS_ENABLED(CONFIG_PV) )
> + return;
> +
> + stub_page = map_domain_page(_mfn(this_cpu(stubs.mfn)));
> +
> + /*
> + * Trampoline for SYSCALL entry from 64-bit mode. The VT-x HVM vcpu
> + * context switch logic relies on the SYSCALL trampoline being at the
> + * start of the stubs.
> + */
> + wrmsrl(MSR_LSTAR, stub_va);
> + offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
> + stub_va, stack_bottom,
> + (unsigned long)lstar_enter);
> + stub_va += offset;
> +
> + if ( cpu_has_sep )
> + {
> + /* SYSENTER entry. */
> + wrmsrl(MSR_IA32_SYSENTER_ESP, stack_bottom);
> + wrmsrl(MSR_IA32_SYSENTER_EIP, (unsigned long)sysenter_entry);
> + wrmsr(MSR_IA32_SYSENTER_CS, __HYPERVISOR_CS, 0);
> + }
> +
> + /* Trampoline for SYSCALL entry from compatibility mode. */
> + wrmsrl(MSR_CSTAR, stub_va);
> + offset += write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
> + stub_va, stack_bottom,
> + (unsigned long)cstar_enter);
> +
> + /* Don't consume more than half of the stub space here. */
> + ASSERT(offset <= STUB_BUF_SIZE / 2);
> +
> + unmap_domain_page(stub_page);
> +
> + /* Common SYSCALL parameters. */
> + wrmsrl(MSR_STAR, XEN_MSR_STAR);
> + wrmsrl(MSR_SYSCALL_MASK, XEN_SYSCALL_MASK);
> +}
These want adjusting to use wrmsrns(), similarly to the previous patch.
Fixed locally.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 21/22] x86/traps: Introduce FRED entrypoints
2025-08-08 20:23 ` [PATCH 21/22] x86/traps: Introduce FRED entrypoints Andrew Cooper
@ 2025-08-11 11:38 ` Andrew Cooper
2025-08-14 15:57 ` Jan Beulich
2025-08-18 10:03 ` Jan Beulich
2 siblings, 0 replies; 120+ messages in thread
From: Andrew Cooper @ 2025-08-11 11:38 UTC (permalink / raw)
To: Xen-devel; +Cc: Jan Beulich, Roger Pau Monné
On 08/08/2025 9:23 pm, Andrew Cooper wrote:
> diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
> index 270b93ed623e..e67a428e4362 100644
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -2181,6 +2240,94 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
> }
> #endif
>
> +void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
> +{
> + /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
> + regs->entry_vector = regs->fred_ss.vector;
> +
> + switch ( regs->fred_ss.type )
> + {
> + case X86_ET_EXT_INTR:
> + do_IRQ(regs);
> + break;
> +
> + case X86_ET_NMI:
> + do_nmi(regs);
> + break;
> +
> + case X86_ET_HW_EXC:
> + case X86_ET_SW_INT:
> + case X86_ET_PRIV_SW_EXC:
> + case X86_ET_SW_EXC:
> + goto fatal;
> +
> + default:
> + goto fatal;
> + }
> +
> + return;
> +
> + fatal:
> + fatal_trap(regs, false);
> +}
> +
> +void asmlinkage entry_from_xen(struct cpu_user_regs *regs)
> +{
> + /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
> + regs->entry_vector = regs->fred_ss.vector;
> +
> + switch ( regs->fred_ss.type )
> + {
> + case X86_ET_EXT_INTR:
> + do_IRQ(regs);
> + break;
> +
> + case X86_ET_NMI:
> + do_nmi(regs);
> + break;
> +
> + case X86_ET_HW_EXC:
> + case X86_ET_SW_INT:
> + case X86_ET_PRIV_SW_EXC:
> + case X86_ET_SW_EXC:
> + switch ( regs->fred_ss.vector )
> + {
> + case X86_EXC_PF: do_page_fault(regs); break;
> + case X86_EXC_GP: do_general_protection(regs); break;
> + case X86_EXC_UD: do_invalid_op(regs); break;
> + case X86_EXC_NM: do_device_not_available(regs); break;
> + case X86_EXC_BP: do_int3(regs); break;
> + case X86_EXC_DB: do_debug(regs); break;
> + case X86_EXC_DF: do_double_fault(regs); break;
> +
> + case X86_EXC_DE:
> + case X86_EXC_OF:
> + case X86_EXC_BR:
> + case X86_EXC_NP:
> + case X86_EXC_SS:
> + case X86_EXC_MF:
> + case X86_EXC_AC:
> + case X86_EXC_XM:
> + do_trap(regs);
> + break;
> +
> + case X86_EXC_CP: do_entry_CP(regs); break;
> +
> + default:
> + goto fatal;
> + }
> + break;
> +
> + default:
> + goto fatal;
> + }
> +
> + return;
> +
> + fatal:
> + fatal_trap(regs, false);
> +}
Having started work on the PV support, I think this patch needs to
change somewhat.
I've split #DB and #PF to have separate IDT prologues, which in turns
gives us uniform handling of IRQ re-enabling for synchronous actions.
But, async still needs special handling.
I think we want something that looks more like:
switch ( regs->fred_ss.type )
{
case X86_ET_EXT_INTR: return do_IRQ(regs);
case X86_ET_NMI: return do_nmi(regs);
case X86_ET_HW_EXC:
if ( regs->fred_ss.vector == X86_EXC_MC )
return do_machine_check(regs);
break;
}
if ( regs->eflags & X86_EFLAGS_IF ) // From Xen only
local_irq_enable(); // From both
switch ( regs->fred_ss.type )
Either way, it's probably not worth focusing too much on how the C in
this patch looks for now.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 01/22] x86/msr: Rename MSR_INTERRUPT_SSP_TABLE to MSR_ISST
2025-08-08 20:22 ` [PATCH 01/22] x86/msr: Rename MSR_INTERRUPT_SSP_TABLE to MSR_ISST Andrew Cooper
@ 2025-08-12 8:06 ` Jan Beulich
2025-08-13 9:02 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-12 8:06 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:22, Andrew Cooper wrote:
> The name AMD chose is rather more concise.
>
> No functional change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
I'm okay with the (much) shorter name, so:
Acked-by: Jan Beulich <jbeulich@suse.com>
But I can't make the connection to AMD: It's INTERRUPT_SST_TABLE (figure)
or INTERRUPT_SSP_TABLE (text) there, afaics. And ISST_ADDR in yet further
places.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 02/22] x86/msr: Rename wrmsr_ns() to wrmsrns(), and take 64bit value
2025-08-11 6:36 ` Andrew Cooper
@ 2025-08-12 8:08 ` Jan Beulich
0 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-12 8:08 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 11.08.2025 08:36, Andrew Cooper wrote:
> On 08/08/2025 9:22 pm, Andrew Cooper wrote:
>> In hindsight, having the wrapper name not be the instruction mnemonic was a
>> poor choice. Also, PKS turns out to be quite rare in wanting a split value.
>>
>> Switch to using a single 64bit value in preparation for new users.
>>
>> No functional change.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>> ---
>> xen/arch/x86/include/asm/msr.h | 4 ++--
>> xen/arch/x86/include/asm/prot-key.h | 4 ++--
>> 2 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/xen/arch/x86/include/asm/msr.h b/xen/arch/x86/include/asm/msr.h
>> index 4c4f18b3a54d..b6b85b04c3fd 100644
>> --- a/xen/arch/x86/include/asm/msr.h
>> +++ b/xen/arch/x86/include/asm/msr.h
>> @@ -39,7 +39,7 @@ static inline void wrmsrl(unsigned int msr, uint64_t val)
>> }
>>
>> /* Non-serialising WRMSR, when available. Falls back to a serialising WRMSR. */
>> -static inline void wrmsr_ns(uint32_t msr, uint32_t lo, uint32_t hi)
>> +static inline void wrmsrns(uint32_t msr, uint64_t val)
>> {
>> /*
>> * WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a redundant CS
>> @@ -47,7 +47,7 @@ static inline void wrmsr_ns(uint32_t msr, uint32_t lo, uint32_t hi)
>> */
>> alternative_input(".byte 0x2e; wrmsr",
>> ".byte 0x0f,0x01,0xc6", X86_FEATURE_WRMSRNS,
>> - "c" (msr), "a" (lo), "d" (hi));
>> + "c" (msr), "a" (val), "d" (val >> 32));
>> }
>
> It turns out this is the case poor code generation for MSR_STAR.
>
> I've adjusted it to:
>
> @@ -39,8 +39,10 @@ static inline void wrmsrl(unsigned int msr, uint64_t val)
> }
>
> /* Non-serialising WRMSR, when available. Falls back to a serialising WRMSR. */
> -static inline void wrmsr_ns(uint32_t msr, uint32_t lo, uint32_t hi)
> +static inline void wrmsrns(uint32_t msr, uint64_t val)
> {
> + uint32_t lo = val, hi = val >> 32;
> +
> /*
> * WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a redundant CS
> * prefix to avoid a trailing NOP.
>
>
> which stops the compiler from loading the high half of %rax too.
Acked-by: Jan Beulich <jbeulich@suse.com>
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 03/22] x86/traps: Drop incorrect BUILD_BUG_ON() and comment in load_system_tables()
2025-08-08 20:22 ` [PATCH 03/22] x86/traps: Drop incorrect BUILD_BUG_ON() and comment in load_system_tables() Andrew Cooper
@ 2025-08-12 8:11 ` Jan Beulich
2025-08-13 9:40 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-12 8:11 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:22, Andrew Cooper wrote:
> This was added erroneously by me.
>
> Hardware task switching does demand a TSS of at least 0x67 bytes, but that's
> not relevant in 64bit, and not relevant for Xen since commit
> 5d1181a5ea5e ("xen: Remove x86_32 build target.") in 2012.
>
> We already load a 0-length TSS in early_traps_init() demonstrating that it's
> possible.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> ---
> xen/arch/x86/cpu/common.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
> index f6ec5c9df522..cdc41248d4e9 100644
> --- a/xen/arch/x86/cpu/common.c
> +++ b/xen/arch/x86/cpu/common.c
> @@ -936,8 +936,6 @@ void load_system_tables(void)
> wrmsrl(MSR_ISST, (unsigned long)ist_ssp);
> }
>
> - BUILD_BUG_ON(sizeof(*tss) <= 0x67); /* Mandated by the architecture. */
> -
> _set_tssldt_desc(gdt + TSS_ENTRY, (unsigned long)tss,
> sizeof(*tss) - 1, SYS_DESC_tss_avail);
> if ( IS_ENABLED(CONFIG_PV32) )
Well, the comment is wrong. Whether the BUILD_BUG_ON() itself is also wrong
depends on our intentions with the structure. Don't we need it to be that
size for everything (incl I/O bitmap) to work correctly elsewhere?
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 04/22] x86/idt: Minor improvements to _update_gate_addr_lower()
2025-08-08 20:22 ` [PATCH 04/22] x86/idt: Minor improvements to _update_gate_addr_lower() Andrew Cooper
@ 2025-08-12 8:16 ` Jan Beulich
2025-08-13 9:48 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-12 8:16 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:22, Andrew Cooper wrote:
> After some experimentation, using .a/.b makes far better logic than using the
> named fields, as both Clang and GCC spill idte to the stack when named fields
> are used.
>
> GCC seems to do something very daft for the addr1 field. It takes addr,
> shifts it by 32, then ANDs with 0xffff0000000000000UL, which requires
> manifesting a MOVABS.
>
> Clang follows the C, whereby it ANDs with $imm32, then shifts, avoiding the
> MOVABS entirely.
>
> No functional change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
albeit I have to admit that I'm not quite happy about ...
> --- a/xen/arch/x86/include/asm/idt.h
> +++ b/xen/arch/x86/include/asm/idt.h
> @@ -92,15 +92,16 @@ static inline void _set_gate_lower(idt_entry_t *gate, unsigned long type,
> * Update the lower half handler of an IDT entry, without changing any other
> * configuration.
> */
> -static inline void _update_gate_addr_lower(idt_entry_t *gate, void *addr)
> +static inline void _update_gate_addr_lower(idt_entry_t *gate, void *_addr)
> {
> + unsigned long addr = (unsigned long)_addr;
> + unsigned int addr1 = addr & 0xffff0000U; /* GCC force better codegen. */
> idt_entry_t idte;
> - idte.a = gate->a;
>
> - idte.b = ((unsigned long)(addr) >> 32);
> - idte.a &= 0x0000FFFFFFFF0000ULL;
> - idte.a |= (((unsigned long)(addr) & 0xFFFF0000UL) << 32) |
> - ((unsigned long)(addr) & 0xFFFFUL);
> + idte.b = addr >> 32;
> + idte.a = gate->a & 0x0000ffffffff0000UL;
> + idte.a |= (unsigned long)addr1 << 32;
... the cast here. Yet perhaps gcc still generates a MOVABS when you make
addr1 unsigned long?
As to the comment next to the variable declaration: Could I talk you into
adding a colon after "GCC"? Without one, the comment reads somewhat
ambiguously to me.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 05/22] x86/traps: Rename early_traps_init() to bsp_early_traps_init()
2025-08-08 20:22 ` [PATCH 05/22] x86/traps: Rename early_traps_init() to bsp_early_traps_init() Andrew Cooper
@ 2025-08-12 8:17 ` Jan Beulich
0 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-12 8:17 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:22, Andrew Cooper wrote:
> We're going to want to introduce an AP version shortly.
>
> No functional change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 06/22] x86/traps: Introduce bsp_traps_reinit()
2025-08-08 20:22 ` [PATCH 06/22] x86/traps: Introduce bsp_traps_reinit() Andrew Cooper
@ 2025-08-12 8:19 ` Jan Beulich
2025-08-13 9:51 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-12 8:19 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:22, Andrew Cooper wrote:
> ... to abstract away updating the refereces to the old BSP stack.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
If it helps later:
Acked-by: Jan Beulich <jbeulich@suse.com>
with ..
> --- a/xen/arch/x86/traps-setup.c
> +++ b/xen/arch/x86/traps-setup.c
> @@ -107,6 +107,15 @@ void __init traps_init(void)
> percpu_traps_init();
> }
>
> +/*
> + * Re-initialise all state referencing the early-boot stack.
> + */
> +void bsp_traps_reinit(void)
... __init added here.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 07/22] x86/spec-ctrl: Rework init_shadow_spec_ctrl_state() to take an info pointer
2025-08-08 20:22 ` [PATCH 07/22] x86/spec-ctrl: Rework init_shadow_spec_ctrl_state() to take an info pointer Andrew Cooper
@ 2025-08-12 8:27 ` Jan Beulich
2025-08-13 10:35 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-12 8:27 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:22, Andrew Cooper wrote:
> We're going to want to reuse it for a remote stack shortly.
Are we? From the titles of subsequent patches I can't judge where that would
be, so it's hard to peek ahead. And iirc earlier on it was a concious decision
to only ever run this locally.
> No functional change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Nevertheless, trusting that you have a good reason:
Acked-by: Jan Beulich <jbeulich@suse.com>
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 08/22] x86/traps: Introduce ap_early_traps_init() and set up exception handling earlier
2025-08-08 20:23 ` [PATCH 08/22] x86/traps: Introduce ap_early_traps_init() and set up exception handling earlier Andrew Cooper
@ 2025-08-12 8:41 ` Jan Beulich
2025-08-13 11:13 ` Andrew Cooper
2025-08-14 18:07 ` [PATCH v1.1 08/22] x86/traps: Introduce percpu_early_traps_init() " Andrew Cooper
1 sibling, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-12 8:41 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:23, Andrew Cooper wrote:
> --- a/xen/arch/x86/acpi/wakeup_prot.S
> +++ b/xen/arch/x86/acpi/wakeup_prot.S
> @@ -63,6 +63,9 @@ LABEL(s3_resume)
> pushq %rax
> lretq
> 1:
> + /* Set up early exceptions and CET before entering C properly. */
> + call ap_early_traps_init
But this is the BSP?
> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -327,12 +327,7 @@ void asmlinkage start_secondary(void)
> struct cpu_info *info = get_cpu_info();
> unsigned int cpu = smp_processor_id();
>
> - /* Critical region without IDT or TSS. Any fault is deadly! */
> -
> - set_current(idle_vcpu[cpu]);
> - this_cpu(curr_vcpu) = idle_vcpu[cpu];
> rdmsrl(MSR_EFER, this_cpu(efer));
> - init_shadow_spec_ctrl_state(info);
>
> /*
> * Just as during early bootstrap, it is convenient here to disable
> @@ -352,14 +347,6 @@ void asmlinkage start_secondary(void)
> */
> spin_debug_disable();
>
> - get_cpu_info()->use_pv_cr3 = false;
> - get_cpu_info()->xen_cr3 = 0;
> - get_cpu_info()->pv_cr3 = 0;
> -
> - load_system_tables();
> -
> - /* Full exception support from here on in. */
> -
> if ( cpu_has_pks )
> wrpkrs_and_cache(0); /* Must be before setting CR4.PKS */
>
> @@ -1064,8 +1051,12 @@ static int cpu_smpboot_alloc(unsigned int cpu)
> goto out;
>
> info = get_cpu_info_from_stack((unsigned long)stack_base[cpu]);
> + memset(info, 0, sizeof(*info));
Why do we suddenly need this? Or is this just out of an abundance of
caution (while making the individual ->*_cr3 writes unnecessary)?
> + init_shadow_spec_ctrl_state(info);
May I suggest to move this further down a little, at least ...
> info->processor_id = cpu;
... past here? Just in case other values in the struct may be needed
in the function at some point.
> info->per_cpu_offset = __per_cpu_offset[cpu];
> + info->current_vcpu = idle_vcpu[cpu];
To be able to spot this, I think it wants /* set_current() */ or some
such.
> + per_cpu(curr_vcpu, cpu) = idle_vcpu[cpu];
It's a little odd to do this early (and remotely), but it looks all fine
with how the variable is currently used.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c
2025-08-08 20:23 ` [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c Andrew Cooper
@ 2025-08-12 9:19 ` Jan Beulich
2025-08-13 11:25 ` Andrew Cooper
2025-08-12 9:43 ` Nicola Vetrini
1 sibling, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-12 9:19 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:23, Andrew Cooper wrote:
> Switch it to Xen coding style and fix MISRA violations.
That were all ul -> UL suffix transformations, afaics?
> Make it static as
> there are no external caller now.
>
> Since commit a35816b5cae8 ("x86/traps: Introduce early_traps_init() and
> simplify setup"), load_system_tables() is called later on the BSP, so the
> SYS_STATE_early_boot check can be dropped from the safety BUG_ON().
>
> Move the BUILD_BUG_ON() into build_assertions(),
I'm not quite convinced of this move - having the related BUILD_BUG_ON()
and BUG_ON() next to each other would seem better to me. (Same would
apply to the TSS size related BUILD_BUG_ON(), if in the earlier patch
we ended up agreeing that only the comment wants dropping there.)
> @@ -139,3 +258,16 @@ void asmlinkage ap_early_traps_init(void)
> {
> load_system_tables();
> }
> +
> +static void __init __maybe_unused build_assertions(void)
> +{
> + /*
> + * This is best-effort (it doesn't cover some padding corner cases), but
> + * is preforable to hitting the check at boot time.
Nit: "preferable"
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c
2025-08-08 20:23 ` [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c Andrew Cooper
2025-08-12 9:19 ` Jan Beulich
@ 2025-08-12 9:43 ` Nicola Vetrini
2025-08-13 11:36 ` Andrew Cooper
1 sibling, 1 reply; 120+ messages in thread
From: Nicola Vetrini @ 2025-08-12 9:43 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Xen-devel, Jan Beulich, Roger Pau Monné
On 2025-08-08 22:23, Andrew Cooper wrote:
> Switch it to Xen coding style and fix MISRA violations. Make it static
> as
> there are no external caller now.
>
> Since commit a35816b5cae8 ("x86/traps: Introduce early_traps_init() and
> simplify setup"), load_system_tables() is called later on the BSP, so
> the
> SYS_STATE_early_boot check can be dropped from the safety BUG_ON().
>
> Move the BUILD_BUG_ON() into build_assertions(), and introduce an
> endof_field() helper to make the expression clearer.
>
> Swap wrmsrl(MSR_ISST, ...) for wrmsrns(). No serialisation is needed
> at this
> point.
>
> No functional change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> ---
> xen/arch/x86/cpu/common.c | 118 --------------------------
> xen/arch/x86/include/asm/system.h | 1 -
> xen/arch/x86/traps-setup.c | 132 ++++++++++++++++++++++++++++++
> xen/include/xen/macros.h | 2 +
> 4 files changed, 134 insertions(+), 119 deletions(-)
>
> diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
> index cdc41248d4e9..da05015578aa 100644
> --- a/xen/arch/x86/cpu/common.c
> +++ b/xen/arch/x86/cpu/common.c
> @@ -843,124 +843,6 @@ void print_cpu_info(unsigned int cpu)
>
> static cpumask_t cpu_initialized;
>
> -/*
> - * Sets up system tables and descriptors.
> - *
> - * - Sets up TSS with stack pointers, including ISTs
> - * - Inserts TSS selector into regular and compat GDTs
> - * - Loads GDT, IDT, TR then null LDT
> - * - Sets up IST references in the IDT
> - */
> -void load_system_tables(void)
> -{
> - unsigned int i, cpu = smp_processor_id();
> - unsigned long stack_bottom = get_stack_bottom(),
> - stack_top = stack_bottom & ~(STACK_SIZE - 1);
> - /*
> - * NB: define tss_page as a local variable because clang 3.5 doesn't
> - * support using ARRAY_SIZE against per-cpu variables.
> - */
> - struct tss_page *tss_page = &this_cpu(tss_page);
> - idt_entry_t *idt = this_cpu(idt);
> -
> - /* The TSS may be live. Disuade any clever optimisations. */
> - volatile struct tss64 *tss = &tss_page->tss;
> - seg_desc_t *gdt =
> - this_cpu(gdt) - FIRST_RESERVED_GDT_ENTRY;
> -
> - const struct desc_ptr gdtr = {
> - .base = (unsigned long)gdt,
> - .limit = LAST_RESERVED_GDT_BYTE,
> - };
> - const struct desc_ptr idtr = {
> - .base = (unsigned long)idt,
> - .limit = sizeof(bsp_idt) - 1,
> - };
> -
> - /*
> - * Set up the TSS. Warning - may be live, and the NMI/#MC must
> remain
> - * valid on every instruction boundary. (Note: these are all
> - * semantically ACCESS_ONCE() due to tss's volatile qualifier.)
> - *
> - * rsp0 refers to the primary stack. #MC, NMI, #DB and #DF handlers
> - * each get their own stacks. No IO Bitmap.
> - */
> - tss->rsp0 = stack_bottom;
> - tss->ist[IST_MCE - 1] = stack_top + (1 + IST_MCE) * PAGE_SIZE;
> - tss->ist[IST_NMI - 1] = stack_top + (1 + IST_NMI) * PAGE_SIZE;
> - tss->ist[IST_DB - 1] = stack_top + (1 + IST_DB) * PAGE_SIZE;
> - tss->ist[IST_DF - 1] = stack_top + (1 + IST_DF) * PAGE_SIZE;
> - tss->bitmap = IOBMP_INVALID_OFFSET;
> -
> - /* All other stack pointers poisioned. */
> - for ( i = IST_MAX; i < ARRAY_SIZE(tss->ist); ++i )
> - tss->ist[i] = 0x8600111111111111ul;
> - tss->rsp1 = 0x8600111111111111ul;
> - tss->rsp2 = 0x8600111111111111ul;
> -
> - /*
> - * Set up the shadow stack IST. Used entries must point at the
> - * supervisor stack token. Unused entries are poisoned.
> - *
> - * This IST Table may be live, and the NMI/#MC entries must
> - * remain valid on every instruction boundary, hence the
> - * volatile qualifier.
> - */
> - if (cpu_has_xen_shstk) {
> - volatile uint64_t *ist_ssp = tss_page->ist_ssp;
> - unsigned long
> - mce_ssp = stack_top + (IST_MCE * IST_SHSTK_SIZE) - 8,
> - nmi_ssp = stack_top + (IST_NMI * IST_SHSTK_SIZE) - 8,
> - db_ssp = stack_top + (IST_DB * IST_SHSTK_SIZE) - 8,
> - df_ssp = stack_top + (IST_DF * IST_SHSTK_SIZE) - 8;
> -
> - ist_ssp[0] = 0x8600111111111111ul;
> - ist_ssp[IST_MCE] = mce_ssp;
> - ist_ssp[IST_NMI] = nmi_ssp;
> - ist_ssp[IST_DB] = db_ssp;
> - ist_ssp[IST_DF] = df_ssp;
> - for ( i = IST_DF + 1; i < ARRAY_SIZE(tss_page->ist_ssp); ++i )
> - ist_ssp[i] = 0x8600111111111111ul;
> -
> - if (IS_ENABLED(CONFIG_XEN_SHSTK) && rdssp() != SSP_NO_SHSTK) {
> - /*
> - * Rewrite supervisor tokens when shadow stacks are
> - * active. This resets any busy bits left across S3.
> - */
> - wrss(mce_ssp, _p(mce_ssp));
> - wrss(nmi_ssp, _p(nmi_ssp));
> - wrss(db_ssp, _p(db_ssp));
> - wrss(df_ssp, _p(df_ssp));
> - }
> -
> - wrmsrl(MSR_ISST, (unsigned long)ist_ssp);
> - }
> -
> - _set_tssldt_desc(gdt + TSS_ENTRY, (unsigned long)tss,
> - sizeof(*tss) - 1, SYS_DESC_tss_avail);
> - if ( IS_ENABLED(CONFIG_PV32) )
> - _set_tssldt_desc(
> - this_cpu(compat_gdt) - FIRST_RESERVED_GDT_ENTRY + TSS_ENTRY,
> - (unsigned long)tss, sizeof(*tss) - 1, SYS_DESC_tss_busy);
> -
> - per_cpu(full_gdt_loaded, cpu) = false;
> - lgdt(&gdtr);
> - lidt(&idtr);
> - ltr(TSS_SELECTOR);
> - lldt(0);
> -
> - enable_each_ist(idt);
> -
> - /*
> - * Bottom-of-stack must be 16-byte aligned!
> - *
> - * Defer checks until exception support is sufficiently set up.
> - */
> - BUILD_BUG_ON((sizeof(struct cpu_info) -
> - sizeof(struct cpu_user_regs)) & 0xf);
> - BUG_ON(system_state != SYS_STATE_early_boot && (stack_bottom & 0xf));
> -}
> -
> static void skinit_enable_intr(void)
> {
> uint64_t val;
> diff --git a/xen/arch/x86/include/asm/system.h
> b/xen/arch/x86/include/asm/system.h
> index 57446c5b465c..3cdc56e4ba6d 100644
> --- a/xen/arch/x86/include/asm/system.h
> +++ b/xen/arch/x86/include/asm/system.h
> @@ -256,7 +256,6 @@ static inline int local_irq_is_enabled(void)
> #define BROKEN_ACPI_Sx 0x0001
> #define BROKEN_INIT_AFTER_S1 0x0002
>
> -void load_system_tables(void);
> void subarch_percpu_traps_init(void);
>
> #endif
> diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
> index 8ca379c9e4cb..13b8fcf0ba51 100644
> --- a/xen/arch/x86/traps-setup.c
> +++ b/xen/arch/x86/traps-setup.c
> @@ -7,6 +7,7 @@
>
> #include <asm/idt.h>
> #include <asm/msr.h>
> +#include <asm/shstk.h>
> #include <asm/system.h>
> #include <asm/traps.h>
>
> @@ -19,6 +20,124 @@ boolean_param("ler", opt_ler);
>
> void nocall entry_PF(void);
>
> +/*
> + * Sets up system tables and descriptors for IDT devliery.
> + *
> + * - Sets up TSS with stack pointers, including ISTs
> + * - Inserts TSS selector into regular and compat GDTs
> + * - Loads GDT, IDT, TR then null LDT
> + * - Sets up IST references in the IDT
> + */
> +static void load_system_tables(void)
> +{
> + unsigned int i, cpu = smp_processor_id();
> + unsigned long stack_bottom = get_stack_bottom(),
> + stack_top = stack_bottom & ~(STACK_SIZE - 1);
> + /*
> + * NB: define tss_page as a local variable because clang 3.5
> doesn't
> + * support using ARRAY_SIZE against per-cpu variables.
> + */
> + struct tss_page *tss_page = &this_cpu(tss_page);
> + idt_entry_t *idt = this_cpu(idt);
> +
Given the clang baseline this might not be needed anymore?
> + /* The TSS may be live. Disuade any clever optimisations. */
> + volatile struct tss64 *tss = &tss_page->tss;
> + seg_desc_t *gdt =
> + this_cpu(gdt) - FIRST_RESERVED_GDT_ENTRY;
> +
> + const struct desc_ptr gdtr = {
> + .base = (unsigned long)gdt,
> + .limit = LAST_RESERVED_GDT_BYTE,
> + };
> + const struct desc_ptr idtr = {
> + .base = (unsigned long)idt,
> + .limit = sizeof(bsp_idt) - 1,
> + };
> +
> + /*
> + * Set up the TSS. Warning - may be live, and the NMI/#MC must
> remain
> + * valid on every instruction boundary. (Note: these are all
> + * semantically ACCESS_ONCE() due to tss's volatile qualifier.)
> + *
> + * rsp0 refers to the primary stack. #MC, NMI, #DB and #DF
> handlers
> + * each get their own stacks. No IO Bitmap.
> + */
> + tss->rsp0 = stack_bottom;
> + tss->ist[IST_MCE - 1] = stack_top + (1 + IST_MCE) * PAGE_SIZE;
> + tss->ist[IST_NMI - 1] = stack_top + (1 + IST_NMI) * PAGE_SIZE;
> + tss->ist[IST_DB - 1] = stack_top + (1 + IST_DB) * PAGE_SIZE;
> + tss->ist[IST_DF - 1] = stack_top + (1 + IST_DF) * PAGE_SIZE;
> + tss->bitmap = IOBMP_INVALID_OFFSET;
> +
> + /* All other stack pointers poisioned. */
> + for ( i = IST_MAX; i < ARRAY_SIZE(tss->ist); ++i )
> + tss->ist[i] = 0x8600111111111111UL;
> + tss->rsp1 = 0x8600111111111111UL;
> + tss->rsp2 = 0x8600111111111111UL;
> +
> + /*
> + * Set up the shadow stack IST. Used entries must point at the
> + * supervisor stack token. Unused entries are poisoned.
> + *
> + * This IST Table may be live, and the NMI/#MC entries must
> + * remain valid on every instruction boundary, hence the
> + * volatile qualifier.
> + */
> + if ( cpu_has_xen_shstk )
> + {
> + volatile uint64_t *ist_ssp = tss_page->ist_ssp;
> + unsigned long
> + mce_ssp = stack_top + (IST_MCE * IST_SHSTK_SIZE) - 8,
> + nmi_ssp = stack_top + (IST_NMI * IST_SHSTK_SIZE) - 8,
> + db_ssp = stack_top + (IST_DB * IST_SHSTK_SIZE) - 8,
> + df_ssp = stack_top + (IST_DF * IST_SHSTK_SIZE) - 8;
> +
> + ist_ssp[0] = 0x8600111111111111UL;
> + ist_ssp[IST_MCE] = mce_ssp;
> + ist_ssp[IST_NMI] = nmi_ssp;
> + ist_ssp[IST_DB] = db_ssp;
> + ist_ssp[IST_DF] = df_ssp;
> + for ( i = IST_DF + 1; i < ARRAY_SIZE(tss_page->ist_ssp); ++i )
> + ist_ssp[i] = 0x8600111111111111UL;
> +
> + if ( IS_ENABLED(CONFIG_XEN_SHSTK) && rdssp() != SSP_NO_SHSTK )
> + {
> + /*
> + * Rewrite supervisor tokens when shadow stacks are
> + * active. This resets any busy bits left across S3.
> + */
> + wrss(mce_ssp, _p(mce_ssp));
> + wrss(nmi_ssp, _p(nmi_ssp));
> + wrss(db_ssp, _p(db_ssp));
> + wrss(df_ssp, _p(df_ssp));
> + }
> +
> + wrmsrns(MSR_ISST, (unsigned long)ist_ssp);
> + }
> +
> + _set_tssldt_desc(gdt + TSS_ENTRY, (unsigned long)tss,
> + sizeof(*tss) - 1, SYS_DESC_tss_avail);
> + if ( IS_ENABLED(CONFIG_PV32) )
> + _set_tssldt_desc(
> + this_cpu(compat_gdt) - FIRST_RESERVED_GDT_ENTRY +
> TSS_ENTRY,
> + (unsigned long)tss, sizeof(*tss) - 1, SYS_DESC_tss_busy);
> +
> + per_cpu(full_gdt_loaded, cpu) = false;
> + lgdt(&gdtr);
> + lidt(&idtr);
> + ltr(TSS_SELECTOR);
> + lldt(0);
> +
> + enable_each_ist(idt);
> +
> + /*
> + * tss->rsp0 must be 16-byte aligned.
> + *
> + * Defer checks until exception support is sufficiently set up.
> + */
> + BUG_ON(stack_bottom & 15);
> +}
> +
> static void __init init_ler(void)
> {
> unsigned int msr = 0;
> @@ -139,3 +258,16 @@ void asmlinkage ap_early_traps_init(void)
> {
> load_system_tables();
> }
> +
> +static void __init __maybe_unused build_assertions(void)
> +{
> + /*
> + * This is best-effort (it doesn't cover some padding corner
> cases), but
> + * is preforable to hitting the check at boot time.
> + *
> + * tss->rsp0, pointing at the end of cpu_info.guest_cpu_user_regs,
> must be
> + * 16-byte aligned.
> + */
> + BUILD_BUG_ON((sizeof(struct cpu_info) -
> + endof_field(struct cpu_info, guest_cpu_user_regs)) &
> 15);
> +}
> diff --git a/xen/include/xen/macros.h b/xen/include/xen/macros.h
> index cd528fbdb127..726ba221e0d8 100644
> --- a/xen/include/xen/macros.h
> +++ b/xen/include/xen/macros.h
> @@ -102,6 +102,8 @@
> */
> #define sizeof_field(type, member) sizeof(((type *)NULL)->member)
>
> +#define endof_field(type, member) (offsetof(type, member) +
> sizeof_field(type, member))
> +
> /* Cast an arbitrary integer to a pointer. */
> #define _p(x) ((void *)(unsigned long)(x))
--
Nicola Vetrini, B.Sc.
Software Engineer
BUGSENG (https://bugseng.com)
LinkedIn: https://www.linkedin.com/in/nicola-vetrini-a42471253
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 10/22] x86/traps: Move subarch_percpu_traps_init() into traps-setup.c
2025-08-11 8:17 ` Andrew Cooper
@ 2025-08-12 9:52 ` Jan Beulich
2025-08-13 11:53 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-12 9:52 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 11.08.2025 10:17, Andrew Cooper wrote:
> On 08/08/2025 9:23 pm, Andrew Cooper wrote:
>> ... along with the supporting functions. Switch to Xen coding style, and make
>> static as there are no external callers.
>>
>> Rename to legacy_syscall_init() as a more accurate name.
>>
>> No functional change.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>> ---
>> xen/arch/x86/include/asm/system.h | 2 -
>> xen/arch/x86/traps-setup.c | 97 ++++++++++++++++++++++++++++++-
>> xen/arch/x86/x86_64/traps.c | 92 -----------------------------
>> 3 files changed, 95 insertions(+), 96 deletions(-)
>>
>> diff --git a/xen/arch/x86/include/asm/system.h b/xen/arch/x86/include/asm/system.h
>> index 3cdc56e4ba6d..6c2800d8158d 100644
>> --- a/xen/arch/x86/include/asm/system.h
>> +++ b/xen/arch/x86/include/asm/system.h
>> @@ -256,6 +256,4 @@ static inline int local_irq_is_enabled(void)
>> #define BROKEN_ACPI_Sx 0x0001
>> #define BROKEN_INIT_AFTER_S1 0x0002
>>
>> -void subarch_percpu_traps_init(void);
>> -
>> #endif
>> diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
>> index 13b8fcf0ba51..fbae7072c292 100644
>> --- a/xen/arch/x86/traps-setup.c
>> +++ b/xen/arch/x86/traps-setup.c
>> @@ -2,13 +2,15 @@
>> /*
>> * Configuration of event handling for all CPUs.
>> */
>> +#include <xen/domain_page.h>
>> #include <xen/init.h>
>> #include <xen/param.h>
>>
>> +#include <asm/endbr.h>
>> #include <asm/idt.h>
>> #include <asm/msr.h>
>> #include <asm/shstk.h>
>> -#include <asm/system.h>
>> +#include <asm/stubs.h>
>> #include <asm/traps.h>
>>
>> DEFINE_PER_CPU_READ_MOSTLY(idt_entry_t *, idt);
>> @@ -19,6 +21,8 @@ static bool __initdata opt_ler;
>> boolean_param("ler", opt_ler);
>>
>> void nocall entry_PF(void);
>> +void nocall lstar_enter(void);
>> +void nocall cstar_enter(void);
>>
>> /*
>> * Sets up system tables and descriptors for IDT devliery.
>> @@ -138,6 +142,95 @@ static void load_system_tables(void)
>> BUG_ON(stack_bottom & 15);
>> }
>>
>> +static unsigned int write_stub_trampoline(
>> + unsigned char *stub, unsigned long stub_va,
>> + unsigned long stack_bottom, unsigned long target_va)
>> +{
>> + unsigned char *p = stub;
>> +
>> + if ( cpu_has_xen_ibt )
>> + {
>> + place_endbr64(p);
>> + p += 4;
>> + }
>> +
>> + /* Store guest %rax into %ss slot */
>> + /* movabsq %rax, stack_bottom - 8 */
>> + *p++ = 0x48;
>> + *p++ = 0xa3;
>> + *(uint64_t *)p = stack_bottom - 8;
>> + p += 8;
>> +
>> + /* Store guest %rsp in %rax */
>> + /* movq %rsp, %rax */
>> + *p++ = 0x48;
>> + *p++ = 0x89;
>> + *p++ = 0xe0;
>> +
>> + /* Switch to Xen stack */
>> + /* movabsq $stack_bottom - 8, %rsp */
>> + *p++ = 0x48;
>> + *p++ = 0xbc;
>> + *(uint64_t *)p = stack_bottom - 8;
>> + p += 8;
>> +
>> + /* jmp target_va */
>> + *p++ = 0xe9;
>> + *(int32_t *)p = target_va - (stub_va + (p - stub) + 4);
>> + p += 4;
>> +
>> + /* Round up to a multiple of 16 bytes. */
>> + return ROUNDUP(p - stub, 16);
>> +}
>> +
>> +static void legacy_syscall_init(void)
>> +{
>> + unsigned long stack_bottom = get_stack_bottom();
>> + unsigned long stub_va = this_cpu(stubs.addr);
>> + unsigned char *stub_page;
>> + unsigned int offset;
>> +
>> + /* No PV guests? No need to set up SYSCALL/SYSENTER infrastructure. */
>> + if ( !IS_ENABLED(CONFIG_PV) )
>> + return;
>> +
>> + stub_page = map_domain_page(_mfn(this_cpu(stubs.mfn)));
>> +
>> + /*
>> + * Trampoline for SYSCALL entry from 64-bit mode. The VT-x HVM vcpu
>> + * context switch logic relies on the SYSCALL trampoline being at the
>> + * start of the stubs.
>> + */
>> + wrmsrl(MSR_LSTAR, stub_va);
>> + offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
>> + stub_va, stack_bottom,
>> + (unsigned long)lstar_enter);
>> + stub_va += offset;
>> +
>> + if ( cpu_has_sep )
>> + {
>> + /* SYSENTER entry. */
>> + wrmsrl(MSR_IA32_SYSENTER_ESP, stack_bottom);
>> + wrmsrl(MSR_IA32_SYSENTER_EIP, (unsigned long)sysenter_entry);
>> + wrmsr(MSR_IA32_SYSENTER_CS, __HYPERVISOR_CS, 0);
>> + }
>> +
>> + /* Trampoline for SYSCALL entry from compatibility mode. */
>> + wrmsrl(MSR_CSTAR, stub_va);
>> + offset += write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
>> + stub_va, stack_bottom,
>> + (unsigned long)cstar_enter);
>> +
>> + /* Don't consume more than half of the stub space here. */
>> + ASSERT(offset <= STUB_BUF_SIZE / 2);
>> +
>> + unmap_domain_page(stub_page);
>> +
>> + /* Common SYSCALL parameters. */
>> + wrmsrl(MSR_STAR, XEN_MSR_STAR);
>> + wrmsrl(MSR_SYSCALL_MASK, XEN_SYSCALL_MASK);
>> +}
>
> These want adjusting to use wrmsrns(), similarly to the previous patch.
> Fixed locally.
Also the one higher in the function, I suppose.
Acked-by: Jan Beulich <jbeulich@suse.com>
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 11/22] x86/traps: Fold x86_64/traps.c into traps.c
2025-08-08 20:23 ` [PATCH 11/22] x86/traps: Fold x86_64/traps.c into traps.c Andrew Cooper
@ 2025-08-12 9:53 ` Jan Beulich
0 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-12 9:53 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:23, Andrew Cooper wrote:
> It's now just the double fault handler and various state dumping functions.
Ah, yes - long awaited consolidation.
> Swap u64 for uint64_t, and fix a few other minor style issues.
>
> No functional change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 12/22] x86/traps: Unexport show_code() and show_stack_overflow()
2025-08-08 20:23 ` [PATCH 12/22] x86/traps: Unexport show_code() and show_stack_overflow() Andrew Cooper
@ 2025-08-12 9:54 ` Jan Beulich
0 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-12 9:54 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:23, Andrew Cooper wrote:
> These can become static now the two traps.c have been merged.
>
> No fucntional change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 01/22] x86/msr: Rename MSR_INTERRUPT_SSP_TABLE to MSR_ISST
2025-08-12 8:06 ` Jan Beulich
@ 2025-08-13 9:02 ` Andrew Cooper
0 siblings, 0 replies; 120+ messages in thread
From: Andrew Cooper @ 2025-08-13 9:02 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 12/08/2025 9:06 am, Jan Beulich wrote:
> On 08.08.2025 22:22, Andrew Cooper wrote:
>> The name AMD chose is rather more concise.
>>
>> No functional change.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> I'm okay with the (much) shorter name, so:
> Acked-by: Jan Beulich <jbeulich@suse.com>
Thanks.
>
> But I can't make the connection to AMD: It's INTERRUPT_SST_TABLE (figure)
> or INTERRUPT_SSP_TABLE (text) there, afaics. And ISST_ADDR in yet further
> places.
ISST has a better association with IST than the long form, which is
ambiguous if turned into an initialism. The ADDR suffix I dropped like
we do elsewhere.
But now you point it out, there is a confusing mix in AMD. I'm pretty
sure SST is a plain typo, and I've fed this back.
I'll tweak the wording a little bit.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 03/22] x86/traps: Drop incorrect BUILD_BUG_ON() and comment in load_system_tables()
2025-08-12 8:11 ` Jan Beulich
@ 2025-08-13 9:40 ` Andrew Cooper
2025-08-14 8:50 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-13 9:40 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 12/08/2025 9:11 am, Jan Beulich wrote:
> On 08.08.2025 22:22, Andrew Cooper wrote:
>> This was added erroneously by me.
>>
>> Hardware task switching does demand a TSS of at least 0x67 bytes, but that's
>> not relevant in 64bit, and not relevant for Xen since commit
>> 5d1181a5ea5e ("xen: Remove x86_32 build target.") in 2012.
>>
>> We already load a 0-length TSS in early_traps_init() demonstrating that it's
>> possible.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>> ---
>> xen/arch/x86/cpu/common.c | 2 --
>> 1 file changed, 2 deletions(-)
>>
>> diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
>> index f6ec5c9df522..cdc41248d4e9 100644
>> --- a/xen/arch/x86/cpu/common.c
>> +++ b/xen/arch/x86/cpu/common.c
>> @@ -936,8 +936,6 @@ void load_system_tables(void)
>> wrmsrl(MSR_ISST, (unsigned long)ist_ssp);
>> }
>>
>> - BUILD_BUG_ON(sizeof(*tss) <= 0x67); /* Mandated by the architecture. */
>> -
>> _set_tssldt_desc(gdt + TSS_ENTRY, (unsigned long)tss,
>> sizeof(*tss) - 1, SYS_DESC_tss_avail);
>> if ( IS_ENABLED(CONFIG_PV32) )
> Well, the comment is wrong. Whether the BUILD_BUG_ON() itself is also wrong
> depends on our intentions with the structure. Don't we need it to be that
> size for everything (incl I/O bitmap) to work correctly elsewhere?
We don't use the IO bitmap. We've talked about it a few times, but
never got it sorted.
Xen's TSS could be as short as 0x37 (covering IST3) and still work
correctly and safely (as there's no task switching).
A failure to read tss->iopb is the same as a failure to read the bitmap
itself. In fact, it's probably marginally faster for users of
IOBMP_INVALID_OFFSET as it fails one step earlier.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 04/22] x86/idt: Minor improvements to _update_gate_addr_lower()
2025-08-12 8:16 ` Jan Beulich
@ 2025-08-13 9:48 ` Andrew Cooper
0 siblings, 0 replies; 120+ messages in thread
From: Andrew Cooper @ 2025-08-13 9:48 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 12/08/2025 9:16 am, Jan Beulich wrote:
> On 08.08.2025 22:22, Andrew Cooper wrote:
>> After some experimentation, using .a/.b makes far better logic than using the
>> named fields, as both Clang and GCC spill idte to the stack when named fields
>> are used.
>>
>> GCC seems to do something very daft for the addr1 field. It takes addr,
>> shifts it by 32, then ANDs with 0xffff0000000000000UL, which requires
>> manifesting a MOVABS.
>>
>> Clang follows the C, whereby it ANDs with $imm32, then shifts, avoiding the
>> MOVABS entirely.
>>
>> No functional change.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Acked-by: Jan Beulich <jbeulich@suse.com>
Thanks.
> albeit I have to admit that I'm not quite happy about ...
>
>> --- a/xen/arch/x86/include/asm/idt.h
>> +++ b/xen/arch/x86/include/asm/idt.h
>> @@ -92,15 +92,16 @@ static inline void _set_gate_lower(idt_entry_t *gate, unsigned long type,
>> * Update the lower half handler of an IDT entry, without changing any other
>> * configuration.
>> */
>> -static inline void _update_gate_addr_lower(idt_entry_t *gate, void *addr)
>> +static inline void _update_gate_addr_lower(idt_entry_t *gate, void *_addr)
>> {
>> + unsigned long addr = (unsigned long)_addr;
>> + unsigned int addr1 = addr & 0xffff0000U; /* GCC force better codegen. */
>> idt_entry_t idte;
>> - idte.a = gate->a;
>>
>> - idte.b = ((unsigned long)(addr) >> 32);
>> - idte.a &= 0x0000FFFFFFFF0000ULL;
>> - idte.a |= (((unsigned long)(addr) & 0xFFFF0000UL) << 32) |
>> - ((unsigned long)(addr) & 0xFFFFUL);
>> + idte.b = addr >> 32;
>> + idte.a = gate->a & 0x0000ffffffff0000UL;
>> + idte.a |= (unsigned long)addr1 << 32;
> ... the cast here. Yet perhaps gcc still generates a MOVABS when you make
> addr1 unsigned long?
Correct. Forcing the mask operation to be 32bit is the only way I found
of avoiding the MOVABS.
>
> As to the comment next to the variable declaration: Could I talk you into
> adding a colon after "GCC"? Without one, the comment reads somewhat
> ambiguously to me.
Ok.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 06/22] x86/traps: Introduce bsp_traps_reinit()
2025-08-12 8:19 ` Jan Beulich
@ 2025-08-13 9:51 ` Andrew Cooper
0 siblings, 0 replies; 120+ messages in thread
From: Andrew Cooper @ 2025-08-13 9:51 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 12/08/2025 9:19 am, Jan Beulich wrote:
> On 08.08.2025 22:22, Andrew Cooper wrote:
>> ... to abstract away updating the refereces to the old BSP stack.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> If it helps later:
> Acked-by: Jan Beulich <jbeulich@suse.com>
> with ..
>
>> --- a/xen/arch/x86/traps-setup.c
>> +++ b/xen/arch/x86/traps-setup.c
>> @@ -107,6 +107,15 @@ void __init traps_init(void)
>> percpu_traps_init();
>> }
>>
>> +/*
>> + * Re-initialise all state referencing the early-boot stack.
>> + */
>> +void bsp_traps_reinit(void)
> ... __init added here.
Oops, yes. Fixed. Thanks.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 07/22] x86/spec-ctrl: Rework init_shadow_spec_ctrl_state() to take an info pointer
2025-08-12 8:27 ` Jan Beulich
@ 2025-08-13 10:35 ` Andrew Cooper
0 siblings, 0 replies; 120+ messages in thread
From: Andrew Cooper @ 2025-08-13 10:35 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 12/08/2025 9:27 am, Jan Beulich wrote:
> On 08.08.2025 22:22, Andrew Cooper wrote:
>> We're going to want to reuse it for a remote stack shortly.
> Are we? From the titles of subsequent patches I can't judge where that would
> be, so it's hard to peek ahead. And iirc earlier on it was a concious decision
> to only ever run this locally.
I don't recall that. I recall bugs which occurred because there were
several variables and we failed to sync one of them.
Either way, it's very clearly the right course of action to take now.
>
>> No functional change.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Nevertheless, trusting that you have a good reason:
> Acked-by: Jan Beulich <jbeulich@suse.com>
Thanks.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 08/22] x86/traps: Introduce ap_early_traps_init() and set up exception handling earlier
2025-08-12 8:41 ` Jan Beulich
@ 2025-08-13 11:13 ` Andrew Cooper
2025-08-14 8:53 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-13 11:13 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 12/08/2025 9:41 am, Jan Beulich wrote:
> On 08.08.2025 22:23, Andrew Cooper wrote:
>> --- a/xen/arch/x86/acpi/wakeup_prot.S
>> +++ b/xen/arch/x86/acpi/wakeup_prot.S
>> @@ -63,6 +63,9 @@ LABEL(s3_resume)
>> pushq %rax
>> lretq
>> 1:
>> + /* Set up early exceptions and CET before entering C properly. */
>> + call ap_early_traps_init
> But this is the BSP?
By the end of the cleanup, what we have is:
At boot only:
* really early init, basic exception handling only
* regular init (inc syscall trampolines)
* late re-init as we change the stack linear address
For everything else (APs, S3, hot-online):
* early, full exception handling
* regular init (inc syscall trampolines)
Currently, these are named:
* bsp_early_traps_init()
* traps_init()
* bsp_traps_reinit()
and
* ap_early_traps_init()
* percpu_traps_init()
Perhaps ap_early_traps_init() should be named
percpu_early_traps_init()? But I'm open to suggestions.
>
>> --- a/xen/arch/x86/smpboot.c
>> +++ b/xen/arch/x86/smpboot.c
>> @@ -327,12 +327,7 @@ void asmlinkage start_secondary(void)
>> struct cpu_info *info = get_cpu_info();
>> unsigned int cpu = smp_processor_id();
>>
>> - /* Critical region without IDT or TSS. Any fault is deadly! */
>> -
>> - set_current(idle_vcpu[cpu]);
>> - this_cpu(curr_vcpu) = idle_vcpu[cpu];
>> rdmsrl(MSR_EFER, this_cpu(efer));
>> - init_shadow_spec_ctrl_state(info);
>>
>> /*
>> * Just as during early bootstrap, it is convenient here to disable
>> @@ -352,14 +347,6 @@ void asmlinkage start_secondary(void)
>> */
>> spin_debug_disable();
>>
>> - get_cpu_info()->use_pv_cr3 = false;
>> - get_cpu_info()->xen_cr3 = 0;
>> - get_cpu_info()->pv_cr3 = 0;
>> -
>> - load_system_tables();
>> -
>> - /* Full exception support from here on in. */
>> -
>> if ( cpu_has_pks )
>> wrpkrs_and_cache(0); /* Must be before setting CR4.PKS */
>>
>> @@ -1064,8 +1051,12 @@ static int cpu_smpboot_alloc(unsigned int cpu)
>> goto out;
>>
>> info = get_cpu_info_from_stack((unsigned long)stack_base[cpu]);
>> + memset(info, 0, sizeof(*info));
> Why do we suddenly need this? Or is this just out of an abundance of
> caution (while making the individual ->*_cr3 writes unnecessary)?
cpu_alloc_stack() explicitly uses alloc_xenheap_pages() which uses
MEMF_no_scrub. It will usually be zeroed memory because we allocate
them all at the start of day, but it also has a habbit of being 0xc2'd
when running under Xen.
Also yes, I do dislike the ad-hoc zeroes of misc fields.
>
>> + init_shadow_spec_ctrl_state(info);
> May I suggest to move this further down a little, at least ...
>
>> info->processor_id = cpu;
> ... past here? Just in case other values in the struct may be needed
> in the function at some point.
Ok.
>
>> info->per_cpu_offset = __per_cpu_offset[cpu];
>> + info->current_vcpu = idle_vcpu[cpu];
> To be able to spot this, I think it wants /* set_current() */ or some
> such.
Ok.
>
>> + per_cpu(curr_vcpu, cpu) = idle_vcpu[cpu];
> It's a little odd to do this early (and remotely), but it looks all fine
> with how the variable is currently used.
It did take a little while for me to conclude that it is safe, but yes -
it does relax a lot of ordering constraints for AP bringup.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c
2025-08-12 9:19 ` Jan Beulich
@ 2025-08-13 11:25 ` Andrew Cooper
2025-08-14 8:55 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-13 11:25 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 12/08/2025 10:19 am, Jan Beulich wrote:
> On 08.08.2025 22:23, Andrew Cooper wrote:
>> Switch it to Xen coding style and fix MISRA violations.
> That were all ul -> UL suffix transformations, afaics?
Yes.
>
>> Make it static as
>> there are no external caller now.
>>
>> Since commit a35816b5cae8 ("x86/traps: Introduce early_traps_init() and
>> simplify setup"), load_system_tables() is called later on the BSP, so the
>> SYS_STATE_early_boot check can be dropped from the safety BUG_ON().
>>
>> Move the BUILD_BUG_ON() into build_assertions(),
> I'm not quite convinced of this move - having the related BUILD_BUG_ON()
> and BUG_ON() next to each other would seem better to me.
I don't see a specific reason for them to be together, and the comment
explains what's going on.
With FRED, we want a related BUILD_BUG_ON(), but there's no equivalent
BUG_ON() because MSR_RSP_SL0 will #GP on being misaligned.
>> @@ -139,3 +258,16 @@ void asmlinkage ap_early_traps_init(void)
>> {
>> load_system_tables();
>> }
>> +
>> +static void __init __maybe_unused build_assertions(void)
>> +{
>> + /*
>> + * This is best-effort (it doesn't cover some padding corner cases), but
>> + * is preforable to hitting the check at boot time.
> Nit: "preferable"
Fixed.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c
2025-08-12 9:43 ` Nicola Vetrini
@ 2025-08-13 11:36 ` Andrew Cooper
2025-08-14 7:26 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-13 11:36 UTC (permalink / raw)
To: Nicola Vetrini; +Cc: Xen-devel, Jan Beulich, Roger Pau Monné
On 12/08/2025 10:43 am, Nicola Vetrini wrote:
> On 2025-08-08 22:23, Andrew Cooper wrote:
>> diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
>> index 8ca379c9e4cb..13b8fcf0ba51 100644
>> --- a/xen/arch/x86/traps-setup.c
>> +++ b/xen/arch/x86/traps-setup.c
>> @@ -19,6 +20,124 @@ boolean_param("ler", opt_ler);
>>
>> void nocall entry_PF(void);
>>
>> +/*
>> + * Sets up system tables and descriptors for IDT devliery.
>> + *
>> + * - Sets up TSS with stack pointers, including ISTs
>> + * - Inserts TSS selector into regular and compat GDTs
>> + * - Loads GDT, IDT, TR then null LDT
>> + * - Sets up IST references in the IDT
>> + */
>> +static void load_system_tables(void)
>> +{
>> + unsigned int i, cpu = smp_processor_id();
>> + unsigned long stack_bottom = get_stack_bottom(),
>> + stack_top = stack_bottom & ~(STACK_SIZE - 1);
>> + /*
>> + * NB: define tss_page as a local variable because clang 3.5
>> doesn't
>> + * support using ARRAY_SIZE against per-cpu variables.
>> + */
>> + struct tss_page *tss_page = &this_cpu(tss_page);
>> + idt_entry_t *idt = this_cpu(idt);
>> +
>
> Given the clang baseline this might not be needed anymore?
Hmm. While true, looking at 51461114e26, the code is definitely better
written with the tss_page variable and we wouldn't want to go back to
the old form.
I think that I'll simply drop the comment.
~Andrew
P.S.
Generally speaking, because of the RELOC_HIDE() in this_cpu(), any time
you ever want two accesses to a variable, it's better (code gen wise) to
construct a pointer to it and use the point multiple times.
I don't understand why there's a RELOC_HIDE() in this_cpu(). The
justification doesn't make sense, but I've not had time to explore what
happens if we take it out.
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 10/22] x86/traps: Move subarch_percpu_traps_init() into traps-setup.c
2025-08-12 9:52 ` Jan Beulich
@ 2025-08-13 11:53 ` Andrew Cooper
2025-08-14 8:58 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-13 11:53 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 12/08/2025 10:52 am, Jan Beulich wrote:
> On 11.08.2025 10:17, Andrew Cooper wrote:
>> On 08/08/2025 9:23 pm, Andrew Cooper wrote:
>>> ... along with the supporting functions. Switch to Xen coding style, and make
>>> static as there are no external callers.
>>>
>>> Rename to legacy_syscall_init() as a more accurate name.
>>>
>>> No functional change.
>>>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> ---
>>> CC: Jan Beulich <JBeulich@suse.com>
>>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>> ---
>>> xen/arch/x86/include/asm/system.h | 2 -
>>> xen/arch/x86/traps-setup.c | 97 ++++++++++++++++++++++++++++++-
>>> xen/arch/x86/x86_64/traps.c | 92 -----------------------------
>>> 3 files changed, 95 insertions(+), 96 deletions(-)
>>>
>>> diff --git a/xen/arch/x86/include/asm/system.h b/xen/arch/x86/include/asm/system.h
>>> index 3cdc56e4ba6d..6c2800d8158d 100644
>>> --- a/xen/arch/x86/include/asm/system.h
>>> +++ b/xen/arch/x86/include/asm/system.h
>>> @@ -256,6 +256,4 @@ static inline int local_irq_is_enabled(void)
>>> #define BROKEN_ACPI_Sx 0x0001
>>> #define BROKEN_INIT_AFTER_S1 0x0002
>>>
>>> -void subarch_percpu_traps_init(void);
>>> -
>>> #endif
>>> diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
>>> index 13b8fcf0ba51..fbae7072c292 100644
>>> --- a/xen/arch/x86/traps-setup.c
>>> +++ b/xen/arch/x86/traps-setup.c
>>> @@ -2,13 +2,15 @@
>>> /*
>>> * Configuration of event handling for all CPUs.
>>> */
>>> +#include <xen/domain_page.h>
>>> #include <xen/init.h>
>>> #include <xen/param.h>
>>>
>>> +#include <asm/endbr.h>
>>> #include <asm/idt.h>
>>> #include <asm/msr.h>
>>> #include <asm/shstk.h>
>>> -#include <asm/system.h>
>>> +#include <asm/stubs.h>
>>> #include <asm/traps.h>
>>>
>>> DEFINE_PER_CPU_READ_MOSTLY(idt_entry_t *, idt);
>>> @@ -19,6 +21,8 @@ static bool __initdata opt_ler;
>>> boolean_param("ler", opt_ler);
>>>
>>> void nocall entry_PF(void);
>>> +void nocall lstar_enter(void);
>>> +void nocall cstar_enter(void);
>>>
>>> /*
>>> * Sets up system tables and descriptors for IDT devliery.
>>> @@ -138,6 +142,95 @@ static void load_system_tables(void)
>>> BUG_ON(stack_bottom & 15);
>>> }
>>>
>>> +static unsigned int write_stub_trampoline(
>>> + unsigned char *stub, unsigned long stub_va,
>>> + unsigned long stack_bottom, unsigned long target_va)
>>> +{
>>> + unsigned char *p = stub;
>>> +
>>> + if ( cpu_has_xen_ibt )
>>> + {
>>> + place_endbr64(p);
>>> + p += 4;
>>> + }
>>> +
>>> + /* Store guest %rax into %ss slot */
>>> + /* movabsq %rax, stack_bottom - 8 */
>>> + *p++ = 0x48;
>>> + *p++ = 0xa3;
>>> + *(uint64_t *)p = stack_bottom - 8;
>>> + p += 8;
>>> +
>>> + /* Store guest %rsp in %rax */
>>> + /* movq %rsp, %rax */
>>> + *p++ = 0x48;
>>> + *p++ = 0x89;
>>> + *p++ = 0xe0;
>>> +
>>> + /* Switch to Xen stack */
>>> + /* movabsq $stack_bottom - 8, %rsp */
>>> + *p++ = 0x48;
>>> + *p++ = 0xbc;
>>> + *(uint64_t *)p = stack_bottom - 8;
>>> + p += 8;
>>> +
>>> + /* jmp target_va */
>>> + *p++ = 0xe9;
>>> + *(int32_t *)p = target_va - (stub_va + (p - stub) + 4);
>>> + p += 4;
>>> +
>>> + /* Round up to a multiple of 16 bytes. */
>>> + return ROUNDUP(p - stub, 16);
>>> +}
>>> +
>>> +static void legacy_syscall_init(void)
>>> +{
>>> + unsigned long stack_bottom = get_stack_bottom();
>>> + unsigned long stub_va = this_cpu(stubs.addr);
>>> + unsigned char *stub_page;
>>> + unsigned int offset;
>>> +
>>> + /* No PV guests? No need to set up SYSCALL/SYSENTER infrastructure. */
>>> + if ( !IS_ENABLED(CONFIG_PV) )
>>> + return;
>>> +
>>> + stub_page = map_domain_page(_mfn(this_cpu(stubs.mfn)));
>>> +
>>> + /*
>>> + * Trampoline for SYSCALL entry from 64-bit mode. The VT-x HVM vcpu
>>> + * context switch logic relies on the SYSCALL trampoline being at the
>>> + * start of the stubs.
>>> + */
>>> + wrmsrl(MSR_LSTAR, stub_va);
>>> + offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
>>> + stub_va, stack_bottom,
>>> + (unsigned long)lstar_enter);
>>> + stub_va += offset;
>>> +
>>> + if ( cpu_has_sep )
>>> + {
>>> + /* SYSENTER entry. */
>>> + wrmsrl(MSR_IA32_SYSENTER_ESP, stack_bottom);
>>> + wrmsrl(MSR_IA32_SYSENTER_EIP, (unsigned long)sysenter_entry);
>>> + wrmsr(MSR_IA32_SYSENTER_CS, __HYPERVISOR_CS, 0);
>>> + }
>>> +
>>> + /* Trampoline for SYSCALL entry from compatibility mode. */
>>> + wrmsrl(MSR_CSTAR, stub_va);
>>> + offset += write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
>>> + stub_va, stack_bottom,
>>> + (unsigned long)cstar_enter);
>>> +
>>> + /* Don't consume more than half of the stub space here. */
>>> + ASSERT(offset <= STUB_BUF_SIZE / 2);
>>> +
>>> + unmap_domain_page(stub_page);
>>> +
>>> + /* Common SYSCALL parameters. */
>>> + wrmsrl(MSR_STAR, XEN_MSR_STAR);
>>> + wrmsrl(MSR_SYSCALL_MASK, XEN_SYSCALL_MASK);
>>> +}
>> These want adjusting to use wrmsrns(), similarly to the previous patch.
>> Fixed locally.
> Also the one higher in the function, I suppose.
All of them.
I'm not aware of anywhere were we want serialising behaviour, except for
ICR which is buggly non-serialising and has workarounds.
But I'm also not sure enough of this to suggest that we make wrmsr() be
wrmsrns() by default.
> Acked-by: Jan Beulich <jbeulich@suse.com>
Thanks.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 13/22] x86: FRED enumerations
2025-08-08 20:23 ` [PATCH 13/22] x86: FRED enumerations Andrew Cooper
@ 2025-08-13 12:28 ` Andrew Cooper
2025-08-14 7:30 ` Jan Beulich
2025-08-14 11:20 ` Jan Beulich
2025-08-18 9:02 ` Jan Beulich
2 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-13 12:28 UTC (permalink / raw)
To: Xen-devel; +Cc: Jan Beulich, Roger Pau Monné
On 08/08/2025 9:23 pm, Andrew Cooper wrote:
> diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
> index a45ce106e210..90cbad13a7c7 100644
> --- a/xen/arch/x86/Kconfig
> +++ b/xen/arch/x86/Kconfig
> @@ -57,6 +57,10 @@ config HAS_CC_CET_IBT
> # Retpoline check to work around https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93654
> def_bool $(cc-option,-fcf-protection=branch -mmanual-endbr -mindirect-branch=thunk-extern) && $(as-instr,endbr64)
>
> +config HAS_AS_FRED
> + # binutils >= 2.41 or LLVM >= 19
> + def_bool $(as-instr,eretu;lkgs %ax)
> +
> menu "Architecture Features"
>
> source "arch/x86/Kconfig.cpu"
> diff --git a/xen/arch/x86/include/asm/asm-defns.h b/xen/arch/x86/include/asm/asm-defns.h
> index 61a5faf90446..2e5200b94b82 100644
> --- a/xen/arch/x86/include/asm/asm-defns.h
> +++ b/xen/arch/x86/include/asm/asm-defns.h
> @@ -4,6 +4,15 @@
> .byte 0x0f, 0x01, 0xfc
> .endm
>
> +#ifndef CONFIG_HAS_AS_FRED
> +.macro eretu
> + .byte 0xf3, 0x0f, 0x01, 0xca
> +.endm
> +.macro erets
> + .byte 0xf2, 0x0f, 0x01, 0xca
> +.endm
> +#endif
Seeing as I know you are going to be unhappy with the Kconfig...
I think I'm dev complete on the PV support now, and there's not an LKGS
in sight.
We don't strictly need the conditional in asm-defns.h, and if we don't
need it in C either then we can drop the Kconfig entirely.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c
2025-08-13 11:36 ` Andrew Cooper
@ 2025-08-14 7:26 ` Jan Beulich
2025-08-14 18:20 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 7:26 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Xen-devel, Roger Pau Monné, Nicola Vetrini
On 13.08.2025 13:36, Andrew Cooper wrote:
> On 12/08/2025 10:43 am, Nicola Vetrini wrote:
>> On 2025-08-08 22:23, Andrew Cooper wrote:
>>> diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
>>> index 8ca379c9e4cb..13b8fcf0ba51 100644
>>> --- a/xen/arch/x86/traps-setup.c
>>> +++ b/xen/arch/x86/traps-setup.c
>>> @@ -19,6 +20,124 @@ boolean_param("ler", opt_ler);
>>>
>>> void nocall entry_PF(void);
>>>
>>> +/*
>>> + * Sets up system tables and descriptors for IDT devliery.
>>> + *
>>> + * - Sets up TSS with stack pointers, including ISTs
>>> + * - Inserts TSS selector into regular and compat GDTs
>>> + * - Loads GDT, IDT, TR then null LDT
>>> + * - Sets up IST references in the IDT
>>> + */
>>> +static void load_system_tables(void)
>>> +{
>>> + unsigned int i, cpu = smp_processor_id();
>>> + unsigned long stack_bottom = get_stack_bottom(),
>>> + stack_top = stack_bottom & ~(STACK_SIZE - 1);
>>> + /*
>>> + * NB: define tss_page as a local variable because clang 3.5
>>> doesn't
>>> + * support using ARRAY_SIZE against per-cpu variables.
>>> + */
>>> + struct tss_page *tss_page = &this_cpu(tss_page);
>>> + idt_entry_t *idt = this_cpu(idt);
>>> +
>>
>> Given the clang baseline this might not be needed anymore?
>
> Hmm. While true, looking at 51461114e26, the code is definitely better
> written with the tss_page variable and we wouldn't want to go back to
> the old form.
>
> I think that I'll simply drop the comment.
>
> ~Andrew
>
> P.S.
>
> Generally speaking, because of the RELOC_HIDE() in this_cpu(), any time
> you ever want two accesses to a variable, it's better (code gen wise) to
> construct a pointer to it and use the point multiple times.
>
> I don't understand why there's a RELOC_HIDE() in this_cpu(). The
> justification doesn't make sense, but I've not had time to explore what
> happens if we take it out.
There's no justification in xen/percpu.h?
My understanding is that we simply may not expose any accesses to per_cpu_*
variables directly to the compiler, or there's a risk that it might access
the "master" variable (i.e. CPU0's on at least x86).
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 13/22] x86: FRED enumerations
2025-08-13 12:28 ` Andrew Cooper
@ 2025-08-14 7:30 ` Jan Beulich
0 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 7:30 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 13.08.2025 14:28, Andrew Cooper wrote:
> On 08/08/2025 9:23 pm, Andrew Cooper wrote:
>> diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
>> index a45ce106e210..90cbad13a7c7 100644
>> --- a/xen/arch/x86/Kconfig
>> +++ b/xen/arch/x86/Kconfig
>> @@ -57,6 +57,10 @@ config HAS_CC_CET_IBT
>> # Retpoline check to work around https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93654
>> def_bool $(cc-option,-fcf-protection=branch -mmanual-endbr -mindirect-branch=thunk-extern) && $(as-instr,endbr64)
>>
>> +config HAS_AS_FRED
>> + # binutils >= 2.41 or LLVM >= 19
>> + def_bool $(as-instr,eretu;lkgs %ax)
>> +
>> menu "Architecture Features"
>>
>> source "arch/x86/Kconfig.cpu"
>> diff --git a/xen/arch/x86/include/asm/asm-defns.h b/xen/arch/x86/include/asm/asm-defns.h
>> index 61a5faf90446..2e5200b94b82 100644
>> --- a/xen/arch/x86/include/asm/asm-defns.h
>> +++ b/xen/arch/x86/include/asm/asm-defns.h
>> @@ -4,6 +4,15 @@
>> .byte 0x0f, 0x01, 0xfc
>> .endm
>>
>> +#ifndef CONFIG_HAS_AS_FRED
>> +.macro eretu
>> + .byte 0xf3, 0x0f, 0x01, 0xca
>> +.endm
>> +.macro erets
>> + .byte 0xf2, 0x0f, 0x01, 0xca
>> +.endm
>> +#endif
>
> Seeing as I know you are going to be unhappy with the Kconfig...
Well, we've got several, so adding one more isn't going to be the end of
the world. What we really need to do is finally settle for one method,
and then switch everything over to whatever it is going to be. I've
taken note to set up another design session on the topic in San Jose.
> I think I'm dev complete on the PV support now, and there's not an LKGS
> in sight.
>
> We don't strictly need the conditional in asm-defns.h, and if we don't
> need it in C either then we can drop the Kconfig entirely.
Yeah, might be best for now.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 03/22] x86/traps: Drop incorrect BUILD_BUG_ON() and comment in load_system_tables()
2025-08-13 9:40 ` Andrew Cooper
@ 2025-08-14 8:50 ` Jan Beulich
0 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 8:50 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 13.08.2025 11:40, Andrew Cooper wrote:
> On 12/08/2025 9:11 am, Jan Beulich wrote:
>> On 08.08.2025 22:22, Andrew Cooper wrote:
>>> This was added erroneously by me.
>>>
>>> Hardware task switching does demand a TSS of at least 0x67 bytes, but that's
>>> not relevant in 64bit, and not relevant for Xen since commit
>>> 5d1181a5ea5e ("xen: Remove x86_32 build target.") in 2012.
>>>
>>> We already load a 0-length TSS in early_traps_init() demonstrating that it's
>>> possible.
>>>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> ---
>>> CC: Jan Beulich <JBeulich@suse.com>
>>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>> ---
>>> xen/arch/x86/cpu/common.c | 2 --
>>> 1 file changed, 2 deletions(-)
>>>
>>> diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
>>> index f6ec5c9df522..cdc41248d4e9 100644
>>> --- a/xen/arch/x86/cpu/common.c
>>> +++ b/xen/arch/x86/cpu/common.c
>>> @@ -936,8 +936,6 @@ void load_system_tables(void)
>>> wrmsrl(MSR_ISST, (unsigned long)ist_ssp);
>>> }
>>>
>>> - BUILD_BUG_ON(sizeof(*tss) <= 0x67); /* Mandated by the architecture. */
>>> -
>>> _set_tssldt_desc(gdt + TSS_ENTRY, (unsigned long)tss,
>>> sizeof(*tss) - 1, SYS_DESC_tss_avail);
>>> if ( IS_ENABLED(CONFIG_PV32) )
>> Well, the comment is wrong. Whether the BUILD_BUG_ON() itself is also wrong
>> depends on our intentions with the structure. Don't we need it to be that
>> size for everything (incl I/O bitmap) to work correctly elsewhere?
>
> We don't use the IO bitmap. We've talked about it a few times, but
> never got it sorted.
>
> Xen's TSS could be as short as 0x37 (covering IST3) and still work
> correctly and safely (as there's no task switching).
Then shouldn't we have a BUILD_BUG_ON(sizeof(*tss) <= 0x37) here? Hmm,
arguably that get pretty close to useless, though, so
Acked-by: Jan Beulich <jbeulich@suse.com>
> A failure to read tss->iopb is the same as a failure to read the bitmap
> itself. In fact, it's probably marginally faster for users of
> IOBMP_INVALID_OFFSET as it fails one step earlier.
(Provided there are no errata lurking anywhere.)
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 08/22] x86/traps: Introduce ap_early_traps_init() and set up exception handling earlier
2025-08-13 11:13 ` Andrew Cooper
@ 2025-08-14 8:53 ` Jan Beulich
0 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 8:53 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 13.08.2025 13:13, Andrew Cooper wrote:
> On 12/08/2025 9:41 am, Jan Beulich wrote:
>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>> --- a/xen/arch/x86/acpi/wakeup_prot.S
>>> +++ b/xen/arch/x86/acpi/wakeup_prot.S
>>> @@ -63,6 +63,9 @@ LABEL(s3_resume)
>>> pushq %rax
>>> lretq
>>> 1:
>>> + /* Set up early exceptions and CET before entering C properly. */
>>> + call ap_early_traps_init
>> But this is the BSP?
>
> By the end of the cleanup, what we have is:
>
> At boot only:
> * really early init, basic exception handling only
> * regular init (inc syscall trampolines)
> * late re-init as we change the stack linear address
>
> For everything else (APs, S3, hot-online):
> * early, full exception handling
> * regular init (inc syscall trampolines)
>
>
> Currently, these are named:
> * bsp_early_traps_init()
> * traps_init()
> * bsp_traps_reinit()
>
> and
> * ap_early_traps_init()
> * percpu_traps_init()
>
>
> Perhaps ap_early_traps_init() should be named
> percpu_early_traps_init()? But I'm open to suggestions.
That name looks like a better fit to me, yes.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c
2025-08-13 11:25 ` Andrew Cooper
@ 2025-08-14 8:55 ` Jan Beulich
2025-08-14 18:09 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 8:55 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 13.08.2025 13:25, Andrew Cooper wrote:
> On 12/08/2025 10:19 am, Jan Beulich wrote:
>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>> Since commit a35816b5cae8 ("x86/traps: Introduce early_traps_init() and
>>> simplify setup"), load_system_tables() is called later on the BSP, so the
>>> SYS_STATE_early_boot check can be dropped from the safety BUG_ON().
>>>
>>> Move the BUILD_BUG_ON() into build_assertions(),
>> I'm not quite convinced of this move - having the related BUILD_BUG_ON()
>> and BUG_ON() next to each other would seem better to me.
>
> I don't see a specific reason for them to be together, and the comment
> explains what's going on.
>
> With FRED, we want a related BUILD_BUG_ON(), but there's no equivalent
> BUG_ON() because MSR_RSP_SL0 will #GP on being misaligned.
That BUILD_BUG_ON() could then sit next to the MSR write? Unless of course
that ends up sitting in an assembly source.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 10/22] x86/traps: Move subarch_percpu_traps_init() into traps-setup.c
2025-08-13 11:53 ` Andrew Cooper
@ 2025-08-14 8:58 ` Jan Beulich
2025-08-14 10:17 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 8:58 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 13.08.2025 13:53, Andrew Cooper wrote:
> On 12/08/2025 10:52 am, Jan Beulich wrote:
>> On 11.08.2025 10:17, Andrew Cooper wrote:
>>> On 08/08/2025 9:23 pm, Andrew Cooper wrote:
>>>> ... along with the supporting functions. Switch to Xen coding style, and make
>>>> static as there are no external callers.
>>>>
>>>> Rename to legacy_syscall_init() as a more accurate name.
>>>>
>>>> No functional change.
>>>>
>>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> ---
>>>> CC: Jan Beulich <JBeulich@suse.com>
>>>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>>> ---
>>>> xen/arch/x86/include/asm/system.h | 2 -
>>>> xen/arch/x86/traps-setup.c | 97 ++++++++++++++++++++++++++++++-
>>>> xen/arch/x86/x86_64/traps.c | 92 -----------------------------
>>>> 3 files changed, 95 insertions(+), 96 deletions(-)
>>>>
>>>> diff --git a/xen/arch/x86/include/asm/system.h b/xen/arch/x86/include/asm/system.h
>>>> index 3cdc56e4ba6d..6c2800d8158d 100644
>>>> --- a/xen/arch/x86/include/asm/system.h
>>>> +++ b/xen/arch/x86/include/asm/system.h
>>>> @@ -256,6 +256,4 @@ static inline int local_irq_is_enabled(void)
>>>> #define BROKEN_ACPI_Sx 0x0001
>>>> #define BROKEN_INIT_AFTER_S1 0x0002
>>>>
>>>> -void subarch_percpu_traps_init(void);
>>>> -
>>>> #endif
>>>> diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
>>>> index 13b8fcf0ba51..fbae7072c292 100644
>>>> --- a/xen/arch/x86/traps-setup.c
>>>> +++ b/xen/arch/x86/traps-setup.c
>>>> @@ -2,13 +2,15 @@
>>>> /*
>>>> * Configuration of event handling for all CPUs.
>>>> */
>>>> +#include <xen/domain_page.h>
>>>> #include <xen/init.h>
>>>> #include <xen/param.h>
>>>>
>>>> +#include <asm/endbr.h>
>>>> #include <asm/idt.h>
>>>> #include <asm/msr.h>
>>>> #include <asm/shstk.h>
>>>> -#include <asm/system.h>
>>>> +#include <asm/stubs.h>
>>>> #include <asm/traps.h>
>>>>
>>>> DEFINE_PER_CPU_READ_MOSTLY(idt_entry_t *, idt);
>>>> @@ -19,6 +21,8 @@ static bool __initdata opt_ler;
>>>> boolean_param("ler", opt_ler);
>>>>
>>>> void nocall entry_PF(void);
>>>> +void nocall lstar_enter(void);
>>>> +void nocall cstar_enter(void);
>>>>
>>>> /*
>>>> * Sets up system tables and descriptors for IDT devliery.
>>>> @@ -138,6 +142,95 @@ static void load_system_tables(void)
>>>> BUG_ON(stack_bottom & 15);
>>>> }
>>>>
>>>> +static unsigned int write_stub_trampoline(
>>>> + unsigned char *stub, unsigned long stub_va,
>>>> + unsigned long stack_bottom, unsigned long target_va)
>>>> +{
>>>> + unsigned char *p = stub;
>>>> +
>>>> + if ( cpu_has_xen_ibt )
>>>> + {
>>>> + place_endbr64(p);
>>>> + p += 4;
>>>> + }
>>>> +
>>>> + /* Store guest %rax into %ss slot */
>>>> + /* movabsq %rax, stack_bottom - 8 */
>>>> + *p++ = 0x48;
>>>> + *p++ = 0xa3;
>>>> + *(uint64_t *)p = stack_bottom - 8;
>>>> + p += 8;
>>>> +
>>>> + /* Store guest %rsp in %rax */
>>>> + /* movq %rsp, %rax */
>>>> + *p++ = 0x48;
>>>> + *p++ = 0x89;
>>>> + *p++ = 0xe0;
>>>> +
>>>> + /* Switch to Xen stack */
>>>> + /* movabsq $stack_bottom - 8, %rsp */
>>>> + *p++ = 0x48;
>>>> + *p++ = 0xbc;
>>>> + *(uint64_t *)p = stack_bottom - 8;
>>>> + p += 8;
>>>> +
>>>> + /* jmp target_va */
>>>> + *p++ = 0xe9;
>>>> + *(int32_t *)p = target_va - (stub_va + (p - stub) + 4);
>>>> + p += 4;
>>>> +
>>>> + /* Round up to a multiple of 16 bytes. */
>>>> + return ROUNDUP(p - stub, 16);
>>>> +}
>>>> +
>>>> +static void legacy_syscall_init(void)
>>>> +{
>>>> + unsigned long stack_bottom = get_stack_bottom();
>>>> + unsigned long stub_va = this_cpu(stubs.addr);
>>>> + unsigned char *stub_page;
>>>> + unsigned int offset;
>>>> +
>>>> + /* No PV guests? No need to set up SYSCALL/SYSENTER infrastructure. */
>>>> + if ( !IS_ENABLED(CONFIG_PV) )
>>>> + return;
>>>> +
>>>> + stub_page = map_domain_page(_mfn(this_cpu(stubs.mfn)));
>>>> +
>>>> + /*
>>>> + * Trampoline for SYSCALL entry from 64-bit mode. The VT-x HVM vcpu
>>>> + * context switch logic relies on the SYSCALL trampoline being at the
>>>> + * start of the stubs.
>>>> + */
>>>> + wrmsrl(MSR_LSTAR, stub_va);
>>>> + offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
>>>> + stub_va, stack_bottom,
>>>> + (unsigned long)lstar_enter);
>>>> + stub_va += offset;
>>>> +
>>>> + if ( cpu_has_sep )
>>>> + {
>>>> + /* SYSENTER entry. */
>>>> + wrmsrl(MSR_IA32_SYSENTER_ESP, stack_bottom);
>>>> + wrmsrl(MSR_IA32_SYSENTER_EIP, (unsigned long)sysenter_entry);
>>>> + wrmsr(MSR_IA32_SYSENTER_CS, __HYPERVISOR_CS, 0);
>>>> + }
>>>> +
>>>> + /* Trampoline for SYSCALL entry from compatibility mode. */
>>>> + wrmsrl(MSR_CSTAR, stub_va);
>>>> + offset += write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
>>>> + stub_va, stack_bottom,
>>>> + (unsigned long)cstar_enter);
>>>> +
>>>> + /* Don't consume more than half of the stub space here. */
>>>> + ASSERT(offset <= STUB_BUF_SIZE / 2);
>>>> +
>>>> + unmap_domain_page(stub_page);
>>>> +
>>>> + /* Common SYSCALL parameters. */
>>>> + wrmsrl(MSR_STAR, XEN_MSR_STAR);
>>>> + wrmsrl(MSR_SYSCALL_MASK, XEN_SYSCALL_MASK);
>>>> +}
>>> These want adjusting to use wrmsrns(), similarly to the previous patch.
>>> Fixed locally.
>> Also the one higher in the function, I suppose.
>
> All of them.
>
> I'm not aware of anywhere were we want serialising behaviour, except for
> ICR which is buggly non-serialising and has workarounds.
>
> But I'm also not sure enough of this to suggest that we make wrmsr() be
> wrmsrns() by default.
I'm pretty sure we don't want this. If nothing else then to avoid code bloat
for MSR writes which are non-serializing even in the original form.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 10/22] x86/traps: Move subarch_percpu_traps_init() into traps-setup.c
2025-08-14 8:58 ` Jan Beulich
@ 2025-08-14 10:17 ` Andrew Cooper
2025-08-14 10:52 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-14 10:17 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 14/08/2025 9:58 am, Jan Beulich wrote:
> On 13.08.2025 13:53, Andrew Cooper wrote:
>> On 12/08/2025 10:52 am, Jan Beulich wrote:
>>> On 11.08.2025 10:17, Andrew Cooper wrote:
>>>> On 08/08/2025 9:23 pm, Andrew Cooper wrote:
>>>>> ... along with the supporting functions. Switch to Xen coding style, and make
>>>>> static as there are no external callers.
>>>>>
>>>>> Rename to legacy_syscall_init() as a more accurate name.
>>>>>
>>>>> No functional change.
>>>>>
>>>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>>> ---
>>>>> CC: Jan Beulich <JBeulich@suse.com>
>>>>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>>>> ---
>>>>> xen/arch/x86/include/asm/system.h | 2 -
>>>>> xen/arch/x86/traps-setup.c | 97 ++++++++++++++++++++++++++++++-
>>>>> xen/arch/x86/x86_64/traps.c | 92 -----------------------------
>>>>> 3 files changed, 95 insertions(+), 96 deletions(-)
>>>>>
>>>>> diff --git a/xen/arch/x86/include/asm/system.h b/xen/arch/x86/include/asm/system.h
>>>>> index 3cdc56e4ba6d..6c2800d8158d 100644
>>>>> --- a/xen/arch/x86/include/asm/system.h
>>>>> +++ b/xen/arch/x86/include/asm/system.h
>>>>> @@ -256,6 +256,4 @@ static inline int local_irq_is_enabled(void)
>>>>> #define BROKEN_ACPI_Sx 0x0001
>>>>> #define BROKEN_INIT_AFTER_S1 0x0002
>>>>>
>>>>> -void subarch_percpu_traps_init(void);
>>>>> -
>>>>> #endif
>>>>> diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
>>>>> index 13b8fcf0ba51..fbae7072c292 100644
>>>>> --- a/xen/arch/x86/traps-setup.c
>>>>> +++ b/xen/arch/x86/traps-setup.c
>>>>> @@ -2,13 +2,15 @@
>>>>> /*
>>>>> * Configuration of event handling for all CPUs.
>>>>> */
>>>>> +#include <xen/domain_page.h>
>>>>> #include <xen/init.h>
>>>>> #include <xen/param.h>
>>>>>
>>>>> +#include <asm/endbr.h>
>>>>> #include <asm/idt.h>
>>>>> #include <asm/msr.h>
>>>>> #include <asm/shstk.h>
>>>>> -#include <asm/system.h>
>>>>> +#include <asm/stubs.h>
>>>>> #include <asm/traps.h>
>>>>>
>>>>> DEFINE_PER_CPU_READ_MOSTLY(idt_entry_t *, idt);
>>>>> @@ -19,6 +21,8 @@ static bool __initdata opt_ler;
>>>>> boolean_param("ler", opt_ler);
>>>>>
>>>>> void nocall entry_PF(void);
>>>>> +void nocall lstar_enter(void);
>>>>> +void nocall cstar_enter(void);
>>>>>
>>>>> /*
>>>>> * Sets up system tables and descriptors for IDT devliery.
>>>>> @@ -138,6 +142,95 @@ static void load_system_tables(void)
>>>>> BUG_ON(stack_bottom & 15);
>>>>> }
>>>>>
>>>>> +static unsigned int write_stub_trampoline(
>>>>> + unsigned char *stub, unsigned long stub_va,
>>>>> + unsigned long stack_bottom, unsigned long target_va)
>>>>> +{
>>>>> + unsigned char *p = stub;
>>>>> +
>>>>> + if ( cpu_has_xen_ibt )
>>>>> + {
>>>>> + place_endbr64(p);
>>>>> + p += 4;
>>>>> + }
>>>>> +
>>>>> + /* Store guest %rax into %ss slot */
>>>>> + /* movabsq %rax, stack_bottom - 8 */
>>>>> + *p++ = 0x48;
>>>>> + *p++ = 0xa3;
>>>>> + *(uint64_t *)p = stack_bottom - 8;
>>>>> + p += 8;
>>>>> +
>>>>> + /* Store guest %rsp in %rax */
>>>>> + /* movq %rsp, %rax */
>>>>> + *p++ = 0x48;
>>>>> + *p++ = 0x89;
>>>>> + *p++ = 0xe0;
>>>>> +
>>>>> + /* Switch to Xen stack */
>>>>> + /* movabsq $stack_bottom - 8, %rsp */
>>>>> + *p++ = 0x48;
>>>>> + *p++ = 0xbc;
>>>>> + *(uint64_t *)p = stack_bottom - 8;
>>>>> + p += 8;
>>>>> +
>>>>> + /* jmp target_va */
>>>>> + *p++ = 0xe9;
>>>>> + *(int32_t *)p = target_va - (stub_va + (p - stub) + 4);
>>>>> + p += 4;
>>>>> +
>>>>> + /* Round up to a multiple of 16 bytes. */
>>>>> + return ROUNDUP(p - stub, 16);
>>>>> +}
>>>>> +
>>>>> +static void legacy_syscall_init(void)
>>>>> +{
>>>>> + unsigned long stack_bottom = get_stack_bottom();
>>>>> + unsigned long stub_va = this_cpu(stubs.addr);
>>>>> + unsigned char *stub_page;
>>>>> + unsigned int offset;
>>>>> +
>>>>> + /* No PV guests? No need to set up SYSCALL/SYSENTER infrastructure. */
>>>>> + if ( !IS_ENABLED(CONFIG_PV) )
>>>>> + return;
>>>>> +
>>>>> + stub_page = map_domain_page(_mfn(this_cpu(stubs.mfn)));
>>>>> +
>>>>> + /*
>>>>> + * Trampoline for SYSCALL entry from 64-bit mode. The VT-x HVM vcpu
>>>>> + * context switch logic relies on the SYSCALL trampoline being at the
>>>>> + * start of the stubs.
>>>>> + */
>>>>> + wrmsrl(MSR_LSTAR, stub_va);
>>>>> + offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
>>>>> + stub_va, stack_bottom,
>>>>> + (unsigned long)lstar_enter);
>>>>> + stub_va += offset;
>>>>> +
>>>>> + if ( cpu_has_sep )
>>>>> + {
>>>>> + /* SYSENTER entry. */
>>>>> + wrmsrl(MSR_IA32_SYSENTER_ESP, stack_bottom);
>>>>> + wrmsrl(MSR_IA32_SYSENTER_EIP, (unsigned long)sysenter_entry);
>>>>> + wrmsr(MSR_IA32_SYSENTER_CS, __HYPERVISOR_CS, 0);
>>>>> + }
>>>>> +
>>>>> + /* Trampoline for SYSCALL entry from compatibility mode. */
>>>>> + wrmsrl(MSR_CSTAR, stub_va);
>>>>> + offset += write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
>>>>> + stub_va, stack_bottom,
>>>>> + (unsigned long)cstar_enter);
>>>>> +
>>>>> + /* Don't consume more than half of the stub space here. */
>>>>> + ASSERT(offset <= STUB_BUF_SIZE / 2);
>>>>> +
>>>>> + unmap_domain_page(stub_page);
>>>>> +
>>>>> + /* Common SYSCALL parameters. */
>>>>> + wrmsrl(MSR_STAR, XEN_MSR_STAR);
>>>>> + wrmsrl(MSR_SYSCALL_MASK, XEN_SYSCALL_MASK);
>>>>> +}
>>>> These want adjusting to use wrmsrns(), similarly to the previous patch.
>>>> Fixed locally.
>>> Also the one higher in the function, I suppose.
>> All of them.
>>
>> I'm not aware of anywhere were we want serialising behaviour, except for
>> ICR which is buggly non-serialising and has workarounds.
>>
>> But I'm also not sure enough of this to suggest that we make wrmsr() be
>> wrmsrns() by default.
> I'm pretty sure we don't want this. If nothing else then to avoid code bloat
> for MSR writes which are non-serializing even in the original form.
Even that's complicated.
For FRED, FS/GS_BASE/KERN need changes because the lack of SWAPGS forces
MSR accesses even if we do have FSGSBASE active.
Writes to these were made non-serialising in Zen2 and later, but are
still serialising on Intel. i.e. they need converting to WRMSRNS even
though plain WRMSR would be "fine" on all AMD systems (either because
it's the only option, or because it's non-serialising).
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 10/22] x86/traps: Move subarch_percpu_traps_init() into traps-setup.c
2025-08-14 10:17 ` Andrew Cooper
@ 2025-08-14 10:52 ` Jan Beulich
2025-08-14 11:02 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 10:52 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 14.08.2025 12:17, Andrew Cooper wrote:
> On 14/08/2025 9:58 am, Jan Beulich wrote:
>> On 13.08.2025 13:53, Andrew Cooper wrote:
>>> On 12/08/2025 10:52 am, Jan Beulich wrote:
>>>> On 11.08.2025 10:17, Andrew Cooper wrote:
>>>>> On 08/08/2025 9:23 pm, Andrew Cooper wrote:
>>>>>> ... along with the supporting functions. Switch to Xen coding style, and make
>>>>>> static as there are no external callers.
>>>>>>
>>>>>> Rename to legacy_syscall_init() as a more accurate name.
>>>>>>
>>>>>> No functional change.
>>>>>>
>>>>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>>>> ---
>>>>>> CC: Jan Beulich <JBeulich@suse.com>
>>>>>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>>>>> ---
>>>>>> xen/arch/x86/include/asm/system.h | 2 -
>>>>>> xen/arch/x86/traps-setup.c | 97 ++++++++++++++++++++++++++++++-
>>>>>> xen/arch/x86/x86_64/traps.c | 92 -----------------------------
>>>>>> 3 files changed, 95 insertions(+), 96 deletions(-)
>>>>>>
>>>>>> diff --git a/xen/arch/x86/include/asm/system.h b/xen/arch/x86/include/asm/system.h
>>>>>> index 3cdc56e4ba6d..6c2800d8158d 100644
>>>>>> --- a/xen/arch/x86/include/asm/system.h
>>>>>> +++ b/xen/arch/x86/include/asm/system.h
>>>>>> @@ -256,6 +256,4 @@ static inline int local_irq_is_enabled(void)
>>>>>> #define BROKEN_ACPI_Sx 0x0001
>>>>>> #define BROKEN_INIT_AFTER_S1 0x0002
>>>>>>
>>>>>> -void subarch_percpu_traps_init(void);
>>>>>> -
>>>>>> #endif
>>>>>> diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
>>>>>> index 13b8fcf0ba51..fbae7072c292 100644
>>>>>> --- a/xen/arch/x86/traps-setup.c
>>>>>> +++ b/xen/arch/x86/traps-setup.c
>>>>>> @@ -2,13 +2,15 @@
>>>>>> /*
>>>>>> * Configuration of event handling for all CPUs.
>>>>>> */
>>>>>> +#include <xen/domain_page.h>
>>>>>> #include <xen/init.h>
>>>>>> #include <xen/param.h>
>>>>>>
>>>>>> +#include <asm/endbr.h>
>>>>>> #include <asm/idt.h>
>>>>>> #include <asm/msr.h>
>>>>>> #include <asm/shstk.h>
>>>>>> -#include <asm/system.h>
>>>>>> +#include <asm/stubs.h>
>>>>>> #include <asm/traps.h>
>>>>>>
>>>>>> DEFINE_PER_CPU_READ_MOSTLY(idt_entry_t *, idt);
>>>>>> @@ -19,6 +21,8 @@ static bool __initdata opt_ler;
>>>>>> boolean_param("ler", opt_ler);
>>>>>>
>>>>>> void nocall entry_PF(void);
>>>>>> +void nocall lstar_enter(void);
>>>>>> +void nocall cstar_enter(void);
>>>>>>
>>>>>> /*
>>>>>> * Sets up system tables and descriptors for IDT devliery.
>>>>>> @@ -138,6 +142,95 @@ static void load_system_tables(void)
>>>>>> BUG_ON(stack_bottom & 15);
>>>>>> }
>>>>>>
>>>>>> +static unsigned int write_stub_trampoline(
>>>>>> + unsigned char *stub, unsigned long stub_va,
>>>>>> + unsigned long stack_bottom, unsigned long target_va)
>>>>>> +{
>>>>>> + unsigned char *p = stub;
>>>>>> +
>>>>>> + if ( cpu_has_xen_ibt )
>>>>>> + {
>>>>>> + place_endbr64(p);
>>>>>> + p += 4;
>>>>>> + }
>>>>>> +
>>>>>> + /* Store guest %rax into %ss slot */
>>>>>> + /* movabsq %rax, stack_bottom - 8 */
>>>>>> + *p++ = 0x48;
>>>>>> + *p++ = 0xa3;
>>>>>> + *(uint64_t *)p = stack_bottom - 8;
>>>>>> + p += 8;
>>>>>> +
>>>>>> + /* Store guest %rsp in %rax */
>>>>>> + /* movq %rsp, %rax */
>>>>>> + *p++ = 0x48;
>>>>>> + *p++ = 0x89;
>>>>>> + *p++ = 0xe0;
>>>>>> +
>>>>>> + /* Switch to Xen stack */
>>>>>> + /* movabsq $stack_bottom - 8, %rsp */
>>>>>> + *p++ = 0x48;
>>>>>> + *p++ = 0xbc;
>>>>>> + *(uint64_t *)p = stack_bottom - 8;
>>>>>> + p += 8;
>>>>>> +
>>>>>> + /* jmp target_va */
>>>>>> + *p++ = 0xe9;
>>>>>> + *(int32_t *)p = target_va - (stub_va + (p - stub) + 4);
>>>>>> + p += 4;
>>>>>> +
>>>>>> + /* Round up to a multiple of 16 bytes. */
>>>>>> + return ROUNDUP(p - stub, 16);
>>>>>> +}
>>>>>> +
>>>>>> +static void legacy_syscall_init(void)
>>>>>> +{
>>>>>> + unsigned long stack_bottom = get_stack_bottom();
>>>>>> + unsigned long stub_va = this_cpu(stubs.addr);
>>>>>> + unsigned char *stub_page;
>>>>>> + unsigned int offset;
>>>>>> +
>>>>>> + /* No PV guests? No need to set up SYSCALL/SYSENTER infrastructure. */
>>>>>> + if ( !IS_ENABLED(CONFIG_PV) )
>>>>>> + return;
>>>>>> +
>>>>>> + stub_page = map_domain_page(_mfn(this_cpu(stubs.mfn)));
>>>>>> +
>>>>>> + /*
>>>>>> + * Trampoline for SYSCALL entry from 64-bit mode. The VT-x HVM vcpu
>>>>>> + * context switch logic relies on the SYSCALL trampoline being at the
>>>>>> + * start of the stubs.
>>>>>> + */
>>>>>> + wrmsrl(MSR_LSTAR, stub_va);
>>>>>> + offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
>>>>>> + stub_va, stack_bottom,
>>>>>> + (unsigned long)lstar_enter);
>>>>>> + stub_va += offset;
>>>>>> +
>>>>>> + if ( cpu_has_sep )
>>>>>> + {
>>>>>> + /* SYSENTER entry. */
>>>>>> + wrmsrl(MSR_IA32_SYSENTER_ESP, stack_bottom);
>>>>>> + wrmsrl(MSR_IA32_SYSENTER_EIP, (unsigned long)sysenter_entry);
>>>>>> + wrmsr(MSR_IA32_SYSENTER_CS, __HYPERVISOR_CS, 0);
>>>>>> + }
>>>>>> +
>>>>>> + /* Trampoline for SYSCALL entry from compatibility mode. */
>>>>>> + wrmsrl(MSR_CSTAR, stub_va);
>>>>>> + offset += write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
>>>>>> + stub_va, stack_bottom,
>>>>>> + (unsigned long)cstar_enter);
>>>>>> +
>>>>>> + /* Don't consume more than half of the stub space here. */
>>>>>> + ASSERT(offset <= STUB_BUF_SIZE / 2);
>>>>>> +
>>>>>> + unmap_domain_page(stub_page);
>>>>>> +
>>>>>> + /* Common SYSCALL parameters. */
>>>>>> + wrmsrl(MSR_STAR, XEN_MSR_STAR);
>>>>>> + wrmsrl(MSR_SYSCALL_MASK, XEN_SYSCALL_MASK);
>>>>>> +}
>>>>> These want adjusting to use wrmsrns(), similarly to the previous patch.
>>>>> Fixed locally.
>>>> Also the one higher in the function, I suppose.
>>> All of them.
>>>
>>> I'm not aware of anywhere were we want serialising behaviour, except for
>>> ICR which is buggly non-serialising and has workarounds.
>>>
>>> But I'm also not sure enough of this to suggest that we make wrmsr() be
>>> wrmsrns() by default.
>> I'm pretty sure we don't want this. If nothing else then to avoid code bloat
>> for MSR writes which are non-serializing even in the original form.
>
> Even that's complicated.
>
> For FRED, FS/GS_BASE/KERN need changes because the lack of SWAPGS forces
> MSR accesses even if we do have FSGSBASE active.
>
> Writes to these were made non-serialising in Zen2 and later, but are
> still serialising on Intel. i.e. they need converting to WRMSRNS even
> though plain WRMSR would be "fine" on all AMD systems (either because
> it's the only option, or because it's non-serialising).
Right, such would need converting. But x2APIC MSR accesses, for example,
should have a need.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 10/22] x86/traps: Move subarch_percpu_traps_init() into traps-setup.c
2025-08-14 10:52 ` Jan Beulich
@ 2025-08-14 11:02 ` Andrew Cooper
0 siblings, 0 replies; 120+ messages in thread
From: Andrew Cooper @ 2025-08-14 11:02 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 14/08/2025 11:52 am, Jan Beulich wrote:
> On 14.08.2025 12:17, Andrew Cooper wrote:
>> On 14/08/2025 9:58 am, Jan Beulich wrote:
>>> On 13.08.2025 13:53, Andrew Cooper wrote:
>>>> On 12/08/2025 10:52 am, Jan Beulich wrote:
>>>>> On 11.08.2025 10:17, Andrew Cooper wrote:
>>>>>> On 08/08/2025 9:23 pm, Andrew Cooper wrote:
>>>>>>> ... along with the supporting functions. Switch to Xen coding style, and make
>>>>>>> static as there are no external callers.
>>>>>>>
>>>>>>> Rename to legacy_syscall_init() as a more accurate name.
>>>>>>>
>>>>>>> No functional change.
>>>>>>>
>>>>>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>>>>> ---
>>>>>>> CC: Jan Beulich <JBeulich@suse.com>
>>>>>>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>>>>>> ---
>>>>>>> xen/arch/x86/include/asm/system.h | 2 -
>>>>>>> xen/arch/x86/traps-setup.c | 97 ++++++++++++++++++++++++++++++-
>>>>>>> xen/arch/x86/x86_64/traps.c | 92 -----------------------------
>>>>>>> 3 files changed, 95 insertions(+), 96 deletions(-)
>>>>>>>
>>>>>>> diff --git a/xen/arch/x86/include/asm/system.h b/xen/arch/x86/include/asm/system.h
>>>>>>> index 3cdc56e4ba6d..6c2800d8158d 100644
>>>>>>> --- a/xen/arch/x86/include/asm/system.h
>>>>>>> +++ b/xen/arch/x86/include/asm/system.h
>>>>>>> @@ -256,6 +256,4 @@ static inline int local_irq_is_enabled(void)
>>>>>>> #define BROKEN_ACPI_Sx 0x0001
>>>>>>> #define BROKEN_INIT_AFTER_S1 0x0002
>>>>>>>
>>>>>>> -void subarch_percpu_traps_init(void);
>>>>>>> -
>>>>>>> #endif
>>>>>>> diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
>>>>>>> index 13b8fcf0ba51..fbae7072c292 100644
>>>>>>> --- a/xen/arch/x86/traps-setup.c
>>>>>>> +++ b/xen/arch/x86/traps-setup.c
>>>>>>> @@ -2,13 +2,15 @@
>>>>>>> /*
>>>>>>> * Configuration of event handling for all CPUs.
>>>>>>> */
>>>>>>> +#include <xen/domain_page.h>
>>>>>>> #include <xen/init.h>
>>>>>>> #include <xen/param.h>
>>>>>>>
>>>>>>> +#include <asm/endbr.h>
>>>>>>> #include <asm/idt.h>
>>>>>>> #include <asm/msr.h>
>>>>>>> #include <asm/shstk.h>
>>>>>>> -#include <asm/system.h>
>>>>>>> +#include <asm/stubs.h>
>>>>>>> #include <asm/traps.h>
>>>>>>>
>>>>>>> DEFINE_PER_CPU_READ_MOSTLY(idt_entry_t *, idt);
>>>>>>> @@ -19,6 +21,8 @@ static bool __initdata opt_ler;
>>>>>>> boolean_param("ler", opt_ler);
>>>>>>>
>>>>>>> void nocall entry_PF(void);
>>>>>>> +void nocall lstar_enter(void);
>>>>>>> +void nocall cstar_enter(void);
>>>>>>>
>>>>>>> /*
>>>>>>> * Sets up system tables and descriptors for IDT devliery.
>>>>>>> @@ -138,6 +142,95 @@ static void load_system_tables(void)
>>>>>>> BUG_ON(stack_bottom & 15);
>>>>>>> }
>>>>>>>
>>>>>>> +static unsigned int write_stub_trampoline(
>>>>>>> + unsigned char *stub, unsigned long stub_va,
>>>>>>> + unsigned long stack_bottom, unsigned long target_va)
>>>>>>> +{
>>>>>>> + unsigned char *p = stub;
>>>>>>> +
>>>>>>> + if ( cpu_has_xen_ibt )
>>>>>>> + {
>>>>>>> + place_endbr64(p);
>>>>>>> + p += 4;
>>>>>>> + }
>>>>>>> +
>>>>>>> + /* Store guest %rax into %ss slot */
>>>>>>> + /* movabsq %rax, stack_bottom - 8 */
>>>>>>> + *p++ = 0x48;
>>>>>>> + *p++ = 0xa3;
>>>>>>> + *(uint64_t *)p = stack_bottom - 8;
>>>>>>> + p += 8;
>>>>>>> +
>>>>>>> + /* Store guest %rsp in %rax */
>>>>>>> + /* movq %rsp, %rax */
>>>>>>> + *p++ = 0x48;
>>>>>>> + *p++ = 0x89;
>>>>>>> + *p++ = 0xe0;
>>>>>>> +
>>>>>>> + /* Switch to Xen stack */
>>>>>>> + /* movabsq $stack_bottom - 8, %rsp */
>>>>>>> + *p++ = 0x48;
>>>>>>> + *p++ = 0xbc;
>>>>>>> + *(uint64_t *)p = stack_bottom - 8;
>>>>>>> + p += 8;
>>>>>>> +
>>>>>>> + /* jmp target_va */
>>>>>>> + *p++ = 0xe9;
>>>>>>> + *(int32_t *)p = target_va - (stub_va + (p - stub) + 4);
>>>>>>> + p += 4;
>>>>>>> +
>>>>>>> + /* Round up to a multiple of 16 bytes. */
>>>>>>> + return ROUNDUP(p - stub, 16);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void legacy_syscall_init(void)
>>>>>>> +{
>>>>>>> + unsigned long stack_bottom = get_stack_bottom();
>>>>>>> + unsigned long stub_va = this_cpu(stubs.addr);
>>>>>>> + unsigned char *stub_page;
>>>>>>> + unsigned int offset;
>>>>>>> +
>>>>>>> + /* No PV guests? No need to set up SYSCALL/SYSENTER infrastructure. */
>>>>>>> + if ( !IS_ENABLED(CONFIG_PV) )
>>>>>>> + return;
>>>>>>> +
>>>>>>> + stub_page = map_domain_page(_mfn(this_cpu(stubs.mfn)));
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * Trampoline for SYSCALL entry from 64-bit mode. The VT-x HVM vcpu
>>>>>>> + * context switch logic relies on the SYSCALL trampoline being at the
>>>>>>> + * start of the stubs.
>>>>>>> + */
>>>>>>> + wrmsrl(MSR_LSTAR, stub_va);
>>>>>>> + offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
>>>>>>> + stub_va, stack_bottom,
>>>>>>> + (unsigned long)lstar_enter);
>>>>>>> + stub_va += offset;
>>>>>>> +
>>>>>>> + if ( cpu_has_sep )
>>>>>>> + {
>>>>>>> + /* SYSENTER entry. */
>>>>>>> + wrmsrl(MSR_IA32_SYSENTER_ESP, stack_bottom);
>>>>>>> + wrmsrl(MSR_IA32_SYSENTER_EIP, (unsigned long)sysenter_entry);
>>>>>>> + wrmsr(MSR_IA32_SYSENTER_CS, __HYPERVISOR_CS, 0);
>>>>>>> + }
>>>>>>> +
>>>>>>> + /* Trampoline for SYSCALL entry from compatibility mode. */
>>>>>>> + wrmsrl(MSR_CSTAR, stub_va);
>>>>>>> + offset += write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
>>>>>>> + stub_va, stack_bottom,
>>>>>>> + (unsigned long)cstar_enter);
>>>>>>> +
>>>>>>> + /* Don't consume more than half of the stub space here. */
>>>>>>> + ASSERT(offset <= STUB_BUF_SIZE / 2);
>>>>>>> +
>>>>>>> + unmap_domain_page(stub_page);
>>>>>>> +
>>>>>>> + /* Common SYSCALL parameters. */
>>>>>>> + wrmsrl(MSR_STAR, XEN_MSR_STAR);
>>>>>>> + wrmsrl(MSR_SYSCALL_MASK, XEN_SYSCALL_MASK);
>>>>>>> +}
>>>>>> These want adjusting to use wrmsrns(), similarly to the previous patch.
>>>>>> Fixed locally.
>>>>> Also the one higher in the function, I suppose.
>>>> All of them.
>>>>
>>>> I'm not aware of anywhere were we want serialising behaviour, except for
>>>> ICR which is buggly non-serialising and has workarounds.
>>>>
>>>> But I'm also not sure enough of this to suggest that we make wrmsr() be
>>>> wrmsrns() by default.
>>> I'm pretty sure we don't want this. If nothing else then to avoid code bloat
>>> for MSR writes which are non-serializing even in the original form.
>> Even that's complicated.
>>
>> For FRED, FS/GS_BASE/KERN need changes because the lack of SWAPGS forces
>> MSR accesses even if we do have FSGSBASE active.
>>
>> Writes to these were made non-serialising in Zen2 and later, but are
>> still serialising on Intel. i.e. they need converting to WRMSRNS even
>> though plain WRMSR would be "fine" on all AMD systems (either because
>> it's the only option, or because it's non-serialising).
> Right, such would need converting. But x2APIC MSR accesses, for example,
> should have a need.
For serialising-ness, yes, but they still want to be MSR_IMM when
available, at which point the code bloat price is already paid.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 13/22] x86: FRED enumerations
2025-08-08 20:23 ` [PATCH 13/22] x86: FRED enumerations Andrew Cooper
2025-08-13 12:28 ` Andrew Cooper
@ 2025-08-14 11:20 ` Jan Beulich
2025-08-14 11:42 ` Andrew Cooper
` (2 more replies)
2025-08-18 9:02 ` Jan Beulich
2 siblings, 3 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 11:20 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:23, Andrew Cooper wrote:
> Of note, CR4.FRED is bit 32 and cannot enabled outside of 64bit mode.
>
> Most supported toolchains don't understand the FRED instructions yet. ERETU
> and ERETS are easy to wrap (they encoded as REPZ/REPNE CLAC), while LKGS is
> more complicated and deferred for now.
>
> I have intentionally named the FRED MSRs differently to the spec. In the
> spec, the stack pointer names alias the TSS fields of the same name, despite
> very different semantics.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
with ...
> --- a/xen/arch/x86/Kconfig
> +++ b/xen/arch/x86/Kconfig
> @@ -57,6 +57,10 @@ config HAS_CC_CET_IBT
> # Retpoline check to work around https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93654
> def_bool $(cc-option,-fcf-protection=branch -mmanual-endbr -mindirect-branch=thunk-extern) && $(as-instr,endbr64)
>
> +config HAS_AS_FRED
> + # binutils >= 2.41 or LLVM >= 19
> + def_bool $(as-instr,eretu;lkgs %ax)
..., as per your reply, this preferably dropped (without me insisting), and
with ...
> --- a/xen/arch/x86/include/asm/x86-defns.h
> +++ b/xen/arch/x86/include/asm/x86-defns.h
> @@ -75,6 +75,7 @@
> #define X86_CR4_PKE 0x00400000 /* enable PKE */
> #define X86_CR4_CET 0x00800000 /* Control-flow Enforcement Technology */
> #define X86_CR4_PKS 0x01000000 /* Protection Key Supervisor */
> +#define X86_CR4_FRED 0x100000000 /* Fast Return and Event Delivery */
... a UL suffix added here for Misra.
> --- a/xen/include/public/arch-x86/cpufeatureset.h
> +++ b/xen/include/public/arch-x86/cpufeatureset.h
> @@ -310,7 +310,10 @@ XEN_CPUFEATURE(ARCH_PERF_MON, 10*32+8) /* Architectural Perfmon */
> XEN_CPUFEATURE(FZRM, 10*32+10) /*A Fast Zero-length REP MOVSB */
> XEN_CPUFEATURE(FSRS, 10*32+11) /*A Fast Short REP STOSB */
> XEN_CPUFEATURE(FSRCS, 10*32+12) /*A Fast Short REP CMPSB/SCASB */
> +XEN_CPUFEATURE(FRED, 10*32+17) /* Fast Return and Event Delivery */
> +XEN_CPUFEATURE(LKGS, 10*32+18) /* Load Kernel GS instruction */
> XEN_CPUFEATURE(WRMSRNS, 10*32+19) /*S WRMSR Non-Serialising */
> +XEN_CPUFEATURE(NMI_SRC, 10*32+20) /* NMI-Source Reporting */
> XEN_CPUFEATURE(AMX_FP16, 10*32+21) /* AMX FP16 instruction */
> XEN_CPUFEATURE(AVX_IFMA, 10*32+23) /*A AVX-IFMA Instructions */
> XEN_CPUFEATURE(LAM, 10*32+26) /* Linear Address Masking */
I'd like to note that we could long have had this if my long-pending emulator
patch had gone in at some point.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 13/22] x86: FRED enumerations
2025-08-14 11:20 ` Jan Beulich
@ 2025-08-14 11:42 ` Andrew Cooper
2025-08-14 11:44 ` Jan Beulich
2025-08-14 13:19 ` Jan Beulich
2025-08-21 21:23 ` Andrew Cooper
2 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-14 11:42 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 14/08/2025 12:20 pm, Jan Beulich wrote:
> On 08.08.2025 22:23, Andrew Cooper wrote:
>> Of note, CR4.FRED is bit 32 and cannot enabled outside of 64bit mode.
>>
>> Most supported toolchains don't understand the FRED instructions yet. ERETU
>> and ERETS are easy to wrap (they encoded as REPZ/REPNE CLAC), while LKGS is
>> more complicated and deferred for now.
>>
>> I have intentionally named the FRED MSRs differently to the spec. In the
>> spec, the stack pointer names alias the TSS fields of the same name, despite
>> very different semantics.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Acked-by: Jan Beulich <jbeulich@suse.com>
Thanks.
>> --- a/xen/arch/x86/include/asm/x86-defns.h
>> +++ b/xen/arch/x86/include/asm/x86-defns.h
>> @@ -75,6 +75,7 @@
>> #define X86_CR4_PKE 0x00400000 /* enable PKE */
>> #define X86_CR4_CET 0x00800000 /* Control-flow Enforcement Technology */
>> #define X86_CR4_PKS 0x01000000 /* Protection Key Supervisor */
>> +#define X86_CR4_FRED 0x100000000 /* Fast Return and Event Delivery */
> ... a UL suffix added here for Misra.
I was surprised, but Eclair is entirely fine with this.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 13/22] x86: FRED enumerations
2025-08-14 11:42 ` Andrew Cooper
@ 2025-08-14 11:44 ` Jan Beulich
2025-08-14 11:47 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 11:44 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 14.08.2025 13:42, Andrew Cooper wrote:
> On 14/08/2025 12:20 pm, Jan Beulich wrote:
>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>> --- a/xen/arch/x86/include/asm/x86-defns.h
>>> +++ b/xen/arch/x86/include/asm/x86-defns.h
>>> @@ -75,6 +75,7 @@
>>> #define X86_CR4_PKE 0x00400000 /* enable PKE */
>>> #define X86_CR4_CET 0x00800000 /* Control-flow Enforcement Technology */
>>> #define X86_CR4_PKS 0x01000000 /* Protection Key Supervisor */
>>> +#define X86_CR4_FRED 0x100000000 /* Fast Return and Event Delivery */
>> ... a UL suffix added here for Misra.
>
> I was surprised, but Eclair is entirely fine with this.
And there is a use of the identifier in a monitored C file?
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 13/22] x86: FRED enumerations
2025-08-14 11:44 ` Jan Beulich
@ 2025-08-14 11:47 ` Andrew Cooper
2025-08-14 19:37 ` Nicola Vetrini
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-14 11:47 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 14/08/2025 12:44 pm, Jan Beulich wrote:
> On 14.08.2025 13:42, Andrew Cooper wrote:
>> On 14/08/2025 12:20 pm, Jan Beulich wrote:
>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>> --- a/xen/arch/x86/include/asm/x86-defns.h
>>>> +++ b/xen/arch/x86/include/asm/x86-defns.h
>>>> @@ -75,6 +75,7 @@
>>>> #define X86_CR4_PKE 0x00400000 /* enable PKE */
>>>> #define X86_CR4_CET 0x00800000 /* Control-flow Enforcement Technology */
>>>> #define X86_CR4_PKS 0x01000000 /* Protection Key Supervisor */
>>>> +#define X86_CR4_FRED 0x100000000 /* Fast Return and Event Delivery */
>>> ... a UL suffix added here for Misra.
>> I was surprised, but Eclair is entirely fine with this.
> And there is a use of the identifier in a monitored C file?
Yes. traps-setup.c which definitely has not been added to an exclusion
list.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 14/22] x86/traps: Extend struct cpu_user_regs/cpu_info with FRED fields
2025-08-08 20:23 ` [PATCH 14/22] x86/traps: Extend struct cpu_user_regs/cpu_info with FRED fields Andrew Cooper
@ 2025-08-14 13:12 ` Jan Beulich
2025-08-14 15:07 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 13:12 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:23, Andrew Cooper wrote:
> @@ -42,17 +46,76 @@ struct cpu_user_regs
> */
>
> union { uint64_t rip; uint32_t eip; uint16_t ip; };
> - uint16_t cs, _pad0[1];
> - uint8_t saved_upcall_mask; /* PV (v)rflags.IF == !saved_upcall_mask */
> - uint8_t _pad1[3];
> + union {
> + struct {
> + uint16_t cs;
> + unsigned long :16;
> + uint8_t saved_upcall_mask; /* PV (v)rflags.IF == !saved_upcall_mask */
Would this better be reproduced ...
> + };
> + unsigned long csx;
> + struct {
> + /*
> + * Bits 0 thru 31 control ERET{U,S} behaviour, and is state of the
> + * interrupted context.
> + */
> + uint16_t cs;
> + unsigned int sl:2; /* Stack Level */
> + bool wfe:1; /* Wait-for-ENDBRANCH state */
... here as well, just like you reproduce "cs"?
> + } fred_cs;
> + };
> union { uint64_t rflags; uint32_t eflags; uint16_t flags; };
> union { uint64_t rsp; uint32_t esp; uint16_t sp; uint8_t spl; };
> - uint16_t ss, _pad2[3];
> + union {
> + uint16_t ss;
> + unsigned long ssx;
What use do you foresee for this and "csx"?
> + struct {
> + /*
> + * Bits 0 thru 31 control ERET{U,S} behaviour, and is state about
> + * the event which occured.
> + */
> + uint16_t ss;
> + bool sti:1; /* Was blocked-by-STI, and not cancelled */
> + bool swint:1; /* Was a SYSCALL/SYSENTER/INT $N */
> + bool nmi:1; /* Was an NMI. */
> + unsigned long :13;
> +
> + /*
> + * Bits 32 thru 63 are ignored by ERET{U,S} and are informative
> + * only.
> + */
> + uint8_t vector;
> + unsigned long :8;
> + unsigned int type:4; /* X86_ET_* */
> + unsigned long :4;
> + bool enclave:1; /* Event taken in SGX mode */
> + bool lm:1; /* Was in Long Mode */
The bit indicates 64-bit mode aiui, not long mode (without which FRED isn't even
available).
> --- a/xen/arch/x86/include/asm/current.h
> +++ b/xen/arch/x86/include/asm/current.h
> @@ -38,6 +38,8 @@ struct vcpu;
>
> struct cpu_info {
> struct cpu_user_regs guest_cpu_user_regs;
> + struct fred_info _fred; /* Only used when FRED is active. */
Any particular need for the leading underscore?
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 13/22] x86: FRED enumerations
2025-08-14 11:20 ` Jan Beulich
2025-08-14 11:42 ` Andrew Cooper
@ 2025-08-14 13:19 ` Jan Beulich
2025-08-14 18:45 ` Andrew Cooper
2025-08-21 21:23 ` Andrew Cooper
2 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 13:19 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 14.08.2025 13:20, Jan Beulich wrote:
> On 08.08.2025 22:23, Andrew Cooper wrote:
>> --- a/xen/include/public/arch-x86/cpufeatureset.h
>> +++ b/xen/include/public/arch-x86/cpufeatureset.h
>> @@ -310,7 +310,10 @@ XEN_CPUFEATURE(ARCH_PERF_MON, 10*32+8) /* Architectural Perfmon */
>> XEN_CPUFEATURE(FZRM, 10*32+10) /*A Fast Zero-length REP MOVSB */
>> XEN_CPUFEATURE(FSRS, 10*32+11) /*A Fast Short REP STOSB */
>> XEN_CPUFEATURE(FSRCS, 10*32+12) /*A Fast Short REP CMPSB/SCASB */
>> +XEN_CPUFEATURE(FRED, 10*32+17) /* Fast Return and Event Delivery */
>> +XEN_CPUFEATURE(LKGS, 10*32+18) /* Load Kernel GS instruction */
>> XEN_CPUFEATURE(WRMSRNS, 10*32+19) /*S WRMSR Non-Serialising */
>> +XEN_CPUFEATURE(NMI_SRC, 10*32+20) /* NMI-Source Reporting */
>> XEN_CPUFEATURE(AMX_FP16, 10*32+21) /* AMX FP16 instruction */
>> XEN_CPUFEATURE(AVX_IFMA, 10*32+23) /*A AVX-IFMA Instructions */
>> XEN_CPUFEATURE(LAM, 10*32+26) /* Linear Address Masking */
>
> I'd like to note that we could long have had this if my long-pending emulator
> patch had gone in at some point.
Actually what I further have there, and what in the context of patch 15 I
notice you should have here is
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -278,7 +278,8 @@ def crunch_numbers(state):
# superpages, PCID and PKU are only available in 4 level paging.
# NO_LMSL indicates the absense of Long Mode Segment Limits, which
# have been dropped in hardware.
- LM: [CX16, PCID, LAHF_LM, PAGE1GB, PKU, NO_LMSL, AMX_TILE, CMPCCXADD],
+ LM: [CX16, PCID, LAHF_LM, PAGE1GB, PKU, NO_LMSL, AMX_TILE, CMPCCXADD,
+ LKGS],
# AMD K6-2+ and K6-III processors shipped with 3DNow+, beyond the
# standard 3DNow in the earlier K6 processors.
@@ -347,6 +348,9 @@ def crunch_numbers(state):
# computational instructions. All further AMX features are built on top
# of AMX-TILE.
AMX_TILE: [AMX_BF16, AMX_INT8, AMX_FP16, AMX_COMPLEX],
+
+ # FRED builds on the LKGS instruction.
+ LKGS: [FRED],
}
deep_features = tuple(sorted(deps.keys()))
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 15/22] x86/traps: Introduce opt_fred
2025-08-08 20:23 ` [PATCH 15/22] x86/traps: Introduce opt_fred Andrew Cooper
@ 2025-08-14 13:30 ` Jan Beulich
2025-08-14 19:16 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 13:30 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Xen-devel
On 08.08.2025 22:23, Andrew Cooper wrote:
> ... disabled by default. There is a lot of work before FRED can be enabled by
> default.
>
> One part of FRED, the LKGS (Load Kernel GS) instruction, is enumerated
> separately but is mandatory as FRED disallows the SWAPGS instruction.
> Therefore, both CPUID bits must be checked.
See my (further) reply to patch 13 - I think FRED simply ought to depend on
LKGS.
> @@ -20,6 +22,9 @@ unsigned int __ro_after_init ler_msr;
> static bool __initdata opt_ler;
> boolean_param("ler", opt_ler);
>
> +int8_t __ro_after_init opt_fred = 0; /* -1 when supported. */
I'm a little puzzled by the comment? DYM "once default-enabled"? Then ...
> @@ -305,6 +310,32 @@ void __init traps_init(void)
> /* Replace early pagefault with real pagefault handler. */
> _update_gate_addr_lower(&bsp_idt[X86_EXC_PF], entry_PF);
>
> + if ( !cpu_has_fred || !cpu_has_lkgs )
> + {
> + if ( opt_fred )
... this won't work anymore once the initializer is changed.
> + printk(XENLOG_WARNING "FRED not available, ignoring\n");
> + opt_fred = false;
Better use 0 here?
> + }
> +
> + if ( opt_fred == -1 )
> + opt_fred = !pv_shim;
Imo it would be better to have the initializer be -1 right away, and comment
out the "!pv_shim" here, until we mean it to be default-enabled.
> + if ( opt_fred )
> + {
> +#ifdef CONFIG_PV32
> + if ( opt_pv32 )
> + {
> + opt_pv32 = 0;
> + printk(XENLOG_INFO "Disabling PV32 due to FRED\n");
> + }
> +#endif
> + printk("Using FRED event delivery\n");
> + }
> + else
> + {
> + printk("Using IDT event delivery\n");
> + }
Could I talk you into omitting the figure braces here? Hmm, or perhaps you
mean to later move code here.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 16/22] x86/boot: Adjust CR4 handling around ap_early_traps_init()
2025-08-08 20:23 ` [PATCH 16/22] x86/boot: Adjust CR4 handling around ap_early_traps_init() Andrew Cooper
@ 2025-08-14 14:47 ` Jan Beulich
2025-08-14 14:54 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 14:47 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:23, Andrew Cooper wrote:
> ap_early_traps_init() will shortly be setting CR4.FRED. This requires that
> cpu_info->cr4 is already set up, and that the enablement of CET doesn't
> truncate FRED back out because of it's 32bit logic.
>
> For __high_start(), defer re-loading XEN_MINIMAL_CR4 until after %rsp is set
> up and we can store the result in the cr4 field too.
>
> For s3_resume(), explicitly re-load XEN_MINIMAL_CR4. Later when loading all
> features, use the mmu_cr4_features variable which is how the rest of Xen
> performs this operation.
>
> No functional change, yet.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
> --- a/xen/arch/x86/acpi/wakeup_prot.S
> +++ b/xen/arch/x86/acpi/wakeup_prot.S
> @@ -63,6 +63,14 @@ LABEL(s3_resume)
> pushq %rax
> lretq
> 1:
> +
> + GET_STACK_END(15)
> +
> + /* Enable minimal CR4 features. */
> + mov $XEN_MINIMAL_CR4, %eax
> + mov %rax, STACK_CPUINFO_FIELD(cr4)(%r15)
Strictly speaking this and ...
> --- a/xen/arch/x86/boot/x86_64.S
> +++ b/xen/arch/x86/boot/x86_64.S
> @@ -11,16 +11,19 @@ ENTRY(__high_start)
> mov %ecx,%gs
> mov %ecx,%ss
>
> - /* Enable minimal CR4 features. */
> - mov $XEN_MINIMAL_CR4,%rcx
> - mov %rcx,%cr4
> -
> mov stack_start(%rip),%rsp
>
> /* Reset EFLAGS (subsumes CLI and CLD). */
> pushq $0
> popf
>
> + GET_STACK_END(15)
> +
> + /* Enable minimal CR4 features. */
> + mov $XEN_MINIMAL_CR4, %eax
> + mov %rax, STACK_CPUINFO_FIELD(cr4)(%r15)
... this could be 32-bit stores, even in the longer run.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 17/22] x86/S3: Switch to using RSTORSSP to recover SSP on resume
2025-08-08 20:23 ` [PATCH 17/22] x86/S3: Switch to using RSTORSSP to recover SSP on resume Andrew Cooper
@ 2025-08-14 14:54 ` Jan Beulich
0 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 14:54 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:23, Andrew Cooper wrote:
> Under FRED, SETSSBSY is unavailable, and we want to be setting up FRED prior
> to setting up shadow stacks. Luckily, RSTORSSP will also work in this case.
>
> This involves a new type of shadow stack token, the Restore Token, which is
> distinguished from the Supervisor Token by pointing to the adjacent slot on
> the shadow stack rather than pointing at itself.
>
> In the short term, this logic still needs to load MSR_PL0_SSP.
>
> No functional change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 16/22] x86/boot: Adjust CR4 handling around ap_early_traps_init()
2025-08-14 14:47 ` Jan Beulich
@ 2025-08-14 14:54 ` Andrew Cooper
2025-08-14 14:56 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-14 14:54 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 14/08/2025 3:47 pm, Jan Beulich wrote:
> On 08.08.2025 22:23, Andrew Cooper wrote:
>> ap_early_traps_init() will shortly be setting CR4.FRED. This requires that
>> cpu_info->cr4 is already set up, and that the enablement of CET doesn't
>> truncate FRED back out because of it's 32bit logic.
>>
>> For __high_start(), defer re-loading XEN_MINIMAL_CR4 until after %rsp is set
>> up and we can store the result in the cr4 field too.
>>
>> For s3_resume(), explicitly re-load XEN_MINIMAL_CR4. Later when loading all
>> features, use the mmu_cr4_features variable which is how the rest of Xen
>> performs this operation.
>>
>> No functional change, yet.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Thanks.
Unfortunately, ...
>
>> --- a/xen/arch/x86/acpi/wakeup_prot.S
>> +++ b/xen/arch/x86/acpi/wakeup_prot.S
>> @@ -63,6 +63,14 @@ LABEL(s3_resume)
>> pushq %rax
>> lretq
>> 1:
>> +
>> + GET_STACK_END(15)
>> +
>> + /* Enable minimal CR4 features. */
>> + mov $XEN_MINIMAL_CR4, %eax
>> + mov %rax, STACK_CPUINFO_FIELD(cr4)(%r15)
> Strictly speaking this and ...
>
>> --- a/xen/arch/x86/boot/x86_64.S
>> +++ b/xen/arch/x86/boot/x86_64.S
>> @@ -11,16 +11,19 @@ ENTRY(__high_start)
>> mov %ecx,%gs
>> mov %ecx,%ss
>>
>> - /* Enable minimal CR4 features. */
>> - mov $XEN_MINIMAL_CR4,%rcx
>> - mov %rcx,%cr4
>> -
>> mov stack_start(%rip),%rsp
>>
>> /* Reset EFLAGS (subsumes CLI and CLD). */
>> pushq $0
>> popf
>>
>> + GET_STACK_END(15)
>> +
>> + /* Enable minimal CR4 features. */
>> + mov $XEN_MINIMAL_CR4, %eax
>> + mov %rax, STACK_CPUINFO_FIELD(cr4)(%r15)
> ... this could be 32-bit stores, even in the longer run.
... no, they can't.
The store also serves to clear out stale X86_CR4_FRED, prior to FRED
getting reconfigured again.
fatal_trap() uses info->cr4 to decide whether it's safe to look at the
extended FRED metadata. Strictly speaking I probably ought to read the
real CR4 (in read_registers too), but using a 32bit store here would
extend a 1-instruction window into quite a larger window where exception
handling would not work quite right.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 16/22] x86/boot: Adjust CR4 handling around ap_early_traps_init()
2025-08-14 14:54 ` Andrew Cooper
@ 2025-08-14 14:56 ` Jan Beulich
2025-08-14 19:22 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 14:56 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 14.08.2025 16:54, Andrew Cooper wrote:
> On 14/08/2025 3:47 pm, Jan Beulich wrote:
>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>> --- a/xen/arch/x86/acpi/wakeup_prot.S
>>> +++ b/xen/arch/x86/acpi/wakeup_prot.S
>>> @@ -63,6 +63,14 @@ LABEL(s3_resume)
>>> pushq %rax
>>> lretq
>>> 1:
>>> +
>>> + GET_STACK_END(15)
>>> +
>>> + /* Enable minimal CR4 features. */
>>> + mov $XEN_MINIMAL_CR4, %eax
>>> + mov %rax, STACK_CPUINFO_FIELD(cr4)(%r15)
>> Strictly speaking this and ...
>>
>>> --- a/xen/arch/x86/boot/x86_64.S
>>> +++ b/xen/arch/x86/boot/x86_64.S
>>> @@ -11,16 +11,19 @@ ENTRY(__high_start)
>>> mov %ecx,%gs
>>> mov %ecx,%ss
>>>
>>> - /* Enable minimal CR4 features. */
>>> - mov $XEN_MINIMAL_CR4,%rcx
>>> - mov %rcx,%cr4
>>> -
>>> mov stack_start(%rip),%rsp
>>>
>>> /* Reset EFLAGS (subsumes CLI and CLD). */
>>> pushq $0
>>> popf
>>>
>>> + GET_STACK_END(15)
>>> +
>>> + /* Enable minimal CR4 features. */
>>> + mov $XEN_MINIMAL_CR4, %eax
>>> + mov %rax, STACK_CPUINFO_FIELD(cr4)(%r15)
>> ... this could be 32-bit stores, even in the longer run.
>
> ... no, they can't.
>
> The store also serves to clear out stale X86_CR4_FRED, prior to FRED
> getting reconfigured again.
>
> fatal_trap() uses info->cr4 to decide whether it's safe to look at the
> extended FRED metadata. Strictly speaking I probably ought to read the
> real CR4 (in read_registers too), but using a 32bit store here would
> extend a 1-instruction window into quite a larger window where exception
> handling would not work quite right.
Oh, I see. Mind me asking to add brief comments there to this effect?
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 18/22] x86/traps: Set MSR_PL0_SSP in load_system_tables()
2025-08-08 20:23 ` [PATCH 18/22] x86/traps: Set MSR_PL0_SSP in load_system_tables() Andrew Cooper
@ 2025-08-14 15:00 ` Jan Beulich
2025-08-14 19:37 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 15:00 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:23, Andrew Cooper wrote:
> FRED and IDT differ by a Supervisor Token on the base of the shstk. This
> means that the value they load into MSR_PL0_SSP differs by 8.
>
> s3_resume() in particular has logic which is otherwise invariant of FRED mode,
> and must not clobber a FRED MSR_PL0_SSP with an IDT one.
>
> This also simplifies the AP path too. Updating reinit_bsp_stack() is deferred
> until later.
>
> No functional change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
I wonder why this was originally done in assembly in the first place, when
we aim at reducing tghe assembly code we have.
> --- a/xen/arch/x86/boot/x86_64.S
> +++ b/xen/arch/x86/boot/x86_64.S
> @@ -65,17 +65,11 @@ ENTRY(__high_start)
> or $(PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8, %rdx
>
> /*
> - * Write a new supervisor token. Doesn't matter on boot, but for S3
> - * resume this clears the busy bit.
> + * Write a new Supervisor Token. It doesn't matter the first time a
> + * CPU boots, but for S3 resume or CPU hot re-add, this clears the
> + * busy bit.
> */
> wrssq %rdx, (%rdx)
> -
> - /* Point MSR_PL0_SSP at the token. */
> - mov $MSR_PL0_SSP, %ecx
> - mov %edx, %eax
> - shr $32, %rdx
> - wrmsr
> -
> setssbsy
This is ending up a little odd: The comment says the write is to clear the
busy bit, when that's re-set immediately afterwards.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 14/22] x86/traps: Extend struct cpu_user_regs/cpu_info with FRED fields
2025-08-14 13:12 ` Jan Beulich
@ 2025-08-14 15:07 ` Andrew Cooper
0 siblings, 0 replies; 120+ messages in thread
From: Andrew Cooper @ 2025-08-14 15:07 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 14/08/2025 2:12 pm, Jan Beulich wrote:
> On 08.08.2025 22:23, Andrew Cooper wrote:
>> @@ -42,17 +46,76 @@ struct cpu_user_regs
>> */
>>
>> union { uint64_t rip; uint32_t eip; uint16_t ip; };
>> - uint16_t cs, _pad0[1];
>> - uint8_t saved_upcall_mask; /* PV (v)rflags.IF == !saved_upcall_mask */
>> - uint8_t _pad1[3];
>> + union {
>> + struct {
>> + uint16_t cs;
>> + unsigned long :16;
>> + uint8_t saved_upcall_mask; /* PV (v)rflags.IF == !saved_upcall_mask */
> Would this better be reproduced ...
>
>> + };
>> + unsigned long csx;
>> + struct {
>> + /*
>> + * Bits 0 thru 31 control ERET{U,S} behaviour, and is state of the
>> + * interrupted context.
>> + */
>> + uint16_t cs;
>> + unsigned int sl:2; /* Stack Level */
>> + bool wfe:1; /* Wait-for-ENDBRANCH state */
> ... here as well, just like you reproduce "cs"?
saved_upcall_mask is a property of an in-guest IRET frame only. It is
only produced in create_bounce_frame, and never consumed by Xen.
It needs to exist in this structure so asm-offsets.c can generate a
constant.
Also, be aware that there are new features being planned which rely on FRED.
>
>> + } fred_cs;
>> + };
>> union { uint64_t rflags; uint32_t eflags; uint16_t flags; };
>> union { uint64_t rsp; uint32_t esp; uint16_t sp; uint8_t spl; };
>> - uint16_t ss, _pad2[3];
>> + union {
>> + uint16_t ss;
>> + unsigned long ssx;
> What use do you foresee for this and "csx"?
That also came from Linux. I'm using it to zero the control metadata so
ERETU behaves more like IRET.
>
>> + struct {
>> + /*
>> + * Bits 0 thru 31 control ERET{U,S} behaviour, and is state about
>> + * the event which occured.
>> + */
>> + uint16_t ss;
>> + bool sti:1; /* Was blocked-by-STI, and not cancelled */
>> + bool swint:1; /* Was a SYSCALL/SYSENTER/INT $N */
>> + bool nmi:1; /* Was an NMI. */
>> + unsigned long :13;
>> +
>> + /*
>> + * Bits 32 thru 63 are ignored by ERET{U,S} and are informative
>> + * only.
>> + */
>> + uint8_t vector;
>> + unsigned long :8;
>> + unsigned int type:4; /* X86_ET_* */
>> + unsigned long :4;
>> + bool enclave:1; /* Event taken in SGX mode */
>> + bool lm:1; /* Was in Long Mode */
> The bit indicates 64-bit mode aiui, not long mode (without which FRED isn't even
> available).
Oh, yes. This is something that changed across revisions, and I wrote
this patch to an older spec.
It's %cs.l of the interrupted context, so I probably should just drop the m.
>
>> --- a/xen/arch/x86/include/asm/current.h
>> +++ b/xen/arch/x86/include/asm/current.h
>> @@ -38,6 +38,8 @@ struct vcpu;
>>
>> struct cpu_info {
>> struct cpu_user_regs guest_cpu_user_regs;
>> + struct fred_info _fred; /* Only used when FRED is active. */
> Any particular need for the leading underscore?
Somewhat, yes. It's not safe to reference this field, except for
loading MSR_PL0_RSP.
Everyone else should use cpu_regs_fred_info() to get the fred_info,
which has a safety ASSERT().
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 19/22] x86/boot: Use RSTORSSP to establish SSP
2025-08-08 20:23 ` [PATCH 19/22] x86/boot: Use RSTORSSP to establish SSP Andrew Cooper
@ 2025-08-14 15:11 ` Jan Beulich
2025-08-14 20:09 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 15:11 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:23, Andrew Cooper wrote:
> Under FRED, SETSSBSY is unavailable, and we want to be setting up FRED prior
> to setting up shadow stacks. As we still need Supervisor Tokens in IDT mode,
> we need mode-specific logic to establish SSP.
>
> In FRED mode, write a Restore Token, RSTORSSP it, and discard the resulting
> Previous-SSP token.
>
> No change outside of FRED mode.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Why is it that in patch 17 you could use identical code, but here you can't?
> @@ -912,10 +913,30 @@ static void __init noreturn reinit_bsp_stack(void)
>
> if ( cpu_has_xen_shstk )
> {
> - wrmsrl(MSR_PL0_SSP,
> - (unsigned long)stack + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8);
Does this removal perhaps belong elsewhere, especially with "No change
outside of FRED mode" in the description?
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 20/22] x86/traps: Alter switch_stack_and_jump() for FRED mode
2025-08-08 20:23 ` [PATCH 20/22] x86/traps: Alter switch_stack_and_jump() for FRED mode Andrew Cooper
@ 2025-08-14 15:35 ` Jan Beulich
2025-08-14 20:55 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 15:35 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:23, Andrew Cooper wrote:
> FRED and IDT differ by a Supervisor Token on the base of the shstk. This
> means that switch_stack_and_jump() needs to discard one extra word when FRED
> is active.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
>
> RFC. I don't like this, but it does work.
>
> This emits opt_fred logic outside of CONFIG_XEN_SHSTK.
opt_fred and XEN_SHSTK are orthogonal, so that's fine anyway. What I guess
you may mean is that you now have a shstk-related calculation outside of
a respective #ifdef. Given the simplicity of the calculation, ...
> But frankly, the
> construct is already too unweildly, and all options I can think of make it
> moreso.
... I agree having it like this is okay.
> @@ -154,7 +155,6 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
> "rdsspd %[ssp];" \
> "cmp $1, %[ssp];" \
> "je .L_shstk_done.%=;" /* CET not active? Skip. */ \
> - "mov $%c[skstk_base], %[val];" \
> "and $%c[stack_mask], %[ssp];" \
> "sub %[ssp], %[val];" \
> "shr $3, %[val];" \
With the latter two insns here, ...
> @@ -177,6 +177,8 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>
> #define switch_stack_and_jump(fn, instr, constr) \
> ({ \
> + unsigned int token_offset = \
> + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - (opt_fred ? 0 : 8); \
> unsigned int tmp; \
> BUILD_BUG_ON(!ssaj_has_attr_noreturn(fn)); \
> __asm__ __volatile__ ( \
> @@ -184,12 +186,11 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
> "mov %[stk], %%rsp;" \
> CHECK_FOR_LIVEPATCH_WORK \
> instr "[fun]" \
> - : [val] "=&r" (tmp), \
> + : [val] "=r" (tmp), \
... I don't think you can legitimately drop the & from here? With it
retained:
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 21/22] x86/traps: Introduce FRED entrypoints
2025-08-08 20:23 ` [PATCH 21/22] x86/traps: Introduce FRED entrypoints Andrew Cooper
2025-08-11 11:38 ` Andrew Cooper
@ 2025-08-14 15:57 ` Jan Beulich
2025-08-14 20:40 ` Andrew Cooper
2025-08-18 10:03 ` Jan Beulich
2 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-14 15:57 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:23, Andrew Cooper wrote:
> --- a/xen/arch/x86/include/asm/asm_defns.h
> +++ b/xen/arch/x86/include/asm/asm_defns.h
> @@ -315,6 +315,71 @@ static always_inline void stac(void)
> subq $-(UREGS_error_code-UREGS_r15+\adj), %rsp
> .endm
>
> +/*
> + * Push and clear GPRs
> + */
> +.macro PUSH_AND_CLEAR_GPRS
> + push %rdi
> + xor %edi, %edi
> + push %rsi
> + xor %esi, %esi
> + push %rdx
> + xor %edx, %edx
> + push %rcx
> + xor %ecx, %ecx
> + push %rax
> + xor %eax, %eax
> + push %r8
> + xor %r8d, %r8d
> + push %r9
> + xor %r9d, %r9d
> + push %r10
> + xor %r10d, %r10d
> + push %r11
> + xor %r11d, %r11d
> + push %rbx
> + xor %ebx, %ebx
> + push %rbp
> +#ifdef CONFIG_FRAME_POINTER
> +/* Indicate special exception stack frame by inverting the frame pointer. */
> + mov %rsp, %rbp
> + notq %rbp
> +#else
> + xor %ebp, %ebp
> +#endif
> + push %r12
> + xor %r12d, %r12d
> + push %r13
> + xor %r13d, %r13d
> + push %r14
> + xor %r14d, %r14d
> + push %r15
> + xor %r15d, %r15d
> +.endm
> +
> +/*
> + * POP GPRs from a UREGS_* frame on the stack. Does not modify flags.
> + *
> + * @rax: Alternative destination for the %rax value on the stack.
> + */
> +.macro POP_GPRS rax=%rax
> + pop %r15
> + pop %r14
> + pop %r13
> + pop %r12
> + pop %rbp
> + pop %rbx
> + pop %r11
> + pop %r10
> + pop %r9
> + pop %r8
> + pop \rax
> + pop %rcx
> + pop %rdx
> + pop %rsi
> + pop %rdi
> +.endm
Hmm, yes, differences are apparently large enough to warrant the redundancy
with SAVE_ALL / RESTORE_ALL.
> --- a/xen/arch/x86/include/asm/msr.h
> +++ b/xen/arch/x86/include/asm/msr.h
> @@ -202,9 +202,9 @@ static inline unsigned long read_gs_base(void)
>
> static inline unsigned long read_gs_shadow(void)
> {
> - unsigned long base;
> + unsigned long base, cr4 = read_cr4();
>
> - if ( read_cr4() & X86_CR4_FSGSBASE )
> + if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
> {
> asm volatile ( "swapgs" );
> base = __rdgsbase();
> @@ -234,7 +234,9 @@ static inline void write_gs_base(unsigned long base)
>
> static inline void write_gs_shadow(unsigned long base)
> {
> - if ( read_cr4() & X86_CR4_FSGSBASE )
> + unsigned long cr4 = read_cr4();
> +
> + if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
> {
> asm volatile ( "swapgs\n\t"
> "wrgsbase %0\n\t"
I don't quite get how these changes fit into this patch.
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -1013,6 +1013,32 @@ void show_execution_state_nmi(const cpumask_t *mask, bool show_all)
> printk("Non-responding CPUs: {%*pbl}\n", CPUMASK_PR(&show_state_mask));
> }
>
> +static const char *x86_et_name(unsigned int type)
> +{
> + static const char *const names[] = {
> + [X86_ET_EXT_INTR] = "EXT_INTR",
> + [X86_ET_NMI] = "NMI",
> + [X86_ET_HW_EXC] = "HW_EXC",
> + [X86_ET_SW_INT] = "SW_INT",
> + [X86_ET_PRIV_SW_EXC] = "PRIV_SW_EXEC",
> + [X86_ET_SW_EXC] = "SW_EXEC",
> + [X86_ET_OTHER] = "OTHER",
> + };
> +
> + return (type < ARRAY_SIZE(names) && names[type]) ? names[type] : "???";
> +}
> +
> +static const char *x86_et_other_name(unsigned int vec)
This isn't really a vector, is it?
> +{
> + static const char *const names[] = {
> + [0] = "MTF",
> + [1] = "SYSCALL",
> + [2] = "SYSENTER",
> + };
> +
> + return (vec < ARRAY_SIZE(names) && names[vec][0]) ? names[vec] : "???";
Did you mean to check names[ves] for being NULL? Or is this a leftover
from the array being something like names[][10]?
> --- a/xen/arch/x86/x86_64/Makefile
> +++ b/xen/arch/x86/x86_64/Makefile
> @@ -1,6 +1,7 @@
> obj-$(CONFIG_PV32) += compat/
>
> obj-bin-y += entry.o
> +obj-bin-y += entry-fred.o
For the ordering here, ...
> --- /dev/null
> +++ b/xen/arch/x86/x86_64/entry-fred.S
> @@ -0,0 +1,35 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +
> + .file "x86_64/entry-fred.S"
> +
> +#include <asm/asm_defns.h>
> +#include <asm/page.h>
> +
> + .section .text.entry, "ax", @progbits
> +
> + /* The Ring3 entry point is required to be 4k aligned. */
> +
> +FUNC(entry_FRED_R3, 4096)
... doesn't this 4k-alignment requirement suggest we want to put
entry-fred.o first? Also, might it be more natural to use PAGE_SIZE
here?
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* [PATCH v1.1 08/22] x86/traps: Introduce percpu_early_traps_init() and set up exception handling earlier
2025-08-08 20:23 ` [PATCH 08/22] x86/traps: Introduce ap_early_traps_init() and set up exception handling earlier Andrew Cooper
2025-08-12 8:41 ` Jan Beulich
@ 2025-08-14 18:07 ` Andrew Cooper
2025-08-15 9:24 ` Jan Beulich
1 sibling, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-14 18:07 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
As things stand, we set up AP/S3 exception handling marginally after the
fragile activity of setting up shadow stacks. Shadow stack setup is going to
get more complicated under FRED.
Introduce percpu_early_traps_init() and call it ahead of setting up shadow
stacks. To start with, call load_system_tables() which is sufficient to set
up full exception handling.
In order to handle exceptions, current and the speculation controls needs to
work. cpu_smpboot_alloc() already constructs some of the AP's top-of-stack
block, so have it set up a little more. Zero the whole structure to subsume
other misc setup.
This gets us complete exception coverage of setting up shadow stacks, rather
than dying with a triple fault.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
v2:
* Rename to percpu_early_traps_init()
* Reorder setup
---
xen/arch/x86/acpi/wakeup_prot.S | 5 +++--
xen/arch/x86/boot/x86_64.S | 5 ++++-
xen/arch/x86/smpboot.c | 19 ++++++-------------
xen/arch/x86/traps-setup.c | 12 ++++++++++++
4 files changed, 25 insertions(+), 16 deletions(-)
diff --git a/xen/arch/x86/acpi/wakeup_prot.S b/xen/arch/x86/acpi/wakeup_prot.S
index 92af6230b31f..cc40fddc38d4 100644
--- a/xen/arch/x86/acpi/wakeup_prot.S
+++ b/xen/arch/x86/acpi/wakeup_prot.S
@@ -63,6 +63,9 @@ LABEL(s3_resume)
pushq %rax
lretq
1:
+ /* Set up early exceptions and CET before entering C properly. */
+ call percpu_early_traps_init
+
#if defined(CONFIG_XEN_SHSTK) || defined(CONFIG_XEN_IBT)
call xen_msr_s_cet_value
test %eax, %eax
@@ -117,8 +120,6 @@ LABEL(s3_resume)
.L_cet_done:
#endif /* CONFIG_XEN_SHSTK || CONFIG_XEN_IBT */
- call load_system_tables
-
/* Restore CR4 from the cpuinfo block. */
GET_STACK_END(bx)
mov STACK_CPUINFO_FIELD(cr4)(%rbx), %rax
diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index 95a6b6cf63bd..d0e7449a149f 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -30,7 +30,10 @@ ENTRY(__high_start)
test %ebx,%ebx
jz .L_bsp
- /* APs. Set up CET before entering C properly. */
+ /* APs. Set up early exceptions and CET before entering C properly. */
+
+ call percpu_early_traps_init
+
#if defined(CONFIG_XEN_SHSTK) || defined(CONFIG_XEN_IBT)
call xen_msr_s_cet_value
test %eax, %eax
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index ce4862dde5a7..efb5adb3a12a 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -327,12 +327,7 @@ void asmlinkage start_secondary(void)
struct cpu_info *info = get_cpu_info();
unsigned int cpu = smp_processor_id();
- /* Critical region without IDT or TSS. Any fault is deadly! */
-
- set_current(idle_vcpu[cpu]);
- this_cpu(curr_vcpu) = idle_vcpu[cpu];
rdmsrl(MSR_EFER, this_cpu(efer));
- init_shadow_spec_ctrl_state(info);
/*
* Just as during early bootstrap, it is convenient here to disable
@@ -352,14 +347,6 @@ void asmlinkage start_secondary(void)
*/
spin_debug_disable();
- get_cpu_info()->use_pv_cr3 = false;
- get_cpu_info()->xen_cr3 = 0;
- get_cpu_info()->pv_cr3 = 0;
-
- load_system_tables();
-
- /* Full exception support from here on in. */
-
if ( cpu_has_pks )
wrpkrs_and_cache(0); /* Must be before setting CR4.PKS */
@@ -1064,9 +1051,15 @@ static int cpu_smpboot_alloc(unsigned int cpu)
goto out;
info = get_cpu_info_from_stack((unsigned long)stack_base[cpu]);
+ memset(info, 0, sizeof(*info));
info->processor_id = cpu;
info->per_cpu_offset = __per_cpu_offset[cpu];
+ init_shadow_spec_ctrl_state(info);
+
+ info->current_vcpu = idle_vcpu[cpu]; /* set_current() */
+ per_cpu(curr_vcpu, cpu) = idle_vcpu[cpu];
+
gdt = per_cpu(gdt, cpu) ?: alloc_xenheap_pages(0, memflags);
if ( gdt == NULL )
goto out;
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index 99257bbb16ec..758c67b335bd 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -127,3 +127,15 @@ void percpu_traps_init(void)
if ( cpu_has_xen_lbr )
wrmsrl(MSR_IA32_DEBUGCTLMSR, IA32_DEBUGCTLMSR_LBR);
}
+
+/*
+ * Configure exception handling on APs and S3. Called before entering C
+ * properly, and before shadow stacks are activated.
+ *
+ * boot_gdt is currently loaded, and we must switch to our local GDT. The
+ * local IDT has unknown IST-ness.
+ */
+void asmlinkage percpu_early_traps_init(void)
+{
+ load_system_tables();
+}
--
2.39.5
^ permalink raw reply related [flat|nested] 120+ messages in thread
* Re: [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c
2025-08-14 8:55 ` Jan Beulich
@ 2025-08-14 18:09 ` Andrew Cooper
2025-08-15 8:22 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-14 18:09 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 14/08/2025 9:55 am, Jan Beulich wrote:
> On 13.08.2025 13:25, Andrew Cooper wrote:
>> On 12/08/2025 10:19 am, Jan Beulich wrote:
>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>> Since commit a35816b5cae8 ("x86/traps: Introduce early_traps_init() and
>>>> simplify setup"), load_system_tables() is called later on the BSP, so the
>>>> SYS_STATE_early_boot check can be dropped from the safety BUG_ON().
>>>>
>>>> Move the BUILD_BUG_ON() into build_assertions(),
>>> I'm not quite convinced of this move - having the related BUILD_BUG_ON()
>>> and BUG_ON() next to each other would seem better to me.
>> I don't see a specific reason for them to be together, and the comment
>> explains what's going on.
>>
>> With FRED, we want a related BUILD_BUG_ON(), but there's no equivalent
>> BUG_ON() because MSR_RSP_SL0 will #GP on being misaligned.
> That BUILD_BUG_ON() could then sit next to the MSR write? Unless of course
> that ends up sitting in an assembly source.
It's the bottom hunk in patch 14, which you've looked at now.
Personally, I think both BUILD_BUG_ON()'s should be together, because
they are related.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c
2025-08-14 7:26 ` Jan Beulich
@ 2025-08-14 18:20 ` Andrew Cooper
2025-08-15 8:30 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-14 18:20 UTC (permalink / raw)
To: Jan Beulich; +Cc: Xen-devel, Roger Pau Monné, Nicola Vetrini
On 14/08/2025 8:26 am, Jan Beulich wrote:
> On 13.08.2025 13:36, Andrew Cooper wrote:
>> On 12/08/2025 10:43 am, Nicola Vetrini wrote:
>>> On 2025-08-08 22:23, Andrew Cooper wrote:
>>>> diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
>>>> index 8ca379c9e4cb..13b8fcf0ba51 100644
>>>> --- a/xen/arch/x86/traps-setup.c
>>>> +++ b/xen/arch/x86/traps-setup.c
>>>> @@ -19,6 +20,124 @@ boolean_param("ler", opt_ler);
>>>>
>>>> void nocall entry_PF(void);
>>>>
>>>> +/*
>>>> + * Sets up system tables and descriptors for IDT devliery.
>>>> + *
>>>> + * - Sets up TSS with stack pointers, including ISTs
>>>> + * - Inserts TSS selector into regular and compat GDTs
>>>> + * - Loads GDT, IDT, TR then null LDT
>>>> + * - Sets up IST references in the IDT
>>>> + */
>>>> +static void load_system_tables(void)
>>>> +{
>>>> + unsigned int i, cpu = smp_processor_id();
>>>> + unsigned long stack_bottom = get_stack_bottom(),
>>>> + stack_top = stack_bottom & ~(STACK_SIZE - 1);
>>>> + /*
>>>> + * NB: define tss_page as a local variable because clang 3.5
>>>> doesn't
>>>> + * support using ARRAY_SIZE against per-cpu variables.
>>>> + */
>>>> + struct tss_page *tss_page = &this_cpu(tss_page);
>>>> + idt_entry_t *idt = this_cpu(idt);
>>>> +
>>> Given the clang baseline this might not be needed anymore?
>> Hmm. While true, looking at 51461114e26, the code is definitely better
>> written with the tss_page variable and we wouldn't want to go back to
>> the old form.
>>
>> I think that I'll simply drop the comment.
>>
>> ~Andrew
>>
>> P.S.
>>
>> Generally speaking, because of the RELOC_HIDE() in this_cpu(), any time
>> you ever want two accesses to a variable, it's better (code gen wise) to
>> construct a pointer to it and use the point multiple times.
>>
>> I don't understand why there's a RELOC_HIDE() in this_cpu(). The
>> justification doesn't make sense, but I've not had time to explore what
>> happens if we take it out.
> There's no justification in xen/percpu.h?
Well, it's given in compiler.h by RELOC_HIDE().
/* This macro obfuscates arithmetic on a variable address so that gcc
shouldn't recognize the original var, and make assumptions about it */
But this is far from convincing.
>
> My understanding is that we simply may not expose any accesses to per_cpu_*
> variables directly to the compiler, or there's a risk that it might access
> the "master" variable (i.e. CPU0's on at least x86).
RELOC_HIDE() doesn't do anything about the correctness of the pointer
arithmetic expression to make the access work.
I don't see how a correct expression can ever access CPU0's data by
accident.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 13/22] x86: FRED enumerations
2025-08-14 13:19 ` Jan Beulich
@ 2025-08-14 18:45 ` Andrew Cooper
2025-08-15 8:34 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-14 18:45 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 14/08/2025 2:19 pm, Jan Beulich wrote:
> On 14.08.2025 13:20, Jan Beulich wrote:
>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>> --- a/xen/include/public/arch-x86/cpufeatureset.h
>>> +++ b/xen/include/public/arch-x86/cpufeatureset.h
>>> @@ -310,7 +310,10 @@ XEN_CPUFEATURE(ARCH_PERF_MON, 10*32+8) /* Architectural Perfmon */
>>> XEN_CPUFEATURE(FZRM, 10*32+10) /*A Fast Zero-length REP MOVSB */
>>> XEN_CPUFEATURE(FSRS, 10*32+11) /*A Fast Short REP STOSB */
>>> XEN_CPUFEATURE(FSRCS, 10*32+12) /*A Fast Short REP CMPSB/SCASB */
>>> +XEN_CPUFEATURE(FRED, 10*32+17) /* Fast Return and Event Delivery */
>>> +XEN_CPUFEATURE(LKGS, 10*32+18) /* Load Kernel GS instruction */
>>> XEN_CPUFEATURE(WRMSRNS, 10*32+19) /*S WRMSR Non-Serialising */
>>> +XEN_CPUFEATURE(NMI_SRC, 10*32+20) /* NMI-Source Reporting */
>>> XEN_CPUFEATURE(AMX_FP16, 10*32+21) /* AMX FP16 instruction */
>>> XEN_CPUFEATURE(AVX_IFMA, 10*32+23) /*A AVX-IFMA Instructions */
>>> XEN_CPUFEATURE(LAM, 10*32+26) /* Linear Address Masking */
>> I'd like to note that we could long have had this if my long-pending emulator
>> patch had gone in at some point.
> Actually what I further have there, and what in the context of patch 15 I
> notice you should have here is
>
> --- a/xen/tools/gen-cpuid.py
> +++ b/xen/tools/gen-cpuid.py
> @@ -278,7 +278,8 @@ def crunch_numbers(state):
> # superpages, PCID and PKU are only available in 4 level paging.
> # NO_LMSL indicates the absense of Long Mode Segment Limits, which
> # have been dropped in hardware.
> - LM: [CX16, PCID, LAHF_LM, PAGE1GB, PKU, NO_LMSL, AMX_TILE, CMPCCXADD],
> + LM: [CX16, PCID, LAHF_LM, PAGE1GB, PKU, NO_LMSL, AMX_TILE, CMPCCXADD,
> + LKGS],
>
> # AMD K6-2+ and K6-III processors shipped with 3DNow+, beyond the
> # standard 3DNow in the earlier K6 processors.
> @@ -347,6 +348,9 @@ def crunch_numbers(state):
> # computational instructions. All further AMX features are built on top
> # of AMX-TILE.
> AMX_TILE: [AMX_BF16, AMX_INT8, AMX_FP16, AMX_COMPLEX],
> +
> + # FRED builds on the LKGS instruction.
> + LKGS: [FRED],
> }
>
> deep_features = tuple(sorted(deps.keys()))
Hmm. Yes, but normally this is part of guest enablement.
Having now done the Xen work and concluded that we don't actually need
LKGS, I'm rethinking the linkage here. It's probably the right thing to
do in practice, but probably needs a bit more in the way of
justification. "built on" doesn't quite cut it IMO.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 15/22] x86/traps: Introduce opt_fred
2025-08-14 13:30 ` Jan Beulich
@ 2025-08-14 19:16 ` Andrew Cooper
2025-08-15 8:37 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-14 19:16 UTC (permalink / raw)
To: Jan Beulich; +Cc: Xen-devel, Roger Pau Monné
On 14/08/2025 2:30 pm, Jan Beulich wrote:
> On 08.08.2025 22:23, Andrew Cooper wrote:
>> ... disabled by default. There is a lot of work before FRED can be enabled by
>> default.
>>
>> One part of FRED, the LKGS (Load Kernel GS) instruction, is enumerated
>> separately but is mandatory as FRED disallows the SWAPGS instruction.
>> Therefore, both CPUID bits must be checked.
> See my (further) reply to patch 13 - I think FRED simply ought to depend on
> LKGS.
>
>> @@ -20,6 +22,9 @@ unsigned int __ro_after_init ler_msr;
>> static bool __initdata opt_ler;
>> boolean_param("ler", opt_ler);
>>
>> +int8_t __ro_after_init opt_fred = 0; /* -1 when supported. */
> I'm a little puzzled by the comment? DYM "once default-enabled"?
Well, I have this temporary patch
https://gitlab.com/xen-project/hardware/xen-staging/-/commit/70ef6a1178a411a29b7b1745a1112e267ffb6245
that will turn into a real patch when we enable FRED by default.
As much as anything else, it was just a TODO.
> Then ...
>
>> @@ -305,6 +310,32 @@ void __init traps_init(void)
>> /* Replace early pagefault with real pagefault handler. */
>> _update_gate_addr_lower(&bsp_idt[X86_EXC_PF], entry_PF);
>>
>> + if ( !cpu_has_fred || !cpu_has_lkgs )
>> + {
>> + if ( opt_fred )
> ... this won't work anymore once the initializer is changed.
Hmm yes. That wants to be an == 1 check. Fixed.
>
>> + printk(XENLOG_WARNING "FRED not available, ignoring\n");
>> + opt_fred = false;
> Better use 0 here?
>
>> + }
>> +
>> + if ( opt_fred == -1 )
>> + opt_fred = !pv_shim;
> Imo it would be better to have the initializer be -1 right away, and comment
> out the "!pv_shim" here, until we mean it to be default-enabled.
It cannot be -1, or Xen will fail spectacularly on any FRED capable
hardware. Setting to -1 is the point at which FRED becomes security
supported.
>
>> + if ( opt_fred )
>> + {
>> +#ifdef CONFIG_PV32
>> + if ( opt_pv32 )
>> + {
>> + opt_pv32 = 0;
>> + printk(XENLOG_INFO "Disabling PV32 due to FRED\n");
>> + }
>> +#endif
>> + printk("Using FRED event delivery\n");
>> + }
>> + else
>> + {
>> + printk("Using IDT event delivery\n");
>> + }
> Could I talk you into omitting the figure braces here? Hmm, or perhaps you
> mean to later move code here.
Indeed, patch 22.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 16/22] x86/boot: Adjust CR4 handling around ap_early_traps_init()
2025-08-14 14:56 ` Jan Beulich
@ 2025-08-14 19:22 ` Andrew Cooper
0 siblings, 0 replies; 120+ messages in thread
From: Andrew Cooper @ 2025-08-14 19:22 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 14/08/2025 3:56 pm, Jan Beulich wrote:
> On 14.08.2025 16:54, Andrew Cooper wrote:
>> On 14/08/2025 3:47 pm, Jan Beulich wrote:
>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>> --- a/xen/arch/x86/boot/x86_64.S
>>>> +++ b/xen/arch/x86/boot/x86_64.S
>>>> @@ -11,16 +11,19 @@ ENTRY(__high_start)
>>>> mov %ecx,%gs
>>>> mov %ecx,%ss
>>>>
>>>> - /* Enable minimal CR4 features. */
>>>> - mov $XEN_MINIMAL_CR4,%rcx
>>>> - mov %rcx,%cr4
>>>> -
>>>> mov stack_start(%rip),%rsp
>>>>
>>>> /* Reset EFLAGS (subsumes CLI and CLD). */
>>>> pushq $0
>>>> popf
>>>>
>>>> + GET_STACK_END(15)
>>>> +
>>>> + /* Enable minimal CR4 features. */
>>>> + mov $XEN_MINIMAL_CR4, %eax
>>>> + mov %rax, STACK_CPUINFO_FIELD(cr4)(%r15)
>>> ... this could be 32-bit stores, even in the longer run.
>> ... no, they can't.
>>
>> The store also serves to clear out stale X86_CR4_FRED, prior to FRED
>> getting reconfigured again.
>>
>> fatal_trap() uses info->cr4 to decide whether it's safe to look at the
>> extended FRED metadata. Strictly speaking I probably ought to read the
>> real CR4 (in read_registers too), but using a 32bit store here would
>> extend a 1-instruction window into quite a larger window where exception
>> handling would not work quite right.
> Oh, I see. Mind me asking to add brief comments there to this effect?
I've changed it to:
/* Enable minimal CR4 features, sync cached state. */
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 13/22] x86: FRED enumerations
2025-08-14 11:47 ` Andrew Cooper
@ 2025-08-14 19:37 ` Nicola Vetrini
2025-08-14 19:44 ` Andrew Cooper
2025-08-14 20:18 ` Nicola Vetrini
0 siblings, 2 replies; 120+ messages in thread
From: Nicola Vetrini @ 2025-08-14 19:37 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Jan Beulich, Roger Pau Monné, Xen-devel
On 2025-08-14 13:47, Andrew Cooper wrote:
> On 14/08/2025 12:44 pm, Jan Beulich wrote:
>> On 14.08.2025 13:42, Andrew Cooper wrote:
>>> On 14/08/2025 12:20 pm, Jan Beulich wrote:
>>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>>> --- a/xen/arch/x86/include/asm/x86-defns.h
>>>>> +++ b/xen/arch/x86/include/asm/x86-defns.h
>>>>> @@ -75,6 +75,7 @@
>>>>> #define X86_CR4_PKE 0x00400000 /* enable PKE */
>>>>> #define X86_CR4_CET 0x00800000 /* Control-flow Enforcement
>>>>> Technology */
>>>>> #define X86_CR4_PKS 0x01000000 /* Protection Key Supervisor
>>>>> */
>>>>> +#define X86_CR4_FRED 0x100000000 /* Fast Return and Event
>>>>> Delivery */
>>>> ... a UL suffix added here for Misra.
>>> I was surprised, but Eclair is entirely fine with this.
>> And there is a use of the identifier in a monitored C file?
>
> Yes. traps-setup.c which definitely has not been added to an exclusion
> list.
>
Might look into it before the end of the week, if time allows. Is [1]
the right branch to look at?
[1]
https://gitlab.com/xen-project/hardware/xen-staging/-/commits/andrew/fred
> ~Andrew
--
Nicola Vetrini, B.Sc.
Software Engineer
BUGSENG (https://bugseng.com)
LinkedIn: https://www.linkedin.com/in/nicola-vetrini-a42471253
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 18/22] x86/traps: Set MSR_PL0_SSP in load_system_tables()
2025-08-14 15:00 ` Jan Beulich
@ 2025-08-14 19:37 ` Andrew Cooper
2025-08-15 8:52 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-14 19:37 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 14/08/2025 4:00 pm, Jan Beulich wrote:
> On 08.08.2025 22:23, Andrew Cooper wrote:
>> FRED and IDT differ by a Supervisor Token on the base of the shstk. This
>> means that the value they load into MSR_PL0_SSP differs by 8.
>>
>> s3_resume() in particular has logic which is otherwise invariant of FRED mode,
>> and must not clobber a FRED MSR_PL0_SSP with an IDT one.
>>
>> This also simplifies the AP path too. Updating reinit_bsp_stack() is deferred
>> until later.
>>
>> No functional change.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> I wonder why this was originally done in assembly in the first place, when
> we aim at reducing tghe assembly code we have.
It took several iterations (and releases) to get the setup of the
supervisor tokens correct.
I can't even recall if we had load_system_tables() working like this
when the first code went it. It may have been following a subsequent
clean-up series.
>
>> --- a/xen/arch/x86/boot/x86_64.S
>> +++ b/xen/arch/x86/boot/x86_64.S
>> @@ -65,17 +65,11 @@ ENTRY(__high_start)
>> or $(PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8, %rdx
>>
>> /*
>> - * Write a new supervisor token. Doesn't matter on boot, but for S3
>> - * resume this clears the busy bit.
>> + * Write a new Supervisor Token. It doesn't matter the first time a
>> + * CPU boots, but for S3 resume or CPU hot re-add, this clears the
>> + * busy bit.
>> */
>> wrssq %rdx, (%rdx)
>> -
>> - /* Point MSR_PL0_SSP at the token. */
>> - mov $MSR_PL0_SSP, %ecx
>> - mov %edx, %eax
>> - shr $32, %rdx
>> - wrmsr
>> -
>> setssbsy
> This is ending up a little odd: The comment says the write is to clear the
> busy bit, when that's re-set immediately afterwards.
That comment is about the wrssq. I suppose what isn't said is that
setssbsy will fault if not. How about ", so SETSSBSY can set it again" ?
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 13/22] x86: FRED enumerations
2025-08-14 19:37 ` Nicola Vetrini
@ 2025-08-14 19:44 ` Andrew Cooper
2025-08-14 21:27 ` Nicola Vetrini
2025-08-14 20:18 ` Nicola Vetrini
1 sibling, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-14 19:44 UTC (permalink / raw)
To: Nicola Vetrini; +Cc: Jan Beulich, Roger Pau Monné, Xen-devel
On 14/08/2025 8:37 pm, Nicola Vetrini wrote:
> On 2025-08-14 13:47, Andrew Cooper wrote:
>> On 14/08/2025 12:44 pm, Jan Beulich wrote:
>>> On 14.08.2025 13:42, Andrew Cooper wrote:
>>>> On 14/08/2025 12:20 pm, Jan Beulich wrote:
>>>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>>>> --- a/xen/arch/x86/include/asm/x86-defns.h
>>>>>> +++ b/xen/arch/x86/include/asm/x86-defns.h
>>>>>> @@ -75,6 +75,7 @@
>>>>>> #define X86_CR4_PKE 0x00400000 /* enable PKE */
>>>>>> #define X86_CR4_CET 0x00800000 /* Control-flow
>>>>>> Enforcement Technology */
>>>>>> #define X86_CR4_PKS 0x01000000 /* Protection Key
>>>>>> Supervisor */
>>>>>> +#define X86_CR4_FRED 0x100000000 /* Fast Return and Event
>>>>>> Delivery */
>>>>> ... a UL suffix added here for Misra.
>>>> I was surprised, but Eclair is entirely fine with this.
>>> And there is a use of the identifier in a monitored C file?
>>
>> Yes. traps-setup.c which definitely has not been added to an exclusion
>> list.
>>
>
> Might look into it before the end of the week, if time allows. Is [1]
> the right branch to look at?
>
> [1]
> https://gitlab.com/xen-project/hardware/xen-staging/-/commits/andrew/fred
Yes, although I am force pushing this with fixes as I find them.
In the latest run at the time of writing, I had one trivial R8.4
violation to fix, and all other clean rules came up fine. I expect the
next run to be clean.
One thing that might be relevant, IIRC it's implementation defined
behaviour what happens to constants which are wider than int to begin with.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 19/22] x86/boot: Use RSTORSSP to establish SSP
2025-08-14 15:11 ` Jan Beulich
@ 2025-08-14 20:09 ` Andrew Cooper
2025-08-15 9:03 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-14 20:09 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 14/08/2025 4:11 pm, Jan Beulich wrote:
> On 08.08.2025 22:23, Andrew Cooper wrote:
>> Under FRED, SETSSBSY is unavailable, and we want to be setting up FRED prior
>> to setting up shadow stacks. As we still need Supervisor Tokens in IDT mode,
>> we need mode-specific logic to establish SSP.
>>
>> In FRED mode, write a Restore Token, RSTORSSP it, and discard the resulting
>> Previous-SSP token.
>>
>> No change outside of FRED mode.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Why is it that in patch 17 you could use identical code, but here you can't?
This caught me out at first too.
For S3, we're going from "no shadow stack" to "back to where we were on
an active shadow stack". All we need to do is get saved_ssp back into
the SSP register.
Here, we're going from "no shadow stack" to "on a good, empty, shadow
stack". For FRED we only need to load a value into SSP, but in IDT mode
we must also arrange to create a busy Supervisor Token on the base of
the stack.
We could in principle conditionally write a busy supervisor token, then
unconditionally RSTORSSP, but that's even more complicated to follow IMO.
It also flips around our position of "clean the busy bit on S3/hotplug"
to "set it even on first-bringup".
>
>> @@ -912,10 +913,30 @@ static void __init noreturn reinit_bsp_stack(void)
>>
>> if ( cpu_has_xen_shstk )
>> {
>> - wrmsrl(MSR_PL0_SSP,
>> - (unsigned long)stack + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8);
> Does this removal perhaps belong elsewhere, especially with "No change
> outside of FRED mode" in the description?
This is the "Updating reinit_bsp_stack() is deferred until later." note
in the previous patch.
This hunk was illegible without the split, although I have to admit that
I can't quite remember why now.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 13/22] x86: FRED enumerations
2025-08-14 19:37 ` Nicola Vetrini
2025-08-14 19:44 ` Andrew Cooper
@ 2025-08-14 20:18 ` Nicola Vetrini
1 sibling, 0 replies; 120+ messages in thread
From: Nicola Vetrini @ 2025-08-14 20:18 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Jan Beulich, Roger Pau Monné, Xen-devel
On 2025-08-14 21:37, Nicola Vetrini wrote:
> On 2025-08-14 13:47, Andrew Cooper wrote:
>> On 14/08/2025 12:44 pm, Jan Beulich wrote:
>>> On 14.08.2025 13:42, Andrew Cooper wrote:
>>>> On 14/08/2025 12:20 pm, Jan Beulich wrote:
>>>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>>>> --- a/xen/arch/x86/include/asm/x86-defns.h
>>>>>> +++ b/xen/arch/x86/include/asm/x86-defns.h
>>>>>> @@ -75,6 +75,7 @@
>>>>>> #define X86_CR4_PKE 0x00400000 /* enable PKE */
>>>>>> #define X86_CR4_CET 0x00800000 /* Control-flow Enforcement
>>>>>> Technology */
>>>>>> #define X86_CR4_PKS 0x01000000 /* Protection Key
>>>>>> Supervisor */
>>>>>> +#define X86_CR4_FRED 0x100000000 /* Fast Return and Event
>>>>>> Delivery */
>>>>> ... a UL suffix added here for Misra.
>>>> I was surprised, but Eclair is entirely fine with this.
>>> And there is a use of the identifier in a monitored C file?
>>
>> Yes. traps-setup.c which definitely has not been added to an
>> exclusion
>> list.
>>
>
> Might look into it before the end of the week, if time allows. Is [1]
> the right branch to look at?
>
> [1]
> https://gitlab.com/xen-project/hardware/xen-staging/-/commits/andrew/fred
>
Actually thinking about it, this is indeed represented in a signed type
in x86: according to the standard it's just a signed long [2], therefore
the rule does not apply. Then you would have a R10 violation due to the
conversion at the callsite. I would put a U regardless, since the
intention was clearly to have an unsigned long.
[2] https://godbolt.org/z/rT5j4j9vj
>> ~Andrew
--
Nicola Vetrini, B.Sc.
Software Engineer
BUGSENG (https://bugseng.com)
LinkedIn: https://www.linkedin.com/in/nicola-vetrini-a42471253
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 21/22] x86/traps: Introduce FRED entrypoints
2025-08-14 15:57 ` Jan Beulich
@ 2025-08-14 20:40 ` Andrew Cooper
2025-08-15 9:22 ` Jan Beulich
2025-08-18 8:59 ` Jan Beulich
0 siblings, 2 replies; 120+ messages in thread
From: Andrew Cooper @ 2025-08-14 20:40 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 14/08/2025 4:57 pm, Jan Beulich wrote:
> On 08.08.2025 22:23, Andrew Cooper wrote:
>> --- a/xen/arch/x86/include/asm/asm_defns.h
>> +++ b/xen/arch/x86/include/asm/asm_defns.h
>> @@ -315,6 +315,71 @@ static always_inline void stac(void)
>> subq $-(UREGS_error_code-UREGS_r15+\adj), %rsp
>> .endm
>>
>> +/*
>> + * Push and clear GPRs
>> + */
>> +.macro PUSH_AND_CLEAR_GPRS
>> + push %rdi
>> + xor %edi, %edi
>> + push %rsi
>> + xor %esi, %esi
>> + push %rdx
>> + xor %edx, %edx
>> + push %rcx
>> + xor %ecx, %ecx
>> + push %rax
>> + xor %eax, %eax
>> + push %r8
>> + xor %r8d, %r8d
>> + push %r9
>> + xor %r9d, %r9d
>> + push %r10
>> + xor %r10d, %r10d
>> + push %r11
>> + xor %r11d, %r11d
>> + push %rbx
>> + xor %ebx, %ebx
>> + push %rbp
>> +#ifdef CONFIG_FRAME_POINTER
>> +/* Indicate special exception stack frame by inverting the frame pointer. */
>> + mov %rsp, %rbp
>> + notq %rbp
>> +#else
>> + xor %ebp, %ebp
>> +#endif
>> + push %r12
>> + xor %r12d, %r12d
>> + push %r13
>> + xor %r13d, %r13d
>> + push %r14
>> + xor %r14d, %r14d
>> + push %r15
>> + xor %r15d, %r15d
>> +.endm
>> +
>> +/*
>> + * POP GPRs from a UREGS_* frame on the stack. Does not modify flags.
>> + *
>> + * @rax: Alternative destination for the %rax value on the stack.
>> + */
>> +.macro POP_GPRS rax=%rax
>> + pop %r15
>> + pop %r14
>> + pop %r13
>> + pop %r12
>> + pop %rbp
>> + pop %rbx
>> + pop %r11
>> + pop %r10
>> + pop %r9
>> + pop %r8
>> + pop \rax
>> + pop %rcx
>> + pop %rdx
>> + pop %rsi
>> + pop %rdi
>> +.endm
> Hmm, yes, differences are apparently large enough to warrant the redundancy
> with SAVE_ALL / RESTORE_ALL.
You may recall this construct from prior attempts I've had to remove
SAVE_ALL/RESTORE_ALL, even with the \rax parameter for SVM. I still
intend to complete that work at some point.
>
>> --- a/xen/arch/x86/include/asm/msr.h
>> +++ b/xen/arch/x86/include/asm/msr.h
>> @@ -202,9 +202,9 @@ static inline unsigned long read_gs_base(void)
>>
>> static inline unsigned long read_gs_shadow(void)
>> {
>> - unsigned long base;
>> + unsigned long base, cr4 = read_cr4();
>>
>> - if ( read_cr4() & X86_CR4_FSGSBASE )
>> + if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
>> {
>> asm volatile ( "swapgs" );
>> base = __rdgsbase();
>> @@ -234,7 +234,9 @@ static inline void write_gs_base(unsigned long base)
>>
>> static inline void write_gs_shadow(unsigned long base)
>> {
>> - if ( read_cr4() & X86_CR4_FSGSBASE )
>> + unsigned long cr4 = read_cr4();
>> +
>> + if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
>> {
>> asm volatile ( "swapgs\n\t"
>> "wrgsbase %0\n\t"
> I don't quite get how these changes fit into this patch.
Without the change, read_registers() suffers #UD because of the SWAPGS.
This recurses until hitting the guard page, then repeats the same on the
#DF stack. And because stacks work nicely under FRED, you eventually
hit #DF's guard page.
Strictly speaking it's only read_gs_shadow() which we need to change to
make exception handling work, but I fixed both at the same time.
That said, I have actually cleaned this codepath up with the MSR work
because the code gen in read_registers() is terrible. Due to
no-strict-aliasing, every store into state-> forces a recalculation of
get_cpu_info(), meaning that read_cr4() cannot be hoisted, and there's a
branch in every helper.
I'm still not sure how best to fit it into this series.
>
>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -1013,6 +1013,32 @@ void show_execution_state_nmi(const cpumask_t *mask, bool show_all)
>> printk("Non-responding CPUs: {%*pbl}\n", CPUMASK_PR(&show_state_mask));
>> }
>>
>> +static const char *x86_et_name(unsigned int type)
>> +{
>> + static const char *const names[] = {
>> + [X86_ET_EXT_INTR] = "EXT_INTR",
>> + [X86_ET_NMI] = "NMI",
>> + [X86_ET_HW_EXC] = "HW_EXC",
>> + [X86_ET_SW_INT] = "SW_INT",
>> + [X86_ET_PRIV_SW_EXC] = "PRIV_SW_EXEC",
>> + [X86_ET_SW_EXC] = "SW_EXEC",
>> + [X86_ET_OTHER] = "OTHER",
>> + };
>> +
>> + return (type < ARRAY_SIZE(names) && names[type]) ? names[type] : "???";
>> +}
>> +
>> +static const char *x86_et_other_name(unsigned int vec)
> This isn't really a vector, is it?
Well - you are decoding the field name regs->fred_ss.vector.
Also I see I've typo'd EXEC in two of those names.
>
>> +{
>> + static const char *const names[] = {
>> + [0] = "MTF",
>> + [1] = "SYSCALL",
>> + [2] = "SYSENTER",
>> + };
>> +
>> + return (vec < ARRAY_SIZE(names) && names[vec][0]) ? names[vec] : "???";
> Did you mean to check names[ves] for being NULL? Or is this a leftover
> from the array being something like names[][10]?
Oh, bad copy&paste. Will fix.
>
>> --- a/xen/arch/x86/x86_64/Makefile
>> +++ b/xen/arch/x86/x86_64/Makefile
>> @@ -1,6 +1,7 @@
>> obj-$(CONFIG_PV32) += compat/
>>
>> obj-bin-y += entry.o
>> +obj-bin-y += entry-fred.o
> For the ordering here, ...
>
>> --- /dev/null
>> +++ b/xen/arch/x86/x86_64/entry-fred.S
>> @@ -0,0 +1,35 @@
>> +/* SPDX-License-Identifier: GPL-2.0-or-later */
>> +
>> + .file "x86_64/entry-fred.S"
>> +
>> +#include <asm/asm_defns.h>
>> +#include <asm/page.h>
>> +
>> + .section .text.entry, "ax", @progbits
>> +
>> + /* The Ring3 entry point is required to be 4k aligned. */
>> +
>> +FUNC(entry_FRED_R3, 4096)
> ... doesn't this 4k-alignment requirement suggest we want to put
> entry-fred.o first?
Perhaps, but that is quite subtle. I did also consider a
.text.entry.page_aligned section, but .text.entry only matters for XPTI
which (as agreed), I'm not intending to implement in FRED mode unless it
proves to be necessary.
Also IIRC there's still a symbol bug where _sentrytext takes priority
over entry_FRED_R3, so the backtrace is effectively wrong.
(These are all bad excuses, but some parts of this series are rather old.)
> Also, might it be more natural to use PAGE_SIZE
> here?
I did debate that, but the spec uses 0xfff, not pages, even if the
pipline surely does have an optimisation for chopping 12 metadata bits
off the bottom of a pointer.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 20/22] x86/traps: Alter switch_stack_and_jump() for FRED mode
2025-08-14 15:35 ` Jan Beulich
@ 2025-08-14 20:55 ` Andrew Cooper
2025-08-15 9:10 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-14 20:55 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 14/08/2025 4:35 pm, Jan Beulich wrote:
> On 08.08.2025 22:23, Andrew Cooper wrote:
>> FRED and IDT differ by a Supervisor Token on the base of the shstk. This
>> means that switch_stack_and_jump() needs to discard one extra word when FRED
>> is active.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>
>> RFC. I don't like this, but it does work.
>>
>> This emits opt_fred logic outside of CONFIG_XEN_SHSTK.
> opt_fred and XEN_SHSTK are orthogonal, so that's fine anyway. What I guess
> you may mean is that you now have a shstk-related calculation outside of
> a respective #ifdef.
I really mean "outside of the path where shadow stacks are known to be
active", i.e. inside the middle of SHADOW_STACK_WORK
> Given the simplicity of the calculation, ...
>
>> But frankly, the
>> construct is already too unweildly, and all options I can think of make it
>> moreso.
> ... I agree having it like this is okay.
Yes, but it is a read of a global even when it's not used.
And as a tangent, we probably want __ro_after_init_read_mostly too. The
read mostly is about cache locality, and is applicable even to the
__ro_after_init section.
>
>> @@ -154,7 +155,6 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>> "rdsspd %[ssp];" \
>> "cmp $1, %[ssp];" \
>> "je .L_shstk_done.%=;" /* CET not active? Skip. */ \
>> - "mov $%c[skstk_base], %[val];" \
>> "and $%c[stack_mask], %[ssp];" \
>> "sub %[ssp], %[val];" \
>> "shr $3, %[val];" \
> With the latter two insns here, ...
>
>> @@ -177,6 +177,8 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>>
>> #define switch_stack_and_jump(fn, instr, constr) \
>> ({ \
>> + unsigned int token_offset = \
>> + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - (opt_fred ? 0 : 8); \
>> unsigned int tmp; \
>> BUILD_BUG_ON(!ssaj_has_attr_noreturn(fn)); \
>> __asm__ __volatile__ ( \
>> @@ -184,12 +186,11 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>> "mov %[stk], %%rsp;" \
>> CHECK_FOR_LIVEPATCH_WORK \
>> instr "[fun]" \
>> - : [val] "=&r" (tmp), \
>> + : [val] "=r" (tmp), \
> ... I don't think you can legitimately drop the & from here? With it
> retained:
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
You chopped the bit which has an explicit input for "[val]", making the
earlyclobber incorrect.
IIRC, one version of Clang complained.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 13/22] x86: FRED enumerations
2025-08-14 19:44 ` Andrew Cooper
@ 2025-08-14 21:27 ` Nicola Vetrini
0 siblings, 0 replies; 120+ messages in thread
From: Nicola Vetrini @ 2025-08-14 21:27 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Jan Beulich, Roger Pau Monné, Xen-devel
On 2025-08-14 21:44, Andrew Cooper wrote:
> On 14/08/2025 8:37 pm, Nicola Vetrini wrote:
>> On 2025-08-14 13:47, Andrew Cooper wrote:
>>> On 14/08/2025 12:44 pm, Jan Beulich wrote:
>>>> On 14.08.2025 13:42, Andrew Cooper wrote:
>>>>> On 14/08/2025 12:20 pm, Jan Beulich wrote:
>>>>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>>>>> --- a/xen/arch/x86/include/asm/x86-defns.h
>>>>>>> +++ b/xen/arch/x86/include/asm/x86-defns.h
>>>>>>> @@ -75,6 +75,7 @@
>>>>>>> #define X86_CR4_PKE 0x00400000 /* enable PKE */
>>>>>>> #define X86_CR4_CET 0x00800000 /* Control-flow
>>>>>>> Enforcement Technology */
>>>>>>> #define X86_CR4_PKS 0x01000000 /* Protection Key
>>>>>>> Supervisor */
>>>>>>> +#define X86_CR4_FRED 0x100000000 /* Fast Return and Event
>>>>>>> Delivery */
>>>>>> ... a UL suffix added here for Misra.
>>>>> I was surprised, but Eclair is entirely fine with this.
>>>> And there is a use of the identifier in a monitored C file?
>>>
>>> Yes. traps-setup.c which definitely has not been added to an
>>> exclusion
>>> list.
>>>
>>
>> Might look into it before the end of the week, if time allows. Is [1]
>> the right branch to look at?
>>
>> [1]
>> https://gitlab.com/xen-project/hardware/xen-staging/-/commits/andrew/fred
>
> Yes, although I am force pushing this with fixes as I find them.
>
> In the latest run at the time of writing, I had one trivial R8.4
> violation to fix, and all other clean rules came up fine. I expect the
> next run to be clean.
>
> One thing that might be relevant, IIRC it's implementation defined
> behaviour what happens to constants which are wider than int to begin
> with.
>
See my other reply in this thread; the rule is concerned with the type
of the constant present after preprocessing, which is defined in C99
6.4.4.1.5 . As such, I'm not sure what you're hinting at regarding IDB
here. Of course I might be wrong.
> ~Andrew
--
Nicola Vetrini, B.Sc.
Software Engineer
BUGSENG (https://bugseng.com)
LinkedIn: https://www.linkedin.com/in/nicola-vetrini-a42471253
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c
2025-08-14 18:09 ` Andrew Cooper
@ 2025-08-15 8:22 ` Jan Beulich
2025-08-15 8:28 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-15 8:22 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 14.08.2025 20:09, Andrew Cooper wrote:
> On 14/08/2025 9:55 am, Jan Beulich wrote:
>> On 13.08.2025 13:25, Andrew Cooper wrote:
>>> On 12/08/2025 10:19 am, Jan Beulich wrote:
>>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>>> Since commit a35816b5cae8 ("x86/traps: Introduce early_traps_init() and
>>>>> simplify setup"), load_system_tables() is called later on the BSP, so the
>>>>> SYS_STATE_early_boot check can be dropped from the safety BUG_ON().
>>>>>
>>>>> Move the BUILD_BUG_ON() into build_assertions(),
>>>> I'm not quite convinced of this move - having the related BUILD_BUG_ON()
>>>> and BUG_ON() next to each other would seem better to me.
>>> I don't see a specific reason for them to be together, and the comment
>>> explains what's going on.
>>>
>>> With FRED, we want a related BUILD_BUG_ON(), but there's no equivalent
>>> BUG_ON() because MSR_RSP_SL0 will #GP on being misaligned.
>> That BUILD_BUG_ON() could then sit next to the MSR write? Unless of course
>> that ends up sitting in an assembly source.
>
> It's the bottom hunk in patch 14, which you've looked at now.
>
> Personally, I think both BUILD_BUG_ON()'s should be together, because
> they are related.
I don't really agree, but I also won't insist on my preference to be followed.
IOW please keep as is.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c
2025-08-15 8:22 ` Jan Beulich
@ 2025-08-15 8:28 ` Andrew Cooper
2025-08-15 8:32 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-15 8:28 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 15/08/2025 9:22 am, Jan Beulich wrote:
> On 14.08.2025 20:09, Andrew Cooper wrote:
>> On 14/08/2025 9:55 am, Jan Beulich wrote:
>>> On 13.08.2025 13:25, Andrew Cooper wrote:
>>>> On 12/08/2025 10:19 am, Jan Beulich wrote:
>>>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>>>> Since commit a35816b5cae8 ("x86/traps: Introduce early_traps_init() and
>>>>>> simplify setup"), load_system_tables() is called later on the BSP, so the
>>>>>> SYS_STATE_early_boot check can be dropped from the safety BUG_ON().
>>>>>>
>>>>>> Move the BUILD_BUG_ON() into build_assertions(),
>>>>> I'm not quite convinced of this move - having the related BUILD_BUG_ON()
>>>>> and BUG_ON() next to each other would seem better to me.
>>>> I don't see a specific reason for them to be together, and the comment
>>>> explains what's going on.
>>>>
>>>> With FRED, we want a related BUILD_BUG_ON(), but there's no equivalent
>>>> BUG_ON() because MSR_RSP_SL0 will #GP on being misaligned.
>>> That BUILD_BUG_ON() could then sit next to the MSR write? Unless of course
>>> that ends up sitting in an assembly source.
>> It's the bottom hunk in patch 14, which you've looked at now.
>>
>> Personally, I think both BUILD_BUG_ON()'s should be together, because
>> they are related.
> I don't really agree, but I also won't insist on my preference to be followed.
> IOW please keep as is.
Thankyou. Can I consider this to be A-by then? (This, and the rename
to percpu_early_traps_init() are the only two remaining items in the
entire first half of the series.)
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c
2025-08-14 18:20 ` Andrew Cooper
@ 2025-08-15 8:30 ` Jan Beulich
2025-08-15 8:40 ` Nicola Vetrini
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-15 8:30 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Xen-devel, Roger Pau Monné, Nicola Vetrini
On 14.08.2025 20:20, Andrew Cooper wrote:
> On 14/08/2025 8:26 am, Jan Beulich wrote:
>> On 13.08.2025 13:36, Andrew Cooper wrote:
>>> On 12/08/2025 10:43 am, Nicola Vetrini wrote:
>>>> On 2025-08-08 22:23, Andrew Cooper wrote:
>>>>> diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
>>>>> index 8ca379c9e4cb..13b8fcf0ba51 100644
>>>>> --- a/xen/arch/x86/traps-setup.c
>>>>> +++ b/xen/arch/x86/traps-setup.c
>>>>> @@ -19,6 +20,124 @@ boolean_param("ler", opt_ler);
>>>>>
>>>>> void nocall entry_PF(void);
>>>>>
>>>>> +/*
>>>>> + * Sets up system tables and descriptors for IDT devliery.
>>>>> + *
>>>>> + * - Sets up TSS with stack pointers, including ISTs
>>>>> + * - Inserts TSS selector into regular and compat GDTs
>>>>> + * - Loads GDT, IDT, TR then null LDT
>>>>> + * - Sets up IST references in the IDT
>>>>> + */
>>>>> +static void load_system_tables(void)
>>>>> +{
>>>>> + unsigned int i, cpu = smp_processor_id();
>>>>> + unsigned long stack_bottom = get_stack_bottom(),
>>>>> + stack_top = stack_bottom & ~(STACK_SIZE - 1);
>>>>> + /*
>>>>> + * NB: define tss_page as a local variable because clang 3.5
>>>>> doesn't
>>>>> + * support using ARRAY_SIZE against per-cpu variables.
>>>>> + */
>>>>> + struct tss_page *tss_page = &this_cpu(tss_page);
>>>>> + idt_entry_t *idt = this_cpu(idt);
>>>>> +
>>>> Given the clang baseline this might not be needed anymore?
>>> Hmm. While true, looking at 51461114e26, the code is definitely better
>>> written with the tss_page variable and we wouldn't want to go back to
>>> the old form.
>>>
>>> I think that I'll simply drop the comment.
>>>
>>> ~Andrew
>>>
>>> P.S.
>>>
>>> Generally speaking, because of the RELOC_HIDE() in this_cpu(), any time
>>> you ever want two accesses to a variable, it's better (code gen wise) to
>>> construct a pointer to it and use the point multiple times.
>>>
>>> I don't understand why there's a RELOC_HIDE() in this_cpu(). The
>>> justification doesn't make sense, but I've not had time to explore what
>>> happens if we take it out.
>> There's no justification in xen/percpu.h?
>
> Well, it's given in compiler.h by RELOC_HIDE().
>
> /* This macro obfuscates arithmetic on a variable address so that gcc
> shouldn't recognize the original var, and make assumptions about it */
>
>
> But this is far from convincing.
>
>>
>> My understanding is that we simply may not expose any accesses to per_cpu_*
>> variables directly to the compiler, or there's a risk that it might access
>> the "master" variable (i.e. CPU0's on at least x86).
>
> RELOC_HIDE() doesn't do anything about the correctness of the pointer
> arithmetic expression to make the access work.
>
> I don't see how a correct expression can ever access CPU0's data by
> accident.
Hmm, upon another look I agree. I wonder whether we inherited this from
Linux, where in turn it may have been merely a workaround to deal with
preemptible code not correctly accessing per-CPU data (i.e. not
accounting for get_per_cpu_offset() not being stable across preemption).
Yet then per_cpu() would have been of similar concern when "cpu" isn't
properly re-fetched after any possible preemption point ...
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c
2025-08-15 8:28 ` Andrew Cooper
@ 2025-08-15 8:32 ` Jan Beulich
0 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-15 8:32 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 15.08.2025 10:28, Andrew Cooper wrote:
> On 15/08/2025 9:22 am, Jan Beulich wrote:
>> On 14.08.2025 20:09, Andrew Cooper wrote:
>>> On 14/08/2025 9:55 am, Jan Beulich wrote:
>>>> On 13.08.2025 13:25, Andrew Cooper wrote:
>>>>> On 12/08/2025 10:19 am, Jan Beulich wrote:
>>>>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>>>>> Since commit a35816b5cae8 ("x86/traps: Introduce early_traps_init() and
>>>>>>> simplify setup"), load_system_tables() is called later on the BSP, so the
>>>>>>> SYS_STATE_early_boot check can be dropped from the safety BUG_ON().
>>>>>>>
>>>>>>> Move the BUILD_BUG_ON() into build_assertions(),
>>>>>> I'm not quite convinced of this move - having the related BUILD_BUG_ON()
>>>>>> and BUG_ON() next to each other would seem better to me.
>>>>> I don't see a specific reason for them to be together, and the comment
>>>>> explains what's going on.
>>>>>
>>>>> With FRED, we want a related BUILD_BUG_ON(), but there's no equivalent
>>>>> BUG_ON() because MSR_RSP_SL0 will #GP on being misaligned.
>>>> That BUILD_BUG_ON() could then sit next to the MSR write? Unless of course
>>>> that ends up sitting in an assembly source.
>>> It's the bottom hunk in patch 14, which you've looked at now.
>>>
>>> Personally, I think both BUILD_BUG_ON()'s should be together, because
>>> they are related.
>> I don't really agree, but I also won't insist on my preference to be followed.
>> IOW please keep as is.
>
> Thankyou. Can I consider this to be A-by then? (This, and the rename
> to percpu_early_traps_init() are the only two remaining items in the
> entire first half of the series.)
Yes:
Acked-by: Jan Beulich <jbeulich@suse.com>
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 13/22] x86: FRED enumerations
2025-08-14 18:45 ` Andrew Cooper
@ 2025-08-15 8:34 ` Jan Beulich
0 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-15 8:34 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 14.08.2025 20:45, Andrew Cooper wrote:
> On 14/08/2025 2:19 pm, Jan Beulich wrote:
>> On 14.08.2025 13:20, Jan Beulich wrote:
>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>> --- a/xen/include/public/arch-x86/cpufeatureset.h
>>>> +++ b/xen/include/public/arch-x86/cpufeatureset.h
>>>> @@ -310,7 +310,10 @@ XEN_CPUFEATURE(ARCH_PERF_MON, 10*32+8) /* Architectural Perfmon */
>>>> XEN_CPUFEATURE(FZRM, 10*32+10) /*A Fast Zero-length REP MOVSB */
>>>> XEN_CPUFEATURE(FSRS, 10*32+11) /*A Fast Short REP STOSB */
>>>> XEN_CPUFEATURE(FSRCS, 10*32+12) /*A Fast Short REP CMPSB/SCASB */
>>>> +XEN_CPUFEATURE(FRED, 10*32+17) /* Fast Return and Event Delivery */
>>>> +XEN_CPUFEATURE(LKGS, 10*32+18) /* Load Kernel GS instruction */
>>>> XEN_CPUFEATURE(WRMSRNS, 10*32+19) /*S WRMSR Non-Serialising */
>>>> +XEN_CPUFEATURE(NMI_SRC, 10*32+20) /* NMI-Source Reporting */
>>>> XEN_CPUFEATURE(AMX_FP16, 10*32+21) /* AMX FP16 instruction */
>>>> XEN_CPUFEATURE(AVX_IFMA, 10*32+23) /*A AVX-IFMA Instructions */
>>>> XEN_CPUFEATURE(LAM, 10*32+26) /* Linear Address Masking */
>>> I'd like to note that we could long have had this if my long-pending emulator
>>> patch had gone in at some point.
>> Actually what I further have there, and what in the context of patch 15 I
>> notice you should have here is
>>
>> --- a/xen/tools/gen-cpuid.py
>> +++ b/xen/tools/gen-cpuid.py
>> @@ -278,7 +278,8 @@ def crunch_numbers(state):
>> # superpages, PCID and PKU are only available in 4 level paging.
>> # NO_LMSL indicates the absense of Long Mode Segment Limits, which
>> # have been dropped in hardware.
>> - LM: [CX16, PCID, LAHF_LM, PAGE1GB, PKU, NO_LMSL, AMX_TILE, CMPCCXADD],
>> + LM: [CX16, PCID, LAHF_LM, PAGE1GB, PKU, NO_LMSL, AMX_TILE, CMPCCXADD,
>> + LKGS],
>>
>> # AMD K6-2+ and K6-III processors shipped with 3DNow+, beyond the
>> # standard 3DNow in the earlier K6 processors.
>> @@ -347,6 +348,9 @@ def crunch_numbers(state):
>> # computational instructions. All further AMX features are built on top
>> # of AMX-TILE.
>> AMX_TILE: [AMX_BF16, AMX_INT8, AMX_FP16, AMX_COMPLEX],
>> +
>> + # FRED builds on the LKGS instruction.
>> + LKGS: [FRED],
>> }
>>
>> deep_features = tuple(sorted(deps.keys()))
>
> Hmm. Yes, but normally this is part of guest enablement.
Hence why I would have wanted the emulator patch to long have gone in.
> Having now done the Xen work and concluded that we don't actually need
> LKGS, I'm rethinking the linkage here. It's probably the right thing to
> do in practice, but probably needs a bit more in the way of
> justification. "built on" doesn't quite cut it IMO.
The comment can be adjusted (or dropped). Im "builds on" is quite adequate,
as without SWAPGS you won't really have a way to get things working in an
OS without having LKGS available. In Xen we're in a slightly different
position ...
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 15/22] x86/traps: Introduce opt_fred
2025-08-14 19:16 ` Andrew Cooper
@ 2025-08-15 8:37 ` Jan Beulich
2025-08-21 21:52 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-15 8:37 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Xen-devel, Roger Pau Monné
On 14.08.2025 21:16, Andrew Cooper wrote:
> On 14/08/2025 2:30 pm, Jan Beulich wrote:
>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>> ... disabled by default. There is a lot of work before FRED can be enabled by
>>> default.
>>>
>>> One part of FRED, the LKGS (Load Kernel GS) instruction, is enumerated
>>> separately but is mandatory as FRED disallows the SWAPGS instruction.
>>> Therefore, both CPUID bits must be checked.
>> See my (further) reply to patch 13 - I think FRED simply ought to depend on
>> LKGS.
>>
>>> @@ -20,6 +22,9 @@ unsigned int __ro_after_init ler_msr;
>>> static bool __initdata opt_ler;
>>> boolean_param("ler", opt_ler);
>>>
>>> +int8_t __ro_after_init opt_fred = 0; /* -1 when supported. */
>> I'm a little puzzled by the comment? DYM "once default-enabled"?
>
> Well, I have this temporary patch
> https://gitlab.com/xen-project/hardware/xen-staging/-/commit/70ef6a1178a411a29b7b1745a1112e267ffb6245
> that will turn into a real patch when we enable FRED by default.
>
> As much as anything else, it was just a TODO.
>
>
>> Then ...
>>
>>> @@ -305,6 +310,32 @@ void __init traps_init(void)
>>> /* Replace early pagefault with real pagefault handler. */
>>> _update_gate_addr_lower(&bsp_idt[X86_EXC_PF], entry_PF);
>>>
>>> + if ( !cpu_has_fred || !cpu_has_lkgs )
>>> + {
>>> + if ( opt_fred )
>> ... this won't work anymore once the initializer is changed.
>
> Hmm yes. That wants to be an == 1 check. Fixed.
>
>>
>>> + printk(XENLOG_WARNING "FRED not available, ignoring\n");
>>> + opt_fred = false;
>> Better use 0 here?
>>
>>> + }
>>> +
>>> + if ( opt_fred == -1 )
>>> + opt_fred = !pv_shim;
>> Imo it would be better to have the initializer be -1 right away, and comment
>> out the "!pv_shim" here, until we mean it to be default-enabled.
>
> It cannot be -1, or Xen will fail spectacularly on any FRED capable
> hardware. Setting to -1 is the point at which FRED becomes security
> supported.
I guess I'm not following: If it was -1, and if the code here was
if ( opt_fred < 0 )
opt_fred = 0 /* !pv_shim */;
why would things "fail spectacularly" unless someone passed "fred" on
the command line?
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c
2025-08-15 8:30 ` Jan Beulich
@ 2025-08-15 8:40 ` Nicola Vetrini
2025-08-15 8:49 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Nicola Vetrini @ 2025-08-15 8:40 UTC (permalink / raw)
To: Jan Beulich; +Cc: Andrew Cooper, Xen-devel, Roger Pau Monné
On 2025-08-15 10:30, Jan Beulich wrote:
> On 14.08.2025 20:20, Andrew Cooper wrote:
>> On 14/08/2025 8:26 am, Jan Beulich wrote:
>>> On 13.08.2025 13:36, Andrew Cooper wrote:
>>>> On 12/08/2025 10:43 am, Nicola Vetrini wrote:
>>>>> On 2025-08-08 22:23, Andrew Cooper wrote:
>>>>>> diff --git a/xen/arch/x86/traps-setup.c
>>>>>> b/xen/arch/x86/traps-setup.c
>>>>>> index 8ca379c9e4cb..13b8fcf0ba51 100644
>>>>>> --- a/xen/arch/x86/traps-setup.c
>>>>>> +++ b/xen/arch/x86/traps-setup.c
>>>>>> @@ -19,6 +20,124 @@ boolean_param("ler", opt_ler);
>>>>>>
>>>>>> void nocall entry_PF(void);
>>>>>>
>>>>>> +/*
>>>>>> + * Sets up system tables and descriptors for IDT devliery.
>>>>>> + *
>>>>>> + * - Sets up TSS with stack pointers, including ISTs
>>>>>> + * - Inserts TSS selector into regular and compat GDTs
>>>>>> + * - Loads GDT, IDT, TR then null LDT
>>>>>> + * - Sets up IST references in the IDT
>>>>>> + */
>>>>>> +static void load_system_tables(void)
>>>>>> +{
>>>>>> + unsigned int i, cpu = smp_processor_id();
>>>>>> + unsigned long stack_bottom = get_stack_bottom(),
>>>>>> + stack_top = stack_bottom & ~(STACK_SIZE - 1);
>>>>>> + /*
>>>>>> + * NB: define tss_page as a local variable because clang 3.5
>>>>>> doesn't
>>>>>> + * support using ARRAY_SIZE against per-cpu variables.
>>>>>> + */
>>>>>> + struct tss_page *tss_page = &this_cpu(tss_page);
>>>>>> + idt_entry_t *idt = this_cpu(idt);
>>>>>> +
>>>>> Given the clang baseline this might not be needed anymore?
>>>> Hmm. While true, looking at 51461114e26, the code is definitely
>>>> better
>>>> written with the tss_page variable and we wouldn't want to go back
>>>> to
>>>> the old form.
>>>>
>>>> I think that I'll simply drop the comment.
>>>>
>>>> ~Andrew
>>>>
>>>> P.S.
>>>>
>>>> Generally speaking, because of the RELOC_HIDE() in this_cpu(), any
>>>> time
>>>> you ever want two accesses to a variable, it's better (code gen
>>>> wise) to
>>>> construct a pointer to it and use the point multiple times.
>>>>
>>>> I don't understand why there's a RELOC_HIDE() in this_cpu(). The
>>>> justification doesn't make sense, but I've not had time to explore
>>>> what
>>>> happens if we take it out.
>>> There's no justification in xen/percpu.h?
>>
>> Well, it's given in compiler.h by RELOC_HIDE().
>>
>> /* This macro obfuscates arithmetic on a variable address so that gcc
>> shouldn't recognize the original var, and make assumptions about it
>> */
>>
>>
>> But this is far from convincing.
>>
>>>
>>> My understanding is that we simply may not expose any accesses to
>>> per_cpu_*
>>> variables directly to the compiler, or there's a risk that it might
>>> access
>>> the "master" variable (i.e. CPU0's on at least x86).
>>
>> RELOC_HIDE() doesn't do anything about the correctness of the pointer
>> arithmetic expression to make the access work.
>>
>> I don't see how a correct expression can ever access CPU0's data by
>> accident.
>
> Hmm, upon another look I agree. I wonder whether we inherited this from
> Linux, where in turn it may have been merely a workaround to deal with
> preemptible code not correctly accessing per-CPU data (i.e. not
> accounting for get_per_cpu_offset() not being stable across
> preemption).
> Yet then per_cpu() would have been of similar concern when "cpu" isn't
> properly re-fetched after any possible preemption point ...
>
> Jan
Probably inherited with a stripped-down comment on top of RELOC_HIDE,
see [1]. In a way it does make sense that the compiler may decide to
optimize based on this assumption, though I don't know whether wrapping
is meant to happen with per-CPU variables.
[1]
https://elixir.bootlin.com/linux/v6.16/source/include/linux/compiler-gcc.h#L31
--
Nicola Vetrini, B.Sc.
Software Engineer
BUGSENG (https://bugseng.com)
LinkedIn: https://www.linkedin.com/in/nicola-vetrini-a42471253
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c
2025-08-15 8:40 ` Nicola Vetrini
@ 2025-08-15 8:49 ` Jan Beulich
0 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-15 8:49 UTC (permalink / raw)
To: Nicola Vetrini; +Cc: Andrew Cooper, Xen-devel, Roger Pau Monné
On 15.08.2025 10:40, Nicola Vetrini wrote:
> On 2025-08-15 10:30, Jan Beulich wrote:
>> On 14.08.2025 20:20, Andrew Cooper wrote:
>>> On 14/08/2025 8:26 am, Jan Beulich wrote:
>>>> On 13.08.2025 13:36, Andrew Cooper wrote:
>>>>> On 12/08/2025 10:43 am, Nicola Vetrini wrote:
>>>>>> On 2025-08-08 22:23, Andrew Cooper wrote:
>>>>>>> diff --git a/xen/arch/x86/traps-setup.c
>>>>>>> b/xen/arch/x86/traps-setup.c
>>>>>>> index 8ca379c9e4cb..13b8fcf0ba51 100644
>>>>>>> --- a/xen/arch/x86/traps-setup.c
>>>>>>> +++ b/xen/arch/x86/traps-setup.c
>>>>>>> @@ -19,6 +20,124 @@ boolean_param("ler", opt_ler);
>>>>>>>
>>>>>>> void nocall entry_PF(void);
>>>>>>>
>>>>>>> +/*
>>>>>>> + * Sets up system tables and descriptors for IDT devliery.
>>>>>>> + *
>>>>>>> + * - Sets up TSS with stack pointers, including ISTs
>>>>>>> + * - Inserts TSS selector into regular and compat GDTs
>>>>>>> + * - Loads GDT, IDT, TR then null LDT
>>>>>>> + * - Sets up IST references in the IDT
>>>>>>> + */
>>>>>>> +static void load_system_tables(void)
>>>>>>> +{
>>>>>>> + unsigned int i, cpu = smp_processor_id();
>>>>>>> + unsigned long stack_bottom = get_stack_bottom(),
>>>>>>> + stack_top = stack_bottom & ~(STACK_SIZE - 1);
>>>>>>> + /*
>>>>>>> + * NB: define tss_page as a local variable because clang 3.5
>>>>>>> doesn't
>>>>>>> + * support using ARRAY_SIZE against per-cpu variables.
>>>>>>> + */
>>>>>>> + struct tss_page *tss_page = &this_cpu(tss_page);
>>>>>>> + idt_entry_t *idt = this_cpu(idt);
>>>>>>> +
>>>>>> Given the clang baseline this might not be needed anymore?
>>>>> Hmm. While true, looking at 51461114e26, the code is definitely
>>>>> better
>>>>> written with the tss_page variable and we wouldn't want to go back
>>>>> to
>>>>> the old form.
>>>>>
>>>>> I think that I'll simply drop the comment.
>>>>>
>>>>> ~Andrew
>>>>>
>>>>> P.S.
>>>>>
>>>>> Generally speaking, because of the RELOC_HIDE() in this_cpu(), any
>>>>> time
>>>>> you ever want two accesses to a variable, it's better (code gen
>>>>> wise) to
>>>>> construct a pointer to it and use the point multiple times.
>>>>>
>>>>> I don't understand why there's a RELOC_HIDE() in this_cpu(). The
>>>>> justification doesn't make sense, but I've not had time to explore
>>>>> what
>>>>> happens if we take it out.
>>>> There's no justification in xen/percpu.h?
>>>
>>> Well, it's given in compiler.h by RELOC_HIDE().
>>>
>>> /* This macro obfuscates arithmetic on a variable address so that gcc
>>> shouldn't recognize the original var, and make assumptions about it
>>> */
>>>
>>>
>>> But this is far from convincing.
>>>
>>>>
>>>> My understanding is that we simply may not expose any accesses to
>>>> per_cpu_*
>>>> variables directly to the compiler, or there's a risk that it might
>>>> access
>>>> the "master" variable (i.e. CPU0's on at least x86).
>>>
>>> RELOC_HIDE() doesn't do anything about the correctness of the pointer
>>> arithmetic expression to make the access work.
>>>
>>> I don't see how a correct expression can ever access CPU0's data by
>>> accident.
>>
>> Hmm, upon another look I agree. I wonder whether we inherited this from
>> Linux, where in turn it may have been merely a workaround to deal with
>> preemptible code not correctly accessing per-CPU data (i.e. not
>> accounting for get_per_cpu_offset() not being stable across
>> preemption).
>> Yet then per_cpu() would have been of similar concern when "cpu" isn't
>> properly re-fetched after any possible preemption point ...
>
> Probably inherited with a stripped-down comment on top of RELOC_HIDE,
> see [1]. In a way it does make sense that the compiler may decide to
> optimize based on this assumption, though I don't know whether wrapping
> is meant to happen with per-CPU variables.
I wouldn't call it "meant to", but wrapping certainly is possible. This
is arch-independent code, and hence whether any wrapping would occur
depends on the VA layout of the individual architectures.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 18/22] x86/traps: Set MSR_PL0_SSP in load_system_tables()
2025-08-14 19:37 ` Andrew Cooper
@ 2025-08-15 8:52 ` Jan Beulich
2025-08-15 13:49 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-15 8:52 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 14.08.2025 21:37, Andrew Cooper wrote:
> On 14/08/2025 4:00 pm, Jan Beulich wrote:
>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>> --- a/xen/arch/x86/boot/x86_64.S
>>> +++ b/xen/arch/x86/boot/x86_64.S
>>> @@ -65,17 +65,11 @@ ENTRY(__high_start)
>>> or $(PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8, %rdx
>>>
>>> /*
>>> - * Write a new supervisor token. Doesn't matter on boot, but for S3
>>> - * resume this clears the busy bit.
>>> + * Write a new Supervisor Token. It doesn't matter the first time a
>>> + * CPU boots, but for S3 resume or CPU hot re-add, this clears the
>>> + * busy bit.
>>> */
>>> wrssq %rdx, (%rdx)
>>> -
>>> - /* Point MSR_PL0_SSP at the token. */
>>> - mov $MSR_PL0_SSP, %ecx
>>> - mov %edx, %eax
>>> - shr $32, %rdx
>>> - wrmsr
>>> -
>>> setssbsy
>> This is ending up a little odd: The comment says the write is to clear the
>> busy bit, when that's re-set immediately afterwards.
>
> That comment is about the wrssq. I suppose what isn't said is that
> setssbsy will fault if not. How about ", so SETSSBSY can set it again" ?
That would improve things, but then it's still unclear to me why we need to
go through this. If the busy bit is already set, why clear it just to set
it again. Or perhaps asked differently: Wouldn't we be better off if we
cleared the busy bit when a CPU went offline?
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 19/22] x86/boot: Use RSTORSSP to establish SSP
2025-08-14 20:09 ` Andrew Cooper
@ 2025-08-15 9:03 ` Jan Beulich
2025-08-21 22:09 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-15 9:03 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 14.08.2025 22:09, Andrew Cooper wrote:
> On 14/08/2025 4:11 pm, Jan Beulich wrote:
>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>> Under FRED, SETSSBSY is unavailable, and we want to be setting up FRED prior
>>> to setting up shadow stacks. As we still need Supervisor Tokens in IDT mode,
>>> we need mode-specific logic to establish SSP.
>>>
>>> In FRED mode, write a Restore Token, RSTORSSP it, and discard the resulting
>>> Previous-SSP token.
>>>
>>> No change outside of FRED mode.
>>>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Why is it that in patch 17 you could use identical code, but here you can't?
>
> This caught me out at first too.
>
> For S3, we're going from "no shadow stack" to "back to where we were on
> an active shadow stack". All we need to do is get saved_ssp back into
> the SSP register.
>
> Here, we're going from "no shadow stack" to "on a good, empty, shadow
> stack". For FRED we only need to load a value into SSP, but in IDT mode
> we must also arrange to create a busy Supervisor Token on the base of
> the stack.
>
> We could in principle conditionally write a busy supervisor token, then
> unconditionally RSTORSSP, but that's even more complicated to follow IMO.
Why would the write need to be conditional? Can't we write what effectively
is already there? Or is it more a safety measure to avoid the write when
it's supposed to be unnecessary, to avoid papering over bugs?
>>> @@ -912,10 +913,30 @@ static void __init noreturn reinit_bsp_stack(void)
>>>
>>> if ( cpu_has_xen_shstk )
>>> {
>>> - wrmsrl(MSR_PL0_SSP,
>>> - (unsigned long)stack + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8);
>> Does this removal perhaps belong elsewhere, especially with "No change
>> outside of FRED mode" in the description?
>
> This is the "Updating reinit_bsp_stack() is deferred until later." note
> in the previous patch.
>
> This hunk was illegible without the split, although I have to admit that
> I can't quite remember why now.
Hmm, if it is to stay like this, would you mind adding a respective remark
also in the description here?
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 20/22] x86/traps: Alter switch_stack_and_jump() for FRED mode
2025-08-14 20:55 ` Andrew Cooper
@ 2025-08-15 9:10 ` Jan Beulich
2025-08-21 22:56 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-15 9:10 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 14.08.2025 22:55, Andrew Cooper wrote:
> On 14/08/2025 4:35 pm, Jan Beulich wrote:
>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>> FRED and IDT differ by a Supervisor Token on the base of the shstk. This
>>> means that switch_stack_and_jump() needs to discard one extra word when FRED
>>> is active.
>>>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> ---
>>> CC: Jan Beulich <JBeulich@suse.com>
>>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>>
>>> RFC. I don't like this, but it does work.
>>>
>>> This emits opt_fred logic outside of CONFIG_XEN_SHSTK.
>> opt_fred and XEN_SHSTK are orthogonal, so that's fine anyway. What I guess
>> you may mean is that you now have a shstk-related calculation outside of
>> a respective #ifdef.
>
> I really mean "outside of the path where shadow stacks are known to be
> active", i.e. inside the middle of SHADOW_STACK_WORK
>
>> Given the simplicity of the calculation, ...
>>
>>> But frankly, the
>>> construct is already too unweildly, and all options I can think of make it
>>> moreso.
>> ... I agree having it like this is okay.
>
> Yes, but it is a read of a global even when it's not used.
>
> And as a tangent, we probably want __ro_after_init_read_mostly too. The
> read mostly is about cache locality, and is applicable even to the
> __ro_after_init section.
Not really: __read_mostly is to keep stuff rarely written apart from stuff
more frequently written (cache locality, yes). There's not going to be any
frequently written data next to a __ro_after_init item; it's all r/o post-
boot. And I don't think we care much during boot.
>>> @@ -154,7 +155,6 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>>> "rdsspd %[ssp];" \
>>> "cmp $1, %[ssp];" \
>>> "je .L_shstk_done.%=;" /* CET not active? Skip. */ \
>>> - "mov $%c[skstk_base], %[val];" \
>>> "and $%c[stack_mask], %[ssp];" \
>>> "sub %[ssp], %[val];" \
>>> "shr $3, %[val];" \
>> With the latter two insns here, ...
>>
>>> @@ -177,6 +177,8 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>>>
>>> #define switch_stack_and_jump(fn, instr, constr) \
>>> ({ \
>>> + unsigned int token_offset = \
>>> + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - (opt_fred ? 0 : 8); \
>>> unsigned int tmp; \
>>> BUILD_BUG_ON(!ssaj_has_attr_noreturn(fn)); \
>>> __asm__ __volatile__ ( \
>>> @@ -184,12 +186,11 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>>> "mov %[stk], %%rsp;" \
>>> CHECK_FOR_LIVEPATCH_WORK \
>>> instr "[fun]" \
>>> - : [val] "=&r" (tmp), \
>>> + : [val] "=r" (tmp), \
>> ... I don't think you can legitimately drop the & from here? With it
>> retained:
>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>
> You chopped the bit which has an explicit input for "[val]", making the
> earlyclobber incorrect.
I was wondering whether there was a connection there, but ...
> IIRC, one version of Clang complained.
... that's not good. Without the early-clobber the asm() isn't quite
correct imo. If the same value appeared as another input, the compiler
may validly tie both together, assuming the register stays intact until
the very last insn (and hence even that last insn could still use the
register as an input). IOW if there's a Clang issue here, I think it
may need working around explicitly.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 21/22] x86/traps: Introduce FRED entrypoints
2025-08-14 20:40 ` Andrew Cooper
@ 2025-08-15 9:22 ` Jan Beulich
2025-08-18 8:59 ` Jan Beulich
1 sibling, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-15 9:22 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 14.08.2025 22:40, Andrew Cooper wrote:
> On 14/08/2025 4:57 pm, Jan Beulich wrote:
>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>> --- a/xen/arch/x86/include/asm/msr.h
>>> +++ b/xen/arch/x86/include/asm/msr.h
>>> @@ -202,9 +202,9 @@ static inline unsigned long read_gs_base(void)
>>>
>>> static inline unsigned long read_gs_shadow(void)
>>> {
>>> - unsigned long base;
>>> + unsigned long base, cr4 = read_cr4();
>>>
>>> - if ( read_cr4() & X86_CR4_FSGSBASE )
>>> + if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
>>> {
>>> asm volatile ( "swapgs" );
>>> base = __rdgsbase();
>>> @@ -234,7 +234,9 @@ static inline void write_gs_base(unsigned long base)
>>>
>>> static inline void write_gs_shadow(unsigned long base)
>>> {
>>> - if ( read_cr4() & X86_CR4_FSGSBASE )
>>> + unsigned long cr4 = read_cr4();
>>> +
>>> + if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
>>> {
>>> asm volatile ( "swapgs\n\t"
>>> "wrgsbase %0\n\t"
>> I don't quite get how these changes fit into this patch.
>
> Without the change, read_registers() suffers #UD because of the SWAPGS.
>
> This recurses until hitting the guard page, then repeats the same on the
> #DF stack. And because stacks work nicely under FRED, you eventually
> hit #DF's guard page.
>
> Strictly speaking it's only read_gs_shadow() which we need to change to
> make exception handling work, but I fixed both at the same time.
>
> That said, I have actually cleaned this codepath up with the MSR work
> because the code gen in read_registers() is terrible. Due to
> no-strict-aliasing, every store into state-> forces a recalculation of
> get_cpu_info(), meaning that read_cr4() cannot be hoisted, and there's a
> branch in every helper.
>
> I'm still not sure how best to fit it into this series.
Could these two hunks move to another prereq patch, then coming with its
own description?
>>> --- a/xen/arch/x86/traps.c
>>> +++ b/xen/arch/x86/traps.c
>>> @@ -1013,6 +1013,32 @@ void show_execution_state_nmi(const cpumask_t *mask, bool show_all)
>>> printk("Non-responding CPUs: {%*pbl}\n", CPUMASK_PR(&show_state_mask));
>>> }
>>>
>>> +static const char *x86_et_name(unsigned int type)
>>> +{
>>> + static const char *const names[] = {
>>> + [X86_ET_EXT_INTR] = "EXT_INTR",
>>> + [X86_ET_NMI] = "NMI",
>>> + [X86_ET_HW_EXC] = "HW_EXC",
>>> + [X86_ET_SW_INT] = "SW_INT",
>>> + [X86_ET_PRIV_SW_EXC] = "PRIV_SW_EXEC",
>>> + [X86_ET_SW_EXC] = "SW_EXEC",
>>> + [X86_ET_OTHER] = "OTHER",
>>> + };
>>> +
>>> + return (type < ARRAY_SIZE(names) && names[type]) ? names[type] : "???";
>>> +}
>>> +
>>> +static const char *x86_et_other_name(unsigned int vec)
>> This isn't really a vector, is it?
>
> Well - you are decoding the field name regs->fred_ss.vector.
Hmm, yes, the field is re-used, but I'm in trouble viewing these as vectors.
Anyway - I would prefer a rename, but I won't insist.
>>> --- a/xen/arch/x86/x86_64/Makefile
>>> +++ b/xen/arch/x86/x86_64/Makefile
>>> @@ -1,6 +1,7 @@
>>> obj-$(CONFIG_PV32) += compat/
>>>
>>> obj-bin-y += entry.o
>>> +obj-bin-y += entry-fred.o
>> For the ordering here, ...
>>
>>> --- /dev/null
>>> +++ b/xen/arch/x86/x86_64/entry-fred.S
>>> @@ -0,0 +1,35 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-or-later */
>>> +
>>> + .file "x86_64/entry-fred.S"
>>> +
>>> +#include <asm/asm_defns.h>
>>> +#include <asm/page.h>
>>> +
>>> + .section .text.entry, "ax", @progbits
>>> +
>>> + /* The Ring3 entry point is required to be 4k aligned. */
>>> +
>>> +FUNC(entry_FRED_R3, 4096)
>> ... doesn't this 4k-alignment requirement suggest we want to put
>> entry-fred.o first?
>
> Perhaps, but that is quite subtle. I did also consider a
> .text.entry.page_aligned section, but .text.entry only matters for XPTI
> which (as agreed), I'm not intending to implement in FRED mode unless it
> proves to be necessary.
>
> Also IIRC there's still a symbol bug where _sentrytext takes priority
> over entry_FRED_R3, so the backtrace is effectively wrong.
>
> (These are all bad excuses, but some parts of this series are rather old.)
Are you sure this is still the case with entry_FRED_R<n> properly typed as
functions (while _stextentry has no type)? When choosing which symbol to
display, objdump prefers typed over type-less symbols:
/* Sort function and object symbols before global symbols before
local symbols before section symbols before debugging symbols. */
>> Also, might it be more natural to use PAGE_SIZE
>> here?
>
> I did debate that, but the spec uses 0xfff, not pages, even if the
> pipline surely does have an optimisation for chopping 12 metadata bits
> off the bottom of a pointer.
Plus pretty certainly a page boundary is meant here, no matter how things
are worded.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH v1.1 08/22] x86/traps: Introduce percpu_early_traps_init() and set up exception handling earlier
2025-08-14 18:07 ` [PATCH v1.1 08/22] x86/traps: Introduce percpu_early_traps_init() " Andrew Cooper
@ 2025-08-15 9:24 ` Jan Beulich
0 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-15 9:24 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 14.08.2025 20:07, Andrew Cooper wrote:
> As things stand, we set up AP/S3 exception handling marginally after the
> fragile activity of setting up shadow stacks. Shadow stack setup is going to
> get more complicated under FRED.
>
> Introduce percpu_early_traps_init() and call it ahead of setting up shadow
> stacks. To start with, call load_system_tables() which is sufficient to set
> up full exception handling.
>
> In order to handle exceptions, current and the speculation controls needs to
> work. cpu_smpboot_alloc() already constructs some of the AP's top-of-stack
> block, so have it set up a little more. Zero the whole structure to subsume
> other misc setup.
>
> This gets us complete exception coverage of setting up shadow stacks, rather
> than dying with a triple fault.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 18/22] x86/traps: Set MSR_PL0_SSP in load_system_tables()
2025-08-15 8:52 ` Jan Beulich
@ 2025-08-15 13:49 ` Andrew Cooper
0 siblings, 0 replies; 120+ messages in thread
From: Andrew Cooper @ 2025-08-15 13:49 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 15/08/2025 9:52 am, Jan Beulich wrote:
> On 14.08.2025 21:37, Andrew Cooper wrote:
>> On 14/08/2025 4:00 pm, Jan Beulich wrote:
>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>> --- a/xen/arch/x86/boot/x86_64.S
>>>> +++ b/xen/arch/x86/boot/x86_64.S
>>>> @@ -65,17 +65,11 @@ ENTRY(__high_start)
>>>> or $(PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8, %rdx
>>>>
>>>> /*
>>>> - * Write a new supervisor token. Doesn't matter on boot, but for S3
>>>> - * resume this clears the busy bit.
>>>> + * Write a new Supervisor Token. It doesn't matter the first time a
>>>> + * CPU boots, but for S3 resume or CPU hot re-add, this clears the
>>>> + * busy bit.
>>>> */
>>>> wrssq %rdx, (%rdx)
>>>> -
>>>> - /* Point MSR_PL0_SSP at the token. */
>>>> - mov $MSR_PL0_SSP, %ecx
>>>> - mov %edx, %eax
>>>> - shr $32, %rdx
>>>> - wrmsr
>>>> -
>>>> setssbsy
>>> This is ending up a little odd: The comment says the write is to clear the
>>> busy bit, when that's re-set immediately afterwards.
>> That comment is about the wrssq. I suppose what isn't said is that
>> setssbsy will fault if not. How about ", so SETSSBSY can set it again" ?
> That would improve things, but then it's still unclear to me why we need to
> go through this. If the busy bit is already set, why clear it just to set
> it again.
Because the behaviour of SETSSBSY is along the lines of:
if ( test_and_set(busy) )
SSP = token;
else
#GP
It really is quite an ugly instruction, and I'm not sorry to see it go.
> Or perhaps asked differently: Wouldn't we be better off if we
> cleared the busy bit when a CPU went offline?
That's actually quite hard. The shadow stack is in use until the CPU
takes an INIT, so ought to have it's busy bit set.
In practice, clearing the primary shstk busy bit is probably safe to do
when we're certain we'll never return to PV context again.
But this is still a larger change (needs wrss() in every play dead) and
more fragile than just clearing it before we're about to use the stack
fresh.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 21/22] x86/traps: Introduce FRED entrypoints
2025-08-14 20:40 ` Andrew Cooper
2025-08-15 9:22 ` Jan Beulich
@ 2025-08-18 8:59 ` Jan Beulich
1 sibling, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-18 8:59 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 14.08.2025 22:40, Andrew Cooper wrote:
> On 14/08/2025 4:57 pm, Jan Beulich wrote:
>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>> --- /dev/null
>>> +++ b/xen/arch/x86/x86_64/entry-fred.S
>>> @@ -0,0 +1,35 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-or-later */
>>> +
>>> + .file "x86_64/entry-fred.S"
>>> +
>>> +#include <asm/asm_defns.h>
>>> +#include <asm/page.h>
>>> +
>>> + .section .text.entry, "ax", @progbits
>>> +
>>> + /* The Ring3 entry point is required to be 4k aligned. */
>>> +
>>> +FUNC(entry_FRED_R3, 4096)
>> ... doesn't this 4k-alignment requirement suggest we want to put
>> entry-fred.o first?
>
> Perhaps, but that is quite subtle. I did also consider a
> .text.entry.page_aligned section, but .text.entry only matters for XPTI
> which (as agreed), I'm not intending to implement in FRED mode unless it
> proves to be necessary.
>
> Also IIRC there's still a symbol bug where _sentrytext takes priority
> over entry_FRED_R3, so the backtrace is effectively wrong.
>
> (These are all bad excuses, but some parts of this series are rather old.)
>
>> Also, might it be more natural to use PAGE_SIZE
>> here?
>
> I did debate that, but the spec uses 0xfff, not pages, even if the
> pipline surely does have an optimisation for chopping 12 metadata bits
> off the bottom of a pointer.
I found this, though:
"Bits 63:12 contain the upper bits of the linear address of a page in memory
containing event handlers. FRED event delivery will load RIP to refer to an
entry point on this page. See Section 5.1.1."
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 13/22] x86: FRED enumerations
2025-08-08 20:23 ` [PATCH 13/22] x86: FRED enumerations Andrew Cooper
2025-08-13 12:28 ` Andrew Cooper
2025-08-14 11:20 ` Jan Beulich
@ 2025-08-18 9:02 ` Jan Beulich
2 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-18 9:02 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:23, Andrew Cooper wrote:
> Of note, CR4.FRED is bit 32 and cannot enabled outside of 64bit mode.
>
> Most supported toolchains don't understand the FRED instructions yet. ERETU
> and ERETS are easy to wrap (they encoded as REPZ/REPNE CLAC), while LKGS is
> more complicated and deferred for now.
>
> I have intentionally named the FRED MSRs differently to the spec. In the
> spec, the stack pointer names alias the TSS fields of the same name, despite
> very different semantics.
Hmm, looking at this again I'm not entirely convinced: Staying in sync with
the spec also has its merits, and the FRED infix is sufficiently distinguishing
imo.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 22/22] x86/traps: Enable FRED when requested
2025-08-08 20:23 ` [PATCH 22/22] x86/traps: Enable FRED when requested Andrew Cooper
@ 2025-08-18 9:35 ` Jan Beulich
2025-08-18 9:47 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-18 9:35 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:23, Andrew Cooper wrote:
> With the shadow stack and exception handling adjustements in place, we can now
> activate FRED when appropriate. Note that opt_fred is still disabled by
> default.
>
> Introduce init_fred() to set up all the MSRs relevant for FRED. FRED uses
> MSR_STAR (entries from Ring3 only), and MSR_FRED_SSP_SL0 aliases MSR_PL0_SSP
> when CET-SS is active. Otherwise, they're all new MSRs.
>
> With init_fred() existing, load_system_tables() and legacy_syscall_init()
> should only be used when setting up IDT delivery. Insert ASSERT()s to this
> effect, and adjust the various *_init() functions to make this property true.
>
> Per the documentation, ap_early_traps_init() is responsible for switching off
> the boot GDT, which needs doing even in FRED mode.
>
> Finally, set CR4.FRED in {bsp,ap}_early_traps_init().
Probably you've done that already, but these last two paragraphs will need
updating following patch 08 v1.1.
> Xen can now boot in FRED mode up until starting a PV guest, where it faults
> because IRET is not permitted to change privilege.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
> @@ -274,6 +279,44 @@ static void __init init_ler(void)
> setup_force_cpu_cap(X86_FEATURE_XEN_LBR);
> }
>
> +/*
> + * Set up all MSRs relevant for FRED event delivery.
> + *
> + * Xen does not use any of the optional config in MSR_FRED_CONFIG, so all that
> + * is needed is the entrypoint.
> + *
> + * Because FRED always provides a good stack, NMI and #DB do not need any
> + * special treatment. Only #DF needs another stack level, and #MC for the
> + * offchance that Xen's main stack suffers an uncorrectable error.
> + *
> + * FRED reuses MSR_STAR to provide the segment selector values to load on
> + * entry from Ring3. Entry from Ring0 leave %cs and %ss unmodified.
> + */
> +static void init_fred(void)
> +{
> + unsigned long stack_top = get_stack_bottom() & ~(STACK_SIZE - 1);
> +
> + ASSERT(opt_fred == 1);
> +
> + wrmsrns(MSR_STAR, XEN_MSR_STAR);
> + wrmsrns(MSR_FRED_CONFIG, (unsigned long)entry_FRED_R3);
> +
> + wrmsrns(MSR_FRED_RSP_SL0, (unsigned long)(&get_cpu_info()->_fred + 1));
> + wrmsrns(MSR_FRED_RSP_SL1, 0);
In the event of a bug somewhere causing this slot to be accessed, is the
wrapping behavior well-defined, resulting in an attempt to write to the
top end of VA space? (Then again, if the wrapping itself caused a fault,
the overall effect would be largely the same - in many cases #DF.)
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 22/22] x86/traps: Enable FRED when requested
2025-08-18 9:35 ` Jan Beulich
@ 2025-08-18 9:47 ` Andrew Cooper
2025-08-18 9:53 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-18 9:47 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 18/08/2025 10:35 am, Jan Beulich wrote:
> On 08.08.2025 22:23, Andrew Cooper wrote:
>> With the shadow stack and exception handling adjustements in place, we can now
>> activate FRED when appropriate. Note that opt_fred is still disabled by
>> default.
>>
>> Introduce init_fred() to set up all the MSRs relevant for FRED. FRED uses
>> MSR_STAR (entries from Ring3 only), and MSR_FRED_SSP_SL0 aliases MSR_PL0_SSP
>> when CET-SS is active. Otherwise, they're all new MSRs.
>>
>> With init_fred() existing, load_system_tables() and legacy_syscall_init()
>> should only be used when setting up IDT delivery. Insert ASSERT()s to this
>> effect, and adjust the various *_init() functions to make this property true.
>>
>> Per the documentation, ap_early_traps_init() is responsible for switching off
>> the boot GDT, which needs doing even in FRED mode.
>>
>> Finally, set CR4.FRED in {bsp,ap}_early_traps_init().
> Probably you've done that already, but these last two paragraphs will need
> updating following patch 08 v1.1.
It's on my list, but not done yet.
>
>> Xen can now boot in FRED mode up until starting a PV guest, where it faults
>> because IRET is not permitted to change privilege.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Thanks, but I fear this patch has changed too much. I'll take a
decision when I've cleaned up the integration of the PV work.
>
>> @@ -274,6 +279,44 @@ static void __init init_ler(void)
>> setup_force_cpu_cap(X86_FEATURE_XEN_LBR);
>> }
>>
>> +/*
>> + * Set up all MSRs relevant for FRED event delivery.
>> + *
>> + * Xen does not use any of the optional config in MSR_FRED_CONFIG, so all that
>> + * is needed is the entrypoint.
>> + *
>> + * Because FRED always provides a good stack, NMI and #DB do not need any
>> + * special treatment. Only #DF needs another stack level, and #MC for the
>> + * offchance that Xen's main stack suffers an uncorrectable error.
>> + *
>> + * FRED reuses MSR_STAR to provide the segment selector values to load on
>> + * entry from Ring3. Entry from Ring0 leave %cs and %ss unmodified.
>> + */
>> +static void init_fred(void)
>> +{
>> + unsigned long stack_top = get_stack_bottom() & ~(STACK_SIZE - 1);
>> +
>> + ASSERT(opt_fred == 1);
>> +
>> + wrmsrns(MSR_STAR, XEN_MSR_STAR);
>> + wrmsrns(MSR_FRED_CONFIG, (unsigned long)entry_FRED_R3);
>> +
>> + wrmsrns(MSR_FRED_RSP_SL0, (unsigned long)(&get_cpu_info()->_fred + 1));
>> + wrmsrns(MSR_FRED_RSP_SL1, 0);
> In the event of a bug somewhere causing this slot to be accessed, is the
> wrapping behavior well-defined, resulting in an attempt to write to the
> top end of VA space? (Then again, if the wrapping itself caused a fault,
> the overall effect would be largely the same - in many cases #DF.)
The wrapping is well defined - like other cases, it goes to the top of
address space, but that's owned by PV guests. SMAP ought to mitigate
what would otherwise be a priv-esc.
With IDT, we poisoned the unused pointers with non-canonical addresses,
but that's not possible here, as they're MSRs and checked at this point,
rather than when they're used.
I suspect the best we can do is reuse the #DB or NMI stacks, and
intentionally reverse the regular and shadow stack pointers, meaning
that any attempt to use SL1 will hit a guard page and escalate to #DF.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 22/22] x86/traps: Enable FRED when requested
2025-08-18 9:47 ` Andrew Cooper
@ 2025-08-18 9:53 ` Jan Beulich
0 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-18 9:53 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 18.08.2025 11:47, Andrew Cooper wrote:
> On 18/08/2025 10:35 am, Jan Beulich wrote:
>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>> With the shadow stack and exception handling adjustements in place, we can now
>>> activate FRED when appropriate. Note that opt_fred is still disabled by
>>> default.
>>>
>>> Introduce init_fred() to set up all the MSRs relevant for FRED. FRED uses
>>> MSR_STAR (entries from Ring3 only), and MSR_FRED_SSP_SL0 aliases MSR_PL0_SSP
>>> when CET-SS is active. Otherwise, they're all new MSRs.
>>>
>>> With init_fred() existing, load_system_tables() and legacy_syscall_init()
>>> should only be used when setting up IDT delivery. Insert ASSERT()s to this
>>> effect, and adjust the various *_init() functions to make this property true.
>>>
>>> Per the documentation, ap_early_traps_init() is responsible for switching off
>>> the boot GDT, which needs doing even in FRED mode.
>>>
>>> Finally, set CR4.FRED in {bsp,ap}_early_traps_init().
>> Probably you've done that already, but these last two paragraphs will need
>> updating following patch 08 v1.1.
>
> It's on my list, but not done yet.
>
>>
>>> Xen can now boot in FRED mode up until starting a PV guest, where it faults
>>> because IRET is not permitted to change privilege.
>>>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>
> Thanks, but I fear this patch has changed too much. I'll take a
> decision when I've cleaned up the integration of the PV work.
>
>>
>>> @@ -274,6 +279,44 @@ static void __init init_ler(void)
>>> setup_force_cpu_cap(X86_FEATURE_XEN_LBR);
>>> }
>>>
>>> +/*
>>> + * Set up all MSRs relevant for FRED event delivery.
>>> + *
>>> + * Xen does not use any of the optional config in MSR_FRED_CONFIG, so all that
>>> + * is needed is the entrypoint.
>>> + *
>>> + * Because FRED always provides a good stack, NMI and #DB do not need any
>>> + * special treatment. Only #DF needs another stack level, and #MC for the
>>> + * offchance that Xen's main stack suffers an uncorrectable error.
>>> + *
>>> + * FRED reuses MSR_STAR to provide the segment selector values to load on
>>> + * entry from Ring3. Entry from Ring0 leave %cs and %ss unmodified.
>>> + */
>>> +static void init_fred(void)
>>> +{
>>> + unsigned long stack_top = get_stack_bottom() & ~(STACK_SIZE - 1);
>>> +
>>> + ASSERT(opt_fred == 1);
>>> +
>>> + wrmsrns(MSR_STAR, XEN_MSR_STAR);
>>> + wrmsrns(MSR_FRED_CONFIG, (unsigned long)entry_FRED_R3);
>>> +
>>> + wrmsrns(MSR_FRED_RSP_SL0, (unsigned long)(&get_cpu_info()->_fred + 1));
>>> + wrmsrns(MSR_FRED_RSP_SL1, 0);
>> In the event of a bug somewhere causing this slot to be accessed, is the
>> wrapping behavior well-defined, resulting in an attempt to write to the
>> top end of VA space? (Then again, if the wrapping itself caused a fault,
>> the overall effect would be largely the same - in many cases #DF.)
>
> The wrapping is well defined - like other cases, it goes to the top of
> address space, but that's owned by PV guests. SMAP ought to mitigate
> what would otherwise be a priv-esc.
>
> With IDT, we poisoned the unused pointers with non-canonical addresses,
> but that's not possible here, as they're MSRs and checked at this point,
> rather than when they're used.
>
> I suspect the best we can do is reuse the #DB or NMI stacks, and
> intentionally reverse the regular and shadow stack pointers, meaning
> that any attempt to use SL1 will hit a guard page and escalate to #DF.
I was wondering whether to store the upper end of zero_page[]. Or else
point into entirely unmapped space.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 23/22] x86/vmx: Adjust NMI handling for FRED
2025-08-08 23:49 ` [PATCH 23/22] x86/vmx: Adjust NMI handling for FRED Andrew Cooper
@ 2025-08-18 10:02 ` Jan Beulich
2025-08-18 17:18 ` Andrew Cooper
0 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-18 10:02 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 09.08.2025 01:49, Andrew Cooper wrote:
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -4209,8 +4209,18 @@ void asmlinkage vmx_vmexit_handler(struct cpu_user_regs *regs)
> ((intr_info & INTR_INFO_INTR_TYPE_MASK) ==
> MASK_INSR(X86_ET_NMI, INTR_INFO_INTR_TYPE_MASK)) )
> {
> - do_nmi(regs);
> - enable_nmis();
> + /*
> + * If we exited because of an NMI, NMIs are blocked in hardware,
> + * but software is expected to invoke the handler.
> + *
> + * Use INT $2. Combined with the current state, it is the correct
> + * architectural state for the NMI handler,
Not quite, I would say: For profiling (and anything else which may want to
look at the outer context's register state from within the handler) we'd
always appear to have been in Xen when the NMI "occurred".
> and the IRET on the
> + * way back out will unblock NMIs.
> + *
> + * In FRED mode, we can spot this trick and cause the ERETS to
> + * unblock NMIs too.
> + */
> + asm ("int $2");
> }
> break;
> case EXIT_REASON_MCE_DURING_VMENTRY:
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -2285,8 +2285,22 @@ void asmlinkage entry_from_xen(struct cpu_user_regs *regs)
> do_nmi(regs);
> break;
>
> - case X86_ET_HW_EXC:
> case X86_ET_SW_INT:
> + if ( regs->fred_ss.vector == 2 )
> + {
> + /*
> + * Explicit request from the the VMExit handler. Rewrite the FRED
> + * frame to look like it was a real NMI, and go around again.
> + */
> + regs->fred_ss.swint = false;
> + regs->fred_ss.nmi = true;
> + regs->fred_ss.type = X86_ET_NMI;
> + regs->fred_ss.insnlen = 0;
> +
> + return entry_from_xen(regs);
Any particular reason to use recursion here (which the compiler may or may
not transform)? In fact I'm having trouble seeing why you couldn't invoke
do_nmi() here directly.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 21/22] x86/traps: Introduce FRED entrypoints
2025-08-08 20:23 ` [PATCH 21/22] x86/traps: Introduce FRED entrypoints Andrew Cooper
2025-08-11 11:38 ` Andrew Cooper
2025-08-14 15:57 ` Jan Beulich
@ 2025-08-18 10:03 ` Jan Beulich
2025-08-18 10:09 ` Andrew Cooper
2 siblings, 1 reply; 120+ messages in thread
From: Jan Beulich @ 2025-08-18 10:03 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 08.08.2025 22:23, Andrew Cooper wrote:
> Under FRED, there's one entrypoint from Ring 3, and one from Ring 0.
>
> FRED gives us a good stack (even for SYSCALL/SYSENTER), and a unified event
> frame on the stack, meaing that all software needs to do is spill the GPRs
> with a line of PUSHes. Introduce PUSH_AND_CLEAR_GPRS and POP_GPRS for this
> purpose.
>
> Introduce entry_FRED_R0() which to a first appoximation is complete for all
> event handling within Xen.
>
> entry_FRED_R0() needs deriving from entry_FRED_R3(), so introduce a basic
> handler. There is more work required to make the return-to-guest path work
> under FRED, so leave a BUG clearly in place.
>
> Also introduce entry_from_{xen,pv}() to be the C level handlers. By simply
> copying regs->fred_ss.vector into regs->entry_vector, we can reuse all the
> existing fault handlers.
>
> Extend fatal_trap() to render the event type, including by name, when FRED is
> active. This is slightly complicated, because X86_ET_OTHER must not use
> vector_name() or SYSCALL and SYSENTER get rendered as #BP and #DB. Also,
> {read,write}_gs_shadow() needs modifying to avoid the SWAPGS instruction,
> which is disallowed in FRED mode.
>
> This is sufficient to handle all interrupts and exceptions encountered during
> development, including plenty of Double Faults.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
>
> SIMICS hasn't been updated to the FRED v9, and still wants ENDBR instructions
> at the entrypoints.
> ---
> xen/arch/x86/include/asm/asm_defns.h | 65 ++++++++++++
> xen/arch/x86/include/asm/msr.h | 8 +-
> xen/arch/x86/traps.c | 153 ++++++++++++++++++++++++++-
> xen/arch/x86/x86_64/Makefile | 1 +
> xen/arch/x86/x86_64/entry-fred.S | 35 ++++++
> 5 files changed, 256 insertions(+), 6 deletions(-)
> create mode 100644 xen/arch/x86/x86_64/entry-fred.S
>
> diff --git a/xen/arch/x86/include/asm/asm_defns.h b/xen/arch/x86/include/asm/asm_defns.h
> index 72a0082d319d..a81a4043d0f1 100644
> --- a/xen/arch/x86/include/asm/asm_defns.h
> +++ b/xen/arch/x86/include/asm/asm_defns.h
> @@ -315,6 +315,71 @@ static always_inline void stac(void)
> subq $-(UREGS_error_code-UREGS_r15+\adj), %rsp
> .endm
>
> +/*
> + * Push and clear GPRs
> + */
> +.macro PUSH_AND_CLEAR_GPRS
> + push %rdi
> + xor %edi, %edi
> + push %rsi
> + xor %esi, %esi
> + push %rdx
> + xor %edx, %edx
> + push %rcx
> + xor %ecx, %ecx
> + push %rax
> + xor %eax, %eax
> + push %r8
> + xor %r8d, %r8d
> + push %r9
> + xor %r9d, %r9d
> + push %r10
> + xor %r10d, %r10d
> + push %r11
> + xor %r11d, %r11d
> + push %rbx
> + xor %ebx, %ebx
> + push %rbp
> +#ifdef CONFIG_FRAME_POINTER
> +/* Indicate special exception stack frame by inverting the frame pointer. */
> + mov %rsp, %rbp
> + notq %rbp
> +#else
> + xor %ebp, %ebp
> +#endif
> + push %r12
> + xor %r12d, %r12d
> + push %r13
> + xor %r13d, %r13d
> + push %r14
> + xor %r14d, %r14d
> + push %r15
> + xor %r15d, %r15d
> +.endm
> +
> +/*
> + * POP GPRs from a UREGS_* frame on the stack. Does not modify flags.
> + *
> + * @rax: Alternative destination for the %rax value on the stack.
> + */
> +.macro POP_GPRS rax=%rax
> + pop %r15
> + pop %r14
> + pop %r13
> + pop %r12
> + pop %rbp
> + pop %rbx
> + pop %r11
> + pop %r10
> + pop %r9
> + pop %r8
> + pop \rax
> + pop %rcx
> + pop %rdx
> + pop %rsi
> + pop %rdi
> +.endm
> +
> #ifdef CONFIG_PV32
> #define CR4_PV32_RESTORE \
> ALTERNATIVE_2 "", \
> diff --git a/xen/arch/x86/include/asm/msr.h b/xen/arch/x86/include/asm/msr.h
> index b6b85b04c3fd..01f510315ffe 100644
> --- a/xen/arch/x86/include/asm/msr.h
> +++ b/xen/arch/x86/include/asm/msr.h
> @@ -202,9 +202,9 @@ static inline unsigned long read_gs_base(void)
>
> static inline unsigned long read_gs_shadow(void)
> {
> - unsigned long base;
> + unsigned long base, cr4 = read_cr4();
>
> - if ( read_cr4() & X86_CR4_FSGSBASE )
> + if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
> {
> asm volatile ( "swapgs" );
> base = __rdgsbase();
> @@ -234,7 +234,9 @@ static inline void write_gs_base(unsigned long base)
>
> static inline void write_gs_shadow(unsigned long base)
> {
> - if ( read_cr4() & X86_CR4_FSGSBASE )
> + unsigned long cr4 = read_cr4();
> +
> + if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
> {
> asm volatile ( "swapgs\n\t"
> "wrgsbase %0\n\t"
> diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
> index 270b93ed623e..e67a428e4362 100644
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -1013,6 +1013,32 @@ void show_execution_state_nmi(const cpumask_t *mask, bool show_all)
> printk("Non-responding CPUs: {%*pbl}\n", CPUMASK_PR(&show_state_mask));
> }
>
> +static const char *x86_et_name(unsigned int type)
> +{
> + static const char *const names[] = {
> + [X86_ET_EXT_INTR] = "EXT_INTR",
> + [X86_ET_NMI] = "NMI",
> + [X86_ET_HW_EXC] = "HW_EXC",
> + [X86_ET_SW_INT] = "SW_INT",
> + [X86_ET_PRIV_SW_EXC] = "PRIV_SW_EXEC",
> + [X86_ET_SW_EXC] = "SW_EXEC",
> + [X86_ET_OTHER] = "OTHER",
> + };
> +
> + return (type < ARRAY_SIZE(names) && names[type]) ? names[type] : "???";
> +}
> +
> +static const char *x86_et_other_name(unsigned int vec)
> +{
> + static const char *const names[] = {
> + [0] = "MTF",
> + [1] = "SYSCALL",
> + [2] = "SYSENTER",
> + };
> +
> + return (vec < ARRAY_SIZE(names) && names[vec][0]) ? names[vec] : "???";
> +}
> +
> const char *vector_name(unsigned int vec)
> {
> static const char names[][4] = {
> @@ -1091,9 +1117,42 @@ void fatal_trap(const struct cpu_user_regs *regs, bool show_remote)
> }
> }
>
> - panic("FATAL TRAP: vec %u, %s[%04x]%s\n",
> - trapnr, vector_name(trapnr), regs->error_code,
> - (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
> + if ( read_cr4() & X86_CR4_FRED )
> + {
> + bool render_ec = false;
> + const char *vec_name = NULL;
> +
> + switch ( regs->fred_ss.type )
> + {
> + case X86_ET_HW_EXC:
> + case X86_ET_SW_INT:
> + case X86_ET_PRIV_SW_EXC:
> + case X86_ET_SW_EXC:
> + render_ec = true;
> + vec_name = vector_name(regs->fred_ss.vector);
> + break;
> +
> + case X86_ET_OTHER:
> + vec_name = x86_et_other_name(regs->fred_ss.vector);
> + break;
> + }
> +
> + if ( render_ec )
> + panic("Fatal TRAP: type %u, %s, vec %u, %s[%04x]%s\n",
> + regs->fred_ss.type, x86_et_name(regs->fred_ss.type),
> + regs->fred_ss.vector, vec_name ?: "",
> + regs->error_code,
> + (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
> + else
> + panic("Fatal TRAP: type %u, %s, vec %u, %s%s\n",
> + regs->fred_ss.type, x86_et_name(regs->fred_ss.type),
> + regs->fred_ss.vector, vec_name ?: "",
> + (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
> + }
> + else
> + panic("FATAL TRAP: vec %u, %s[%04x]%s\n",
> + trapnr, vector_name(trapnr), regs->error_code,
> + (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
> }
>
> void asmlinkage noreturn do_unhandled_trap(struct cpu_user_regs *regs)
> @@ -2181,6 +2240,94 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
> }
> #endif
>
> +void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
> +{
> + /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
> + regs->entry_vector = regs->fred_ss.vector;
> +
> + switch ( regs->fred_ss.type )
> + {
> + case X86_ET_EXT_INTR:
> + do_IRQ(regs);
> + break;
> +
> + case X86_ET_NMI:
> + do_nmi(regs);
> + break;
> +
> + case X86_ET_HW_EXC:
> + case X86_ET_SW_INT:
> + case X86_ET_PRIV_SW_EXC:
> + case X86_ET_SW_EXC:
> + goto fatal;
> +
> + default:
> + goto fatal;
> + }
> +
> + return;
> +
> + fatal:
> + fatal_trap(regs, false);
> +}
Noticed only now: Shouldn't this be surrounded with #ifdef CONFIG_PV (with
knock-on effects elsewhere)?
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 21/22] x86/traps: Introduce FRED entrypoints
2025-08-18 10:03 ` Jan Beulich
@ 2025-08-18 10:09 ` Andrew Cooper
0 siblings, 0 replies; 120+ messages in thread
From: Andrew Cooper @ 2025-08-18 10:09 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 18/08/2025 11:03 am, Jan Beulich wrote:
> On 08.08.2025 22:23, Andrew Cooper wrote:
>> diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
>> index 270b93ed623e..e67a428e4362 100644
>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -2181,6 +2240,94 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
>> }
>> #endif
>>
>> +void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
>> +{
>> + /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
>> + regs->entry_vector = regs->fred_ss.vector;
>> +
>> + switch ( regs->fred_ss.type )
>> + {
>> + case X86_ET_EXT_INTR:
>> + do_IRQ(regs);
>> + break;
>> +
>> + case X86_ET_NMI:
>> + do_nmi(regs);
>> + break;
>> +
>> + case X86_ET_HW_EXC:
>> + case X86_ET_SW_INT:
>> + case X86_ET_PRIV_SW_EXC:
>> + case X86_ET_SW_EXC:
>> + goto fatal;
>> +
>> + default:
>> + goto fatal;
>> + }
>> +
>> + return;
>> +
>> + fatal:
>> + fatal_trap(regs, false);
>> +}
> Noticed only now: Shouldn't this be surrounded with #ifdef CONFIG_PV (with
> knock-on effects elsewhere)?
Randconfig had a fun time with CONFIG_PV.
I've got an early "if ( IS_ENABLED(CONFIG_PV) ) goto fatal;" but there's
a bit of extra complexity in the ASM. entry_FRED_R3() needs to exist
even outside of CONFIG_PV, despite things like test_all_events being
conditional.
Also, this is changing a bit as part of getting PV support working.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 23/22] x86/vmx: Adjust NMI handling for FRED
2025-08-18 10:02 ` Jan Beulich
@ 2025-08-18 17:18 ` Andrew Cooper
2025-08-19 6:31 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-18 17:18 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 18/08/2025 11:02 am, Jan Beulich wrote:
> On 09.08.2025 01:49, Andrew Cooper wrote:
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -4209,8 +4209,18 @@ void asmlinkage vmx_vmexit_handler(struct cpu_user_regs *regs)
>> ((intr_info & INTR_INFO_INTR_TYPE_MASK) ==
>> MASK_INSR(X86_ET_NMI, INTR_INFO_INTR_TYPE_MASK)) )
>> {
>> - do_nmi(regs);
>> - enable_nmis();
>> + /*
>> + * If we exited because of an NMI, NMIs are blocked in hardware,
>> + * but software is expected to invoke the handler.
>> + *
>> + * Use INT $2. Combined with the current state, it is the correct
>> + * architectural state for the NMI handler,
> Not quite, I would say: For profiling (and anything else which may want to
> look at the outer context's register state from within the handler) we'd
> always appear to have been in Xen when the NMI "occurred".
We are always inside Xen when the NMI "occurred".
In fact there's a latent bug I didn't spot before. Nothing appears to,
but if anything in do_nmi() were to to look at regs->entry_vector, it
will see stack rubble (release build) or poison (debug build).
Having gone searching, it's only the watchdog and oprofile which
configure perf counters with NMIs. vPMU uses fixed interrupts, which
further calls into question it's utility.
>
>> and the IRET on the
>> + * way back out will unblock NMIs.
>> + *
>> + * In FRED mode, we can spot this trick and cause the ERETS to
>> + * unblock NMIs too.
>> + */
>> + asm ("int $2");
>> }
>> break;
>> case EXIT_REASON_MCE_DURING_VMENTRY:
>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -2285,8 +2285,22 @@ void asmlinkage entry_from_xen(struct cpu_user_regs *regs)
>> do_nmi(regs);
>> break;
>>
>> - case X86_ET_HW_EXC:
>> case X86_ET_SW_INT:
>> + if ( regs->fred_ss.vector == 2 )
>> + {
>> + /*
>> + * Explicit request from the the VMExit handler. Rewrite the FRED
>> + * frame to look like it was a real NMI, and go around again.
>> + */
>> + regs->fred_ss.swint = false;
>> + regs->fred_ss.nmi = true;
>> + regs->fred_ss.type = X86_ET_NMI;
>> + regs->fred_ss.insnlen = 0;
>> +
>> + return entry_from_xen(regs);
> Any particular reason to use recursion here (which the compiler may or may
> not transform)? In fact I'm having trouble seeing why you couldn't invoke
> do_nmi() here directly.
The first way I had entry_from_xen(), this was necessary to get the
right behaviour. GCC did manage to transform it into a call to do_nmi().
But this has changed somewhat now so I think I can do it with a fallthrough.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 23/22] x86/vmx: Adjust NMI handling for FRED
2025-08-18 17:18 ` Andrew Cooper
@ 2025-08-19 6:31 ` Jan Beulich
0 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-19 6:31 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 18.08.2025 19:18, Andrew Cooper wrote:
> On 18/08/2025 11:02 am, Jan Beulich wrote:
>> On 09.08.2025 01:49, Andrew Cooper wrote:
>>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>>> @@ -4209,8 +4209,18 @@ void asmlinkage vmx_vmexit_handler(struct cpu_user_regs *regs)
>>> ((intr_info & INTR_INFO_INTR_TYPE_MASK) ==
>>> MASK_INSR(X86_ET_NMI, INTR_INFO_INTR_TYPE_MASK)) )
>>> {
>>> - do_nmi(regs);
>>> - enable_nmis();
>>> + /*
>>> + * If we exited because of an NMI, NMIs are blocked in hardware,
>>> + * but software is expected to invoke the handler.
>>> + *
>>> + * Use INT $2. Combined with the current state, it is the correct
>>> + * architectural state for the NMI handler,
>> Not quite, I would say: For profiling (and anything else which may want to
>> look at the outer context's register state from within the handler) we'd
>> always appear to have been in Xen when the NMI "occurred".
>
> We are always inside Xen when the NMI "occurred".
How that? The perception is based on "regs", isn't it? They're representing
guest context here, just with ...
> In fact there's a latent bug I didn't spot before. Nothing appears to,
> but if anything in do_nmi() were to to look at regs->entry_vector, it
> will see stack rubble (release build) or poison (debug build).
... a few fields (apparently wrongly) not filled.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 13/22] x86: FRED enumerations
2025-08-14 11:20 ` Jan Beulich
2025-08-14 11:42 ` Andrew Cooper
2025-08-14 13:19 ` Jan Beulich
@ 2025-08-21 21:23 ` Andrew Cooper
2 siblings, 0 replies; 120+ messages in thread
From: Andrew Cooper @ 2025-08-21 21:23 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 14/08/2025 12:20 pm, Jan Beulich wrote:
> On 08.08.2025 22:23, Andrew Cooper wrote:
>> Of note, CR4.FRED is bit 32 and cannot enabled outside of 64bit mode.
>>
>> Most supported toolchains don't understand the FRED instructions yet. ERETU
>> and ERETS are easy to wrap (they encoded as REPZ/REPNE CLAC), while LKGS is
>> more complicated and deferred for now.
>>
>> I have intentionally named the FRED MSRs differently to the spec. In the
>> spec, the stack pointer names alias the TSS fields of the same name, despite
>> very different semantics.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Acked-by: Jan Beulich <jbeulich@suse.com>
> with ...
>
>> --- a/xen/arch/x86/Kconfig
>> +++ b/xen/arch/x86/Kconfig
>> @@ -57,6 +57,10 @@ config HAS_CC_CET_IBT
>> # Retpoline check to work around https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93654
>> def_bool $(cc-option,-fcf-protection=branch -mmanual-endbr -mindirect-branch=thunk-extern) && $(as-instr,endbr64)
>>
>> +config HAS_AS_FRED
>> + # binutils >= 2.41 or LLVM >= 19
>> + def_bool $(as-instr,eretu;lkgs %ax)
> ..., as per your reply, this preferably dropped
Having now got the PV side complete (I think), I can indeed drop this,
but I need
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index 71308d9dafc8..0a98676c1604 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -18,7 +18,7 @@ XEN_CPUFEATURE(ARCH_PERFMON, X86_SYNTH( 3)) /* Intel Architectural PerfMon
XEN_CPUFEATURE(TSC_RELIABLE, X86_SYNTH( 4)) /* TSC is known to be reliable */
XEN_CPUFEATURE(XTOPOLOGY, X86_SYNTH( 5)) /* cpu topology enum extensions */
XEN_CPUFEATURE(CPUID_FAULTING, X86_SYNTH( 6)) /* cpuid faulting */
-/* Bit 7 unused */
+XEN_CPUFEATURE(XEN_FRED, X86_SYNTH( 7)) /* Xen uses FRED */
XEN_CPUFEATURE(APERFMPERF, X86_SYNTH( 8)) /* APERFMPERF */
XEN_CPUFEATURE(MFENCE_RDTSC, X86_SYNTH( 9)) /* MFENCE synchronizes RDTSC */
XEN_CPUFEATURE(XEN_SMEP, X86_SYNTH(10)) /* SMEP gets used by Xen itself */
too for a fastpath in assembly. I've folded it into this patch.
~Andrew
^ permalink raw reply related [flat|nested] 120+ messages in thread
* Re: [PATCH 15/22] x86/traps: Introduce opt_fred
2025-08-15 8:37 ` Jan Beulich
@ 2025-08-21 21:52 ` Andrew Cooper
2025-08-25 9:08 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-21 21:52 UTC (permalink / raw)
To: Jan Beulich; +Cc: Xen-devel, Roger Pau Monné
On 15/08/2025 9:37 am, Jan Beulich wrote:
> On 14.08.2025 21:16, Andrew Cooper wrote:
>> On 14/08/2025 2:30 pm, Jan Beulich wrote:
>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>> ... disabled by default. There is a lot of work before FRED can be enabled by
>>>> default.
>>>>
>>>> One part of FRED, the LKGS (Load Kernel GS) instruction, is enumerated
>>>> separately but is mandatory as FRED disallows the SWAPGS instruction.
>>>> Therefore, both CPUID bits must be checked.
>>> See my (further) reply to patch 13 - I think FRED simply ought to depend on
>>> LKGS.
>>>
>>>> @@ -20,6 +22,9 @@ unsigned int __ro_after_init ler_msr;
>>>> static bool __initdata opt_ler;
>>>> boolean_param("ler", opt_ler);
>>>>
>>>> +int8_t __ro_after_init opt_fred = 0; /* -1 when supported. */
>>> I'm a little puzzled by the comment? DYM "once default-enabled"?
>> Well, I have this temporary patch
>> https://gitlab.com/xen-project/hardware/xen-staging/-/commit/70ef6a1178a411a29b7b1745a1112e267ffb6245
>> that will turn into a real patch when we enable FRED by default.
>>
>> As much as anything else, it was just a TODO.
>>
>>
>>> Then ...
>>>
>>>> @@ -305,6 +310,32 @@ void __init traps_init(void)
>>>> /* Replace early pagefault with real pagefault handler. */
>>>> _update_gate_addr_lower(&bsp_idt[X86_EXC_PF], entry_PF);
>>>>
>>>> + if ( !cpu_has_fred || !cpu_has_lkgs )
>>>> + {
>>>> + if ( opt_fred )
>>> ... this won't work anymore once the initializer is changed.
>> Hmm yes. That wants to be an == 1 check. Fixed.
>>
>>>> + printk(XENLOG_WARNING "FRED not available, ignoring\n");
>>>> + opt_fred = false;
>>> Better use 0 here?
>>>
>>>> + }
>>>> +
>>>> + if ( opt_fred == -1 )
>>>> + opt_fred = !pv_shim;
>>> Imo it would be better to have the initializer be -1 right away, and comment
>>> out the "!pv_shim" here, until we mean it to be default-enabled.
>> It cannot be -1, or Xen will fail spectacularly on any FRED capable
>> hardware. Setting to -1 is the point at which FRED becomes security
>> supported.
> I guess I'm not following: If it was -1, and if the code here was
>
> if ( opt_fred < 0 )
> opt_fred = 0 /* !pv_shim */;
>
> why would things "fail spectacularly" unless someone passed "fred" on
> the command line?
Oh, that would work, but why bother? It's simply a less readable form
of mine, and if we're going to nitpick, it's commented out code.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 19/22] x86/boot: Use RSTORSSP to establish SSP
2025-08-15 9:03 ` Jan Beulich
@ 2025-08-21 22:09 ` Andrew Cooper
2025-08-25 9:12 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-21 22:09 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 15/08/2025 10:03 am, Jan Beulich wrote:
> On 14.08.2025 22:09, Andrew Cooper wrote:
>> On 14/08/2025 4:11 pm, Jan Beulich wrote:
>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>> Under FRED, SETSSBSY is unavailable, and we want to be setting up FRED prior
>>>> to setting up shadow stacks. As we still need Supervisor Tokens in IDT mode,
>>>> we need mode-specific logic to establish SSP.
>>>>
>>>> In FRED mode, write a Restore Token, RSTORSSP it, and discard the resulting
>>>> Previous-SSP token.
>>>>
>>>> No change outside of FRED mode.
>>>>
>>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Why is it that in patch 17 you could use identical code, but here you can't?
>> This caught me out at first too.
>>
>> For S3, we're going from "no shadow stack" to "back to where we were on
>> an active shadow stack". All we need to do is get saved_ssp back into
>> the SSP register.
>>
>> Here, we're going from "no shadow stack" to "on a good, empty, shadow
>> stack". For FRED we only need to load a value into SSP, but in IDT mode
>> we must also arrange to create a busy Supervisor Token on the base of
>> the stack.
>>
>> We could in principle conditionally write a busy supervisor token, then
>> unconditionally RSTORSSP, but that's even more complicated to follow IMO.
> Why would the write need to be conditional?
Because the tokens are different. One has the value &addr, and one has
&addr + 9.
The Supervisor Shadow Stack Token for IDT needs to survive for the
lifetime of Xen, while the Restore Token for FRED is temporary and
discarded by the logic added in this patch.
> Can't we write what effectively
> is already there? Or is it more a safety measure to avoid the write when
> it's supposed to be unnecessary, to avoid papering over bugs?
I genuinely don't understand what you're trying to suggest here.
~Andrew
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 20/22] x86/traps: Alter switch_stack_and_jump() for FRED mode
2025-08-15 9:10 ` Jan Beulich
@ 2025-08-21 22:56 ` Andrew Cooper
2025-08-25 9:19 ` Jan Beulich
0 siblings, 1 reply; 120+ messages in thread
From: Andrew Cooper @ 2025-08-21 22:56 UTC (permalink / raw)
To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel
On 15/08/2025 10:10 am, Jan Beulich wrote:
> On 14.08.2025 22:55, Andrew Cooper wrote:
>> On 14/08/2025 4:35 pm, Jan Beulich wrote:
>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>> FRED and IDT differ by a Supervisor Token on the base of the shstk. This
>>>> means that switch_stack_and_jump() needs to discard one extra word when FRED
>>>> is active.
>>>>
>>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> ---
>>>> CC: Jan Beulich <JBeulich@suse.com>
>>>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>>>
>>>> RFC. I don't like this, but it does work.
>>>>
>>>> This emits opt_fred logic outside of CONFIG_XEN_SHSTK.
>>> opt_fred and XEN_SHSTK are orthogonal, so that's fine anyway. What I guess
>>> you may mean is that you now have a shstk-related calculation outside of
>>> a respective #ifdef.
>> I really mean "outside of the path where shadow stacks are known to be
>> active", i.e. inside the middle of SHADOW_STACK_WORK
>>
>>> Given the simplicity of the calculation, ...
>>>
>>>> But frankly, the
>>>> construct is already too unweildly, and all options I can think of make it
>>>> moreso.
>>> ... I agree having it like this is okay.
>> Yes, but it is a read of a global even when it's not used.
>>
>> And as a tangent, we probably want __ro_after_init_read_mostly too. The
>> read mostly is about cache locality, and is applicable even to the
>> __ro_after_init section.
> Not really: __read_mostly is to keep stuff rarely written apart from stuff
> more frequently written (cache locality, yes). There's not going to be any
> frequently written data next to a __ro_after_init item; it's all r/o post-
> boot. And I don't think we care much during boot.
It's not about boot, but hot variables do need grouping. opt_fred is
read on fastpaths, whereas trampoline_phys is not.
>
>>>> @@ -154,7 +155,6 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>>>> "rdsspd %[ssp];" \
>>>> "cmp $1, %[ssp];" \
>>>> "je .L_shstk_done.%=;" /* CET not active? Skip. */ \
>>>> - "mov $%c[skstk_base], %[val];" \
>>>> "and $%c[stack_mask], %[ssp];" \
>>>> "sub %[ssp], %[val];" \
>>>> "shr $3, %[val];" \
>>> With the latter two insns here, ...
>>>
>>>> @@ -177,6 +177,8 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>>>>
>>>> #define switch_stack_and_jump(fn, instr, constr) \
>>>> ({ \
>>>> + unsigned int token_offset = \
>>>> + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - (opt_fred ? 0 : 8); \
>>>> unsigned int tmp; \
>>>> BUILD_BUG_ON(!ssaj_has_attr_noreturn(fn)); \
>>>> __asm__ __volatile__ ( \
>>>> @@ -184,12 +186,11 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>>>> "mov %[stk], %%rsp;" \
>>>> CHECK_FOR_LIVEPATCH_WORK \
>>>> instr "[fun]" \
>>>> - : [val] "=&r" (tmp), \
>>>> + : [val] "=r" (tmp), \
>>> ... I don't think you can legitimately drop the & from here? With it
>>> retained:
>>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>> You chopped the bit which has an explicit input for "[val]", making the
>> earlyclobber incorrect.
> I was wondering whether there was a connection there, but ...
>
>> IIRC, one version of Clang complained.
> ... that's not good. Without the early-clobber the asm() isn't quite
> correct imo. If the same value appeared as another input, the compiler
> may validly tie both together, assuming the register stays intact until
> the very last insn (and hence even that last insn could still use the
> register as an input). IOW if there's a Clang issue here, I think it
> may need working around explicitly.
Given that I need an alternative anyway, this becomes much easier, and
shrinks to this single hunk:
diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
index c1eb27b1c4c2..35cc61fa88e7 100644
--- a/xen/arch/x86/include/asm/current.h
+++ b/xen/arch/x86/include/asm/current.h
@@ -154,7 +154,9 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
"rdsspd %[ssp];" \
"cmp $1, %[ssp];" \
"je .L_shstk_done.%=;" /* CET not active? Skip. */ \
- "mov $%c[skstk_base], %[val];" \
+ ALTERNATIVE("mov $%c[skstk_base], %[val];", \
+ "mov $%c[skstk_base] + 8, %[val];", \
+ X86_FEATURE_XEN_FRED) \
"and $%c[stack_mask], %[ssp];" \
"sub %[ssp], %[val];" \
"shr $3, %[val];" \
~Andrew
^ permalink raw reply related [flat|nested] 120+ messages in thread
* Re: [PATCH 15/22] x86/traps: Introduce opt_fred
2025-08-21 21:52 ` Andrew Cooper
@ 2025-08-25 9:08 ` Jan Beulich
0 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-25 9:08 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Xen-devel, Roger Pau Monné
On 21.08.2025 23:52, Andrew Cooper wrote:
> On 15/08/2025 9:37 am, Jan Beulich wrote:
>> On 14.08.2025 21:16, Andrew Cooper wrote:
>>> On 14/08/2025 2:30 pm, Jan Beulich wrote:
>>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>>> ... disabled by default. There is a lot of work before FRED can be enabled by
>>>>> default.
>>>>>
>>>>> One part of FRED, the LKGS (Load Kernel GS) instruction, is enumerated
>>>>> separately but is mandatory as FRED disallows the SWAPGS instruction.
>>>>> Therefore, both CPUID bits must be checked.
>>>> See my (further) reply to patch 13 - I think FRED simply ought to depend on
>>>> LKGS.
>>>>
>>>>> @@ -20,6 +22,9 @@ unsigned int __ro_after_init ler_msr;
>>>>> static bool __initdata opt_ler;
>>>>> boolean_param("ler", opt_ler);
>>>>>
>>>>> +int8_t __ro_after_init opt_fred = 0; /* -1 when supported. */
>>>> I'm a little puzzled by the comment? DYM "once default-enabled"?
>>> Well, I have this temporary patch
>>> https://gitlab.com/xen-project/hardware/xen-staging/-/commit/70ef6a1178a411a29b7b1745a1112e267ffb6245
>>> that will turn into a real patch when we enable FRED by default.
>>>
>>> As much as anything else, it was just a TODO.
>>>
>>>
>>>> Then ...
>>>>
>>>>> @@ -305,6 +310,32 @@ void __init traps_init(void)
>>>>> /* Replace early pagefault with real pagefault handler. */
>>>>> _update_gate_addr_lower(&bsp_idt[X86_EXC_PF], entry_PF);
>>>>>
>>>>> + if ( !cpu_has_fred || !cpu_has_lkgs )
>>>>> + {
>>>>> + if ( opt_fred )
>>>> ... this won't work anymore once the initializer is changed.
>>> Hmm yes. That wants to be an == 1 check. Fixed.
>>>
>>>>> + printk(XENLOG_WARNING "FRED not available, ignoring\n");
>>>>> + opt_fred = false;
>>>> Better use 0 here?
>>>>
>>>>> + }
>>>>> +
>>>>> + if ( opt_fred == -1 )
>>>>> + opt_fred = !pv_shim;
>>>> Imo it would be better to have the initializer be -1 right away, and comment
>>>> out the "!pv_shim" here, until we mean it to be default-enabled.
>>> It cannot be -1, or Xen will fail spectacularly on any FRED capable
>>> hardware. Setting to -1 is the point at which FRED becomes security
>>> supported.
>> I guess I'm not following: If it was -1, and if the code here was
>>
>> if ( opt_fred < 0 )
>> opt_fred = 0 /* !pv_shim */;
>>
>> why would things "fail spectacularly" unless someone passed "fred" on
>> the command line?
>
> Oh, that would work, but why bother? It's simply a less readable form
> of mine, and if we're going to nitpick, it's commented out code.
Indeed, I was aware of Misra's dislike when writing the reply.
In any event - I'm okay with about any approach as long as the adjustment
to make (once FRED becomes supported) is both clear upfront and simple to
make (read: preferably a single line change). Readability is, as we know
from other recent instances, subjective. In the case here I think it
follows from my original comment that things weren't quite clear according
to my reading.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 19/22] x86/boot: Use RSTORSSP to establish SSP
2025-08-21 22:09 ` Andrew Cooper
@ 2025-08-25 9:12 ` Jan Beulich
0 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-25 9:12 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 22.08.2025 00:09, Andrew Cooper wrote:
> On 15/08/2025 10:03 am, Jan Beulich wrote:
>> On 14.08.2025 22:09, Andrew Cooper wrote:
>>> On 14/08/2025 4:11 pm, Jan Beulich wrote:
>>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>>> Under FRED, SETSSBSY is unavailable, and we want to be setting up FRED prior
>>>>> to setting up shadow stacks. As we still need Supervisor Tokens in IDT mode,
>>>>> we need mode-specific logic to establish SSP.
>>>>>
>>>>> In FRED mode, write a Restore Token, RSTORSSP it, and discard the resulting
>>>>> Previous-SSP token.
>>>>>
>>>>> No change outside of FRED mode.
>>>>>
>>>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> Why is it that in patch 17 you could use identical code, but here you can't?
>>> This caught me out at first too.
>>>
>>> For S3, we're going from "no shadow stack" to "back to where we were on
>>> an active shadow stack". All we need to do is get saved_ssp back into
>>> the SSP register.
>>>
>>> Here, we're going from "no shadow stack" to "on a good, empty, shadow
>>> stack". For FRED we only need to load a value into SSP, but in IDT mode
>>> we must also arrange to create a busy Supervisor Token on the base of
>>> the stack.
>>>
>>> We could in principle conditionally write a busy supervisor token, then
>>> unconditionally RSTORSSP, but that's even more complicated to follow IMO.
>> Why would the write need to be conditional?
>
> Because the tokens are different. One has the value &addr, and one has
> &addr + 9.
>
> The Supervisor Shadow Stack Token for IDT needs to survive for the
> lifetime of Xen, while the Restore Token for FRED is temporary and
> discarded by the logic added in this patch.
>
>> Can't we write what effectively
>> is already there? Or is it more a safety measure to avoid the write when
>> it's supposed to be unnecessary, to avoid papering over bugs?
>
> I genuinely don't understand what you're trying to suggest here.
I think I misunderstood your earlier reply, so the questions probably
indeed didn't make a while lot of sense.
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
* Re: [PATCH 20/22] x86/traps: Alter switch_stack_and_jump() for FRED mode
2025-08-21 22:56 ` Andrew Cooper
@ 2025-08-25 9:19 ` Jan Beulich
0 siblings, 0 replies; 120+ messages in thread
From: Jan Beulich @ 2025-08-25 9:19 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 22.08.2025 00:56, Andrew Cooper wrote:
> On 15/08/2025 10:10 am, Jan Beulich wrote:
>> On 14.08.2025 22:55, Andrew Cooper wrote:
>>> On 14/08/2025 4:35 pm, Jan Beulich wrote:
>>>> On 08.08.2025 22:23, Andrew Cooper wrote:
>>>>> @@ -154,7 +155,6 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>>>>> "rdsspd %[ssp];" \
>>>>> "cmp $1, %[ssp];" \
>>>>> "je .L_shstk_done.%=;" /* CET not active? Skip. */ \
>>>>> - "mov $%c[skstk_base], %[val];" \
>>>>> "and $%c[stack_mask], %[ssp];" \
>>>>> "sub %[ssp], %[val];" \
>>>>> "shr $3, %[val];" \
>>>> With the latter two insns here, ...
>>>>
>>>>> @@ -177,6 +177,8 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>>>>>
>>>>> #define switch_stack_and_jump(fn, instr, constr) \
>>>>> ({ \
>>>>> + unsigned int token_offset = \
>>>>> + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - (opt_fred ? 0 : 8); \
>>>>> unsigned int tmp; \
>>>>> BUILD_BUG_ON(!ssaj_has_attr_noreturn(fn)); \
>>>>> __asm__ __volatile__ ( \
>>>>> @@ -184,12 +186,11 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>>>>> "mov %[stk], %%rsp;" \
>>>>> CHECK_FOR_LIVEPATCH_WORK \
>>>>> instr "[fun]" \
>>>>> - : [val] "=&r" (tmp), \
>>>>> + : [val] "=r" (tmp), \
>>>> ... I don't think you can legitimately drop the & from here? With it
>>>> retained:
>>>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>>> You chopped the bit which has an explicit input for "[val]", making the
>>> earlyclobber incorrect.
>> I was wondering whether there was a connection there, but ...
>>
>>> IIRC, one version of Clang complained.
>> ... that's not good. Without the early-clobber the asm() isn't quite
>> correct imo. If the same value appeared as another input, the compiler
>> may validly tie both together, assuming the register stays intact until
>> the very last insn (and hence even that last insn could still use the
>> register as an input). IOW if there's a Clang issue here, I think it
>> may need working around explicitly.
>
> Given that I need an alternative anyway, this becomes much easier, and
> shrinks to this single hunk:
>
> diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
> index c1eb27b1c4c2..35cc61fa88e7 100644
> --- a/xen/arch/x86/include/asm/current.h
> +++ b/xen/arch/x86/include/asm/current.h
> @@ -154,7 +154,9 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
> "rdsspd %[ssp];" \
> "cmp $1, %[ssp];" \
> "je .L_shstk_done.%=;" /* CET not active? Skip. */ \
> - "mov $%c[skstk_base], %[val];" \
> + ALTERNATIVE("mov $%c[skstk_base], %[val];", \
> + "mov $%c[skstk_base] + 8, %[val];", \
> + X86_FEATURE_XEN_FRED) \
> "and $%c[stack_mask], %[ssp];" \
> "sub %[ssp], %[val];" \
> "shr $3, %[val];" \
Oh, okay. But then please again without unnecessary use of $%c constructs,
when just % will do.
Tangential: Now that I look at this again, what's the 1st 'k' standing
for in skstk_base? Was that maybe meant to be 'h'?
Jan
Jan
^ permalink raw reply [flat|nested] 120+ messages in thread
end of thread, other threads:[~2025-08-25 9:19 UTC | newest]
Thread overview: 120+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-08 20:22 [PATCH 00/22] x86: FRED support, part 1 (stacks and exceptions) Andrew Cooper
2025-08-08 20:22 ` [PATCH 01/22] x86/msr: Rename MSR_INTERRUPT_SSP_TABLE to MSR_ISST Andrew Cooper
2025-08-12 8:06 ` Jan Beulich
2025-08-13 9:02 ` Andrew Cooper
2025-08-08 20:22 ` [PATCH 02/22] x86/msr: Rename wrmsr_ns() to wrmsrns(), and take 64bit value Andrew Cooper
2025-08-11 6:36 ` Andrew Cooper
2025-08-12 8:08 ` Jan Beulich
2025-08-08 20:22 ` [PATCH 03/22] x86/traps: Drop incorrect BUILD_BUG_ON() and comment in load_system_tables() Andrew Cooper
2025-08-12 8:11 ` Jan Beulich
2025-08-13 9:40 ` Andrew Cooper
2025-08-14 8:50 ` Jan Beulich
2025-08-08 20:22 ` [PATCH 04/22] x86/idt: Minor improvements to _update_gate_addr_lower() Andrew Cooper
2025-08-12 8:16 ` Jan Beulich
2025-08-13 9:48 ` Andrew Cooper
2025-08-08 20:22 ` [PATCH 05/22] x86/traps: Rename early_traps_init() to bsp_early_traps_init() Andrew Cooper
2025-08-12 8:17 ` Jan Beulich
2025-08-08 20:22 ` [PATCH 06/22] x86/traps: Introduce bsp_traps_reinit() Andrew Cooper
2025-08-12 8:19 ` Jan Beulich
2025-08-13 9:51 ` Andrew Cooper
2025-08-08 20:22 ` [PATCH 07/22] x86/spec-ctrl: Rework init_shadow_spec_ctrl_state() to take an info pointer Andrew Cooper
2025-08-12 8:27 ` Jan Beulich
2025-08-13 10:35 ` Andrew Cooper
2025-08-08 20:23 ` [PATCH 08/22] x86/traps: Introduce ap_early_traps_init() and set up exception handling earlier Andrew Cooper
2025-08-12 8:41 ` Jan Beulich
2025-08-13 11:13 ` Andrew Cooper
2025-08-14 8:53 ` Jan Beulich
2025-08-14 18:07 ` [PATCH v1.1 08/22] x86/traps: Introduce percpu_early_traps_init() " Andrew Cooper
2025-08-15 9:24 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 09/22] x86/traps: Move load_system_tables() into traps-setup.c Andrew Cooper
2025-08-12 9:19 ` Jan Beulich
2025-08-13 11:25 ` Andrew Cooper
2025-08-14 8:55 ` Jan Beulich
2025-08-14 18:09 ` Andrew Cooper
2025-08-15 8:22 ` Jan Beulich
2025-08-15 8:28 ` Andrew Cooper
2025-08-15 8:32 ` Jan Beulich
2025-08-12 9:43 ` Nicola Vetrini
2025-08-13 11:36 ` Andrew Cooper
2025-08-14 7:26 ` Jan Beulich
2025-08-14 18:20 ` Andrew Cooper
2025-08-15 8:30 ` Jan Beulich
2025-08-15 8:40 ` Nicola Vetrini
2025-08-15 8:49 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 10/22] x86/traps: Move subarch_percpu_traps_init() " Andrew Cooper
2025-08-11 8:17 ` Andrew Cooper
2025-08-12 9:52 ` Jan Beulich
2025-08-13 11:53 ` Andrew Cooper
2025-08-14 8:58 ` Jan Beulich
2025-08-14 10:17 ` Andrew Cooper
2025-08-14 10:52 ` Jan Beulich
2025-08-14 11:02 ` Andrew Cooper
2025-08-08 20:23 ` [PATCH 11/22] x86/traps: Fold x86_64/traps.c into traps.c Andrew Cooper
2025-08-12 9:53 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 12/22] x86/traps: Unexport show_code() and show_stack_overflow() Andrew Cooper
2025-08-12 9:54 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 13/22] x86: FRED enumerations Andrew Cooper
2025-08-13 12:28 ` Andrew Cooper
2025-08-14 7:30 ` Jan Beulich
2025-08-14 11:20 ` Jan Beulich
2025-08-14 11:42 ` Andrew Cooper
2025-08-14 11:44 ` Jan Beulich
2025-08-14 11:47 ` Andrew Cooper
2025-08-14 19:37 ` Nicola Vetrini
2025-08-14 19:44 ` Andrew Cooper
2025-08-14 21:27 ` Nicola Vetrini
2025-08-14 20:18 ` Nicola Vetrini
2025-08-14 13:19 ` Jan Beulich
2025-08-14 18:45 ` Andrew Cooper
2025-08-15 8:34 ` Jan Beulich
2025-08-21 21:23 ` Andrew Cooper
2025-08-18 9:02 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 14/22] x86/traps: Extend struct cpu_user_regs/cpu_info with FRED fields Andrew Cooper
2025-08-14 13:12 ` Jan Beulich
2025-08-14 15:07 ` Andrew Cooper
2025-08-08 20:23 ` [PATCH 15/22] x86/traps: Introduce opt_fred Andrew Cooper
2025-08-14 13:30 ` Jan Beulich
2025-08-14 19:16 ` Andrew Cooper
2025-08-15 8:37 ` Jan Beulich
2025-08-21 21:52 ` Andrew Cooper
2025-08-25 9:08 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 16/22] x86/boot: Adjust CR4 handling around ap_early_traps_init() Andrew Cooper
2025-08-14 14:47 ` Jan Beulich
2025-08-14 14:54 ` Andrew Cooper
2025-08-14 14:56 ` Jan Beulich
2025-08-14 19:22 ` Andrew Cooper
2025-08-08 20:23 ` [PATCH 17/22] x86/S3: Switch to using RSTORSSP to recover SSP on resume Andrew Cooper
2025-08-14 14:54 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 18/22] x86/traps: Set MSR_PL0_SSP in load_system_tables() Andrew Cooper
2025-08-14 15:00 ` Jan Beulich
2025-08-14 19:37 ` Andrew Cooper
2025-08-15 8:52 ` Jan Beulich
2025-08-15 13:49 ` Andrew Cooper
2025-08-08 20:23 ` [PATCH 19/22] x86/boot: Use RSTORSSP to establish SSP Andrew Cooper
2025-08-14 15:11 ` Jan Beulich
2025-08-14 20:09 ` Andrew Cooper
2025-08-15 9:03 ` Jan Beulich
2025-08-21 22:09 ` Andrew Cooper
2025-08-25 9:12 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 20/22] x86/traps: Alter switch_stack_and_jump() for FRED mode Andrew Cooper
2025-08-14 15:35 ` Jan Beulich
2025-08-14 20:55 ` Andrew Cooper
2025-08-15 9:10 ` Jan Beulich
2025-08-21 22:56 ` Andrew Cooper
2025-08-25 9:19 ` Jan Beulich
2025-08-08 20:23 ` [PATCH 21/22] x86/traps: Introduce FRED entrypoints Andrew Cooper
2025-08-11 11:38 ` Andrew Cooper
2025-08-14 15:57 ` Jan Beulich
2025-08-14 20:40 ` Andrew Cooper
2025-08-15 9:22 ` Jan Beulich
2025-08-18 8:59 ` Jan Beulich
2025-08-18 10:03 ` Jan Beulich
2025-08-18 10:09 ` Andrew Cooper
2025-08-08 20:23 ` [PATCH 22/22] x86/traps: Enable FRED when requested Andrew Cooper
2025-08-18 9:35 ` Jan Beulich
2025-08-18 9:47 ` Andrew Cooper
2025-08-18 9:53 ` Jan Beulich
2025-08-08 23:49 ` [PATCH 23/22] x86/vmx: Adjust NMI handling for FRED Andrew Cooper
2025-08-18 10:02 ` Jan Beulich
2025-08-18 17:18 ` Andrew Cooper
2025-08-19 6:31 ` Jan Beulich
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.