xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/23] x86: FRED support
@ 2025-08-28 15:03 Andrew Cooper
  2025-08-28 15:03 ` [PATCH v2 01/23] x86: FRED enumerations Andrew Cooper
                   ` (22 more replies)
  0 siblings, 23 replies; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

This is now everything to get to tech preview, except for the caveat that it's
still not been tried on real hardware yet, so is staying experimental for now.

The first few patches have been seen before but the latter half of the series
is new, adding support for running PV guests when Xen is using FRED.  Some
work here has influenced earlier patches.

See patches for details.

https://gitlab.com/xen-project/hardware/xen-staging/-/pipelines/2009286102

This series is still based on the MSR cleaup series posted previously.

Andrew Cooper (23):
  x86: FRED enumerations
  x86/traps: Extend struct cpu_user_regs/cpu_info with FRED fields
  x86/traps: Introduce opt_fred
  x86/boot: Adjust CR4 handling around percpu_early_traps_init()
  x86/S3: Switch to using RSTORSSP to recover SSP on resume
  x86/traps: Set MSR_PL0_SSP in load_system_tables()
  x86/boot: Use RSTORSSP to establish SSP
  x86/traps: Alter switch_stack_and_jump() for FRED mode
  x86/traps: Skip Supervisor Shadow Stack tokens in FRED mode
  x86/traps: Make an IDT-specific #DB helper
  x86/traps: Make an IDT-specific #PF helper
  x86/fsgsbase: Make gskern accesses safe under FRED
  x86/traps: Introduce FRED entrypoints
  x86/traps: Enable FRED when requested
  x86/pv: Deduplicate is_canonical_address() in do_set_segment_base()
  x86/entry: Alter how IRET faults are recognised
  x86/entry: Drop the pre exception table infrastructure
  x86/entry: Rework the comment about SYSCALL and DF
  x86/pv: Adjust GS handling for FRED mode
  x86/pv: Exception handling in FRED mode
  x86/pv: ERETU error handling
  x86/pv: System call handling in FRED mode
  x86/pv: Adjust eflags handling for FRED mode

 docs/misc/xen-command-line.pandoc           |  10 +
 xen/arch/x86/acpi/wakeup_prot.S             |  52 +-
 xen/arch/x86/boot/x86_64.S                  |  48 +-
 xen/arch/x86/domain.c                       |  26 +-
 xen/arch/x86/extable.c                      |  14 -
 xen/arch/x86/hvm/domain.c                   |   4 +-
 xen/arch/x86/include/asm/asm-defns.h        |   8 +
 xen/arch/x86/include/asm/asm_defns.h        |  76 ++-
 xen/arch/x86/include/asm/cpu-user-regs.h    |  71 ++-
 xen/arch/x86/include/asm/cpufeature.h       |   3 +
 xen/arch/x86/include/asm/cpufeatures.h      |   2 +-
 xen/arch/x86/include/asm/current.h          |   9 +-
 xen/arch/x86/include/asm/domain.h           |   5 +
 xen/arch/x86/include/asm/fsgsbase.h         |   8 +-
 xen/arch/x86/include/asm/hypercall.h        |   5 +
 xen/arch/x86/include/asm/msr-index.h        |  11 +
 xen/arch/x86/include/asm/traps.h            |   6 +
 xen/arch/x86/include/asm/uaccess.h          |   2 -
 xen/arch/x86/include/asm/x86-defns.h        |   8 +
 xen/arch/x86/mm.c                           |  12 +-
 xen/arch/x86/pv/dom0_build.c                |   2 +-
 xen/arch/x86/pv/domain.c                    |  22 +-
 xen/arch/x86/pv/iret.c                      |   8 +-
 xen/arch/x86/pv/misc-hypercalls.c           |  42 +-
 xen/arch/x86/pv/traps.c                     |  33 ++
 xen/arch/x86/setup.c                        |  35 +-
 xen/arch/x86/traps-setup.c                  | 127 ++++-
 xen/arch/x86/traps.c                        | 511 ++++++++++++++++++--
 xen/arch/x86/x86_64/Makefile                |   1 +
 xen/arch/x86/x86_64/compat/entry.S          |   3 +-
 xen/arch/x86/x86_64/entry-fred.S            |  57 +++
 xen/arch/x86/x86_64/entry.S                 |  46 +-
 xen/arch/x86/xen.lds.S                      |   5 -
 xen/include/public/arch-x86/cpufeatureset.h |   3 +
 34 files changed, 1106 insertions(+), 169 deletions(-)
 create mode 100644 xen/arch/x86/x86_64/entry-fred.S

-- 
2.39.5



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v2 01/23] x86: FRED enumerations
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
@ 2025-08-28 15:03 ` Andrew Cooper
  2025-08-28 15:33   ` Andrew Cooper
  2025-08-28 15:03 ` [PATCH v2 02/23] x86/traps: Extend struct cpu_user_regs/cpu_info with FRED fields Andrew Cooper
                   ` (21 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné

Of note, CR4.FRED is bit 32 and cannot enabled outside of 64bit mode.

Most supported toolchains don't understand the FRED instructions yet.  ERETU
and ERETS are easy to wrap (they encoded as REPZ/REPNE CLAC), while LKGS is
more complicated and deferred for now.

I have intentionally named the FRED MSRs differently to the spec.  In the
spec, the stack pointer names alias the TSS fields of the same name, despite
very different semantics.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * Drop CONFIG_HAS_AS_FRED
---
 xen/arch/x86/include/asm/asm-defns.h        |  8 ++++++++
 xen/arch/x86/include/asm/cpufeature.h       |  3 +++
 xen/arch/x86/include/asm/cpufeatures.h      |  2 +-
 xen/arch/x86/include/asm/msr-index.h        | 11 +++++++++++
 xen/arch/x86/include/asm/x86-defns.h        |  1 +
 xen/include/public/arch-x86/cpufeatureset.h |  3 +++
 6 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/include/asm/asm-defns.h b/xen/arch/x86/include/asm/asm-defns.h
index 61a5faf90446..239dc3af096c 100644
--- a/xen/arch/x86/include/asm/asm-defns.h
+++ b/xen/arch/x86/include/asm/asm-defns.h
@@ -4,6 +4,14 @@
     .byte 0x0f, 0x01, 0xfc
 .endm
 
+/* binutils >= 2.41 or LLVM >= 19 */
+.macro eretu
+    .byte 0xf3, 0x0f, 0x01, 0xca
+.endm
+.macro erets
+    .byte 0xf2, 0x0f, 0x01, 0xca
+.endm
+
 /*
  * Call a noreturn function.  This could be JMP, but CALL results in a more
  * helpful backtrace.  BUG is to catch functions which do decide to return...
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 441a7ecc494b..b6cf0c8dfc7c 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -246,6 +246,9 @@ static inline bool boot_cpu_has(unsigned int feat)
 #define cpu_has_avx_vnni        boot_cpu_has(X86_FEATURE_AVX_VNNI)
 #define cpu_has_avx512_bf16     boot_cpu_has(X86_FEATURE_AVX512_BF16)
 #define cpu_has_cmpccxadd       boot_cpu_has(X86_FEATURE_CMPCCXADD)
+#define cpu_has_fred            boot_cpu_has(X86_FEATURE_FRED)
+#define cpu_has_lkgs            boot_cpu_has(X86_FEATURE_LKGS)
+#define cpu_has_nmi_src         boot_cpu_has(X86_FEATURE_NMI_SRC)
 #define cpu_has_avx_ifma        boot_cpu_has(X86_FEATURE_AVX_IFMA)
 
 /* CPUID level 0x80000021.eax */
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index 71308d9dafc8..0a98676c1604 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -18,7 +18,7 @@ XEN_CPUFEATURE(ARCH_PERFMON,      X86_SYNTH( 3)) /* Intel Architectural PerfMon
 XEN_CPUFEATURE(TSC_RELIABLE,      X86_SYNTH( 4)) /* TSC is known to be reliable */
 XEN_CPUFEATURE(XTOPOLOGY,         X86_SYNTH( 5)) /* cpu topology enum extensions */
 XEN_CPUFEATURE(CPUID_FAULTING,    X86_SYNTH( 6)) /* cpuid faulting */
-/* Bit 7 unused */
+XEN_CPUFEATURE(XEN_FRED,          X86_SYNTH( 7)) /* Xen uses FRED */
 XEN_CPUFEATURE(APERFMPERF,        X86_SYNTH( 8)) /* APERFMPERF */
 XEN_CPUFEATURE(MFENCE_RDTSC,      X86_SYNTH( 9)) /* MFENCE synchronizes RDTSC */
 XEN_CPUFEATURE(XEN_SMEP,          X86_SYNTH(10)) /* SMEP gets used by Xen itself */
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index 428d993ee89b..bb48d16f0c6d 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -115,6 +115,17 @@
 #define  MCU_OPT_CTRL_GDS_MIT_DIS           (_AC(1, ULL) <<  4)
 #define  MCU_OPT_CTRL_GDS_MIT_LOCK          (_AC(1, ULL) <<  5)
 
+#define MSR_FRED_RSP_SL0                    0x000001cc
+#define MSR_FRED_RSP_SL1                    0x000001cd
+#define MSR_FRED_RSP_SL2                    0x000001ce
+#define MSR_FRED_RSP_SL3                    0x000001cf
+#define MSR_FRED_STK_LVLS                   0x000001d0
+#define MSR_FRED_SSP_SL0                    MSR_PL0_SSP
+#define MSR_FRED_SSP_SL1                    0x000001d1
+#define MSR_FRED_SSP_SL2                    0x000001d2
+#define MSR_FRED_SSP_SL3                    0x000001d3
+#define MSR_FRED_CONFIG                     0x000001d4
+
 #define MSR_RTIT_OUTPUT_BASE                0x00000560
 #define MSR_RTIT_OUTPUT_MASK                0x00000561
 #define MSR_RTIT_CTL                        0x00000570
diff --git a/xen/arch/x86/include/asm/x86-defns.h b/xen/arch/x86/include/asm/x86-defns.h
index 23579c471f4a..0a0ba83de786 100644
--- a/xen/arch/x86/include/asm/x86-defns.h
+++ b/xen/arch/x86/include/asm/x86-defns.h
@@ -75,6 +75,7 @@
 #define X86_CR4_PKE        0x00400000 /* enable PKE */
 #define X86_CR4_CET        0x00800000 /* Control-flow Enforcement Technology */
 #define X86_CR4_PKS        0x01000000 /* Protection Key Supervisor */
+#define X86_CR4_FRED       (_AC(1, ULL) << 32) /* Fast Return and Event Delivery */
 
 #define X86_CR8_VALID_MASK 0xf
 
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 990b1d13f301..af69cf3822eb 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -310,7 +310,10 @@ XEN_CPUFEATURE(ARCH_PERF_MON, 10*32+8) /*   Architectural Perfmon */
 XEN_CPUFEATURE(FZRM,         10*32+10) /*A  Fast Zero-length REP MOVSB */
 XEN_CPUFEATURE(FSRS,         10*32+11) /*A  Fast Short REP STOSB */
 XEN_CPUFEATURE(FSRCS,        10*32+12) /*A  Fast Short REP CMPSB/SCASB */
+XEN_CPUFEATURE(FRED,         10*32+17) /*   Fast Return and Event Delivery */
+XEN_CPUFEATURE(LKGS,         10*32+18) /*   Load Kernel GS instruction */
 XEN_CPUFEATURE(WRMSRNS,      10*32+19) /*S  WRMSR Non-Serialising */
+XEN_CPUFEATURE(NMI_SRC,      10*32+20) /*   NMI-Source Reporting */
 XEN_CPUFEATURE(AMX_FP16,     10*32+21) /*   AMX FP16 instruction */
 XEN_CPUFEATURE(AVX_IFMA,     10*32+23) /*A  AVX-IFMA Instructions */
 XEN_CPUFEATURE(LAM,          10*32+26) /*   Linear Address Masking */
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 02/23] x86/traps: Extend struct cpu_user_regs/cpu_info with FRED fields
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
  2025-08-28 15:03 ` [PATCH v2 01/23] x86: FRED enumerations Andrew Cooper
@ 2025-08-28 15:03 ` Andrew Cooper
  2025-08-28 15:18   ` Jan Beulich
  2025-08-28 15:03 ` [PATCH v2 03/23] x86/traps: Introduce opt_fred Andrew Cooper
                   ` (20 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

The FRED on-stack format is larger than the IDT format, but is by and large
compatible.  FRED reuses space above cs and ss for extra metadata, some of
which is purely informational, and some of which causes additional effects in
ERET{U,S}.

Follow Linux's choice of naming for fred_{c,s}s structures, to make it very
clear at the point of use that it's dependent on FRED.

There is also the event data field and reserved fields, but we cannot include
these in struct cpu_user_regs without reintroducing OoB structure accesses in
the non-FRED case.  See commit 6065a05adf15 ("x86/traps: 'Fix' safety of
read_registers() in #DF path"). for more details.

Instead, use a new struct fred_info and position it suitably in struct
cpu_info.  This boundary will be loaded into MSR_FRED_RSP_SL0, and must be
64-byte aligned.

This does add 16 bytes back into struct cpu_info, undoing the saving we made
by dropping the vm86 data segment selectors.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * .lm -> .l
 * Tweak comments
---
 xen/arch/x86/include/asm/cpu-user-regs.h | 71 ++++++++++++++++++++++--
 xen/arch/x86/include/asm/current.h       |  2 +
 xen/arch/x86/traps-setup.c               |  5 ++
 3 files changed, 74 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/include/asm/cpu-user-regs.h b/xen/arch/x86/include/asm/cpu-user-regs.h
index 5b283a2f6d02..92aeca0aaa88 100644
--- a/xen/arch/x86/include/asm/cpu-user-regs.h
+++ b/xen/arch/x86/include/asm/cpu-user-regs.h
@@ -30,6 +30,10 @@ struct cpu_user_regs
     /*
      * During IDT delivery for exceptions with an error code, hardware pushes
      * to this point.  Entry_vector is filled in by software.
+     *
+     * During FRED delivery, hardware always pushes to this point.  Software
+     * copies fred_ss.vector into entry_vector so most interrupt/exception
+     * handling can be FRED-agnostic.
      */
 
     uint32_t error_code;
@@ -42,18 +46,77 @@ struct cpu_user_regs
      */
 
     union { uint64_t rip;    uint32_t eip;    uint16_t ip; };
-    uint16_t cs, _pad0[1];
-    uint8_t  saved_upcall_mask; /* PV (v)rflags.IF == !saved_upcall_mask */
-    uint8_t  _pad1[3];
+    union {
+        struct {
+            uint16_t      cs;
+            unsigned long :16;
+            uint8_t       saved_upcall_mask; /* PV (v)rflags.IF == !saved_upcall_mask */
+        };
+        unsigned long     csx;
+        struct {
+            /*
+             * Bits 0 to 31 control ERET{U,S} behaviour, and are state of the
+             * interrupted context.
+             */
+            uint16_t      cs;
+            unsigned int  sl:2;      /* Stack Level */
+            bool          wfe:1;     /* Wait-for-ENDBRANCH state */
+        } fred_cs;
+    };
     union { uint64_t rflags; uint32_t eflags; uint16_t flags; };
     union { uint64_t rsp;    uint32_t esp;    uint16_t sp;    uint8_t spl; };
-    uint16_t ss, _pad2[3];
+    union {
+        uint16_t          ss;
+        unsigned long     ssx;
+        struct {
+            /*
+             * Bits 0 to 31 control ERET{U,S} behaviour, and are state about
+             * the event which occured.
+             */
+            uint16_t      ss;
+            bool          sti:1;     /* Was blocked-by-STI, and not cancelled */
+            bool          swint:1;   /* Was a SYSCALL/SYSENTER/INT $N.  On ERETx, pend_DB iff TF */
+            bool          nmi:1;     /* Was an NMI. */
+            unsigned long :13;
+
+            /*
+             * Bits 32 to 63 are ignored by ERET{U,S} and are informative
+             * only.
+             */
+            uint8_t       vector;
+            unsigned long :8;
+            unsigned int  type:4;    /* X86_ET_* */
+            unsigned long :4;
+            bool          enclave:1; /* Event taken in SGX mode */
+            bool          l:1;       /* Event taken in 64bit mode (old %cs.l) */
+            bool          nested:1;  /* Exception during event delivery (clear for #DF) */
+            unsigned long :1;
+            unsigned int  insnlen:4; /* .type >= SW_INT */
+        } fred_ss;
+    };
 
     /*
      * For IDT delivery, tss->rsp0 points to this boundary as embedded within
      * struct cpu_info.  It must be 16-byte aligned.
      */
 };
+struct fred_info
+{
+    /*
+     * Event Data.  For:
+     *   #DB: PENDING_DBG (%dr6 with positive polarity)
+     *   NMI: NMI-Source Bitmap (on capable hardware)
+     *   #PF: %cr2
+     *   #NM: MSR_XFD_ERR (only XFD-induced #NMs)
+     */
+    uint64_t edata;
+    uint64_t _rsvd;
+
+    /*
+     * For FRED delivery, MSR_FRED_RSP_SL0 points to this boundary as embedded
+     * within struct cpu_info.  It must be 64-byte aligned.
+     */
+};
 
 static inline uint64_t msr_fold(const struct cpu_user_regs *regs)
 {
diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
index fd30422707d9..c1eb27b1c4c2 100644
--- a/xen/arch/x86/include/asm/current.h
+++ b/xen/arch/x86/include/asm/current.h
@@ -38,6 +38,8 @@ struct vcpu;
 
 struct cpu_info {
     struct cpu_user_regs guest_cpu_user_regs;
+    struct fred_info _fred; /* Only used when FRED is active. */
+
     unsigned int processor_id;
     unsigned int verw_sel;
     struct vcpu *current_vcpu;
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index 25581acf1158..c89280270fbb 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -354,7 +354,12 @@ static void __init __maybe_unused build_assertions(void)
      *
      * tss->rsp0, pointing at the end of cpu_info.guest_cpu_user_regs, must be
      * 16-byte aligned.
+     *
+     * MSR_FRED_RSP_SL0, pointing to the end of cpu_info._fred must be 64-byte
+     * aligned.
      */
     BUILD_BUG_ON((sizeof(struct cpu_info) -
                   endof_field(struct cpu_info, guest_cpu_user_regs)) & 15);
+    BUILD_BUG_ON((sizeof(struct cpu_info) -
+                  endof_field(struct cpu_info, _fred)) & 63);
 }
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 03/23] x86/traps: Introduce opt_fred
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
  2025-08-28 15:03 ` [PATCH v2 01/23] x86: FRED enumerations Andrew Cooper
  2025-08-28 15:03 ` [PATCH v2 02/23] x86/traps: Extend struct cpu_user_regs/cpu_info with FRED fields Andrew Cooper
@ 2025-08-28 15:03 ` Andrew Cooper
  2025-08-28 15:20   ` Jan Beulich
  2025-09-01 13:15   ` Jan Beulich
  2025-08-28 15:03 ` [PATCH v2 04/23] x86/boot: Adjust CR4 handling around percpu_early_traps_init() Andrew Cooper
                   ` (19 subsequent siblings)
  22 siblings, 2 replies; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

... disabled by default.  There is a lot of work before FRED can be enabled by
default.

One part of FRED, the LKGS (Load Kernel GS) instruction, is enumerated
separately but is mandatory as FRED disallows the SWAPGS instruction.
Normally, we'd have to check both CPUID bits, but Xen does not use GS like
most other software, and can manage without the LKGS instruction.

FRED formally removes the use of Ring1 and Ring2, meaning we cannot run 32bit
PV guests.  Therefore, don't enable FRED by default in shim mode.  OTOH, if
FRED is active, then PV32 needs disabling like with CET.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * Fix check for warning.
 * Drop check for LKGS.
---
 docs/misc/xen-command-line.pandoc | 10 +++++++++
 xen/arch/x86/include/asm/traps.h  |  4 ++++
 xen/arch/x86/traps-setup.c        | 36 +++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index a75b6c930195..25cebdc1110f 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1284,6 +1284,16 @@ requirement can be relaxed.  This option is particularly useful for nested
 virtualization, to allow the L1 hypervisor to use EPT even if the L0 hypervisor
 does not provide `VM_ENTRY_LOAD_GUEST_PAT`.
 
+### fred (x86)
+> `= <bool>`
+
+> Default: `false`
+
+Flexible Return and Event Delivery is an overhaul of interrupt, exception and
+system call handling, fixing many corner cases in the x86 architecture, and
+expected in hardware from 2025.  Support in Xen is a work in progress and
+disabled by default.
+
 ### gnttab
 > `= List of [ max-ver:<integer>, transitive=<bool>, transfer=<bool> ]`
 
diff --git a/xen/arch/x86/include/asm/traps.h b/xen/arch/x86/include/asm/traps.h
index 6ae451d3fc70..73097e957d05 100644
--- a/xen/arch/x86/include/asm/traps.h
+++ b/xen/arch/x86/include/asm/traps.h
@@ -7,6 +7,10 @@
 #ifndef ASM_TRAP_H
 #define ASM_TRAP_H
 
+#include <xen/types.h>
+
+extern int8_t opt_fred;
+
 void bsp_early_traps_init(void);
 void traps_init(void);
 void bsp_traps_reinit(void);
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index c89280270fbb..6e2af58ba0a5 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -9,6 +9,8 @@
 #include <asm/endbr.h>
 #include <asm/idt.h>
 #include <asm/msr.h>
+#include <asm/pv/domain.h>
+#include <asm/pv/shim.h>
 #include <asm/shstk.h>
 #include <asm/stubs.h>
 #include <asm/traps.h>
@@ -20,6 +22,9 @@ unsigned int __ro_after_init ler_msr;
 static bool __initdata opt_ler;
 boolean_param("ler", opt_ler);
 
+int8_t __ro_after_init opt_fred = 0;
+boolean_param("fred", opt_fred);
+
 void nocall entry_PF(void);
 void nocall lstar_enter(void);
 void nocall cstar_enter(void);
@@ -299,6 +304,37 @@ void __init traps_init(void)
     /* Replace early pagefault with real pagefault handler. */
     _update_gate_addr_lower(&bsp_idt[X86_EXC_PF], entry_PF);
 
+    /*
+     * Xen doesn't use GS like most software does, and doesn't need the LKGS
+     * instruction in order to manage PV guests.  No need to check for it.
+     */
+    if ( !cpu_has_fred )
+    {
+        if ( opt_fred == 1 )
+            printk(XENLOG_WARNING "FRED not available, ignoring\n");
+        opt_fred = 0;
+    }
+
+    if ( opt_fred == -1 )
+        opt_fred = !pv_shim;
+
+    if ( opt_fred )
+    {
+#ifdef CONFIG_PV32
+        if ( opt_pv32 )
+        {
+            opt_pv32 = 0;
+            printk(XENLOG_INFO "Disabling PV32 due to FRED\n");
+        }
+#endif
+        setup_force_cpu_cap(X86_FEATURE_XEN_FRED);
+        printk("Using FRED event delivery\n");
+    }
+    else
+    {
+        printk("Using IDT event delivery\n");
+    }
+
     load_system_tables();
 
     init_ler();
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 04/23] x86/boot: Adjust CR4 handling around percpu_early_traps_init()
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (2 preceding siblings ...)
  2025-08-28 15:03 ` [PATCH v2 03/23] x86/traps: Introduce opt_fred Andrew Cooper
@ 2025-08-28 15:03 ` Andrew Cooper
  2025-08-28 15:03 ` [PATCH v2 05/23] x86/S3: Switch to using RSTORSSP to recover SSP on resume Andrew Cooper
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné

percpu_early_traps_init() will shortly be setting CR4.FRED.  This requires that
cpu_info->cr4 is already set up, and that the enablement of CET doesn't
truncate X86_CR4_FRED back out because of 32bit logic.

For __high_start(), defer re-loading XEN_MINIMAL_CR4 until after %rsp is set
up and we can store the result in the cr4 field too.

For s3_resume(), explicitly re-load XEN_MINIMAL_CR4.  Later when loading all
features, use the mmu_cr4_features variable which is how the rest of Xen
performs this operation.

No functional change, yet.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * Extend comments
---
 xen/arch/x86/acpi/wakeup_prot.S | 18 ++++++++++++++----
 xen/arch/x86/boot/x86_64.S      | 15 ++++++++++-----
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/xen/arch/x86/acpi/wakeup_prot.S b/xen/arch/x86/acpi/wakeup_prot.S
index cc40fddc38d4..0f02ea7b4b9a 100644
--- a/xen/arch/x86/acpi/wakeup_prot.S
+++ b/xen/arch/x86/acpi/wakeup_prot.S
@@ -63,6 +63,14 @@ LABEL(s3_resume)
         pushq   %rax
         lretq
 1:
+
+        GET_STACK_END(15)
+
+        /* Enable minimal CR4 features, sync cached state. */
+        mov     $XEN_MINIMAL_CR4, %eax
+        mov     %rax, STACK_CPUINFO_FIELD(cr4)(%r15)
+        mov     %rax, %cr4
+
         /* Set up early exceptions and CET before entering C properly. */
         call    percpu_early_traps_init
 
@@ -77,7 +85,9 @@ LABEL(s3_resume)
         wrmsr
 
         /* Enable CR4.CET. */
-        mov     $XEN_MINIMAL_CR4 | X86_CR4_CET, %ecx
+        mov     $X86_CR4_CET, %ecx
+        or      STACK_CPUINFO_FIELD(cr4)(%r15), %rcx
+        mov     %rcx, STACK_CPUINFO_FIELD(cr4)(%r15)
         mov     %rcx, %cr4
 
         /* WARNING! call/ret now fatal (iff SHSTK) until SETSSBSY loads SSP */
@@ -120,9 +130,9 @@ LABEL(s3_resume)
 .L_cet_done:
 #endif /* CONFIG_XEN_SHSTK || CONFIG_XEN_IBT */
 
-        /* Restore CR4 from the cpuinfo block. */
-        GET_STACK_END(bx)
-        mov     STACK_CPUINFO_FIELD(cr4)(%rbx), %rax
+        /* Load all CR4 settings. */
+        mov     mmu_cr4_features(%rip), %rax
+        mov     %rax, STACK_CPUINFO_FIELD(cr4)(%r15)
         mov     %rax, %cr4
 
         call    mtrr_bp_restore
diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index d0e7449a149f..3a5ad2764448 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -11,16 +11,19 @@ ENTRY(__high_start)
         mov     %ecx,%gs
         mov     %ecx,%ss
 
-        /* Enable minimal CR4 features. */
-        mov     $XEN_MINIMAL_CR4,%rcx
-        mov     %rcx,%cr4
-
         mov     stack_start(%rip),%rsp
 
         /* Reset EFLAGS (subsumes CLI and CLD). */
         pushq   $0
         popf
 
+        GET_STACK_END(15)
+
+        /* Enable minimal CR4 features, sync cached state. */
+        mov     $XEN_MINIMAL_CR4, %eax
+        mov     %rax, STACK_CPUINFO_FIELD(cr4)(%r15)
+        mov     %rax, %cr4
+
         /* Reload code selector. */
         pushq   $__HYPERVISOR_CS
         leaq    1f(%rip),%rax
@@ -45,7 +48,9 @@ ENTRY(__high_start)
         wrmsr
 
         /* Enable CR4.CET. */
-        mov     $XEN_MINIMAL_CR4 | X86_CR4_CET, %ecx
+        mov     $X86_CR4_CET, %ecx
+        or      STACK_CPUINFO_FIELD(cr4)(%r15), %rcx
+        mov     %rcx, STACK_CPUINFO_FIELD(cr4)(%r15)
         mov     %rcx, %cr4
 
         /* WARNING! call/ret now fatal (iff SHSTK) until SETSSBSY loads SSP */
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 05/23] x86/S3: Switch to using RSTORSSP to recover SSP on resume
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (3 preceding siblings ...)
  2025-08-28 15:03 ` [PATCH v2 04/23] x86/boot: Adjust CR4 handling around percpu_early_traps_init() Andrew Cooper
@ 2025-08-28 15:03 ` Andrew Cooper
  2025-08-28 15:03 ` [PATCH v2 06/23] x86/traps: Set MSR_PL0_SSP in load_system_tables() Andrew Cooper
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné

Under FRED, SETSSBSY is disallowed, and we want to be setting up FRED prior to
setting up shadow stacks.  Luckily, RSTORSSP will also work in this case.

This involves a new type of shadow stack token, the Restore Token, which is
distinguished from the Supervisor Token by pointing to the adjacent slot on
the shadow stack rather than pointing at itself.

In the short term, this logic still needs to load MSR_PL0_SSP.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/acpi/wakeup_prot.S | 31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/xen/arch/x86/acpi/wakeup_prot.S b/xen/arch/x86/acpi/wakeup_prot.S
index 0f02ea7b4b9a..fceb4ca353f7 100644
--- a/xen/arch/x86/acpi/wakeup_prot.S
+++ b/xen/arch/x86/acpi/wakeup_prot.S
@@ -90,7 +90,7 @@ LABEL(s3_resume)
         mov     %rcx, STACK_CPUINFO_FIELD(cr4)(%r15)
         mov     %rcx, %cr4
 
-        /* WARNING! call/ret now fatal (iff SHSTK) until SETSSBSY loads SSP */
+        /* WARNING! CALL/RET now fatal (iff SHSTK) until RSTORSSP loads SSP */
 
 #if defined(CONFIG_XEN_SHSTK)
         test    $CET_SHSTK_EN, %al
@@ -98,32 +98,31 @@ LABEL(s3_resume)
 
         /*
          * Restoring SSP is a little complicated, because we are intercepting
-         * an in-use shadow stack.  Write a temporary token under the stack,
-         * so SETSSBSY will successfully load a value useful for us, then
-         * reset MSR_PL0_SSP to its usual value and pop the temporary token.
+         * an in-use shadow stack.  Write a Restore Token under the stack, and
+         * use RSTORSSP to load it.  RSTORSSP converts the token to a
+         * Previous-SSP Token, which we discard.
          */
         mov     saved_ssp(%rip), %rdi
 
-        /* Construct the temporary supervisor token under SSP. */
-        sub     $8, %rdi
-
-        /* Load it into MSR_PL0_SSP. */
+        /* Calculate MSR_PL0_SSP from SSP. */
         mov     $MSR_PL0_SSP, %ecx
         mov     %rdi, %rdx
         shr     $32, %rdx
         mov     %edi, %eax
-        wrmsr
-
-        /* Write the temporary token onto the shadow stack, and activate it. */
-        wrssq   %rdi, (%rdi)
-        setssbsy
-
-        /* Reset MSR_PL0_SSP back to its normal value. */
         and     $~(STACK_SIZE - 1), %eax
         or      $(PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8, %eax
         wrmsr
 
-        /* Pop the temporary token off the stack. */
+        /*
+         * A Restore Token's value is &token + 8 + 64BIT (bit 0).
+         * We want to put this on the shstk at SSP - 8.
+         */
+        lea     1(%rdi), %rax
+        sub     $8, %rdi
+        wrssq   %rax, (%rdi)
+        rstorssp (%rdi)
+
+        /* Discard the Previous-SSP Token from the shstk. */
         mov     $2, %eax
         incsspd %eax
 #endif /* CONFIG_XEN_SHSTK */
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 06/23] x86/traps: Set MSR_PL0_SSP in load_system_tables()
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (4 preceding siblings ...)
  2025-08-28 15:03 ` [PATCH v2 05/23] x86/S3: Switch to using RSTORSSP to recover SSP on resume Andrew Cooper
@ 2025-08-28 15:03 ` Andrew Cooper
  2025-09-01  9:23   ` Jan Beulich
  2025-08-28 15:03 ` [PATCH v2 07/23] x86/boot: Use RSTORSSP to establish SSP Andrew Cooper
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

FRED and IDT differ by a Supervisor Token on the base of the shstk.  This
means that the value they load into MSR_PL0_SSP differs by 8.

s3_resume() in particular has logic which is otherwise invariant of FRED mode,
and must not clobber a FRED MSR_PL0_SSP with an IDT one.

This also simplifies the AP path too.  Updating reinit_bsp_stack() is deferred
until later.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * Extend comment about clearing the busy bit.
 * Move reinit_bsp_stack() hunk into this patch.
---
 xen/arch/x86/acpi/wakeup_prot.S |  9 ---------
 xen/arch/x86/boot/x86_64.S      | 12 +++---------
 xen/arch/x86/setup.c            |  2 --
 xen/arch/x86/traps-setup.c      |  2 ++
 4 files changed, 5 insertions(+), 20 deletions(-)

diff --git a/xen/arch/x86/acpi/wakeup_prot.S b/xen/arch/x86/acpi/wakeup_prot.S
index fceb4ca353f7..ba0bd77806b8 100644
--- a/xen/arch/x86/acpi/wakeup_prot.S
+++ b/xen/arch/x86/acpi/wakeup_prot.S
@@ -104,15 +104,6 @@ LABEL(s3_resume)
          */
         mov     saved_ssp(%rip), %rdi
 
-        /* Calculate MSR_PL0_SSP from SSP. */
-        mov     $MSR_PL0_SSP, %ecx
-        mov     %rdi, %rdx
-        shr     $32, %rdx
-        mov     %edi, %eax
-        and     $~(STACK_SIZE - 1), %eax
-        or      $(PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8, %eax
-        wrmsr
-
         /*
          * A Restore Token's value is &token + 8 + 64BIT (bit 0).
          * We want to put this on the shstk at SSP - 8.
diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index 3a5ad2764448..11a7e9d3bd23 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -65,17 +65,11 @@ ENTRY(__high_start)
         or      $(PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8, %rdx
 
         /*
-         * Write a new supervisor token.  Doesn't matter on boot, but for S3
-         * resume this clears the busy bit.
+         * Write a new Supervisor Token.  It doesn't matter the first time a
+         * CPU boots, but for S3 resume or hotplug this clears the busy bit so
+         * SETSSBSY can set it again.
          */
         wrssq   %rdx, (%rdx)
-
-        /* Point MSR_PL0_SSP at the token. */
-        mov     $MSR_PL0_SSP, %ecx
-        mov     %edx, %eax
-        shr     $32, %rdx
-        wrmsr
-
         setssbsy
 
 #endif /* CONFIG_XEN_SHSTK */
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 6c81841426a4..a810bdf6d352 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -907,8 +907,6 @@ static void __init noreturn reinit_bsp_stack(void)
 
     if ( cpu_has_xen_shstk )
     {
-        wrmsrl(MSR_PL0_SSP,
-               (unsigned long)stack + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8);
         wrmsrl(MSR_S_CET, xen_msr_s_cet_value());
         asm volatile ("setssbsy" ::: "memory");
     }
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index 6e2af58ba0a5..d77be8f83921 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -92,6 +92,7 @@ static void load_system_tables(void)
     {
         volatile uint64_t *ist_ssp = tss_page->ist_ssp;
         unsigned long
+            ssp = stack_top + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8,
             mce_ssp = stack_top + (IST_MCE * IST_SHSTK_SIZE) - 8,
             nmi_ssp = stack_top + (IST_NMI * IST_SHSTK_SIZE) - 8,
             db_ssp  = stack_top + (IST_DB  * IST_SHSTK_SIZE) - 8,
@@ -118,6 +119,7 @@ static void load_system_tables(void)
         }
 
         wrmsrns(MSR_ISST, (unsigned long)ist_ssp);
+        wrmsrns(MSR_PL0_SSP, (unsigned long)ssp);
     }
 
     _set_tssldt_desc(gdt + TSS_ENTRY, (unsigned long)tss,
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 07/23] x86/boot: Use RSTORSSP to establish SSP
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (5 preceding siblings ...)
  2025-08-28 15:03 ` [PATCH v2 06/23] x86/traps: Set MSR_PL0_SSP in load_system_tables() Andrew Cooper
@ 2025-08-28 15:03 ` Andrew Cooper
  2025-09-01  9:28   ` Jan Beulich
  2025-08-28 15:03 ` [PATCH v2 08/23] x86/traps: Alter switch_stack_and_jump() for FRED mode Andrew Cooper
                   ` (15 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

Under FRED, SETSSBSY is disallowed, and we want to be setting up FRED prior to
setting up shadow stacks.  As we still need Supervisor Tokens in IDT mode, we
need mode-specific logic to establish SSP.

In FRED mode, write a Restore Token, RSTORSSP it, and discard the resulting
Previous-SSP token.

No change outside of FRED mode.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * Some logic moved into prior patch.
---
 xen/arch/x86/boot/x86_64.S | 23 +++++++++++++++++++++--
 xen/arch/x86/setup.c       | 25 ++++++++++++++++++++++++-
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index 11a7e9d3bd23..9705d03f849c 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -53,17 +53,21 @@ ENTRY(__high_start)
         mov     %rcx, STACK_CPUINFO_FIELD(cr4)(%r15)
         mov     %rcx, %cr4
 
-        /* WARNING! call/ret now fatal (iff SHSTK) until SETSSBSY loads SSP */
+        /* WARNING! CALL/RET now fatal (iff SHSTK) until SETSSBSY/RSTORSSP loads SSP */
 
 #if defined(CONFIG_XEN_SHSTK)
         test    $CET_SHSTK_EN, %al
         jz      .L_ap_cet_done
 
-        /* Derive the supervisor token address from %rsp. */
+        /* Derive the token address from %rsp. */
         mov     %rsp, %rdx
         and     $~(STACK_SIZE - 1), %rdx
         or      $(PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8, %rdx
 
+        /* Establishing SSP differs between IDT or FRED mode. */
+        bt      $32 /* ilog2(X86_CR4_FRED) */, %rcx
+        jc      .L_fred_shstk
+
         /*
          * Write a new Supervisor Token.  It doesn't matter the first time a
          * CPU boots, but for S3 resume or hotplug this clears the busy bit so
@@ -71,6 +75,21 @@ ENTRY(__high_start)
          */
         wrssq   %rdx, (%rdx)
         setssbsy
+        jmp     .L_ap_cet_done
+
+.L_fred_shstk:
+
+        /*
+         * Write a Restore Token, value: &token + 8 + 64BIT (bit 0) at the
+         * base of the shstk (which isn't in use yet).
+         */
+        lea     9(%rdx), %rdi
+        wrssq   %rdi, (%rdx)
+        rstorssp (%rdx)
+
+        /* Discard the Previous-SSP Token from the shstk. */
+        mov     $2, %edx
+        incsspd %edx
 
 #endif /* CONFIG_XEN_SHSTK */
 .L_ap_cet_done:
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index a810bdf6d352..73799fcc684c 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -49,6 +49,7 @@
 #include <asm/prot-key.h>
 #include <asm/pv/domain.h>
 #include <asm/setup.h>
+#include <asm/shstk.h>
 #include <asm/smp.h>
 #include <asm/spec_ctrl.h>
 #include <asm/stubs.h>
@@ -908,7 +909,29 @@ static void __init noreturn reinit_bsp_stack(void)
     if ( cpu_has_xen_shstk )
     {
         wrmsrl(MSR_S_CET, xen_msr_s_cet_value());
-        asm volatile ("setssbsy" ::: "memory");
+
+        /*
+         * IDT and FRED differ by a Supervisor Token on the shadow stack, and
+         * therefore by the value in MSR_PL0_SSP.
+         *
+         * In IDT mode, we use SETSSBSY to mark the Supervisor Token as busy.
+         * In FRED mode, there is no token, so we need to create a temporary
+         * Restore Token to establish SSP.
+         */
+        if ( opt_fred )
+        {
+            unsigned long *token =
+                (void *)stack + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8;
+
+            wrss((unsigned long)token + 9, token);
+            asm volatile ( "rstorssp %0" : "+m" (*token) );
+            /*
+             * We need to discard the resulting Previous-SSP Token, but
+             * reset_stack_and_jump() will do that for us.
+             */
+        }
+        else
+            asm volatile ( "setssbsy" ::: "memory" );
     }
 
     reset_stack_and_jump(init_done);
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 08/23] x86/traps: Alter switch_stack_and_jump() for FRED mode
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (6 preceding siblings ...)
  2025-08-28 15:03 ` [PATCH v2 07/23] x86/boot: Use RSTORSSP to establish SSP Andrew Cooper
@ 2025-08-28 15:03 ` Andrew Cooper
  2025-09-01  9:30   ` Jan Beulich
  2025-08-28 15:03 ` [PATCH v2 09/23] x86/traps: Skip Supervisor Shadow Stack tokens in " Andrew Cooper
                   ` (14 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

FRED and IDT differ by a Supervisor Token on the base of the shstk.  This
means that switch_stack_and_jump() needs to discard one extra word when FRED
is active.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * Use X86_FEATURE_XEN_FRED
---
 xen/arch/x86/include/asm/current.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
index c1eb27b1c4c2..35cc61fa88e7 100644
--- a/xen/arch/x86/include/asm/current.h
+++ b/xen/arch/x86/include/asm/current.h
@@ -154,7 +154,9 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
     "rdsspd %[ssp];"                                            \
     "cmp $1, %[ssp];"                                           \
     "je .L_shstk_done.%=;" /* CET not active?  Skip. */         \
-    "mov $%c[skstk_base], %[val];"                              \
+    ALTERNATIVE("mov $%c[skstk_base], %[val];",                 \
+                "mov $%c[skstk_base] + 8, %[val];",             \
+                X86_FEATURE_XEN_FRED)                           \
     "and $%c[stack_mask], %[ssp];"                              \
     "sub %[ssp], %[val];"                                       \
     "shr $3, %[val];"                                           \
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 09/23] x86/traps: Skip Supervisor Shadow Stack tokens in FRED mode
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (7 preceding siblings ...)
  2025-08-28 15:03 ` [PATCH v2 08/23] x86/traps: Alter switch_stack_and_jump() for FRED mode Andrew Cooper
@ 2025-08-28 15:03 ` Andrew Cooper
  2025-09-01  9:38   ` Jan Beulich
  2025-08-28 15:03 ` [PATCH v2 10/23] x86/traps: Make an IDT-specific #DB helper Andrew Cooper
                   ` (13 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

FRED doesn't use Supervisor Shadow Stack tokens.  Skip setting them up.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New
---
 xen/arch/x86/mm.c    | 12 +++++++++---
 xen/arch/x86/setup.c |  8 ++++----
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index b929d15d0050..043e6aa9d73a 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -129,6 +129,7 @@
 #include <asm/shadow.h>
 #include <asm/shared.h>
 #include <asm/trampoline.h>
+#include <asm/traps.h>
 #include <asm/x86_emulate.h>
 
 #include <public/memory.h>
@@ -6441,8 +6442,13 @@ static void write_sss_token(unsigned long *ptr)
 
 void memguard_guard_stack(void *p)
 {
-    /* IST Shadow stacks.  4x 1k in stack page 0. */
-    if ( IS_ENABLED(CONFIG_XEN_SHSTK) )
+    /*
+     * IST Shadow stacks.  4x 1k in stack page 0.
+     *
+     * With IDT delivery, we need Supervisor Shadow Stack tokens at the base
+     * of each stack.  With FRED delivery, these no longer exist.
+     */
+    if ( IS_ENABLED(CONFIG_XEN_SHSTK) && !opt_fred )
     {
         write_sss_token(p + (IST_MCE * IST_SHSTK_SIZE) - 8);
         write_sss_token(p + (IST_NMI * IST_SHSTK_SIZE) - 8);
@@ -6453,7 +6459,7 @@ void memguard_guard_stack(void *p)
 
     /* Primary Shadow Stack.  1x 4k in stack page 5. */
     p += PRIMARY_SHSTK_SLOT * PAGE_SIZE;
-    if ( IS_ENABLED(CONFIG_XEN_SHSTK) )
+    if ( IS_ENABLED(CONFIG_XEN_SHSTK) && !opt_fred )
         write_sss_token(p + PAGE_SIZE - 8);
 
     map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_SHSTK);
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 73799fcc684c..c767d0451574 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1920,10 +1920,6 @@ void asmlinkage __init noreturn __start_xen(void)
 
     system_state = SYS_STATE_boot;
 
-    bsp_stack = cpu_alloc_stack(0);
-    if ( !bsp_stack )
-        panic("No memory for BSP stack\n");
-
     console_init_ring();
     vesa_init();
 
@@ -2077,6 +2073,10 @@ void asmlinkage __init noreturn __start_xen(void)
 
     traps_init(); /* Needs stubs allocated. */
 
+    bsp_stack = cpu_alloc_stack(0); /* Needs to know IDT vs FRED */
+    if ( !bsp_stack )
+        panic("No memory for BSP stack\n");
+
     cpu_init();
 
     rcu_init();
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 10/23] x86/traps: Make an IDT-specific #DB helper
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (8 preceding siblings ...)
  2025-08-28 15:03 ` [PATCH v2 09/23] x86/traps: Skip Supervisor Shadow Stack tokens in " Andrew Cooper
@ 2025-08-28 15:03 ` Andrew Cooper
  2025-09-01  9:41   ` Jan Beulich
  2025-08-28 15:03 ` [PATCH v2 11/23] x86/traps: Make an IDT-specific #PF helper Andrew Cooper
                   ` (12 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

FRED provides PENDING_DBG in the the stack frame, avoiding the need to read
%dr6 manually.

Rename do_debug() to handle_DB(), and update it to take a dbg field using
positive polarity.

Introduce a new handle_DB_IDT() which reads %dr6.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New
---
 xen/arch/x86/traps.c        | 28 +++++++++++++++++-----------
 xen/arch/x86/x86_64/entry.S |  2 +-
 2 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 7ae46ae20f98..0372f1c386a8 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1992,14 +1992,11 @@ void asmlinkage do_device_not_available(struct cpu_user_regs *regs)
 
 void nocall sysenter_eflags_saved(void);
 
-void asmlinkage do_debug(struct cpu_user_regs *regs)
+/* Handle #DB.  @dbg is PENDING_DBG, a.k.a. %dr6 with positive polarity. */
+static void handle_DB(struct cpu_user_regs *regs, unsigned long dbg)
 {
-    unsigned long dr6;
     struct vcpu *v = current;
 
-    /* Stash dr6 as early as possible. */
-    dr6 = read_debugreg(6);
-
     /*
      * At the time of writing (March 2018), on the subject of %dr6:
      *
@@ -2066,13 +2063,13 @@ void asmlinkage do_debug(struct cpu_user_regs *regs)
          * If however we do, safety measures need to be enacted.  Use a big
          * hammer and clear all debug settings.
          */
-        if ( dr6 & (DR_TRAP3 | DR_TRAP2 | DR_TRAP1 | DR_TRAP0) )
+        if ( dbg & (DR_TRAP3 | DR_TRAP2 | DR_TRAP1 | DR_TRAP0) )
         {
             unsigned int bp, dr7 = read_debugreg(7);
 
             for ( bp = 0; bp < 4; ++bp )
             {
-                if ( (dr6 & (1u << bp)) && /* Breakpoint triggered? */
+                if ( (dbg & (1u << bp)) && /* Breakpoint triggered? */
                      (dr7 & (3u << (bp * DR_ENABLE_SIZE))) && /* Enabled? */
                      ((dr7 & (3u << ((bp * DR_CONTROL_SIZE) + /* Insn? */
                                      DR_CONTROL_SHIFT))) == DR_RW_EXECUTE) )
@@ -2093,9 +2090,9 @@ void asmlinkage do_debug(struct cpu_user_regs *regs)
          * so ensure the message is ratelimited.
          */
         gprintk(XENLOG_WARNING,
-                "Hit #DB in Xen context: %04x:%p [%ps], stk %04x:%p, dr6 %lx\n",
+                "Hit #DB in Xen context: %04x:%p [%ps], stk %04x:%p, dbg %lx\n",
                 regs->cs, _p(regs->rip), _p(regs->rip),
-                regs->ss, _p(regs->rsp), dr6);
+                regs->ss, _p(regs->rsp), dbg);
 
         return;
     }
@@ -2107,7 +2104,7 @@ void asmlinkage do_debug(struct cpu_user_regs *regs)
      * by debugging actions completed behind it's back.
      */
     v->arch.dr6 = x86_merge_dr6(v->domain->arch.cpu_policy,
-                                v->arch.dr6, dr6 ^ X86_DR6_DEFAULT);
+                                v->arch.dr6, dbg);
 
     if ( guest_kernel_mode(v, regs) && v->domain->debugger_attached )
     {
@@ -2115,7 +2112,16 @@ void asmlinkage do_debug(struct cpu_user_regs *regs)
         return;
     }
 
-    pv_inject_DB(dr6 ^ X86_DR6_DEFAULT);
+    pv_inject_DB(dbg);
+}
+
+/*
+ * When using IDT delivery, it is our responsibility to read %dr6.  Convert it
+ * to positive polarity.
+ */
+void asmlinkage handle_DB_IDT(struct cpu_user_regs *regs)
+{
+    handle_DB(regs, read_debugreg(6) ^ X86_DR6_DEFAULT);
 }
 
 void asmlinkage do_entry_CP(struct cpu_user_regs *regs)
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 39c7b9d17f9e..789687488c5f 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -1171,7 +1171,7 @@ FUNC(handle_ist_exception)
 .L_ ## vec ## _done:
 
         DISPATCH(X86_EXC_NMI, do_nmi)
-        DISPATCH(X86_EXC_DB,  do_debug)
+        DISPATCH(X86_EXC_DB,  handle_DB_IDT)
         DISPATCH(X86_EXC_MC,  do_machine_check)
 #undef DISPATCH
 
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 11/23] x86/traps: Make an IDT-specific #PF helper
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (9 preceding siblings ...)
  2025-08-28 15:03 ` [PATCH v2 10/23] x86/traps: Make an IDT-specific #DB helper Andrew Cooper
@ 2025-08-28 15:03 ` Andrew Cooper
  2025-09-01  9:46   ` Jan Beulich
  2025-08-28 15:03 ` [PATCH v2 12/23] x86/fsgsbase: Make gskern accesses safe under FRED Andrew Cooper
                   ` (11 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

FRED provides %cr2 in the the stack frame, avoiding the need to read %cr2
manually.

Rename do_page_fault() to handle_PF(), and update it to take cr2, still named
addr for consistency.

Introduce a new handle_PF_IDT() which reads %cr2 and conditionally re-enables
interrupts.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New
---
 xen/arch/x86/traps.c        | 26 ++++++++++++++------------
 xen/arch/x86/x86_64/entry.S |  2 +-
 2 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 0372f1c386a8..c11d72d47027 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1670,21 +1670,10 @@ static int fixup_page_fault(unsigned long addr, struct cpu_user_regs *regs)
     return 0;
 }
 
-void asmlinkage do_page_fault(struct cpu_user_regs *regs)
+static void handle_PF(struct cpu_user_regs *regs, unsigned long addr /* cr2 */)
 {
-    unsigned long addr;
     unsigned int error_code;
 
-    addr = read_cr2();
-
-    /*
-     * Don't re-enable interrupts if we were running an IRQ-off region when
-     * we hit the page fault, or we'll break that code.
-     */
-    ASSERT(!local_irq_is_enabled());
-    if ( regs->flags & X86_EFLAGS_IF )
-        local_irq_enable();
-
     /* fixup_page_fault() might change regs->error_code, so cache it here. */
     error_code = regs->error_code;
 
@@ -1745,6 +1734,19 @@ void asmlinkage do_page_fault(struct cpu_user_regs *regs)
     pv_inject_page_fault(regs->error_code, addr);
 }
 
+/*
+ * When using IDT delivery, it is our responsibility to read %cr2.
+ */
+void asmlinkage handle_PF_IDT(struct cpu_user_regs *regs)
+{
+    unsigned long addr = read_cr2();
+
+    if ( regs->flags & X86_EFLAGS_IF )
+        local_irq_enable();
+
+    handle_PF(regs, addr);
+}
+
 /*
  * Early #PF handler to print CR2, error code, and stack.
  *
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 789687488c5f..c02245ac064c 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -871,7 +871,7 @@ handle_exception_saved:
          * reading %cr2. Otherwise a page fault in the nested interrupt handler
          * would corrupt %cr2.
          */
-        DISPATCH(X86_EXC_PF, do_page_fault)
+        DISPATCH(X86_EXC_PF, handle_PF_IDT)
 
         /* Only re-enable IRQs if they were active before taking the fault */
         testb $X86_EFLAGS_IF >> 8, UREGS_eflags + 1(%rsp)
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 12/23] x86/fsgsbase: Make gskern accesses safe under FRED
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (10 preceding siblings ...)
  2025-08-28 15:03 ` [PATCH v2 11/23] x86/traps: Make an IDT-specific #PF helper Andrew Cooper
@ 2025-08-28 15:03 ` Andrew Cooper
  2025-09-01  9:57   ` Jan Beulich
  2025-08-28 15:03 ` [PATCH v2 13/23] x86/traps: Introduce FRED entrypoints Andrew Cooper
                   ` (10 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

Under FRED, the SWAPGS instructions is disallowed.  Therefore we must use the
MSR path instead.

read_registers() is in the show_registers() path, so this allows Xen to render
it's current state without suffering #UD (and recursing until the stack guard
page is hit).

All hardware with FRED is expected to have some kind of non-serialising access
to these registers.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * Broken out of subsequent patch.  Rebased over MSR cleanup.

In principle, the following can also be used for read_registers()

    diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
    index 5799770a2f71..0b0fdf2c5ac4 100644
    --- a/xen/arch/x86/traps.c
    +++ b/xen/arch/x86/traps.c
    @@ -125,16 +125,21 @@ static void read_registers(struct extra_state *state)
         state->cr3 = read_cr3();
         state->cr4 = read_cr4();

    -    if ( !(state->cr4 & X86_CR4_FRED) && (state->cr4 & X86_CR4_FSGSBASE) )
    +    if ( state->cr4 & X86_CR4_FSGSBASE )
         {
             state->fsb = __rdfsbase();
             state->gsb = __rdgsbase();
    +
    +        if ( state->cr4 & X86_CR4_FRED )
    +            goto gskern_fred;
    +
             state->gss = __rdgskern();
         }
         else
         {
             state->fsb = rdmsr(MSR_FS_BASE);
             state->gsb = rdmsr(MSR_GS_BASE);
    +    gskern_fred:
             state->gss = rdmsr(MSR_SHADOW_GS_BASE);
         }

but I'm not sure that it's a good enough improvement to warrant the
complexity.
---
 xen/arch/x86/include/asm/fsgsbase.h | 8 ++++++--
 xen/arch/x86/traps.c                | 2 +-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/include/asm/fsgsbase.h b/xen/arch/x86/include/asm/fsgsbase.h
index 24862a6bfea7..5faa3a324332 100644
--- a/xen/arch/x86/include/asm/fsgsbase.h
+++ b/xen/arch/x86/include/asm/fsgsbase.h
@@ -79,7 +79,9 @@ static inline unsigned long read_gs_base(void)
 
 static inline unsigned long read_gs_shadow(void)
 {
-    if ( read_cr4() & X86_CR4_FSGSBASE )
+    unsigned long cr4 = read_cr4();
+
+    if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
         return __rdgs_shadow();
     else
         return rdmsr(MSR_SHADOW_GS_BASE);
@@ -103,7 +105,9 @@ static inline void write_gs_base(unsigned long base)
 
 static inline void write_gs_shadow(unsigned long base)
 {
-    if ( read_cr4() & X86_CR4_FSGSBASE )
+    unsigned long cr4 = read_cr4();
+
+    if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
         __wrgs_shadow(base);
     else
         wrmsrns(MSR_SHADOW_GS_BASE, base);
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index c11d72d47027..66308e7c9edf 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -118,7 +118,7 @@ static void read_registers(struct extra_state *state)
     state->cr3 = read_cr3();
     state->cr4 = read_cr4();
 
-    if ( state->cr4 & X86_CR4_FSGSBASE )
+    if ( !(state->cr4 & X86_CR4_FRED) && (state->cr4 & X86_CR4_FSGSBASE) )
     {
         state->fsb = __rdfsbase();
         state->gsb = __rdgsbase();
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 13/23] x86/traps: Introduce FRED entrypoints
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (11 preceding siblings ...)
  2025-08-28 15:03 ` [PATCH v2 12/23] x86/fsgsbase: Make gskern accesses safe under FRED Andrew Cooper
@ 2025-08-28 15:03 ` Andrew Cooper
  2025-09-01 10:26   ` Jan Beulich
  2025-08-28 15:04 ` [PATCH v2 14/23] x86/traps: Enable FRED when requested Andrew Cooper
                   ` (9 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:03 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

Under FRED, there's one entrypoint from Ring 3, and one from Ring 0.

FRED gives us a good stack (even for SYSCALL/SYSENTER), and a unified event
frame on the stack, meaing that all software needs to do is spill the GPRs
with a line of PUSHes.  Introduce PUSH_AND_CLEAR_GPRS and POP_GPRS for this
purpose.

Introduce entry_FRED_R0() which to a first appoximation is complete for all
event handling within Xen.

entry_FRED_R0() needs deriving from entry_FRED_R3(), so introduce a basic
handler.  There is more work required to make the return-to-guest path work
under FRED, so leave a BUG clearly in place.

Also introduce entry_from_{xen,pv}() to be the C level handlers.  By simply
copying regs->fred_ss.vector into regs->entry_vector, we can reuse all the
existing fault handlers.

Extend fatal_trap() to render the event type, including by name, when FRED is
active.  This is slightly complicated, because X86_ET_OTHER must not use
vector_name() or SYSCALL and SYSENTER get rendered as #BP and #DB.  Also,
{read,write}_gs_shadow() needs modifying to avoid the SWAPGS instruction,
which is disallowed in FRED mode.

This is sufficient to handle all interrupts and exceptions encountered during
development, including plenty of Double Faults.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * Don't render a vector name for X86_ET_SW_INT
 * Fix typos in names[]
 * Link entry-fred.o first

SIMICS hasn't been updated to the FRED v9, and still wants ENDBR instructions
at the entrypoints.
---
 xen/arch/x86/include/asm/asm_defns.h |  65 +++++++++++
 xen/arch/x86/traps.c                 | 159 ++++++++++++++++++++++++++-
 xen/arch/x86/x86_64/Makefile         |   1 +
 xen/arch/x86/x86_64/entry-fred.S     |  33 ++++++
 4 files changed, 255 insertions(+), 3 deletions(-)
 create mode 100644 xen/arch/x86/x86_64/entry-fred.S

diff --git a/xen/arch/x86/include/asm/asm_defns.h b/xen/arch/x86/include/asm/asm_defns.h
index 72a0082d319d..a81a4043d0f1 100644
--- a/xen/arch/x86/include/asm/asm_defns.h
+++ b/xen/arch/x86/include/asm/asm_defns.h
@@ -315,6 +315,71 @@ static always_inline void stac(void)
         subq  $-(UREGS_error_code-UREGS_r15+\adj), %rsp
 .endm
 
+/*
+ * Push and clear GPRs
+ */
+.macro PUSH_AND_CLEAR_GPRS
+        push  %rdi
+        xor   %edi, %edi
+        push  %rsi
+        xor   %esi, %esi
+        push  %rdx
+        xor   %edx, %edx
+        push  %rcx
+        xor   %ecx, %ecx
+        push  %rax
+        xor   %eax, %eax
+        push  %r8
+        xor   %r8d, %r8d
+        push  %r9
+        xor   %r9d, %r9d
+        push  %r10
+        xor   %r10d, %r10d
+        push  %r11
+        xor   %r11d, %r11d
+        push  %rbx
+        xor   %ebx, %ebx
+        push  %rbp
+#ifdef CONFIG_FRAME_POINTER
+/* Indicate special exception stack frame by inverting the frame pointer. */
+        mov   %rsp, %rbp
+        notq  %rbp
+#else
+        xor   %ebp, %ebp
+#endif
+        push  %r12
+        xor   %r12d, %r12d
+        push  %r13
+        xor   %r13d, %r13d
+        push  %r14
+        xor   %r14d, %r14d
+        push  %r15
+        xor   %r15d, %r15d
+.endm
+
+/*
+ * POP GPRs from a UREGS_* frame on the stack.  Does not modify flags.
+ *
+ * @rax: Alternative destination for the %rax value on the stack.
+ */
+.macro POP_GPRS rax=%rax
+        pop   %r15
+        pop   %r14
+        pop   %r13
+        pop   %r12
+        pop   %rbp
+        pop   %rbx
+        pop   %r11
+        pop   %r10
+        pop   %r9
+        pop   %r8
+        pop   \rax
+        pop   %rcx
+        pop   %rdx
+        pop   %rsi
+        pop   %rdi
+.endm
+
 #ifdef CONFIG_PV32
 #define CR4_PV32_RESTORE                               \
     ALTERNATIVE_2 "",                                  \
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 66308e7c9edf..67763bec0dc5 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -89,6 +89,13 @@ const unsigned int nmi_cpu;
 #define stack_words_per_line 4
 #define ESP_BEFORE_EXCEPTION(regs) ((unsigned long *)(regs)->rsp)
 
+/* Only valid to use when FRED is active. */
+static inline struct fred_info *cpu_regs_fred_info(struct cpu_user_regs *regs)
+{
+    ASSERT(read_cr4() & X86_CR4_FRED);
+    return (void *)(regs + 1);
+}
+
 struct extra_state
 {
     unsigned long cr0, cr2, cr3, cr4;
@@ -1023,6 +1030,32 @@ void show_execution_state_nmi(const cpumask_t *mask, bool show_all)
         printk("Non-responding CPUs: {%*pbl}\n", CPUMASK_PR(&show_state_mask));
 }
 
+static const char *x86_et_name(unsigned int type)
+{
+    static const char *const names[] = {
+        [X86_ET_EXT_INTR]    = "EXT_INTR",
+        [X86_ET_NMI]         = "NMI",
+        [X86_ET_HW_EXC]      = "HW_EXC",
+        [X86_ET_SW_INT]      = "SW_INT",
+        [X86_ET_PRIV_SW_EXC] = "PRIV_SW_EXC",
+        [X86_ET_SW_EXC]      = "SW_EXC",
+        [X86_ET_OTHER]       = "OTHER",
+    };
+
+    return (type < ARRAY_SIZE(names) && names[type]) ? names[type] : "???";
+}
+
+static const char *x86_et_other_name(unsigned int what)
+{
+    static const char *const names[] = {
+        [0] = "MTF",
+        [1] = "SYSCALL",
+        [2] = "SYSENTER",
+    };
+
+    return (what < ARRAY_SIZE(names) && names[what]) ? names[what] : "???";
+}
+
 const char *vector_name(unsigned int vec)
 {
     static const char names[][4] = {
@@ -1101,9 +1134,41 @@ void fatal_trap(const struct cpu_user_regs *regs, bool show_remote)
         }
     }
 
-    panic("FATAL TRAP: vec %u, %s[%04x]%s\n",
-          trapnr, vector_name(trapnr), regs->error_code,
-          (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
+    if ( read_cr4() & X86_CR4_FRED )
+    {
+        bool render_ec = false;
+        const char *vec_name = NULL;
+
+        switch ( regs->fred_ss.type )
+        {
+        case X86_ET_HW_EXC:
+        case X86_ET_PRIV_SW_EXC:
+        case X86_ET_SW_EXC:
+            render_ec = true;
+            vec_name = vector_name(regs->fred_ss.vector);
+            break;
+
+        case X86_ET_OTHER:
+            vec_name = x86_et_other_name(regs->fred_ss.vector);
+            break;
+        }
+
+        if ( render_ec )
+            panic("Fatal TRAP: type %u, %s, vec %u, %s[%04x]%s\n",
+                  regs->fred_ss.type, x86_et_name(regs->fred_ss.type),
+                  regs->fred_ss.vector, vec_name ?: "",
+                  regs->error_code,
+                  (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
+        else
+            panic("Fatal TRAP: type %u, %s, vec %u, %s%s\n",
+                  regs->fred_ss.type, x86_et_name(regs->fred_ss.type),
+                  regs->fred_ss.vector, vec_name ?: "",
+                  (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
+    }
+    else
+        panic("FATAL TRAP: vec %u, %s[%04x]%s\n",
+              trapnr, vector_name(trapnr), regs->error_code,
+              (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
 }
 
 void asmlinkage noreturn do_unhandled_trap(struct cpu_user_regs *regs)
@@ -2198,6 +2263,94 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
 }
 #endif
 
+void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
+{
+    /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
+    regs->entry_vector = regs->fred_ss.vector;
+
+    fatal_trap(regs, false);
+}
+
+void asmlinkage entry_from_xen(struct cpu_user_regs *regs)
+{
+    struct fred_info *fi = cpu_regs_fred_info(regs);
+    uint8_t type = regs->fred_ss.type;
+
+    /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
+    regs->entry_vector = regs->fred_ss.vector;
+
+    /*
+     * First, handle the asynchronous or fatal events.  These are either
+     * unrelated to the interrupted context, or may not have valid context
+     * recorded, and all have special rules on how/whether to re-enable IRQs.
+     */
+    switch ( type )
+    {
+    case X86_ET_EXT_INTR:
+        return do_IRQ(regs);
+
+    case X86_ET_NMI:
+        return do_nmi(regs);
+
+    case X86_ET_HW_EXC:
+        switch ( regs->fred_ss.vector )
+        {
+        case X86_EXC_DF: return do_double_fault(regs);
+        case X86_EXC_MC: return do_machine_check(regs);
+        }
+        break;
+    }
+
+    /*
+     * With the asynchronous events handled, what remains are the synchronous
+     * ones.  If we interrupted an IRQs-on region, we should re-enable IRQs
+     * now; for #PF and #DB, %cr2 and %dr6 are on the stack in edata.
+     */
+    if ( regs->eflags & X86_EFLAGS_IF )
+        local_irq_enable();
+
+    switch ( type )
+    {
+    case X86_ET_HW_EXC:
+    case X86_ET_PRIV_SW_EXC:
+    case X86_ET_SW_EXC:
+        switch ( regs->fred_ss.vector )
+        {
+        case X86_EXC_PF:  handle_PF(regs, fi->edata); break;
+        case X86_EXC_GP:  do_general_protection(regs); break;
+        case X86_EXC_UD:  do_invalid_op(regs); break;
+        case X86_EXC_NM:  do_device_not_available(regs); break;
+        case X86_EXC_BP:  do_int3(regs); break;
+        case X86_EXC_DB:  handle_DB(regs, fi->edata); break;
+
+        case X86_EXC_DE:
+        case X86_EXC_OF:
+        case X86_EXC_BR:
+        case X86_EXC_NP:
+        case X86_EXC_SS:
+        case X86_EXC_MF:
+        case X86_EXC_AC:
+        case X86_EXC_XM:
+            do_trap(regs);
+            break;
+
+        case X86_EXC_CP:  do_entry_CP(regs); break;
+
+        default:
+            goto fatal;
+        }
+        break;
+
+    default:
+        goto fatal;
+    }
+
+    return;
+
+ fatal:
+    fatal_trap(regs, false);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/x86_64/Makefile b/xen/arch/x86/x86_64/Makefile
index f20763088740..c0a0b6603221 100644
--- a/xen/arch/x86/x86_64/Makefile
+++ b/xen/arch/x86/x86_64/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_PV32) += compat/
 
+obj-bin-y += entry-fred.o
 obj-bin-y += entry.o
 obj-$(CONFIG_KEXEC) += machine_kexec.o
 obj-y += pci.o
diff --git a/xen/arch/x86/x86_64/entry-fred.S b/xen/arch/x86/x86_64/entry-fred.S
new file mode 100644
index 000000000000..3c3320df22cb
--- /dev/null
+++ b/xen/arch/x86/x86_64/entry-fred.S
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+        .file "x86_64/entry-fred.S"
+
+#include <asm/asm_defns.h>
+#include <asm/page.h>
+
+        .section .text.entry, "ax", @progbits
+
+        /* The Ring3 entry point is required to be 4k aligned. */
+
+FUNC(entry_FRED_R3, 4096)
+        PUSH_AND_CLEAR_GPRS
+
+        mov     %rsp, %rdi
+        call    entry_from_pv
+
+        POP_GPRS
+        eretu
+END(entry_FRED_R3)
+
+        /* The Ring0 entrypoint is at Ring3 + 0x100. */
+        .org entry_FRED_R3 + 0x100, 0xcc
+
+FUNC_LOCAL(entry_FRED_R0, 0)
+        PUSH_AND_CLEAR_GPRS
+
+        mov     %rsp, %rdi
+        call    entry_from_xen
+
+        POP_GPRS
+        erets
+END(entry_FRED_R0)
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 14/23] x86/traps: Enable FRED when requested
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (12 preceding siblings ...)
  2025-08-28 15:03 ` [PATCH v2 13/23] x86/traps: Introduce FRED entrypoints Andrew Cooper
@ 2025-08-28 15:04 ` Andrew Cooper
  2025-09-01 10:50   ` Jan Beulich
  2025-08-28 15:04 ` [PATCH v2 15/23] x86/pv: Deduplicate is_canonical_address() in do_set_segment_base() Andrew Cooper
                   ` (8 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:04 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

With the shadow stack and exception handling adjustements in place, we can now
activate FRED when appropriate.  Note that opt_fred is still disabled by
default.

Introduce init_fred() to set up all the MSRs relevant for FRED.  FRED uses
MSR_STAR (entries from Ring3 only), and MSR_FRED_SSP_SL0 aliases MSR_PL0_SSP
when CET-SS is active.  Otherwise, they're all new MSRs.

With init_fred() existing, load_system_tables() and legacy_syscall_init()
should only be used when setting up IDT delivery.  Insert ASSERT()s to this
effect, and adjust the various *_init() functions to make this property true.

Per the documentation, ap_early_traps_init() is responsible for switching off
the boot GDT, which needs doing even in FRED mode.

Finally, set CR4.FRED in {bsp,ap}_early_traps_init().

Xen can now boot in FRED mode up until starting a PV guest, where it faults
because IRET is not permitted to change privilege.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * Explain the lack of BUG_ON()
 * Posion SL1

In principle we can stop allocating the IDT and TSS for CPUs now, although I
want to get shutdown and kexec working before making this optimisation, in
case there's something I've overlooked.
---
 xen/arch/x86/include/asm/current.h |  3 ++
 xen/arch/x86/include/asm/traps.h   |  2 +
 xen/arch/x86/traps-setup.c         | 86 +++++++++++++++++++++++++++---
 3 files changed, 85 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
index 35cc61fa88e7..53b0d3cf143d 100644
--- a/xen/arch/x86/include/asm/current.h
+++ b/xen/arch/x86/include/asm/current.h
@@ -23,6 +23,9 @@
  * 2 - NMI IST stack
  * 1 - #MC IST stack
  * 0 - IST Shadow Stacks (4x 1k, read-only)
+ *
+ * In FRED mode, #DB and NMI do not need special stacks, so their IST stacks
+ * are unused.
  */
 
 /*
diff --git a/xen/arch/x86/include/asm/traps.h b/xen/arch/x86/include/asm/traps.h
index 73097e957d05..5d7504bc44d1 100644
--- a/xen/arch/x86/include/asm/traps.h
+++ b/xen/arch/x86/include/asm/traps.h
@@ -16,6 +16,8 @@ void traps_init(void);
 void bsp_traps_reinit(void);
 void percpu_traps_init(void);
 
+void nocall entry_FRED_R3(void);
+
 extern unsigned int ler_msr;
 
 const char *vector_name(unsigned int vec);
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index d77be8f83921..535b53969678 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -59,6 +59,8 @@ static void load_system_tables(void)
         .limit = sizeof(bsp_idt) - 1,
     };
 
+    ASSERT(opt_fred == 0);
+
     /*
      * Set up the TSS.  Warning - may be live, and the NMI/#MC must remain
      * valid on every instruction boundary.  (Note: these are all
@@ -191,6 +193,8 @@ static void legacy_syscall_init(void)
     unsigned char *stub_page;
     unsigned int offset;
 
+    ASSERT(opt_fred == 0);
+
     /* No PV guests?  No need to set up SYSCALL/SYSENTER infrastructure. */
     if ( !IS_ENABLED(CONFIG_PV) )
         return;
@@ -268,6 +272,52 @@ static void __init init_ler(void)
     setup_force_cpu_cap(X86_FEATURE_XEN_LBR);
 }
 
+/*
+ * Set up all MSRs relevant for FRED event delivery.
+ *
+ * Xen does not use any of the optional config in MSR_FRED_CONFIG, so all that
+ * is needed is the entrypoint.
+ *
+ * Because FRED always provides a good stack, NMI and #DB do not need any
+ * special treatment.  Only #DF needs another stack level, and #MC for the
+ * offchance that Xen's main stack suffers an uncorrectable error.
+ *
+ * This makes Stack Level 1 unused, but we use #DB's stacks, and with the
+ * regular and shadow stacks reversed as posion to guarantee that any use
+ * escalates to #DF.
+ *
+ * FRED reuses MSR_STAR to provide the segment selector values to load on
+ * entry from Ring3.  Entry from Ring0 leave %cs and %ss unmodified.
+ */
+static void init_fred(void)
+{
+    unsigned long stack_top = get_stack_bottom() & ~(STACK_SIZE - 1);
+
+    ASSERT(opt_fred == 1);
+
+    wrmsrns(MSR_STAR, XEN_MSR_STAR);
+    wrmsrns(MSR_FRED_CONFIG, (unsigned long)entry_FRED_R3);
+
+    /*
+     * MSR_FRED_RSP_* all come with an 64-byte alignment check, avoiding the
+     * need for an explicit BUG_ON().
+     */
+    wrmsrns(MSR_FRED_RSP_SL0, (unsigned long)(&get_cpu_info()->_fred + 1));
+    wrmsrns(MSR_FRED_RSP_SL1, stack_top + (IST_DB * IST_SHSTK_SIZE)); /* Poison */
+    wrmsrns(MSR_FRED_RSP_SL2, stack_top + (1 + IST_MCE)  * PAGE_SIZE);
+    wrmsrns(MSR_FRED_RSP_SL3, stack_top + (1 + IST_DF)   * PAGE_SIZE);
+    wrmsrns(MSR_FRED_STK_LVLS, ((2UL << (X86_EXC_MC * 2)) |
+                                (3UL << (X86_EXC_DF * 2))));
+
+    if ( cpu_has_xen_shstk )
+    {
+        wrmsrns(MSR_FRED_SSP_SL0, stack_top + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE);
+        wrmsrns(MSR_FRED_RSP_SL1, stack_top + (1 + IST_DF)  * PAGE_SIZE); /* Poison */
+        wrmsrns(MSR_FRED_SSP_SL2, stack_top + (IST_MCE * IST_SHSTK_SIZE));
+        wrmsrns(MSR_FRED_SSP_SL3, stack_top + (IST_DF  * IST_SHSTK_SIZE));
+    }
+}
+
 /*
  * Configure basic exception handling.  This is prior to parsing the command
  * line or configuring a console, and needs to be as simple as possible.
@@ -329,16 +379,20 @@ void __init traps_init(void)
             printk(XENLOG_INFO "Disabling PV32 due to FRED\n");
         }
 #endif
+
+        init_fred();
+        set_in_cr4(X86_CR4_FRED);
+
         setup_force_cpu_cap(X86_FEATURE_XEN_FRED);
         printk("Using FRED event delivery\n");
     }
     else
     {
+        load_system_tables();
+
         printk("Using IDT event delivery\n");
     }
 
-    load_system_tables();
-
     init_ler();
 
     /* Cache {,compat_}gdt_l1e now that physically relocation is done. */
@@ -356,8 +410,13 @@ void __init traps_init(void)
  */
 void __init bsp_traps_reinit(void)
 {
-    load_system_tables();
-    percpu_traps_init();
+    if ( opt_fred )
+        init_fred();
+    else
+    {
+        load_system_tables();
+        percpu_traps_init();
+    }
 }
 
 /*
@@ -366,7 +425,8 @@ void __init bsp_traps_reinit(void)
  */
 void percpu_traps_init(void)
 {
-    legacy_syscall_init();
+    if ( !opt_fred )
+        legacy_syscall_init();
 
     if ( cpu_has_xen_lbr )
         wrmsrl(MSR_IA32_DEBUGCTLMSR, IA32_DEBUGCTLMSR_LBR);
@@ -381,7 +441,21 @@ void percpu_traps_init(void)
  */
 void asmlinkage percpu_early_traps_init(void)
 {
-    load_system_tables();
+    if ( opt_fred )
+    {
+        const seg_desc_t *gdt = this_cpu(gdt) - FIRST_RESERVED_GDT_ENTRY;
+        const struct desc_ptr gdtr = {
+            .base = (unsigned long)gdt,
+            .limit = LAST_RESERVED_GDT_BYTE,
+        };
+
+        lgdt(&gdtr);
+
+        init_fred();
+        write_cr4(read_cr4() | X86_CR4_FRED);
+    }
+    else
+        load_system_tables();
 }
 
 static void __init __maybe_unused build_assertions(void)
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 15/23] x86/pv: Deduplicate is_canonical_address() in do_set_segment_base()
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (13 preceding siblings ...)
  2025-08-28 15:04 ` [PATCH v2 14/23] x86/traps: Enable FRED when requested Andrew Cooper
@ 2025-08-28 15:04 ` Andrew Cooper
  2025-09-01 11:22   ` Jan Beulich
  2025-08-28 15:04 ` [PATCH v2 16/23] x86/entry: Alter how IRET faults are recognised Andrew Cooper
                   ` (7 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:04 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

This is really a rearrangement to make adding FRED support easier.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New

There is a marginal code size improvement:

  add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-46 (-46)
  Function                                     old     new   delta
  do_set_segment_base                          496     450     -46

but it does get undone by the FRED support.
---
 xen/arch/x86/pv/misc-hypercalls.c | 32 ++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/xen/arch/x86/pv/misc-hypercalls.c b/xen/arch/x86/pv/misc-hypercalls.c
index 7a37f16bf038..4c2abeb4add8 100644
--- a/xen/arch/x86/pv/misc-hypercalls.c
+++ b/xen/arch/x86/pv/misc-hypercalls.c
@@ -176,27 +176,29 @@ long do_set_segment_base(unsigned int which, unsigned long base)
     switch ( which )
     {
     case SEGBASE_FS:
-        if ( is_canonical_address(base) )
-            write_fs_base(base);
-        else
+    case SEGBASE_GS_USER:
+    case SEGBASE_GS_KERNEL:
+        if ( !is_canonical_address(base) )
+        {
             ret = -EINVAL;
-        break;
+            break;
+        }
 
-    case SEGBASE_GS_USER:
-        if ( is_canonical_address(base) )
+        switch ( which )
         {
-            write_gs_shadow(base);
+        case SEGBASE_FS:
+            write_fs_base(base);
+            break;
+
+        case SEGBASE_GS_USER:
             v->arch.pv.gs_base_user = base;
-        }
-        else
-            ret = -EINVAL;
-        break;
+            write_gs_shadow(base);
+            break;
 
-    case SEGBASE_GS_KERNEL:
-        if ( is_canonical_address(base) )
+        case SEGBASE_GS_KERNEL:
             write_gs_base(base);
-        else
-            ret = -EINVAL;
+            break;
+        }
         break;
 
     case SEGBASE_GS_USER_SEL:
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 16/23] x86/entry: Alter how IRET faults are recognised
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (14 preceding siblings ...)
  2025-08-28 15:04 ` [PATCH v2 15/23] x86/pv: Deduplicate is_canonical_address() in do_set_segment_base() Andrew Cooper
@ 2025-08-28 15:04 ` Andrew Cooper
  2025-09-01 11:41   ` Jan Beulich
  2025-08-28 15:04 ` [PATCH v2 17/23] x86/entry: Drop the pre exception table infrastructure Andrew Cooper
                   ` (6 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:04 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

Right now we have two IRET instructions that can fault for guest reasons, and
the pre exception table gives handle_exception as the fixup for both.

Instead, we can have compat_restore_all_guest() use restore_all_guest()'s IRET
which gives us just a single position to handle specially.

In exception_with_ints_disabled(), remove search_pre_exception_table() and use
a simpler check.  Explain how the recovery works, because this isn't the first
time I've had to figure it out.

The reference to iret_to_guest highlights that any checking here is specific
to CONFIG_PV, so exclude it in !PV builds.

Later in exception_with_ints_disabled(), it suffices to load %ecx rather than
%rcx, and remove a stray semi-colon from the rep movsq.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New
---
 xen/arch/x86/x86_64/compat/entry.S |  3 +--
 xen/arch/x86/x86_64/entry.S        | 31 ++++++++++++++++++++++--------
 2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index d7b381ea546d..39925d80a677 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -167,8 +167,7 @@ FUNC(compat_restore_all_guest)
             scf=STK_REL(CPUINFO_scf,      CPUINFO_rip), \
             sel=STK_REL(CPUINFO_verw_sel, CPUINFO_rip)
 
-.Lft0:  iretq
-        _ASM_PRE_EXTABLE(.Lft0, handle_exception)
+        jmp     iret_to_guest
 END(compat_restore_all_guest)
 
 /* Callers can cope with both %rax and %rcx being clobbered. */
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index c02245ac064c..01b431793b7b 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -241,8 +241,9 @@ iret_exit_to_guest:
         SPEC_CTRL_COND_VERW     /* Req: %rsp=eframe                    Clob: efl */
 
         addq  $8,%rsp
-.Lft0:  iretq
-        _ASM_PRE_EXTABLE(.Lft0, handle_exception)
+
+LABEL(iret_to_guest, 0)
+        iretq
 END(restore_all_guest)
 
 /*
@@ -920,10 +921,23 @@ handle_exception_saved:
 exception_with_ints_disabled:
         testb $3,UREGS_cs(%rsp)         # interrupts disabled outside Xen?
         jnz   FATAL_exception_with_ints_disabled
-        movq  %rsp,%rdi
-        call  search_pre_exception_table
-        testq %rax,%rax                 # no fixup code for faulting EIP?
-        jz    .Ldispatch_exceptions
+
+#ifndef CONFIG_PV
+        /* No PV?  No IRETs-to-guest to worry about. */
+        jmp .Ldispatch_exceptions
+#else
+        /* Check to see if the exception was on the IRET to guest context. */
+        lea   iret_to_guest(%rip), %rax
+        cmp   %rax, UREGS_rip(%rsp)
+        jne   .Ldispatch_exceptions
+
+        /*
+         * Recovery is at handle_exception.  It may be necessary to make space
+         * on the interrupted stack for ec/ev, after which the current ec/ev
+         * is copied to make it appear as if this exception occurred in guest
+         * context.
+         */
+        lea   handle_exception(%rip), %rax
         movq  %rax,UREGS_rip(%rsp)      # fixup regular stack
 
 #ifdef CONFIG_XEN_SHSTK
@@ -940,13 +954,14 @@ exception_with_ints_disabled:
         movq  %rsp,%rsi
         subq  $8,%rsp
         movq  %rsp,%rdi
-        movq  $UREGS_kernel_sizeof/8,%rcx
-        rep;  movsq                     # make room for ec/ev
+        mov   $UREGS_kernel_sizeof/8, %ecx
+        rep movsq                       # make room for ec/ev
 1:      movq  UREGS_error_code(%rsp),%rax # ec/ev
         movq  %rax,UREGS_kernel_sizeof(%rsp)
         mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         mov   %r13b, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)
         jmp   restore_all_xen           # return to fixup code
+#endif /* !CONFIG_PV */
 
 /* No special register assumptions. */
 FATAL_exception_with_ints_disabled:
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 17/23] x86/entry: Drop the pre exception table infrastructure
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (15 preceding siblings ...)
  2025-08-28 15:04 ` [PATCH v2 16/23] x86/entry: Alter how IRET faults are recognised Andrew Cooper
@ 2025-08-28 15:04 ` Andrew Cooper
  2025-09-01 11:45   ` Jan Beulich
  2025-08-28 15:04 ` [PATCH v2 18/23] x86/entry: Rework the comment about SYSCALL and DF Andrew Cooper
                   ` (5 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:04 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

It is no longer used.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New
---
 xen/arch/x86/extable.c               | 14 --------------
 xen/arch/x86/include/asm/asm_defns.h | 11 ++++-------
 xen/arch/x86/include/asm/uaccess.h   |  2 --
 xen/arch/x86/xen.lds.S               |  5 -----
 4 files changed, 4 insertions(+), 28 deletions(-)

diff --git a/xen/arch/x86/extable.c b/xen/arch/x86/extable.c
index cf637d0921e4..a9b6c6b904f5 100644
--- a/xen/arch/x86/extable.c
+++ b/xen/arch/x86/extable.c
@@ -61,7 +61,6 @@ void init_or_livepatch sort_exception_table(struct exception_table_entry *start,
 void __init sort_exception_tables(void)
 {
     sort_exception_table(__start___ex_table, __stop___ex_table);
-    sort_exception_table(__start___pre_ex_table, __stop___pre_ex_table);
 }
 
 static unsigned long
@@ -219,16 +218,3 @@ int __init cf_check stub_selftest(void)
 }
 __initcall(stub_selftest);
 #endif /* CONFIG_SELF_TESTS */
-
-unsigned long asmlinkage search_pre_exception_table(struct cpu_user_regs *regs)
-{
-    unsigned long addr = regs->rip;
-    unsigned long fixup = search_one_extable(
-        __start___pre_ex_table, __stop___pre_ex_table, addr);
-    if ( fixup )
-    {
-        dprintk(XENLOG_INFO, "Pre-exception: %p -> %p\n", _p(addr), _p(fixup));
-        perfc_incr(exception_fixed);
-    }
-    return fixup;
-}
diff --git a/xen/arch/x86/include/asm/asm_defns.h b/xen/arch/x86/include/asm/asm_defns.h
index a81a4043d0f1..d7eafedf0e4c 100644
--- a/xen/arch/x86/include/asm/asm_defns.h
+++ b/xen/arch/x86/include/asm/asm_defns.h
@@ -65,22 +65,19 @@ register unsigned long current_stack_pointer asm("rsp");
 
 /* Exception table entry */
 #ifdef __ASSEMBLY__
-# define _ASM__EXTABLE(sfx, from, to)             \
-    .section .ex_table##sfx, "a" ;                \
+# define _ASM_EXTABLE(from, to)                   \
+    .section .ex_table, "a" ;                     \
     .balign 4 ;                                   \
     .long _ASM_EX(from), _ASM_EX(to) ;            \
     .previous
 #else
-# define _ASM__EXTABLE(sfx, from, to)             \
-    " .section .ex_table" #sfx ",\"a\"\n"         \
+# define _ASM_EXTABLE(from, to)                   \
+    " .section .ex_table,\"a\"\n"                 \
     " .balign 4\n"                                \
     " .long " _ASM_EX(from) ", " _ASM_EX(to) "\n" \
     " .previous\n"
 #endif
 
-#define _ASM_EXTABLE(from, to)     _ASM__EXTABLE(, from, to)
-#define _ASM_PRE_EXTABLE(from, to) _ASM__EXTABLE(.pre, from, to)
-
 #ifdef __ASSEMBLY__
 
 .macro BUILD_BUG_ON condstr, cond:vararg
diff --git a/xen/arch/x86/include/asm/uaccess.h b/xen/arch/x86/include/asm/uaccess.h
index 719d053936b9..4c41a0fe0426 100644
--- a/xen/arch/x86/include/asm/uaccess.h
+++ b/xen/arch/x86/include/asm/uaccess.h
@@ -410,8 +410,6 @@ struct exception_table_entry
 };
 extern struct exception_table_entry __start___ex_table[];
 extern struct exception_table_entry __stop___ex_table[];
-extern struct exception_table_entry __start___pre_ex_table[];
-extern struct exception_table_entry __stop___pre_ex_table[];
 
 union stub_exception_token {
     struct {
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 966e514f2034..66075bc0ae6d 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -119,11 +119,6 @@ SECTIONS
        *(.ex_table)
        __stop___ex_table = .;
 
-       /* Pre-exception table */
-       __start___pre_ex_table = .;
-       *(.ex_table.pre)
-       __stop___pre_ex_table = .;
-
        . = ALIGN(PAGE_SIZE);
        __ro_after_init_end = .;
 
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 18/23] x86/entry: Rework the comment about SYSCALL and DF
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (16 preceding siblings ...)
  2025-08-28 15:04 ` [PATCH v2 17/23] x86/entry: Drop the pre exception table infrastructure Andrew Cooper
@ 2025-08-28 15:04 ` Andrew Cooper
  2025-09-01 11:52   ` Jan Beulich
  2025-08-28 15:04 ` [PATCH v2 19/23] x86/pv: Adjust GS handling for FRED mode Andrew Cooper
                   ` (4 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:04 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

It's soon going to be needed in a second location.

Right now it's misleading saying that nothing else would be cleared.  It's
missing the more important point that SYSCALLs are treated like all other
interrupts and exceptions, and undergo normal flags handling there.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New
---
 xen/arch/x86/x86_64/entry.S | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 01b431793b7b..ca446c6ff0ce 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -39,10 +39,9 @@ FUNC_LOCAL(switch_to_kernel)
         leal  (,%rcx,TBF_INTERRUPT),%ecx
 
         /*
-         * The PV ABI hardcodes the (guest-inaccessible and virtual)
-         * SYSCALL_MASK MSR such that DF (and nothing else) would be cleared.
-         * Note that the equivalent of IF (VGCF_syscall_disables_events) is
-         * dealt with separately above.
+         * The PV ABI, given no virtual SYSCALL_MASK, hardcodes that DF is
+         * cleared.  Other flags are handled in the same way as interrupts and
+         * exceptions in create_bounce_frame().
          */
         mov   $~X86_EFLAGS_DF, %esi
 
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 19/23] x86/pv: Adjust GS handling for FRED mode
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (17 preceding siblings ...)
  2025-08-28 15:04 ` [PATCH v2 18/23] x86/entry: Rework the comment about SYSCALL and DF Andrew Cooper
@ 2025-08-28 15:04 ` Andrew Cooper
  2025-09-01 12:02   ` Jan Beulich
  2025-08-28 15:04 ` [PATCH v2 20/23] x86/pv: Exception handling in " Andrew Cooper
                   ` (3 subsequent siblings)
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:04 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

When FRED is active, hardware automatically swaps GS when changing privilege,
and the SWAPGS instruction is disallowed.

For native OSes using GS as the thread local pointer this is a massive
improvement on the pre-FRED architecture, but under Xen it makes handling PV
guests more complicated.  Specifically, it means that GS_BASE and GS_SHADOW
are the opposite way around in FRED mode, as opposed to IDT mode.

This leads to the following changes:

  * In load_segments(), we have to load both GSes.  Account for this in the
    SWAP() condition and avoid the path with SWAGS.

  * In save_segments(), we need to read GS_KERN rather than GS_BASE.

  * In toggle_guest_mode(), we need to emulate SWAPGS.

  * In do_set_segment_base(), merge the SEGBASE_GS_{USER,KERNEL} cases and
    take FRED into account when choosing which base to update.

    SEGBASE_GS_USER_SEL was already an LKGS invocation (decades before FRED)
    so under FRED needs to be a simple MOV %gs.  Simply skip the SWAPGSes.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New

I think this functions, but it's not ideal.  The conditions are asymmetric and
awkward.

In principle, MSR_IMM can be as performant as FSGSBASE.  They can literally be
the the same microcode if the microline indexing allows.

Otherwise, the FSGSBASE instructions will be more performant than MSR
accesses (no need to decode %ecx), even with non-serialising writes (which all
FRED hardware should have).  However, use of FSGSBASE often comes with SWAPGS
and that can't be used under FRED.
---
 xen/arch/x86/domain.c             | 22 +++++++++++++++++-----
 xen/arch/x86/pv/domain.c          | 22 ++++++++++++++++++++--
 xen/arch/x86/pv/misc-hypercalls.c | 16 ++++++++++------
 3 files changed, 47 insertions(+), 13 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 8089ff929bf7..64922869a625 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1819,9 +1819,10 @@ static void load_segments(struct vcpu *n)
 
         /*
          * Figure out which way around gsb/gss want to be.  gsb needs to be
-         * the active context, and gss needs to be the inactive context.
+         * the active context, and gss needs to be the inactive context,
+         * unless we're in FRED mode where they're reversed.
          */
-        if ( !(n->arch.flags & TF_kernel_mode) )
+        if ( !(n->arch.flags & TF_kernel_mode) ^ opt_fred )
             SWAP(gsb, gss);
 
         if ( using_svm() && (n->arch.pv.fs | n->arch.pv.gs) <= 3 )
@@ -1842,7 +1843,9 @@ static void load_segments(struct vcpu *n)
 
     if ( !fs_gs_done && !compat )
     {
-        if ( read_cr4() & X86_CR4_FSGSBASE )
+        unsigned long cr4 = read_cr4();
+
+        if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
         {
             __wrgsbase(gss);
             __wrfsbase(n->arch.pv.fs_base);
@@ -1959,6 +1962,9 @@ static void load_segments(struct vcpu *n)
  * Guests however cannot use SWAPGS, so there is no mechanism to modify the
  * inactive GS base behind Xen's back.  Therefore, Xen's copy of the inactive
  * GS base is still accurate, and doesn't need reading back from hardware.
+ *
+ * Under FRED, hardware automatically swaps GS for us, so GS_KERN is the
+ * active GS from the guest's point of view.
  */
 static void save_segments(struct vcpu *v)
 {
@@ -1974,12 +1980,18 @@ static void save_segments(struct vcpu *v)
         if ( read_cr4() & X86_CR4_FSGSBASE )
         {
             fs_base = __rdfsbase();
-            gs_base = __rdgsbase();
+            if ( opt_fred )
+                gs_base = rdmsr(MSR_SHADOW_GS_BASE);
+            else
+                gs_base = __rdgsbase();
         }
         else
         {
             fs_base = rdmsr(MSR_FS_BASE);
-            gs_base = rdmsr(MSR_GS_BASE);
+            if ( opt_fred )
+                gs_base = rdmsr(MSR_SHADOW_GS_BASE);
+            else
+                gs_base = rdmsr(MSR_GS_BASE);
         }
 
         v->arch.pv.fs_base = fs_base;
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 9c4785c187dd..5a7b69da5000 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -14,9 +14,10 @@
 #include <asm/cpufeature.h>
 #include <asm/fsgsbase.h>
 #include <asm/invpcid.h>
-#include <asm/spec_ctrl.h>
 #include <asm/pv/domain.h>
 #include <asm/shadow.h>
+#include <asm/spec_ctrl.h>
+#include <asm/traps.h>
 
 #ifdef CONFIG_PV32
 int8_t __read_mostly opt_pv32 = -1;
@@ -480,11 +481,28 @@ void toggle_guest_mode(struct vcpu *v)
      * subsequent context switch won't bother re-reading it.
      */
     gs_base = read_gs_base();
+
+    /*
+     * In FRED mode, not only are the two GSes the other way around (i.e. we
+     * want to read GS_KERN here), the SWAPGS instruction is disallowed so we
+     * have to emulate it.
+     */
+    if ( opt_fred )
+    {
+        unsigned long gs_kern = rdmsr(MSR_SHADOW_GS_BASE);
+
+        wrmsrns(MSR_SHADOW_GS_BASE, gs_base);
+        write_gs_base(gs_kern);
+
+        gs_base = gs_kern;
+    }
+    else
+        asm volatile ( "swapgs" );
+
     if ( v->arch.flags & TF_kernel_mode )
         v->arch.pv.gs_base_kernel = gs_base;
     else
         v->arch.pv.gs_base_user = gs_base;
-    asm volatile ( "swapgs" );
 
     _toggle_guest_pt(v);
 
diff --git a/xen/arch/x86/pv/misc-hypercalls.c b/xen/arch/x86/pv/misc-hypercalls.c
index 4c2abeb4add8..2c9cf50638db 100644
--- a/xen/arch/x86/pv/misc-hypercalls.c
+++ b/xen/arch/x86/pv/misc-hypercalls.c
@@ -11,6 +11,7 @@
 
 #include <asm/debugreg.h>
 #include <asm/fsgsbase.h>
+#include <asm/traps.h>
 
 long do_set_debugreg(int reg, unsigned long value)
 {
@@ -192,11 +193,12 @@ long do_set_segment_base(unsigned int which, unsigned long base)
 
         case SEGBASE_GS_USER:
             v->arch.pv.gs_base_user = base;
-            write_gs_shadow(base);
-            break;
-
+            fallthrough;
         case SEGBASE_GS_KERNEL:
-            write_gs_base(base);
+            if ( (which == SEGBASE_GS_KERNEL) ^ opt_fred )
+                write_gs_base(base);
+            else
+                write_gs_shadow(base);
             break;
         }
         break;
@@ -209,7 +211,8 @@ long do_set_segment_base(unsigned int which, unsigned long base)
          * We wish to update the user %gs from the GDT/LDT.  Currently, the
          * guest kernel's GS_BASE is in context.
          */
-        asm volatile ( "swapgs" );
+        if ( !opt_fred )
+            asm volatile ( "swapgs" );
 
         if ( sel > 3 )
             /* Fix up RPL for non-NUL selectors. */
@@ -247,7 +250,8 @@ long do_set_segment_base(unsigned int which, unsigned long base)
         /* Update the cache of the inactive base, as read from the GDT/LDT. */
         v->arch.pv.gs_base_user = read_gs_base();
 
-        asm volatile ( safe_swapgs );
+        if ( !opt_fred )
+            asm volatile ( safe_swapgs );
         break;
     }
 
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 20/23] x86/pv: Exception handling in FRED mode
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (18 preceding siblings ...)
  2025-08-28 15:04 ` [PATCH v2 19/23] x86/pv: Adjust GS handling for FRED mode Andrew Cooper
@ 2025-08-28 15:04 ` Andrew Cooper
  2025-09-01 12:23   ` Jan Beulich
  2025-09-01 12:57   ` Jan Beulich
  2025-08-28 15:04 ` [PATCH v2 21/23] x86/pv: ERETU error handling Andrew Cooper
                   ` (2 subsequent siblings)
  22 siblings, 2 replies; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:04 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

Under FRED, entry_from_pv() handles everything.  To start with, implement
exception handling in the same manner as entry_from_xen(), although we can
unconditionally enable interrupts after the async/fatal events.

After entry_from_pv() returns, test_all_events() needs to run to perform
exception and interrupt injection.  Split entry_FRED_R3() into two and
introduce eretu_exit_to_guest() as the latter half, coming unilaterally from
restore_all_guest().

For all of this, there is a slightly complicated relationship with CONFIG_PV.
entry_FRED_R3() must exist irrespective of CONFIG_PV, because it's the
entrypoint registered with hardware.  For simplicity, entry_from_pv() is
always called, but it collapses into fatal_trap() in the !PV case.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New
---
 xen/arch/x86/traps.c             | 76 +++++++++++++++++++++++++++++++-
 xen/arch/x86/x86_64/entry-fred.S | 13 +++++-
 xen/arch/x86/x86_64/entry.S      |  4 +-
 3 files changed, 90 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 67763bec0dc5..72df446a6a78 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -2265,9 +2265,83 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
 
 void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
 {
+    struct fred_info *fi = cpu_regs_fred_info(regs);
+    uint8_t type = regs->fred_ss.type;
+    uint8_t vec = regs->fred_ss.vector;
+
     /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
-    regs->entry_vector = regs->fred_ss.vector;
+    regs->entry_vector = vec;
+
+    if ( !IS_ENABLED(CONFIG_PV) )
+        goto fatal;
+
+    /*
+     * First, handle the asynchronous or fatal events.  These are either
+     * unrelated to the interrupted context, or may not have valid context
+     * recorded, and all have special rules on how/whether to re-enable IRQs.
+     */
+    switch ( type )
+    {
+    case X86_ET_EXT_INTR:
+        return do_IRQ(regs);
 
+    case X86_ET_NMI:
+        return do_nmi(regs);
+
+    case X86_ET_HW_EXC:
+        switch ( vec )
+        {
+        case X86_EXC_DF: return do_double_fault(regs);
+        case X86_EXC_MC: return do_machine_check(regs);
+        }
+        break;
+    }
+
+    /*
+     * With the asynchronous events handled, what remains are the synchronous
+     * ones.  Guest context always had interrupts enabled.
+     */
+    local_irq_enable();
+
+    switch ( type )
+    {
+    case X86_ET_HW_EXC:
+    case X86_ET_PRIV_SW_EXC:
+    case X86_ET_SW_EXC:
+        switch ( vec )
+        {
+        case X86_EXC_PF:  handle_PF(regs, fi->edata); break;
+        case X86_EXC_GP:  do_general_protection(regs); break;
+        case X86_EXC_UD:  do_invalid_op(regs); break;
+        case X86_EXC_NM:  do_device_not_available(regs); break;
+        case X86_EXC_BP:  do_int3(regs); break;
+        case X86_EXC_DB:  handle_DB(regs, fi->edata); break;
+
+        case X86_EXC_DE:
+        case X86_EXC_OF:
+        case X86_EXC_BR:
+        case X86_EXC_NP:
+        case X86_EXC_SS:
+        case X86_EXC_MF:
+        case X86_EXC_AC:
+        case X86_EXC_XM:
+            do_trap(regs);
+            break;
+
+        case X86_EXC_CP:  do_entry_CP(regs); break;
+
+        default:
+            goto fatal;
+        }
+        break;
+
+    default:
+        goto fatal;
+    }
+
+    return;
+
+ fatal:
     fatal_trap(regs, false);
 }
 
diff --git a/xen/arch/x86/x86_64/entry-fred.S b/xen/arch/x86/x86_64/entry-fred.S
index 3c3320df22cb..07684f38a078 100644
--- a/xen/arch/x86/x86_64/entry-fred.S
+++ b/xen/arch/x86/x86_64/entry-fred.S
@@ -15,9 +15,20 @@ FUNC(entry_FRED_R3, 4096)
         mov     %rsp, %rdi
         call    entry_from_pv
 
+#ifndef CONFIG_PV
+        BUG     /* Not Reached */
+#else
+        GET_STACK_END(14)
+        movq    STACK_CPUINFO_FIELD(current_vcpu)(%r14), %rbx
+
+        jmp     test_all_events
+#endif
+END(entry_FRED_R3)
+
+FUNC(eretu_exit_to_guest)
         POP_GPRS
         eretu
-END(entry_FRED_R3)
+END(eretu_exit_to_guest)
 
         /* The Ring0 entrypoint is at Ring3 + 0x100. */
         .org entry_FRED_R3 + 0x100, 0xcc
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index ca446c6ff0ce..0692163faa44 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -63,7 +63,7 @@ UNLIKELY_END(syscall_no_callback)
         /* Conditionally clear DF */
         and   %esi, UREGS_eflags(%rsp)
 /* %rbx: struct vcpu */
-test_all_events:
+LABEL(test_all_events, 0)
         ASSERT_NOT_IN_ATOMIC
         cli                             # tests must not race interrupts
 /*test_softirqs:*/
@@ -152,6 +152,8 @@ END(switch_to_kernel)
 FUNC_LOCAL(restore_all_guest)
         ASSERT_INTERRUPTS_DISABLED
 
+        ALTERNATIVE "", "jmp eretu_exit_to_guest", X86_FEATURE_XEN_FRED
+
         /* Stash guest SPEC_CTRL value while we can read struct vcpu. */
         mov VCPU_arch_msrs(%rbx), %rdx
         mov VCPUMSR_spec_ctrl_raw(%rdx), %r15d
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 21/23] x86/pv: ERETU error handling
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (19 preceding siblings ...)
  2025-08-28 15:04 ` [PATCH v2 20/23] x86/pv: Exception handling in " Andrew Cooper
@ 2025-08-28 15:04 ` Andrew Cooper
  2025-09-01 13:13   ` Jan Beulich
  2025-08-28 15:04 ` [PATCH v2 22/23] x86/pv: System call handling in FRED mode Andrew Cooper
  2025-08-28 15:04 ` [PATCH v2 23/23] x86/pv: Adjust eflags handling for " Andrew Cooper
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:04 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

ERETU can fault for guest reasons, and like IRET needs special handling to
forward the error into the guest.

As this is largely written in C, take the opportunity to better classify the
sources of error, and in particilar, not forward errors that are actually
Xen's fault into the guest, opting for a domain crash instead.

Because ERETU does not enable NMIs if it faults, a corner case exists if an
NMI was taken while in guest context, and the ERETU back out faults.  Recovery
must involve an ERETS with the interrupted context's NMI flag.

See the comments for full details.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New
---
 xen/arch/x86/traps.c             | 115 +++++++++++++++++++++++++++++++
 xen/arch/x86/x86_64/entry-fred.S |  13 ++++
 2 files changed, 128 insertions(+)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 72df446a6a78..e10b4e771824 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -2345,6 +2345,113 @@ void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
     fatal_trap(regs, false);
 }
 
+void nocall eretu_error_dom_crash(void);
+
+/*
+ * Classify an event at the ERETU instruction, and handle if possible.
+ * Returns @true if handled, @false if the event should continue down the
+ * normal handlers.
+ */
+static bool handle_eretu_event(struct cpu_user_regs *regs)
+{
+    unsigned long recover;
+
+    /*
+     * WARNING: The GPRs in gregs overlaps with regs.  Only gregs->error_code
+     *          and later are legitimate to access.
+     */
+    struct cpu_user_regs *gregs =
+        _p(regs->rsp - offsetof(struct cpu_user_regs, error_code));
+
+    /*
+     * The asynchronous or fatal events (INTR, NMI, #MC, #DF) have been dealt
+     * with, meaning we only have syncrhonous ones to consider.  Anything
+     * which isn't a hardware exception wants handling normally.
+     */
+    if ( regs->fred_ss.type != X86_ET_HW_EXC )
+        return false;
+
+    /*
+     * Guests are permitted to write non-present GDT/LDT entries.  Therefore
+     * #NP[sel] (%cs) and #SS[sel] (%ss) must be handled as guest errors.  The
+     * only other source of #SS is for a bad %ss-relative memory access in
+     * Xen, and if the stack is that bad, we'll have escalated to #DF.
+     *
+     * #PF can happen from ERETU accessing the GDT/LDT.  Xen may translate
+     * these into #GP for the guest, so must be handled as guest errors.  In
+     * theory we can get #PF for a bad instruction fetch or bad stack access,
+     * but either of these will be fatal and not end up here.
+     */
+    switch ( regs->fred_ss.vector )
+    {
+    case X86_EXC_GP:
+        /*
+         * #GP[0] can occur because of a NULL %cs or %ss (which are a guest
+         * error), but some #GP[0]'s are errors in Xen (ERETU at SL != 0), or
+         * errors of Xen handling guest state (bad metadata).  These magic
+         * numbers came from the FRED Spec; they check that ERETU is trying to
+         * return to Ring 3, and that reserved or inapplicable bits are 0.
+         */
+        if ( regs->error_code == 0 && (gregs->cs & ~3) && (gregs->ss & ~3) &&
+             (regs->fred_cs.sl != 0 ||
+              (gregs->csx    & 0xffffffffffff0003UL) != 3 ||
+              (gregs->rflags & 0xffffffffffc2b02aUL) != 2 ||
+              (gregs->ssx    &         0xfff80003UL) != 3) )
+        {
+            recover = (unsigned long)eretu_error_dom_crash;
+
+            if ( regs->fred_cs.sl )
+                gprintk(XENLOG_ERR, "ERETU at SL %u\n", regs->fred_cs.sl);
+            else
+                gprintk(XENLOG_ERR, "Bad return state: csx %#lx, rflags %#lx, ssx %#x\n",
+                        gregs->csx, gregs->rflags, (unsigned int)gregs->ssx);
+            break;
+        }
+        fallthrough;
+    case X86_EXC_NP:
+    case X86_EXC_SS:
+    case X86_EXC_PF:
+        recover = (unsigned long)entry_FRED_R3;
+        break;
+
+        /*
+         * Handle everything else normally.  #BP and #DB would be debugging
+         * activities in Xen.  In theory we can get #UD if CR4.FRED gets
+         * cleared, but in practice if that were the case we wouldn't be here
+         * handling the result.
+         */
+    default:
+        return false;
+    }
+
+    this_cpu(last_extable_addr) = regs->rip;
+
+    /*
+     * Everything else is recoverable, one way or another.
+     *
+     * If an NMI was taken in guest context and the ERETU faulted, NMIs will
+     * still be blocked.  Therefore we copy the interrupted frame's NMI status
+     * into our own, and must ERETS as part of recovery.
+     */
+    regs->fred_ss.nmi = gregs->fred_ss.nmi;
+
+    /*
+     * Next, copy the exception information from the current frame back onto
+     * the interrupted frame, preserving the interrupted frame's %cs and %ss.
+     */
+    *cpu_regs_fred_info(regs) = *cpu_regs_fred_info(gregs);
+    gregs->ssx = (regs->ssx & ~0xffff) | gregs->ss;
+    gregs->csx = (regs->csx & ~0xffff) | gregs->cs;
+    gregs->error_code   = regs->error_code;
+    gregs->entry_vector = regs->entry_vector;
+
+    fixup_exception_return(regs, recover, 0);
+
+    return true;
+}
+
+void nocall eretu(void);
+
 void asmlinkage entry_from_xen(struct cpu_user_regs *regs)
 {
     struct fred_info *fi = cpu_regs_fred_info(regs);
@@ -2383,6 +2490,14 @@ void asmlinkage entry_from_xen(struct cpu_user_regs *regs)
     if ( regs->eflags & X86_EFLAGS_IF )
         local_irq_enable();
 
+    /*
+     * An event taken at the ERETU instruction may be because of guest state
+     * and in that case will need special handling.
+     */
+    if ( unlikely(regs->rip == (unsigned long)eretu) &&
+         handle_eretu_event(regs) )
+        return;
+
     switch ( type )
     {
     case X86_ET_HW_EXC:
diff --git a/xen/arch/x86/x86_64/entry-fred.S b/xen/arch/x86/x86_64/entry-fred.S
index 07684f38a078..8b5cafb866e2 100644
--- a/xen/arch/x86/x86_64/entry-fred.S
+++ b/xen/arch/x86/x86_64/entry-fred.S
@@ -27,9 +27,22 @@ END(entry_FRED_R3)
 
 FUNC(eretu_exit_to_guest)
         POP_GPRS
+
+        /*
+         * Exceptions here are handled by redirecting either to
+         * entry_FRED_R3() (for an error to be passed to the guest), or to
+         * eretu_error_dom_crash() (for a Xen error handling guest state).
+         */
+LABEL(eretu, 0)
         eretu
 END(eretu_exit_to_guest)
 
+FUNC(eretu_error_dom_crash)
+        PUSH_AND_CLEAR_GPRS
+        sti
+        call    asm_domain_crash_synchronous  /* Does not return */
+END(eretu_error_dom_crash)
+
         /* The Ring0 entrypoint is at Ring3 + 0x100. */
         .org entry_FRED_R3 + 0x100, 0xcc
 
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 22/23] x86/pv: System call handling in FRED mode
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (20 preceding siblings ...)
  2025-08-28 15:04 ` [PATCH v2 21/23] x86/pv: ERETU error handling Andrew Cooper
@ 2025-08-28 15:04 ` Andrew Cooper
  2025-09-01 13:49   ` Jan Beulich
  2025-08-28 15:04 ` [PATCH v2 23/23] x86/pv: Adjust eflags handling for " Andrew Cooper
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:04 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

Under FRED, entry_from_pv() handles everything, even system calls.  This means
more of our logic is written in C now, rather than assembly.

In order to facilitate this, introduce pv_inject_callback(), which reuses
struct trap_bounce infrastructure to inject the syscall/sysenter callbacks.
This in turns requires some !PV compatibility for pv_inject_callback() and
pv_hypercall() which can both be ASSERT_UNREACHABLE().

For each of INT $N, SYSCALL and SYSENTER, FRED gives us interrupted context
which was previously lost.  As the guest can't see FRED, Xen has to lose state
in the same way to maintain the prior behaviour.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New
---
 xen/arch/x86/include/asm/domain.h    |   5 ++
 xen/arch/x86/include/asm/hypercall.h |   5 ++
 xen/arch/x86/pv/traps.c              |  33 +++++++++
 xen/arch/x86/traps.c                 | 107 +++++++++++++++++++++++++++
 4 files changed, 150 insertions(+)

diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index 5df8c7825333..b374decccc9c 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -712,11 +712,16 @@ int arch_set_info_hvm_guest(struct vcpu *v, const struct vcpu_hvm_context *ctx);
 
 #ifdef CONFIG_PV
 void pv_inject_event(const struct x86_event *event);
+void pv_inject_callback(unsigned int type);
 #else
 static inline void pv_inject_event(const struct x86_event *event)
 {
     ASSERT_UNREACHABLE();
 }
+static inline void pv_inject_callback(unsigned int type)
+{
+    ASSERT_UNREACHABLE();
+}
 #endif
 
 static inline void pv_inject_hw_exception(unsigned int vector, int errcode)
diff --git a/xen/arch/x86/include/asm/hypercall.h b/xen/arch/x86/include/asm/hypercall.h
index f6e9e2313b3c..1010332a47e9 100644
--- a/xen/arch/x86/include/asm/hypercall.h
+++ b/xen/arch/x86/include/asm/hypercall.h
@@ -20,6 +20,11 @@
 
 #ifdef CONFIG_PV
 void pv_hypercall(struct cpu_user_regs *regs);
+#else
+static inline void pv_hypercall(struct cpu_user_regs *regs)
+{
+    ASSERT_UNREACHABLE();
+}
 #endif
 
 void pv_ring1_init_hypercall_page(void *ptr);
diff --git a/xen/arch/x86/pv/traps.c b/xen/arch/x86/pv/traps.c
index c3c0976c440f..e7314d8703d9 100644
--- a/xen/arch/x86/pv/traps.c
+++ b/xen/arch/x86/pv/traps.c
@@ -19,6 +19,8 @@
 #include <asm/shared.h>
 #include <asm/traps.h>
 
+#include <public/callback.h>
+
 void pv_inject_event(const struct x86_event *event)
 {
     struct vcpu *curr = current;
@@ -95,6 +97,37 @@ void pv_inject_event(const struct x86_event *event)
     }
 }
 
+void pv_inject_callback(unsigned int type)
+{
+    struct vcpu *curr = current;
+    struct trap_bounce *tb = &curr->arch.pv.trap_bounce;
+    unsigned long rip = 0;
+    bool irq = false;
+
+    ASSERT(is_pv_64bit_vcpu(curr));
+
+    switch ( type )
+    {
+    case CALLBACKTYPE_syscall:
+        rip = curr->arch.pv.syscall_callback_eip;
+        irq = curr->arch.pv.vgc_flags & VGCF_syscall_disables_events;
+        break;
+
+    case CALLBACKTYPE_syscall32:
+        rip = curr->arch.pv.syscall32_callback_eip;
+        irq = curr->arch.pv.syscall32_disables_events;
+        break;
+
+    case CALLBACKTYPE_sysenter:
+        rip = curr->arch.pv.sysenter_callback_eip;
+        irq = curr->arch.pv.sysenter_disables_events;
+        break;
+    }
+
+    tb->flags = TBF_EXCEPTION | (irq ? TBF_INTERRUPT : 0);
+    tb->eip = rip;
+}
+
 /*
  * Called from asm to set up the MCE trapbounce info.
  * Returns false no callback is set up, else true.
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index e10b4e771824..9211067cd688 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -18,6 +18,7 @@
 #include <xen/delay.h>
 #include <xen/domain_page.h>
 #include <xen/guest_access.h>
+#include <xen/hypercall.h>
 #include <xen/init.h>
 #include <xen/mm.h>
 #include <xen/paging.h>
@@ -52,6 +53,8 @@
 #include <asm/uaccess.h>
 #include <asm/xenoprof.h>
 
+#include <public/callback.h>
+
 /*
  * opt_nmi: one of 'ignore', 'dom0', or 'fatal'.
  *  fatal:  Xen prints diagnostic message and then hangs.
@@ -2266,6 +2269,7 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
 void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
 {
     struct fred_info *fi = cpu_regs_fred_info(regs);
+    struct vcpu *curr = current;
     uint8_t type = regs->fred_ss.type;
     uint8_t vec = regs->fred_ss.vector;
 
@@ -2305,6 +2309,27 @@ void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
 
     switch ( type )
     {
+    case X86_ET_SW_INT:
+        /*
+         * INT $3/4 are indistinguishable from INT3/INTO under IDT, and are
+         * permitted by Xen without the guest kernel having a choice.  Let
+         * them fall through into X86_ET_HW_EXC, as #BP in particular needs
+         * handling by do_int3() in case an external debugger is attached.
+         */
+        if ( vec != X86_EXC_BP && vec != X86_EXC_OF )
+        {
+            const struct trap_info *ti = &curr->arch.pv.trap_ctxt[vec];
+
+            if ( permit_softint(TI_GET_DPL(ti), curr, regs) )
+                pv_inject_sw_interrupt(vec);
+            else
+            {
+                regs->rip -= 2;
+                pv_inject_hw_exception(X86_EXC_GP, (vec << 3) | X86_XEC_IDT);
+            }
+            break;
+        }
+        fallthrough;
     case X86_ET_HW_EXC:
     case X86_ET_PRIV_SW_EXC:
     case X86_ET_SW_EXC:
@@ -2335,6 +2360,88 @@ void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
         }
         break;
 
+    case X86_ET_OTHER:
+        switch ( regs->fred_ss.vector )
+        {
+        case 1: /* SYSCALL */
+        {
+            /*
+             * FRED delivery preserves the interrupted %cs/%ss, but previously
+             * SYSCALL lost the interrupted selectors, and SYSRET forced the
+             * use of the ones in MSR_STAR.
+             *
+             * The guest isn't aware of FRED, so recreate the legacy
+             * behaviour, including the guess of instruction length for
+             * faults.
+             *
+             * The non-FRED SYSCALL path sets TRAP_syscall in entry_vector to
+             * signal that SYSRET can be used, but this isn't relevant in FRED
+             * mode.
+             *
+             * When setting the selectors, clear all upper metadata again for
+             * backwards compatibility.  In particular fred_ss.swint becomes
+             * pend_DB on ERETx, and nothing else in the pv_hypercall() would
+             * clean up.
+             */
+            bool l = regs->fred_ss.l;
+
+            regs->ssx = l ? FLAT_KERNEL_SS   : FLAT_USER_SS32;
+            regs->csx = l ? FLAT_KERNEL_CS64 : FLAT_USER_CS32;
+
+            if ( guest_kernel_mode(curr, regs) )
+                pv_hypercall(regs);
+            else if ( (l ? curr->arch.pv.syscall_callback_eip
+                         : curr->arch.pv.syscall32_callback_eip) == 0 )
+            {
+                regs->rip -= 2;
+                pv_inject_hw_exception(X86_EXC_UD, X86_EVENT_NO_EC);
+            }
+            else
+            {
+                /*
+                 * The PV ABI, given no virtual SYSCALL_MASK, hardcodes that
+                 * DF is cleared.  Other flags are handled in the same way as
+                 * interrupts and exceptions in create_bounce_frame().
+                 */
+                regs->eflags &= ~X86_EFLAGS_DF;
+                pv_inject_callback(l ? CALLBACKTYPE_syscall
+                                     : CALLBACKTYPE_syscall32);
+            }
+            break;
+        }
+
+        case 2: /* SYSENTER */
+            /*
+             * FRED delivery preserves the interrupted state, but previously
+             * SYSENTER discarded almost everything.
+             *
+             * The guest isn't aware of FRED, so recreate the legacy
+             * behaviour, including the guess of instruction length for
+             * faults.
+             *
+             * When setting the selectors, clear all upper metadata.  In
+             * particular fred_ss.swint becomes pend_DB on ERETx.
+             */
+            regs->ssx = FLAT_USER_SS;
+            regs->rsp = 0;
+            regs->eflags &= ~(X86_EFLAGS_VM | X86_EFLAGS_IF);
+            regs->csx = 3;
+            regs->rip = 0;
+
+            if ( !curr->arch.pv.sysenter_callback_eip )
+            {
+                regs->rip -= 2;
+                pv_inject_hw_exception(X86_EXC_GP, 0);
+            }
+            else
+                pv_inject_callback(CALLBACKTYPE_sysenter);
+            break;
+
+        default:
+            goto fatal;
+        }
+        break;
+
     default:
         goto fatal;
     }
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 23/23] x86/pv: Adjust eflags handling for FRED mode
  2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
                   ` (21 preceding siblings ...)
  2025-08-28 15:04 ` [PATCH v2 22/23] x86/pv: System call handling in FRED mode Andrew Cooper
@ 2025-08-28 15:04 ` Andrew Cooper
  2025-09-01 14:17   ` Jan Beulich
  22 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:04 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

ERETU, unlike IRET, requires the sticky-1 bit (bit 2) be set, and reserved
bits to be clear.  Notably this means that dom0_construct() must set
X86_EFLAGS_MBS it in order for a PV dom0 to start.

Adjust arch_set_info_guest*() and hypercall_iret() which consume flags to
clamp the reserved bits.

This is a minor ABI change, but by the same argument as commit
9f892f84c279 ("x86/domctl: Stop using XLAT_cpu_user_regs()"), this change will
happen naturally when the vCPU schedules.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New

The handling of VM is complicated.

It turns out that it's simply ignored by IRET in Long Mode (i.e. clearing it
commit 0e47f92b0725 ("x86: force EFLAGS.IF on when exiting to PV guests")
wasn't actually necessary) but ERETU does care.

But, it's unclear how to handle this in in arch_set_info().  We must preserve
it for HVM guests (whih can use vm86 mode).  PV32 has special handling but
only in hypercall_iret(), not in arch_set_info().
---
 xen/arch/x86/domain.c                | 4 ++--
 xen/arch/x86/hvm/domain.c            | 4 ++--
 xen/arch/x86/include/asm/x86-defns.h | 7 +++++++
 xen/arch/x86/pv/dom0_build.c         | 2 +-
 xen/arch/x86/pv/iret.c               | 8 +++++---
 5 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 64922869a625..c1880324f7a9 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1273,7 +1273,7 @@ int arch_set_info_guest(
         v->arch.user_regs.rax               = c.nat->user_regs.rax;
         v->arch.user_regs.rip               = c.nat->user_regs.rip;
         v->arch.user_regs.cs                = c.nat->user_regs.cs;
-        v->arch.user_regs.rflags            = c.nat->user_regs.rflags;
+        v->arch.user_regs.rflags            = (c.nat->user_regs.rflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
         v->arch.user_regs.rsp               = c.nat->user_regs.rsp;
         v->arch.user_regs.ss                = c.nat->user_regs.ss;
         v->arch.pv.es                       = c.nat->user_regs.es;
@@ -1297,7 +1297,7 @@ int arch_set_info_guest(
         v->arch.user_regs.eax               = c.cmp->user_regs.eax;
         v->arch.user_regs.eip               = c.cmp->user_regs.eip;
         v->arch.user_regs.cs                = c.cmp->user_regs.cs;
-        v->arch.user_regs.eflags            = c.cmp->user_regs.eflags;
+        v->arch.user_regs.eflags            = (c.cmp->user_regs.eflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
         v->arch.user_regs.esp               = c.cmp->user_regs.esp;
         v->arch.user_regs.ss                = c.cmp->user_regs.ss;
         v->arch.pv.es                       = c.cmp->user_regs.es;
diff --git a/xen/arch/x86/hvm/domain.c b/xen/arch/x86/hvm/domain.c
index 048f29ae4911..1e874d598952 100644
--- a/xen/arch/x86/hvm/domain.c
+++ b/xen/arch/x86/hvm/domain.c
@@ -194,7 +194,7 @@ int arch_set_info_hvm_guest(struct vcpu *v, const struct vcpu_hvm_context *ctx)
         uregs->rsi    = regs->esi;
         uregs->rdi    = regs->edi;
         uregs->rip    = regs->eip;
-        uregs->rflags = regs->eflags;
+        uregs->rflags = (regs->eflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
 
         v->arch.hvm.guest_cr[0] = regs->cr0;
         v->arch.hvm.guest_cr[3] = regs->cr3;
@@ -245,7 +245,7 @@ int arch_set_info_hvm_guest(struct vcpu *v, const struct vcpu_hvm_context *ctx)
         uregs->rsi    = regs->rsi;
         uregs->rdi    = regs->rdi;
         uregs->rip    = regs->rip;
-        uregs->rflags = regs->rflags;
+        uregs->rflags = (regs->rflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
 
         v->arch.hvm.guest_cr[0] = regs->cr0;
         v->arch.hvm.guest_cr[3] = regs->cr3;
diff --git a/xen/arch/x86/include/asm/x86-defns.h b/xen/arch/x86/include/asm/x86-defns.h
index 0a0ba83de786..edeb0b4ff95a 100644
--- a/xen/arch/x86/include/asm/x86-defns.h
+++ b/xen/arch/x86/include/asm/x86-defns.h
@@ -27,6 +27,13 @@
     (X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |   \
      X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF)
 
+#define X86_EFLAGS_ALL                                          \
+    (X86_EFLAGS_ARITH_MASK | X86_EFLAGS_TF | X86_EFLAGS_IF |    \
+     X86_EFLAGS_DF | X86_EFLAGS_OF | X86_EFLAGS_IOPL |          \
+     X86_EFLAGS_NT | X86_EFLAGS_RF | X86_EFLAGS_VM |            \
+     X86_EFLAGS_AC | X86_EFLAGS_VIF | X86_EFLAGS_VIP |          \
+     X86_EFLAGS_ID)
+
 /*
  * Intel CPU flags in CR0
  */
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 21158ce1812e..f9bbbea2ff70 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -1021,7 +1021,7 @@ static int __init dom0_construct(const struct boot_domain *bd)
     regs->rip = parms.virt_entry;
     regs->rsp = vstack_end;
     regs->rsi = vstartinfo_start;
-    regs->eflags = X86_EFLAGS_IF;
+    regs->eflags = X86_EFLAGS_IF | X86_EFLAGS_MBS;
 
     /*
      * We don't call arch_set_info_guest(), so some initialisation needs doing
diff --git a/xen/arch/x86/pv/iret.c b/xen/arch/x86/pv/iret.c
index d3a1fb2c685b..39ce316b8d91 100644
--- a/xen/arch/x86/pv/iret.c
+++ b/xen/arch/x86/pv/iret.c
@@ -80,8 +80,9 @@ long do_iret(void)
 
     regs->rip    = iret_saved.rip;
     regs->cs     = iret_saved.cs | 3; /* force guest privilege */
-    regs->rflags = ((iret_saved.rflags & ~(X86_EFLAGS_IOPL|X86_EFLAGS_VM))
-                    | X86_EFLAGS_IF);
+    regs->rflags = ((iret_saved.rflags & X86_EFLAGS_ALL &
+                     ~(X86_EFLAGS_IOPL | X86_EFLAGS_VM)) |
+                    X86_EFLAGS_IF | X86_EFLAGS_MBS);
     regs->rsp    = iret_saved.rsp;
     regs->ss     = iret_saved.ss | 3; /* force guest privilege */
 
@@ -143,7 +144,8 @@ int compat_iret(void)
     if ( VM_ASSIST(v->domain, architectural_iopl) )
         v->arch.pv.iopl = eflags & X86_EFLAGS_IOPL;
 
-    regs->eflags = (eflags & ~X86_EFLAGS_IOPL) | X86_EFLAGS_IF;
+    regs->eflags = ((eflags & X86_EFLAGS_ALL & ~X86_EFLAGS_IOPL) |
+                    X86_EFLAGS_IF | X86_EFLAGS_MBS);
 
     if ( unlikely(eflags & X86_EFLAGS_VM) )
     {
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 02/23] x86/traps: Extend struct cpu_user_regs/cpu_info with FRED fields
  2025-08-28 15:03 ` [PATCH v2 02/23] x86/traps: Extend struct cpu_user_regs/cpu_info with FRED fields Andrew Cooper
@ 2025-08-28 15:18   ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-08-28 15:18 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:03, Andrew Cooper wrote:
> The FRED on-stack format is larger than the IDT format, but is by and large
> compatible.  FRED reuses space above cs and ss for extra metadata, some of
> which is purely informational, and some of which causes additional effects in
> ERET{U,S}.
> 
> Follow Linux's choice of naming for fred_{c,s}s structures, to make it very
> clear at the point of use that it's dependent on FRED.
> 
> There is also the event data field and reserved fields, but we cannot include
> these in struct cpu_user_regs without reintroducing OoB structure accesses in
> the non-FRED case.  See commit 6065a05adf15 ("x86/traps: 'Fix' safety of
> read_registers() in #DF path"). for more details.
> 
> Instead, use a new struct fred_info and position it suitably in struct
> cpu_info.  This boundary will be loaded into MSR_FRED_RSP_SL0, and must be
> 64-byte aligned.
> 
> This does add 16 bytes back into struct cpu_info, undoing the saving we made
> by dropping the vm86 data segment selectors.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 03/23] x86/traps: Introduce opt_fred
  2025-08-28 15:03 ` [PATCH v2 03/23] x86/traps: Introduce opt_fred Andrew Cooper
@ 2025-08-28 15:20   ` Jan Beulich
  2025-09-01 13:15   ` Jan Beulich
  1 sibling, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-08-28 15:20 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:03, Andrew Cooper wrote:
> ... disabled by default.  There is a lot of work before FRED can be enabled by
> default.
> 
> One part of FRED, the LKGS (Load Kernel GS) instruction, is enumerated
> separately but is mandatory as FRED disallows the SWAPGS instruction.
> Normally, we'd have to check both CPUID bits, but Xen does not use GS like
> most other software, and can manage without the LKGS instruction.
> 
> FRED formally removes the use of Ring1 and Ring2, meaning we cannot run 32bit
> PV guests.  Therefore, don't enable FRED by default in shim mode.  OTOH, if
> FRED is active, then PV32 needs disabling like with CET.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 01/23] x86: FRED enumerations
  2025-08-28 15:03 ` [PATCH v2 01/23] x86: FRED enumerations Andrew Cooper
@ 2025-08-28 15:33   ` Andrew Cooper
  0 siblings, 0 replies; 56+ messages in thread
From: Andrew Cooper @ 2025-08-28 15:33 UTC (permalink / raw)
  To: Xen-devel; +Cc: Jan Beulich, Roger Pau Monné

On 28/08/2025 4:03 pm, Andrew Cooper wrote:
> Of note, CR4.FRED is bit 32 and cannot enabled outside of 64bit mode.
>
> Most supported toolchains don't understand the FRED instructions yet.  ERETU
> and ERETS are easy to wrap (they encoded as REPZ/REPNE CLAC), while LKGS is
> more complicated and deferred for now.

I've realised this is stale now.  I'll replace with:

Most supported toolchains don't understand the FRED instructions yet. 
ERETU and ERETS are easy to wrap (they encoded as REPZ/REPNE CLAC),
while Xen turns out to have no need for LKGS.

>
> I have intentionally named the FRED MSRs differently to the spec.  In the
> spec, the stack pointer names alias the TSS fields of the same name, despite
> very different semantics.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Acked-by: Jan Beulich <jbeulich@suse.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
>
> v2:
>  * Drop CONFIG_HAS_AS_FRED

I should have also noted "Add X86_FEATURE_XEN_FRED" here.

~Andrew


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 06/23] x86/traps: Set MSR_PL0_SSP in load_system_tables()
  2025-08-28 15:03 ` [PATCH v2 06/23] x86/traps: Set MSR_PL0_SSP in load_system_tables() Andrew Cooper
@ 2025-09-01  9:23   ` Jan Beulich
  2025-09-01 15:31     ` Andrew Cooper
  0 siblings, 1 reply; 56+ messages in thread
From: Jan Beulich @ 2025-09-01  9:23 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:03, Andrew Cooper wrote:
> FRED and IDT differ by a Supervisor Token on the base of the shstk.  This
> means that the value they load into MSR_PL0_SSP differs by 8.
> 
> s3_resume() in particular has logic which is otherwise invariant of FRED mode,
> and must not clobber a FRED MSR_PL0_SSP with an IDT one.
> 
> This also simplifies the AP path too.  Updating reinit_bsp_stack() is deferred
> until later.

This last sentence looks to be ...

> No functional change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> 
> v2:
>  * Extend comment about clearing the busy bit.
>  * Move reinit_bsp_stack() hunk into this patch.

... stale, according to this. Other than that:
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 07/23] x86/boot: Use RSTORSSP to establish SSP
  2025-08-28 15:03 ` [PATCH v2 07/23] x86/boot: Use RSTORSSP to establish SSP Andrew Cooper
@ 2025-09-01  9:28   ` Jan Beulich
  2025-09-01 15:33     ` Andrew Cooper
  0 siblings, 1 reply; 56+ messages in thread
From: Jan Beulich @ 2025-09-01  9:28 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:03, Andrew Cooper wrote:
> @@ -908,7 +909,29 @@ static void __init noreturn reinit_bsp_stack(void)
>      if ( cpu_has_xen_shstk )
>      {
>          wrmsrl(MSR_S_CET, xen_msr_s_cet_value());
> -        asm volatile ("setssbsy" ::: "memory");
> +
> +        /*
> +         * IDT and FRED differ by a Supervisor Token on the shadow stack, and
> +         * therefore by the value in MSR_PL0_SSP.

Beside not being overly relevant here afaict, is this last part of the sentence
actually correct? Patch 06 doesn't write different values into the MSR.

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 08/23] x86/traps: Alter switch_stack_and_jump() for FRED mode
  2025-08-28 15:03 ` [PATCH v2 08/23] x86/traps: Alter switch_stack_and_jump() for FRED mode Andrew Cooper
@ 2025-09-01  9:30   ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01  9:30 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:03, Andrew Cooper wrote:
> FRED and IDT differ by a Supervisor Token on the base of the shstk.  This
> means that switch_stack_and_jump() needs to discard one extra word when FRED
> is active.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

However, I'd much prefer if ...

> --- a/xen/arch/x86/include/asm/current.h
> +++ b/xen/arch/x86/include/asm/current.h
> @@ -154,7 +154,9 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>      "rdsspd %[ssp];"                                            \
>      "cmp $1, %[ssp];"                                           \
>      "je .L_shstk_done.%=;" /* CET not active?  Skip. */         \
> -    "mov $%c[skstk_base], %[val];"                              \
> +    ALTERNATIVE("mov $%c[skstk_base], %[val];",                 \
> +                "mov $%c[skstk_base] + 8, %[val];",             \

... the unnecessarily complicated $%c here could be replaced by plain %.

Jan

> +                X86_FEATURE_XEN_FRED)                           \
>      "and $%c[stack_mask], %[ssp];"                              \
>      "sub %[ssp], %[val];"                                       \
>      "shr $3, %[val];"                                           \



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 09/23] x86/traps: Skip Supervisor Shadow Stack tokens in FRED mode
  2025-08-28 15:03 ` [PATCH v2 09/23] x86/traps: Skip Supervisor Shadow Stack tokens in " Andrew Cooper
@ 2025-09-01  9:38   ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01  9:38 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:03, Andrew Cooper wrote:
> FRED doesn't use Supervisor Shadow Stack tokens.  Skip setting them up.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

In principle
Reviewed-by: Jan Beulich <jbeulich@suse.com>

However, ...

> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1920,10 +1920,6 @@ void asmlinkage __init noreturn __start_xen(void)
>  
>      system_state = SYS_STATE_boot;
>  
> -    bsp_stack = cpu_alloc_stack(0);
> -    if ( !bsp_stack )
> -        panic("No memory for BSP stack\n");
> -
>      console_init_ring();
>      vesa_init();
>  
> @@ -2077,6 +2073,10 @@ void asmlinkage __init noreturn __start_xen(void)
>  
>      traps_init(); /* Needs stubs allocated. */
>  
> +    bsp_stack = cpu_alloc_stack(0); /* Needs to know IDT vs FRED */
> +    if ( !bsp_stack )
> +        panic("No memory for BSP stack\n");

... while this is the earliest possible point now, can we please consider
moving it yet further down? E.g. next to the setting of system_state to
SYS_STATE_smp_boot, i.e. in particular past IRQ and spin-debug enabling?

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 10/23] x86/traps: Make an IDT-specific #DB helper
  2025-08-28 15:03 ` [PATCH v2 10/23] x86/traps: Make an IDT-specific #DB helper Andrew Cooper
@ 2025-09-01  9:41   ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01  9:41 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:03, Andrew Cooper wrote:
> FRED provides PENDING_DBG in the the stack frame, avoiding the need to read
> %dr6 manually.
> 
> Rename do_debug() to handle_DB(), and update it to take a dbg field using
> positive polarity.
> 
> Introduce a new handle_DB_IDT() which reads %dr6.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 11/23] x86/traps: Make an IDT-specific #PF helper
  2025-08-28 15:03 ` [PATCH v2 11/23] x86/traps: Make an IDT-specific #PF helper Andrew Cooper
@ 2025-09-01  9:46   ` Jan Beulich
  2025-09-01 15:36     ` Andrew Cooper
  0 siblings, 1 reply; 56+ messages in thread
From: Jan Beulich @ 2025-09-01  9:46 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:03, Andrew Cooper wrote:
> FRED provides %cr2 in the the stack frame, avoiding the need to read %cr2
> manually.
> 
> Rename do_page_fault() to handle_PF(), and update it to take cr2, still named
> addr for consistency.
> 
> Introduce a new handle_PF_IDT() which reads %cr2 and conditionally re-enables
> interrupts.

Why does this IRQ-re-enabling move to the IDT-specific function? Do you intend
to do the re-enabling yet earlier in FRED mode?

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 12/23] x86/fsgsbase: Make gskern accesses safe under FRED
  2025-08-28 15:03 ` [PATCH v2 12/23] x86/fsgsbase: Make gskern accesses safe under FRED Andrew Cooper
@ 2025-09-01  9:57   ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01  9:57 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:03, Andrew Cooper wrote:
> In principle, the following can also be used for read_registers()
> 
>     diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
>     index 5799770a2f71..0b0fdf2c5ac4 100644
>     --- a/xen/arch/x86/traps.c
>     +++ b/xen/arch/x86/traps.c
>     @@ -125,16 +125,21 @@ static void read_registers(struct extra_state *state)
>          state->cr3 = read_cr3();
>          state->cr4 = read_cr4();
> 
>     -    if ( !(state->cr4 & X86_CR4_FRED) && (state->cr4 & X86_CR4_FSGSBASE) )
>     +    if ( state->cr4 & X86_CR4_FSGSBASE )
>          {
>              state->fsb = __rdfsbase();
>              state->gsb = __rdgsbase();
>     +
>     +        if ( state->cr4 & X86_CR4_FRED )
>     +            goto gskern_fred;
>     +
>              state->gss = __rdgskern();

I'm irritated by this patch context here vs ...

> --- a/xen/arch/x86/include/asm/fsgsbase.h
> +++ b/xen/arch/x86/include/asm/fsgsbase.h
> @@ -79,7 +79,9 @@ static inline unsigned long read_gs_base(void)
>  
>  static inline unsigned long read_gs_shadow(void)
>  {
> -    if ( read_cr4() & X86_CR4_FSGSBASE )
> +    unsigned long cr4 = read_cr4();
> +
> +    if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
>          return __rdgs_shadow();

... the one here. Was the former (and the subject of the patch) not updated
yet (kern => shadow)? On the assumption that that's what has happened, and
hence preferably with the subject also adjusted
Reviewed-by: Jan Beulich <jbeulich@suse.com>

As to the alternative, I in particular don't overly like the goto there. I
would consider that an option only if in turn a simplification elsewhere
resulted.

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 13/23] x86/traps: Introduce FRED entrypoints
  2025-08-28 15:03 ` [PATCH v2 13/23] x86/traps: Introduce FRED entrypoints Andrew Cooper
@ 2025-09-01 10:26   ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01 10:26 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:03, Andrew Cooper wrote:
> Under FRED, there's one entrypoint from Ring 3, and one from Ring 0.
> 
> FRED gives us a good stack (even for SYSCALL/SYSENTER), and a unified event
> frame on the stack, meaing that all software needs to do is spill the GPRs
> with a line of PUSHes.  Introduce PUSH_AND_CLEAR_GPRS and POP_GPRS for this
> purpose.
> 
> Introduce entry_FRED_R0() which to a first appoximation is complete for all
> event handling within Xen.
> 
> entry_FRED_R0() needs deriving from entry_FRED_R3(), so introduce a basic
> handler.  There is more work required to make the return-to-guest path work
> under FRED, so leave a BUG clearly in place.
> 
> Also introduce entry_from_{xen,pv}() to be the C level handlers.  By simply
> copying regs->fred_ss.vector into regs->entry_vector, we can reuse all the
> existing fault handlers.
> 
> Extend fatal_trap() to render the event type, including by name, when FRED is
> active.  This is slightly complicated, because X86_ET_OTHER must not use
> vector_name() or SYSCALL and SYSENTER get rendered as #BP and #DB.  Also,
> {read,write}_gs_shadow() needs modifying to avoid the SWAPGS instruction,
> which is disallowed in FRED mode.

This last sentence looks to be stale now.

> --- a/xen/arch/x86/include/asm/asm_defns.h
> +++ b/xen/arch/x86/include/asm/asm_defns.h
> @@ -315,6 +315,71 @@ static always_inline void stac(void)
>          subq  $-(UREGS_error_code-UREGS_r15+\adj), %rsp
>  .endm
>  
> +/*
> + * Push and clear GPRs
> + */
> +.macro PUSH_AND_CLEAR_GPRS
> +        push  %rdi
> +        xor   %edi, %edi
> +        push  %rsi
> +        xor   %esi, %esi
> +        push  %rdx
> +        xor   %edx, %edx
> +        push  %rcx
> +        xor   %ecx, %ecx
> +        push  %rax
> +        xor   %eax, %eax
> +        push  %r8
> +        xor   %r8d, %r8d
> +        push  %r9
> +        xor   %r9d, %r9d
> +        push  %r10
> +        xor   %r10d, %r10d
> +        push  %r11
> +        xor   %r11d, %r11d
> +        push  %rbx
> +        xor   %ebx, %ebx
> +        push  %rbp
> +#ifdef CONFIG_FRAME_POINTER
> +/* Indicate special exception stack frame by inverting the frame pointer. */
> +        mov   %rsp, %rbp
> +        notq  %rbp
> +#else
> +        xor   %ebp, %ebp
> +#endif
> +        push  %r12
> +        xor   %r12d, %r12d
> +        push  %r13
> +        xor   %r13d, %r13d
> +        push  %r14
> +        xor   %r14d, %r14d
> +        push  %r15
> +        xor   %r15d, %r15d
> +.endm
> +
> +/*
> + * POP GPRs from a UREGS_* frame on the stack.  Does not modify flags.
> + *
> + * @rax: Alternative destination for the %rax value on the stack.
> + */
> +.macro POP_GPRS rax=%rax

The parameter isn't really needed, at least not just yet. Do you have firm
plans for using it (presumably in the SVM code ahead of VMRUN)? Else I'd
suggest to omit it, as it's fragile anyway: One can't use an arbitrary
register for it; only ...

> +        pop   %r15
> +        pop   %r14
> +        pop   %r13
> +        pop   %r12
> +        pop   %rbp
> +        pop   %rbx
> +        pop   %r11
> +        pop   %r10
> +        pop   %r9
> +        pop   %r8
> +        pop   \rax
> +        pop   %rcx
> +        pop   %rdx
> +        pop   %rsi
> +        pop   %rdi

... the ones POPed later are candidates. Their order isn't really visible
at use sites of the macro, though. (A possibility would be to indicate
via argument only _that_ the %rax slot wants discarding, without indicating
by use of which register.)

> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -89,6 +89,13 @@ const unsigned int nmi_cpu;
>  #define stack_words_per_line 4
>  #define ESP_BEFORE_EXCEPTION(regs) ((unsigned long *)(regs)->rsp)
>  
> +/* Only valid to use when FRED is active. */
> +static inline struct fred_info *cpu_regs_fred_info(struct cpu_user_regs *regs)
> +{
> +    ASSERT(read_cr4() & X86_CR4_FRED);
> +    return (void *)(regs + 1);

Maybe better

    &container_of(regs, struct cpu_info, guest_cpu_user_regs)->_fred;

?

> @@ -1101,9 +1134,41 @@ void fatal_trap(const struct cpu_user_regs *regs, bool show_remote)
>          }
>      }
>  
> -    panic("FATAL TRAP: vec %u, %s[%04x]%s\n",
> -          trapnr, vector_name(trapnr), regs->error_code,
> -          (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
> +    if ( read_cr4() & X86_CR4_FRED )
> +    {
> +        bool render_ec = false;
> +        const char *vec_name = NULL;
> +
> +        switch ( regs->fred_ss.type )
> +        {
> +        case X86_ET_HW_EXC:
> +        case X86_ET_PRIV_SW_EXC:
> +        case X86_ET_SW_EXC:
> +            render_ec = true;
> +            vec_name = vector_name(regs->fred_ss.vector);
> +            break;
> +
> +        case X86_ET_OTHER:
> +            vec_name = x86_et_other_name(regs->fred_ss.vector);
> +            break;
> +        }
> +
> +        if ( render_ec )
> +            panic("Fatal TRAP: type %u, %s, vec %u, %s[%04x]%s\n",

Is it deliberate that here and ...

> +                  regs->fred_ss.type, x86_et_name(regs->fred_ss.type),
> +                  regs->fred_ss.vector, vec_name ?: "",
> +                  regs->error_code,
> +                  (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
> +        else
> +            panic("Fatal TRAP: type %u, %s, vec %u, %s%s\n",

.. here it's "Fatal", when ....

> +                  regs->fred_ss.type, x86_et_name(regs->fred_ss.type),
> +                  regs->fred_ss.vector, vec_name ?: "",
> +                  (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
> +    }
> +    else
> +        panic("FATAL TRAP: vec %u, %s[%04x]%s\n",

... here it's "FATAL"? (Personally I prefer the all capitals form.)

> +              trapnr, vector_name(trapnr), regs->error_code,
> +              (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");

Just as a remark (the "else" thing still needs sorting) - without "else"
this would be a smaller diff.

> @@ -2198,6 +2263,94 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
>  }
>  #endif
>  
> +void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
> +{
> +    /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
> +    regs->entry_vector = regs->fred_ss.vector;
> +
> +    fatal_trap(regs, false);
> +}
> +
> +void asmlinkage entry_from_xen(struct cpu_user_regs *regs)
> +{
> +    struct fred_info *fi = cpu_regs_fred_info(regs);
> +    uint8_t type = regs->fred_ss.type;
> +
> +    /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
> +    regs->entry_vector = regs->fred_ss.vector;
> +
> +    /*
> +     * First, handle the asynchronous or fatal events.  These are either
> +     * unrelated to the interrupted context, or may not have valid context
> +     * recorded, and all have special rules on how/whether to re-enable IRQs.
> +     */
> +    switch ( type )
> +    {
> +    case X86_ET_EXT_INTR:
> +        return do_IRQ(regs);
> +
> +    case X86_ET_NMI:
> +        return do_nmi(regs);
> +
> +    case X86_ET_HW_EXC:
> +        switch ( regs->fred_ss.vector )
> +        {
> +        case X86_EXC_DF: return do_double_fault(regs);
> +        case X86_EXC_MC: return do_machine_check(regs);
> +        }
> +        break;
> +    }
> +
> +    /*
> +     * With the asynchronous events handled, what remains are the synchronous
> +     * ones.  If we interrupted an IRQs-on region, we should re-enable IRQs
> +     * now; for #PF and #DB, %cr2 and %dr6 are on the stack in edata.
> +     */
> +    if ( regs->eflags & X86_EFLAGS_IF )
> +        local_irq_enable();

Ah yes, here we go (as to my question on an earlier patch).

> +    switch ( type )
> +    {
> +    case X86_ET_HW_EXC:
> +    case X86_ET_PRIV_SW_EXC:
> +    case X86_ET_SW_EXC:
> +        switch ( regs->fred_ss.vector )
> +        {
> +        case X86_EXC_PF:  handle_PF(regs, fi->edata); break;
> +        case X86_EXC_GP:  do_general_protection(regs); break;
> +        case X86_EXC_UD:  do_invalid_op(regs); break;
> +        case X86_EXC_NM:  do_device_not_available(regs); break;
> +        case X86_EXC_BP:  do_int3(regs); break;
> +        case X86_EXC_DB:  handle_DB(regs, fi->edata); break;
> +
> +        case X86_EXC_DE:
> +        case X86_EXC_OF:
> +        case X86_EXC_BR:
> +        case X86_EXC_NP:
> +        case X86_EXC_SS:
> +        case X86_EXC_MF:
> +        case X86_EXC_AC:
> +        case X86_EXC_XM:
> +            do_trap(regs);
> +            break;
> +
> +        case X86_EXC_CP:  do_entry_CP(regs); break;

Any reason this isn't grouped with the other single-line cases?

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 14/23] x86/traps: Enable FRED when requested
  2025-08-28 15:04 ` [PATCH v2 14/23] x86/traps: Enable FRED when requested Andrew Cooper
@ 2025-09-01 10:50   ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01 10:50 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:04, Andrew Cooper wrote:
> With the shadow stack and exception handling adjustements in place, we can now
> activate FRED when appropriate.  Note that opt_fred is still disabled by
> default.
> 
> Introduce init_fred() to set up all the MSRs relevant for FRED.  FRED uses
> MSR_STAR (entries from Ring3 only), and MSR_FRED_SSP_SL0 aliases MSR_PL0_SSP
> when CET-SS is active.  Otherwise, they're all new MSRs.
> 
> With init_fred() existing, load_system_tables() and legacy_syscall_init()
> should only be used when setting up IDT delivery.  Insert ASSERT()s to this
> effect, and adjust the various *_init() functions to make this property true.
> 
> Per the documentation, ap_early_traps_init() is responsible for switching off
> the boot GDT, which needs doing even in FRED mode.
> 
> Finally, set CR4.FRED in {bsp,ap}_early_traps_init().

Nit: That's {bsp,percpu}_traps_init() now, aiui.

> --- a/xen/arch/x86/include/asm/traps.h
> +++ b/xen/arch/x86/include/asm/traps.h
> @@ -16,6 +16,8 @@ void traps_init(void);
>  void bsp_traps_reinit(void);
>  void percpu_traps_init(void);
>  
> +void nocall entry_FRED_R3(void);

Can't we constrain this decl to the sole C file needing it?

> @@ -268,6 +272,52 @@ static void __init init_ler(void)
>      setup_force_cpu_cap(X86_FEATURE_XEN_LBR);
>  }
>  
> +/*
> + * Set up all MSRs relevant for FRED event delivery.
> + *
> + * Xen does not use any of the optional config in MSR_FRED_CONFIG, so all that
> + * is needed is the entrypoint.
> + *
> + * Because FRED always provides a good stack, NMI and #DB do not need any
> + * special treatment.  Only #DF needs another stack level, and #MC for the
> + * offchance that Xen's main stack suffers an uncorrectable error.
> + *
> + * This makes Stack Level 1 unused, but we use #DB's stacks, and with the
> + * regular and shadow stacks reversed as posion to guarantee that any use
> + * escalates to #DF.
> + *
> + * FRED reuses MSR_STAR to provide the segment selector values to load on
> + * entry from Ring3.  Entry from Ring0 leave %cs and %ss unmodified.
> + */
> +static void init_fred(void)
> +{
> +    unsigned long stack_top = get_stack_bottom() & ~(STACK_SIZE - 1);
> +
> +    ASSERT(opt_fred == 1);
> +
> +    wrmsrns(MSR_STAR, XEN_MSR_STAR);
> +    wrmsrns(MSR_FRED_CONFIG, (unsigned long)entry_FRED_R3);
> +
> +    /*
> +     * MSR_FRED_RSP_* all come with an 64-byte alignment check, avoiding the
> +     * need for an explicit BUG_ON().
> +     */
> +    wrmsrns(MSR_FRED_RSP_SL0, (unsigned long)(&get_cpu_info()->_fred + 1));
> +    wrmsrns(MSR_FRED_RSP_SL1, stack_top + (IST_DB * IST_SHSTK_SIZE)); /* Poison */

So the use of IST_SHSTK_SIZE here (and PAGE_SIZE below) is to have the SL1
stacks "the wrong way round". I first thought this was a mistake, not the
least also due to the typo below. I think this wants commenting upon.

> +    wrmsrns(MSR_FRED_RSP_SL2, stack_top + (1 + IST_MCE)  * PAGE_SIZE);
> +    wrmsrns(MSR_FRED_RSP_SL3, stack_top + (1 + IST_DF)   * PAGE_SIZE);
> +    wrmsrns(MSR_FRED_STK_LVLS, ((2UL << (X86_EXC_MC * 2)) |
> +                                (3UL << (X86_EXC_DF * 2))));
> +
> +    if ( cpu_has_xen_shstk )
> +    {
> +        wrmsrns(MSR_FRED_SSP_SL0, stack_top + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE);
> +        wrmsrns(MSR_FRED_RSP_SL1, stack_top + (1 + IST_DF)  * PAGE_SIZE); /* Poison */

MSR_FRED_SSP_SL1 and presumably IST_DB?

Also (nit) both of these lines are too long; the double blank ahead of * on
the latter one probably also wants dropping.

> +        wrmsrns(MSR_FRED_SSP_SL2, stack_top + (IST_MCE * IST_SHSTK_SIZE));
> +        wrmsrns(MSR_FRED_SSP_SL3, stack_top + (IST_DF  * IST_SHSTK_SIZE));
> +    }
> +}

Because of the intentional asymmetry for SL1, maybe the writing of
MSR_FRED_STK_LVLS would better move below here?

> @@ -356,8 +410,13 @@ void __init traps_init(void)
>   */
>  void __init bsp_traps_reinit(void)
>  {
> -    load_system_tables();
> -    percpu_traps_init();
> +    if ( opt_fred )
> +        init_fred();
> +    else
> +    {
> +        load_system_tables();
> +        percpu_traps_init();

Doesn't this need to stay outside of the if/else, ...

> +    }
>  }
>  
>  /*
> @@ -366,7 +425,8 @@ void __init bsp_traps_reinit(void)
>   */
>  void percpu_traps_init(void)
>  {
> -    legacy_syscall_init();
> +    if ( !opt_fred )
> +        legacy_syscall_init();
>  
>      if ( cpu_has_xen_lbr )
>          wrmsrl(MSR_IA32_DEBUGCTLMSR, IA32_DEBUGCTLMSR_LBR);

... for this to still be done?

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 15/23] x86/pv: Deduplicate is_canonical_address() in do_set_segment_base()
  2025-08-28 15:04 ` [PATCH v2 15/23] x86/pv: Deduplicate is_canonical_address() in do_set_segment_base() Andrew Cooper
@ 2025-09-01 11:22   ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01 11:22 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:04, Andrew Cooper wrote:
> This is really a rearrangement to make adding FRED support easier.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> 
> v2:
>  * New
> 
> There is a marginal code size improvement:
> 
>   add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-46 (-46)
>   Function                                     old     new   delta
>   do_set_segment_base                          496     450     -46

While this is quite nice, the nested switch()es aren't as much. Still
Acked-by: Jan Beulich <jbeulich@suse.com>
with possibly a default case added to the new inner switch(), just to
please Misra.

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 16/23] x86/entry: Alter how IRET faults are recognised
  2025-08-28 15:04 ` [PATCH v2 16/23] x86/entry: Alter how IRET faults are recognised Andrew Cooper
@ 2025-09-01 11:41   ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01 11:41 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:04, Andrew Cooper wrote:
> Right now we have two IRET instructions that can fault for guest reasons, and
> the pre exception table gives handle_exception as the fixup for both.
> 
> Instead, we can have compat_restore_all_guest() use restore_all_guest()'s IRET
> which gives us just a single position to handle specially.
> 
> In exception_with_ints_disabled(), remove search_pre_exception_table() and use
> a simpler check.

And, peeking ahead, a similar check will then appear for ERETU. Probably indeed
a fair exchange seeing that in the next patch you drop the pre-exception stuff
altogether.

>  Explain how the recovery works, because this isn't the first
> time I've had to figure it out.
> 
> The reference to iret_to_guest highlights that any checking here is specific
> to CONFIG_PV, so exclude it in !PV builds.
> 
> Later in exception_with_ints_disabled(), it suffices to load %ecx rather than
> %rcx, and remove a stray semi-colon from the rep movsq.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 17/23] x86/entry: Drop the pre exception table infrastructure
  2025-08-28 15:04 ` [PATCH v2 17/23] x86/entry: Drop the pre exception table infrastructure Andrew Cooper
@ 2025-09-01 11:45   ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01 11:45 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:04, Andrew Cooper wrote:
> It is no longer used.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 18/23] x86/entry: Rework the comment about SYSCALL and DF
  2025-08-28 15:04 ` [PATCH v2 18/23] x86/entry: Rework the comment about SYSCALL and DF Andrew Cooper
@ 2025-09-01 11:52   ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01 11:52 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:04, Andrew Cooper wrote:
> It's soon going to be needed in a second location.
> 
> Right now it's misleading saying that nothing else would be cleared.  It's
> missing the more important point that SYSCALLs are treated like all other
> interrupts and exceptions, and undergo normal flags handling there.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 19/23] x86/pv: Adjust GS handling for FRED mode
  2025-08-28 15:04 ` [PATCH v2 19/23] x86/pv: Adjust GS handling for FRED mode Andrew Cooper
@ 2025-09-01 12:02   ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01 12:02 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:04, Andrew Cooper wrote:
> When FRED is active, hardware automatically swaps GS when changing privilege,
> and the SWAPGS instruction is disallowed.
> 
> For native OSes using GS as the thread local pointer this is a massive
> improvement on the pre-FRED architecture, but under Xen it makes handling PV
> guests more complicated.  Specifically, it means that GS_BASE and GS_SHADOW
> are the opposite way around in FRED mode, as opposed to IDT mode.
> 
> This leads to the following changes:
> 
>   * In load_segments(), we have to load both GSes.  Account for this in the
>     SWAP() condition and avoid the path with SWAGS.
> 
>   * In save_segments(), we need to read GS_KERN rather than GS_BASE.

GS_SHADOW in our terminology, that is. (Also again in code comments,
and there's also a variable named gs_kern.)

>   * In toggle_guest_mode(), we need to emulate SWAPGS.
> 
>   * In do_set_segment_base(), merge the SEGBASE_GS_{USER,KERNEL} cases and
>     take FRED into account when choosing which base to update.
> 
>     SEGBASE_GS_USER_SEL was already an LKGS invocation (decades before FRED)
>     so under FRED needs to be a simple MOV %gs.  Simply skip the SWAPGSes.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> 
> v2:
>  * New
> 
> I think this functions, but it's not ideal.  The conditions are asymmetric and
> awkward.

It's not as bad as I expect it to be after reading this remark.

Preferably with the naming adjusted:
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 20/23] x86/pv: Exception handling in FRED mode
  2025-08-28 15:04 ` [PATCH v2 20/23] x86/pv: Exception handling in " Andrew Cooper
@ 2025-09-01 12:23   ` Jan Beulich
  2025-09-01 12:57   ` Jan Beulich
  1 sibling, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01 12:23 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:04, Andrew Cooper wrote:
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -2265,9 +2265,83 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
>  
>  void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
>  {
> +    struct fred_info *fi = cpu_regs_fred_info(regs);
> +    uint8_t type = regs->fred_ss.type;
> +    uint8_t vec = regs->fred_ss.vector;
> +
>      /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
> -    regs->entry_vector = regs->fred_ss.vector;
> +    regs->entry_vector = vec;
> +
> +    if ( !IS_ENABLED(CONFIG_PV) )
> +        goto fatal;
> +
> +    /*
> +     * First, handle the asynchronous or fatal events.  These are either
> +     * unrelated to the interrupted context, or may not have valid context
> +     * recorded, and all have special rules on how/whether to re-enable IRQs.
> +     */
> +    switch ( type )
> +    {
> +    case X86_ET_EXT_INTR:
> +        return do_IRQ(regs);
>  
> +    case X86_ET_NMI:
> +        return do_nmi(regs);
> +
> +    case X86_ET_HW_EXC:
> +        switch ( vec )
> +        {
> +        case X86_EXC_DF: return do_double_fault(regs);
> +        case X86_EXC_MC: return do_machine_check(regs);
> +        }
> +        break;
> +    }

This switch() is identical to entry_from_xen()'s. Fold into a helper?

> +    /*
> +     * With the asynchronous events handled, what remains are the synchronous
> +     * ones.  Guest context always had interrupts enabled.
> +     */
> +    local_irq_enable();

In the comment, maybe s/Guest/PV guest/?

> +    switch ( type )
> +    {
> +    case X86_ET_HW_EXC:
> +    case X86_ET_PRIV_SW_EXC:
> +    case X86_ET_SW_EXC:
> +        switch ( vec )
> +        {
> +        case X86_EXC_PF:  handle_PF(regs, fi->edata); break;
> +        case X86_EXC_GP:  do_general_protection(regs); break;
> +        case X86_EXC_UD:  do_invalid_op(regs); break;
> +        case X86_EXC_NM:  do_device_not_available(regs); break;
> +        case X86_EXC_BP:  do_int3(regs); break;
> +        case X86_EXC_DB:  handle_DB(regs, fi->edata); break;
> +
> +        case X86_EXC_DE:
> +        case X86_EXC_OF:
> +        case X86_EXC_BR:
> +        case X86_EXC_NP:
> +        case X86_EXC_SS:
> +        case X86_EXC_MF:
> +        case X86_EXC_AC:
> +        case X86_EXC_XM:
> +            do_trap(regs);
> +            break;
> +
> +        case X86_EXC_CP:  do_entry_CP(regs); break;
> +
> +        default:
> +            goto fatal;
> +        }
> +        break;

This again looks identical to when entry_from_xen() has. Maybe, instead of
a helper for each switch(), we could have a common always-inline function
(with all necessary parametrization) that both invoke?

> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -63,7 +63,7 @@ UNLIKELY_END(syscall_no_callback)
>          /* Conditionally clear DF */
>          and   %esi, UREGS_eflags(%rsp)
>  /* %rbx: struct vcpu */
> -test_all_events:
> +LABEL(test_all_events, 0)
>          ASSERT_NOT_IN_ATOMIC
>          cli                             # tests must not race interrupts
>  /*test_softirqs:*/
> @@ -152,6 +152,8 @@ END(switch_to_kernel)
>  FUNC_LOCAL(restore_all_guest)
>          ASSERT_INTERRUPTS_DISABLED
>  
> +        ALTERNATIVE "", "jmp eretu_exit_to_guest", X86_FEATURE_XEN_FRED
> +
>          /* Stash guest SPEC_CTRL value while we can read struct vcpu. */
>          mov VCPU_arch_msrs(%rbx), %rdx

I assume it's deliberate that you don't "consume" this insn into the
alternative, but without the description saying anything it's not quite
clear why.

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 20/23] x86/pv: Exception handling in FRED mode
  2025-08-28 15:04 ` [PATCH v2 20/23] x86/pv: Exception handling in " Andrew Cooper
  2025-09-01 12:23   ` Jan Beulich
@ 2025-09-01 12:57   ` Jan Beulich
  2025-09-01 13:27     ` Andrew Cooper
  1 sibling, 1 reply; 56+ messages in thread
From: Jan Beulich @ 2025-09-01 12:57 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:04, Andrew Cooper wrote:
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -2265,9 +2265,83 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
>  
>  void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
>  {
> +    struct fred_info *fi = cpu_regs_fred_info(regs);
> +    uint8_t type = regs->fred_ss.type;
> +    uint8_t vec = regs->fred_ss.vector;
> +
>      /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
> -    regs->entry_vector = regs->fred_ss.vector;
> +    regs->entry_vector = vec;
> +
> +    if ( !IS_ENABLED(CONFIG_PV) )
> +        goto fatal;
> +
> +    /*
> +     * First, handle the asynchronous or fatal events.  These are either
> +     * unrelated to the interrupted context, or may not have valid context
> +     * recorded, and all have special rules on how/whether to re-enable IRQs.
> +     */
> +    switch ( type )
> +    {
> +    case X86_ET_EXT_INTR:
> +        return do_IRQ(regs);
>  
> +    case X86_ET_NMI:
> +        return do_nmi(regs);
> +
> +    case X86_ET_HW_EXC:
> +        switch ( vec )
> +        {
> +        case X86_EXC_DF: return do_double_fault(regs);
> +        case X86_EXC_MC: return do_machine_check(regs);

Looking at patch 21, I came to wonder where it is that we're moving back to
SL0 in the #MC case (which may not be fatal), for ERETU to not fault.

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 21/23] x86/pv: ERETU error handling
  2025-08-28 15:04 ` [PATCH v2 21/23] x86/pv: ERETU error handling Andrew Cooper
@ 2025-09-01 13:13   ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01 13:13 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:04, Andrew Cooper wrote:
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -2345,6 +2345,113 @@ void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
>      fatal_trap(regs, false);
>  }
>  
> +void nocall eretu_error_dom_crash(void);
> +
> +/*
> + * Classify an event at the ERETU instruction, and handle if possible.
> + * Returns @true if handled, @false if the event should continue down the
> + * normal handlers.
> + */
> +static bool handle_eretu_event(struct cpu_user_regs *regs)
> +{
> +    unsigned long recover;
> +
> +    /*
> +     * WARNING: The GPRs in gregs overlaps with regs.  Only gregs->error_code
> +     *          and later are legitimate to access.
> +     */
> +    struct cpu_user_regs *gregs =
> +        _p(regs->rsp - offsetof(struct cpu_user_regs, error_code));
> +
> +    /*
> +     * The asynchronous or fatal events (INTR, NMI, #MC, #DF) have been dealt
> +     * with, meaning we only have syncrhonous ones to consider.  Anything
> +     * which isn't a hardware exception wants handling normally.
> +     */
> +    if ( regs->fred_ss.type != X86_ET_HW_EXC )
> +        return false;
> +
> +    /*
> +     * Guests are permitted to write non-present GDT/LDT entries.  Therefore
> +     * #NP[sel] (%cs) and #SS[sel] (%ss) must be handled as guest errors.  The
> +     * only other source of #SS is for a bad %ss-relative memory access in
> +     * Xen, and if the stack is that bad, we'll have escalated to #DF.
> +     *
> +     * #PF can happen from ERETU accessing the GDT/LDT.  Xen may translate
> +     * these into #GP for the guest, so must be handled as guest errors.  In
> +     * theory we can get #PF for a bad instruction fetch or bad stack access,
> +     * but either of these will be fatal and not end up here.
> +     */
> +    switch ( regs->fred_ss.vector )
> +    {
> +    case X86_EXC_GP:
> +        /*
> +         * #GP[0] can occur because of a NULL %cs or %ss (which are a guest
> +         * error), but some #GP[0]'s are errors in Xen (ERETU at SL != 0), or
> +         * errors of Xen handling guest state (bad metadata).  These magic
> +         * numbers came from the FRED Spec; they check that ERETU is trying to
> +         * return to Ring 3, and that reserved or inapplicable bits are 0.
> +         */
> +        if ( regs->error_code == 0 && (gregs->cs & ~3) && (gregs->ss & ~3) &&
> +             (regs->fred_cs.sl != 0 ||
> +              (gregs->csx    & 0xffffffffffff0003UL) != 3 ||
> +              (gregs->rflags & 0xffffffffffc2b02aUL) != 2 ||
> +              (gregs->ssx    &         0xfff80003UL) != 3) )

I understand that for ->csx and ->ssx sensible alternatives to using such magic
constants don't really exist. For ->rflags, though, I think it would be nice if
we could use constants we have.

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 03/23] x86/traps: Introduce opt_fred
  2025-08-28 15:03 ` [PATCH v2 03/23] x86/traps: Introduce opt_fred Andrew Cooper
  2025-08-28 15:20   ` Jan Beulich
@ 2025-09-01 13:15   ` Jan Beulich
  1 sibling, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01 13:15 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:03, Andrew Cooper wrote:
> @@ -20,6 +22,9 @@ unsigned int __ro_after_init ler_msr;
>  static bool __initdata opt_ler;
>  boolean_param("ler", opt_ler);
>  
> +int8_t __ro_after_init opt_fred = 0;

Can't this be __initdata now that we have ...

> @@ -299,6 +304,37 @@ void __init traps_init(void)
>      /* Replace early pagefault with real pagefault handler. */
>      _update_gate_addr_lower(&bsp_idt[X86_EXC_PF], entry_PF);
>  
> +    /*
> +     * Xen doesn't use GS like most software does, and doesn't need the LKGS
> +     * instruction in order to manage PV guests.  No need to check for it.
> +     */
> +    if ( !cpu_has_fred )
> +    {
> +        if ( opt_fred == 1 )
> +            printk(XENLOG_WARNING "FRED not available, ignoring\n");
> +        opt_fred = 0;
> +    }
> +
> +    if ( opt_fred == -1 )
> +        opt_fred = !pv_shim;
> +
> +    if ( opt_fred )
> +    {
> +#ifdef CONFIG_PV32
> +        if ( opt_pv32 )
> +        {
> +            opt_pv32 = 0;
> +            printk(XENLOG_INFO "Disabling PV32 due to FRED\n");
> +        }
> +#endif
> +        setup_force_cpu_cap(X86_FEATURE_XEN_FRED);

... this? All non-__init uses could use the synthetic feature bit.

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 20/23] x86/pv: Exception handling in FRED mode
  2025-09-01 12:57   ` Jan Beulich
@ 2025-09-01 13:27     ` Andrew Cooper
  2025-09-01 13:55       ` Jan Beulich
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-09-01 13:27 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel

On 01/09/2025 1:57 pm, Jan Beulich wrote:
> On 28.08.2025 17:04, Andrew Cooper wrote:
>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -2265,9 +2265,83 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
>>  
>>  void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
>>  {
>> +    struct fred_info *fi = cpu_regs_fred_info(regs);
>> +    uint8_t type = regs->fred_ss.type;
>> +    uint8_t vec = regs->fred_ss.vector;
>> +
>>      /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
>> -    regs->entry_vector = regs->fred_ss.vector;
>> +    regs->entry_vector = vec;
>> +
>> +    if ( !IS_ENABLED(CONFIG_PV) )
>> +        goto fatal;
>> +
>> +    /*
>> +     * First, handle the asynchronous or fatal events.  These are either
>> +     * unrelated to the interrupted context, or may not have valid context
>> +     * recorded, and all have special rules on how/whether to re-enable IRQs.
>> +     */
>> +    switch ( type )
>> +    {
>> +    case X86_ET_EXT_INTR:
>> +        return do_IRQ(regs);
>>  
>> +    case X86_ET_NMI:
>> +        return do_nmi(regs);
>> +
>> +    case X86_ET_HW_EXC:
>> +        switch ( vec )
>> +        {
>> +        case X86_EXC_DF: return do_double_fault(regs);
>> +        case X86_EXC_MC: return do_machine_check(regs);
> Looking at patch 21, I came to wonder where it is that we're moving back to
> SL0 in the #MC case (which may not be fatal), for ERETU to not fault.

(Almost) any event taken in Ring3 enters Ring0 at SL0, even those with
custom STK_LVLS configuration.

See 5.1.2 Determining the New Values for Stack Level, RSP, and SSP

Nested exceptions (i.e contributory fault) and #DF can end up at SL > 0
with a Ring 3 frame.  In principle you'd need to do recovery based on
regs->fred_ss.nested but Xen doesn't have any contributory exceptions
configured like this.

Under FRED, there are far fewer ways to take a contributory fault. 
Pagetable corruption for the entrypoint or stack, or hitting a stack
guard page.  Hitting the guard page will #DF; others will triple fault.

~Andrew


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 22/23] x86/pv: System call handling in FRED mode
  2025-08-28 15:04 ` [PATCH v2 22/23] x86/pv: System call handling in FRED mode Andrew Cooper
@ 2025-09-01 13:49   ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01 13:49 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:04, Andrew Cooper wrote:
> Under FRED, entry_from_pv() handles everything, even system calls.  This means
> more of our logic is written in C now, rather than assembly.
> 
> In order to facilitate this, introduce pv_inject_callback(), which reuses
> struct trap_bounce infrastructure to inject the syscall/sysenter callbacks.
> This in turns requires some !PV compatibility for pv_inject_callback() and
> pv_hypercall() which can both be ASSERT_UNREACHABLE().
> 
> For each of INT $N, SYSCALL and SYSENTER, FRED gives us interrupted context
> which was previously lost.  As the guest can't see FRED, Xen has to lose state
> in the same way to maintain the prior behaviour.

In principle we could expose a new capability to the guest allowing it to
request that we preserve state. Question of course is whether that would
be of any practical use.

> --- a/xen/arch/x86/include/asm/domain.h
> +++ b/xen/arch/x86/include/asm/domain.h
> @@ -712,11 +712,16 @@ int arch_set_info_hvm_guest(struct vcpu *v, const struct vcpu_hvm_context *ctx);
>  
>  #ifdef CONFIG_PV
>  void pv_inject_event(const struct x86_event *event);
> +void pv_inject_callback(unsigned int type);
>  #else
>  static inline void pv_inject_event(const struct x86_event *event)
>  {
>      ASSERT_UNREACHABLE();
>  }
> +static inline void pv_inject_callback(unsigned int type)
> +{
> +    ASSERT_UNREACHABLE();
> +}
>  #endif

We don't really need this, nor ...

> --- a/xen/arch/x86/include/asm/hypercall.h
> +++ b/xen/arch/x86/include/asm/hypercall.h
> @@ -20,6 +20,11 @@
>  
>  #ifdef CONFIG_PV
>  void pv_hypercall(struct cpu_user_regs *regs);
> +#else
> +static inline void pv_hypercall(struct cpu_user_regs *regs)
> +{
> +    ASSERT_UNREACHABLE();
> +}
>  #endif

... this, do we? If you expose the decls outside of the #ifdef, I can't help
the impression that all call sites will simply be DCE-ed (thanks to the
!IS_ENABLED(CONFIG_PV) check at the top of entry_from_pv()).

> --- a/xen/arch/x86/pv/traps.c
> +++ b/xen/arch/x86/pv/traps.c
> @@ -19,6 +19,8 @@
>  #include <asm/shared.h>
>  #include <asm/traps.h>
>  
> +#include <public/callback.h>
> +
>  void pv_inject_event(const struct x86_event *event)
>  {
>      struct vcpu *curr = current;
> @@ -95,6 +97,37 @@ void pv_inject_event(const struct x86_event *event)
>      }
>  }
>  
> +void pv_inject_callback(unsigned int type)
> +{
> +    struct vcpu *curr = current;
> +    struct trap_bounce *tb = &curr->arch.pv.trap_bounce;
> +    unsigned long rip = 0;
> +    bool irq = false;

Move the latter two initializers into a default: case, after an
ASSERT_UNREACHABLE()?

> +    ASSERT(is_pv_64bit_vcpu(curr));

I was first wondering why you check this here, but yes, PV32 is disabled when
FRED is enabled. IOW if a new use for this function turned up, this could
validly be relaxed.

> @@ -2305,6 +2309,27 @@ void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
>  
>      switch ( type )
>      {
> +    case X86_ET_SW_INT:
> +        /*
> +         * INT $3/4 are indistinguishable from INT3/INTO under IDT, and are

Didn't we discuss this the other day? They are distinguishable, as long as you
set their gates' DPL to 0. Just that we use DPL 3. Hence I think this wants
wording a little differently, to make clear it's our (possibly wrong) choice.

> +         * permitted by Xen without the guest kernel having a choice.  Let

Doesn't the guest have a choice by using TI_SET_DPL() suitably?

> +         * them fall through into X86_ET_HW_EXC, as #BP in particular needs
> +         * handling by do_int3() in case an external debugger is attached.
> +         */

I don't understand this, though. An external debugger would better not place
breakpoints using CD 03, so I think we'd better wire such the normal INT nn
way. And for #OF I also don't think we need to make an exception.

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 20/23] x86/pv: Exception handling in FRED mode
  2025-09-01 13:27     ` Andrew Cooper
@ 2025-09-01 13:55       ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01 13:55 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 01.09.2025 15:27, Andrew Cooper wrote:
> On 01/09/2025 1:57 pm, Jan Beulich wrote:
>> On 28.08.2025 17:04, Andrew Cooper wrote:
>>> --- a/xen/arch/x86/traps.c
>>> +++ b/xen/arch/x86/traps.c
>>> @@ -2265,9 +2265,83 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
>>>  
>>>  void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
>>>  {
>>> +    struct fred_info *fi = cpu_regs_fred_info(regs);
>>> +    uint8_t type = regs->fred_ss.type;
>>> +    uint8_t vec = regs->fred_ss.vector;
>>> +
>>>      /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
>>> -    regs->entry_vector = regs->fred_ss.vector;
>>> +    regs->entry_vector = vec;
>>> +
>>> +    if ( !IS_ENABLED(CONFIG_PV) )
>>> +        goto fatal;
>>> +
>>> +    /*
>>> +     * First, handle the asynchronous or fatal events.  These are either
>>> +     * unrelated to the interrupted context, or may not have valid context
>>> +     * recorded, and all have special rules on how/whether to re-enable IRQs.
>>> +     */
>>> +    switch ( type )
>>> +    {
>>> +    case X86_ET_EXT_INTR:
>>> +        return do_IRQ(regs);
>>>  
>>> +    case X86_ET_NMI:
>>> +        return do_nmi(regs);
>>> +
>>> +    case X86_ET_HW_EXC:
>>> +        switch ( vec )
>>> +        {
>>> +        case X86_EXC_DF: return do_double_fault(regs);
>>> +        case X86_EXC_MC: return do_machine_check(regs);
>> Looking at patch 21, I came to wonder where it is that we're moving back to
>> SL0 in the #MC case (which may not be fatal), for ERETU to not fault.
> 
> (Almost) any event taken in Ring3 enters Ring0 at SL0, even those with
> custom STK_LVLS configuration.
> 
> See 5.1.2 Determining the New Values for Stack Level, RSP, and SSP

Oh, right - that's something I still need to properly settle in a corner of
my brain.

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 23/23] x86/pv: Adjust eflags handling for FRED mode
  2025-08-28 15:04 ` [PATCH v2 23/23] x86/pv: Adjust eflags handling for " Andrew Cooper
@ 2025-09-01 14:17   ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01 14:17 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 28.08.2025 17:04, Andrew Cooper wrote:
> ERETU, unlike IRET, requires the sticky-1 bit (bit 2) be set, and reserved
> bits to be clear.  Notably this means that dom0_construct() must set
> X86_EFLAGS_MBS it in order for a PV dom0 to start.
> 
> Adjust arch_set_info_guest*() and hypercall_iret() which consume flags to
> clamp the reserved bits.
> 
> This is a minor ABI change, but by the same argument as commit
> 9f892f84c279 ("x86/domctl: Stop using XLAT_cpu_user_regs()"), this change will
> happen naturally when the vCPU schedules.

It's no that similar, is it? MBS will be observed set once guest context is
entered, irrespective of any scheduling. So it's entirely benign if we set
it up-front, except of course for a back-to-back set/get.

> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> 
> v2:
>  * New
> 
> The handling of VM is complicated.
> 
> It turns out that it's simply ignored by IRET in Long Mode (i.e. clearing it
> commit 0e47f92b0725 ("x86: force EFLAGS.IF on when exiting to PV guests")
> wasn't actually necessary) but ERETU does care.
> 
> But, it's unclear how to handle this in in arch_set_info().  We must preserve
> it for HVM guests (whih can use vm86 mode).  PV32 has special handling but
> only in hypercall_iret(), not in arch_set_info().

I think we need to either reject or clear VM, NT, IOPL, and whatever else
would make ERETU unhappy (for IOPL we already do so). It simply is of no
use to ...

> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1273,7 +1273,7 @@ int arch_set_info_guest(
>          v->arch.user_regs.rax               = c.nat->user_regs.rax;
>          v->arch.user_regs.rip               = c.nat->user_regs.rip;
>          v->arch.user_regs.cs                = c.nat->user_regs.cs;
> -        v->arch.user_regs.rflags            = c.nat->user_regs.rflags;
> +        v->arch.user_regs.rflags            = (c.nat->user_regs.rflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
>          v->arch.user_regs.rsp               = c.nat->user_regs.rsp;
>          v->arch.user_regs.ss                = c.nat->user_regs.ss;
>          v->arch.pv.es                       = c.nat->user_regs.es;
> @@ -1297,7 +1297,7 @@ int arch_set_info_guest(
>          v->arch.user_regs.eax               = c.cmp->user_regs.eax;
>          v->arch.user_regs.eip               = c.cmp->user_regs.eip;
>          v->arch.user_regs.cs                = c.cmp->user_regs.cs;
> -        v->arch.user_regs.eflags            = c.cmp->user_regs.eflags;
> +        v->arch.user_regs.eflags            = (c.cmp->user_regs.eflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
>          v->arch.user_regs.esp               = c.cmp->user_regs.esp;
>          v->arch.user_regs.ss                = c.cmp->user_regs.ss;
>          v->arch.pv.es                       = c.cmp->user_regs.es;

... accept the bits here, just for the first exit to guest mode to fault on
the ERETU. The guest would have a hard time to recover from that, I expect.
Yet perhaps we should do this only conditionally when FRED is active. Then
again a VM migrating from a pre-FRED host to a FRED one might observe the
(minor) behavioral change later on.

> --- a/xen/arch/x86/hvm/domain.c
> +++ b/xen/arch/x86/hvm/domain.c
> @@ -194,7 +194,7 @@ int arch_set_info_hvm_guest(struct vcpu *v, const struct vcpu_hvm_context *ctx)
>          uregs->rsi    = regs->esi;
>          uregs->rdi    = regs->edi;
>          uregs->rip    = regs->eip;
> -        uregs->rflags = regs->eflags;
> +        uregs->rflags = (regs->eflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
>  
>          v->arch.hvm.guest_cr[0] = regs->cr0;
>          v->arch.hvm.guest_cr[3] = regs->cr3;
> @@ -245,7 +245,7 @@ int arch_set_info_hvm_guest(struct vcpu *v, const struct vcpu_hvm_context *ctx)
>          uregs->rsi    = regs->rsi;
>          uregs->rdi    = regs->rdi;
>          uregs->rip    = regs->rip;
> -        uregs->rflags = regs->rflags;
> +        uregs->rflags = (regs->rflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;

Why would the HVM code need changing at all? We never ERETU there.

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 06/23] x86/traps: Set MSR_PL0_SSP in load_system_tables()
  2025-09-01  9:23   ` Jan Beulich
@ 2025-09-01 15:31     ` Andrew Cooper
  0 siblings, 0 replies; 56+ messages in thread
From: Andrew Cooper @ 2025-09-01 15:31 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel

On 01/09/2025 10:23 am, Jan Beulich wrote:
> On 28.08.2025 17:03, Andrew Cooper wrote:
>> FRED and IDT differ by a Supervisor Token on the base of the shstk.  This
>> means that the value they load into MSR_PL0_SSP differs by 8.
>>
>> s3_resume() in particular has logic which is otherwise invariant of FRED mode,
>> and must not clobber a FRED MSR_PL0_SSP with an IDT one.
>>
>> This also simplifies the AP path too.  Updating reinit_bsp_stack() is deferred
>> until later.
> This last sentence looks to be ...
>
>> No functional change.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>
>> v2:
>>  * Extend comment about clearing the busy bit.
>>  * Move reinit_bsp_stack() hunk into this patch.
> ... stale, according to this. Other than that:
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Oops yes.  I'll drop.

~Andrew


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 07/23] x86/boot: Use RSTORSSP to establish SSP
  2025-09-01  9:28   ` Jan Beulich
@ 2025-09-01 15:33     ` Andrew Cooper
  2025-09-01 15:41       ` Jan Beulich
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-09-01 15:33 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel

On 01/09/2025 10:28 am, Jan Beulich wrote:
> On 28.08.2025 17:03, Andrew Cooper wrote:
>> @@ -908,7 +909,29 @@ static void __init noreturn reinit_bsp_stack(void)
>>      if ( cpu_has_xen_shstk )
>>      {
>>          wrmsrl(MSR_S_CET, xen_msr_s_cet_value());
>> -        asm volatile ("setssbsy" ::: "memory");
>> +
>> +        /*
>> +         * IDT and FRED differ by a Supervisor Token on the shadow stack, and
>> +         * therefore by the value in MSR_PL0_SSP.
> Beside not being overly relevant here afaict, is this last part of the sentence
> actually correct? Patch 06 doesn't write different values into the MSR.

It is correct, but also well hidden.

#define MSR_FRED_SSP_SL0                    MSR_PL0_SSP

I suppose I should should write MSR_PL0_SSP/MSR_FRED_SSP_SL0 here to
highlight the logically different names for the two modes.

~Andrew


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 11/23] x86/traps: Make an IDT-specific #PF helper
  2025-09-01  9:46   ` Jan Beulich
@ 2025-09-01 15:36     ` Andrew Cooper
  2025-09-01 15:42       ` Jan Beulich
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-09-01 15:36 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel

On 01/09/2025 10:46 am, Jan Beulich wrote:
> On 28.08.2025 17:03, Andrew Cooper wrote:
>> FRED provides %cr2 in the the stack frame, avoiding the need to read %cr2
>> manually.
>>
>> Rename do_page_fault() to handle_PF(), and update it to take cr2, still named
>> addr for consistency.
>>
>> Introduce a new handle_PF_IDT() which reads %cr2 and conditionally re-enables
>> interrupts.
> Why does this IRQ-re-enabling move to the IDT-specific function? Do you intend
> to do the re-enabling yet earlier in FRED mode?

It is updated in a common path under FRED.

#PF (and #DB but that didn't get updated recently) are the weird cases
IDT mode.

~Andrew


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 07/23] x86/boot: Use RSTORSSP to establish SSP
  2025-09-01 15:33     ` Andrew Cooper
@ 2025-09-01 15:41       ` Jan Beulich
  2025-09-01 18:53         ` Andrew Cooper
  0 siblings, 1 reply; 56+ messages in thread
From: Jan Beulich @ 2025-09-01 15:41 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 01.09.2025 17:33, Andrew Cooper wrote:
> On 01/09/2025 10:28 am, Jan Beulich wrote:
>> On 28.08.2025 17:03, Andrew Cooper wrote:
>>> @@ -908,7 +909,29 @@ static void __init noreturn reinit_bsp_stack(void)
>>>      if ( cpu_has_xen_shstk )
>>>      {
>>>          wrmsrl(MSR_S_CET, xen_msr_s_cet_value());
>>> -        asm volatile ("setssbsy" ::: "memory");
>>> +
>>> +        /*
>>> +         * IDT and FRED differ by a Supervisor Token on the shadow stack, and
>>> +         * therefore by the value in MSR_PL0_SSP.
>> Beside not being overly relevant here afaict, is this last part of the sentence
>> actually correct? Patch 06 doesn't write different values into the MSR.
> 
> It is correct, but also well hidden.
> 
> #define MSR_FRED_SSP_SL0                    MSR_PL0_SSP
> 
> I suppose I should should write MSR_PL0_SSP/MSR_FRED_SSP_SL0 here to
> highlight the logically different names for the two modes.

But the code following the comment doesn't access any MSR. That's what
first tripped me up. It was only then that I wasn't able to spot the two
different writes. Now that you point out the aliasing it becomes clear
that until patch 14 it is simply impossible to find that other write.

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 11/23] x86/traps: Make an IDT-specific #PF helper
  2025-09-01 15:36     ` Andrew Cooper
@ 2025-09-01 15:42       ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-01 15:42 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 01.09.2025 17:36, Andrew Cooper wrote:
> On 01/09/2025 10:46 am, Jan Beulich wrote:
>> On 28.08.2025 17:03, Andrew Cooper wrote:
>>> FRED provides %cr2 in the the stack frame, avoiding the need to read %cr2
>>> manually.
>>>
>>> Rename do_page_fault() to handle_PF(), and update it to take cr2, still named
>>> addr for consistency.
>>>
>>> Introduce a new handle_PF_IDT() which reads %cr2 and conditionally re-enables
>>> interrupts.
>> Why does this IRQ-re-enabling move to the IDT-specific function? Do you intend
>> to do the re-enabling yet earlier in FRED mode?
> 
> It is updated in a common path under FRED.

Right, having spotted it the meantime:
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 07/23] x86/boot: Use RSTORSSP to establish SSP
  2025-09-01 15:41       ` Jan Beulich
@ 2025-09-01 18:53         ` Andrew Cooper
  2025-09-02  6:17           ` Jan Beulich
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Cooper @ 2025-09-01 18:53 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel

On 01/09/2025 4:41 pm, Jan Beulich wrote:
> On 01.09.2025 17:33, Andrew Cooper wrote:
>> On 01/09/2025 10:28 am, Jan Beulich wrote:
>>> On 28.08.2025 17:03, Andrew Cooper wrote:
>>>> @@ -908,7 +909,29 @@ static void __init noreturn reinit_bsp_stack(void)
>>>>      if ( cpu_has_xen_shstk )
>>>>      {
>>>>          wrmsrl(MSR_S_CET, xen_msr_s_cet_value());
>>>> -        asm volatile ("setssbsy" ::: "memory");
>>>> +
>>>> +        /*
>>>> +         * IDT and FRED differ by a Supervisor Token on the shadow stack, and
>>>> +         * therefore by the value in MSR_PL0_SSP.
>>> Beside not being overly relevant here afaict, is this last part of the sentence
>>> actually correct? Patch 06 doesn't write different values into the MSR.
>> It is correct, but also well hidden.
>>
>> #define MSR_FRED_SSP_SL0                    MSR_PL0_SSP
>>
>> I suppose I should should write MSR_PL0_SSP/MSR_FRED_SSP_SL0 here to
>> highlight the logically different names for the two modes.
> But the code following the comment doesn't access any MSR. That's what
> first tripped me up. It was only then that I wasn't able to spot the two
> different writes. Now that you point out the aliasing it becomes clear
> that until patch 14 it is simply impossible to find that other write.

I suppose the MSR value isn't relevant now that neither paths write the
value.  The first iteration had both writes here.

I guess I can drop that paragraph, and just have the second?

~Andrew


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 07/23] x86/boot: Use RSTORSSP to establish SSP
  2025-09-01 18:53         ` Andrew Cooper
@ 2025-09-02  6:17           ` Jan Beulich
  0 siblings, 0 replies; 56+ messages in thread
From: Jan Beulich @ 2025-09-02  6:17 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 01.09.2025 20:53, Andrew Cooper wrote:
> On 01/09/2025 4:41 pm, Jan Beulich wrote:
>> On 01.09.2025 17:33, Andrew Cooper wrote:
>>> On 01/09/2025 10:28 am, Jan Beulich wrote:
>>>> On 28.08.2025 17:03, Andrew Cooper wrote:
>>>>> @@ -908,7 +909,29 @@ static void __init noreturn reinit_bsp_stack(void)
>>>>>      if ( cpu_has_xen_shstk )
>>>>>      {
>>>>>          wrmsrl(MSR_S_CET, xen_msr_s_cet_value());
>>>>> -        asm volatile ("setssbsy" ::: "memory");
>>>>> +
>>>>> +        /*
>>>>> +         * IDT and FRED differ by a Supervisor Token on the shadow stack, and
>>>>> +         * therefore by the value in MSR_PL0_SSP.
>>>> Beside not being overly relevant here afaict, is this last part of the sentence
>>>> actually correct? Patch 06 doesn't write different values into the MSR.
>>> It is correct, but also well hidden.
>>>
>>> #define MSR_FRED_SSP_SL0                    MSR_PL0_SSP
>>>
>>> I suppose I should should write MSR_PL0_SSP/MSR_FRED_SSP_SL0 here to
>>> highlight the logically different names for the two modes.
>> But the code following the comment doesn't access any MSR. That's what
>> first tripped me up. It was only then that I wasn't able to spot the two
>> different writes. Now that you point out the aliasing it becomes clear
>> that until patch 14 it is simply impossible to find that other write.
> 
> I suppose the MSR value isn't relevant now that neither paths write the
> value.  The first iteration had both writes here.
> 
> I guess I can drop that paragraph, and just have the second?

I'd keep everything up to the comma (plus the other paragraph of course).

Jan


^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2025-09-02  6:17 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-28 15:03 [PATCH v2 00/23] x86: FRED support Andrew Cooper
2025-08-28 15:03 ` [PATCH v2 01/23] x86: FRED enumerations Andrew Cooper
2025-08-28 15:33   ` Andrew Cooper
2025-08-28 15:03 ` [PATCH v2 02/23] x86/traps: Extend struct cpu_user_regs/cpu_info with FRED fields Andrew Cooper
2025-08-28 15:18   ` Jan Beulich
2025-08-28 15:03 ` [PATCH v2 03/23] x86/traps: Introduce opt_fred Andrew Cooper
2025-08-28 15:20   ` Jan Beulich
2025-09-01 13:15   ` Jan Beulich
2025-08-28 15:03 ` [PATCH v2 04/23] x86/boot: Adjust CR4 handling around percpu_early_traps_init() Andrew Cooper
2025-08-28 15:03 ` [PATCH v2 05/23] x86/S3: Switch to using RSTORSSP to recover SSP on resume Andrew Cooper
2025-08-28 15:03 ` [PATCH v2 06/23] x86/traps: Set MSR_PL0_SSP in load_system_tables() Andrew Cooper
2025-09-01  9:23   ` Jan Beulich
2025-09-01 15:31     ` Andrew Cooper
2025-08-28 15:03 ` [PATCH v2 07/23] x86/boot: Use RSTORSSP to establish SSP Andrew Cooper
2025-09-01  9:28   ` Jan Beulich
2025-09-01 15:33     ` Andrew Cooper
2025-09-01 15:41       ` Jan Beulich
2025-09-01 18:53         ` Andrew Cooper
2025-09-02  6:17           ` Jan Beulich
2025-08-28 15:03 ` [PATCH v2 08/23] x86/traps: Alter switch_stack_and_jump() for FRED mode Andrew Cooper
2025-09-01  9:30   ` Jan Beulich
2025-08-28 15:03 ` [PATCH v2 09/23] x86/traps: Skip Supervisor Shadow Stack tokens in " Andrew Cooper
2025-09-01  9:38   ` Jan Beulich
2025-08-28 15:03 ` [PATCH v2 10/23] x86/traps: Make an IDT-specific #DB helper Andrew Cooper
2025-09-01  9:41   ` Jan Beulich
2025-08-28 15:03 ` [PATCH v2 11/23] x86/traps: Make an IDT-specific #PF helper Andrew Cooper
2025-09-01  9:46   ` Jan Beulich
2025-09-01 15:36     ` Andrew Cooper
2025-09-01 15:42       ` Jan Beulich
2025-08-28 15:03 ` [PATCH v2 12/23] x86/fsgsbase: Make gskern accesses safe under FRED Andrew Cooper
2025-09-01  9:57   ` Jan Beulich
2025-08-28 15:03 ` [PATCH v2 13/23] x86/traps: Introduce FRED entrypoints Andrew Cooper
2025-09-01 10:26   ` Jan Beulich
2025-08-28 15:04 ` [PATCH v2 14/23] x86/traps: Enable FRED when requested Andrew Cooper
2025-09-01 10:50   ` Jan Beulich
2025-08-28 15:04 ` [PATCH v2 15/23] x86/pv: Deduplicate is_canonical_address() in do_set_segment_base() Andrew Cooper
2025-09-01 11:22   ` Jan Beulich
2025-08-28 15:04 ` [PATCH v2 16/23] x86/entry: Alter how IRET faults are recognised Andrew Cooper
2025-09-01 11:41   ` Jan Beulich
2025-08-28 15:04 ` [PATCH v2 17/23] x86/entry: Drop the pre exception table infrastructure Andrew Cooper
2025-09-01 11:45   ` Jan Beulich
2025-08-28 15:04 ` [PATCH v2 18/23] x86/entry: Rework the comment about SYSCALL and DF Andrew Cooper
2025-09-01 11:52   ` Jan Beulich
2025-08-28 15:04 ` [PATCH v2 19/23] x86/pv: Adjust GS handling for FRED mode Andrew Cooper
2025-09-01 12:02   ` Jan Beulich
2025-08-28 15:04 ` [PATCH v2 20/23] x86/pv: Exception handling in " Andrew Cooper
2025-09-01 12:23   ` Jan Beulich
2025-09-01 12:57   ` Jan Beulich
2025-09-01 13:27     ` Andrew Cooper
2025-09-01 13:55       ` Jan Beulich
2025-08-28 15:04 ` [PATCH v2 21/23] x86/pv: ERETU error handling Andrew Cooper
2025-09-01 13:13   ` Jan Beulich
2025-08-28 15:04 ` [PATCH v2 22/23] x86/pv: System call handling in FRED mode Andrew Cooper
2025-09-01 13:49   ` Jan Beulich
2025-08-28 15:04 ` [PATCH v2 23/23] x86/pv: Adjust eflags handling for " Andrew Cooper
2025-09-01 14:17   ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).