[PATCH v3 for-4.21 00/22] x86: FRED support

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 for-4.21 00/22] x86: FRED support
@ 2025-10-03 22:53 Andrew Cooper
  2025-10-03 22:53 ` [PATCH v3 01/22] x86/msr: Change rdmsr() to have normal API Andrew Cooper
                   ` (22 more replies)
  0 siblings, 23 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Oleksii Kurochko

This is the combined MSR cleanup and FRED series.  Some patches of both v2's
have already been committed.

I have moved the MSR_IMM patch out.  It's not strictly needed for FRED, and
really needs to go behind Jan's patch to use mergable sections for
altinstructions which is definitely not making 4.21 at this point.

I almost got access to a piece of real hardware in time, but that fell through
at the last minute.

In terms of timing, I know we're getting very tight for 4.21, but there has
been an awful lot of disruption with travel and holidays recently.  Half the
patches are already acked/reviewed, but can't easily go in due to logical
dependencies (I suspect patches 15-17 could be committed right away as they're
pretty independent.)

Therefore I'd like to ask Oleksii whether the nominal release ack still
stands.

https://gitlab.com/xen-project/hardware/xen-staging/-/pipelines/2079705485

Andrew Cooper (22):
  x86/msr: Change rdmsr() to have normal API
  x86/msr: Change wrmsr() to take a single parameter
  x86/fsgsbase: Split out __{rd,wr}gs_shadow() helpers
  x86/fsgsbase: Update fs/gs helpers to use wrmsrns()
  x86/fsgsbase: Improve code generation in read_registers()
  x86/boot: Use RSTORSSP to establish SSP
  x86/traps: Alter switch_stack_and_jump() for FRED mode
  x86/traps: Skip Supervisor Shadow Stack tokens in FRED mode
  x86/traps: Make an IDT-specific #DB helper
  x86/traps: Make an IDT-specific #PF helper
  x86/fsgsbase: Make gskern accesses safe under FRED
  x86/traps: Introduce FRED entrypoints
  x86/traps: Enable FRED when requested
  x86/pv: Deduplicate is_canonical_address() in do_set_segment_base()
  x86/entry: Alter how IRET faults are recognised
  x86/entry: Drop the pre exception table infrastructure
  x86/entry: Rework the comment about SYSCALL and DF
  x86/pv: Adjust GS handling for FRED mode
  x86/pv: Guest exception handling in FRED mode
  x86/pv: ERETU error handling
  x86/pv: System call handling in FRED mode
  x86: Clamp reserved bits in eflags more aggressively

 xen/arch/x86/acpi/cpufreq/powernow.c    |  12 +-
 xen/arch/x86/boot/x86_64.S              |  23 +-
 xen/arch/x86/cpu/amd.c                  |   8 +-
 xen/arch/x86/cpu/common.c               |  20 +-
 xen/arch/x86/cpu/intel.c                |  30 +-
 xen/arch/x86/domain.c                   |  34 +-
 xen/arch/x86/extable.c                  |  14 -
 xen/arch/x86/genapic/x2apic.c           |   5 +-
 xen/arch/x86/hvm/domain.c               |   4 +-
 xen/arch/x86/hvm/vmx/vmcs.c             |  32 +-
 xen/arch/x86/hvm/vmx/vmx.c              |   4 +-
 xen/arch/x86/include/asm/asm_defns.h    |  76 +++-
 xen/arch/x86/include/asm/current.h      |   9 +-
 xen/arch/x86/include/asm/domain.h       |   2 +
 xen/arch/x86/include/asm/fsgsbase.h     |  66 +--
 xen/arch/x86/include/asm/hypercall.h    |   2 -
 xen/arch/x86/include/asm/msr.h          |  48 ++-
 xen/arch/x86/include/asm/prot-key.h     |   6 +-
 xen/arch/x86/include/asm/traps.h        |   2 +
 xen/arch/x86/include/asm/uaccess.h      |   2 -
 xen/arch/x86/include/asm/x86-defns.h    |   7 +
 xen/arch/x86/mm.c                       |  12 +-
 xen/arch/x86/nmi.c                      |  18 +-
 xen/arch/x86/oprofile/op_model_athlon.c |   2 +-
 xen/arch/x86/pv/dom0_build.c            |   2 +-
 xen/arch/x86/pv/domain.c                |  22 +-
 xen/arch/x86/pv/iret.c                  |   8 +-
 xen/arch/x86/pv/misc-hypercalls.c       |  42 +-
 xen/arch/x86/pv/traps.c                 |  39 ++
 xen/arch/x86/setup.c                    |  33 +-
 xen/arch/x86/traps-setup.c              |  83 +++-
 xen/arch/x86/traps.c                    | 519 ++++++++++++++++++++++--
 xen/arch/x86/tsx.c                      |  27 +-
 xen/arch/x86/x86_64/Makefile            |   1 +
 xen/arch/x86/x86_64/compat/entry.S      |   3 +-
 xen/arch/x86/x86_64/entry-fred.S        |  57 +++
 xen/arch/x86/x86_64/entry.S             |  46 ++-
 xen/arch/x86/xen.lds.S                  |   5 -
 38 files changed, 1076 insertions(+), 249 deletions(-)
 create mode 100644 xen/arch/x86/x86_64/entry-fred.S

-- 
2.39.5



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 01/22] x86/msr: Change rdmsr() to have normal API
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-07 15:47   ` Jan Beulich
  2025-10-03 22:53 ` [PATCH v3 02/22] x86/msr: Change wrmsr() to take a single parameter Andrew Cooper
                   ` (21 subsequent siblings)
  22 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

We want a consistent MSR API, and these want to be named rdmsr() and wrmsr(),
but not with their current APIs.  The current rdmsr() flavours writing to
their parameters by name makes code that reads like invalid C, and is
unergonomic to use in lots of cases.

Change the API, and update the callers all in one go.  Where appropriate,
update the write side to wrmsrns() as per the recommendation.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v3:
 * Fix m10Ah typos.
 * Don't truncate MSR_MCU_OPT_CTRL / MSR_PB_OPT_CTRL.
 * Swap u32 for uint32_t in _vmx_cpu_up().
 * Switch to always_inline to reduce churn in later patch.
---
 xen/arch/x86/acpi/cpufreq/powernow.c | 12 ++++++-----
 xen/arch/x86/cpu/amd.c               |  6 +++---
 xen/arch/x86/cpu/common.c            | 20 +++++++++++--------
 xen/arch/x86/cpu/intel.c             | 30 ++++++++++++++--------------
 xen/arch/x86/genapic/x2apic.c        |  5 +----
 xen/arch/x86/hvm/vmx/vmcs.c          | 30 +++++++++++++++++++++-------
 xen/arch/x86/include/asm/msr.h       | 30 ++++++++++++++++++++++++----
 xen/arch/x86/include/asm/prot-key.h  |  6 +-----
 xen/arch/x86/tsx.c                   | 27 ++++++++++---------------
 9 files changed, 99 insertions(+), 67 deletions(-)

diff --git a/xen/arch/x86/acpi/cpufreq/powernow.c b/xen/arch/x86/acpi/cpufreq/powernow.c
index 12fca45b4551..71ac0b45261a 100644
--- a/xen/arch/x86/acpi/cpufreq/powernow.c
+++ b/xen/arch/x86/acpi/cpufreq/powernow.c
@@ -140,23 +140,25 @@ static int cf_check powernow_cpufreq_target(
 
 static void amd_fixup_frequency(struct xen_processor_px *px)
 {
-    u32 hi, lo, fid, did;
+    uint64_t val;
+    uint32_t fid, did;
     int index = px->control & 0x00000007;
     const struct cpuinfo_x86 *c = &current_cpu_data;
 
     if ((c->x86 != 0x10 || c->x86_model >= 10) && c->x86 != 0x11)
         return;
 
-    rdmsr(MSR_PSTATE_DEF_BASE + index, lo, hi);
+    val = rdmsr(MSR_PSTATE_DEF_BASE + index);
+
     /*
      * MSR C001_0064+:
      * Bit 63: PstateEn. Read-write. If set, the P-state is valid.
      */
-    if (!(hi & (1U << 31)))
+    if (!(val & (1UL << 63)))
         return;
 
-    fid = lo & 0x3f;
-    did = (lo >> 6) & 7;
+    fid = val & 0x3f;
+    did = (val >> 6) & 7;
     if (c->x86 == 0x10)
         px->core_frequency = (100 * (fid + 16)) >> did;
     else
diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c
index 9767f6353973..43481daa8e26 100644
--- a/xen/arch/x86/cpu/amd.c
+++ b/xen/arch/x86/cpu/amd.c
@@ -1369,9 +1369,9 @@ static void cf_check init_amd(struct cpuinfo_x86 *c)
 	}
 
 	if (cpu_has(c, X86_FEATURE_EFRO)) {
-		rdmsr(MSR_K8_HWCR, l, h);
-		l |= (1 << 27); /* Enable read-only APERF/MPERF bit */
-		wrmsr(MSR_K8_HWCR, l, h);
+		/* Enable read-only APERF/MPERF bit */
+		wrmsrns(MSR_K8_HWCR,
+			rdmsr(MSR_K8_HWCR) | (1 << 27));
 	}
 
 	/* Prevent TSC drift in non single-processor, single-core platforms. */
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index 8c8bf1a806c6..37820a3a08ab 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -329,6 +329,7 @@ static inline u32 phys_pkg_id(u32 cpuid_apic, int index_msb)
 void __init early_cpu_init(bool verbose)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
+	uint64_t val;
 	u32 eax, ebx, ecx, edx;
 
 	c->x86_cache_alignment = 32;
@@ -412,10 +413,11 @@ void __init early_cpu_init(bool verbose)
 			    &c->x86_capability[FEATURESET_7c0],
 			    &c->x86_capability[FEATURESET_7d0]);
 
-		if (test_bit(X86_FEATURE_ARCH_CAPS, c->x86_capability))
-			rdmsr(MSR_ARCH_CAPABILITIES,
-			      c->x86_capability[FEATURESET_m10Al],
-			      c->x86_capability[FEATURESET_m10Ah]);
+		if (test_bit(X86_FEATURE_ARCH_CAPS, c->x86_capability)) {
+			val = rdmsr(MSR_ARCH_CAPABILITIES);
+			c->x86_capability[FEATURESET_m10Al] = val;
+			c->x86_capability[FEATURESET_m10Ah] = val >> 32;
+		}
 
 		if (max_subleaf >= 1)
 			cpuid_count(7, 1, &eax, &ebx, &ecx,
@@ -467,6 +469,7 @@ void reset_cpuinfo(struct cpuinfo_x86 *c, bool keep_basic)
 
 static void generic_identify(struct cpuinfo_x86 *c)
 {
+	uint64_t val;
 	u32 eax, ebx, ecx, edx, tmp;
 
 	/* Get vendor name */
@@ -566,10 +569,11 @@ static void generic_identify(struct cpuinfo_x86 *c)
 			    &c->x86_capability[FEATURESET_Da1],
 			    &tmp, &tmp, &tmp);
 
-	if (test_bit(X86_FEATURE_ARCH_CAPS, c->x86_capability))
-		rdmsr(MSR_ARCH_CAPABILITIES,
-		      c->x86_capability[FEATURESET_m10Al],
-		      c->x86_capability[FEATURESET_m10Ah]);
+	if (test_bit(X86_FEATURE_ARCH_CAPS, c->x86_capability)) {
+		val = rdmsr(MSR_ARCH_CAPABILITIES);
+		c->x86_capability[FEATURESET_m10Al] = val;
+		c->x86_capability[FEATURESET_m10Ah] = val >> 32;
+	}
 }
 
 /*
diff --git a/xen/arch/x86/cpu/intel.c b/xen/arch/x86/cpu/intel.c
index ecca11f04db8..6f71365b7ea0 100644
--- a/xen/arch/x86/cpu/intel.c
+++ b/xen/arch/x86/cpu/intel.c
@@ -23,17 +23,17 @@ static uint32_t __ro_after_init mcu_opt_ctrl_val;
 
 void update_mcu_opt_ctrl(void)
 {
-    uint32_t mask = mcu_opt_ctrl_mask, lo, hi;
+    uint64_t mask = mcu_opt_ctrl_mask, val;
 
     if ( !mask )
         return;
 
-    rdmsr(MSR_MCU_OPT_CTRL, lo, hi);
+    val = rdmsr(MSR_MCU_OPT_CTRL);
 
-    lo &= ~mask;
-    lo |= mcu_opt_ctrl_val;
+    val &= ~mask;
+    val |= mcu_opt_ctrl_val;
 
-    wrmsr(MSR_MCU_OPT_CTRL, lo, hi);
+    wrmsrns(MSR_MCU_OPT_CTRL, val);
 }
 
 void __init set_in_mcu_opt_ctrl(uint32_t mask, uint32_t val)
@@ -51,17 +51,17 @@ static uint32_t __ro_after_init pb_opt_ctrl_val;
 
 void update_pb_opt_ctrl(void)
 {
-    uint32_t mask = pb_opt_ctrl_mask, lo, hi;
+    uint64_t mask = pb_opt_ctrl_mask, val;
 
     if ( !mask )
         return;
 
-    rdmsr(MSR_PB_OPT_CTRL, lo, hi);
+    val = rdmsr(MSR_PB_OPT_CTRL);
 
-    lo &= ~mask;
-    lo |= pb_opt_ctrl_val;
+    val &= ~mask;
+    val |= pb_opt_ctrl_val;
 
-    wrmsr(MSR_PB_OPT_CTRL, lo, hi);
+    wrmsrns(MSR_PB_OPT_CTRL, val);
 }
 
 void __init set_in_pb_opt_ctrl(uint32_t mask, uint32_t val)
@@ -456,15 +456,15 @@ static void __init probe_mwait_errata(void)
  */
 static void Intel_errata_workarounds(struct cpuinfo_x86 *c)
 {
-	unsigned long lo, hi;
+	uint64_t val;
 
 	if ((c->x86 == 15) && (c->x86_model == 1) && (c->x86_mask == 1)) {
-		rdmsr (MSR_IA32_MISC_ENABLE, lo, hi);
-		if ((lo & (1<<9)) == 0) {
+		val = rdmsr(MSR_IA32_MISC_ENABLE);
+		if ((val & (1 << 9)) == 0) {
 			printk (KERN_INFO "CPU: C0 stepping P4 Xeon detected.\n");
 			printk (KERN_INFO "CPU: Disabling hardware prefetching (Errata 037)\n");
-			lo |= (1<<9);	/* Disable hw prefetching */
-			wrmsr (MSR_IA32_MISC_ENABLE, lo, hi);
+			val |= (1 << 9); /* Disable hw prefetching */
+			wrmsrns(MSR_IA32_MISC_ENABLE, val);
 		}
 	}
 
diff --git a/xen/arch/x86/genapic/x2apic.c b/xen/arch/x86/genapic/x2apic.c
index 1d55eb6b8a41..58157c217ee8 100644
--- a/xen/arch/x86/genapic/x2apic.c
+++ b/xen/arch/x86/genapic/x2apic.c
@@ -268,14 +268,11 @@ const struct genapic *__init apic_x2apic_probe(void)
 
 void __init check_x2apic_preenabled(void)
 {
-    u32 lo, hi;
-
     if ( !cpu_has_x2apic )
         return;
 
     /* Check whether x2apic mode was already enabled by the BIOS. */
-    rdmsr(MSR_APIC_BASE, lo, hi);
-    if ( lo & APIC_BASE_EXTD )
+    if ( rdmsr(MSR_APIC_BASE) & APIC_BASE_EXTD )
     {
         printk("x2APIC mode is already enabled by BIOS.\n");
         x2apic_enabled = 1;
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index ab8b1c87ec0f..b639818b6ea6 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -207,9 +207,13 @@ static void __init vmx_display_features(void)
 static u32 adjust_vmx_controls(
     const char *name, u32 ctl_min, u32 ctl_opt, u32 msr, bool *mismatch)
 {
+    uint64_t val;
     u32 vmx_msr_low, vmx_msr_high, ctl = ctl_min | ctl_opt;
 
-    rdmsr(msr, vmx_msr_low, vmx_msr_high);
+    val = rdmsr(msr);
+
+    vmx_msr_low = val;
+    vmx_msr_high = val >> 32;
 
     ctl &= vmx_msr_high; /* bit == 0 in high word ==> must be zero */
     ctl |= vmx_msr_low;  /* bit == 1 in low word  ==> must be one  */
@@ -258,10 +262,13 @@ static int vmx_init_vmcs_config(bool bsp)
 {
     u32 vmx_basic_msr_low, vmx_basic_msr_high, min, opt;
     struct vmx_caps caps = {};
-    u64 _vmx_misc_cap = 0;
+    uint64_t _vmx_misc_cap = 0, val;
     bool mismatch = false;
 
-    rdmsr(MSR_IA32_VMX_BASIC, vmx_basic_msr_low, vmx_basic_msr_high);
+    val = rdmsr(MSR_IA32_VMX_BASIC);
+
+    vmx_basic_msr_low = val;
+    vmx_basic_msr_high = val >> 32;
 
     min = (PIN_BASED_EXT_INTR_MASK |
            PIN_BASED_NMI_EXITING);
@@ -366,7 +373,10 @@ static int vmx_init_vmcs_config(bool bsp)
     if ( caps.secondary_exec_control & (SECONDARY_EXEC_ENABLE_EPT |
                                         SECONDARY_EXEC_ENABLE_VPID) )
     {
-        rdmsr(MSR_IA32_VMX_EPT_VPID_CAP, caps.ept, caps.vpid);
+        val = rdmsr(MSR_IA32_VMX_EPT_VPID_CAP);
+
+        caps.ept = val;
+        caps.vpid = val >> 32;
 
         if ( !opt_ept_ad )
             caps.ept &= ~VMX_EPT_AD_BIT;
@@ -408,9 +418,15 @@ static int vmx_init_vmcs_config(bool bsp)
          * We check VMX_BASIC_MSR[55] to correctly handle default controls.
          */
         uint32_t must_be_one, must_be_zero, msr = MSR_IA32_VMX_PROCBASED_CTLS;
+
         if ( vmx_basic_msr_high & (VMX_BASIC_DEFAULT1_ZERO >> 32) )
             msr = MSR_IA32_VMX_TRUE_PROCBASED_CTLS;
-        rdmsr(msr, must_be_one, must_be_zero);
+
+        val = rdmsr(msr);
+
+        must_be_one = val;
+        must_be_zero = val >> 32;
+
         if ( must_be_one & (CPU_BASED_INVLPG_EXITING |
                             CPU_BASED_CR3_LOAD_EXITING |
                             CPU_BASED_CR3_STORE_EXITING) )
@@ -699,7 +715,7 @@ void cf_check vmx_cpu_dead(unsigned int cpu)
 
 static int _vmx_cpu_up(bool bsp)
 {
-    u32 eax, edx;
+    uint32_t eax;
     int rc, bios_locked, cpu = smp_processor_id();
     u64 cr0, vmx_cr0_fixed0, vmx_cr0_fixed1;
 
@@ -719,7 +735,7 @@ static int _vmx_cpu_up(bool bsp)
         return -EINVAL;
     }
 
-    rdmsr(MSR_IA32_FEATURE_CONTROL, eax, edx);
+    eax = rdmsr(MSR_IA32_FEATURE_CONTROL);
 
     bios_locked = !!(eax & IA32_FEATURE_CONTROL_LOCK);
     if ( bios_locked )
diff --git a/xen/arch/x86/include/asm/msr.h b/xen/arch/x86/include/asm/msr.h
index 901770555b8c..188a50f9cea4 100644
--- a/xen/arch/x86/include/asm/msr.h
+++ b/xen/arch/x86/include/asm/msr.h
@@ -9,10 +9,32 @@
 #include <asm/asm_defns.h>
 #include <asm/msr-index.h>
 
-#define rdmsr(msr,val1,val2) \
-     __asm__ __volatile__("rdmsr" \
-			  : "=a" (val1), "=d" (val2) \
-			  : "c" (msr))
+/*
+ * MSR APIs.  Most logic is expected to use:
+ *
+ *   uint64_t foo = rdmsr(MSR_BAR);
+ *   wrmsrns(MSR_BAR, foo);
+ *
+ * In addition, *_safe() wrappers exist to cope gracefully with a #GP.
+ *
+ *
+ * All legacy forms are to be phased out:
+ *
+ *  rdmsrl(MSR_FOO, val);
+ *  wrmsr(MSR_FOO, lo, hi);
+ *  wrmsrl(MSR_FOO, val);
+ */
+
+static always_inline uint64_t rdmsr(unsigned int msr)
+{
+    unsigned long lo, hi;
+
+    asm volatile ( "rdmsr"
+                   : "=a" (lo), "=d" (hi)
+                   : "c" (msr) );
+
+    return (hi << 32) | lo;
+}
 
 #define rdmsrl(msr,val) do { unsigned long a__,b__; \
        __asm__ __volatile__("rdmsr" \
diff --git a/xen/arch/x86/include/asm/prot-key.h b/xen/arch/x86/include/asm/prot-key.h
index 3e9c2eaef415..8fb15b5c32e9 100644
--- a/xen/arch/x86/include/asm/prot-key.h
+++ b/xen/arch/x86/include/asm/prot-key.h
@@ -52,11 +52,7 @@ DECLARE_PER_CPU(uint32_t, pkrs);
 
 static inline uint32_t rdpkrs(void)
 {
-    uint32_t pkrs, tmp;
-
-    rdmsr(MSR_PKRS, pkrs, tmp);
-
-    return pkrs;
+    return rdmsr(MSR_PKRS);
 }
 
 static inline uint32_t rdpkrs_and_cache(void)
diff --git a/xen/arch/x86/tsx.c b/xen/arch/x86/tsx.c
index 2a0c7c08a2ba..fe9f0ab4f792 100644
--- a/xen/arch/x86/tsx.c
+++ b/xen/arch/x86/tsx.c
@@ -42,6 +42,8 @@ void tsx_init(void)
 {
     static bool __read_mostly once;
 
+    uint64_t val;
+
     /*
      * This function is first called between microcode being loaded, and
      * CPUID being scanned generally. early_cpu_init() has already prepared
@@ -62,8 +64,6 @@ void tsx_init(void)
              * On a TAA-vulnerable or later part with at least the May 2020
              * microcode mitigating SRBDS.
              */
-            uint64_t val;
-
             rdmsrl(MSR_MCU_OPT_CTRL, val);
 
             /*
@@ -118,8 +118,6 @@ void tsx_init(void)
 
         if ( cpu_has_tsx_force_abort )
         {
-            uint64_t val;
-
             /*
              * On an early TSX-enabled Skylake part subject to the memory
              * ordering erratum, with at least the March 2019 microcode.
@@ -250,18 +248,17 @@ void tsx_init(void)
          *    controlled, we have or will set MSR_MCU_OPT_CTRL.RTM_ALLOW to
          *    let TSX_CTRL.RTM_DISABLE be usable.
          */
-        uint32_t hi, lo;
 
-        rdmsr(MSR_TSX_CTRL, lo, hi);
+        val = rdmsr(MSR_TSX_CTRL);
 
         /* Check bottom bit only.  Higher bits are various sentinels. */
         rtm_disabled = !(opt_tsx & 1);
 
-        lo &= ~(TSX_CTRL_RTM_DISABLE | TSX_CTRL_CPUID_CLEAR);
+        val &= ~(TSX_CTRL_RTM_DISABLE | TSX_CTRL_CPUID_CLEAR);
         if ( rtm_disabled )
-            lo |= TSX_CTRL_RTM_DISABLE | TSX_CTRL_CPUID_CLEAR;
+            val |= TSX_CTRL_RTM_DISABLE | TSX_CTRL_CPUID_CLEAR;
 
-        wrmsr(MSR_TSX_CTRL, lo, hi);
+        wrmsrns(MSR_TSX_CTRL, val);
     }
     else if ( cpu_has_tsx_force_abort )
     {
@@ -269,14 +266,12 @@ void tsx_init(void)
          * On an early TSX-enable Skylake part subject to the memory ordering
          * erratum, with at least the March 2019 microcode.
          */
-        uint32_t hi, lo;
-
-        rdmsr(MSR_TSX_FORCE_ABORT, lo, hi);
+        val = rdmsr(MSR_TSX_FORCE_ABORT);
 
         /* Check bottom bit only.  Higher bits are various sentinels. */
         rtm_disabled = !(opt_tsx & 1);
 
-        lo &= ~(TSX_FORCE_ABORT_RTM | TSX_CPUID_CLEAR | TSX_ENABLE_RTM);
+        val &= ~(TSX_FORCE_ABORT_RTM | TSX_CPUID_CLEAR | TSX_ENABLE_RTM);
 
         if ( cpu_has_rtm_always_abort )
         {
@@ -291,7 +286,7 @@ void tsx_init(void)
              *  - TSX_FORCE_ABORT.ENABLE_RTM may be used to opt in to
              *    re-enabling RTM, at the users own risk.
              */
-            lo |= rtm_disabled ? TSX_CPUID_CLEAR : TSX_ENABLE_RTM;
+            val |= rtm_disabled ? TSX_CPUID_CLEAR : TSX_ENABLE_RTM;
         }
         else
         {
@@ -304,10 +299,10 @@ void tsx_init(void)
              *    setting TSX_FORCE_ABORT.FORCE_ABORT_RTM.
              */
             if ( rtm_disabled )
-                lo |= TSX_FORCE_ABORT_RTM;
+                val |= TSX_FORCE_ABORT_RTM;
         }
 
-        wrmsr(MSR_TSX_FORCE_ABORT, lo, hi);
+        wrmsrns(MSR_TSX_FORCE_ABORT, val);
     }
     else if ( opt_tsx >= 0 )
         printk_once(XENLOG_WARNING
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 01/22] x86/msr: Change rdmsr() to have normal API
  2025-10-03 22:53 ` [PATCH v3 01/22] x86/msr: Change rdmsr() to have normal API Andrew Cooper
@ 2025-10-07 15:47   ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2025-10-07 15:47 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 04.10.2025 00:53, Andrew Cooper wrote:
> We want a consistent MSR API, and these want to be named rdmsr() and wrmsr(),
> but not with their current APIs.  The current rdmsr() flavours writing to
> their parameters by name makes code that reads like invalid C, and is
> unergonomic to use in lots of cases.
> 
> Change the API, and update the callers all in one go.  Where appropriate,
> update the write side to wrmsrns() as per the recommendation.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 02/22] x86/msr: Change wrmsr() to take a single parameter
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
  2025-10-03 22:53 ` [PATCH v3 01/22] x86/msr: Change rdmsr() to have normal API Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-04  0:11   ` Demi Marie Obenour
  2025-10-03 22:53 ` [PATCH v3 03/22] x86/fsgsbase: Split out __{rd,wr}gs_shadow() helpers Andrew Cooper
                   ` (20 subsequent siblings)
  22 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné

Mirroring the cleanup to rdmsr(), do the same to wrmsr().  It now has the same
API as wrmsrl(), but we'll want to drop that wrapper in due course.

It's telling that almost all remaining users pass in 0.  Most are converted
directly to WRMSRNS, but a few are not.

MSR_VIRT_SPEC_CTRL is unconditionally intercepted is orders of magnitude more
expensive than just serialising.  In disable_lapic_nmi_watchdog(), the P4 case
won't run on hardware which has anything more than plain WRMSR.

For CTR_WRITE() in op_model_athlon.c there is a logical change in behaviour,
but it's fixing a bug.  Peformance counters typically get written to -(count)
as they generate an interrupt on overflow.  The performance counters even in
the K8 were 48 bits wide, and this wrmsr() not being a wrmsrl() appears to
have been an oversight in commit b5103d692aa7 ("x86 oprofile: use
rdmsrl/wrmsrl") which converted all other users, and appears to be the last
time there was an attempt to unify the MSR APIs.

No practical change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v3:
 * Swap to wrmsrns() in setup_k7_watchdog()
 * Reinstate correct bracketing in op_model_athlon.c's CTR_WRITE(), drop
   useless do{}while().
---
 xen/arch/x86/cpu/amd.c                  |  2 +-
 xen/arch/x86/hvm/vmx/vmcs.c             |  2 +-
 xen/arch/x86/include/asm/msr.h          | 20 ++++++++++----------
 xen/arch/x86/nmi.c                      | 18 +++++++++---------
 xen/arch/x86/oprofile/op_model_athlon.c |  2 +-
 5 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c
index 43481daa8e26..9b02e1ba675c 100644
--- a/xen/arch/x86/cpu/amd.c
+++ b/xen/arch/x86/cpu/amd.c
@@ -934,7 +934,7 @@ void amd_set_legacy_ssbd(bool enable)
 		return;
 
 	if (cpu_has_virt_ssbd)
-		wrmsr(MSR_VIRT_SPEC_CTRL, enable ? SPEC_CTRL_SSBD : 0, 0);
+		wrmsr(MSR_VIRT_SPEC_CTRL, enable ? SPEC_CTRL_SSBD : 0);
 	else if (amd_legacy_ssbd)
 		core_set_legacy_ssbd(enable);
 	else
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index b639818b6ea6..cd5ac8a5f0e3 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -754,7 +754,7 @@ static int _vmx_cpu_up(bool bsp)
         eax |= IA32_FEATURE_CONTROL_ENABLE_VMXON_OUTSIDE_SMX;
         if ( test_bit(X86_FEATURE_SMX, &boot_cpu_data.x86_capability) )
             eax |= IA32_FEATURE_CONTROL_ENABLE_VMXON_INSIDE_SMX;
-        wrmsr(MSR_IA32_FEATURE_CONTROL, eax, 0);
+        wrmsrns(MSR_IA32_FEATURE_CONTROL, eax);
     }
 
     if ( (rc = vmx_init_vmcs_config(bsp)) != 0 )
diff --git a/xen/arch/x86/include/asm/msr.h b/xen/arch/x86/include/asm/msr.h
index 188a50f9cea4..941a7612f4ba 100644
--- a/xen/arch/x86/include/asm/msr.h
+++ b/xen/arch/x86/include/asm/msr.h
@@ -15,13 +15,17 @@
  *   uint64_t foo = rdmsr(MSR_BAR);
  *   wrmsrns(MSR_BAR, foo);
  *
+ * and, if architectural serialisaition is necessary, or there are other
+ * reasons that WRMSRNS is inapplicable, then:
+ *
+ *   wrmsr(MSR_BAR, foo);
+ *
  * In addition, *_safe() wrappers exist to cope gracefully with a #GP.
  *
  *
  * All legacy forms are to be phased out:
  *
  *  rdmsrl(MSR_FOO, val);
- *  wrmsr(MSR_FOO, lo, hi);
  *  wrmsrl(MSR_FOO, val);
  */
 
@@ -43,17 +47,13 @@ static always_inline uint64_t rdmsr(unsigned int msr)
        val = a__ | ((u64)b__<<32); \
 } while(0)
 
-#define wrmsr(msr,val1,val2) \
-     __asm__ __volatile__("wrmsr" \
-			  : /* no outputs */ \
-			  : "c" (msr), "a" (val1), "d" (val2))
-
-static inline void wrmsrl(unsigned int msr, uint64_t val)
+static inline void wrmsr(unsigned int msr, uint64_t val)
 {
-        uint32_t lo = val, hi = val >> 32;
+    uint32_t lo = val, hi = val >> 32;
 
-        wrmsr(msr, lo, hi);
+    asm volatile ( "wrmsr" :: "a" (lo), "d" (hi), "c" (msr) );
 }
+#define wrmsrl(msr, val) wrmsr(msr, val)
 
 /* Non-serialising WRMSR, when available.  Falls back to a serialising WRMSR. */
 static inline void wrmsrns(uint32_t msr, uint64_t val)
@@ -150,7 +150,7 @@ static inline void wrmsr_tsc_aux(uint32_t val)
 
     if ( *this_tsc_aux != val )
     {
-        wrmsr(MSR_TSC_AUX, val, 0);
+        wrmsrns(MSR_TSC_AUX, val);
         *this_tsc_aux = val;
     }
 }
diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c
index 9793fa23168d..a0c9194ff032 100644
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -218,16 +218,16 @@ void disable_lapic_nmi_watchdog(void)
         return;
     switch (boot_cpu_data.x86_vendor) {
     case X86_VENDOR_AMD:
-        wrmsr(MSR_K7_EVNTSEL0, 0, 0);
+        wrmsrns(MSR_K7_EVNTSEL0, 0);
         break;
     case X86_VENDOR_INTEL:
         switch (boot_cpu_data.x86) {
         case 6:
-            wrmsr(MSR_P6_EVNTSEL(0), 0, 0);
+            wrmsrns(MSR_P6_EVNTSEL(0), 0);
             break;
         case 15:
-            wrmsr(MSR_P4_IQ_CCCR0, 0, 0);
-            wrmsr(MSR_P4_CRU_ESCR0, 0, 0);
+            wrmsr(MSR_P4_IQ_CCCR0, 0);
+            wrmsr(MSR_P4_CRU_ESCR0, 0);
             break;
         }
         break;
@@ -282,7 +282,7 @@ static void clear_msr_range(unsigned int base, unsigned int n)
     unsigned int i;
 
     for (i = 0; i < n; i++)
-        wrmsr(base+i, 0, 0);
+        wrmsrns(base + i, 0);
 }
 
 static inline void write_watchdog_counter(const char *descr)
@@ -308,11 +308,11 @@ static void setup_k7_watchdog(void)
         | K7_EVNTSEL_USR
         | K7_NMI_EVENT;
 
-    wrmsr(MSR_K7_EVNTSEL0, evntsel, 0);
+    wrmsrns(MSR_K7_EVNTSEL0, evntsel);
     write_watchdog_counter("K7_PERFCTR0");
     apic_write(APIC_LVTPC, APIC_DM_NMI);
     evntsel |= K7_EVNTSEL_ENABLE;
-    wrmsr(MSR_K7_EVNTSEL0, evntsel, 0);
+    wrmsrns(MSR_K7_EVNTSEL0, evntsel);
 }
 
 static void setup_p6_watchdog(unsigned counter)
@@ -338,11 +338,11 @@ static void setup_p6_watchdog(unsigned counter)
         | P6_EVNTSEL_USR
         | counter;
 
-    wrmsr(MSR_P6_EVNTSEL(0), evntsel, 0);
+    wrmsrns(MSR_P6_EVNTSEL(0), evntsel);
     write_watchdog_counter("P6_PERFCTR0");
     apic_write(APIC_LVTPC, APIC_DM_NMI);
     evntsel |= P6_EVNTSEL0_ENABLE;
-    wrmsr(MSR_P6_EVNTSEL(0), evntsel, 0);
+    wrmsrns(MSR_P6_EVNTSEL(0), evntsel);
 }
 
 static void setup_p4_watchdog(void)
diff --git a/xen/arch/x86/oprofile/op_model_athlon.c b/xen/arch/x86/oprofile/op_model_athlon.c
index bf897a4b6328..4c016624a69b 100644
--- a/xen/arch/x86/oprofile/op_model_athlon.c
+++ b/xen/arch/x86/oprofile/op_model_athlon.c
@@ -34,7 +34,7 @@
 #define MAX_COUNTERS FAM15H_NUM_COUNTERS
 
 #define CTR_READ(msr_content,msrs,c) do {rdmsrl(msrs->counters[(c)].addr, (msr_content));} while (0)
-#define CTR_WRITE(l,msrs,c) do {wrmsr(msrs->counters[(c)].addr, -(unsigned int)(l), -1);} while (0)
+#define CTR_WRITE(l,msrs,c) wrmsr(msrs->counters[(c)].addr, -(l))
 #define CTR_OVERFLOWED(n) (!((n) & (1ULL<<31)))
 
 #define CTRL_READ(msr_content,msrs,c) do {rdmsrl(msrs->controls[(c)].addr, (msr_content));} while (0)
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 02/22] x86/msr: Change wrmsr() to take a single parameter
  2025-10-03 22:53 ` [PATCH v3 02/22] x86/msr: Change wrmsr() to take a single parameter Andrew Cooper
@ 2025-10-04  0:11   ` Demi Marie Obenour
  2025-10-04  0:14     ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Demi Marie Obenour @ 2025-10-04  0:11 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Jan Beulich, Roger Pau Monné


[-- Attachment #1.1.1: Type: text/plain, Size: 7748 bytes --]

On 10/3/25 18:53, Andrew Cooper wrote:
> Mirroring the cleanup to rdmsr(), do the same to wrmsr().  It now has the same
> API as wrmsrl(), but we'll want to drop that wrapper in due course.
> 
> It's telling that almost all remaining users pass in 0.  Most are converted
> directly to WRMSRNS, but a few are not.
> 
> MSR_VIRT_SPEC_CTRL is unconditionally intercepted is orders of magnitude more

"is unconditionally intercepted is" is this a typo?

> expensive than just serialising.  In disable_lapic_nmi_watchdog(), the P4 case
> won't run on hardware which has anything more than plain WRMSR.
> 
> For CTR_WRITE() in op_model_athlon.c there is a logical change in behaviour,
> but it's fixing a bug.  Peformance counters typically get written to -(count)
> as they generate an interrupt on overflow.  The performance counters even in
> the K8 were 48 bits wide, and this wrmsr() not being a wrmsrl() appears to
> have been an oversight in commit b5103d692aa7 ("x86 oprofile: use
> rdmsrl/wrmsrl") which converted all other users, and appears to be the last
> time there was an attempt to unify the MSR APIs.
> 
> No practical change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Acked-by: Jan Beulich <jbeulich@suse.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> 
> v3:
>  * Swap to wrmsrns() in setup_k7_watchdog()
>  * Reinstate correct bracketing in op_model_athlon.c's CTR_WRITE(), drop
>    useless do{}while().
> ---
>  xen/arch/x86/cpu/amd.c                  |  2 +-
>  xen/arch/x86/hvm/vmx/vmcs.c             |  2 +-
>  xen/arch/x86/include/asm/msr.h          | 20 ++++++++++----------
>  xen/arch/x86/nmi.c                      | 18 +++++++++---------
>  xen/arch/x86/oprofile/op_model_athlon.c |  2 +-
>  5 files changed, 22 insertions(+), 22 deletions(-)
> 
> diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c
> index 43481daa8e26..9b02e1ba675c 100644
> --- a/xen/arch/x86/cpu/amd.c
> +++ b/xen/arch/x86/cpu/amd.c
> @@ -934,7 +934,7 @@ void amd_set_legacy_ssbd(bool enable)
>  		return;
>  
>  	if (cpu_has_virt_ssbd)
> -		wrmsr(MSR_VIRT_SPEC_CTRL, enable ? SPEC_CTRL_SSBD : 0, 0);
> +		wrmsr(MSR_VIRT_SPEC_CTRL, enable ? SPEC_CTRL_SSBD : 0);
>  	else if (amd_legacy_ssbd)
>  		core_set_legacy_ssbd(enable);
>  	else
> diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
> index b639818b6ea6..cd5ac8a5f0e3 100644
> --- a/xen/arch/x86/hvm/vmx/vmcs.c
> +++ b/xen/arch/x86/hvm/vmx/vmcs.c
> @@ -754,7 +754,7 @@ static int _vmx_cpu_up(bool bsp)
>          eax |= IA32_FEATURE_CONTROL_ENABLE_VMXON_OUTSIDE_SMX;
>          if ( test_bit(X86_FEATURE_SMX, &boot_cpu_data.x86_capability) )
>              eax |= IA32_FEATURE_CONTROL_ENABLE_VMXON_INSIDE_SMX;
> -        wrmsr(MSR_IA32_FEATURE_CONTROL, eax, 0);
> +        wrmsrns(MSR_IA32_FEATURE_CONTROL, eax);
>      }
>  
>      if ( (rc = vmx_init_vmcs_config(bsp)) != 0 )
> diff --git a/xen/arch/x86/include/asm/msr.h b/xen/arch/x86/include/asm/msr.h
> index 188a50f9cea4..941a7612f4ba 100644
> --- a/xen/arch/x86/include/asm/msr.h
> +++ b/xen/arch/x86/include/asm/msr.h
> @@ -15,13 +15,17 @@
>   *   uint64_t foo = rdmsr(MSR_BAR);
>   *   wrmsrns(MSR_BAR, foo);
>   *
> + * and, if architectural serialisaition is necessary, or there are other
> + * reasons that WRMSRNS is inapplicable, then:
> + *
> + *   wrmsr(MSR_BAR, foo);
> + *
>   * In addition, *_safe() wrappers exist to cope gracefully with a #GP.
>   *
>   *
>   * All legacy forms are to be phased out:
>   *
>   *  rdmsrl(MSR_FOO, val);
> - *  wrmsr(MSR_FOO, lo, hi);
>   *  wrmsrl(MSR_FOO, val);
>   */
>  
> @@ -43,17 +47,13 @@ static always_inline uint64_t rdmsr(unsigned int msr)
>         val = a__ | ((u64)b__<<32); \
>  } while(0)
>  
> -#define wrmsr(msr,val1,val2) \
> -     __asm__ __volatile__("wrmsr" \
> -			  : /* no outputs */ \
> -			  : "c" (msr), "a" (val1), "d" (val2))
> -
> -static inline void wrmsrl(unsigned int msr, uint64_t val)
> +static inline void wrmsr(unsigned int msr, uint64_t val)
>  {
> -        uint32_t lo = val, hi = val >> 32;
> +    uint32_t lo = val, hi = val >> 32;
>  
> -        wrmsr(msr, lo, hi);
> +    asm volatile ( "wrmsr" :: "a" (lo), "d" (hi), "c" (msr) );
>  }
> +#define wrmsrl(msr, val) wrmsr(msr, val)
>  
>  /* Non-serialising WRMSR, when available.  Falls back to a serialising WRMSR. */
>  static inline void wrmsrns(uint32_t msr, uint64_t val)
> @@ -150,7 +150,7 @@ static inline void wrmsr_tsc_aux(uint32_t val)
>  
>      if ( *this_tsc_aux != val )
>      {
> -        wrmsr(MSR_TSC_AUX, val, 0);
> +        wrmsrns(MSR_TSC_AUX, val);
>          *this_tsc_aux = val;
>      }
>  }
> diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c
> index 9793fa23168d..a0c9194ff032 100644
> --- a/xen/arch/x86/nmi.c
> +++ b/xen/arch/x86/nmi.c
> @@ -218,16 +218,16 @@ void disable_lapic_nmi_watchdog(void)
>          return;
>      switch (boot_cpu_data.x86_vendor) {
>      case X86_VENDOR_AMD:
> -        wrmsr(MSR_K7_EVNTSEL0, 0, 0);
> +        wrmsrns(MSR_K7_EVNTSEL0, 0);
>          break;
>      case X86_VENDOR_INTEL:
>          switch (boot_cpu_data.x86) {
>          case 6:
> -            wrmsr(MSR_P6_EVNTSEL(0), 0, 0);
> +            wrmsrns(MSR_P6_EVNTSEL(0), 0);
>              break;
>          case 15:
> -            wrmsr(MSR_P4_IQ_CCCR0, 0, 0);
> -            wrmsr(MSR_P4_CRU_ESCR0, 0, 0);
> +            wrmsr(MSR_P4_IQ_CCCR0, 0);
> +            wrmsr(MSR_P4_CRU_ESCR0, 0);
>              break;
>          }
>          break;
> @@ -282,7 +282,7 @@ static void clear_msr_range(unsigned int base, unsigned int n)
>      unsigned int i;
>  
>      for (i = 0; i < n; i++)
> -        wrmsr(base+i, 0, 0);
> +        wrmsrns(base + i, 0);
>  }
>  
>  static inline void write_watchdog_counter(const char *descr)
> @@ -308,11 +308,11 @@ static void setup_k7_watchdog(void)
>          | K7_EVNTSEL_USR
>          | K7_NMI_EVENT;
>  
> -    wrmsr(MSR_K7_EVNTSEL0, evntsel, 0);
> +    wrmsrns(MSR_K7_EVNTSEL0, evntsel);
>      write_watchdog_counter("K7_PERFCTR0");
>      apic_write(APIC_LVTPC, APIC_DM_NMI);
>      evntsel |= K7_EVNTSEL_ENABLE;
> -    wrmsr(MSR_K7_EVNTSEL0, evntsel, 0);
> +    wrmsrns(MSR_K7_EVNTSEL0, evntsel);
>  }
>  
>  static void setup_p6_watchdog(unsigned counter)
> @@ -338,11 +338,11 @@ static void setup_p6_watchdog(unsigned counter)
>          | P6_EVNTSEL_USR
>          | counter;
>  
> -    wrmsr(MSR_P6_EVNTSEL(0), evntsel, 0);
> +    wrmsrns(MSR_P6_EVNTSEL(0), evntsel);
>      write_watchdog_counter("P6_PERFCTR0");
>      apic_write(APIC_LVTPC, APIC_DM_NMI);
>      evntsel |= P6_EVNTSEL0_ENABLE;
> -    wrmsr(MSR_P6_EVNTSEL(0), evntsel, 0);
> +    wrmsrns(MSR_P6_EVNTSEL(0), evntsel);
>  }
>  
>  static void setup_p4_watchdog(void)
> diff --git a/xen/arch/x86/oprofile/op_model_athlon.c b/xen/arch/x86/oprofile/op_model_athlon.c
> index bf897a4b6328..4c016624a69b 100644
> --- a/xen/arch/x86/oprofile/op_model_athlon.c
> +++ b/xen/arch/x86/oprofile/op_model_athlon.c
> @@ -34,7 +34,7 @@
>  #define MAX_COUNTERS FAM15H_NUM_COUNTERS
>  
>  #define CTR_READ(msr_content,msrs,c) do {rdmsrl(msrs->counters[(c)].addr, (msr_content));} while (0)
> -#define CTR_WRITE(l,msrs,c) do {wrmsr(msrs->counters[(c)].addr, -(unsigned int)(l), -1);} while (0)
> +#define CTR_WRITE(l,msrs,c) wrmsr(msrs->counters[(c)].addr, -(l))
>  #define CTR_OVERFLOWED(n) (!((n) & (1ULL<<31)))
>  
>  #define CTRL_READ(msr_content,msrs,c) do {rdmsrl(msrs->controls[(c)].addr, (msr_content));} while (0)


-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 02/22] x86/msr: Change wrmsr() to take a single parameter
  2025-10-04  0:11   ` Demi Marie Obenour
@ 2025-10-04  0:14     ` Andrew Cooper
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-10-04  0:14 UTC (permalink / raw)
  To: Demi Marie Obenour, Andrew Cooper, Xen-devel
  Cc: Jan Beulich, Roger Pau Monné

On 04/10/2025 1:11 am, Demi Marie Obenour wrote:
> On 10/3/25 18:53, Andrew Cooper wrote:
>> Mirroring the cleanup to rdmsr(), do the same to wrmsr().  It now has the same
>> API as wrmsrl(), but we'll want to drop that wrapper in due course.
>>
>> It's telling that almost all remaining users pass in 0.  Most are converted
>> directly to WRMSRNS, but a few are not.
>>
>> MSR_VIRT_SPEC_CTRL is unconditionally intercepted is orders of magnitude more
> "is unconditionally intercepted is" is this a typo?

There should be an 'and' in there.  Fixed locally.

~Andrew

>> expensive than just serialising.  In disable_lapic_nmi_watchdog(), the P4 case
>> won't run on hardware which has anything more than plain WRMSR.
>>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 03/22] x86/fsgsbase: Split out __{rd,wr}gs_shadow() helpers
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
  2025-10-03 22:53 ` [PATCH v3 01/22] x86/msr: Change rdmsr() to have normal API Andrew Cooper
  2025-10-03 22:53 ` [PATCH v3 02/22] x86/msr: Change wrmsr() to take a single parameter Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-07 15:49   ` Jan Beulich
  2025-10-03 22:53 ` [PATCH v3 04/22] x86/fsgsbase: Update fs/gs helpers to use wrmsrns() Andrew Cooper
                   ` (19 subsequent siblings)
  22 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

Right now they're inline in {read,write}_gs_shadow(), but we're going to need
to use these elsewhere to support FRED.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v3:
 * Rename to __{rd,wr}gs_shadow()
---
 xen/arch/x86/include/asm/fsgsbase.h | 36 ++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/xen/arch/x86/include/asm/fsgsbase.h b/xen/arch/x86/include/asm/fsgsbase.h
index 03e6a85d31ce..557703842691 100644
--- a/xen/arch/x86/include/asm/fsgsbase.h
+++ b/xen/arch/x86/include/asm/fsgsbase.h
@@ -32,6 +32,17 @@ static inline unsigned long __rdgsbase(void)
     return base;
 }
 
+static inline unsigned long __rdgs_shadow(void)
+{
+    unsigned long base;
+
+    asm_inline volatile ( "swapgs\n\t"
+                          "rdgsbase %0\n\t"
+                          "swapgs" : "=r" (base) );
+
+    return base;
+}
+
 static inline void __wrfsbase(unsigned long base)
 {
     asm volatile ( "wrfsbase %0" :: "r" (base) );
@@ -42,6 +53,14 @@ static inline void __wrgsbase(unsigned long base)
     asm volatile ( "wrgsbase %0" :: "r" (base) );
 }
 
+static inline void __wrgs_shadow(unsigned long base)
+{
+    asm_inline volatile ( "swapgs\n\t"
+                          "wrgsbase %0\n\t"
+                          "swapgs"
+                          :: "r" (base) );
+}
+
 static inline unsigned long read_fs_base(void)
 {
     unsigned long base;
@@ -71,13 +90,9 @@ static inline unsigned long read_gs_shadow(void)
     unsigned long base;
 
     if ( read_cr4() & X86_CR4_FSGSBASE )
-    {
-        asm volatile ( "swapgs" );
-        base = __rdgsbase();
-        asm volatile ( "swapgs" );
-    }
-    else
-        rdmsrl(MSR_SHADOW_GS_BASE, base);
+        return __rdgs_shadow();
+
+    rdmsrl(MSR_SHADOW_GS_BASE, base);
 
     return base;
 }
@@ -101,12 +116,7 @@ static inline void write_gs_base(unsigned long base)
 static inline void write_gs_shadow(unsigned long base)
 {
     if ( read_cr4() & X86_CR4_FSGSBASE )
-    {
-        asm volatile ( "swapgs\n\t"
-                       "wrgsbase %0\n\t"
-                       "swapgs"
-                       :: "r" (base) );
-    }
+        __wrgs_shadow(base);
     else
         wrmsrl(MSR_SHADOW_GS_BASE, base);
 }
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 03/22] x86/fsgsbase: Split out __{rd,wr}gs_shadow() helpers
  2025-10-03 22:53 ` [PATCH v3 03/22] x86/fsgsbase: Split out __{rd,wr}gs_shadow() helpers Andrew Cooper
@ 2025-10-07 15:49   ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2025-10-07 15:49 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 04.10.2025 00:53, Andrew Cooper wrote:
> Right now they're inline in {read,write}_gs_shadow(), but we're going to need
> to use these elsewhere to support FRED.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 04/22] x86/fsgsbase: Update fs/gs helpers to use wrmsrns()
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (2 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 03/22] x86/fsgsbase: Split out __{rd,wr}gs_shadow() helpers Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-07 15:53   ` Jan Beulich
  2025-10-03 22:53 ` [PATCH v3 05/22] x86/fsgsbase: Improve code generation in read_registers() Andrew Cooper
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

... and rdmsr() while here.

Most of these accesses are in fastpaths and do not need serialising behaviour,
but the write side is serialising on all Intel hardware as well as older AMD
hardware.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/domain.c               | 10 +++++-----
 xen/arch/x86/hvm/vmx/vmx.c          |  4 ++--
 xen/arch/x86/include/asm/fsgsbase.h | 30 +++++++++--------------------
 3 files changed, 16 insertions(+), 28 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 19fd86ce88d2..8089ff929bf7 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1851,9 +1851,9 @@ static void load_segments(struct vcpu *n)
         }
         else
         {
-            wrmsrl(MSR_FS_BASE, n->arch.pv.fs_base);
-            wrmsrl(MSR_GS_BASE, gsb);
-            wrmsrl(MSR_SHADOW_GS_BASE, gss);
+            wrmsrns(MSR_FS_BASE, n->arch.pv.fs_base);
+            wrmsrns(MSR_GS_BASE, gsb);
+            wrmsrns(MSR_SHADOW_GS_BASE, gss);
         }
     }
 
@@ -1978,8 +1978,8 @@ static void save_segments(struct vcpu *v)
         }
         else
         {
-            rdmsrl(MSR_FS_BASE, fs_base);
-            rdmsrl(MSR_GS_BASE, gs_base);
+            fs_base = rdmsr(MSR_FS_BASE);
+            gs_base = rdmsr(MSR_GS_BASE);
         }
 
         v->arch.pv.fs_base = fs_base;
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index e2b5077654ef..01bc67460aae 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2737,8 +2737,8 @@ static uint64_t cf_check vmx_get_reg(struct vcpu *v, unsigned int reg)
     case MSR_SHADOW_GS_BASE:
         if ( v != curr )
             return v->arch.hvm.vmx.shadow_gs;
-        rdmsrl(MSR_SHADOW_GS_BASE, val);
-        return val;
+        else
+            return rdmsr(MSR_SHADOW_GS_BASE);
     }
 
     /* Logic which maybe requires remote VMCS acquisition. */
diff --git a/xen/arch/x86/include/asm/fsgsbase.h b/xen/arch/x86/include/asm/fsgsbase.h
index 557703842691..24862a6bfea7 100644
--- a/xen/arch/x86/include/asm/fsgsbase.h
+++ b/xen/arch/x86/include/asm/fsgsbase.h
@@ -63,38 +63,26 @@ static inline void __wrgs_shadow(unsigned long base)
 
 static inline unsigned long read_fs_base(void)
 {
-    unsigned long base;
-
     if ( read_cr4() & X86_CR4_FSGSBASE )
         return __rdfsbase();
-
-    rdmsrl(MSR_FS_BASE, base);
-
-    return base;
+    else
+        return rdmsr(MSR_FS_BASE);
 }
 
 static inline unsigned long read_gs_base(void)
 {
-    unsigned long base;
-
     if ( read_cr4() & X86_CR4_FSGSBASE )
         return __rdgsbase();
-
-    rdmsrl(MSR_GS_BASE, base);
-
-    return base;
+    else
+        return rdmsr(MSR_GS_BASE);
 }
 
 static inline unsigned long read_gs_shadow(void)
 {
-    unsigned long base;
-
     if ( read_cr4() & X86_CR4_FSGSBASE )
         return __rdgs_shadow();
-
-    rdmsrl(MSR_SHADOW_GS_BASE, base);
-
-    return base;
+    else
+        return rdmsr(MSR_SHADOW_GS_BASE);
 }
 
 static inline void write_fs_base(unsigned long base)
@@ -102,7 +90,7 @@ static inline void write_fs_base(unsigned long base)
     if ( read_cr4() & X86_CR4_FSGSBASE )
         __wrfsbase(base);
     else
-        wrmsrl(MSR_FS_BASE, base);
+        wrmsrns(MSR_FS_BASE, base);
 }
 
 static inline void write_gs_base(unsigned long base)
@@ -110,7 +98,7 @@ static inline void write_gs_base(unsigned long base)
     if ( read_cr4() & X86_CR4_FSGSBASE )
         __wrgsbase(base);
     else
-        wrmsrl(MSR_GS_BASE, base);
+        wrmsrns(MSR_GS_BASE, base);
 }
 
 static inline void write_gs_shadow(unsigned long base)
@@ -118,7 +106,7 @@ static inline void write_gs_shadow(unsigned long base)
     if ( read_cr4() & X86_CR4_FSGSBASE )
         __wrgs_shadow(base);
     else
-        wrmsrl(MSR_SHADOW_GS_BASE, base);
+        wrmsrns(MSR_SHADOW_GS_BASE, base);
 }
 
 #endif /* X86_FSGSBASE_H */
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 04/22] x86/fsgsbase: Update fs/gs helpers to use wrmsrns()
  2025-10-03 22:53 ` [PATCH v3 04/22] x86/fsgsbase: Update fs/gs helpers to use wrmsrns() Andrew Cooper
@ 2025-10-07 15:53   ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2025-10-07 15:53 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 04.10.2025 00:53, Andrew Cooper wrote:
> ... and rdmsr() while here.
> 
> Most of these accesses are in fastpaths and do not need serialising behaviour,
> but the write side is serialising on all Intel hardware as well as older AMD
> hardware.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: Jan Beulich <jbeulich@suse.com>
in the interest of not blocking the tidying over ...

> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -2737,8 +2737,8 @@ static uint64_t cf_check vmx_get_reg(struct vcpu *v, unsigned int reg)
>      case MSR_SHADOW_GS_BASE:
>          if ( v != curr )
>              return v->arch.hvm.vmx.shadow_gs;
> -        rdmsrl(MSR_SHADOW_GS_BASE, val);
> -        return val;
> +        else
> +            return rdmsr(MSR_SHADOW_GS_BASE);

... the addition of useless (and confusing, and possibly being Misra violations)
"else" here and elsewhere. If you were to drop them, feel free to "upgrade" to
R-b.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 05/22] x86/fsgsbase: Improve code generation in read_registers()
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (3 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 04/22] x86/fsgsbase: Update fs/gs helpers to use wrmsrns() Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-04  0:13   ` Demi Marie Obenour
  2025-10-07 15:54   ` Jan Beulich
  2025-10-03 22:53 ` [PATCH v3 06/22] x86/boot: Use RSTORSSP to establish SSP Andrew Cooper
                   ` (17 subsequent siblings)
  22 siblings, 2 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

It turns out that using the higher level helpers adjacent like this leads to
terrible code generation.  Due to -fno-strict-alising, the store into state->
invalidates the read_cr4() address calculation (which is really cpu_info->cr4
under the hood), meaning that it can't be hoisted.

As a result we get "locate the top of stack block, get cr4, and see if
FSGSBASE is set" repeated 3 times, and an unreasoanble number of basic blocks.

Hoist the calculation manually, which results in two basic blocks.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

Side-by-side disassembly: https://termbin.com/9xfq
---
 xen/arch/x86/traps.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 0c5393cb2166..545c42a10862 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -118,9 +118,18 @@ static void read_registers(struct extra_state *state)
     state->cr3 = read_cr3();
     state->cr4 = read_cr4();
 
-    state->fsb = read_fs_base();
-    state->gsb = read_gs_base();
-    state->gss = read_gs_shadow();
+    if ( state->cr4 & X86_CR4_FSGSBASE )
+    {
+        state->fsb = __rdfsbase();
+        state->gsb = __rdgsbase();
+        state->gss = __rdgs_shadow();
+    }
+    else
+    {
+        state->fsb = rdmsr(MSR_FS_BASE);
+        state->gsb = rdmsr(MSR_GS_BASE);
+        state->gss = rdmsr(MSR_SHADOW_GS_BASE);
+    }
 
     asm ( "mov %%ds, %0" : "=m" (state->ds) );
     asm ( "mov %%es, %0" : "=m" (state->es) );
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 05/22] x86/fsgsbase: Improve code generation in read_registers()
  2025-10-03 22:53 ` [PATCH v3 05/22] x86/fsgsbase: Improve code generation in read_registers() Andrew Cooper
@ 2025-10-04  0:13   ` Demi Marie Obenour
  2025-10-07 15:54   ` Jan Beulich
  1 sibling, 0 replies; 40+ messages in thread
From: Demi Marie Obenour @ 2025-10-04  0:13 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Jan Beulich, Roger Pau Monné


[-- Attachment #1.1.1: Type: text/plain, Size: 492 bytes --]

On 10/3/25 18:53, Andrew Cooper wrote:
> It turns out that using the higher level helpers adjacent like this leads to
> terrible code generation.  Due to -fno-strict-alising, the store into state->
> invalidates the read_cr4() address calculation (which is really cpu_info->cr4
> under the hood), meaning that it can't be hoisted.

It would be nice if compilers could warn when they
could not optimize because of something like this.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 05/22] x86/fsgsbase: Improve code generation in read_registers()
  2025-10-03 22:53 ` [PATCH v3 05/22] x86/fsgsbase: Improve code generation in read_registers() Andrew Cooper
  2025-10-04  0:13   ` Demi Marie Obenour
@ 2025-10-07 15:54   ` Jan Beulich
  1 sibling, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2025-10-07 15:54 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 04.10.2025 00:53, Andrew Cooper wrote:
> It turns out that using the higher level helpers adjacent like this leads to
> terrible code generation.  Due to -fno-strict-alising, the store into state->
> invalidates the read_cr4() address calculation (which is really cpu_info->cr4
> under the hood), meaning that it can't be hoisted.
> 
> As a result we get "locate the top of stack block, get cr4, and see if
> FSGSBASE is set" repeated 3 times, and an unreasoanble number of basic blocks.
> 
> Hoist the calculation manually, which results in two basic blocks.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: Jan Beulich <jbeulich@suse.com>
albeit preferably with ...

> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -118,9 +118,18 @@ static void read_registers(struct extra_state *state)
>      state->cr3 = read_cr3();
>      state->cr4 = read_cr4();
>  
> -    state->fsb = read_fs_base();
> -    state->gsb = read_gs_base();
> -    state->gss = read_gs_shadow();
> +    if ( state->cr4 & X86_CR4_FSGSBASE )
> +    {
> +        state->fsb = __rdfsbase();
> +        state->gsb = __rdgsbase();
> +        state->gss = __rdgs_shadow();
> +    }
> +    else
> +    {
> +        state->fsb = rdmsr(MSR_FS_BASE);
> +        state->gsb = rdmsr(MSR_GS_BASE);
> +        state->gss = rdmsr(MSR_SHADOW_GS_BASE);
> +    }

... a brief comment added towards the deliberate open-coding.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 06/22] x86/boot: Use RSTORSSP to establish SSP
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (4 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 05/22] x86/fsgsbase: Improve code generation in read_registers() Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-07 15:57   ` Jan Beulich
  2025-10-03 22:53 ` [PATCH v3 07/22] x86/traps: Alter switch_stack_and_jump() for FRED mode Andrew Cooper
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

Under FRED, SETSSBSY is disallowed, and we want to be setting up FRED prior to
setting up shadow stacks.  As we still need Supervisor Tokens in IDT mode, we
need mode-specific logic to establish SSP.

In FRED mode, write a Restore Token, RSTORSSP it, and discard the resulting
Previous-SSP token.

No change outside of FRED mode.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v3:
 * Adjust the comment in reinit_bsp_stack().

v2:
 * Some logic moved into prior patch.
---
 xen/arch/x86/boot/x86_64.S | 23 +++++++++++++++++++++--
 xen/arch/x86/setup.c       | 25 ++++++++++++++++++++++++-
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index 11a7e9d3bd23..9705d03f849c 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -53,17 +53,21 @@ ENTRY(__high_start)
         mov     %rcx, STACK_CPUINFO_FIELD(cr4)(%r15)
         mov     %rcx, %cr4
 
-        /* WARNING! call/ret now fatal (iff SHSTK) until SETSSBSY loads SSP */
+        /* WARNING! CALL/RET now fatal (iff SHSTK) until SETSSBSY/RSTORSSP loads SSP */
 
 #if defined(CONFIG_XEN_SHSTK)
         test    $CET_SHSTK_EN, %al
         jz      .L_ap_cet_done
 
-        /* Derive the supervisor token address from %rsp. */
+        /* Derive the token address from %rsp. */
         mov     %rsp, %rdx
         and     $~(STACK_SIZE - 1), %rdx
         or      $(PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8, %rdx
 
+        /* Establishing SSP differs between IDT or FRED mode. */
+        bt      $32 /* ilog2(X86_CR4_FRED) */, %rcx
+        jc      .L_fred_shstk
+
         /*
          * Write a new Supervisor Token.  It doesn't matter the first time a
          * CPU boots, but for S3 resume or hotplug this clears the busy bit so
@@ -71,6 +75,21 @@ ENTRY(__high_start)
          */
         wrssq   %rdx, (%rdx)
         setssbsy
+        jmp     .L_ap_cet_done
+
+.L_fred_shstk:
+
+        /*
+         * Write a Restore Token, value: &token + 8 + 64BIT (bit 0) at the
+         * base of the shstk (which isn't in use yet).
+         */
+        lea     9(%rdx), %rdi
+        wrssq   %rdi, (%rdx)
+        rstorssp (%rdx)
+
+        /* Discard the Previous-SSP Token from the shstk. */
+        mov     $2, %edx
+        incsspd %edx
 
 #endif /* CONFIG_XEN_SHSTK */
 .L_ap_cet_done:
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 872a8c63f94a..44da5efa1d20 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -49,6 +49,7 @@
 #include <asm/prot-key.h>
 #include <asm/pv/domain.h>
 #include <asm/setup.h>
+#include <asm/shstk.h>
 #include <asm/smp.h>
 #include <asm/spec_ctrl.h>
 #include <asm/stubs.h>
@@ -908,7 +909,29 @@ static void __init noreturn reinit_bsp_stack(void)
     if ( cpu_has_xen_shstk )
     {
         wrmsrl(MSR_S_CET, xen_msr_s_cet_value());
-        asm volatile ("setssbsy" ::: "memory");
+
+        /*
+         * IDT and FRED differ by a Supervisor Token on the shadow stack.
+         *
+         * In IDT mode, we use SETSSBSY (itself using MSR_PL0_SSP, configured
+         * previously) to mark the Supervisor Token as Busy.  In FRED mode,
+         * there is no token, so we need to create a temporary Restore Token
+         * to establish SSP.
+         */
+        if ( opt_fred )
+        {
+            unsigned long *token =
+                (void *)stack + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8;
+
+            wrss((unsigned long)token + 9, token);
+            asm volatile ( "rstorssp %0" : "+m" (*token) );
+            /*
+             * We need to discard the resulting Previous-SSP Token, but
+             * reset_stack_and_jump() will do that for us.
+             */
+        }
+        else
+            asm volatile ( "setssbsy" ::: "memory" );
     }
 
     reset_stack_and_jump(init_done);
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 06/22] x86/boot: Use RSTORSSP to establish SSP
  2025-10-03 22:53 ` [PATCH v3 06/22] x86/boot: Use RSTORSSP to establish SSP Andrew Cooper
@ 2025-10-07 15:57   ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2025-10-07 15:57 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 04.10.2025 00:53, Andrew Cooper wrote:
> Under FRED, SETSSBSY is disallowed, and we want to be setting up FRED prior to
> setting up shadow stacks.  As we still need Supervisor Tokens in IDT mode, we
> need mode-specific logic to establish SSP.
> 
> In FRED mode, write a Restore Token, RSTORSSP it, and discard the resulting
> Previous-SSP token.
> 
> No change outside of FRED mode.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 07/22] x86/traps: Alter switch_stack_and_jump() for FRED mode
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (5 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 06/22] x86/boot: Use RSTORSSP to establish SSP Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-07 15:58   ` Jan Beulich
  2025-10-03 22:53 ` [PATCH v3 08/22] x86/traps: Skip Supervisor Shadow Stack tokens in " Andrew Cooper
                   ` (15 subsequent siblings)
  22 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné

FRED and IDT differ by a Supervisor Token on the base of the shstk.  This
means that switch_stack_and_jump() needs to discard one extra word when FRED
is active.

Fix a typo in the parameter name, which should be shstk_base.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

Leave as $%c.  Otherwise it doesn't assemble correctly presented with $$24568
to parse as an instruction immediate.

v3:
 * Fix a typo in the parameter name.
v2:
 * Use X86_FEATURE_XEN_FRED
---
 xen/arch/x86/include/asm/current.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
index c1eb27b1c4c2..62817e8476ec 100644
--- a/xen/arch/x86/include/asm/current.h
+++ b/xen/arch/x86/include/asm/current.h
@@ -154,7 +154,9 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
     "rdsspd %[ssp];"                                            \
     "cmp $1, %[ssp];"                                           \
     "je .L_shstk_done.%=;" /* CET not active?  Skip. */         \
-    "mov $%c[skstk_base], %[val];"                              \
+    ALTERNATIVE("mov $%c[shstk_base], %[val];",                 \
+                "mov $%c[shstk_base] + 8, %[val];",             \
+                X86_FEATURE_XEN_FRED)                           \
     "and $%c[stack_mask], %[ssp];"                              \
     "sub %[ssp], %[val];"                                       \
     "shr $3, %[val];"                                           \
@@ -188,7 +190,7 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
               [ssp] "=&r" (tmp)                                         \
             : [stk] "r" (guest_cpu_user_regs()),                        \
               [fun] constr (fn),                                        \
-              [skstk_base] "i"                                          \
+              [shstk_base] "i"                                          \
               ((PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8),               \
               [stack_mask] "i" (STACK_SIZE - 1),                        \
               _ASM_BUGFRAME_INFO(BUGFRAME_bug, __LINE__,                \
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 07/22] x86/traps: Alter switch_stack_and_jump() for FRED mode
  2025-10-03 22:53 ` [PATCH v3 07/22] x86/traps: Alter switch_stack_and_jump() for FRED mode Andrew Cooper
@ 2025-10-07 15:58   ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2025-10-07 15:58 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 04.10.2025 00:53, Andrew Cooper wrote:
> FRED and IDT differ by a Supervisor Token on the base of the shstk.  This
> means that switch_stack_and_jump() needs to discard one extra word when FRED
> is active.
> 
> Fix a typo in the parameter name, which should be shstk_base.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> 
> Leave as $%c.  Otherwise it doesn't assemble correctly presented with $$24568
> to parse as an instruction immediate.

I don't follow. Where would the 2nd $ come from if you write ...

> --- a/xen/arch/x86/include/asm/current.h
> +++ b/xen/arch/x86/include/asm/current.h
> @@ -154,7 +154,9 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>      "rdsspd %[ssp];"                                            \
>      "cmp $1, %[ssp];"                                           \
>      "je .L_shstk_done.%=;" /* CET not active?  Skip. */         \
> -    "mov $%c[skstk_base], %[val];"                              \
> +    ALTERNATIVE("mov $%c[shstk_base], %[val];",                 \
> +                "mov $%c[shstk_base] + 8, %[val];",             \
> +                X86_FEATURE_XEN_FRED)                           \

    ALTERNATIVE("mov %[shstk_base], %[val];",                   \
                "mov %[shstk_base] + 8, %[val];",               \
                X86_FEATURE_XEN_FRED)                           \

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 08/22] x86/traps: Skip Supervisor Shadow Stack tokens in FRED mode
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (6 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 07/22] x86/traps: Alter switch_stack_and_jump() for FRED mode Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-03 22:53 ` [PATCH v3 09/22] x86/traps: Make an IDT-specific #DB helper Andrew Cooper
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné

FRED doesn't use Supervisor Shadow Stack tokens.  Skip setting them up.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v3:
 * Move further still

v2:
 * New
---
 xen/arch/x86/mm.c    | 12 +++++++++---
 xen/arch/x86/setup.c |  8 ++++----
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index b929d15d0050..043e6aa9d73a 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -129,6 +129,7 @@
 #include <asm/shadow.h>
 #include <asm/shared.h>
 #include <asm/trampoline.h>
+#include <asm/traps.h>
 #include <asm/x86_emulate.h>
 
 #include <public/memory.h>
@@ -6441,8 +6442,13 @@ static void write_sss_token(unsigned long *ptr)
 
 void memguard_guard_stack(void *p)
 {
-    /* IST Shadow stacks.  4x 1k in stack page 0. */
-    if ( IS_ENABLED(CONFIG_XEN_SHSTK) )
+    /*
+     * IST Shadow stacks.  4x 1k in stack page 0.
+     *
+     * With IDT delivery, we need Supervisor Shadow Stack tokens at the base
+     * of each stack.  With FRED delivery, these no longer exist.
+     */
+    if ( IS_ENABLED(CONFIG_XEN_SHSTK) && !opt_fred )
     {
         write_sss_token(p + (IST_MCE * IST_SHSTK_SIZE) - 8);
         write_sss_token(p + (IST_NMI * IST_SHSTK_SIZE) - 8);
@@ -6453,7 +6459,7 @@ void memguard_guard_stack(void *p)
 
     /* Primary Shadow Stack.  1x 4k in stack page 5. */
     p += PRIMARY_SHSTK_SLOT * PAGE_SIZE;
-    if ( IS_ENABLED(CONFIG_XEN_SHSTK) )
+    if ( IS_ENABLED(CONFIG_XEN_SHSTK) && !opt_fred )
         write_sss_token(p + PAGE_SIZE - 8);
 
     map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_SHSTK);
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 44da5efa1d20..160a9611f456 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1923,10 +1923,6 @@ void asmlinkage __init noreturn __start_xen(void)
 
     system_state = SYS_STATE_boot;
 
-    bsp_stack = cpu_alloc_stack(0);
-    if ( !bsp_stack )
-        panic("No memory for BSP stack\n");
-
     console_init_ring();
     vesa_init();
 
@@ -2111,6 +2107,10 @@ void asmlinkage __init noreturn __start_xen(void)
 
     console_init_postirq();
 
+    bsp_stack = cpu_alloc_stack(0); /* Needs to know IDT vs FRED */
+    if ( !bsp_stack )
+        panic("No memory for BSP stack\n");
+
     system_state = SYS_STATE_smp_boot;
 
     do_presmp_initcalls();
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 09/22] x86/traps: Make an IDT-specific #DB helper
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (7 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 08/22] x86/traps: Skip Supervisor Shadow Stack tokens in " Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-03 22:53 ` [PATCH v3 10/22] x86/traps: Make an IDT-specific #PF helper Andrew Cooper
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné

FRED provides PENDING_DBG in the the stack frame, avoiding the need to read
%dr6 manually.

Rename do_debug() to handle_DB(), and update it to take a dbg field using
positive polarity.

Introduce a new handle_DB_IDT() which reads %dr6.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New
---
 xen/arch/x86/traps.c        | 28 +++++++++++++++++-----------
 xen/arch/x86/x86_64/entry.S |  2 +-
 2 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 545c42a10862..3fd0f5709a52 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1992,14 +1992,11 @@ void asmlinkage do_device_not_available(struct cpu_user_regs *regs)
 
 void nocall sysenter_eflags_saved(void);
 
-void asmlinkage do_debug(struct cpu_user_regs *regs)
+/* Handle #DB.  @dbg is PENDING_DBG, a.k.a. %dr6 with positive polarity. */
+static void handle_DB(struct cpu_user_regs *regs, unsigned long dbg)
 {
-    unsigned long dr6;
     struct vcpu *v = current;
 
-    /* Stash dr6 as early as possible. */
-    dr6 = read_debugreg(6);
-
     /*
      * At the time of writing (March 2018), on the subject of %dr6:
      *
@@ -2067,13 +2064,13 @@ void asmlinkage do_debug(struct cpu_user_regs *regs)
          * If however we do, safety measures need to be enacted.  Use a big
          * hammer and clear all debug settings.
          */
-        if ( dr6 & (DR_TRAP3 | DR_TRAP2 | DR_TRAP1 | DR_TRAP0) )
+        if ( dbg & (DR_TRAP3 | DR_TRAP2 | DR_TRAP1 | DR_TRAP0) )
         {
             unsigned int bp, dr7 = read_debugreg(7);
 
             for ( bp = 0; bp < 4; ++bp )
             {
-                if ( (dr6 & (1u << bp)) && /* Breakpoint triggered? */
+                if ( (dbg & (1u << bp)) && /* Breakpoint triggered? */
                      (dr7 & (3u << (bp * DR_ENABLE_SIZE))) && /* Enabled? */
                      ((dr7 & (3u << ((bp * DR_CONTROL_SIZE) + /* Insn? */
                                      DR_CONTROL_SHIFT))) == DR_RW_EXECUTE) )
@@ -2094,9 +2091,9 @@ void asmlinkage do_debug(struct cpu_user_regs *regs)
          * so ensure the message is ratelimited.
          */
         gprintk(XENLOG_WARNING,
-                "Hit #DB in Xen context: %04x:%p [%ps], stk %04x:%p, dr6 %lx\n",
+                "Hit #DB in Xen context: %04x:%p [%ps], stk %04x:%p, dbg %lx\n",
                 regs->cs, _p(regs->rip), _p(regs->rip),
-                regs->ss, _p(regs->rsp), dr6);
+                regs->ss, _p(regs->rsp), dbg);
 
         return;
     }
@@ -2108,7 +2105,7 @@ void asmlinkage do_debug(struct cpu_user_regs *regs)
      * by debugging actions completed behind it's back.
      */
     v->arch.dr6 = x86_merge_dr6(v->domain->arch.cpu_policy,
-                                v->arch.dr6, dr6 ^ X86_DR6_DEFAULT);
+                                v->arch.dr6, dbg);
 
     if ( guest_kernel_mode(v, regs) && v->domain->debugger_attached )
     {
@@ -2116,7 +2113,16 @@ void asmlinkage do_debug(struct cpu_user_regs *regs)
         return;
     }
 
-    pv_inject_DB(dr6 ^ X86_DR6_DEFAULT);
+    pv_inject_DB(dbg);
+}
+
+/*
+ * When using IDT delivery, it is our responsibility to read %dr6.  Convert it
+ * to positive polarity.
+ */
+void asmlinkage handle_DB_IDT(struct cpu_user_regs *regs)
+{
+    handle_DB(regs, read_debugreg(6) ^ X86_DR6_DEFAULT);
 }
 
 void asmlinkage do_entry_CP(struct cpu_user_regs *regs)
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 39c7b9d17f9e..789687488c5f 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -1171,7 +1171,7 @@ FUNC(handle_ist_exception)
 .L_ ## vec ## _done:
 
         DISPATCH(X86_EXC_NMI, do_nmi)
-        DISPATCH(X86_EXC_DB,  do_debug)
+        DISPATCH(X86_EXC_DB,  handle_DB_IDT)
         DISPATCH(X86_EXC_MC,  do_machine_check)
 #undef DISPATCH
 
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 10/22] x86/traps: Make an IDT-specific #PF helper
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (8 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 09/22] x86/traps: Make an IDT-specific #DB helper Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-03 22:53 ` [PATCH v3 11/22] x86/fsgsbase: Make gskern accesses safe under FRED Andrew Cooper
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné

FRED provides %cr2 in the the stack frame, avoiding the need to read %cr2
manually.

Rename do_page_fault() to handle_PF(), and update it to take cr2, still named
addr for consistency.

Introduce a new handle_PF_IDT() which reads %cr2 and conditionally re-enables
interrupts.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New
---
 xen/arch/x86/traps.c        | 26 ++++++++++++++------------
 xen/arch/x86/x86_64/entry.S |  2 +-
 2 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 3fd0f5709a52..d42973660db0 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1670,21 +1670,10 @@ static int fixup_page_fault(unsigned long addr, struct cpu_user_regs *regs)
     return 0;
 }
 
-void asmlinkage do_page_fault(struct cpu_user_regs *regs)
+static void handle_PF(struct cpu_user_regs *regs, unsigned long addr /* cr2 */)
 {
-    unsigned long addr;
     unsigned int error_code;
 
-    addr = read_cr2();
-
-    /*
-     * Don't re-enable interrupts if we were running an IRQ-off region when
-     * we hit the page fault, or we'll break that code.
-     */
-    ASSERT(!local_irq_is_enabled());
-    if ( regs->flags & X86_EFLAGS_IF )
-        local_irq_enable();
-
     /* fixup_page_fault() might change regs->error_code, so cache it here. */
     error_code = regs->error_code;
 
@@ -1745,6 +1734,19 @@ void asmlinkage do_page_fault(struct cpu_user_regs *regs)
     pv_inject_page_fault(regs->error_code, addr);
 }
 
+/*
+ * When using IDT delivery, it is our responsibility to read %cr2.
+ */
+void asmlinkage handle_PF_IDT(struct cpu_user_regs *regs)
+{
+    unsigned long addr = read_cr2();
+
+    if ( regs->flags & X86_EFLAGS_IF )
+        local_irq_enable();
+
+    handle_PF(regs, addr);
+}
+
 /*
  * Early #PF handler to print CR2, error code, and stack.
  *
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 789687488c5f..c02245ac064c 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -871,7 +871,7 @@ handle_exception_saved:
          * reading %cr2. Otherwise a page fault in the nested interrupt handler
          * would corrupt %cr2.
          */
-        DISPATCH(X86_EXC_PF, do_page_fault)
+        DISPATCH(X86_EXC_PF, handle_PF_IDT)
 
         /* Only re-enable IRQs if they were active before taking the fault */
         testb $X86_EFLAGS_IF >> 8, UREGS_eflags + 1(%rsp)
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 11/22] x86/fsgsbase: Make gskern accesses safe under FRED
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (9 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 10/22] x86/traps: Make an IDT-specific #PF helper Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-03 22:53 ` [PATCH v3 12/22] x86/traps: Introduce FRED entrypoints Andrew Cooper
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné

Under FRED, the SWAPGS instructions is disallowed.  Therefore we must use the
MSR path instead.

read_registers() is in the show_registers() path, so this allows Xen to render
it's current state without suffering #UD (and recursing until the stack guard
page is hit).

All hardware with FRED is expected to have some kind of non-serialising access
to these registers.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * Broken out of subsequent patch.  Rebased over MSR cleanup.
---
 xen/arch/x86/include/asm/fsgsbase.h | 8 ++++++--
 xen/arch/x86/traps.c                | 2 +-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/include/asm/fsgsbase.h b/xen/arch/x86/include/asm/fsgsbase.h
index 24862a6bfea7..5faa3a324332 100644
--- a/xen/arch/x86/include/asm/fsgsbase.h
+++ b/xen/arch/x86/include/asm/fsgsbase.h
@@ -79,7 +79,9 @@ static inline unsigned long read_gs_base(void)
 
 static inline unsigned long read_gs_shadow(void)
 {
-    if ( read_cr4() & X86_CR4_FSGSBASE )
+    unsigned long cr4 = read_cr4();
+
+    if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
         return __rdgs_shadow();
     else
         return rdmsr(MSR_SHADOW_GS_BASE);
@@ -103,7 +105,9 @@ static inline void write_gs_base(unsigned long base)
 
 static inline void write_gs_shadow(unsigned long base)
 {
-    if ( read_cr4() & X86_CR4_FSGSBASE )
+    unsigned long cr4 = read_cr4();
+
+    if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
         __wrgs_shadow(base);
     else
         wrmsrns(MSR_SHADOW_GS_BASE, base);
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index d42973660db0..2e3efe45edf4 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -118,7 +118,7 @@ static void read_registers(struct extra_state *state)
     state->cr3 = read_cr3();
     state->cr4 = read_cr4();
 
-    if ( state->cr4 & X86_CR4_FSGSBASE )
+    if ( !(state->cr4 & X86_CR4_FRED) && (state->cr4 & X86_CR4_FSGSBASE) )
     {
         state->fsb = __rdfsbase();
         state->gsb = __rdgsbase();
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 12/22] x86/traps: Introduce FRED entrypoints
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (10 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 11/22] x86/fsgsbase: Make gskern accesses safe under FRED Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-08  8:50   ` Jan Beulich
  2025-10-16 14:54   ` Jan Beulich
  2025-10-03 22:53 ` [PATCH v3 13/22] x86/traps: Enable FRED when requested Andrew Cooper
                   ` (10 subsequent siblings)
  22 siblings, 2 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

Under FRED, there's one entrypoint from Ring 3, and one from Ring 0.

FRED gives us a good stack (even for SYSCALL/SYSENTER), and a unified event
frame on the stack, meaing that all software needs to do is spill the GPRs
with a line of PUSHes.  Introduce PUSH_AND_CLEAR_GPRS and POP_GPRS for this
purpose.

Introduce entry_FRED_R0() which to a first appoximation is complete for all
event handling within Xen.

entry_FRED_R0() needs deriving from entry_FRED_R3(), so introduce a basic
handler.  There is more work required to make the return-to-guest path work
under FRED.

Also introduce entry_from_{xen,pv}() to be the C level handlers.  By simply
copying regs->fred_ss.vector into regs->entry_vector, we can reuse all the
existing fault handlers.

Extend fatal_trap() to render the event type, including by name, when FRED is
active.  This is slightly complicated, because X86_ET_OTHER must not use
vector_name() or SYSCALL and SYSENTER get rendered as #BP and #DB.

This is sufficient to handle all interrupts and exceptions encountered during
development, including plenty of Double Faults.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v3:
 * Adjust commit message to remove stale details
 * Adjust formatting in fatal_trap()
 * Group CP with others.  It's probably wrong for perf, but that's out the
   window anyway now that we're letting a compiler make the decision tree.

v2:
 * Don't render a vector name for X86_ET_SW_INT
 * Fix typos in names[]
 * Link entry-fred.o first

SIMICS hasn't been updated to the FRED v9, and still wants ENDBR instructions
at the entrypoints.
---
 xen/arch/x86/include/asm/asm_defns.h |  65 ++++++++++++
 xen/arch/x86/traps.c                 | 152 +++++++++++++++++++++++++++
 xen/arch/x86/x86_64/Makefile         |   1 +
 xen/arch/x86/x86_64/entry-fred.S     |  33 ++++++
 4 files changed, 251 insertions(+)
 create mode 100644 xen/arch/x86/x86_64/entry-fred.S

diff --git a/xen/arch/x86/include/asm/asm_defns.h b/xen/arch/x86/include/asm/asm_defns.h
index 72a0082d319d..a81a4043d0f1 100644
--- a/xen/arch/x86/include/asm/asm_defns.h
+++ b/xen/arch/x86/include/asm/asm_defns.h
@@ -315,6 +315,71 @@ static always_inline void stac(void)
         subq  $-(UREGS_error_code-UREGS_r15+\adj), %rsp
 .endm
 
+/*
+ * Push and clear GPRs
+ */
+.macro PUSH_AND_CLEAR_GPRS
+        push  %rdi
+        xor   %edi, %edi
+        push  %rsi
+        xor   %esi, %esi
+        push  %rdx
+        xor   %edx, %edx
+        push  %rcx
+        xor   %ecx, %ecx
+        push  %rax
+        xor   %eax, %eax
+        push  %r8
+        xor   %r8d, %r8d
+        push  %r9
+        xor   %r9d, %r9d
+        push  %r10
+        xor   %r10d, %r10d
+        push  %r11
+        xor   %r11d, %r11d
+        push  %rbx
+        xor   %ebx, %ebx
+        push  %rbp
+#ifdef CONFIG_FRAME_POINTER
+/* Indicate special exception stack frame by inverting the frame pointer. */
+        mov   %rsp, %rbp
+        notq  %rbp
+#else
+        xor   %ebp, %ebp
+#endif
+        push  %r12
+        xor   %r12d, %r12d
+        push  %r13
+        xor   %r13d, %r13d
+        push  %r14
+        xor   %r14d, %r14d
+        push  %r15
+        xor   %r15d, %r15d
+.endm
+
+/*
+ * POP GPRs from a UREGS_* frame on the stack.  Does not modify flags.
+ *
+ * @rax: Alternative destination for the %rax value on the stack.
+ */
+.macro POP_GPRS rax=%rax
+        pop   %r15
+        pop   %r14
+        pop   %r13
+        pop   %r12
+        pop   %rbp
+        pop   %rbx
+        pop   %r11
+        pop   %r10
+        pop   %r9
+        pop   %r8
+        pop   \rax
+        pop   %rcx
+        pop   %rdx
+        pop   %rsi
+        pop   %rdi
+.endm
+
 #ifdef CONFIG_PV32
 #define CR4_PV32_RESTORE                               \
     ALTERNATIVE_2 "",                                  \
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 2e3efe45edf4..0027f096a6c3 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -89,6 +89,13 @@ const unsigned int nmi_cpu;
 #define stack_words_per_line 4
 #define ESP_BEFORE_EXCEPTION(regs) ((unsigned long *)(regs)->rsp)
 
+/* Only valid to use when FRED is active. */
+static inline struct fred_info *cpu_regs_fred_info(struct cpu_user_regs *regs)
+{
+    ASSERT(read_cr4() & X86_CR4_FRED);
+    return &container_of(regs, struct cpu_info, guest_cpu_user_regs)->_fred;
+}
+
 struct extra_state
 {
     unsigned long cr0, cr2, cr3, cr4;
@@ -1023,6 +1030,32 @@ void show_execution_state_nmi(const cpumask_t *mask, bool show_all)
         printk("Non-responding CPUs: {%*pbl}\n", CPUMASK_PR(&show_state_mask));
 }
 
+static const char *x86_et_name(unsigned int type)
+{
+    static const char *const names[] = {
+        [X86_ET_EXT_INTR]    = "EXT_INTR",
+        [X86_ET_NMI]         = "NMI",
+        [X86_ET_HW_EXC]      = "HW_EXC",
+        [X86_ET_SW_INT]      = "SW_INT",
+        [X86_ET_PRIV_SW_EXC] = "PRIV_SW_EXC",
+        [X86_ET_SW_EXC]      = "SW_EXC",
+        [X86_ET_OTHER]       = "OTHER",
+    };
+
+    return (type < ARRAY_SIZE(names) && names[type]) ? names[type] : "???";
+}
+
+static const char *x86_et_other_name(unsigned int what)
+{
+    static const char *const names[] = {
+        [0] = "MTF",
+        [1] = "SYSCALL",
+        [2] = "SYSENTER",
+    };
+
+    return (what < ARRAY_SIZE(names) && names[what]) ? names[what] : "???";
+}
+
 const char *vector_name(unsigned int vec)
 {
     static const char names[][4] = {
@@ -1101,6 +1134,38 @@ void fatal_trap(const struct cpu_user_regs *regs, bool show_remote)
         }
     }
 
+    if ( read_cr4() & X86_CR4_FRED )
+    {
+        bool render_ec = false;
+        const char *vec_name = NULL;
+
+        switch ( regs->fred_ss.type )
+        {
+        case X86_ET_HW_EXC:
+        case X86_ET_PRIV_SW_EXC:
+        case X86_ET_SW_EXC:
+            render_ec = true;
+            vec_name = vector_name(regs->fred_ss.vector);
+            break;
+
+        case X86_ET_OTHER:
+            vec_name = x86_et_other_name(regs->fred_ss.vector);
+            break;
+        }
+
+        if ( render_ec )
+            panic("FATAL TRAP: type %u, %s, vec %u, %s[%04x]%s\n",
+                  regs->fred_ss.type, x86_et_name(regs->fred_ss.type),
+                  regs->fred_ss.vector, vec_name ?: "",
+                  regs->error_code,
+                  (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
+        else
+            panic("FATAL TRAP: type %u, %s, vec %u, %s%s\n",
+                  regs->fred_ss.type, x86_et_name(regs->fred_ss.type),
+                  regs->fred_ss.vector, vec_name ?: "",
+                  (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
+    }
+
     panic("FATAL TRAP: vec %u, %s[%04x]%s\n",
           trapnr, vector_name(trapnr), regs->error_code,
           (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
@@ -2199,6 +2264,93 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
 }
 #endif
 
+void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
+{
+    /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
+    regs->entry_vector = regs->fred_ss.vector;
+
+    fatal_trap(regs, false);
+}
+
+void asmlinkage entry_from_xen(struct cpu_user_regs *regs)
+{
+    struct fred_info *fi = cpu_regs_fred_info(regs);
+    uint8_t type = regs->fred_ss.type;
+
+    /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
+    regs->entry_vector = regs->fred_ss.vector;
+
+    /*
+     * First, handle the asynchronous or fatal events.  These are either
+     * unrelated to the interrupted context, or may not have valid context
+     * recorded, and all have special rules on how/whether to re-enable IRQs.
+     */
+    switch ( type )
+    {
+    case X86_ET_EXT_INTR:
+        return do_IRQ(regs);
+
+    case X86_ET_NMI:
+        return do_nmi(regs);
+
+    case X86_ET_HW_EXC:
+        switch ( regs->fred_ss.vector )
+        {
+        case X86_EXC_DF: return do_double_fault(regs);
+        case X86_EXC_MC: return do_machine_check(regs);
+        }
+        break;
+    }
+
+    /*
+     * With the asynchronous events handled, what remains are the synchronous
+     * ones.  If we interrupted an IRQs-on region, we should re-enable IRQs
+     * now; for #PF and #DB, %cr2 and %dr6 are on the stack in edata.
+     */
+    if ( regs->eflags & X86_EFLAGS_IF )
+        local_irq_enable();
+
+    switch ( type )
+    {
+    case X86_ET_HW_EXC:
+    case X86_ET_PRIV_SW_EXC:
+    case X86_ET_SW_EXC:
+        switch ( regs->fred_ss.vector )
+        {
+        case X86_EXC_PF:  handle_PF(regs, fi->edata); break;
+        case X86_EXC_GP:  do_general_protection(regs); break;
+        case X86_EXC_UD:  do_invalid_op(regs); break;
+        case X86_EXC_NM:  do_device_not_available(regs); break;
+        case X86_EXC_BP:  do_int3(regs); break;
+        case X86_EXC_DB:  handle_DB(regs, fi->edata); break;
+        case X86_EXC_CP:  do_entry_CP(regs); break;
+
+        case X86_EXC_DE:
+        case X86_EXC_OF:
+        case X86_EXC_BR:
+        case X86_EXC_NP:
+        case X86_EXC_SS:
+        case X86_EXC_MF:
+        case X86_EXC_AC:
+        case X86_EXC_XM:
+            do_trap(regs);
+            break;
+
+        default:
+            goto fatal;
+        }
+        break;
+
+    default:
+        goto fatal;
+    }
+
+    return;
+
+ fatal:
+    fatal_trap(regs, false);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/x86_64/Makefile b/xen/arch/x86/x86_64/Makefile
index f20763088740..c0a0b6603221 100644
--- a/xen/arch/x86/x86_64/Makefile
+++ b/xen/arch/x86/x86_64/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_PV32) += compat/
 
+obj-bin-y += entry-fred.o
 obj-bin-y += entry.o
 obj-$(CONFIG_KEXEC) += machine_kexec.o
 obj-y += pci.o
diff --git a/xen/arch/x86/x86_64/entry-fred.S b/xen/arch/x86/x86_64/entry-fred.S
new file mode 100644
index 000000000000..3c3320df22cb
--- /dev/null
+++ b/xen/arch/x86/x86_64/entry-fred.S
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+        .file "x86_64/entry-fred.S"
+
+#include <asm/asm_defns.h>
+#include <asm/page.h>
+
+        .section .text.entry, "ax", @progbits
+
+        /* The Ring3 entry point is required to be 4k aligned. */
+
+FUNC(entry_FRED_R3, 4096)
+        PUSH_AND_CLEAR_GPRS
+
+        mov     %rsp, %rdi
+        call    entry_from_pv
+
+        POP_GPRS
+        eretu
+END(entry_FRED_R3)
+
+        /* The Ring0 entrypoint is at Ring3 + 0x100. */
+        .org entry_FRED_R3 + 0x100, 0xcc
+
+FUNC_LOCAL(entry_FRED_R0, 0)
+        PUSH_AND_CLEAR_GPRS
+
+        mov     %rsp, %rdi
+        call    entry_from_xen
+
+        POP_GPRS
+        erets
+END(entry_FRED_R0)
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 12/22] x86/traps: Introduce FRED entrypoints
  2025-10-03 22:53 ` [PATCH v3 12/22] x86/traps: Introduce FRED entrypoints Andrew Cooper
@ 2025-10-08  8:50   ` Jan Beulich
  2025-10-16 14:54   ` Jan Beulich
  1 sibling, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2025-10-08  8:50 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 04.10.2025 00:53, Andrew Cooper wrote:
> Under FRED, there's one entrypoint from Ring 3, and one from Ring 0.
> 
> FRED gives us a good stack (even for SYSCALL/SYSENTER), and a unified event
> frame on the stack, meaing that all software needs to do is spill the GPRs
> with a line of PUSHes.  Introduce PUSH_AND_CLEAR_GPRS and POP_GPRS for this
> purpose.
> 
> Introduce entry_FRED_R0() which to a first appoximation is complete for all
> event handling within Xen.
> 
> entry_FRED_R0() needs deriving from entry_FRED_R3(), so introduce a basic
> handler.  There is more work required to make the return-to-guest path work
> under FRED.
> 
> Also introduce entry_from_{xen,pv}() to be the C level handlers.  By simply
> copying regs->fred_ss.vector into regs->entry_vector, we can reuse all the
> existing fault handlers.
> 
> Extend fatal_trap() to render the event type, including by name, when FRED is
> active.  This is slightly complicated, because X86_ET_OTHER must not use
> vector_name() or SYSCALL and SYSENTER get rendered as #BP and #DB.
> 
> This is sufficient to handle all interrupts and exceptions encountered during
> development, including plenty of Double Faults.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 12/22] x86/traps: Introduce FRED entrypoints
  2025-10-03 22:53 ` [PATCH v3 12/22] x86/traps: Introduce FRED entrypoints Andrew Cooper
  2025-10-08  8:50   ` Jan Beulich
@ 2025-10-16 14:54   ` Jan Beulich
  1 sibling, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2025-10-16 14:54 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 04.10.2025 00:53, Andrew Cooper wrote:
> --- a/xen/arch/x86/include/asm/asm_defns.h
> +++ b/xen/arch/x86/include/asm/asm_defns.h
> @@ -315,6 +315,71 @@ static always_inline void stac(void)
>          subq  $-(UREGS_error_code-UREGS_r15+\adj), %rsp
>  .endm
>  
> +/*
> + * Push and clear GPRs
> + */
> +.macro PUSH_AND_CLEAR_GPRS
> +        push  %rdi
> +        xor   %edi, %edi
> +        push  %rsi
> +        xor   %esi, %esi
> +        push  %rdx
> +        xor   %edx, %edx
> +        push  %rcx
> +        xor   %ecx, %ecx
> +        push  %rax
> +        xor   %eax, %eax
> +        push  %r8
> +        xor   %r8d, %r8d
> +        push  %r9
> +        xor   %r9d, %r9d
> +        push  %r10
> +        xor   %r10d, %r10d
> +        push  %r11
> +        xor   %r11d, %r11d
> +        push  %rbx
> +        xor   %ebx, %ebx
> +        push  %rbp
> +#ifdef CONFIG_FRAME_POINTER
> +/* Indicate special exception stack frame by inverting the frame pointer. */
> +        mov   %rsp, %rbp
> +        notq  %rbp
> +#else
> +        xor   %ebp, %ebp
> +#endif
> +        push  %r12
> +        xor   %r12d, %r12d
> +        push  %r13
> +        xor   %r13d, %r13d
> +        push  %r14
> +        xor   %r14d, %r14d
> +        push  %r15
> +        xor   %r15d, %r15d
> +.endm
> +
> +/*
> + * POP GPRs from a UREGS_* frame on the stack.  Does not modify flags.
> + *
> + * @rax: Alternative destination for the %rax value on the stack.
> + */
> +.macro POP_GPRS rax=%rax
> +        pop   %r15
> +        pop   %r14
> +        pop   %r13
> +        pop   %r12
> +        pop   %rbp
> +        pop   %rbx
> +        pop   %r11
> +        pop   %r10
> +        pop   %r9
> +        pop   %r8
> +        pop   \rax

For placing the entry_ssp field I came back looking here. I notice there
was a v2 comment regarding the rax macro parameter. Besides there being
a limitation on which registers may be used, a %rsp-relative memory
operand can't be (cleanly) used either. I think if you really want to
retain that parameter, some warning needs adding to the comment.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 13/22] x86/traps: Enable FRED when requested
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (11 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 12/22] x86/traps: Introduce FRED entrypoints Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-08  8:54   ` Jan Beulich
  2025-10-03 22:53 ` [PATCH v3 14/22] x86/pv: Deduplicate is_canonical_address() in do_set_segment_base() Andrew Cooper
                   ` (9 subsequent siblings)
  22 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

With the shadow stack and exception handling adjustements in place, we can now
activate FRED when appropriate.  Note that opt_fred is still disabled by
default.

Introduce init_fred() to set up all the MSRs relevant for FRED.  FRED uses
MSR_STAR (entries from Ring3 only), and MSR_FRED_SSP_SL0 aliases MSR_PL0_SSP
when CET-SS is active.  Otherwise, they're all new MSRs.

With init_fred() existing, load_system_tables() and legacy_syscall_init()
should only be used when setting up IDT delivery.  Insert ASSERT()s to this
effect, and adjust the various *_init() functions to make this property true.

Per the documentation, percpu_early_traps_init() is responsible for switching
off the boot GDT, which needs doing even in FRED mode.

Finally, set CR4.FRED in traps_init()/percpu_early_traps_init().

Xen can now boot in FRED mode up until starting a PV guest, where it faults
because IRET is not permitted to change privilege.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v3:
 * Fix poisoning of SL1 pointers.
 * Adjust bsp_traps_reinit().  It probably doesn't matter.

v2:
 * Explain the lack of BUG_ON()
 * Posion SL1

In principle we can stop allocating the IDT and TSS for CPUs now, although I
want to get shutdown and kexec working before making this optimisation, in
case there's something I've overlooked.
---
 xen/arch/x86/include/asm/current.h |  3 ++
 xen/arch/x86/include/asm/traps.h   |  2 +
 xen/arch/x86/traps-setup.c         | 83 ++++++++++++++++++++++++++++--
 3 files changed, 83 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
index 62817e8476ec..6139980ab115 100644
--- a/xen/arch/x86/include/asm/current.h
+++ b/xen/arch/x86/include/asm/current.h
@@ -23,6 +23,9 @@
  * 2 - NMI IST stack
  * 1 - #MC IST stack
  * 0 - IST Shadow Stacks (4x 1k, read-only)
+ *
+ * In FRED mode, #DB and NMI do not need special stacks, so their IST stacks
+ * are unused.
  */
 
 /*
diff --git a/xen/arch/x86/include/asm/traps.h b/xen/arch/x86/include/asm/traps.h
index 73097e957d05..5d7504bc44d1 100644
--- a/xen/arch/x86/include/asm/traps.h
+++ b/xen/arch/x86/include/asm/traps.h
@@ -16,6 +16,8 @@ void traps_init(void);
 void bsp_traps_reinit(void);
 void percpu_traps_init(void);
 
+void nocall entry_FRED_R3(void);
+
 extern unsigned int ler_msr;
 
 const char *vector_name(unsigned int vec);
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index d77be8f83921..d937209ae606 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -59,6 +59,8 @@ static void load_system_tables(void)
         .limit = sizeof(bsp_idt) - 1,
     };
 
+    ASSERT(opt_fred == 0);
+
     /*
      * Set up the TSS.  Warning - may be live, and the NMI/#MC must remain
      * valid on every instruction boundary.  (Note: these are all
@@ -191,6 +193,8 @@ static void legacy_syscall_init(void)
     unsigned char *stub_page;
     unsigned int offset;
 
+    ASSERT(opt_fred == 0);
+
     /* No PV guests?  No need to set up SYSCALL/SYSENTER infrastructure. */
     if ( !IS_ENABLED(CONFIG_PV) )
         return;
@@ -268,6 +272,52 @@ static void __init init_ler(void)
     setup_force_cpu_cap(X86_FEATURE_XEN_LBR);
 }
 
+/*
+ * Set up all MSRs relevant for FRED event delivery.
+ *
+ * Xen does not use any of the optional config in MSR_FRED_CONFIG, so all that
+ * is needed is the entrypoint.
+ *
+ * Because FRED always provides a good stack, NMI and #DB do not need any
+ * special treatment.  Only #DF needs another stack level, and #MC for the
+ * offchance that Xen's main stack suffers an uncorrectable error.
+ *
+ * This makes Stack Level 1 unused, but we use #DB's stacks, and with the
+ * regular and shadow stacks reversed as posion to guarantee that any use
+ * escalates to #DF.
+ *
+ * FRED reuses MSR_STAR to provide the segment selector values to load on
+ * entry from Ring3.  Entry from Ring0 leave %cs and %ss unmodified.
+ */
+static void init_fred(void)
+{
+    unsigned long stack_top = get_stack_bottom() & ~(STACK_SIZE - 1);
+
+    ASSERT(opt_fred == 1);
+
+    wrmsrns(MSR_STAR, XEN_MSR_STAR);
+    wrmsrns(MSR_FRED_CONFIG, (unsigned long)entry_FRED_R3);
+
+    /*
+     * MSR_FRED_RSP_* all come with an 64-byte alignment check, avoiding the
+     * need for an explicit BUG_ON().
+     */
+    wrmsrns(MSR_FRED_RSP_SL0, (unsigned long)(&get_cpu_info()->_fred + 1));
+    wrmsrns(MSR_FRED_RSP_SL1, stack_top + (IST_DB * IST_SHSTK_SIZE)); /* Poison */
+    wrmsrns(MSR_FRED_RSP_SL2, stack_top + (1 + IST_MCE)  * PAGE_SIZE);
+    wrmsrns(MSR_FRED_RSP_SL3, stack_top + (1 + IST_DF)   * PAGE_SIZE);
+    wrmsrns(MSR_FRED_STK_LVLS, ((2UL << (X86_EXC_MC * 2)) |
+                                (3UL << (X86_EXC_DF * 2))));
+
+    if ( cpu_has_xen_shstk )
+    {
+        wrmsrns(MSR_FRED_SSP_SL0, stack_top + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE);
+        wrmsrns(MSR_FRED_SSP_SL1, stack_top + (1 + IST_DB) * PAGE_SIZE); /* Poison */
+        wrmsrns(MSR_FRED_SSP_SL2, stack_top + (IST_MCE * IST_SHSTK_SIZE));
+        wrmsrns(MSR_FRED_SSP_SL3, stack_top + (IST_DF  * IST_SHSTK_SIZE));
+    }
+}
+
 /*
  * Configure basic exception handling.  This is prior to parsing the command
  * line or configuring a console, and needs to be as simple as possible.
@@ -329,16 +379,20 @@ void __init traps_init(void)
             printk(XENLOG_INFO "Disabling PV32 due to FRED\n");
         }
 #endif
+
+        init_fred();
+        set_in_cr4(X86_CR4_FRED);
+
         setup_force_cpu_cap(X86_FEATURE_XEN_FRED);
         printk("Using FRED event delivery\n");
     }
     else
     {
+        load_system_tables();
+
         printk("Using IDT event delivery\n");
     }
 
-    load_system_tables();
-
     init_ler();
 
     /* Cache {,compat_}gdt_l1e now that physically relocation is done. */
@@ -356,7 +410,11 @@ void __init traps_init(void)
  */
 void __init bsp_traps_reinit(void)
 {
-    load_system_tables();
+    if ( opt_fred )
+        init_fred();
+    else
+        load_system_tables();
+
     percpu_traps_init();
 }
 
@@ -366,7 +424,8 @@ void __init bsp_traps_reinit(void)
  */
 void percpu_traps_init(void)
 {
-    legacy_syscall_init();
+    if ( !opt_fred )
+        legacy_syscall_init();
 
     if ( cpu_has_xen_lbr )
         wrmsrl(MSR_IA32_DEBUGCTLMSR, IA32_DEBUGCTLMSR_LBR);
@@ -381,7 +440,21 @@ void percpu_traps_init(void)
  */
 void asmlinkage percpu_early_traps_init(void)
 {
-    load_system_tables();
+    if ( opt_fred )
+    {
+        const seg_desc_t *gdt = this_cpu(gdt) - FIRST_RESERVED_GDT_ENTRY;
+        const struct desc_ptr gdtr = {
+            .base = (unsigned long)gdt,
+            .limit = LAST_RESERVED_GDT_BYTE,
+        };
+
+        lgdt(&gdtr);
+
+        init_fred();
+        write_cr4(read_cr4() | X86_CR4_FRED);
+    }
+    else
+        load_system_tables();
 }
 
 static void __init __maybe_unused build_assertions(void)
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 13/22] x86/traps: Enable FRED when requested
  2025-10-03 22:53 ` [PATCH v3 13/22] x86/traps: Enable FRED when requested Andrew Cooper
@ 2025-10-08  8:54   ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2025-10-08  8:54 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 04.10.2025 00:53, Andrew Cooper wrote:
> With the shadow stack and exception handling adjustements in place, we can now
> activate FRED when appropriate.  Note that opt_fred is still disabled by
> default.
> 
> Introduce init_fred() to set up all the MSRs relevant for FRED.  FRED uses
> MSR_STAR (entries from Ring3 only), and MSR_FRED_SSP_SL0 aliases MSR_PL0_SSP
> when CET-SS is active.  Otherwise, they're all new MSRs.
> 
> With init_fred() existing, load_system_tables() and legacy_syscall_init()
> should only be used when setting up IDT delivery.  Insert ASSERT()s to this
> effect, and adjust the various *_init() functions to make this property true.
> 
> Per the documentation, percpu_early_traps_init() is responsible for switching
> off the boot GDT, which needs doing even in FRED mode.
> 
> Finally, set CR4.FRED in traps_init()/percpu_early_traps_init().
> 
> Xen can now boot in FRED mode up until starting a PV guest, where it faults
> because IRET is not permitted to change privilege.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 14/22] x86/pv: Deduplicate is_canonical_address() in do_set_segment_base()
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (12 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 13/22] x86/traps: Enable FRED when requested Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-03 22:53 ` [PATCH v3 15/22] x86/entry: Alter how IRET faults are recognised Andrew Cooper
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné

This is really a rearrangement to make adding FRED support easier.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New

There is a marginal code size improvement:

  add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-46 (-46)
  Function                                     old     new   delta
  do_set_segment_base                          496     450     -46

but it does get undone by the FRED support.
---
 xen/arch/x86/pv/misc-hypercalls.c | 32 ++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/xen/arch/x86/pv/misc-hypercalls.c b/xen/arch/x86/pv/misc-hypercalls.c
index 7a37f16bf038..4c2abeb4add8 100644
--- a/xen/arch/x86/pv/misc-hypercalls.c
+++ b/xen/arch/x86/pv/misc-hypercalls.c
@@ -176,27 +176,29 @@ long do_set_segment_base(unsigned int which, unsigned long base)
     switch ( which )
     {
     case SEGBASE_FS:
-        if ( is_canonical_address(base) )
-            write_fs_base(base);
-        else
+    case SEGBASE_GS_USER:
+    case SEGBASE_GS_KERNEL:
+        if ( !is_canonical_address(base) )
+        {
             ret = -EINVAL;
-        break;
+            break;
+        }
 
-    case SEGBASE_GS_USER:
-        if ( is_canonical_address(base) )
+        switch ( which )
         {
-            write_gs_shadow(base);
+        case SEGBASE_FS:
+            write_fs_base(base);
+            break;
+
+        case SEGBASE_GS_USER:
             v->arch.pv.gs_base_user = base;
-        }
-        else
-            ret = -EINVAL;
-        break;
+            write_gs_shadow(base);
+            break;
 
-    case SEGBASE_GS_KERNEL:
-        if ( is_canonical_address(base) )
+        case SEGBASE_GS_KERNEL:
             write_gs_base(base);
-        else
-            ret = -EINVAL;
+            break;
+        }
         break;
 
     case SEGBASE_GS_USER_SEL:
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 15/22] x86/entry: Alter how IRET faults are recognised
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (13 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 14/22] x86/pv: Deduplicate is_canonical_address() in do_set_segment_base() Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-03 22:53 ` [PATCH v3 16/22] x86/entry: Drop the pre exception table infrastructure Andrew Cooper
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné

Right now we have two IRET instructions that can fault for guest reasons, and
the pre exception table gives handle_exception as the fixup for both.

Instead, we can have compat_restore_all_guest() use restore_all_guest()'s IRET
which gives us just a single position to handle specially.

In exception_with_ints_disabled(), remove search_pre_exception_table() and use
a simpler check.  Explain how the recovery works, because this isn't the first
time I've had to reverse engineer it for my own understanding.

The reference to iret_to_guest highlights that any checking here is specific
to CONFIG_PV, so exclude it in !PV builds.

Later in exception_with_ints_disabled(), it suffices to load %ecx rather than
%rcx, and remove a stray semi-colon from the rep movsq.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New
---
 xen/arch/x86/x86_64/compat/entry.S |  3 +--
 xen/arch/x86/x86_64/entry.S        | 31 ++++++++++++++++++++++--------
 2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index d7b381ea546d..39925d80a677 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -167,8 +167,7 @@ FUNC(compat_restore_all_guest)
             scf=STK_REL(CPUINFO_scf,      CPUINFO_rip), \
             sel=STK_REL(CPUINFO_verw_sel, CPUINFO_rip)
 
-.Lft0:  iretq
-        _ASM_PRE_EXTABLE(.Lft0, handle_exception)
+        jmp     iret_to_guest
 END(compat_restore_all_guest)
 
 /* Callers can cope with both %rax and %rcx being clobbered. */
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index c02245ac064c..01b431793b7b 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -241,8 +241,9 @@ iret_exit_to_guest:
         SPEC_CTRL_COND_VERW     /* Req: %rsp=eframe                    Clob: efl */
 
         addq  $8,%rsp
-.Lft0:  iretq
-        _ASM_PRE_EXTABLE(.Lft0, handle_exception)
+
+LABEL(iret_to_guest, 0)
+        iretq
 END(restore_all_guest)
 
 /*
@@ -920,10 +921,23 @@ handle_exception_saved:
 exception_with_ints_disabled:
         testb $3,UREGS_cs(%rsp)         # interrupts disabled outside Xen?
         jnz   FATAL_exception_with_ints_disabled
-        movq  %rsp,%rdi
-        call  search_pre_exception_table
-        testq %rax,%rax                 # no fixup code for faulting EIP?
-        jz    .Ldispatch_exceptions
+
+#ifndef CONFIG_PV
+        /* No PV?  No IRETs-to-guest to worry about. */
+        jmp .Ldispatch_exceptions
+#else
+        /* Check to see if the exception was on the IRET to guest context. */
+        lea   iret_to_guest(%rip), %rax
+        cmp   %rax, UREGS_rip(%rsp)
+        jne   .Ldispatch_exceptions
+
+        /*
+         * Recovery is at handle_exception.  It may be necessary to make space
+         * on the interrupted stack for ec/ev, after which the current ec/ev
+         * is copied to make it appear as if this exception occurred in guest
+         * context.
+         */
+        lea   handle_exception(%rip), %rax
         movq  %rax,UREGS_rip(%rsp)      # fixup regular stack
 
 #ifdef CONFIG_XEN_SHSTK
@@ -940,13 +954,14 @@ exception_with_ints_disabled:
         movq  %rsp,%rsi
         subq  $8,%rsp
         movq  %rsp,%rdi
-        movq  $UREGS_kernel_sizeof/8,%rcx
-        rep;  movsq                     # make room for ec/ev
+        mov   $UREGS_kernel_sizeof/8, %ecx
+        rep movsq                       # make room for ec/ev
 1:      movq  UREGS_error_code(%rsp),%rax # ec/ev
         movq  %rax,UREGS_kernel_sizeof(%rsp)
         mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         mov   %r13b, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)
         jmp   restore_all_xen           # return to fixup code
+#endif /* !CONFIG_PV */
 
 /* No special register assumptions. */
 FATAL_exception_with_ints_disabled:
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 16/22] x86/entry: Drop the pre exception table infrastructure
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (14 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 15/22] x86/entry: Alter how IRET faults are recognised Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-03 22:53 ` [PATCH v3 17/22] x86/entry: Rework the comment about SYSCALL and DF Andrew Cooper
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné

It is no longer used.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New
---
 xen/arch/x86/extable.c               | 14 --------------
 xen/arch/x86/include/asm/asm_defns.h | 11 ++++-------
 xen/arch/x86/include/asm/uaccess.h   |  2 --
 xen/arch/x86/xen.lds.S               |  5 -----
 4 files changed, 4 insertions(+), 28 deletions(-)

diff --git a/xen/arch/x86/extable.c b/xen/arch/x86/extable.c
index cf637d0921e4..a9b6c6b904f5 100644
--- a/xen/arch/x86/extable.c
+++ b/xen/arch/x86/extable.c
@@ -61,7 +61,6 @@ void init_or_livepatch sort_exception_table(struct exception_table_entry *start,
 void __init sort_exception_tables(void)
 {
     sort_exception_table(__start___ex_table, __stop___ex_table);
-    sort_exception_table(__start___pre_ex_table, __stop___pre_ex_table);
 }
 
 static unsigned long
@@ -219,16 +218,3 @@ int __init cf_check stub_selftest(void)
 }
 __initcall(stub_selftest);
 #endif /* CONFIG_SELF_TESTS */
-
-unsigned long asmlinkage search_pre_exception_table(struct cpu_user_regs *regs)
-{
-    unsigned long addr = regs->rip;
-    unsigned long fixup = search_one_extable(
-        __start___pre_ex_table, __stop___pre_ex_table, addr);
-    if ( fixup )
-    {
-        dprintk(XENLOG_INFO, "Pre-exception: %p -> %p\n", _p(addr), _p(fixup));
-        perfc_incr(exception_fixed);
-    }
-    return fixup;
-}
diff --git a/xen/arch/x86/include/asm/asm_defns.h b/xen/arch/x86/include/asm/asm_defns.h
index a81a4043d0f1..d7eafedf0e4c 100644
--- a/xen/arch/x86/include/asm/asm_defns.h
+++ b/xen/arch/x86/include/asm/asm_defns.h
@@ -65,22 +65,19 @@ register unsigned long current_stack_pointer asm("rsp");
 
 /* Exception table entry */
 #ifdef __ASSEMBLY__
-# define _ASM__EXTABLE(sfx, from, to)             \
-    .section .ex_table##sfx, "a" ;                \
+# define _ASM_EXTABLE(from, to)                   \
+    .section .ex_table, "a" ;                     \
     .balign 4 ;                                   \
     .long _ASM_EX(from), _ASM_EX(to) ;            \
     .previous
 #else
-# define _ASM__EXTABLE(sfx, from, to)             \
-    " .section .ex_table" #sfx ",\"a\"\n"         \
+# define _ASM_EXTABLE(from, to)                   \
+    " .section .ex_table,\"a\"\n"                 \
     " .balign 4\n"                                \
     " .long " _ASM_EX(from) ", " _ASM_EX(to) "\n" \
     " .previous\n"
 #endif
 
-#define _ASM_EXTABLE(from, to)     _ASM__EXTABLE(, from, to)
-#define _ASM_PRE_EXTABLE(from, to) _ASM__EXTABLE(.pre, from, to)
-
 #ifdef __ASSEMBLY__
 
 .macro BUILD_BUG_ON condstr, cond:vararg
diff --git a/xen/arch/x86/include/asm/uaccess.h b/xen/arch/x86/include/asm/uaccess.h
index 719d053936b9..4c41a0fe0426 100644
--- a/xen/arch/x86/include/asm/uaccess.h
+++ b/xen/arch/x86/include/asm/uaccess.h
@@ -410,8 +410,6 @@ struct exception_table_entry
 };
 extern struct exception_table_entry __start___ex_table[];
 extern struct exception_table_entry __stop___ex_table[];
-extern struct exception_table_entry __start___pre_ex_table[];
-extern struct exception_table_entry __stop___pre_ex_table[];
 
 union stub_exception_token {
     struct {
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 966e514f2034..66075bc0ae6d 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -119,11 +119,6 @@ SECTIONS
        *(.ex_table)
        __stop___ex_table = .;
 
-       /* Pre-exception table */
-       __start___pre_ex_table = .;
-       *(.ex_table.pre)
-       __stop___pre_ex_table = .;
-
        . = ALIGN(PAGE_SIZE);
        __ro_after_init_end = .;
 
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 17/22] x86/entry: Rework the comment about SYSCALL and DF
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (15 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 16/22] x86/entry: Drop the pre exception table infrastructure Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-03 22:53 ` [PATCH v3 18/22] x86/pv: Adjust GS handling for FRED mode Andrew Cooper
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné

It's soon going to be needed in a second location.

Right now it's misleading saying that nothing else would be cleared.  It's
missing the more important point that SYSCALLs are treated like all other
interrupts and exceptions, and undergo normal flags handling there.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New
---
 xen/arch/x86/x86_64/entry.S | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 01b431793b7b..ca446c6ff0ce 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -39,10 +39,9 @@ FUNC_LOCAL(switch_to_kernel)
         leal  (,%rcx,TBF_INTERRUPT),%ecx
 
         /*
-         * The PV ABI hardcodes the (guest-inaccessible and virtual)
-         * SYSCALL_MASK MSR such that DF (and nothing else) would be cleared.
-         * Note that the equivalent of IF (VGCF_syscall_disables_events) is
-         * dealt with separately above.
+         * The PV ABI, given no virtual SYSCALL_MASK, hardcodes that DF is
+         * cleared.  Other flags are handled in the same way as interrupts and
+         * exceptions in create_bounce_frame().
          */
         mov   $~X86_EFLAGS_DF, %esi
 
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 18/22] x86/pv: Adjust GS handling for FRED mode
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (16 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 17/22] x86/entry: Rework the comment about SYSCALL and DF Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-03 22:53 ` [PATCH v3 19/22] x86/pv: Guest exception handling in " Andrew Cooper
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné

When FRED is active, hardware automatically swaps GS when changing privilege,
and the SWAPGS instruction is disallowed.

For native OSes using GS as the thread local pointer this is a massive
improvement on the pre-FRED architecture, but under Xen it makes handling PV
guests more complicated.  Specifically, it means that GS_BASE and GS_SHADOW
are the opposite way around in FRED mode, as opposed to IDT mode.

This leads to the following changes:

  * In load_segments(), we have to load both GSes.  Account for this in the
    SWAP() condition and avoid the path with SWAGS.

  * In save_segments(), we need to read GS_SHADOW rather than GS_BASE.

  * In toggle_guest_mode(), we need to emulate SWAPGS.

  * In do_set_segment_base(), merge the SEGBASE_GS_{USER,KERNEL} cases and
    take FRED into account when choosing which base to update.

    SEGBASE_GS_USER_SEL was already an LKGS invocation (decades before FRED)
    so under FRED needs to be just a MOV %gs.  Simply skip the SWAPGSes.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v3:
 * Rename things

v2:
 * New

I think this functions, but it's not ideal.  The conditions are asymmetric and
awkward.
---
 xen/arch/x86/domain.c             | 22 +++++++++++++++++-----
 xen/arch/x86/pv/domain.c          | 22 ++++++++++++++++++++--
 xen/arch/x86/pv/misc-hypercalls.c | 16 ++++++++++------
 3 files changed, 47 insertions(+), 13 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 8089ff929bf7..ce08f91be3af 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1819,9 +1819,10 @@ static void load_segments(struct vcpu *n)
 
         /*
          * Figure out which way around gsb/gss want to be.  gsb needs to be
-         * the active context, and gss needs to be the inactive context.
+         * the active context, and gss needs to be the inactive context,
+         * unless we're in FRED mode where they're reversed.
          */
-        if ( !(n->arch.flags & TF_kernel_mode) )
+        if ( !(n->arch.flags & TF_kernel_mode) ^ opt_fred )
             SWAP(gsb, gss);
 
         if ( using_svm() && (n->arch.pv.fs | n->arch.pv.gs) <= 3 )
@@ -1842,7 +1843,9 @@ static void load_segments(struct vcpu *n)
 
     if ( !fs_gs_done && !compat )
     {
-        if ( read_cr4() & X86_CR4_FSGSBASE )
+        unsigned long cr4 = read_cr4();
+
+        if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
         {
             __wrgsbase(gss);
             __wrfsbase(n->arch.pv.fs_base);
@@ -1959,6 +1962,9 @@ static void load_segments(struct vcpu *n)
  * Guests however cannot use SWAPGS, so there is no mechanism to modify the
  * inactive GS base behind Xen's back.  Therefore, Xen's copy of the inactive
  * GS base is still accurate, and doesn't need reading back from hardware.
+ *
+ * Under FRED, hardware automatically swaps GS for us, so SHADOW_GS is the
+ * active GS from the guest's point of view.
  */
 static void save_segments(struct vcpu *v)
 {
@@ -1974,12 +1980,18 @@ static void save_segments(struct vcpu *v)
         if ( read_cr4() & X86_CR4_FSGSBASE )
         {
             fs_base = __rdfsbase();
-            gs_base = __rdgsbase();
+            if ( opt_fred )
+                gs_base = rdmsr(MSR_SHADOW_GS_BASE);
+            else
+                gs_base = __rdgsbase();
         }
         else
         {
             fs_base = rdmsr(MSR_FS_BASE);
-            gs_base = rdmsr(MSR_GS_BASE);
+            if ( opt_fred )
+                gs_base = rdmsr(MSR_SHADOW_GS_BASE);
+            else
+                gs_base = rdmsr(MSR_GS_BASE);
         }
 
         v->arch.pv.fs_base = fs_base;
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 9c4785c187dd..369af444c29b 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -14,9 +14,10 @@
 #include <asm/cpufeature.h>
 #include <asm/fsgsbase.h>
 #include <asm/invpcid.h>
-#include <asm/spec_ctrl.h>
 #include <asm/pv/domain.h>
 #include <asm/shadow.h>
+#include <asm/spec_ctrl.h>
+#include <asm/traps.h>
 
 #ifdef CONFIG_PV32
 int8_t __read_mostly opt_pv32 = -1;
@@ -480,11 +481,28 @@ void toggle_guest_mode(struct vcpu *v)
      * subsequent context switch won't bother re-reading it.
      */
     gs_base = read_gs_base();
+
+    /*
+     * In FRED mode, not only are the two GSes the other way around (i.e. we
+     * want to read GS_SHADOW here), the SWAPGS instruction is disallowed so
+     * we have to emulate it.
+     */
+    if ( opt_fred )
+    {
+        unsigned long gs_shadow = rdmsr(MSR_SHADOW_GS_BASE);
+
+        wrmsrns(MSR_SHADOW_GS_BASE, gs_base);
+        write_gs_base(gs_shadow);
+
+        gs_base = gs_shadow;
+    }
+    else
+        asm volatile ( "swapgs" );
+
     if ( v->arch.flags & TF_kernel_mode )
         v->arch.pv.gs_base_kernel = gs_base;
     else
         v->arch.pv.gs_base_user = gs_base;
-    asm volatile ( "swapgs" );
 
     _toggle_guest_pt(v);
 
diff --git a/xen/arch/x86/pv/misc-hypercalls.c b/xen/arch/x86/pv/misc-hypercalls.c
index 4c2abeb4add8..2c9cf50638db 100644
--- a/xen/arch/x86/pv/misc-hypercalls.c
+++ b/xen/arch/x86/pv/misc-hypercalls.c
@@ -11,6 +11,7 @@
 
 #include <asm/debugreg.h>
 #include <asm/fsgsbase.h>
+#include <asm/traps.h>
 
 long do_set_debugreg(int reg, unsigned long value)
 {
@@ -192,11 +193,12 @@ long do_set_segment_base(unsigned int which, unsigned long base)
 
         case SEGBASE_GS_USER:
             v->arch.pv.gs_base_user = base;
-            write_gs_shadow(base);
-            break;
-
+            fallthrough;
         case SEGBASE_GS_KERNEL:
-            write_gs_base(base);
+            if ( (which == SEGBASE_GS_KERNEL) ^ opt_fred )
+                write_gs_base(base);
+            else
+                write_gs_shadow(base);
             break;
         }
         break;
@@ -209,7 +211,8 @@ long do_set_segment_base(unsigned int which, unsigned long base)
          * We wish to update the user %gs from the GDT/LDT.  Currently, the
          * guest kernel's GS_BASE is in context.
          */
-        asm volatile ( "swapgs" );
+        if ( !opt_fred )
+            asm volatile ( "swapgs" );
 
         if ( sel > 3 )
             /* Fix up RPL for non-NUL selectors. */
@@ -247,7 +250,8 @@ long do_set_segment_base(unsigned int which, unsigned long base)
         /* Update the cache of the inactive base, as read from the GDT/LDT. */
         v->arch.pv.gs_base_user = read_gs_base();
 
-        asm volatile ( safe_swapgs );
+        if ( !opt_fred )
+            asm volatile ( safe_swapgs );
         break;
     }
 
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 19/22] x86/pv: Guest exception handling in FRED mode
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (17 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 18/22] x86/pv: Adjust GS handling for FRED mode Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-08 12:28   ` Jan Beulich
  2025-10-03 22:53 ` [PATCH v3 20/22] x86/pv: ERETU error handling Andrew Cooper
                   ` (3 subsequent siblings)
  22 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

Under FRED, entry_from_pv() handles everything.  To start with, implement
exception handling in the same manner as entry_from_xen(), although we can
unconditionally enable interrupts after the async/fatal events.

After entry_from_pv() returns, test_all_events() needs to run to perform
exception and interrupt injection.  Split entry_FRED_R3() into two and
introduce eretu_exit_to_guest() as the latter half, coming unilaterally from
restore_all_guest().

For all of this, there is a slightly complicated relationship with CONFIG_PV.
entry_FRED_R3() must exist irrespective of CONFIG_PV, because it's the
entrypoint registered with hardware.  For simplicity, entry_from_pv() is
always called, but it collapses into fatal_trap() in the !PV case.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v3:
 * Adjust comments.
 * Group CP with others.  It's definitely wrong for perf, but that's out the
   window anyway now that we're letting a compiler make the decision tree.
v2:
 * New
---
 xen/arch/x86/traps.c             | 75 +++++++++++++++++++++++++++++++-
 xen/arch/x86/x86_64/entry-fred.S | 13 +++++-
 xen/arch/x86/x86_64/entry.S      |  4 +-
 3 files changed, 89 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 0027f096a6c3..3f7db11c247b 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -2266,9 +2266,82 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
 
 void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
 {
+    struct fred_info *fi = cpu_regs_fred_info(regs);
+    uint8_t type = regs->fred_ss.type;
+    uint8_t vec = regs->fred_ss.vector;
+
     /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
-    regs->entry_vector = regs->fred_ss.vector;
+    regs->entry_vector = vec;
+
+    if ( !IS_ENABLED(CONFIG_PV) )
+        goto fatal;
+
+    /*
+     * First, handle the asynchronous or fatal events.  These are either
+     * unrelated to the interrupted context, or may not have valid context
+     * recorded, and all have special rules on how/whether to re-enable IRQs.
+     */
+    switch ( type )
+    {
+    case X86_ET_EXT_INTR:
+        return do_IRQ(regs);
+
+    case X86_ET_NMI:
+        return do_nmi(regs);
+
+    case X86_ET_HW_EXC:
+        switch ( vec )
+        {
+        case X86_EXC_DF: return do_double_fault(regs);
+        case X86_EXC_MC: return do_machine_check(regs);
+        }
+        break;
+    }
+
+    /*
+     * With the asynchronous events handled, what remains are the synchronous
+     * ones.  PV guest context always had interrupts enabled.
+     */
+    local_irq_enable();
+
+    switch ( type )
+    {
+    case X86_ET_HW_EXC:
+    case X86_ET_PRIV_SW_EXC:
+    case X86_ET_SW_EXC:
+        switch ( vec )
+        {
+        case X86_EXC_PF:  handle_PF(regs, fi->edata); break;
+        case X86_EXC_GP:  do_general_protection(regs); break;
+        case X86_EXC_UD:  do_invalid_op(regs); break;
+        case X86_EXC_NM:  do_device_not_available(regs); break;
+        case X86_EXC_BP:  do_int3(regs); break;
+        case X86_EXC_DB:  handle_DB(regs, fi->edata); break;
+        case X86_EXC_CP:  do_entry_CP(regs); break;
+
+        case X86_EXC_DE:
+        case X86_EXC_OF:
+        case X86_EXC_BR:
+        case X86_EXC_NP:
+        case X86_EXC_SS:
+        case X86_EXC_MF:
+        case X86_EXC_AC:
+        case X86_EXC_XM:
+            do_trap(regs);
+            break;
 
+        default:
+            goto fatal;
+        }
+        break;
+
+    default:
+        goto fatal;
+    }
+
+    return;
+
+ fatal:
     fatal_trap(regs, false);
 }
 
diff --git a/xen/arch/x86/x86_64/entry-fred.S b/xen/arch/x86/x86_64/entry-fred.S
index 3c3320df22cb..a1ff9a4a9747 100644
--- a/xen/arch/x86/x86_64/entry-fred.S
+++ b/xen/arch/x86/x86_64/entry-fred.S
@@ -15,9 +15,20 @@ FUNC(entry_FRED_R3, 4096)
         mov     %rsp, %rdi
         call    entry_from_pv
 
+#ifdef CONFIG_PV
+        GET_STACK_END(14)
+        movq    STACK_CPUINFO_FIELD(current_vcpu)(%r14), %rbx
+
+        jmp     test_all_events
+#else
+        BUG     /* Not Reached */
+#endif
+END(entry_FRED_R3)
+
+FUNC(eretu_exit_to_guest)
         POP_GPRS
         eretu
-END(entry_FRED_R3)
+END(eretu_exit_to_guest)
 
         /* The Ring0 entrypoint is at Ring3 + 0x100. */
         .org entry_FRED_R3 + 0x100, 0xcc
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index ca446c6ff0ce..0692163faa44 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -63,7 +63,7 @@ UNLIKELY_END(syscall_no_callback)
         /* Conditionally clear DF */
         and   %esi, UREGS_eflags(%rsp)
 /* %rbx: struct vcpu */
-test_all_events:
+LABEL(test_all_events, 0)
         ASSERT_NOT_IN_ATOMIC
         cli                             # tests must not race interrupts
 /*test_softirqs:*/
@@ -152,6 +152,8 @@ END(switch_to_kernel)
 FUNC_LOCAL(restore_all_guest)
         ASSERT_INTERRUPTS_DISABLED
 
+        ALTERNATIVE "", "jmp eretu_exit_to_guest", X86_FEATURE_XEN_FRED
+
         /* Stash guest SPEC_CTRL value while we can read struct vcpu. */
         mov VCPU_arch_msrs(%rbx), %rdx
         mov VCPUMSR_spec_ctrl_raw(%rdx), %r15d
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 19/22] x86/pv: Guest exception handling in FRED mode
  2025-10-03 22:53 ` [PATCH v3 19/22] x86/pv: Guest exception handling in " Andrew Cooper
@ 2025-10-08 12:28   ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2025-10-08 12:28 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 04.10.2025 00:53, Andrew Cooper wrote:
> Under FRED, entry_from_pv() handles everything.  To start with, implement
> exception handling in the same manner as entry_from_xen(), although we can
> unconditionally enable interrupts after the async/fatal events.
> 
> After entry_from_pv() returns, test_all_events() needs to run to perform
> exception and interrupt injection.  Split entry_FRED_R3() into two and
> introduce eretu_exit_to_guest() as the latter half, coming unilaterally from
> restore_all_guest().
> 
> For all of this, there is a slightly complicated relationship with CONFIG_PV.
> entry_FRED_R3() must exist irrespective of CONFIG_PV, because it's the
> entrypoint registered with hardware.  For simplicity, entry_from_pv() is
> always called, but it collapses into fatal_trap() in the !PV case.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

Nevertheless ...

> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -2266,9 +2266,82 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
>  
>  void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
>  {
> +    struct fred_info *fi = cpu_regs_fred_info(regs);
> +    uint8_t type = regs->fred_ss.type;
> +    uint8_t vec = regs->fred_ss.vector;
> +
>      /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
> -    regs->entry_vector = regs->fred_ss.vector;
> +    regs->entry_vector = vec;
> +
> +    if ( !IS_ENABLED(CONFIG_PV) )
> +        goto fatal;
> +
> +    /*
> +     * First, handle the asynchronous or fatal events.  These are either
> +     * unrelated to the interrupted context, or may not have valid context
> +     * recorded, and all have special rules on how/whether to re-enable IRQs.
> +     */
> +    switch ( type )
> +    {
> +    case X86_ET_EXT_INTR:
> +        return do_IRQ(regs);
> +
> +    case X86_ET_NMI:
> +        return do_nmi(regs);
> +
> +    case X86_ET_HW_EXC:
> +        switch ( vec )
> +        {
> +        case X86_EXC_DF: return do_double_fault(regs);
> +        case X86_EXC_MC: return do_machine_check(regs);
> +        }
> +        break;
> +    }
> +
> +    /*
> +     * With the asynchronous events handled, what remains are the synchronous
> +     * ones.  PV guest context always had interrupts enabled.
> +     */
> +    local_irq_enable();
> +
> +    switch ( type )
> +    {
> +    case X86_ET_HW_EXC:
> +    case X86_ET_PRIV_SW_EXC:
> +    case X86_ET_SW_EXC:
> +        switch ( vec )
> +        {
> +        case X86_EXC_PF:  handle_PF(regs, fi->edata); break;
> +        case X86_EXC_GP:  do_general_protection(regs); break;
> +        case X86_EXC_UD:  do_invalid_op(regs); break;
> +        case X86_EXC_NM:  do_device_not_available(regs); break;
> +        case X86_EXC_BP:  do_int3(regs); break;
> +        case X86_EXC_DB:  handle_DB(regs, fi->edata); break;
> +        case X86_EXC_CP:  do_entry_CP(regs); break;
> +
> +        case X86_EXC_DE:
> +        case X86_EXC_OF:
> +        case X86_EXC_BR:
> +        case X86_EXC_NP:
> +        case X86_EXC_SS:
> +        case X86_EXC_MF:
> +        case X86_EXC_AC:
> +        case X86_EXC_XM:
> +            do_trap(regs);
> +            break;
>  
> +        default:
> +            goto fatal;
> +        }
> +        break;
> +
> +    default:
> +        goto fatal;
> +    }
> +
> +    return;
> +
> + fatal:
>      fatal_trap(regs, false);
>  }

... I'm still somewhat bothered by this almost entirely duplicating the
other entry function, i.e. I continue to wonder if we wouldn't be better
off by eliminating that duplication (say by way of an always_inline
helper with a suitable extra parameter).

> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -63,7 +63,7 @@ UNLIKELY_END(syscall_no_callback)
>          /* Conditionally clear DF */
>          and   %esi, UREGS_eflags(%rsp)
>  /* %rbx: struct vcpu */
> -test_all_events:
> +LABEL(test_all_events, 0)
>          ASSERT_NOT_IN_ATOMIC
>          cli                             # tests must not race interrupts
>  /*test_softirqs:*/
> @@ -152,6 +152,8 @@ END(switch_to_kernel)
>  FUNC_LOCAL(restore_all_guest)
>          ASSERT_INTERRUPTS_DISABLED
>  
> +        ALTERNATIVE "", "jmp eretu_exit_to_guest", X86_FEATURE_XEN_FRED
> +
>          /* Stash guest SPEC_CTRL value while we can read struct vcpu. */
>          mov VCPU_arch_msrs(%rbx), %rdx

I also continue to wonder if we wouldn't do a tiny bit better by using

        ALTERNATIVE "mov VCPU_arch_msrs(%rbx), %rdx", \
                    "jmp eretu_exit_to_guest", \
                    X86_FEATURE_XEN_FRED

Or by converting the few jumps to restore_all_guest to alternatives
(duplicating the ASSERT_INTERRUPTS_DISABLED there).

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 20/22] x86/pv: ERETU error handling
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (18 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 19/22] x86/pv: Guest exception handling in " Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-08 12:36   ` Jan Beulich
  2025-10-03 22:53 ` [PATCH v3 21/22] x86/pv: System call handling in FRED mode Andrew Cooper
                   ` (2 subsequent siblings)
  22 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

ERETU can fault for guest reasons, and like IRET needs special handling to
forward the error into the guest.

As this is largely written in C, take the opportunity to better classify the
sources of error, and in particilar, not forward errors that are actually
Xen's fault into the guest, opting for a domain crash instead.

Because ERETU does not enable NMIs if it faults, a corner case exists if an
NMI was taken while in guest context, and the ERETU back out faults.  Recovery
must involve an ERETS with the interrupted context's NMI flag.

See the comments for full details.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * New
---
 xen/arch/x86/traps.c             | 115 +++++++++++++++++++++++++++++++
 xen/arch/x86/x86_64/entry-fred.S |  13 ++++
 2 files changed, 128 insertions(+)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 3f7db11c247b..955cff32d75f 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -2345,6 +2345,113 @@ void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
     fatal_trap(regs, false);
 }
 
+void nocall eretu_error_dom_crash(void);
+
+/*
+ * Classify an event at the ERETU instruction, and handle if possible.
+ * Returns @true if handled, @false if the event should continue down the
+ * normal handlers.
+ */
+static bool handle_eretu_event(struct cpu_user_regs *regs)
+{
+    unsigned long recover;
+
+    /*
+     * WARNING: The GPRs in gregs overlaps with regs.  Only gregs->error_code
+     *          and later are legitimate to access.
+     */
+    struct cpu_user_regs *gregs =
+        _p(regs->rsp - offsetof(struct cpu_user_regs, error_code));
+
+    /*
+     * The asynchronous or fatal events (INTR, NMI, #MC, #DF) have been dealt
+     * with, meaning we only have syncrhonous ones to consider.  Anything
+     * which isn't a hardware exception wants handling normally.
+     */
+    if ( regs->fred_ss.type != X86_ET_HW_EXC )
+        return false;
+
+    /*
+     * Guests are permitted to write non-present GDT/LDT entries.  Therefore
+     * #NP[sel] (%cs) and #SS[sel] (%ss) must be handled as guest errors.  The
+     * only other source of #SS is for a bad %ss-relative memory access in
+     * Xen, and if the stack is that bad, we'll have escalated to #DF.
+     *
+     * #PF can happen from ERETU accessing the GDT/LDT.  Xen may translate
+     * these into #GP for the guest, so must be handled as guest errors.  In
+     * theory we can get #PF for a bad instruction fetch or bad stack access,
+     * but either of these will be fatal and not end up here.
+     */
+    switch ( regs->fred_ss.vector )
+    {
+    case X86_EXC_GP:
+        /*
+         * #GP[0] can occur because of a NULL %cs or %ss (which are a guest
+         * error), but some #GP[0]'s are errors in Xen (ERETU at SL != 0), or
+         * errors of Xen handling guest state (bad metadata).  These magic
+         * numbers came from the FRED Spec; they check that ERETU is trying to
+         * return to Ring 3, and that reserved or inapplicable bits are 0.
+         */
+        if ( regs->error_code == 0 && (gregs->cs & ~3) && (gregs->ss & ~3) &&
+             (regs->fred_cs.sl != 0 ||
+              (gregs->csx    & 0xffffffffffff0003UL) != 3 ||
+              (gregs->rflags & 0xffffffffffc2b02aUL) != 2 ||
+              (gregs->ssx    &         0xfff80003UL) != 3) )
+        {
+            recover = (unsigned long)eretu_error_dom_crash;
+
+            if ( regs->fred_cs.sl )
+                gprintk(XENLOG_ERR, "ERETU at SL %u\n", regs->fred_cs.sl);
+            else
+                gprintk(XENLOG_ERR, "Bad return state: csx %#lx, rflags %#lx, ssx %#x\n",
+                        gregs->csx, gregs->rflags, (unsigned int)gregs->ssx);
+            break;
+        }
+        fallthrough;
+    case X86_EXC_NP:
+    case X86_EXC_SS:
+    case X86_EXC_PF:
+        recover = (unsigned long)entry_FRED_R3;
+        break;
+
+        /*
+         * Handle everything else normally.  #BP and #DB would be debugging
+         * activities in Xen.  In theory we can get #UD if CR4.FRED gets
+         * cleared, but in practice if that were the case we wouldn't be here
+         * handling the result.
+         */
+    default:
+        return false;
+    }
+
+    this_cpu(last_extable_addr) = regs->rip;
+
+    /*
+     * Everything else is recoverable, one way or another.
+     *
+     * If an NMI was taken in guest context and the ERETU faulted, NMIs will
+     * still be blocked.  Therefore we copy the interrupted frame's NMI status
+     * into our own, and must ERETS as part of recovery.
+     */
+    regs->fred_ss.nmi = gregs->fred_ss.nmi;
+
+    /*
+     * Next, copy the exception information from the current frame back onto
+     * the interrupted frame, preserving the interrupted frame's %cs and %ss.
+     */
+    *cpu_regs_fred_info(regs) = *cpu_regs_fred_info(gregs);
+    gregs->ssx = (regs->ssx & ~0xffff) | gregs->ss;
+    gregs->csx = (regs->csx & ~0xffff) | gregs->cs;
+    gregs->error_code   = regs->error_code;
+    gregs->entry_vector = regs->entry_vector;
+
+    fixup_exception_return(regs, recover, 0);
+
+    return true;
+}
+
+void nocall eretu(void);
+
 void asmlinkage entry_from_xen(struct cpu_user_regs *regs)
 {
     struct fred_info *fi = cpu_regs_fred_info(regs);
@@ -2383,6 +2490,14 @@ void asmlinkage entry_from_xen(struct cpu_user_regs *regs)
     if ( regs->eflags & X86_EFLAGS_IF )
         local_irq_enable();
 
+    /*
+     * An event taken at the ERETU instruction may be because of guest state
+     * and in that case will need special handling.
+     */
+    if ( unlikely(regs->rip == (unsigned long)eretu) &&
+         handle_eretu_event(regs) )
+        return;
+
     switch ( type )
     {
     case X86_ET_HW_EXC:
diff --git a/xen/arch/x86/x86_64/entry-fred.S b/xen/arch/x86/x86_64/entry-fred.S
index a1ff9a4a9747..2fa57beb930c 100644
--- a/xen/arch/x86/x86_64/entry-fred.S
+++ b/xen/arch/x86/x86_64/entry-fred.S
@@ -27,9 +27,22 @@ END(entry_FRED_R3)
 
 FUNC(eretu_exit_to_guest)
         POP_GPRS
+
+        /*
+         * Exceptions here are handled by redirecting either to
+         * entry_FRED_R3() (for an error to be passed to the guest), or to
+         * eretu_error_dom_crash() (for a Xen error handling guest state).
+         */
+LABEL(eretu, 0)
         eretu
 END(eretu_exit_to_guest)
 
+FUNC(eretu_error_dom_crash)
+        PUSH_AND_CLEAR_GPRS
+        sti
+        call    asm_domain_crash_synchronous  /* Does not return */
+END(eretu_error_dom_crash)
+
         /* The Ring0 entrypoint is at Ring3 + 0x100. */
         .org entry_FRED_R3 + 0x100, 0xcc
 
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 20/22] x86/pv: ERETU error handling
  2025-10-03 22:53 ` [PATCH v3 20/22] x86/pv: ERETU error handling Andrew Cooper
@ 2025-10-08 12:36   ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2025-10-08 12:36 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 04.10.2025 00:53, Andrew Cooper wrote:
> ERETU can fault for guest reasons, and like IRET needs special handling to
> forward the error into the guest.
> 
> As this is largely written in C, take the opportunity to better classify the
> sources of error, and in particilar, not forward errors that are actually
> Xen's fault into the guest, opting for a domain crash instead.
> 
> Because ERETU does not enable NMIs if it faults, a corner case exists if an
> NMI was taken while in guest context, and the ERETU back out faults.  Recovery
> must involve an ERETS with the interrupted context's NMI flag.
> 
> See the comments for full details.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>
albeit ...

> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -2345,6 +2345,113 @@ void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
>      fatal_trap(regs, false);
>  }
>  
> +void nocall eretu_error_dom_crash(void);
> +
> +/*
> + * Classify an event at the ERETU instruction, and handle if possible.
> + * Returns @true if handled, @false if the event should continue down the
> + * normal handlers.
> + */
> +static bool handle_eretu_event(struct cpu_user_regs *regs)
> +{
> +    unsigned long recover;
> +
> +    /*
> +     * WARNING: The GPRs in gregs overlaps with regs.  Only gregs->error_code
> +     *          and later are legitimate to access.
> +     */
> +    struct cpu_user_regs *gregs =
> +        _p(regs->rsp - offsetof(struct cpu_user_regs, error_code));
> +
> +    /*
> +     * The asynchronous or fatal events (INTR, NMI, #MC, #DF) have been dealt
> +     * with, meaning we only have syncrhonous ones to consider.  Anything
> +     * which isn't a hardware exception wants handling normally.
> +     */
> +    if ( regs->fred_ss.type != X86_ET_HW_EXC )
> +        return false;
> +
> +    /*
> +     * Guests are permitted to write non-present GDT/LDT entries.  Therefore
> +     * #NP[sel] (%cs) and #SS[sel] (%ss) must be handled as guest errors.  The
> +     * only other source of #SS is for a bad %ss-relative memory access in
> +     * Xen, and if the stack is that bad, we'll have escalated to #DF.
> +     *
> +     * #PF can happen from ERETU accessing the GDT/LDT.  Xen may translate
> +     * these into #GP for the guest, so must be handled as guest errors.  In
> +     * theory we can get #PF for a bad instruction fetch or bad stack access,
> +     * but either of these will be fatal and not end up here.
> +     */
> +    switch ( regs->fred_ss.vector )
> +    {
> +    case X86_EXC_GP:
> +        /*
> +         * #GP[0] can occur because of a NULL %cs or %ss (which are a guest
> +         * error), but some #GP[0]'s are errors in Xen (ERETU at SL != 0), or
> +         * errors of Xen handling guest state (bad metadata).  These magic
> +         * numbers came from the FRED Spec; they check that ERETU is trying to
> +         * return to Ring 3, and that reserved or inapplicable bits are 0.
> +         */
> +        if ( regs->error_code == 0 && (gregs->cs & ~3) && (gregs->ss & ~3) &&
> +             (regs->fred_cs.sl != 0 ||
> +              (gregs->csx    & 0xffffffffffff0003UL) != 3 ||
> +              (gregs->rflags & 0xffffffffffc2b02aUL) != 2 ||
> +              (gregs->ssx    &         0xfff80003UL) != 3) )

... I continue to dislike these uses of literal numbers.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 21/22] x86/pv: System call handling in FRED mode
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (19 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 20/22] x86/pv: ERETU error handling Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-08 13:45   ` Jan Beulich
  2025-10-03 22:53 ` [PATCH v3 22/22] x86: Clamp reserved bits in eflags more aggressively Andrew Cooper
  2025-10-17 13:24 ` [PATCH v3 for-4.21 00/22] x86: FRED support Oleksii Kurochko
  22 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

Under FRED, entry_from_pv() handles everything, even system calls.  This means
more of our logic is written in C now, rather than assembly.

In order to facilitate this, introduce pv_inject_callback(), which reuses
struct trap_bounce infrastructure to inject the syscall/sysenter callbacks.
This in turns requires some !PV compatibility for pv_inject_callback() and
pv_hypercall() which can both be ASSERT_UNREACHABLE().

For each of INT $N, SYSCALL and SYSENTER, FRED gives us interrupted context
which was previously lost.  As the guest can't see FRED, Xen has to lose state
in the same way to maintain the prior behaviour.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v3:
 * Simplify DCE handling.
 * Add ASSERT_UNREACHABLE() to pv_inject_callback().
 * Adjust comment for X86_ET_SW_INT

v2:
 * New
---
 xen/arch/x86/include/asm/domain.h    |   2 +
 xen/arch/x86/include/asm/hypercall.h |   2 -
 xen/arch/x86/pv/traps.c              |  39 ++++++++++
 xen/arch/x86/traps.c                 | 110 +++++++++++++++++++++++++++
 4 files changed, 151 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index 5df8c7825333..828f42c3e448 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -710,6 +710,8 @@ void arch_vcpu_regs_init(struct vcpu *v);
 struct vcpu_hvm_context;
 int arch_set_info_hvm_guest(struct vcpu *v, const struct vcpu_hvm_context *ctx);
 
+void pv_inject_callback(unsigned int type);
+
 #ifdef CONFIG_PV
 void pv_inject_event(const struct x86_event *event);
 #else
diff --git a/xen/arch/x86/include/asm/hypercall.h b/xen/arch/x86/include/asm/hypercall.h
index f6e9e2313b3c..ded3c24d40e2 100644
--- a/xen/arch/x86/include/asm/hypercall.h
+++ b/xen/arch/x86/include/asm/hypercall.h
@@ -18,9 +18,7 @@
 
 #define __HYPERVISOR_paging_domctl_cont __HYPERVISOR_arch_1
 
-#ifdef CONFIG_PV
 void pv_hypercall(struct cpu_user_regs *regs);
-#endif
 
 void pv_ring1_init_hypercall_page(void *ptr);
 void pv_ring3_init_hypercall_page(void *ptr);
diff --git a/xen/arch/x86/pv/traps.c b/xen/arch/x86/pv/traps.c
index c3c0976c440f..00de03412639 100644
--- a/xen/arch/x86/pv/traps.c
+++ b/xen/arch/x86/pv/traps.c
@@ -19,6 +19,8 @@
 #include <asm/shared.h>
 #include <asm/traps.h>
 
+#include <public/callback.h>
+
 void pv_inject_event(const struct x86_event *event)
 {
     struct vcpu *curr = current;
@@ -95,6 +97,43 @@ void pv_inject_event(const struct x86_event *event)
     }
 }
 
+void pv_inject_callback(unsigned int type)
+{
+    struct vcpu *curr = current;
+    struct trap_bounce *tb = &curr->arch.pv.trap_bounce;
+    unsigned long rip;
+    bool irq;
+
+    ASSERT(is_pv_64bit_vcpu(curr));
+
+    switch ( type )
+    {
+    case CALLBACKTYPE_syscall:
+        rip = curr->arch.pv.syscall_callback_eip;
+        irq = curr->arch.pv.vgc_flags & VGCF_syscall_disables_events;
+        break;
+
+    case CALLBACKTYPE_syscall32:
+        rip = curr->arch.pv.syscall32_callback_eip;
+        irq = curr->arch.pv.syscall32_disables_events;
+        break;
+
+    case CALLBACKTYPE_sysenter:
+        rip = curr->arch.pv.sysenter_callback_eip;
+        irq = curr->arch.pv.sysenter_disables_events;
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+        rip = 0;
+        irq = false;
+        break;
+    }
+
+    tb->flags = TBF_EXCEPTION | (irq ? TBF_INTERRUPT : 0);
+    tb->eip = rip;
+}
+
 /*
  * Called from asm to set up the MCE trapbounce info.
  * Returns false no callback is set up, else true.
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 955cff32d75f..5f89928d8128 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -18,6 +18,7 @@
 #include <xen/delay.h>
 #include <xen/domain_page.h>
 #include <xen/guest_access.h>
+#include <xen/hypercall.h>
 #include <xen/init.h>
 #include <xen/mm.h>
 #include <xen/paging.h>
@@ -52,6 +53,8 @@
 #include <asm/uaccess.h>
 #include <asm/xenoprof.h>
 
+#include <public/callback.h>
+
 /*
  * opt_nmi: one of 'ignore', 'dom0', or 'fatal'.
  *  fatal:  Xen prints diagnostic message and then hangs.
@@ -2267,6 +2270,7 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
 void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
 {
     struct fred_info *fi = cpu_regs_fred_info(regs);
+    struct vcpu *curr = current;
     uint8_t type = regs->fred_ss.type;
     uint8_t vec = regs->fred_ss.vector;
 
@@ -2306,6 +2310,30 @@ void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
 
     switch ( type )
     {
+    case X86_ET_SW_INT:
+        /*
+         * For better or worse, Xen writes IDT vectors 3 and 4 with DPL3 (so
+         * INT3/INTO work), making INT $3/4 indistinguishable, and the guest
+         * choice of DPL for these vectors is ignored.
+         *
+         * Have them fall through into X86_ET_HW_EXC, as #BP in particular
+         * needs handling by do_int3() in case an external debugger is
+         * attached.
+         */
+        if ( vec != X86_EXC_BP && vec != X86_EXC_OF )
+        {
+            const struct trap_info *ti = &curr->arch.pv.trap_ctxt[vec];
+
+            if ( permit_softint(TI_GET_DPL(ti), curr, regs) )
+                pv_inject_sw_interrupt(vec);
+            else
+            {
+                regs->rip -= 2;
+                pv_inject_hw_exception(X86_EXC_GP, (vec << 3) | X86_XEC_IDT);
+            }
+            break;
+        }
+        fallthrough;
     case X86_ET_HW_EXC:
     case X86_ET_PRIV_SW_EXC:
     case X86_ET_SW_EXC:
@@ -2335,6 +2363,88 @@ void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
         }
         break;
 
+    case X86_ET_OTHER:
+        switch ( regs->fred_ss.vector )
+        {
+        case 1: /* SYSCALL */
+        {
+            /*
+             * FRED delivery preserves the interrupted %cs/%ss, but previously
+             * SYSCALL lost the interrupted selectors, and SYSRET forced the
+             * use of the ones in MSR_STAR.
+             *
+             * The guest isn't aware of FRED, so recreate the legacy
+             * behaviour, including the guess of instruction length for
+             * faults.
+             *
+             * The non-FRED SYSCALL path sets TRAP_syscall in entry_vector to
+             * signal that SYSRET can be used, but this isn't relevant in FRED
+             * mode.
+             *
+             * When setting the selectors, clear all upper metadata again for
+             * backwards compatibility.  In particular fred_ss.swint becomes
+             * pend_DB on ERETx, and nothing else in the pv_hypercall() would
+             * clean up.
+             */
+            bool l = regs->fred_ss.l;
+
+            regs->ssx = l ? FLAT_KERNEL_SS   : FLAT_USER_SS32;
+            regs->csx = l ? FLAT_KERNEL_CS64 : FLAT_USER_CS32;
+
+            if ( guest_kernel_mode(curr, regs) )
+                pv_hypercall(regs);
+            else if ( (l ? curr->arch.pv.syscall_callback_eip
+                         : curr->arch.pv.syscall32_callback_eip) == 0 )
+            {
+                regs->rip -= 2;
+                pv_inject_hw_exception(X86_EXC_UD, X86_EVENT_NO_EC);
+            }
+            else
+            {
+                /*
+                 * The PV ABI, given no virtual SYSCALL_MASK, hardcodes that
+                 * DF is cleared.  Other flags are handled in the same way as
+                 * interrupts and exceptions in create_bounce_frame().
+                 */
+                regs->eflags &= ~X86_EFLAGS_DF;
+                pv_inject_callback(l ? CALLBACKTYPE_syscall
+                                     : CALLBACKTYPE_syscall32);
+            }
+            break;
+        }
+
+        case 2: /* SYSENTER */
+            /*
+             * FRED delivery preserves the interrupted state, but previously
+             * SYSENTER discarded almost everything.
+             *
+             * The guest isn't aware of FRED, so recreate the legacy
+             * behaviour, including the guess of instruction length for
+             * faults.
+             *
+             * When setting the selectors, clear all upper metadata.  In
+             * particular fred_ss.swint becomes pend_DB on ERETx.
+             */
+            regs->ssx = FLAT_USER_SS;
+            regs->rsp = 0;
+            regs->eflags &= ~(X86_EFLAGS_VM | X86_EFLAGS_IF);
+            regs->csx = 3;
+            regs->rip = 0;
+
+            if ( !curr->arch.pv.sysenter_callback_eip )
+            {
+                regs->rip -= 2;
+                pv_inject_hw_exception(X86_EXC_GP, 0);
+            }
+            else
+                pv_inject_callback(CALLBACKTYPE_sysenter);
+            break;
+
+        default:
+            goto fatal;
+        }
+        break;
+
     default:
         goto fatal;
     }
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 21/22] x86/pv: System call handling in FRED mode
  2025-10-03 22:53 ` [PATCH v3 21/22] x86/pv: System call handling in FRED mode Andrew Cooper
@ 2025-10-08 13:45   ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2025-10-08 13:45 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 04.10.2025 00:53, Andrew Cooper wrote:
> Under FRED, entry_from_pv() handles everything, even system calls.  This means
> more of our logic is written in C now, rather than assembly.
> 
> In order to facilitate this, introduce pv_inject_callback(), which reuses
> struct trap_bounce infrastructure to inject the syscall/sysenter callbacks.
> This in turns requires some !PV compatibility for pv_inject_callback() and
> pv_hypercall() which can both be ASSERT_UNREACHABLE().
> 
> For each of INT $N, SYSCALL and SYSENTER, FRED gives us interrupted context
> which was previously lost.  As the guest can't see FRED, Xen has to lose state
> in the same way to maintain the prior behaviour.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 22/22] x86: Clamp reserved bits in eflags more aggressively
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (20 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 21/22] x86/pv: System call handling in FRED mode Andrew Cooper
@ 2025-10-03 22:53 ` Andrew Cooper
  2025-10-08 13:50   ` Jan Beulich
  2025-10-17 13:24 ` [PATCH v3 for-4.21 00/22] x86: FRED support Oleksii Kurochko
  22 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2025-10-03 22:53 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

ERETU, unlike IRET, requires the sticky-1 bit (bit 2) be set, and reserved
bits to be clear.  Notably this means that dom0_construct() must set
X86_EFLAGS_MBS it in order for a PV dom0 to start.

Xen has been overly lax with reserved bit handling.  Adjust
arch_set_info_guest*() and hypercall_iret() which consume flags to clamp the
reserved bits for all guest types.

This is a minor ABI change, but by the same argument as commit
9f892f84c279 ("x86/domctl: Stop using XLAT_cpu_user_regs()"), the reserved
bits would get clamped naturally by hardware when the vCPU is run.

This allows PV guests to start when Xen is using FRED mode.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v3:
 * Rewrite the commit message.
v2:
 * New

The handling of VM is complicated.

It turns out that it's simply ignored by IRET in Long Mode (i.e. clearing it
commit 0e47f92b0725 ("x86: force EFLAGS.IF on when exiting to PV guests")
wasn't actually necessary) but ERETU does care.

But, it's unclear how to handle this in in arch_set_info().  We must preserve
it for HVM guests (which can use vm86 mode).  PV32 has special handling but
only in hypercall_iret(), not in arch_set_info().
---
 xen/arch/x86/domain.c                | 4 ++--
 xen/arch/x86/hvm/domain.c            | 4 ++--
 xen/arch/x86/include/asm/x86-defns.h | 7 +++++++
 xen/arch/x86/pv/dom0_build.c         | 2 +-
 xen/arch/x86/pv/iret.c               | 8 +++++---
 5 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index ce08f91be3af..423d0a6af4f3 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1273,7 +1273,7 @@ int arch_set_info_guest(
         v->arch.user_regs.rax               = c.nat->user_regs.rax;
         v->arch.user_regs.rip               = c.nat->user_regs.rip;
         v->arch.user_regs.cs                = c.nat->user_regs.cs;
-        v->arch.user_regs.rflags            = c.nat->user_regs.rflags;
+        v->arch.user_regs.rflags            = (c.nat->user_regs.rflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
         v->arch.user_regs.rsp               = c.nat->user_regs.rsp;
         v->arch.user_regs.ss                = c.nat->user_regs.ss;
         v->arch.pv.es                       = c.nat->user_regs.es;
@@ -1297,7 +1297,7 @@ int arch_set_info_guest(
         v->arch.user_regs.eax               = c.cmp->user_regs.eax;
         v->arch.user_regs.eip               = c.cmp->user_regs.eip;
         v->arch.user_regs.cs                = c.cmp->user_regs.cs;
-        v->arch.user_regs.eflags            = c.cmp->user_regs.eflags;
+        v->arch.user_regs.eflags            = (c.cmp->user_regs.eflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
         v->arch.user_regs.esp               = c.cmp->user_regs.esp;
         v->arch.user_regs.ss                = c.cmp->user_regs.ss;
         v->arch.pv.es                       = c.cmp->user_regs.es;
diff --git a/xen/arch/x86/hvm/domain.c b/xen/arch/x86/hvm/domain.c
index 048f29ae4911..1e874d598952 100644
--- a/xen/arch/x86/hvm/domain.c
+++ b/xen/arch/x86/hvm/domain.c
@@ -194,7 +194,7 @@ int arch_set_info_hvm_guest(struct vcpu *v, const struct vcpu_hvm_context *ctx)
         uregs->rsi    = regs->esi;
         uregs->rdi    = regs->edi;
         uregs->rip    = regs->eip;
-        uregs->rflags = regs->eflags;
+        uregs->rflags = (regs->eflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
 
         v->arch.hvm.guest_cr[0] = regs->cr0;
         v->arch.hvm.guest_cr[3] = regs->cr3;
@@ -245,7 +245,7 @@ int arch_set_info_hvm_guest(struct vcpu *v, const struct vcpu_hvm_context *ctx)
         uregs->rsi    = regs->rsi;
         uregs->rdi    = regs->rdi;
         uregs->rip    = regs->rip;
-        uregs->rflags = regs->rflags;
+        uregs->rflags = (regs->rflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
 
         v->arch.hvm.guest_cr[0] = regs->cr0;
         v->arch.hvm.guest_cr[3] = regs->cr3;
diff --git a/xen/arch/x86/include/asm/x86-defns.h b/xen/arch/x86/include/asm/x86-defns.h
index 0a0ba83de786..edeb0b4ff95a 100644
--- a/xen/arch/x86/include/asm/x86-defns.h
+++ b/xen/arch/x86/include/asm/x86-defns.h
@@ -27,6 +27,13 @@
     (X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |   \
      X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF)
 
+#define X86_EFLAGS_ALL                                          \
+    (X86_EFLAGS_ARITH_MASK | X86_EFLAGS_TF | X86_EFLAGS_IF |    \
+     X86_EFLAGS_DF | X86_EFLAGS_OF | X86_EFLAGS_IOPL |          \
+     X86_EFLAGS_NT | X86_EFLAGS_RF | X86_EFLAGS_VM |            \
+     X86_EFLAGS_AC | X86_EFLAGS_VIF | X86_EFLAGS_VIP |          \
+     X86_EFLAGS_ID)
+
 /*
  * Intel CPU flags in CR0
  */
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 21158ce1812e..f9bbbea2ff70 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -1021,7 +1021,7 @@ static int __init dom0_construct(const struct boot_domain *bd)
     regs->rip = parms.virt_entry;
     regs->rsp = vstack_end;
     regs->rsi = vstartinfo_start;
-    regs->eflags = X86_EFLAGS_IF;
+    regs->eflags = X86_EFLAGS_IF | X86_EFLAGS_MBS;
 
     /*
      * We don't call arch_set_info_guest(), so some initialisation needs doing
diff --git a/xen/arch/x86/pv/iret.c b/xen/arch/x86/pv/iret.c
index d3a1fb2c685b..39ce316b8d91 100644
--- a/xen/arch/x86/pv/iret.c
+++ b/xen/arch/x86/pv/iret.c
@@ -80,8 +80,9 @@ long do_iret(void)
 
     regs->rip    = iret_saved.rip;
     regs->cs     = iret_saved.cs | 3; /* force guest privilege */
-    regs->rflags = ((iret_saved.rflags & ~(X86_EFLAGS_IOPL|X86_EFLAGS_VM))
-                    | X86_EFLAGS_IF);
+    regs->rflags = ((iret_saved.rflags & X86_EFLAGS_ALL &
+                     ~(X86_EFLAGS_IOPL | X86_EFLAGS_VM)) |
+                    X86_EFLAGS_IF | X86_EFLAGS_MBS);
     regs->rsp    = iret_saved.rsp;
     regs->ss     = iret_saved.ss | 3; /* force guest privilege */
 
@@ -143,7 +144,8 @@ int compat_iret(void)
     if ( VM_ASSIST(v->domain, architectural_iopl) )
         v->arch.pv.iopl = eflags & X86_EFLAGS_IOPL;
 
-    regs->eflags = (eflags & ~X86_EFLAGS_IOPL) | X86_EFLAGS_IF;
+    regs->eflags = ((eflags & X86_EFLAGS_ALL & ~X86_EFLAGS_IOPL) |
+                    X86_EFLAGS_IF | X86_EFLAGS_MBS);
 
     if ( unlikely(eflags & X86_EFLAGS_VM) )
     {
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 22/22] x86: Clamp reserved bits in eflags more aggressively
  2025-10-03 22:53 ` [PATCH v3 22/22] x86: Clamp reserved bits in eflags more aggressively Andrew Cooper
@ 2025-10-08 13:50   ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2025-10-08 13:50 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 04.10.2025 00:53, Andrew Cooper wrote:
> ERETU, unlike IRET, requires the sticky-1 bit (bit 2) be set, and reserved
> bits to be clear.  Notably this means that dom0_construct() must set
> X86_EFLAGS_MBS it in order for a PV dom0 to start.

Nit: Seemingly stray "it".

> Xen has been overly lax with reserved bit handling.  Adjust
> arch_set_info_guest*() and hypercall_iret() which consume flags to clamp the
> reserved bits for all guest types.
> 
> This is a minor ABI change, but by the same argument as commit
> 9f892f84c279 ("x86/domctl: Stop using XLAT_cpu_user_regs()"), the reserved
> bits would get clamped naturally by hardware when the vCPU is run.
> 
> This allows PV guests to start when Xen is using FRED mode.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> 
> v3:
>  * Rewrite the commit message.
> v2:
>  * New
> 
> The handling of VM is complicated.
> 
> It turns out that it's simply ignored by IRET in Long Mode (i.e. clearing it
> commit 0e47f92b0725 ("x86: force EFLAGS.IF on when exiting to PV guests")
> wasn't actually necessary) but ERETU does care.
> 
> But, it's unclear how to handle this in in arch_set_info().  We must preserve
> it for HVM guests (which can use vm86 mode).  PV32 has special handling but
> only in hypercall_iret(), not in arch_set_info().

And what's wrong with adding special handling? (Apart from this the patch looks
good to me.)

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 for-4.21 00/22] x86: FRED support
  2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
                   ` (21 preceding siblings ...)
  2025-10-03 22:53 ` [PATCH v3 22/22] x86: Clamp reserved bits in eflags more aggressively Andrew Cooper
@ 2025-10-17 13:24 ` Oleksii Kurochko
  22 siblings, 0 replies; 40+ messages in thread
From: Oleksii Kurochko @ 2025-10-17 13:24 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Jan Beulich, Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 4530 bytes --]


On 10/4/25 12:53 AM, Andrew Cooper wrote:
> This is the combined MSR cleanup and FRED series.  Some patches of both v2's
> have already been committed.
>
> I have moved the MSR_IMM patch out.  It's not strictly needed for FRED, and
> really needs to go behind Jan's patch to use mergable sections for
> altinstructions which is definitely not making 4.21 at this point.
>
> I almost got access to a piece of real hardware in time, but that fell through
> at the last minute.
>
> In terms of timing, I know we're getting very tight for 4.21, but there has
> been an awful lot of disruption with travel and holidays recently.  Half the
> patches are already acked/reviewed, but can't easily go in due to logical
> dependencies (I suspect patches 15-17 could be committed right away as they're
> pretty independent.)
>
> Therefore I'd like to ask Oleksii whether the nominal release ack still
> stands.

If you're asking about patch 15-17, then it could be still merged now:
  Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>

>
> https://gitlab.com/xen-project/hardware/xen-staging/-/pipelines/2079705485

Btw, is it expected that some tests failed?

Thanks.

~ Oleksii

>
> Andrew Cooper (22):
>    x86/msr: Change rdmsr() to have normal API
>    x86/msr: Change wrmsr() to take a single parameter
>    x86/fsgsbase: Split out __{rd,wr}gs_shadow() helpers
>    x86/fsgsbase: Update fs/gs helpers to use wrmsrns()
>    x86/fsgsbase: Improve code generation in read_registers()
>    x86/boot: Use RSTORSSP to establish SSP
>    x86/traps: Alter switch_stack_and_jump() for FRED mode
>    x86/traps: Skip Supervisor Shadow Stack tokens in FRED mode
>    x86/traps: Make an IDT-specific #DB helper
>    x86/traps: Make an IDT-specific #PF helper
>    x86/fsgsbase: Make gskern accesses safe under FRED
>    x86/traps: Introduce FRED entrypoints
>    x86/traps: Enable FRED when requested
>    x86/pv: Deduplicate is_canonical_address() in do_set_segment_base()
>    x86/entry: Alter how IRET faults are recognised
>    x86/entry: Drop the pre exception table infrastructure
>    x86/entry: Rework the comment about SYSCALL and DF
>    x86/pv: Adjust GS handling for FRED mode
>    x86/pv: Guest exception handling in FRED mode
>    x86/pv: ERETU error handling
>    x86/pv: System call handling in FRED mode
>    x86: Clamp reserved bits in eflags more aggressively
>
>   xen/arch/x86/acpi/cpufreq/powernow.c    |  12 +-
>   xen/arch/x86/boot/x86_64.S              |  23 +-
>   xen/arch/x86/cpu/amd.c                  |   8 +-
>   xen/arch/x86/cpu/common.c               |  20 +-
>   xen/arch/x86/cpu/intel.c                |  30 +-
>   xen/arch/x86/domain.c                   |  34 +-
>   xen/arch/x86/extable.c                  |  14 -
>   xen/arch/x86/genapic/x2apic.c           |   5 +-
>   xen/arch/x86/hvm/domain.c               |   4 +-
>   xen/arch/x86/hvm/vmx/vmcs.c             |  32 +-
>   xen/arch/x86/hvm/vmx/vmx.c              |   4 +-
>   xen/arch/x86/include/asm/asm_defns.h    |  76 +++-
>   xen/arch/x86/include/asm/current.h      |   9 +-
>   xen/arch/x86/include/asm/domain.h       |   2 +
>   xen/arch/x86/include/asm/fsgsbase.h     |  66 +--
>   xen/arch/x86/include/asm/hypercall.h    |   2 -
>   xen/arch/x86/include/asm/msr.h          |  48 ++-
>   xen/arch/x86/include/asm/prot-key.h     |   6 +-
>   xen/arch/x86/include/asm/traps.h        |   2 +
>   xen/arch/x86/include/asm/uaccess.h      |   2 -
>   xen/arch/x86/include/asm/x86-defns.h    |   7 +
>   xen/arch/x86/mm.c                       |  12 +-
>   xen/arch/x86/nmi.c                      |  18 +-
>   xen/arch/x86/oprofile/op_model_athlon.c |   2 +-
>   xen/arch/x86/pv/dom0_build.c            |   2 +-
>   xen/arch/x86/pv/domain.c                |  22 +-
>   xen/arch/x86/pv/iret.c                  |   8 +-
>   xen/arch/x86/pv/misc-hypercalls.c       |  42 +-
>   xen/arch/x86/pv/traps.c                 |  39 ++
>   xen/arch/x86/setup.c                    |  33 +-
>   xen/arch/x86/traps-setup.c              |  83 +++-
>   xen/arch/x86/traps.c                    | 519 ++++++++++++++++++++++--
>   xen/arch/x86/tsx.c                      |  27 +-
>   xen/arch/x86/x86_64/Makefile            |   1 +
>   xen/arch/x86/x86_64/compat/entry.S      |   3 +-
>   xen/arch/x86/x86_64/entry-fred.S        |  57 +++
>   xen/arch/x86/x86_64/entry.S             |  46 ++-
>   xen/arch/x86/xen.lds.S                  |   5 -
>   38 files changed, 1076 insertions(+), 249 deletions(-)
>   create mode 100644 xen/arch/x86/x86_64/entry-fred.S
>

[-- Attachment #2: Type: text/html, Size: 5240 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2025-10-17 13:25 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-03 22:53 [PATCH v3 for-4.21 00/22] x86: FRED support Andrew Cooper
2025-10-03 22:53 ` [PATCH v3 01/22] x86/msr: Change rdmsr() to have normal API Andrew Cooper
2025-10-07 15:47   ` Jan Beulich
2025-10-03 22:53 ` [PATCH v3 02/22] x86/msr: Change wrmsr() to take a single parameter Andrew Cooper
2025-10-04  0:11   ` Demi Marie Obenour
2025-10-04  0:14     ` Andrew Cooper
2025-10-03 22:53 ` [PATCH v3 03/22] x86/fsgsbase: Split out __{rd,wr}gs_shadow() helpers Andrew Cooper
2025-10-07 15:49   ` Jan Beulich
2025-10-03 22:53 ` [PATCH v3 04/22] x86/fsgsbase: Update fs/gs helpers to use wrmsrns() Andrew Cooper
2025-10-07 15:53   ` Jan Beulich
2025-10-03 22:53 ` [PATCH v3 05/22] x86/fsgsbase: Improve code generation in read_registers() Andrew Cooper
2025-10-04  0:13   ` Demi Marie Obenour
2025-10-07 15:54   ` Jan Beulich
2025-10-03 22:53 ` [PATCH v3 06/22] x86/boot: Use RSTORSSP to establish SSP Andrew Cooper
2025-10-07 15:57   ` Jan Beulich
2025-10-03 22:53 ` [PATCH v3 07/22] x86/traps: Alter switch_stack_and_jump() for FRED mode Andrew Cooper
2025-10-07 15:58   ` Jan Beulich
2025-10-03 22:53 ` [PATCH v3 08/22] x86/traps: Skip Supervisor Shadow Stack tokens in " Andrew Cooper
2025-10-03 22:53 ` [PATCH v3 09/22] x86/traps: Make an IDT-specific #DB helper Andrew Cooper
2025-10-03 22:53 ` [PATCH v3 10/22] x86/traps: Make an IDT-specific #PF helper Andrew Cooper
2025-10-03 22:53 ` [PATCH v3 11/22] x86/fsgsbase: Make gskern accesses safe under FRED Andrew Cooper
2025-10-03 22:53 ` [PATCH v3 12/22] x86/traps: Introduce FRED entrypoints Andrew Cooper
2025-10-08  8:50   ` Jan Beulich
2025-10-16 14:54   ` Jan Beulich
2025-10-03 22:53 ` [PATCH v3 13/22] x86/traps: Enable FRED when requested Andrew Cooper
2025-10-08  8:54   ` Jan Beulich
2025-10-03 22:53 ` [PATCH v3 14/22] x86/pv: Deduplicate is_canonical_address() in do_set_segment_base() Andrew Cooper
2025-10-03 22:53 ` [PATCH v3 15/22] x86/entry: Alter how IRET faults are recognised Andrew Cooper
2025-10-03 22:53 ` [PATCH v3 16/22] x86/entry: Drop the pre exception table infrastructure Andrew Cooper
2025-10-03 22:53 ` [PATCH v3 17/22] x86/entry: Rework the comment about SYSCALL and DF Andrew Cooper
2025-10-03 22:53 ` [PATCH v3 18/22] x86/pv: Adjust GS handling for FRED mode Andrew Cooper
2025-10-03 22:53 ` [PATCH v3 19/22] x86/pv: Guest exception handling in " Andrew Cooper
2025-10-08 12:28   ` Jan Beulich
2025-10-03 22:53 ` [PATCH v3 20/22] x86/pv: ERETU error handling Andrew Cooper
2025-10-08 12:36   ` Jan Beulich
2025-10-03 22:53 ` [PATCH v3 21/22] x86/pv: System call handling in FRED mode Andrew Cooper
2025-10-08 13:45   ` Jan Beulich
2025-10-03 22:53 ` [PATCH v3 22/22] x86: Clamp reserved bits in eflags more aggressively Andrew Cooper
2025-10-08 13:50   ` Jan Beulich
2025-10-17 13:24 ` [PATCH v3 for-4.21 00/22] x86: FRED support Oleksii Kurochko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).