* [PATCH v4 00/14] x86: FRED support
@ 2026-02-27 23:16 Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 01/14] x86/pv: Don't assume that INT $imm8 instructions are two bytes long Andrew Cooper
` (13 more replies)
0 siblings, 14 replies; 43+ messages in thread
From: Andrew Cooper @ 2026-02-27 23:16 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
This version of the series has finally run on real hardware, Intel
PantherLake. Notable changes:
* Rework TSS setup, given an unpleasant discovery about VT-x.
* Fix an INT $N emulation bug in IDT mode, discovered by XTF testing to check
that FRED behaved the same.
* Document aspects of the PV ABI now they've been thoroughly reverse
engineered.
By the end of patch 6, PVH dom0 works.
By the end of the series, PV dom0 works.
https://gitlab.com/xen-project/hardware/xen-staging/-/pipelines/2354867216
Andrew Cooper (14):
x86/pv: Don't assume that INT $imm8 instructions are two bytes long
docs/guest-guide: Describe the PV traps and entrypoints ABI
x86/boot: Move gdt_l1e caching out of traps_init()
x86/boot: Document the ordering dependency of _svm_cpu_up()
x86/traps: Move traps_init() earlier on boot
x86/traps: Don't configure Supervisor Shadow Stack tokens in FRED mode
x86/traps: Introduce FRED entrypoints
x86/traps: Enable FRED when requested
x86/pv: Adjust GS handling for FRED mode
x86/pv: Guest exception handling in FRED mode
x86/pv: ERETU error handling
x86/pv: System call handling in FRED mode
x86: Clamp reserved bits in eflags more aggressively
x86/traps: Use fatal_trap() for #UD and #GP
docs/glossary.rst | 3 +
docs/guest-guide/x86/index.rst | 1 +
docs/guest-guide/x86/pv-traps.rst | 123 +++++++
xen/arch/x86/cpu/common.c | 4 +-
xen/arch/x86/domain.c | 22 +-
xen/arch/x86/hvm/domain.c | 4 +-
xen/arch/x86/hvm/svm/svm.c | 16 +
xen/arch/x86/include/asm/asm_defns.h | 63 ++++
xen/arch/x86/include/asm/current.h | 3 +
xen/arch/x86/include/asm/domain.h | 2 +
xen/arch/x86/include/asm/hypercall.h | 2 -
xen/arch/x86/include/asm/pv/traps.h | 2 +
xen/arch/x86/include/asm/traps.h | 2 +
xen/arch/x86/include/asm/x86-defns.h | 7 +
xen/arch/x86/mm.c | 14 +-
xen/arch/x86/pv/dom0_build.c | 2 +-
xen/arch/x86/pv/domain.c | 22 +-
xen/arch/x86/pv/emul-priv-op.c | 72 +++-
xen/arch/x86/pv/iret.c | 8 +-
xen/arch/x86/pv/misc-hypercalls.c | 16 +-
xen/arch/x86/pv/traps.c | 39 +++
xen/arch/x86/setup.c | 20 +-
xen/arch/x86/smpboot.c | 11 +
xen/arch/x86/traps-setup.c | 147 +++++++-
xen/arch/x86/traps.c | 486 ++++++++++++++++++++++++++-
xen/arch/x86/x86_64/Makefile | 1 +
xen/arch/x86/x86_64/entry-fred.S | 57 ++++
xen/arch/x86/x86_64/entry.S | 4 +-
28 files changed, 1091 insertions(+), 62 deletions(-)
create mode 100644 docs/guest-guide/x86/pv-traps.rst
create mode 100644 xen/arch/x86/x86_64/entry-fred.S
--
2.39.5
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v4 01/14] x86/pv: Don't assume that INT $imm8 instructions are two bytes long
2026-02-27 23:16 [PATCH v4 00/14] x86: FRED support Andrew Cooper
@ 2026-02-27 23:16 ` Andrew Cooper
2026-03-02 11:03 ` Jan Beulich
2026-02-27 23:16 ` [PATCH v4 02/14] docs/guest-guide: Describe the PV traps and entrypoints ABI Andrew Cooper
` (12 subsequent siblings)
13 siblings, 1 reply; 43+ messages in thread
From: Andrew Cooper @ 2026-02-27 23:16 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
For INT $N instructions (besides $0x80 for which there is a dedicated fast
path), handling is mostly fault-based because of DPL0 gates in the IDT. This
means that when the guest kernel allows the instruction too, Xen must
increment %rip to the end of the instruction before passing a trap to the
guest kernel.
When an INT $N instruction has a prefix, it's longer than two bytes, and Xen
will deliver the "trap" with %rip pointing into the middle of the instruction.
Introduce a new pv_emulate_sw_interrupt() which uses x86_insn_length() to
determine the instruction length, rather than assuming two.
This is a change in behaviour for PV guests, but the prior behaviour cannot
reasonably be said to be intentional.
This change does not affect the INT $0x80 fastpath. Prefixed INT $N
instructions occur almost exclusively in test code or exploits, and INT $0x80
appears to be the only user-usable interrupt gate in contemporary PV guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
v4:
* New
---
xen/arch/x86/include/asm/pv/traps.h | 2 ++
xen/arch/x86/pv/emul-priv-op.c | 48 +++++++++++++++++++++++++++++
xen/arch/x86/traps.c | 3 +-
3 files changed, 51 insertions(+), 2 deletions(-)
diff --git a/xen/arch/x86/include/asm/pv/traps.h b/xen/arch/x86/include/asm/pv/traps.h
index 8c201190923d..16e9a8d2aa3f 100644
--- a/xen/arch/x86/include/asm/pv/traps.h
+++ b/xen/arch/x86/include/asm/pv/traps.h
@@ -17,6 +17,7 @@
int pv_raise_nmi(struct vcpu *v);
int pv_emulate_privileged_op(struct cpu_user_regs *regs);
+void pv_emulate_sw_interrupt(struct cpu_user_regs *regs);
void pv_emulate_gate_op(struct cpu_user_regs *regs);
bool pv_emulate_invalid_op(struct cpu_user_regs *regs);
@@ -31,6 +32,7 @@ static inline bool pv_trap_callback_registered(const struct vcpu *v,
static inline int pv_raise_nmi(struct vcpu *v) { return -EOPNOTSUPP; }
static inline int pv_emulate_privileged_op(struct cpu_user_regs *regs) { return 0; }
+static inline void pv_emulate_sw_interrupt(struct cpu_user_regs *regs) {}
static inline void pv_emulate_gate_op(struct cpu_user_regs *regs) {}
static inline bool pv_emulate_invalid_op(struct cpu_user_regs *regs) { return true; }
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index a3c1fd12621d..87d3bbcf901f 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -8,6 +8,7 @@
*/
#include <xen/domain_page.h>
+#include <xen/err.h>
#include <xen/event.h>
#include <xen/guest_access.h>
#include <xen/hypercall.h>
@@ -1401,6 +1402,53 @@ int pv_emulate_privileged_op(struct cpu_user_regs *regs)
return 0;
}
+/*
+ * Hardware already decoded the INT $N instruction and determinted that there
+ * was a DPL issue, hence the #GP. Xen has already determined that the guest
+ * kernel has permitted this software interrupt.
+ *
+ * All that is needed is the instruction length, to turn the fault into a
+ * trap. All errors are turned back into the original #GP, as that's the
+ * action that really happened.
+ */
+void pv_emulate_sw_interrupt(struct cpu_user_regs *regs)
+{
+ struct vcpu *curr = current;
+ struct domain *currd = curr->domain;
+ struct priv_op_ctxt ctxt = {
+ .ctxt.regs = regs,
+ .ctxt.lma = !is_pv_32bit_domain(currd),
+ };
+ struct x86_emulate_state *state;
+ uint8_t vector = regs->error_code >> 3;
+ unsigned int len, ar;
+
+ if ( !pv_emul_read_descriptor(regs->cs, curr, &ctxt.cs.base,
+ &ctxt.cs.limit, &ar, 1) ||
+ !(ar & _SEGMENT_S) ||
+ !(ar & _SEGMENT_P) ||
+ !(ar & _SEGMENT_CODE) )
+ goto error;
+
+ state = x86_decode_insn(&ctxt.ctxt, insn_fetch);
+ if ( IS_ERR_OR_NULL(state) )
+ goto error;
+
+ len = x86_insn_length(state, &ctxt.ctxt);
+ x86_emulate_free_state(state);
+
+ /* Note: Checked slightly late to simplify 'state' handling. */
+ if ( ctxt.ctxt.opcode != 0xcd /* INT $imm8 */ )
+ goto error;
+
+ regs->rip += len;
+ pv_inject_sw_interrupt(vector);
+ return;
+
+ error:
+ pv_inject_hw_exception(X86_EXC_GP, regs->entry_vector);
+}
+
/*
* Local variables:
* mode: C
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 5feac88d6c0b..907fb4c186c0 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1379,8 +1379,7 @@ void do_general_protection(struct cpu_user_regs *regs)
if ( permit_softint(TI_GET_DPL(ti), v, regs) )
{
- regs->rip += 2;
- pv_inject_sw_interrupt(vector);
+ pv_emulate_sw_interrupt(regs);
return;
}
}
--
2.39.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v4 02/14] docs/guest-guide: Describe the PV traps and entrypoints ABI
2026-02-27 23:16 [PATCH v4 00/14] x86: FRED support Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 01/14] x86/pv: Don't assume that INT $imm8 instructions are two bytes long Andrew Cooper
@ 2026-02-27 23:16 ` Andrew Cooper
2026-03-02 11:19 ` Jan Beulich
2026-02-27 23:16 ` [PATCH v4 03/14] x86/boot: Move gdt_l1e caching out of traps_init() Andrew Cooper
` (11 subsequent siblings)
13 siblings, 1 reply; 43+ messages in thread
From: Andrew Cooper @ 2026-02-27 23:16 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
... seeing as I've had to thoroughly reverse engineer it for FRED and make
tweaks in places.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
Obviously there's a lot more in need of doing, but this is at least a start.
v4:
* New
---
docs/glossary.rst | 3 +
docs/guest-guide/x86/index.rst | 1 +
docs/guest-guide/x86/pv-traps.rst | 123 ++++++++++++++++++++++++++++++
3 files changed, 127 insertions(+)
create mode 100644 docs/guest-guide/x86/pv-traps.rst
diff --git a/docs/glossary.rst b/docs/glossary.rst
index 6adeec77e14c..c8ab2386bc6e 100644
--- a/docs/glossary.rst
+++ b/docs/glossary.rst
@@ -43,6 +43,9 @@ Glossary
Sapphire Rapids (Server, 2023) CPUs. AMD support only CET-SS, starting
with Zen3 (Both client and server, 2020) CPUs.
+ event channel
+ A paravirtual facility for guests to send and recieve interrupts.
+
guest
The term 'guest' has two different meanings, depending on context, and
should not be confused with :term:`domain`.
diff --git a/docs/guest-guide/x86/index.rst b/docs/guest-guide/x86/index.rst
index 502968490d9d..5b38ae397a9f 100644
--- a/docs/guest-guide/x86/index.rst
+++ b/docs/guest-guide/x86/index.rst
@@ -7,3 +7,4 @@ x86
:maxdepth: 2
hypercall-abi
+ pv-traps
diff --git a/docs/guest-guide/x86/pv-traps.rst b/docs/guest-guide/x86/pv-traps.rst
new file mode 100644
index 000000000000..2ff18e2f9454
--- /dev/null
+++ b/docs/guest-guide/x86/pv-traps.rst
@@ -0,0 +1,123 @@
+.. SPDX-License-Identifier: CC-BY-4.0
+
+PV Traps and Entrypoints
+========================
+
+.. note::
+
+ The details here are specific to 64bit builds of Xen. Details for 32bit
+ builds of Xen, are different and not discussed further.
+
+PV guests are subject to Xen's linkage setup for events (interrupts,
+exceptions and system calls). x86's IDT architecture and limitations are the
+majority influence on the PV ABI.
+
+All external interrupts are routed to PV guests via the :term:`Event Channel`
+interface, and not discussed further here.
+
+What remain are exceptions, and the instructions which cause a control
+transfers. In the x86 architecture, the instructions relevant for PV guests
+are:
+
+ * ``INT3``, which generates ``#BP``.
+
+ * ``INTO``, which generates ``#OF`` only if the overflow flag is set. It is
+ only usable in compatibility mode, and will ``#UD`` in 64bit mode.
+
+ * ``CALL (far)`` referencing a gate in the GDT.
+
+ * ``INT $N``, which invokes an arbitrary IDT gate. These four instructions
+ so far all check the gate DPL and will ``#GP`` otherwise.
+
+ * ``INT1``, also known as ``ICEBP``, which generates ``#DB``. This
+ instruction does *not* check DPL, and can be used unconditionally by
+ userspace.
+
+ * ``SYSCALL``, which enters CPL0 as configured by the ``{C,L,}STAR`` MSRs.
+ It is usable if enabled by ``MSR_EFER.SCE``, and will ``#UD`` otherwise.
+ On Intel parts, ``SYSCALL`` is unusable outside of 64bit mode.
+
+ * ``SYSENTER``, which enters CPL0 as configured by the ``SEP`` MSRs. It is
+ usable if enabled by ``MSR_SYSENTER_CS`` having a non-NUL selector, and
+ will ``#GP`` otherwise. On AMD parts, ``SYSENTER`` is unusable in Long
+ mode.
+
+
+Xen's configuration
+-------------------
+
+Xen maintains a complete IDT, with most gates configured with DPL0. This
+causes most ``INT $N`` instructions to ``#GP``. This allows Xen to emulate
+the instruction, referring to the guest kernels vDPL choice.
+
+ * Vectors 3 ``#BP`` and 4 ``#OF`` are DPL3, in order to allow the ``INT3``
+ and ``INTO`` instructions to function in userspace.
+
+ * Vector 0x80 is DPL3 in order to implement the legacy system call fastpath
+ commonly found in UNIXes.
+
+ * Vector 0x82 is DPL1 when PV32 is enabled, allowing the guest kernel to make
+ hypercalls to Xen. All other cases (PV32 guest userspace, and both PV64
+ modes) operate in CPL3 and this vector behaves like all others to ``INT
+ $N`` instructions.
+
+A range of the GDT is guest-owned, allowing for call gates. During audit, Xen
+forces all call gates to DPL0, causing their use to ``#GP`` allowing for
+emulation.
+
+Xen enables ``SYSCALL`` in all cases as it is mandatory in 64bit mode, and
+enables ``SYSENTER`` when available in 64bit mode.
+
+When Xen is using FRED delivery the hardware configuration is substantially
+different, but the behaviour for guests remains as unchanged as possible.
+
+
+PV Guest's configuration
+------------------------
+
+The PV ABI contains the "trap table", modelled very closely on the IDT. It is
+manipulated by ``HYPERCALL_set_trap_table``, has 256 entries, each containing
+a code segment selector, an address, and flags. A guest is expected to
+configure handlers for all exceptions; failure to do so is terminal similar to
+a Triple Fault.
+
+Part of the GDT is guest owned with descriptors audited by Xen. This range
+can be manipulated with ``HYPERVISOR_set_gdt`` and
+``HYPERVISOR_update_descriptor``.
+
+Other entrypoints are configured via ``HYPERVISOR_callback_op``. Of note here
+are the callback types ``syscall``, ``syscall32`` (relevant for AMD parts) and
+``sysenter`` (relevant for Intel parts).
+
+.. warning::
+
+ Prior to Xen 4.15, there was no check that the ``syscall`` or ``syscall32``
+ callbacks had been registered before attempting to deliver via them.
+ Guests are strongly advised to ensure the entrypoints are registered before
+ running userspace.
+
+
+Notes
+-----
+
+``INT3`` vs ``INT $3`` and ``INTO`` vs ``INT $4`` are hard to distinguish
+architecturally as both forms have a DPL check and use the same IDT vectors.
+Because Xen configures both as DPL3, the ``INT $`` forms do not fault for
+emulation, and are treated as if they were exceptions. This means the guest
+can't block these instruction by trying to configure them with vDPL0.
+
+The instructions which trap into Xen (``INT $0x80``, ``SYSCALL``,
+``SYSENTER``) but can be disabled by guest configuration need turning back
+into faults for the guest kernel to process.
+
+ * When using IDT delivery, instruction lengths are not provided by hardware
+ and Xen does not account for possible prefixes. ``%rip`` only gets rewound
+ by the length of the unprefixed instruction. This is observable, but not
+ expected to be an issue in practice.
+
+ * When Xen is using FRED delivery, the full instruction length is provided by
+ hardware, and ``%rip`` is rewound fully.
+
+While both PV32 and PV64 guests are permitted to write Call Gates into the
+GDT, emulation is only wired up for PV32. At the time of writing, the x86
+maintainers feel no specific need to fix this omission.
--
2.39.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v4 03/14] x86/boot: Move gdt_l1e caching out of traps_init()
2026-02-27 23:16 [PATCH v4 00/14] x86: FRED support Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 01/14] x86/pv: Don't assume that INT $imm8 instructions are two bytes long Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 02/14] docs/guest-guide: Describe the PV traps and entrypoints ABI Andrew Cooper
@ 2026-02-27 23:16 ` Andrew Cooper
2026-03-02 11:33 ` Jan Beulich
2026-02-27 23:16 ` [PATCH v4 04/14] x86/boot: Document the ordering dependency of _svm_cpu_up() Andrew Cooper
` (10 subsequent siblings)
13 siblings, 1 reply; 43+ messages in thread
From: Andrew Cooper @ 2026-02-27 23:16 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
Commit 564d261687c0 ("x86/ctxt-switch: Document and improve GDT handling") put
the initialisation of {,compat_}gdt_l1e into traps_init() but this wasn't a
great choice. Instead, put it in smp_prepare_cpus() which performs the BSP
preparation of variables normally set up by cpu_smpboot_alloc() for APs.
This removes an implicit dependency that prevents traps_init() moving earlier
than move_xen() in the boot sequence.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
v4:
* New
I'm on the fence about the ASSERT(), but I'm getting rather tired of unstated
dependencies. For a PV64 guest using SYSEXIT to enter the guest, it's the
first interrupt/exception which references the GDT, which could be after the
guest is running.
---
xen/arch/x86/domain.c | 2 ++
xen/arch/x86/smpboot.c | 11 +++++++++++
xen/arch/x86/traps-setup.c | 7 -------
3 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 8eb1509782ef..e658c2d647b7 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -2029,6 +2029,8 @@ static always_inline bool need_full_gdt(const struct domain *d)
static void update_xen_slot_in_full_gdt(const struct vcpu *v, unsigned int cpu)
{
+ ASSERT(per_cpu(gdt_l1e, cpu).l1); /* Confirm these have been cached. */
+
l1e_write(pv_gdt_ptes(v) + FIRST_RESERVED_GDT_PAGE,
!is_pv_32bit_vcpu(v) ? per_cpu(gdt_l1e, cpu)
: per_cpu(compat_gdt_l1e, cpu));
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 961bdf53331c..491cbbba33ae 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -1167,6 +1167,17 @@ void __init smp_prepare_cpus(void)
initialize_cpu_data(0); /* Final full version of the data */
print_cpu_info(0);
+ /*
+ * Cache {,compat_}gdt_l1e for the BSP now that physically relocation is
+ * done. It must be after physical relocation of Xen, and before the
+ * first context_switch().
+ */
+ this_cpu(gdt_l1e) =
+ l1e_from_pfn(virt_to_mfn(boot_gdt), __PAGE_HYPERVISOR_RW);
+ if ( IS_ENABLED(CONFIG_PV32) )
+ this_cpu(compat_gdt_l1e) =
+ l1e_from_pfn(virt_to_mfn(boot_compat_gdt), __PAGE_HYPERVISOR_RW);
+
boot_cpu_physical_apicid = get_apic_id();
x86_cpu_to_apicid[0] = boot_cpu_physical_apicid;
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index d77be8f83921..c5fc71c75bca 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -341,13 +341,6 @@ void __init traps_init(void)
init_ler();
- /* Cache {,compat_}gdt_l1e now that physically relocation is done. */
- this_cpu(gdt_l1e) =
- l1e_from_pfn(virt_to_mfn(boot_gdt), __PAGE_HYPERVISOR_RW);
- if ( IS_ENABLED(CONFIG_PV32) )
- this_cpu(compat_gdt_l1e) =
- l1e_from_pfn(virt_to_mfn(boot_compat_gdt), __PAGE_HYPERVISOR_RW);
-
percpu_traps_init();
}
--
2.39.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v4 04/14] x86/boot: Document the ordering dependency of _svm_cpu_up()
2026-02-27 23:16 [PATCH v4 00/14] x86: FRED support Andrew Cooper
` (2 preceding siblings ...)
2026-02-27 23:16 ` [PATCH v4 03/14] x86/boot: Move gdt_l1e caching out of traps_init() Andrew Cooper
@ 2026-02-27 23:16 ` Andrew Cooper
2026-03-02 11:35 ` Jan Beulich
2026-03-02 15:20 ` Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 05/14] x86/traps: Move traps_init() earlier on boot Andrew Cooper
` (9 subsequent siblings)
13 siblings, 2 replies; 43+ messages in thread
From: Andrew Cooper @ 2026-02-27 23:16 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
Lets just say this took an unreasoanble amount of time and effort to track
down, when trying to move traps_init() earlier during boot.
When the SYSCALL linkage MSRs are not configured ahead of _svm_cpu_up() on the
BSP, the first context switch into PV uses svm_load_segs() and clobbers the
later-set-up linkage with the 0's cached here, causing hypercalls issues by
the PV guest to enter at 0 in supervisor mode on the user stack.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
v4:
* New
It occurs to me that it's not actually 0's we cache here. It's whatever
context was left from prior to Xen. We still don't reliably clean unused
MSRs.
---
xen/arch/x86/hvm/svm/svm.c | 16 ++++++++++++++++
xen/arch/x86/setup.c | 2 +-
2 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 18ba837738c6..f1e02d919cae 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -35,6 +35,7 @@
#include <asm/p2m.h>
#include <asm/paging.h>
#include <asm/processor.h>
+#include <asm/traps.h>
#include <asm/vm_event.h>
#include <asm/x86_emulate.h>
@@ -1581,6 +1582,21 @@ static int _svm_cpu_up(bool bsp)
/* Initialize OSVW bits to be used by guests */
svm_host_osvw_init();
+ /*
+ * VMSAVE writes out the current full FS, GS, LDTR and TR segments, and
+ * the GS_SHADOW, SYSENTER and SYSCALL linkage MSRs.
+ *
+ * The segment data gets modified by the svm_load_segs() optimisation for
+ * PV context switches, but all values get reloaded at that point, as well
+ * as during context switch from SVM.
+ *
+ * If PV guests are available (and FRED is not in use), it is critical
+ * that the SYSCALL linkage MSRs been configured at this juncture.
+ */
+ ASSERT(opt_fred >= 0); /* Confirm that FRED-ness has been resolved */
+ if ( IS_ENABLED(CONFIG_PV) && !opt_fred )
+ ASSERT(rdmsr(MSR_LSTAR));
+
svm_vmsave_pa(per_cpu(host_vmcb, cpu));
return 0;
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 27c63d1d97c9..675de3a649ea 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -2078,7 +2078,7 @@ void asmlinkage __init noreturn __start_xen(void)
&this_cpu(stubs).mfn);
BUG_ON(!this_cpu(stubs.addr));
- traps_init(); /* Needs stubs allocated. */
+ traps_init(); /* Needs stubs allocated, must be before presmp_initcalls. */
cpu_init();
--
2.39.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v4 05/14] x86/traps: Move traps_init() earlier on boot
2026-02-27 23:16 [PATCH v4 00/14] x86: FRED support Andrew Cooper
` (3 preceding siblings ...)
2026-02-27 23:16 ` [PATCH v4 04/14] x86/boot: Document the ordering dependency of _svm_cpu_up() Andrew Cooper
@ 2026-02-27 23:16 ` Andrew Cooper
2026-03-02 11:39 ` Jan Beulich
2026-02-27 23:16 ` [PATCH v4 06/14] x86/traps: Don't configure Supervisor Shadow Stack tokens in FRED mode Andrew Cooper
` (8 subsequent siblings)
13 siblings, 1 reply; 43+ messages in thread
From: Andrew Cooper @ 2026-02-27 23:16 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
We wish to make use of opt_fred earlier on boot, which involves moving
traps_init() earlier, but this comes with several ordering complications.
The feature word containing FRED needs collecting in early_cpu_init(), and
legacy_syscall_init() cannot be called that early because it relies on the
stubs being allocated, yet must be called ahead of cpu_init() so the SYSCALL
linkage MSRs are set up before being cached.
Delaying legacy_syscall_init() is easy enough based on a system_state check.
Reuse bsp_traps_reinit() to cause a call to legacy_syscall_init() to occur at
the same point as previously.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
v4:
* New
I don't particualrly like this solution, but the layout of these functions
change for FRED. Any adjustments need to consider the logic at the end of the
series, not at this point.
---
xen/arch/x86/cpu/common.c | 4 +++-
xen/arch/x86/setup.c | 4 +++-
xen/arch/x86/traps-setup.c | 12 +++++++++++-
3 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index bfa63fcfb721..5d0523a78b52 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -407,7 +407,9 @@ void __init early_cpu_init(bool verbose)
}
if (max_subleaf >= 1)
- cpuid_count(7, 1, &eax, &ebx, &ecx,
+ cpuid_count(7, 1,
+ &c->x86_capability[FEATURESET_7a1],
+ &ebx, &ecx,
&c->x86_capability[FEATURESET_7d1]);
}
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 675de3a649ea..0816a713e1c8 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1386,6 +1386,8 @@ void asmlinkage __init noreturn __start_xen(void)
else
panic("Bootloader provided no memory information\n");
+ traps_init();
+
/* Choose shadow stack early, to set infrastructure up appropriately. */
if ( !boot_cpu_has(X86_FEATURE_CET_SS) )
opt_xen_shstk = 0;
@@ -2078,7 +2080,7 @@ void asmlinkage __init noreturn __start_xen(void)
&this_cpu(stubs).mfn);
BUG_ON(!this_cpu(stubs.addr));
- traps_init(); /* Needs stubs allocated, must be before presmp_initcalls. */
+ bsp_traps_reinit(); /* Needs stubs allocated, must be before presmp_initcalls. */
cpu_init();
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index c5fc71c75bca..b2c161943d1e 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -346,6 +346,10 @@ void __init traps_init(void)
/*
* Re-initialise all state referencing the early-boot stack.
+ *
+ * This is called twice during boot, first to ensure legacy_syscall_init() has
+ * run (deferred from earlier), and second when the virtual address of the BSP
+ * stack changes.
*/
void __init bsp_traps_reinit(void)
{
@@ -359,7 +363,13 @@ void __init bsp_traps_reinit(void)
*/
void percpu_traps_init(void)
{
- legacy_syscall_init();
+ /*
+ * Skip legacy_syscall_init() at early boot. It requires the stubs being
+ * allocated, limiting the placement of the traps_init() call, and gets
+ * re-done anyway by bsp_traps_reinit().
+ */
+ if ( system_state > SYS_STATE_early_boot )
+ legacy_syscall_init();
if ( cpu_has_xen_lbr )
wrmsrl(MSR_IA32_DEBUGCTLMSR, IA32_DEBUGCTLMSR_LBR);
--
2.39.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v4 06/14] x86/traps: Don't configure Supervisor Shadow Stack tokens in FRED mode
2026-02-27 23:16 [PATCH v4 00/14] x86: FRED support Andrew Cooper
` (4 preceding siblings ...)
2026-02-27 23:16 ` [PATCH v4 05/14] x86/traps: Move traps_init() earlier on boot Andrew Cooper
@ 2026-02-27 23:16 ` Andrew Cooper
2026-03-02 14:50 ` Jan Beulich
2026-02-27 23:16 ` [PATCH v4 07/14] x86/traps: Introduce FRED entrypoints Andrew Cooper
` (7 subsequent siblings)
13 siblings, 1 reply; 43+ messages in thread
From: Andrew Cooper @ 2026-02-27 23:16 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
FRED doesn't use Supervisor Shadow Stack tokens. This means that:
1) memguard_guard_stack() should not write Supervisor Shadow Stack Tokens.
2) cpu_has_bug_shstk_fracture is no longer relevant when deciding whether or
not to enable Shadow Stacks in the first place.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
v4:
* Adjust for cpu_has_bug_shstk_fracture.
* Reworked entirely in light of the prior 3 patches.
The SDM explicitly points out the shstk fracture vs FRED case, yet PTL
enumerates CET-SSS (immunity to shstk fracture). I can only assume that there
are other Intel CPUs with FRED but without CET-SSS.
---
xen/arch/x86/mm.c | 14 +++++++++++---
xen/arch/x86/setup.c | 16 ++++++++++------
2 files changed, 21 insertions(+), 9 deletions(-)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 0d0d5292953b..4c404b6c134f 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -129,6 +129,7 @@
#include <asm/shadow.h>
#include <asm/shared.h>
#include <asm/trampoline.h>
+#include <asm/traps.h>
#include <asm/x86_emulate.h>
#include <public/memory.h>
@@ -6441,8 +6442,15 @@ static void write_sss_token(unsigned long *ptr)
void memguard_guard_stack(void *p)
{
- /* IST Shadow stacks. 4x 1k in stack page 0. */
- if ( IS_ENABLED(CONFIG_XEN_SHSTK) )
+ ASSERT(opt_fred >= 0); /* Confirm that FRED-ness has been resolved */
+
+ /*
+ * IST Shadow stacks. 4x 1k in stack page 0.
+ *
+ * With IDT delivery, we need Supervisor Shadow Stack tokens at the base
+ * of each stack. With FRED delivery, these no longer exist.
+ */
+ if ( IS_ENABLED(CONFIG_XEN_SHSTK) && !opt_fred )
{
write_sss_token(p + (IST_MCE * IST_SHSTK_SIZE) - 8);
write_sss_token(p + (IST_NMI * IST_SHSTK_SIZE) - 8);
@@ -6453,7 +6461,7 @@ void memguard_guard_stack(void *p)
/* Primary Shadow Stack. 1x 4k in stack page 5. */
p += PRIMARY_SHSTK_SLOT * PAGE_SIZE;
- if ( IS_ENABLED(CONFIG_XEN_SHSTK) )
+ if ( IS_ENABLED(CONFIG_XEN_SHSTK) && !opt_fred )
write_sss_token(p + PAGE_SIZE - 8);
map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_SHSTK);
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 0816a713e1c8..8e59c9801afe 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1412,15 +1412,19 @@ void asmlinkage __init noreturn __start_xen(void)
boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
!boot_cpu_has(X86_FEATURE_CET_SSS);
+ ASSERT(opt_fred >= 0); /* Confirm that FRED-ness has been resolved */
+
/*
- * On bare metal, assume that Xen won't be impacted by shstk
- * fracturing problems. Under virt, be more conservative and disable
- * shstk by default.
+ * If FRED is in use, Supervisor Shadow Stack tokens are not used and
+ * shstk fracturing is of no consequence. Otherwise:
+ * - On bare metal, assume that Xen won't be impacted by shstk
+ * fracturing problems.
+ * - Under virt, be more conservative and disable shstk by default.
*/
if ( opt_xen_shstk == -1 )
opt_xen_shstk =
- cpu_has_hypervisor ? !cpu_has_bug_shstk_fracture
- : true;
+ opt_fred || (cpu_has_hypervisor ? !cpu_has_bug_shstk_fracture
+ : true);
if ( opt_xen_shstk )
{
@@ -1925,7 +1929,7 @@ void asmlinkage __init noreturn __start_xen(void)
system_state = SYS_STATE_boot;
- bsp_stack = cpu_alloc_stack(0);
+ bsp_stack = cpu_alloc_stack(0); /* Needs to know IDT vs FRED */
if ( !bsp_stack )
panic("No memory for BSP stack\n");
--
2.39.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v4 07/14] x86/traps: Introduce FRED entrypoints
2026-02-27 23:16 [PATCH v4 00/14] x86: FRED support Andrew Cooper
` (5 preceding siblings ...)
2026-02-27 23:16 ` [PATCH v4 06/14] x86/traps: Don't configure Supervisor Shadow Stack tokens in FRED mode Andrew Cooper
@ 2026-02-27 23:16 ` Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 08/14] x86/traps: Enable FRED when requested Andrew Cooper
` (6 subsequent siblings)
13 siblings, 0 replies; 43+ messages in thread
From: Andrew Cooper @ 2026-02-27 23:16 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné
Under FRED, there's one entrypoint from Ring 3, and one from Ring 0.
FRED gives us a good stack (even for SYSCALL/SYSENTER), and a unified event
frame on the stack, meaing that all software needs to do is spill the GPRs
with a line of PUSHes. Introduce PUSH_AND_CLEAR_GPRS for this purpose, along
with the matching POP_GPRS.
Introduce entry_FRED_R0() which to a first appoximation is complete for all
event handling within Xen.
entry_FRED_R0() needs deriving from entry_FRED_R3(), so introduce a basic
handler. There is more work required to make the return-to-guest path work
under FRED.
Also introduce entry_from_{xen,pv}() to be the C level handlers. By simply
copying regs->fred_ss.vector into regs->entry_vector, we can reuse all the
existing fault handlers.
Extend fatal_trap() to render the event type, including by name, when FRED is
active. This is slightly complicated, because X86_ET_OTHER must not use
vector_name() or SYSCALL and SYSENTER get rendered as #BP and #DB.
This is sufficient to handle all interrupts and exceptions encountered during
development, including plenty of Double Faults.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
v4:
* Drop POP_GPRS %rax parameter for now.
* Treat nested events as fatal. Even if the condition didn't escalate to
#DF, it still indicates an error with the linkage setup.
v3:
* Adjust commit message to remove stale details
* Adjust formatting in fatal_trap()
* Group CP with others. It's probably wrong for perf, but that's out the
window anyway now that we're letting a compiler make the decision tree.
v2:
* Don't render a vector name for X86_ET_SW_INT
* Fix typos in names[]
* Link entry-fred.o first
SIMICS hasn't been updated to the FRED v9, and still wants ENDBR instructions
at the entrypoints.
---
xen/arch/x86/include/asm/asm_defns.h | 63 +++++++++++
xen/arch/x86/traps.c | 155 +++++++++++++++++++++++++++
xen/arch/x86/x86_64/Makefile | 1 +
xen/arch/x86/x86_64/entry-fred.S | 33 ++++++
4 files changed, 252 insertions(+)
create mode 100644 xen/arch/x86/x86_64/entry-fred.S
diff --git a/xen/arch/x86/include/asm/asm_defns.h b/xen/arch/x86/include/asm/asm_defns.h
index 4a21a7b46684..0dd63270fc7c 100644
--- a/xen/arch/x86/include/asm/asm_defns.h
+++ b/xen/arch/x86/include/asm/asm_defns.h
@@ -312,6 +312,69 @@ static always_inline void stac(void)
subq $-(UREGS_error_code-UREGS_r15+\adj), %rsp
.endm
+/*
+ * Push and clear GPRs
+ */
+.macro PUSH_AND_CLEAR_GPRS
+ push %rdi
+ xor %edi, %edi
+ push %rsi
+ xor %esi, %esi
+ push %rdx
+ xor %edx, %edx
+ push %rcx
+ xor %ecx, %ecx
+ push %rax
+ xor %eax, %eax
+ push %r8
+ xor %r8d, %r8d
+ push %r9
+ xor %r9d, %r9d
+ push %r10
+ xor %r10d, %r10d
+ push %r11
+ xor %r11d, %r11d
+ push %rbx
+ xor %ebx, %ebx
+ push %rbp
+#ifdef CONFIG_FRAME_POINTER
+/* Indicate special exception stack frame by inverting the frame pointer. */
+ mov %rsp, %rbp
+ not %rbp
+#else
+ xor %ebp, %ebp
+#endif
+ push %r12
+ xor %r12d, %r12d
+ push %r13
+ xor %r13d, %r13d
+ push %r14
+ xor %r14d, %r14d
+ push %r15
+ xor %r15d, %r15d
+.endm
+
+/*
+ * POP GPRs from a UREGS_* frame on the stack. Does not modify flags.
+ */
+.macro POP_GPRS
+ pop %r15
+ pop %r14
+ pop %r13
+ pop %r12
+ pop %rbp
+ pop %rbx
+ pop %r11
+ pop %r10
+ pop %r9
+ pop %r8
+ pop %rax
+ pop %rcx
+ pop %rdx
+ pop %rsi
+ pop %rdi
+.endm
+
#ifdef CONFIG_PV32
#define CR4_PV32_RESTORE \
ALTERNATIVE_2 "", \
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 907fb4c186c0..48667c71d591 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -88,6 +88,13 @@ const unsigned int nmi_cpu;
#define stack_words_per_line 4
#define ESP_BEFORE_EXCEPTION(regs) ((unsigned long *)(regs)->rsp)
+/* Only valid to use when FRED is active. */
+static inline struct fred_info *cpu_regs_fred_info(struct cpu_user_regs *regs)
+{
+ ASSERT(read_cr4() & X86_CR4_FRED);
+ return &container_of(regs, struct cpu_info, guest_cpu_user_regs)->_fred;
+}
+
struct extra_state
{
unsigned long cr0, cr2, cr3, cr4;
@@ -1028,6 +1035,32 @@ void show_execution_state_nmi(const cpumask_t *mask, bool show_all)
printk("Non-responding CPUs: {%*pbl}\n", CPUMASK_PR(&show_state_mask));
}
+static const char *x86_et_name(unsigned int type)
+{
+ static const char *const names[] = {
+ [X86_ET_EXT_INTR] = "EXT_INTR",
+ [X86_ET_NMI] = "NMI",
+ [X86_ET_HW_EXC] = "HW_EXC",
+ [X86_ET_SW_INT] = "SW_INT",
+ [X86_ET_PRIV_SW_EXC] = "PRIV_SW_EXC",
+ [X86_ET_SW_EXC] = "SW_EXC",
+ [X86_ET_OTHER] = "OTHER",
+ };
+
+ return (type < ARRAY_SIZE(names) && names[type]) ? names[type] : "???";
+}
+
+static const char *x86_et_other_name(unsigned int what)
+{
+ static const char *const names[] = {
+ [0] = "MTF",
+ [1] = "SYSCALL",
+ [2] = "SYSENTER",
+ };
+
+ return (what < ARRAY_SIZE(names) && names[what]) ? names[what] : "???";
+}
+
const char *vector_name(unsigned int vec)
{
static const char names[][4] = {
@@ -1106,6 +1139,38 @@ void fatal_trap(const struct cpu_user_regs *regs, bool show_remote)
}
}
+ if ( read_cr4() & X86_CR4_FRED )
+ {
+ bool render_ec = false;
+ const char *vec_name = NULL;
+
+ switch ( regs->fred_ss.type )
+ {
+ case X86_ET_HW_EXC:
+ case X86_ET_PRIV_SW_EXC:
+ case X86_ET_SW_EXC:
+ render_ec = true;
+ vec_name = vector_name(regs->fred_ss.vector);
+ break;
+
+ case X86_ET_OTHER:
+ vec_name = x86_et_other_name(regs->fred_ss.vector);
+ break;
+ }
+
+ if ( render_ec )
+ panic("FATAL TRAP: type %u, %s, vec %u, %s[%04x]%s\n",
+ regs->fred_ss.type, x86_et_name(regs->fred_ss.type),
+ regs->fred_ss.vector, vec_name ?: "",
+ regs->error_code,
+ (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
+ else
+ panic("FATAL TRAP: type %u, %s, vec %u, %s%s\n",
+ regs->fred_ss.type, x86_et_name(regs->fred_ss.type),
+ regs->fred_ss.vector, vec_name ?: "",
+ (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
+ }
+
panic("FATAL TRAP: vec %u, %s[%04x]%s\n",
trapnr, vector_name(trapnr), regs->error_code,
(regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
@@ -2199,6 +2264,96 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
}
#endif
+void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
+{
+ /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
+ regs->entry_vector = regs->fred_ss.vector;
+
+ fatal_trap(regs, false);
+}
+
+void asmlinkage entry_from_xen(struct cpu_user_regs *regs)
+{
+ struct fred_info *fi = cpu_regs_fred_info(regs);
+ uint8_t type = regs->fred_ss.type;
+
+ /* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
+ regs->entry_vector = regs->fred_ss.vector;
+
+ /*
+ * First, handle the asynchronous or fatal events. These are either
+ * unrelated to the interrupted context, or may not have valid context
+ * recorded, and all have special rules on how/whether to re-enable IRQs.
+ */
+ if ( regs->fred_ss.nested )
+ goto fatal;
+
+ switch ( type )
+ {
+ case X86_ET_EXT_INTR:
+ return do_IRQ(regs);
+
+ case X86_ET_NMI:
+ return do_nmi(regs);
+
+ case X86_ET_HW_EXC:
+ switch ( regs->fred_ss.vector )
+ {
+ case X86_EXC_DF: return do_double_fault(regs);
+ case X86_EXC_MC: return do_machine_check(regs);
+ }
+ break;
+ }
+
+ /*
+ * With the asynchronous events handled, what remains are the synchronous
+ * ones. If we interrupted an IRQs-on region, we should re-enable IRQs
+ * now; for #PF and #DB, %cr2 and PENDING_DBG are on the stack in edata.
+ */
+ if ( regs->eflags & X86_EFLAGS_IF )
+ local_irq_enable();
+
+ switch ( type )
+ {
+ case X86_ET_HW_EXC:
+ case X86_ET_PRIV_SW_EXC:
+ case X86_ET_SW_EXC:
+ switch ( regs->fred_ss.vector )
+ {
+ case X86_EXC_PF: handle_PF(regs, fi->edata); break;
+ case X86_EXC_GP: do_general_protection(regs); break;
+ case X86_EXC_UD: do_invalid_op(regs); break;
+ case X86_EXC_NM: do_device_not_available(regs); break;
+ case X86_EXC_BP: do_int3(regs); break;
+ case X86_EXC_DB: handle_DB(regs, fi->edata); break;
+ case X86_EXC_CP: do_entry_CP(regs); break;
+
+ case X86_EXC_DE:
+ case X86_EXC_OF:
+ case X86_EXC_BR:
+ case X86_EXC_NP:
+ case X86_EXC_SS:
+ case X86_EXC_MF:
+ case X86_EXC_AC:
+ case X86_EXC_XM:
+ do_trap(regs);
+ break;
+
+ default:
+ goto fatal;
+ }
+ break;
+
+ default:
+ goto fatal;
+ }
+
+ return;
+
+ fatal:
+ fatal_trap(regs, false);
+}
+
/*
* Local variables:
* mode: C
diff --git a/xen/arch/x86/x86_64/Makefile b/xen/arch/x86/x86_64/Makefile
index f20763088740..c0a0b6603221 100644
--- a/xen/arch/x86/x86_64/Makefile
+++ b/xen/arch/x86/x86_64/Makefile
@@ -1,5 +1,6 @@
obj-$(CONFIG_PV32) += compat/
+obj-bin-y += entry-fred.o
obj-bin-y += entry.o
obj-$(CONFIG_KEXEC) += machine_kexec.o
obj-y += pci.o
diff --git a/xen/arch/x86/x86_64/entry-fred.S b/xen/arch/x86/x86_64/entry-fred.S
new file mode 100644
index 000000000000..3c3320df22cb
--- /dev/null
+++ b/xen/arch/x86/x86_64/entry-fred.S
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+ .file "x86_64/entry-fred.S"
+
+#include <asm/asm_defns.h>
+#include <asm/page.h>
+
+ .section .text.entry, "ax", @progbits
+
+ /* The Ring3 entry point is required to be 4k aligned. */
+
+FUNC(entry_FRED_R3, 4096)
+ PUSH_AND_CLEAR_GPRS
+
+ mov %rsp, %rdi
+ call entry_from_pv
+
+ POP_GPRS
+ eretu
+END(entry_FRED_R3)
+
+ /* The Ring0 entrypoint is at Ring3 + 0x100. */
+ .org entry_FRED_R3 + 0x100, 0xcc
+
+FUNC_LOCAL(entry_FRED_R0, 0)
+ PUSH_AND_CLEAR_GPRS
+
+ mov %rsp, %rdi
+ call entry_from_xen
+
+ POP_GPRS
+ erets
+END(entry_FRED_R0)
--
2.39.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v4 08/14] x86/traps: Enable FRED when requested
2026-02-27 23:16 [PATCH v4 00/14] x86: FRED support Andrew Cooper
` (6 preceding siblings ...)
2026-02-27 23:16 ` [PATCH v4 07/14] x86/traps: Introduce FRED entrypoints Andrew Cooper
@ 2026-02-27 23:16 ` Andrew Cooper
2026-03-02 16:12 ` Jan Beulich
2026-02-27 23:16 ` [PATCH v4 09/14] x86/pv: Adjust GS handling for FRED mode Andrew Cooper
` (5 subsequent siblings)
13 siblings, 1 reply; 43+ messages in thread
From: Andrew Cooper @ 2026-02-27 23:16 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
With the shadow stack and exception handling adjustements in place, we can now
activate FRED when appropriate. Note that opt_fred is still disabled by
default until more infrastructure is in place.
Introduce init_fred() to set up all the MSRs relevant for FRED. FRED uses
MSR_STAR (entries from Ring3 only), and MSR_FRED_SSP_SL0 aliases MSR_PL0_SSP
when CET-SS is active. Otherwise, they're all new MSRs.
Also introduce init_fred_tss(). At this juncture we need a TSS set up, even
if it is mostly unused. Reinsert the BUILD_BUG_ON() checking the size of the
TSS against 0x67, this time with a more precise comment.
With init_fred() existing, load_system_tables() and legacy_syscall_init()
should only be used when setting up IDT delivery. Insert ASSERT()s to this
effect, and adjust the various init functions to make this property true.
The FRED initialisation path still needs to write to all system table
registers at least once, even if only to invalidate them. Per the
documentation, percpu_early_traps_init() is responsible for switching off the
boot GDT, which also needs doing even in FRED mode.
Finally, set CR4.FRED in traps_init()/percpu_early_traps_init().
Xen can now boot in FRED mode and run a PVH dom0. PV guests still need more
work before they can be run under FRED.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
[*] PVH Dom0 on an Intel PantherLake CPU.
v4:
* Rework TSS handling entirely, following the VT-x (re)discovery.
v3:
* Fix poisoning of SL1 pointers.
* Adjust bsp_traps_reinit(). It probably doesn't matter.
---
xen/arch/x86/include/asm/current.h | 3 +
xen/arch/x86/include/asm/traps.h | 2 +
xen/arch/x86/traps-setup.c | 130 +++++++++++++++++++++++++++--
3 files changed, 130 insertions(+), 5 deletions(-)
diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
index 62817e8476ec..6139980ab115 100644
--- a/xen/arch/x86/include/asm/current.h
+++ b/xen/arch/x86/include/asm/current.h
@@ -23,6 +23,9 @@
* 2 - NMI IST stack
* 1 - #MC IST stack
* 0 - IST Shadow Stacks (4x 1k, read-only)
+ *
+ * In FRED mode, #DB and NMI do not need special stacks, so their IST stacks
+ * are unused.
*/
/*
diff --git a/xen/arch/x86/include/asm/traps.h b/xen/arch/x86/include/asm/traps.h
index 73097e957d05..5d7504bc44d1 100644
--- a/xen/arch/x86/include/asm/traps.h
+++ b/xen/arch/x86/include/asm/traps.h
@@ -16,6 +16,8 @@ void traps_init(void);
void bsp_traps_reinit(void);
void percpu_traps_init(void);
+void nocall entry_FRED_R3(void);
+
extern unsigned int ler_msr;
const char *vector_name(unsigned int vec);
diff --git a/xen/arch/x86/traps-setup.c b/xen/arch/x86/traps-setup.c
index b2c161943d1e..4c8744eebe3c 100644
--- a/xen/arch/x86/traps-setup.c
+++ b/xen/arch/x86/traps-setup.c
@@ -59,6 +59,8 @@ static void load_system_tables(void)
.limit = sizeof(bsp_idt) - 1,
};
+ ASSERT(opt_fred == 0);
+
/*
* Set up the TSS. Warning - may be live, and the NMI/#MC must remain
* valid on every instruction boundary. (Note: these are all
@@ -191,6 +193,8 @@ static void legacy_syscall_init(void)
unsigned char *stub_page;
unsigned int offset;
+ ASSERT(opt_fred == 0);
+
/* No PV guests? No need to set up SYSCALL/SYSENTER infrastructure. */
if ( !IS_ENABLED(CONFIG_PV) )
return;
@@ -268,6 +272,76 @@ static void __init init_ler(void)
setup_force_cpu_cap(X86_FEATURE_XEN_LBR);
}
+/*
+ * Set up all MSRs relevant for FRED event delivery.
+ *
+ * Xen does not use any of the optional config in MSR_FRED_CONFIG, so all that
+ * is needed is the entrypoint.
+ *
+ * Because FRED always provides a good stack, NMI and #DB do not need any
+ * special treatment. Only #DF needs another stack level, and #MC for the
+ * offchance that Xen's main stack suffers an uncorrectable error.
+ *
+ * This makes Stack Level 1 unused, but we use #DB's stacks, and with the
+ * regular and shadow stacks reversed as posion to guarantee that any use
+ * escalates to #DF.
+ *
+ * FRED reuses MSR_STAR to provide the segment selector values to load on
+ * entry from Ring3. Entry from Ring0 leave %cs and %ss unmodified.
+ */
+static void init_fred(void)
+{
+ unsigned long stack_top = get_stack_bottom() & ~(STACK_SIZE - 1);
+
+ ASSERT(opt_fred == 1);
+
+ wrmsrns(MSR_STAR, XEN_MSR_STAR);
+ wrmsrns(MSR_FRED_CONFIG, (unsigned long)entry_FRED_R3);
+
+ /*
+ * MSR_FRED_RSP_* all come with an 64-byte alignment check, avoiding the
+ * need for an explicit BUG_ON().
+ */
+ wrmsrns(MSR_FRED_RSP_SL0, (unsigned long)(&get_cpu_info()->_fred + 1));
+ wrmsrns(MSR_FRED_RSP_SL1, stack_top + (IST_DB * IST_SHSTK_SIZE)); /* Poison */
+ wrmsrns(MSR_FRED_RSP_SL2, stack_top + (1 + IST_MCE) * PAGE_SIZE);
+ wrmsrns(MSR_FRED_RSP_SL3, stack_top + (1 + IST_DF) * PAGE_SIZE);
+ wrmsrns(MSR_FRED_STK_LVLS, ((2UL << (X86_EXC_MC * 2)) |
+ (3UL << (X86_EXC_DF * 2))));
+
+ if ( cpu_has_xen_shstk )
+ {
+ wrmsrns(MSR_FRED_SSP_SL0, stack_top + (PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE);
+ wrmsrns(MSR_FRED_SSP_SL1, stack_top + (1 + IST_DB) * PAGE_SIZE); /* Poison */
+ wrmsrns(MSR_FRED_SSP_SL2, stack_top + (IST_MCE * IST_SHSTK_SIZE));
+ wrmsrns(MSR_FRED_SSP_SL3, stack_top + (IST_DF * IST_SHSTK_SIZE));
+ }
+}
+
+/*
+ * Set up a minimal TSS and selector for use in FRED mode.
+ *
+ * With FRED moving the stack pointers into MSRs, we would like to avoid
+ * having a TSS at all, but:
+ * - VT-x VMExit unconditionally sets TR.limit to 0x67, meaning that
+ * HOST_TR_BASE needs to point to a good TSS.
+ * - show_stack_overflow() cross-checks tss->rsp0.
+ *
+ * Fill in rsp0 and the bitmap offset, and load a zero-length TR. If VT-x
+ * does get used, it will clobber TR to refer to this_cpu(tss_page).tss.
+ */
+static void init_fred_tss(void)
+{
+ seg_desc_t *gdt = this_cpu(gdt) - FIRST_RESERVED_GDT_ENTRY;
+ struct tss64 *tss = &this_cpu(tss_page).tss;
+
+ tss->rsp0 = get_stack_bottom();
+ tss->bitmap = IOBMP_INVALID_OFFSET;
+
+ _set_tssldt_desc(gdt + TSS_ENTRY, 0, 0, SYS_DESC_tss_avail);
+ ltr(TSS_SELECTOR);
+}
+
/*
* Configure basic exception handling. This is prior to parsing the command
* line or configuring a console, and needs to be as simple as possible.
@@ -322,6 +396,8 @@ void __init traps_init(void)
if ( opt_fred )
{
+ const struct desc_ptr idtr = {};
+
#ifdef CONFIG_PV32
if ( opt_pv32 )
{
@@ -329,16 +405,27 @@ void __init traps_init(void)
printk(XENLOG_INFO "Disabling PV32 due to FRED\n");
}
#endif
+
+ init_fred();
+ set_in_cr4(X86_CR4_FRED);
+
+ /*
+ * Invalidate the IDT as it's not used. Set up a minimal TSS. The
+ * LDT was configured by bsp_early_traps_init().
+ */
+ lidt(&idtr);
+ init_fred_tss();
+
setup_force_cpu_cap(X86_FEATURE_XEN_FRED);
printk("Using FRED event delivery\n");
}
else
{
+ load_system_tables();
+
printk("Using IDT event delivery\n");
}
- load_system_tables();
-
init_ler();
percpu_traps_init();
@@ -353,7 +440,11 @@ void __init traps_init(void)
*/
void __init bsp_traps_reinit(void)
{
- load_system_tables();
+ if ( opt_fred )
+ init_fred();
+ else
+ load_system_tables();
+
percpu_traps_init();
}
@@ -368,7 +459,7 @@ void percpu_traps_init(void)
* allocated, limiting the placement of the traps_init() call, and gets
* re-done anyway by bsp_traps_reinit().
*/
- if ( system_state > SYS_STATE_early_boot )
+ if ( !opt_fred && system_state > SYS_STATE_early_boot )
legacy_syscall_init();
if ( cpu_has_xen_lbr )
@@ -384,7 +475,29 @@ void percpu_traps_init(void)
*/
void asmlinkage percpu_early_traps_init(void)
{
- load_system_tables();
+ if ( opt_fred )
+ {
+ seg_desc_t *gdt = this_cpu(gdt) - FIRST_RESERVED_GDT_ENTRY;
+ const struct desc_ptr gdtr = {
+ .base = (unsigned long)gdt,
+ .limit = LAST_RESERVED_GDT_BYTE,
+ }, idtr = {};
+
+ lgdt(&gdtr);
+
+ init_fred();
+ write_cr4(read_cr4() | X86_CR4_FRED);
+
+ /*
+ * Invalidate the IDT (not used) and LDT (not set up yet). Set up a
+ * minimal TSS.
+ */
+ lidt(&idtr);
+ init_fred_tss();
+ lldt(0);
+ }
+ else
+ load_system_tables();
}
static void __init __maybe_unused build_assertions(void)
@@ -403,4 +516,11 @@ static void __init __maybe_unused build_assertions(void)
endof_field(struct cpu_info, guest_cpu_user_regs)) & 15);
BUILD_BUG_ON((sizeof(struct cpu_info) -
endof_field(struct cpu_info, _fred)) & 63);
+
+ /*
+ * The x86 architecture is happy with TR.limit being less than 0x67, but
+ * VT-x is not. VMExit unconditionally sets the limit to 0x67, meaning
+ * that HOST_TR_BASE needs to refer to a good TSS of at least this size.
+ */
+ BUILD_BUG_ON(sizeof(struct tss64) <= 0x67);
}
--
2.39.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v4 09/14] x86/pv: Adjust GS handling for FRED mode
2026-02-27 23:16 [PATCH v4 00/14] x86: FRED support Andrew Cooper
` (7 preceding siblings ...)
2026-02-27 23:16 ` [PATCH v4 08/14] x86/traps: Enable FRED when requested Andrew Cooper
@ 2026-02-27 23:16 ` Andrew Cooper
2026-03-02 16:24 ` Jan Beulich
2026-03-04 17:18 ` [PATCH v4.1 " Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 10/14] x86/pv: Guest exception handling in " Andrew Cooper
` (4 subsequent siblings)
13 siblings, 2 replies; 43+ messages in thread
From: Andrew Cooper @ 2026-02-27 23:16 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné
When FRED is active, hardware automatically swaps GS when changing privilege,
and the SWAPGS instruction is disallowed.
For native OSes using GS as the thread local pointer this is a massive
improvement on the pre-FRED architecture, but under Xen it makes handling PV
guests more complicated. Specifically, it means that GS_BASE and GS_SHADOW
are the opposite way around in FRED mode, as opposed to IDT mode.
This leads to the following changes:
* In load_segments(), we have to load both GSes. Account for this in the
SWAP() condition and avoid the path with SWAGS.
* In save_segments(), we need to read GS_SHADOW rather than GS_BASE.
* In toggle_guest_mode(), we need to emulate SWAPGS.
* In {read,write}_msr() which access the live registers, GS_SHADOW and
GS_BASE need swapping.
* In do_set_segment_base(), merge the SEGBASE_GS_{USER,KERNEL} cases and
take FRED into account when choosing which base to update.
SEGBASE_GS_USER_SEL was already an LKGS invocation (decades before FRED)
so under FRED needs to be just a MOV %gs. Simply skip the SWAPGSes.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
v4:
* Adjust GS accesses for emulated {RD,WR}MSR too.
---
xen/arch/x86/domain.c | 16 +++++++++++-----
xen/arch/x86/pv/domain.c | 22 ++++++++++++++++++++--
xen/arch/x86/pv/emul-priv-op.c | 24 +++++++++++++++---------
xen/arch/x86/pv/misc-hypercalls.c | 16 ++++++++++------
4 files changed, 56 insertions(+), 22 deletions(-)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index e658c2d647b7..9c1f6ef76d52 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1791,9 +1791,10 @@ static void load_segments(struct vcpu *n)
/*
* Figure out which way around gsb/gss want to be. gsb needs to be
- * the active context, and gss needs to be the inactive context.
+ * the active context, and gss needs to be the inactive context,
+ * unless we're in FRED mode where they're reversed.
*/
- if ( !(n->arch.flags & TF_kernel_mode) )
+ if ( !(n->arch.flags & TF_kernel_mode) ^ opt_fred )
SWAP(gsb, gss);
if ( using_svm() && (n->arch.pv.fs | n->arch.pv.gs) <= 3 )
@@ -1814,7 +1815,9 @@ static void load_segments(struct vcpu *n)
if ( !fs_gs_done && !compat )
{
- if ( read_cr4() & X86_CR4_FSGSBASE )
+ unsigned long cr4 = read_cr4();
+
+ if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
{
__wrgsbase(gss);
__wrfsbase(n->arch.pv.fs_base);
@@ -1931,6 +1934,9 @@ static void load_segments(struct vcpu *n)
* Guests however cannot use SWAPGS, so there is no mechanism to modify the
* inactive GS base behind Xen's back. Therefore, Xen's copy of the inactive
* GS base is still accurate, and doesn't need reading back from hardware.
+ *
+ * Under FRED, hardware automatically swaps GS for us, so SHADOW_GS is the
+ * active GS from the guest's point of view.
*/
static void save_segments(struct vcpu *v)
{
@@ -1946,12 +1952,12 @@ static void save_segments(struct vcpu *v)
if ( read_cr4() & X86_CR4_FSGSBASE )
{
fs_base = __rdfsbase();
- gs_base = __rdgsbase();
+ gs_base = opt_fred ? rdmsr(MSR_SHADOW_GS_BASE) : __rdgsbase();
}
else
{
fs_base = rdmsr(MSR_FS_BASE);
- gs_base = rdmsr(MSR_GS_BASE);
+ gs_base = opt_fred ? rdmsr(MSR_SHADOW_GS_BASE) : rdmsr(MSR_GS_BASE);
}
v->arch.pv.fs_base = fs_base;
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index d16583a7454d..b85abb5ed903 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -14,9 +14,10 @@
#include <asm/cpufeature.h>
#include <asm/fsgsbase.h>
#include <asm/invpcid.h>
-#include <asm/spec_ctrl.h>
#include <asm/pv/domain.h>
#include <asm/shadow.h>
+#include <asm/spec_ctrl.h>
+#include <asm/traps.h>
#ifdef CONFIG_PV32
int8_t __read_mostly opt_pv32 = -1;
@@ -514,11 +515,28 @@ void toggle_guest_mode(struct vcpu *v)
* subsequent context switch won't bother re-reading it.
*/
gs_base = read_gs_base();
+
+ /*
+ * In FRED mode, not only are the two GSes the other way around (i.e. we
+ * want to read GS_SHADOW here), the SWAPGS instruction is disallowed so
+ * we have to emulate it.
+ */
+ if ( opt_fred )
+ {
+ unsigned long gs_shadow = rdmsr(MSR_SHADOW_GS_BASE);
+
+ wrmsrns(MSR_SHADOW_GS_BASE, gs_base);
+ write_gs_base(gs_shadow);
+
+ gs_base = gs_shadow;
+ }
+ else
+ asm volatile ( "swapgs" );
+
if ( v->arch.flags & TF_kernel_mode )
v->arch.pv.gs_base_kernel = gs_base;
else
v->arch.pv.gs_base_user = gs_base;
- asm volatile ( "swapgs" );
_toggle_guest_pt(v);
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index 87d3bbcf901f..81153084129a 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -25,6 +25,7 @@
#include <asm/pv/traps.h>
#include <asm/shared.h>
#include <asm/stubs.h>
+#include <asm/traps.h>
#include <xsm/xsm.h>
@@ -926,7 +927,7 @@ static int cf_check read_msr(
case MSR_GS_BASE:
if ( !cp->extd.lm )
break;
- *val = read_gs_base();
+ *val = opt_fred ? rdmsr(MSR_SHADOW_GS_BASE) : read_gs_base();
return X86EMUL_OKAY;
case MSR_SHADOW_GS_BASE:
@@ -1066,17 +1067,22 @@ static int cf_check write_msr(
if ( !cp->extd.lm || !is_canonical_address(val) )
break;
- if ( reg == MSR_FS_BASE )
- write_fs_base(val);
- else if ( reg == MSR_GS_BASE )
- write_gs_base(val);
- else if ( reg == MSR_SHADOW_GS_BASE )
+ switch ( reg )
{
- write_gs_shadow(val);
+ case MSR_FS_BASE:
+ write_fs_base(val);
+ break;
+
+ case MSR_SHADOW_GS_BASE:
curr->arch.pv.gs_base_user = val;
+ fallthrough;
+ case MSR_GS_BASE:
+ if ( (reg == MSR_GS_BASE) ^ opt_fred )
+ write_gs_base(val);
+ else
+ write_gs_shadow(val);
+ break;
}
- else
- ASSERT_UNREACHABLE();
return X86EMUL_OKAY;
case MSR_EFER:
diff --git a/xen/arch/x86/pv/misc-hypercalls.c b/xen/arch/x86/pv/misc-hypercalls.c
index 4c2abeb4add8..2c9cf50638db 100644
--- a/xen/arch/x86/pv/misc-hypercalls.c
+++ b/xen/arch/x86/pv/misc-hypercalls.c
@@ -11,6 +11,7 @@
#include <asm/debugreg.h>
#include <asm/fsgsbase.h>
+#include <asm/traps.h>
long do_set_debugreg(int reg, unsigned long value)
{
@@ -192,11 +193,12 @@ long do_set_segment_base(unsigned int which, unsigned long base)
case SEGBASE_GS_USER:
v->arch.pv.gs_base_user = base;
- write_gs_shadow(base);
- break;
-
+ fallthrough;
case SEGBASE_GS_KERNEL:
- write_gs_base(base);
+ if ( (which == SEGBASE_GS_KERNEL) ^ opt_fred )
+ write_gs_base(base);
+ else
+ write_gs_shadow(base);
break;
}
break;
@@ -209,7 +211,8 @@ long do_set_segment_base(unsigned int which, unsigned long base)
* We wish to update the user %gs from the GDT/LDT. Currently, the
* guest kernel's GS_BASE is in context.
*/
- asm volatile ( "swapgs" );
+ if ( !opt_fred )
+ asm volatile ( "swapgs" );
if ( sel > 3 )
/* Fix up RPL for non-NUL selectors. */
@@ -247,7 +250,8 @@ long do_set_segment_base(unsigned int which, unsigned long base)
/* Update the cache of the inactive base, as read from the GDT/LDT. */
v->arch.pv.gs_base_user = read_gs_base();
- asm volatile ( safe_swapgs );
+ if ( !opt_fred )
+ asm volatile ( safe_swapgs );
break;
}
--
2.39.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v4 10/14] x86/pv: Guest exception handling in FRED mode
2026-02-27 23:16 [PATCH v4 00/14] x86: FRED support Andrew Cooper
` (8 preceding siblings ...)
2026-02-27 23:16 ` [PATCH v4 09/14] x86/pv: Adjust GS handling for FRED mode Andrew Cooper
@ 2026-02-27 23:16 ` Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 11/14] x86/pv: ERETU error handling Andrew Cooper
` (3 subsequent siblings)
13 siblings, 0 replies; 43+ messages in thread
From: Andrew Cooper @ 2026-02-27 23:16 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné
Under FRED, entry_from_pv() handles everything. To start with, implement
exception handling in the same manner as entry_from_xen(), although we can
unconditionally enable interrupts after the async/fatal events.
After entry_from_pv() returns, test_all_events() needs to run to perform
exception and interrupt injection. Split entry_FRED_R3() into two and
introduce eretu_exit_to_guest() as the latter half, coming unilaterally from
restore_all_guest().
For all of this, there is a slightly complicated relationship with CONFIG_PV.
entry_FRED_R3() must exist irrespective of CONFIG_PV, because it's the
entrypoint registered with hardware. For simplicity, entry_from_pv() is
always called, but it collapses into fatal_trap() in the !PV case.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
v4:
* Treat nested events as fatal.
v3:
* Adjust comments.
* Group CP with others. It's definitely wrong for perf, but that's out the
window anyway now that we're letting a compiler make the decision tree.
v2:
* New
---
xen/arch/x86/traps.c | 78 +++++++++++++++++++++++++++++++-
xen/arch/x86/x86_64/entry-fred.S | 13 +++++-
xen/arch/x86/x86_64/entry.S | 4 +-
3 files changed, 92 insertions(+), 3 deletions(-)
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 48667c71d591..7563576fb477 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -2266,9 +2266,85 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
{
+ struct fred_info *fi = cpu_regs_fred_info(regs);
+ uint8_t type = regs->fred_ss.type;
+ uint8_t vec = regs->fred_ss.vector;
+
/* Copy fred_ss.vector into entry_vector as IDT delivery would have done. */
- regs->entry_vector = regs->fred_ss.vector;
+ regs->entry_vector = vec;
+
+ if ( !IS_ENABLED(CONFIG_PV) )
+ goto fatal;
+
+ /*
+ * First, handle the asynchronous or fatal events. These are either
+ * unrelated to the interrupted context, or may not have valid context
+ * recorded, and all have special rules on how/whether to re-enable IRQs.
+ */
+ if ( regs->fred_ss.nested )
+ goto fatal;
+
+ switch ( type )
+ {
+ case X86_ET_EXT_INTR:
+ return do_IRQ(regs);
+
+ case X86_ET_NMI:
+ return do_nmi(regs);
+
+ case X86_ET_HW_EXC:
+ switch ( vec )
+ {
+ case X86_EXC_DF: return do_double_fault(regs);
+ case X86_EXC_MC: return do_machine_check(regs);
+ }
+ break;
+ }
+ /*
+ * With the asynchronous events handled, what remains are the synchronous
+ * ones. PV guest context always had interrupts enabled.
+ */
+ local_irq_enable();
+
+ switch ( type )
+ {
+ case X86_ET_HW_EXC:
+ case X86_ET_PRIV_SW_EXC:
+ case X86_ET_SW_EXC:
+ switch ( vec )
+ {
+ case X86_EXC_PF: handle_PF(regs, fi->edata); break;
+ case X86_EXC_GP: do_general_protection(regs); break;
+ case X86_EXC_UD: do_invalid_op(regs); break;
+ case X86_EXC_NM: do_device_not_available(regs); break;
+ case X86_EXC_BP: do_int3(regs); break;
+ case X86_EXC_DB: handle_DB(regs, fi->edata); break;
+ case X86_EXC_CP: do_entry_CP(regs); break;
+
+ case X86_EXC_DE:
+ case X86_EXC_OF:
+ case X86_EXC_BR:
+ case X86_EXC_NP:
+ case X86_EXC_SS:
+ case X86_EXC_MF:
+ case X86_EXC_AC:
+ case X86_EXC_XM:
+ do_trap(regs);
+ break;
+
+ default:
+ goto fatal;
+ }
+ break;
+
+ default:
+ goto fatal;
+ }
+
+ return;
+
+ fatal:
fatal_trap(regs, false);
}
diff --git a/xen/arch/x86/x86_64/entry-fred.S b/xen/arch/x86/x86_64/entry-fred.S
index 3c3320df22cb..a1ff9a4a9747 100644
--- a/xen/arch/x86/x86_64/entry-fred.S
+++ b/xen/arch/x86/x86_64/entry-fred.S
@@ -15,9 +15,20 @@ FUNC(entry_FRED_R3, 4096)
mov %rsp, %rdi
call entry_from_pv
+#ifdef CONFIG_PV
+ GET_STACK_END(14)
+ movq STACK_CPUINFO_FIELD(current_vcpu)(%r14), %rbx
+
+ jmp test_all_events
+#else
+ BUG /* Not Reached */
+#endif
+END(entry_FRED_R3)
+
+FUNC(eretu_exit_to_guest)
POP_GPRS
eretu
-END(entry_FRED_R3)
+END(eretu_exit_to_guest)
/* The Ring0 entrypoint is at Ring3 + 0x100. */
.org entry_FRED_R3 + 0x100, 0xcc
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 8b83082413a5..17ca6a493906 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -63,7 +63,7 @@ UNLIKELY_END(syscall_no_callback)
/* Conditionally clear DF */
and %esi, UREGS_eflags(%rsp)
/* %rbx: struct vcpu */
-test_all_events:
+LABEL(test_all_events, 0)
ASSERT_NOT_IN_ATOMIC
cli # tests must not race interrupts
/*test_softirqs:*/
@@ -152,6 +152,8 @@ END(switch_to_kernel)
FUNC_LOCAL(restore_all_guest)
ASSERT_INTERRUPTS_DISABLED
+ ALTERNATIVE "", "jmp eretu_exit_to_guest", X86_FEATURE_XEN_FRED
+
/* Stash guest SPEC_CTRL value while we can read struct vcpu. */
mov VCPU_arch_msrs(%rbx), %rdx
mov VCPUMSR_spec_ctrl_raw(%rdx), %r15d
--
2.39.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v4 11/14] x86/pv: ERETU error handling
2026-02-27 23:16 [PATCH v4 00/14] x86: FRED support Andrew Cooper
` (9 preceding siblings ...)
2026-02-27 23:16 ` [PATCH v4 10/14] x86/pv: Guest exception handling in " Andrew Cooper
@ 2026-02-27 23:16 ` Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 12/14] x86/pv: System call handling in FRED mode Andrew Cooper
` (2 subsequent siblings)
13 siblings, 0 replies; 43+ messages in thread
From: Andrew Cooper @ 2026-02-27 23:16 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné
ERETU can fault for guest reasons, and like IRET needs special handling to
forward the error into the guest.
As this is largely written in C, take the opportunity to better classify the
sources of error, and in particilar, not forward errors that are actually
Xen's fault into the guest, opting for a domain crash instead.
Because ERETU does not enable NMIs if it faults, a corner case exists if an
NMI was taken while in guest context, and the ERETU back out faults. Recovery
must involve an ERETS with the interrupted context's NMI flag.
See the comments for full details.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
v4:
* Tweak comments.
v2:
* New
---
xen/arch/x86/traps.c | 115 +++++++++++++++++++++++++++++++
xen/arch/x86/x86_64/entry-fred.S | 13 ++++
2 files changed, 128 insertions(+)
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 7563576fb477..2f40f628cbff 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -2348,6 +2348,113 @@ void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
fatal_trap(regs, false);
}
+void nocall eretu_error_dom_crash(void);
+
+/*
+ * Classify an event at the ERETU instruction, and handle if possible.
+ * Returns @true if handled, @false if the event should continue down the
+ * normal handlers.
+ */
+static bool handle_eretu_event(struct cpu_user_regs *regs)
+{
+ unsigned long recover;
+
+ /*
+ * WARNING: The GPRs in gregs overlaps with regs. Only gregs->error_code
+ * and later are legitimate to access.
+ */
+ struct cpu_user_regs *gregs =
+ _p(regs->rsp - offsetof(struct cpu_user_regs, error_code));
+
+ /*
+ * The asynchronous or fatal events (INTR, NMI, #MC, #DF) have been dealt
+ * with, meaning we only have synchronous ones to consider. Anything
+ * which isn't a hardware exception (e.g. #BP) wants handling normally.
+ */
+ if ( regs->fred_ss.type != X86_ET_HW_EXC )
+ return false;
+
+ /*
+ * Guests are permitted to write non-present GDT/LDT entries. Therefore
+ * #NP[sel] (%cs) and #SS[sel] (%ss) must be handled as guest errors. The
+ * only other source of #SS is for a bad %ss-relative memory access in
+ * Xen, and if the stack is that bad, we'll have escalated to #DF.
+ *
+ * #PF can happen from ERETU accessing the GDT/LDT. Xen may translate
+ * these into #GP for the guest, so must be handled as guest errors. In
+ * theory we can get #PF for a bad instruction fetch or bad stack access,
+ * but either of these will be fatal and not end up here.
+ */
+ switch ( regs->fred_ss.vector )
+ {
+ case X86_EXC_GP:
+ /*
+ * #GP[0] can occur because of a NULL %cs or %ss (which are a guest
+ * error), but some #GP[0]'s are errors in Xen (ERETU at SL != 0), or
+ * errors of Xen's handling of guest state (bad metadata).
+ *
+ * These magic numbers came from the FRED Spec; they check that ERETU
+ * is trying to return to Ring 3, and that reserved or inapplicable
+ * bits are 0.
+ */
+ if ( regs->error_code == 0 && (gregs->cs & ~3) && (gregs->ss & ~3) &&
+ (regs->fred_cs.sl != 0 ||
+ (gregs->csx & 0xffffffffffff0003UL) != 3 ||
+ (gregs->rflags & 0xffffffffffc2b02aUL) != 2 ||
+ (gregs->ssx & 0xfff80003UL) != 3) )
+ {
+ recover = (unsigned long)eretu_error_dom_crash;
+
+ if ( regs->fred_cs.sl )
+ gprintk(XENLOG_ERR, "ERETU at SL %u\n", regs->fred_cs.sl);
+ else
+ gprintk(XENLOG_ERR, "Bad return state: csx %#lx, rflags %#lx, ssx %#x\n",
+ gregs->csx, gregs->rflags, (unsigned int)gregs->ssx);
+ break;
+ }
+ fallthrough;
+ case X86_EXC_NP:
+ case X86_EXC_SS:
+ case X86_EXC_PF:
+ recover = (unsigned long)entry_FRED_R3;
+ break;
+
+ /*
+ * Handle everything else normally. e.g. #DB would be debugging
+ * activities in Xen. In theory we can get #UD if CR4.FRED gets
+ * cleared, but in practice if that were the case we wouldn't be here
+ * handling the result.
+ */
+ default:
+ return false;
+ }
+
+ this_cpu(last_extable_addr) = regs->rip;
+
+ /*
+ * If an NMI was taken in guest context and the ERETU faulted, NMIs will
+ * still be blocked. Therefore we copy the interrupted frame's NMI status
+ * into our own, and must ERETS as part of recovery.
+ */
+ regs->fred_ss.nmi = gregs->fred_ss.nmi;
+
+ /*
+ * Next, copy the exception information from the current frame back onto
+ * the interrupted frame, preserving the interrupted frame's %cs and %ss.
+ */
+ *cpu_regs_fred_info(regs) = *cpu_regs_fred_info(gregs);
+ gregs->ssx = (regs->ssx & ~0xffff) | gregs->ss;
+ gregs->csx = (regs->csx & ~0xffff) | gregs->cs;
+ gregs->error_code = regs->error_code;
+ gregs->entry_vector = regs->entry_vector;
+
+ fixup_exception_return(regs, recover, 0);
+
+ return true;
+}
+
+void nocall eretu(void);
+
void asmlinkage entry_from_xen(struct cpu_user_regs *regs)
{
struct fred_info *fi = cpu_regs_fred_info(regs);
@@ -2389,6 +2496,14 @@ void asmlinkage entry_from_xen(struct cpu_user_regs *regs)
if ( regs->eflags & X86_EFLAGS_IF )
local_irq_enable();
+ /*
+ * An event taken at the ERETU instruction may be because of bad guest
+ * state. If so, it will need special handling.
+ */
+ if ( unlikely(regs->rip == (unsigned long)eretu) &&
+ handle_eretu_event(regs) )
+ return;
+
switch ( type )
{
case X86_ET_HW_EXC:
diff --git a/xen/arch/x86/x86_64/entry-fred.S b/xen/arch/x86/x86_64/entry-fred.S
index a1ff9a4a9747..2fa57beb930c 100644
--- a/xen/arch/x86/x86_64/entry-fred.S
+++ b/xen/arch/x86/x86_64/entry-fred.S
@@ -27,9 +27,22 @@ END(entry_FRED_R3)
FUNC(eretu_exit_to_guest)
POP_GPRS
+
+ /*
+ * Exceptions here are handled by redirecting either to
+ * entry_FRED_R3() (for an error to be passed to the guest), or to
+ * eretu_error_dom_crash() (for a Xen error handling guest state).
+ */
+LABEL(eretu, 0)
eretu
END(eretu_exit_to_guest)
+FUNC(eretu_error_dom_crash)
+ PUSH_AND_CLEAR_GPRS
+ sti
+ call asm_domain_crash_synchronous /* Does not return */
+END(eretu_error_dom_crash)
+
/* The Ring0 entrypoint is at Ring3 + 0x100. */
.org entry_FRED_R3 + 0x100, 0xcc
--
2.39.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v4 12/14] x86/pv: System call handling in FRED mode
2026-02-27 23:16 [PATCH v4 00/14] x86: FRED support Andrew Cooper
` (10 preceding siblings ...)
2026-02-27 23:16 ` [PATCH v4 11/14] x86/pv: ERETU error handling Andrew Cooper
@ 2026-02-27 23:16 ` Andrew Cooper
2026-03-09 22:25 ` Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 13/14] x86: Clamp reserved bits in eflags more aggressively Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 14/14] x86/traps: Use fatal_trap() for #UD and #GP Andrew Cooper
13 siblings, 1 reply; 43+ messages in thread
From: Andrew Cooper @ 2026-02-27 23:16 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné
Under FRED, entry_from_pv() handles everything, even system call instructions.
This means more of our logic is written in C now, rather than assembly.
In order to facilitate this, introduce pv_inject_callback(), which reuses
struct trap_bounce infrastructure to inject the syscall/sysenter callbacks.
This in turns requires some !PV compatibility for pv_inject_callback() and
pv_hypercall() which can both be ASSERT_UNREACHABLE().
For each of INT $N, SYSCALL and SYSENTER, FRED gives us interrupted context
which was previously lost. As the guest can't see FRED, Xen has to lose state
in the same way to maintain the prior behaviour.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
v3:
* Simplify DCE handling.
* Add ASSERT_UNREACHABLE() to pv_inject_callback().
* Adjust comment for X86_ET_SW_INT
v2:
* New
---
xen/arch/x86/include/asm/domain.h | 2 +
xen/arch/x86/include/asm/hypercall.h | 2 -
xen/arch/x86/pv/traps.c | 39 ++++++++
xen/arch/x86/traps.c | 131 +++++++++++++++++++++++++++
4 files changed, 172 insertions(+), 2 deletions(-)
diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index 94b0cf7f1d95..ad7f6adb2cb9 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -725,6 +725,8 @@ void arch_vcpu_regs_init(struct vcpu *v);
struct vcpu_hvm_context;
int arch_set_info_hvm_guest(struct vcpu *v, const struct vcpu_hvm_context *ctx);
+void pv_inject_callback(unsigned int type);
+
#ifdef CONFIG_PV
void pv_inject_event(const struct x86_event *event);
#else
diff --git a/xen/arch/x86/include/asm/hypercall.h b/xen/arch/x86/include/asm/hypercall.h
index bf2f0e169aef..d042a61d1702 100644
--- a/xen/arch/x86/include/asm/hypercall.h
+++ b/xen/arch/x86/include/asm/hypercall.h
@@ -18,9 +18,7 @@
#define __HYPERVISOR_paging_domctl_cont __HYPERVISOR_arch_1
-#ifdef CONFIG_PV
void pv_hypercall(struct cpu_user_regs *regs);
-#endif
void pv_ring1_init_hypercall_page(void *p);
void pv_ring3_init_hypercall_page(void *p);
diff --git a/xen/arch/x86/pv/traps.c b/xen/arch/x86/pv/traps.c
index b0395b99145a..c863ab9d372a 100644
--- a/xen/arch/x86/pv/traps.c
+++ b/xen/arch/x86/pv/traps.c
@@ -20,6 +20,8 @@
#include <asm/shared.h>
#include <asm/traps.h>
+#include <public/callback.h>
+
void pv_inject_event(const struct x86_event *event)
{
struct vcpu *curr = current;
@@ -96,6 +98,43 @@ void pv_inject_event(const struct x86_event *event)
}
}
+void pv_inject_callback(unsigned int type)
+{
+ struct vcpu *curr = current;
+ struct trap_bounce *tb = &curr->arch.pv.trap_bounce;
+ unsigned long rip;
+ bool irq;
+
+ ASSERT(is_pv_64bit_vcpu(curr));
+
+ switch ( type )
+ {
+ case CALLBACKTYPE_syscall:
+ rip = curr->arch.pv.syscall_callback_eip;
+ irq = curr->arch.pv.vgc_flags & VGCF_syscall_disables_events;
+ break;
+
+ case CALLBACKTYPE_syscall32:
+ rip = curr->arch.pv.syscall32_callback_eip;
+ irq = curr->arch.pv.syscall32_disables_events;
+ break;
+
+ case CALLBACKTYPE_sysenter:
+ rip = curr->arch.pv.sysenter_callback_eip;
+ irq = curr->arch.pv.sysenter_disables_events;
+ break;
+
+ default:
+ ASSERT_UNREACHABLE();
+ rip = 0;
+ irq = false;
+ break;
+ }
+
+ tb->flags = TBF_EXCEPTION | (irq ? TBF_INTERRUPT : 0);
+ tb->eip = rip;
+}
+
/*
* Called from asm to set up the MCE trapbounce info.
* Returns false no callback is set up, else true.
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 2f40f628cbff..e2c35a046e6b 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -18,6 +18,7 @@
#include <xen/delay.h>
#include <xen/domain_page.h>
#include <xen/guest_access.h>
+#include <xen/hypercall.h>
#include <xen/init.h>
#include <xen/mm.h>
#include <xen/paging.h>
@@ -51,6 +52,8 @@
#include <asm/traps.h>
#include <asm/uaccess.h>
+#include <public/callback.h>
+
/*
* opt_nmi: one of 'ignore', 'dom0', or 'fatal'.
* fatal: Xen prints diagnostic message and then hangs.
@@ -2267,6 +2270,7 @@ void asmlinkage check_ist_exit(const struct cpu_user_regs *regs, bool ist_exit)
void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
{
struct fred_info *fi = cpu_regs_fred_info(regs);
+ struct vcpu *curr = current;
uint8_t type = regs->fred_ss.type;
uint8_t vec = regs->fred_ss.vector;
@@ -2309,6 +2313,38 @@ void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
switch ( type )
{
+ case X86_ET_SW_INT:
+ /*
+ * For better or worse, Xen writes IDT vectors 3 and 4 with DPL3 (so
+ * INT3/INTO work), making INT $3/4 indistinguishable, and the guest
+ * choice of DPL for these vectors is ignored.
+ *
+ * Have them fall through into X86_ET_HW_EXC, as #BP in particular
+ * needs handling by do_int3() in case an external debugger is
+ * attached.
+ *
+ * As the event type is provided, INT $N instructions don't need #GP
+ * tricks to spot, and INT $0x80 doesn't need a fastpath. As the
+ * guest is necessary PV64, INT $0x82 has no special meaning either.
+ *
+ * When converting to a fault, hardware finally gives us enough
+ * information to account for prefixes, so provide the more correct
+ * behaviour rather than assuming the instruction was two bytes long.
+ */
+ if ( vec != X86_EXC_BP && vec != X86_EXC_OF )
+ {
+ const struct trap_info *ti = &curr->arch.pv.trap_ctxt[vec];
+
+ if ( permit_softint(TI_GET_DPL(ti), curr, regs) )
+ pv_inject_sw_interrupt(vec);
+ else
+ {
+ regs->rip -= regs->fred_ss.insnlen;
+ pv_inject_hw_exception(X86_EXC_GP, (vec << 3) | X86_XEC_IDT);
+ }
+ break;
+ }
+ fallthrough;
case X86_ET_HW_EXC:
case X86_ET_PRIV_SW_EXC:
case X86_ET_SW_EXC:
@@ -2338,6 +2374,101 @@ void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
}
break;
+ case X86_ET_OTHER:
+ switch ( regs->fred_ss.vector )
+ {
+ case 1: /* SYSCALL */
+ {
+ /*
+ * FRED delivery preserves the interrupted %cs/%ss, but previously
+ * SYSCALL lost the interrupted selectors, and SYSRET forced the
+ * use of the ones in MSR_STAR.
+ *
+ * The guest isn't aware of FRED, so recreate the legacy
+ * behaviour.
+ *
+ * The non-FRED SYSCALL path sets TRAP_syscall in entry_vector to
+ * signal that SYSRET can be used, but this isn't relevant in FRED
+ * mode.
+ *
+ * When setting the selectors, clear all upper metadata again for
+ * backwards compatibility. In particular fred_ss.swint becomes
+ * pend_DB on ERETx, and nothing else in the pv_hypercall() would
+ * clean up.
+ *
+ * When converting to a fault, hardware finally gives us enough
+ * information to account for prefixes, so provide the more
+ * correct behaviour rather than assuming the instruction was two
+ * bytes long.
+ */
+ bool l = regs->fred_ss.l;
+ unsigned int len = regs->fred_ss.insnlen;
+
+ regs->ssx = l ? FLAT_KERNEL_SS : FLAT_USER_SS32;
+ regs->csx = l ? FLAT_KERNEL_CS64 : FLAT_USER_CS32;
+
+ if ( guest_kernel_mode(curr, regs) )
+ pv_hypercall(regs);
+ else if ( (l ? curr->arch.pv.syscall_callback_eip
+ : curr->arch.pv.syscall32_callback_eip) == 0 )
+ {
+ regs->rip -= len;
+ pv_inject_hw_exception(X86_EXC_UD, X86_EVENT_NO_EC);
+ }
+ else
+ {
+ /*
+ * The PV ABI, given no virtual SYSCALL_MASK, hardcodes that
+ * DF is cleared. Other flags are handled in the same way as
+ * interrupts and exceptions in create_bounce_frame().
+ */
+ regs->eflags &= ~X86_EFLAGS_DF;
+ pv_inject_callback(l ? CALLBACKTYPE_syscall
+ : CALLBACKTYPE_syscall32);
+ }
+ break;
+ }
+
+ case 2: /* SYSENTER */
+ {
+ /*
+ * FRED delivery preserves the interrupted state, but previously
+ * SYSENTER discarded almost everything.
+ *
+ * The guest isn't aware of FRED, so recreate the legacy
+ * behaviour.
+ *
+ * When setting the selectors, clear all upper metadata. In
+ * particular fred_ss.swint becomes pend_DB on ERETx.
+ *
+ * When converting to a fault, hardware finally gives us enough
+ * information to account for prefixes, so provide the more
+ * correct behaviour rather than assuming the instruction was two
+ * bytes long.
+ */
+ unsigned int len = regs->fred_ss.insnlen;
+
+ regs->ssx = FLAT_USER_SS;
+ regs->rsp = 0;
+ regs->eflags &= ~(X86_EFLAGS_VM | X86_EFLAGS_IF);
+ regs->csx = 3;
+ regs->rip = 0;
+
+ if ( !curr->arch.pv.sysenter_callback_eip )
+ {
+ regs->rip -= len;
+ pv_inject_hw_exception(X86_EXC_GP, 0);
+ }
+ else
+ pv_inject_callback(CALLBACKTYPE_sysenter);
+ break;
+ }
+
+ default:
+ goto fatal;
+ }
+ break;
+
default:
goto fatal;
}
--
2.39.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v4 13/14] x86: Clamp reserved bits in eflags more aggressively
2026-02-27 23:16 [PATCH v4 00/14] x86: FRED support Andrew Cooper
` (11 preceding siblings ...)
2026-02-27 23:16 ` [PATCH v4 12/14] x86/pv: System call handling in FRED mode Andrew Cooper
@ 2026-02-27 23:16 ` Andrew Cooper
2026-03-02 16:35 ` Jan Beulich
2026-03-11 17:58 ` [PATCH v4.1 13/14] x86: Clamp " Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 14/14] x86/traps: Use fatal_trap() for #UD and #GP Andrew Cooper
13 siblings, 2 replies; 43+ messages in thread
From: Andrew Cooper @ 2026-02-27 23:16 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
ERETU, unlike IRET, requires the sticky-1 bit (bit 2) be set, and reserved
bits to be clear. Notably this means that dom0_construct() must set
X86_EFLAGS_MBS in order for a PV dom0 to start.
Xen has been overly lax with reserved bit handling. Adjust
arch_set_info_guest*() and hypercall_iret() which consume flags to clamp the
reserved bits for all guest types.
This is a minor ABI change, but by the same argument as commit
9f892f84c279 ("x86/domctl: Stop using XLAT_cpu_user_regs()"); the reserved
bits would get clamped like this naturally by hardware when the vCPU is run.
This allows PV guests to start when Xen is using FRED mode.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
Still slightly RFC. Testing still in progress.
v3:
* Rewrite the commit message.
v2:
* New
The handling of VM is complicated.
It turns out that it's simply ignored by IRET in Long Mode (i.e. clearing it
commit 0e47f92b0725 ("x86: force EFLAGS.IF on when exiting to PV guests")
wasn't actually necessary) but ERETU does care.
But, it's unclear how to handle this in in arch_set_info(). We must preserve
it for HVM guests (which can use vm86 mode). PV32 has special handling but
only in hypercall_iret(), not in arch_set_info().
---
xen/arch/x86/domain.c | 4 ++--
xen/arch/x86/hvm/domain.c | 4 ++--
xen/arch/x86/include/asm/x86-defns.h | 7 +++++++
xen/arch/x86/pv/dom0_build.c | 2 +-
xen/arch/x86/pv/iret.c | 8 +++++---
5 files changed, 17 insertions(+), 8 deletions(-)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 9c1f6ef76d52..1372a65d8123 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1244,7 +1244,7 @@ int arch_set_info_guest(
v->arch.user_regs.rax = c.nat->user_regs.rax;
v->arch.user_regs.rip = c.nat->user_regs.rip;
v->arch.user_regs.cs = c.nat->user_regs.cs;
- v->arch.user_regs.rflags = c.nat->user_regs.rflags;
+ v->arch.user_regs.rflags = (c.nat->user_regs.rflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
v->arch.user_regs.rsp = c.nat->user_regs.rsp;
v->arch.user_regs.ss = c.nat->user_regs.ss;
v->arch.pv.es = c.nat->user_regs.es;
@@ -1268,7 +1268,7 @@ int arch_set_info_guest(
v->arch.user_regs.eax = c.cmp->user_regs.eax;
v->arch.user_regs.eip = c.cmp->user_regs.eip;
v->arch.user_regs.cs = c.cmp->user_regs.cs;
- v->arch.user_regs.eflags = c.cmp->user_regs.eflags;
+ v->arch.user_regs.eflags = (c.cmp->user_regs.eflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
v->arch.user_regs.esp = c.cmp->user_regs.esp;
v->arch.user_regs.ss = c.cmp->user_regs.ss;
v->arch.pv.es = c.cmp->user_regs.es;
diff --git a/xen/arch/x86/hvm/domain.c b/xen/arch/x86/hvm/domain.c
index 155d61db13f8..a0e811ea47a0 100644
--- a/xen/arch/x86/hvm/domain.c
+++ b/xen/arch/x86/hvm/domain.c
@@ -194,7 +194,7 @@ int arch_set_info_hvm_guest(struct vcpu *v, const struct vcpu_hvm_context *ctx)
uregs->rsi = regs->esi;
uregs->rdi = regs->edi;
uregs->rip = regs->eip;
- uregs->rflags = regs->eflags;
+ uregs->rflags = (regs->eflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
v->arch.hvm.guest_cr[0] = regs->cr0;
v->arch.hvm.guest_cr[3] = regs->cr3;
@@ -245,7 +245,7 @@ int arch_set_info_hvm_guest(struct vcpu *v, const struct vcpu_hvm_context *ctx)
uregs->rsi = regs->rsi;
uregs->rdi = regs->rdi;
uregs->rip = regs->rip;
- uregs->rflags = regs->rflags;
+ uregs->rflags = (regs->rflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
v->arch.hvm.guest_cr[0] = regs->cr0;
v->arch.hvm.guest_cr[3] = regs->cr3;
diff --git a/xen/arch/x86/include/asm/x86-defns.h b/xen/arch/x86/include/asm/x86-defns.h
index 0a0ba83de786..edeb0b4ff95a 100644
--- a/xen/arch/x86/include/asm/x86-defns.h
+++ b/xen/arch/x86/include/asm/x86-defns.h
@@ -27,6 +27,13 @@
(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF | \
X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF)
+#define X86_EFLAGS_ALL \
+ (X86_EFLAGS_ARITH_MASK | X86_EFLAGS_TF | X86_EFLAGS_IF | \
+ X86_EFLAGS_DF | X86_EFLAGS_OF | X86_EFLAGS_IOPL | \
+ X86_EFLAGS_NT | X86_EFLAGS_RF | X86_EFLAGS_VM | \
+ X86_EFLAGS_AC | X86_EFLAGS_VIF | X86_EFLAGS_VIP | \
+ X86_EFLAGS_ID)
+
/*
* Intel CPU flags in CR0
*/
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 9a11a0a16b4e..075a3646c2a3 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -1024,7 +1024,7 @@ static int __init dom0_construct(const struct boot_domain *bd)
regs->rip = parms.virt_entry;
regs->rsp = vstack_end;
regs->rsi = vstartinfo_start;
- regs->eflags = X86_EFLAGS_IF;
+ regs->eflags = X86_EFLAGS_IF | X86_EFLAGS_MBS;
/*
* We don't call arch_set_info_guest(), so some initialisation needs doing
diff --git a/xen/arch/x86/pv/iret.c b/xen/arch/x86/pv/iret.c
index d3a1fb2c685b..39ce316b8d91 100644
--- a/xen/arch/x86/pv/iret.c
+++ b/xen/arch/x86/pv/iret.c
@@ -80,8 +80,9 @@ long do_iret(void)
regs->rip = iret_saved.rip;
regs->cs = iret_saved.cs | 3; /* force guest privilege */
- regs->rflags = ((iret_saved.rflags & ~(X86_EFLAGS_IOPL|X86_EFLAGS_VM))
- | X86_EFLAGS_IF);
+ regs->rflags = ((iret_saved.rflags & X86_EFLAGS_ALL &
+ ~(X86_EFLAGS_IOPL | X86_EFLAGS_VM)) |
+ X86_EFLAGS_IF | X86_EFLAGS_MBS);
regs->rsp = iret_saved.rsp;
regs->ss = iret_saved.ss | 3; /* force guest privilege */
@@ -143,7 +144,8 @@ int compat_iret(void)
if ( VM_ASSIST(v->domain, architectural_iopl) )
v->arch.pv.iopl = eflags & X86_EFLAGS_IOPL;
- regs->eflags = (eflags & ~X86_EFLAGS_IOPL) | X86_EFLAGS_IF;
+ regs->eflags = ((eflags & X86_EFLAGS_ALL & ~X86_EFLAGS_IOPL) |
+ X86_EFLAGS_IF | X86_EFLAGS_MBS);
if ( unlikely(eflags & X86_EFLAGS_VM) )
{
--
2.39.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v4 14/14] x86/traps: Use fatal_trap() for #UD and #GP
2026-02-27 23:16 [PATCH v4 00/14] x86: FRED support Andrew Cooper
` (12 preceding siblings ...)
2026-02-27 23:16 ` [PATCH v4 13/14] x86: Clamp reserved bits in eflags more aggressively Andrew Cooper
@ 2026-02-27 23:16 ` Andrew Cooper
2026-03-02 16:39 ` Jan Beulich
2026-03-02 16:40 ` Jan Beulich
13 siblings, 2 replies; 43+ messages in thread
From: Andrew Cooper @ 2026-02-27 23:16 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
This renders the diagnostics in a more uniform way.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
v4:
* New
PF and CP are more complicated and not converted yet.
---
xen/arch/x86/traps.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index e2c35a046e6b..c04ab484ad27 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1375,8 +1375,7 @@ void asmlinkage do_invalid_op(struct cpu_user_regs *regs)
if ( likely(extable_fixup(regs, true)) )
return;
- show_execution_state(regs);
- panic("FATAL TRAP: vector = %d (invalid opcode)\n", X86_EXC_UD);
+ fatal_trap(regs, false);
}
void asmlinkage do_int3(struct cpu_user_regs *regs)
@@ -1475,8 +1474,7 @@ void do_general_protection(struct cpu_user_regs *regs)
return;
hardware_gp:
- show_execution_state(regs);
- panic("GENERAL PROTECTION FAULT\n[error_code=%04x]\n", regs->error_code);
+ fatal_trap(regs, false);
}
#ifdef CONFIG_PV
--
2.39.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH v4 01/14] x86/pv: Don't assume that INT $imm8 instructions are two bytes long
2026-02-27 23:16 ` [PATCH v4 01/14] x86/pv: Don't assume that INT $imm8 instructions are two bytes long Andrew Cooper
@ 2026-03-02 11:03 ` Jan Beulich
2026-03-02 11:43 ` Andrew Cooper
0 siblings, 1 reply; 43+ messages in thread
From: Jan Beulich @ 2026-03-02 11:03 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 28.02.2026 00:16, Andrew Cooper wrote:
> For INT $N instructions (besides $0x80 for which there is a dedicated fast
> path), handling is mostly fault-based because of DPL0 gates in the IDT. This
> means that when the guest kernel allows the instruction too, Xen must
> increment %rip to the end of the instruction before passing a trap to the
> guest kernel.
>
> When an INT $N instruction has a prefix, it's longer than two bytes, and Xen
> will deliver the "trap" with %rip pointing into the middle of the instruction.
>
> Introduce a new pv_emulate_sw_interrupt() which uses x86_insn_length() to
> determine the instruction length, rather than assuming two.
>
> This is a change in behaviour for PV guests, but the prior behaviour cannot
> reasonably be said to be intentional.
>
> This change does not affect the INT $0x80 fastpath. Prefixed INT $N
> instructions occur almost exclusively in test code or exploits, and INT $0x80
> appears to be the only user-usable interrupt gate in contemporary PV guests.
Whereas for the slow path, while the subtracting of 2 from %rip there isn't
quite right either, the insn size determination here would then simply yield
2 as well, so all is good for that case as well.
> @@ -1401,6 +1402,53 @@ int pv_emulate_privileged_op(struct cpu_user_regs *regs)
> return 0;
> }
>
> +/*
> + * Hardware already decoded the INT $N instruction and determinted that there
> + * was a DPL issue, hence the #GP. Xen has already determined that the guest
> + * kernel has permitted this software interrupt.
> + *
> + * All that is needed is the instruction length, to turn the fault into a
> + * trap. All errors are turned back into the original #GP, as that's the
> + * action that really happened.
> + */
> +void pv_emulate_sw_interrupt(struct cpu_user_regs *regs)
> +{
> + struct vcpu *curr = current;
> + struct domain *currd = curr->domain;
> + struct priv_op_ctxt ctxt = {
> + .ctxt.regs = regs,
> + .ctxt.lma = !is_pv_32bit_domain(currd),
The difference may not be overly significant here, but 64-bit guests can run
32-bit code, so setting .lma seems wrong in that case. As it ought to be
largely benign, perhaps to code could even be left as is, just with a comment
to clarify things?
> + };
> + struct x86_emulate_state *state;
> + uint8_t vector = regs->error_code >> 3;
> + unsigned int len, ar;
> +
> + if ( !pv_emul_read_descriptor(regs->cs, curr, &ctxt.cs.base,
> + &ctxt.cs.limit, &ar, 1) ||
> + !(ar & _SEGMENT_S) ||
> + !(ar & _SEGMENT_P) ||
> + !(ar & _SEGMENT_CODE) )
> + goto error;
> +
> + state = x86_decode_insn(&ctxt.ctxt, insn_fetch);
> + if ( IS_ERR_OR_NULL(state) )
> + goto error;
> +
> + len = x86_insn_length(state, &ctxt.ctxt);
> + x86_emulate_free_state(state);
> +
> + /* Note: Checked slightly late to simplify 'state' handling. */
> + if ( ctxt.ctxt.opcode != 0xcd /* INT $imm8 */ )
> + goto error;
> +
> + regs->rip += len;
> + pv_inject_sw_interrupt(vector);
> + return;
> +
> + error:
> + pv_inject_hw_exception(X86_EXC_GP, regs->entry_vector);
DYM regs->error_code here? Might it alternatively make sense to return a
boolean here, for ...
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -1379,8 +1379,7 @@ void do_general_protection(struct cpu_user_regs *regs)
>
> if ( permit_softint(TI_GET_DPL(ti), v, regs) )
> {
> - regs->rip += 2;
> - pv_inject_sw_interrupt(vector);
> + pv_emulate_sw_interrupt(regs);
> return;
... the return here to become conditional, leveraging the #GP injection at
the bottom of this function?
Jan
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 02/14] docs/guest-guide: Describe the PV traps and entrypoints ABI
2026-02-27 23:16 ` [PATCH v4 02/14] docs/guest-guide: Describe the PV traps and entrypoints ABI Andrew Cooper
@ 2026-03-02 11:19 ` Jan Beulich
2026-03-02 14:47 ` Andrew Cooper
0 siblings, 1 reply; 43+ messages in thread
From: Jan Beulich @ 2026-03-02 11:19 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 28.02.2026 00:16, Andrew Cooper wrote:
> ... seeing as I've had to thoroughly reverse engineer it for FRED and make
> tweaks in places.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
> --- /dev/null
> +++ b/docs/guest-guide/x86/pv-traps.rst
> @@ -0,0 +1,123 @@
> +.. SPDX-License-Identifier: CC-BY-4.0
> +
> +PV Traps and Entrypoints
> +========================
> +
> +.. note::
> +
> + The details here are specific to 64bit builds of Xen. Details for 32bit
> + builds of Xen, are different and not discussed further.
Nit: Stray comma?
> +PV guests are subject to Xen's linkage setup for events (interrupts,
> +exceptions and system calls). x86's IDT architecture and limitations are the
> +majority influence on the PV ABI.
> +
> +All external interrupts are routed to PV guests via the :term:`Event Channel`
> +interface, and not discussed further here.
> +
> +What remain are exceptions, and the instructions which cause a control
> +transfers. In the x86 architecture, the instructions relevant for PV guests
> +are:
> +
> + * ``INT3``, which generates ``#BP``.
> +
> + * ``INTO``, which generates ``#OF`` only if the overflow flag is set. It is
> + only usable in compatibility mode, and will ``#UD`` in 64bit mode.
> +
> + * ``CALL (far)`` referencing a gate in the GDT.
> +
> + * ``INT $N``, which invokes an arbitrary IDT gate. These four instructions
> + so far all check the gate DPL and will ``#GP`` otherwise.
> +
> + * ``INT1``, also known as ``ICEBP``, which generates ``#DB``. This
> + instruction does *not* check DPL, and can be used unconditionally by
> + userspace.
> +
> + * ``SYSCALL``, which enters CPL0 as configured by the ``{C,L,}STAR`` MSRs.
> + It is usable if enabled by ``MSR_EFER.SCE``, and will ``#UD`` otherwise.
> + On Intel parts, ``SYSCALL`` is unusable outside of 64bit mode.
> +
> + * ``SYSENTER``, which enters CPL0 as configured by the ``SEP`` MSRs. It is
> + usable if enabled by ``MSR_SYSENTER_CS`` having a non-NUL selector, and
> + will ``#GP`` otherwise. On AMD parts, ``SYSENTER`` is unusable in Long
> + mode.
The UD<n> family of insns is kind of a hybrid: They explicitly generate #UD,
and hence do a control transfer. Same for at least BOUND. It's not quite clear
whether they should be enumerated here as well.
> +Xen's configuration
> +-------------------
> +
> +Xen maintains a complete IDT, with most gates configured with DPL0. This
> +causes most ``INT $N`` instructions to ``#GP``. This allows Xen to emulate
> +the instruction, referring to the guest kernels vDPL choice.
> +
> + * Vectors 3 ``#BP`` and 4 ``#OF`` are DPL3, in order to allow the ``INT3``
> + and ``INTO`` instructions to function in userspace.
> +
> + * Vector 0x80 is DPL3 in order to implement the legacy system call fastpath
> + commonly found in UNIXes.
Much like we make this DPL0 when PV=n, should we perhaps make vectors 3 and 4
DPL0 as well in that case (just for formality's sake)? Maybe 4, like 9, would
even want to be an autogen entry point then?
Jan
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 03/14] x86/boot: Move gdt_l1e caching out of traps_init()
2026-02-27 23:16 ` [PATCH v4 03/14] x86/boot: Move gdt_l1e caching out of traps_init() Andrew Cooper
@ 2026-03-02 11:33 ` Jan Beulich
0 siblings, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2026-03-02 11:33 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 28.02.2026 00:16, Andrew Cooper wrote:
> Commit 564d261687c0 ("x86/ctxt-switch: Document and improve GDT handling") put
> the initialisation of {,compat_}gdt_l1e into traps_init() but this wasn't a
> great choice. Instead, put it in smp_prepare_cpus() which performs the BSP
> preparation of variables normally set up by cpu_smpboot_alloc() for APs.
>
> This removes an implicit dependency that prevents traps_init() moving earlier
> than move_xen() in the boot sequence.
>
> No functional change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
> I'm on the fence about the ASSERT(), but I'm getting rather tired of unstated
> dependencies. For a PV64 guest using SYSEXIT to enter the guest, it's the
> first interrupt/exception which references the GDT, which could be after the
> guest is running.
I think that's okay to have there. "Unstated dependencies" is of course a wide
field, and going too far with assertions is also a risk.
Jan
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 04/14] x86/boot: Document the ordering dependency of _svm_cpu_up()
2026-02-27 23:16 ` [PATCH v4 04/14] x86/boot: Document the ordering dependency of _svm_cpu_up() Andrew Cooper
@ 2026-03-02 11:35 ` Jan Beulich
2026-03-02 15:20 ` Andrew Cooper
1 sibling, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2026-03-02 11:35 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 28.02.2026 00:16, Andrew Cooper wrote:
> Lets just say this took an unreasoanble amount of time and effort to track
> down, when trying to move traps_init() earlier during boot.
>
> When the SYSCALL linkage MSRs are not configured ahead of _svm_cpu_up() on the
> BSP, the first context switch into PV uses svm_load_segs() and clobbers the
> later-set-up linkage with the 0's cached here, causing hypercalls issues by
> the PV guest to enter at 0 in supervisor mode on the user stack.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 05/14] x86/traps: Move traps_init() earlier on boot
2026-02-27 23:16 ` [PATCH v4 05/14] x86/traps: Move traps_init() earlier on boot Andrew Cooper
@ 2026-03-02 11:39 ` Jan Beulich
2026-03-02 15:32 ` Andrew Cooper
0 siblings, 1 reply; 43+ messages in thread
From: Jan Beulich @ 2026-03-02 11:39 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 28.02.2026 00:16, Andrew Cooper wrote:
> We wish to make use of opt_fred earlier on boot, which involves moving
> traps_init() earlier, but this comes with several ordering complications.
>
> The feature word containing FRED needs collecting in early_cpu_init(), and
> legacy_syscall_init() cannot be called that early because it relies on the
> stubs being allocated, yet must be called ahead of cpu_init() so the SYSCALL
> linkage MSRs are set up before being cached.
>
> Delaying legacy_syscall_init() is easy enough based on a system_state check.
> Reuse bsp_traps_reinit() to cause a call to legacy_syscall_init() to occur at
> the same point as previously.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Irrespective ...
> @@ -359,7 +363,13 @@ void __init bsp_traps_reinit(void)
> */
> void percpu_traps_init(void)
> {
> - legacy_syscall_init();
> + /*
> + * Skip legacy_syscall_init() at early boot. It requires the stubs being
> + * allocated, limiting the placement of the traps_init() call, and gets
> + * re-done anyway by bsp_traps_reinit().
> + */
> + if ( system_state > SYS_STATE_early_boot )
> + legacy_syscall_init();
... I wonder if simply pulling this out of this function wouldn't be slightly
neater. To me at least, syscall/sysenter are only a remote from of "trap".
Jan
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 01/14] x86/pv: Don't assume that INT $imm8 instructions are two bytes long
2026-03-02 11:03 ` Jan Beulich
@ 2026-03-02 11:43 ` Andrew Cooper
2026-03-02 12:57 ` Jan Beulich
0 siblings, 1 reply; 43+ messages in thread
From: Andrew Cooper @ 2026-03-02 11:43 UTC (permalink / raw)
To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, Xen-devel
On 02/03/2026 11:03 am, Jan Beulich wrote:
> On 28.02.2026 00:16, Andrew Cooper wrote:
>> For INT $N instructions (besides $0x80 for which there is a dedicated fast
>> path), handling is mostly fault-based because of DPL0 gates in the IDT. This
>> means that when the guest kernel allows the instruction too, Xen must
>> increment %rip to the end of the instruction before passing a trap to the
>> guest kernel.
>>
>> When an INT $N instruction has a prefix, it's longer than two bytes, and Xen
>> will deliver the "trap" with %rip pointing into the middle of the instruction.
>>
>> Introduce a new pv_emulate_sw_interrupt() which uses x86_insn_length() to
>> determine the instruction length, rather than assuming two.
>>
>> This is a change in behaviour for PV guests, but the prior behaviour cannot
>> reasonably be said to be intentional.
>>
>> This change does not affect the INT $0x80 fastpath. Prefixed INT $N
>> instructions occur almost exclusively in test code or exploits, and INT $0x80
>> appears to be the only user-usable interrupt gate in contemporary PV guests.
> Whereas for the slow path, while the subtracting of 2 from %rip there isn't
> quite right either, the insn size determination here would then simply yield
> 2 as well, so all is good for that case as well.
I've covered that in the docs patch (patch 2). Because INT $0x80 is
DPL3 and therefore traps, this is the best we can do.
>
>> @@ -1401,6 +1402,53 @@ int pv_emulate_privileged_op(struct cpu_user_regs *regs)
>> return 0;
>> }
>>
>> +/*
>> + * Hardware already decoded the INT $N instruction and determinted that there
>> + * was a DPL issue, hence the #GP. Xen has already determined that the guest
>> + * kernel has permitted this software interrupt.
>> + *
>> + * All that is needed is the instruction length, to turn the fault into a
>> + * trap. All errors are turned back into the original #GP, as that's the
>> + * action that really happened.
>> + */
>> +void pv_emulate_sw_interrupt(struct cpu_user_regs *regs)
>> +{
>> + struct vcpu *curr = current;
>> + struct domain *currd = curr->domain;
>> + struct priv_op_ctxt ctxt = {
>> + .ctxt.regs = regs,
>> + .ctxt.lma = !is_pv_32bit_domain(currd),
> The difference may not be overly significant here, but 64-bit guests can run
> 32-bit code, so setting .lma seems wrong in that case. As it ought to be
> largely benign, perhaps to code could even be left as is, just with a comment
> to clarify things?
LMA must be set for a 64bit guest. Are you confusing it with %cs.l ?
What's potentially wrong is having LMA clear for a 32bit guest, but this
is how pv_emulate_privileged_op() behaves. LMA is active in real
hardware when running in a compatibility mode segment.
I don't think anything actually cares about LMA.
pv_emul_read_descriptor() doesn't audit L and instead relies on us not
permitting a PV32 guest to write a 64bit code segment.
>
>> + };
>> + struct x86_emulate_state *state;
>> + uint8_t vector = regs->error_code >> 3;
>> + unsigned int len, ar;
>> +
>> + if ( !pv_emul_read_descriptor(regs->cs, curr, &ctxt.cs.base,
>> + &ctxt.cs.limit, &ar, 1) ||
>> + !(ar & _SEGMENT_S) ||
>> + !(ar & _SEGMENT_P) ||
>> + !(ar & _SEGMENT_CODE) )
>> + goto error;
>> +
>> + state = x86_decode_insn(&ctxt.ctxt, insn_fetch);
>> + if ( IS_ERR_OR_NULL(state) )
>> + goto error;
>> +
>> + len = x86_insn_length(state, &ctxt.ctxt);
>> + x86_emulate_free_state(state);
>> +
>> + /* Note: Checked slightly late to simplify 'state' handling. */
>> + if ( ctxt.ctxt.opcode != 0xcd /* INT $imm8 */ )
>> + goto error;
>> +
>> + regs->rip += len;
>> + pv_inject_sw_interrupt(vector);
>> + return;
>> +
>> + error:
>> + pv_inject_hw_exception(X86_EXC_GP, regs->entry_vector);
> DYM regs->error_code here?
Oh. I'm sure I fixed this bug already. I wonder where the fix got lost.
Yes, it should be regs->error_code.
> Might it alternatively make sense to return a
> boolean here, for ...
>
>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -1379,8 +1379,7 @@ void do_general_protection(struct cpu_user_regs *regs)
>>
>> if ( permit_softint(TI_GET_DPL(ti), v, regs) )
>> {
>> - regs->rip += 2;
>> - pv_inject_sw_interrupt(vector);
>> + pv_emulate_sw_interrupt(regs);
>> return;
> ... the return here to become conditional, leveraging the #GP injection at
> the bottom of this function?
To make this bool, I need to insert a new label into the function. I
considered that, but delayed it. do_general_protection() wants a lot
more cleaning up than just this, and proportionability is a concern.
What I was actually considering was splitting out a new pv_handle_GP()
function to remove the ifdef-ary, and doing a wholesale rework at that
point.
~Andrew
P.S. Something I'm still trying to figure out is how to make
guest_mode() able to DCE based on the caller being
entry_from_{xen,pv}(), because the function can be bifurcated for FRED.
It doesn't appear that the assume() constructs work, probably because
do_general_protection() can't be inlined due to IDT mode.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 01/14] x86/pv: Don't assume that INT $imm8 instructions are two bytes long
2026-03-02 11:43 ` Andrew Cooper
@ 2026-03-02 12:57 ` Jan Beulich
2026-03-02 16:39 ` Andrew Cooper
0 siblings, 1 reply; 43+ messages in thread
From: Jan Beulich @ 2026-03-02 12:57 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 02.03.2026 12:43, Andrew Cooper wrote:
> On 02/03/2026 11:03 am, Jan Beulich wrote:
>> On 28.02.2026 00:16, Andrew Cooper wrote:
>>> @@ -1401,6 +1402,53 @@ int pv_emulate_privileged_op(struct cpu_user_regs *regs)
>>> return 0;
>>> }
>>>
>>> +/*
>>> + * Hardware already decoded the INT $N instruction and determinted that there
>>> + * was a DPL issue, hence the #GP. Xen has already determined that the guest
>>> + * kernel has permitted this software interrupt.
>>> + *
>>> + * All that is needed is the instruction length, to turn the fault into a
>>> + * trap. All errors are turned back into the original #GP, as that's the
>>> + * action that really happened.
>>> + */
>>> +void pv_emulate_sw_interrupt(struct cpu_user_regs *regs)
>>> +{
>>> + struct vcpu *curr = current;
>>> + struct domain *currd = curr->domain;
>>> + struct priv_op_ctxt ctxt = {
>>> + .ctxt.regs = regs,
>>> + .ctxt.lma = !is_pv_32bit_domain(currd),
>> The difference may not be overly significant here, but 64-bit guests can run
>> 32-bit code, so setting .lma seems wrong in that case. As it ought to be
>> largely benign, perhaps to code could even be left as is, just with a comment
>> to clarify things?
>
> LMA must be set for a 64bit guest. Are you confusing it with %cs.l ?
Indeed I am, sorry.
>>> + struct x86_emulate_state *state;
>>> + uint8_t vector = regs->error_code >> 3;
>>> + unsigned int len, ar;
>>> +
>>> + if ( !pv_emul_read_descriptor(regs->cs, curr, &ctxt.cs.base,
>>> + &ctxt.cs.limit, &ar, 1) ||
>>> + !(ar & _SEGMENT_S) ||
>>> + !(ar & _SEGMENT_P) ||
>>> + !(ar & _SEGMENT_CODE) )
>>> + goto error;
>>> +
>>> + state = x86_decode_insn(&ctxt.ctxt, insn_fetch);
>>> + if ( IS_ERR_OR_NULL(state) )
>>> + goto error;
>>> +
>>> + len = x86_insn_length(state, &ctxt.ctxt);
>>> + x86_emulate_free_state(state);
>>> +
>>> + /* Note: Checked slightly late to simplify 'state' handling. */
>>> + if ( ctxt.ctxt.opcode != 0xcd /* INT $imm8 */ )
>>> + goto error;
>>> +
>>> + regs->rip += len;
>>> + pv_inject_sw_interrupt(vector);
>>> + return;
>>> +
>>> + error:
>>> + pv_inject_hw_exception(X86_EXC_GP, regs->entry_vector);
>> DYM regs->error_code here?
>
> Oh. I'm sure I fixed this bug already. I wonder where the fix got lost.
>
> Yes, it should be regs->error_code.
Then (plus with my confusion above sorted)
Reviewed-by: Jan Beulich <jbeulich@suse.com>
>> Might it alternatively make sense to return a
>> boolean here, for ...
>>
>>> --- a/xen/arch/x86/traps.c
>>> +++ b/xen/arch/x86/traps.c
>>> @@ -1379,8 +1379,7 @@ void do_general_protection(struct cpu_user_regs *regs)
>>>
>>> if ( permit_softint(TI_GET_DPL(ti), v, regs) )
>>> {
>>> - regs->rip += 2;
>>> - pv_inject_sw_interrupt(vector);
>>> + pv_emulate_sw_interrupt(regs);
>>> return;
>> ... the return here to become conditional, leveraging the #GP injection at
>> the bottom of this function?
>
> To make this bool, I need to insert a new label into the function.
Why would that be? Simply skipping the return and falling through will do,
afaics.
> I
> considered that, but delayed it. do_general_protection() wants a lot
> more cleaning up than just this, and proportionability is a concern.
Whatever you exactly mean with this.
Jan
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 02/14] docs/guest-guide: Describe the PV traps and entrypoints ABI
2026-03-02 11:19 ` Jan Beulich
@ 2026-03-02 14:47 ` Andrew Cooper
0 siblings, 0 replies; 43+ messages in thread
From: Andrew Cooper @ 2026-03-02 14:47 UTC (permalink / raw)
To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, Xen-devel
On 02/03/2026 11:19 am, Jan Beulich wrote:
> On 28.02.2026 00:16, Andrew Cooper wrote:
>> ... seeing as I've had to thoroughly reverse engineer it for FRED and make
>> tweaks in places.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Acked-by: Jan Beulich <jbeulich@suse.com>
Thanks.
>
>> --- /dev/null
>> +++ b/docs/guest-guide/x86/pv-traps.rst
>> @@ -0,0 +1,123 @@
>> +.. SPDX-License-Identifier: CC-BY-4.0
>> +
>> +PV Traps and Entrypoints
>> +========================
>> +
>> +.. note::
>> +
>> + The details here are specific to 64bit builds of Xen. Details for 32bit
>> + builds of Xen, are different and not discussed further.
> Nit: Stray comma?
Yes. From a sentence refactor. Will drop.
>
>> +PV guests are subject to Xen's linkage setup for events (interrupts,
>> +exceptions and system calls). x86's IDT architecture and limitations are the
>> +majority influence on the PV ABI.
>> +
>> +All external interrupts are routed to PV guests via the :term:`Event Channel`
>> +interface, and not discussed further here.
>> +
>> +What remain are exceptions, and the instructions which cause a control
>> +transfers. In the x86 architecture, the instructions relevant for PV guests
>> +are:
>> +
>> + * ``INT3``, which generates ``#BP``.
>> +
>> + * ``INTO``, which generates ``#OF`` only if the overflow flag is set. It is
>> + only usable in compatibility mode, and will ``#UD`` in 64bit mode.
>> +
>> + * ``CALL (far)`` referencing a gate in the GDT.
>> +
>> + * ``INT $N``, which invokes an arbitrary IDT gate. These four instructions
>> + so far all check the gate DPL and will ``#GP`` otherwise.
>> +
>> + * ``INT1``, also known as ``ICEBP``, which generates ``#DB``. This
>> + instruction does *not* check DPL, and can be used unconditionally by
>> + userspace.
>> +
>> + * ``SYSCALL``, which enters CPL0 as configured by the ``{C,L,}STAR`` MSRs.
>> + It is usable if enabled by ``MSR_EFER.SCE``, and will ``#UD`` otherwise.
>> + On Intel parts, ``SYSCALL`` is unusable outside of 64bit mode.
>> +
>> + * ``SYSENTER``, which enters CPL0 as configured by the ``SEP`` MSRs. It is
>> + usable if enabled by ``MSR_SYSENTER_CS`` having a non-NUL selector, and
>> + will ``#GP`` otherwise. On AMD parts, ``SYSENTER`` is unusable in Long
>> + mode.
> The UD<n> family of insns is kind of a hybrid: They explicitly generate #UD,
> and hence do a control transfer. Same for at least BOUND. It's not quite clear
> whether they should be enumerated here as well.
UDn and BOUND are strictly faults, not traps. They're type 3 (hardware
exception) and provide no instruction length.
The simplest implementation of UDn is nothing. The decoder already
needs a signal for "not an instruction I know" which is wired into #UD.
All the manual does by defining these "instructions" is promise that
nothing else will be allocated in that opcode space.
BOUND is weird. I'm not sure what more can be said about it.
Either way, #UD and #BR are not interestingly different from e.g. #PF
from a PV guest's point of view.
>
>> +Xen's configuration
>> +-------------------
>> +
>> +Xen maintains a complete IDT, with most gates configured with DPL0. This
>> +causes most ``INT $N`` instructions to ``#GP``. This allows Xen to emulate
>> +the instruction, referring to the guest kernels vDPL choice.
>> +
>> + * Vectors 3 ``#BP`` and 4 ``#OF`` are DPL3, in order to allow the ``INT3``
>> + and ``INTO`` instructions to function in userspace.
>> +
>> + * Vector 0x80 is DPL3 in order to implement the legacy system call fastpath
>> + commonly found in UNIXes.
> Much like we make this DPL0 when PV=n, should we perhaps make vectors 3 and 4
> DPL0 as well in that case (just for formality's sake)? Maybe 4, like 9, would
> even want to be an autogen entry point then?
We could, but does that gain us anything?
For 0x80, we get another vector to use for regular interrupts, but that
doesn't work for the vectors below 0x10.
~Andrew
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 06/14] x86/traps: Don't configure Supervisor Shadow Stack tokens in FRED mode
2026-02-27 23:16 ` [PATCH v4 06/14] x86/traps: Don't configure Supervisor Shadow Stack tokens in FRED mode Andrew Cooper
@ 2026-03-02 14:50 ` Jan Beulich
2026-03-02 15:47 ` Andrew Cooper
0 siblings, 1 reply; 43+ messages in thread
From: Jan Beulich @ 2026-03-02 14:50 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 28.02.2026 00:16, Andrew Cooper wrote:
> FRED doesn't use Supervisor Shadow Stack tokens. This means that:
>
> 1) memguard_guard_stack() should not write Supervisor Shadow Stack Tokens.
> 2) cpu_has_bug_shstk_fracture is no longer relevant when deciding whether or
> not to enable Shadow Stacks in the first place.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
> The SDM explicitly points out the shstk fracture vs FRED case, yet PTL
> enumerates CET-SSS (immunity to shstk fracture). I can only assume that there
> are other Intel CPUs with FRED but without CET-SSS.
Isn't CET-SSS still relevant to OSes not using FRED (much like you do for
the fred=no case)?
Jan
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 04/14] x86/boot: Document the ordering dependency of _svm_cpu_up()
2026-02-27 23:16 ` [PATCH v4 04/14] x86/boot: Document the ordering dependency of _svm_cpu_up() Andrew Cooper
2026-03-02 11:35 ` Jan Beulich
@ 2026-03-02 15:20 ` Andrew Cooper
2026-03-02 15:34 ` Jan Beulich
1 sibling, 1 reply; 43+ messages in thread
From: Andrew Cooper @ 2026-03-02 15:20 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
On 27/02/2026 11:16 pm, Andrew Cooper wrote:
> Lets just say this took an unreasoanble amount of time and effort to track
> down, when trying to move traps_init() earlier during boot.
>
> When the SYSCALL linkage MSRs are not configured ahead of _svm_cpu_up() on the
> BSP, the first context switch into PV uses svm_load_segs() and clobbers the
> later-set-up linkage with the 0's cached here, causing hypercalls issues by
> the PV guest to enter at 0 in supervisor mode on the user stack.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
>
> v4:
> * New
>
> It occurs to me that it's not actually 0's we cache here. It's whatever
> context was left from prior to Xen. We still don't reliably clean unused
> MSRs.
> ---
> xen/arch/x86/hvm/svm/svm.c | 16 ++++++++++++++++
> xen/arch/x86/setup.c | 2 +-
> 2 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
> index 18ba837738c6..f1e02d919cae 100644
> --- a/xen/arch/x86/hvm/svm/svm.c
> +++ b/xen/arch/x86/hvm/svm/svm.c
> @@ -35,6 +35,7 @@
> #include <asm/p2m.h>
> #include <asm/paging.h>
> #include <asm/processor.h>
> +#include <asm/traps.h>
> #include <asm/vm_event.h>
> #include <asm/x86_emulate.h>
>
> @@ -1581,6 +1582,21 @@ static int _svm_cpu_up(bool bsp)
> /* Initialize OSVW bits to be used by guests */
> svm_host_osvw_init();
>
> + /*
> + * VMSAVE writes out the current full FS, GS, LDTR and TR segments, and
> + * the GS_SHADOW, SYSENTER and SYSCALL linkage MSRs.
> + *
> + * The segment data gets modified by the svm_load_segs() optimisation for
> + * PV context switches, but all values get reloaded at that point, as well
> + * as during context switch from SVM.
> + *
> + * If PV guests are available (and FRED is not in use), it is critical
> + * that the SYSCALL linkage MSRs been configured at this juncture.
> + */
> + ASSERT(opt_fred >= 0); /* Confirm that FRED-ness has been resolved */
> + if ( IS_ENABLED(CONFIG_PV) && !opt_fred )
> + ASSERT(rdmsr(MSR_LSTAR));
It has occurred to me that this is subtly wrong. While FRED doesn't use
LSTAR/SFMASK, it does reuse STAR.
So this needs to be:
if ( IS_ENABLED(CONFIG_PV) )
ASSERT(rdmsr(MSR_STAR));
with the include dropped, as the final sentence adjusted to say "even
with FRED".
~Andrew
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 05/14] x86/traps: Move traps_init() earlier on boot
2026-03-02 11:39 ` Jan Beulich
@ 2026-03-02 15:32 ` Andrew Cooper
0 siblings, 0 replies; 43+ messages in thread
From: Andrew Cooper @ 2026-03-02 15:32 UTC (permalink / raw)
To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, Xen-devel
On 02/03/2026 11:39 am, Jan Beulich wrote:
> On 28.02.2026 00:16, Andrew Cooper wrote:
>> We wish to make use of opt_fred earlier on boot, which involves moving
>> traps_init() earlier, but this comes with several ordering complications.
>>
>> The feature word containing FRED needs collecting in early_cpu_init(), and
>> legacy_syscall_init() cannot be called that early because it relies on the
>> stubs being allocated, yet must be called ahead of cpu_init() so the SYSCALL
>> linkage MSRs are set up before being cached.
>>
>> Delaying legacy_syscall_init() is easy enough based on a system_state check.
>> Reuse bsp_traps_reinit() to cause a call to legacy_syscall_init() to occur at
>> the same point as previously.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Thanks.
>
> Irrespective ...
>
>> @@ -359,7 +363,13 @@ void __init bsp_traps_reinit(void)
>> */
>> void percpu_traps_init(void)
>> {
>> - legacy_syscall_init();
>> + /*
>> + * Skip legacy_syscall_init() at early boot. It requires the stubs being
>> + * allocated, limiting the placement of the traps_init() call, and gets
>> + * re-done anyway by bsp_traps_reinit().
>> + */
>> + if ( system_state > SYS_STATE_early_boot )
>> + legacy_syscall_init();
> ... I wonder if simply pulling this out of this function wouldn't be slightly
> neater. To me at least, syscall/sysenter are only a remote from of "trap".
I'm not a massive fan of how we (well, Linux) uses "traps" when it's
different from x86 term of the same name.
But, setting up the syscall stub has always been part of traps_init(),
and for FRED it's combined.
As noted, this changes again as FRED gets plumbed in, so really you need
to look at patch 8. I'm not a massive fan of how it's ended up, but I
can't think of anything simpler.
~Andrew
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 04/14] x86/boot: Document the ordering dependency of _svm_cpu_up()
2026-03-02 15:20 ` Andrew Cooper
@ 2026-03-02 15:34 ` Jan Beulich
2026-03-02 15:42 ` Andrew Cooper
0 siblings, 1 reply; 43+ messages in thread
From: Jan Beulich @ 2026-03-02 15:34 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 02.03.2026 16:20, Andrew Cooper wrote:
> On 27/02/2026 11:16 pm, Andrew Cooper wrote:
>> Lets just say this took an unreasoanble amount of time and effort to track
>> down, when trying to move traps_init() earlier during boot.
>>
>> When the SYSCALL linkage MSRs are not configured ahead of _svm_cpu_up() on the
>> BSP, the first context switch into PV uses svm_load_segs() and clobbers the
>> later-set-up linkage with the 0's cached here, causing hypercalls issues by
>> the PV guest to enter at 0 in supervisor mode on the user stack.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>
>> v4:
>> * New
>>
>> It occurs to me that it's not actually 0's we cache here. It's whatever
>> context was left from prior to Xen. We still don't reliably clean unused
>> MSRs.
Actually, with this, ...
>> --- a/xen/arch/x86/hvm/svm/svm.c
>> +++ b/xen/arch/x86/hvm/svm/svm.c
>> @@ -35,6 +35,7 @@
>> #include <asm/p2m.h>
>> #include <asm/paging.h>
>> #include <asm/processor.h>
>> +#include <asm/traps.h>
>> #include <asm/vm_event.h>
>> #include <asm/x86_emulate.h>
>>
>> @@ -1581,6 +1582,21 @@ static int _svm_cpu_up(bool bsp)
>> /* Initialize OSVW bits to be used by guests */
>> svm_host_osvw_init();
>>
>> + /*
>> + * VMSAVE writes out the current full FS, GS, LDTR and TR segments, and
>> + * the GS_SHADOW, SYSENTER and SYSCALL linkage MSRs.
>> + *
>> + * The segment data gets modified by the svm_load_segs() optimisation for
>> + * PV context switches, but all values get reloaded at that point, as well
>> + * as during context switch from SVM.
>> + *
>> + * If PV guests are available (and FRED is not in use), it is critical
>> + * that the SYSCALL linkage MSRs been configured at this juncture.
>> + */
>> + ASSERT(opt_fred >= 0); /* Confirm that FRED-ness has been resolved */
>> + if ( IS_ENABLED(CONFIG_PV) && !opt_fred )
>> + ASSERT(rdmsr(MSR_LSTAR));
>
> It has occurred to me that this is subtly wrong. While FRED doesn't use
> LSTAR/SFMASK, it does reuse STAR.
>
> So this needs to be:
>
> if ( IS_ENABLED(CONFIG_PV) )
> ASSERT(rdmsr(MSR_STAR));
>
> with the include dropped, as the final sentence adjusted to say "even
> with FRED".
... if we inherit a non-zero value, is the assertion of much use this way?
Jan
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 04/14] x86/boot: Document the ordering dependency of _svm_cpu_up()
2026-03-02 15:34 ` Jan Beulich
@ 2026-03-02 15:42 ` Andrew Cooper
0 siblings, 0 replies; 43+ messages in thread
From: Andrew Cooper @ 2026-03-02 15:42 UTC (permalink / raw)
To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, Xen-devel
On 02/03/2026 3:34 pm, Jan Beulich wrote:
> On 02.03.2026 16:20, Andrew Cooper wrote:
>> On 27/02/2026 11:16 pm, Andrew Cooper wrote:
>>> Lets just say this took an unreasoanble amount of time and effort to track
>>> down, when trying to move traps_init() earlier during boot.
>>>
>>> When the SYSCALL linkage MSRs are not configured ahead of _svm_cpu_up() on the
>>> BSP, the first context switch into PV uses svm_load_segs() and clobbers the
>>> later-set-up linkage with the 0's cached here, causing hypercalls issues by
>>> the PV guest to enter at 0 in supervisor mode on the user stack.
>>>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> ---
>>> CC: Jan Beulich <JBeulich@suse.com>
>>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>>
>>> v4:
>>> * New
>>>
>>> It occurs to me that it's not actually 0's we cache here. It's whatever
>>> context was left from prior to Xen. We still don't reliably clean unused
>>> MSRs.
> Actually, with this, ...
>
>>> --- a/xen/arch/x86/hvm/svm/svm.c
>>> +++ b/xen/arch/x86/hvm/svm/svm.c
>>> @@ -35,6 +35,7 @@
>>> #include <asm/p2m.h>
>>> #include <asm/paging.h>
>>> #include <asm/processor.h>
>>> +#include <asm/traps.h>
>>> #include <asm/vm_event.h>
>>> #include <asm/x86_emulate.h>
>>>
>>> @@ -1581,6 +1582,21 @@ static int _svm_cpu_up(bool bsp)
>>> /* Initialize OSVW bits to be used by guests */
>>> svm_host_osvw_init();
>>>
>>> + /*
>>> + * VMSAVE writes out the current full FS, GS, LDTR and TR segments, and
>>> + * the GS_SHADOW, SYSENTER and SYSCALL linkage MSRs.
>>> + *
>>> + * The segment data gets modified by the svm_load_segs() optimisation for
>>> + * PV context switches, but all values get reloaded at that point, as well
>>> + * as during context switch from SVM.
>>> + *
>>> + * If PV guests are available (and FRED is not in use), it is critical
>>> + * that the SYSCALL linkage MSRs been configured at this juncture.
>>> + */
>>> + ASSERT(opt_fred >= 0); /* Confirm that FRED-ness has been resolved */
>>> + if ( IS_ENABLED(CONFIG_PV) && !opt_fred )
>>> + ASSERT(rdmsr(MSR_LSTAR));
>> It has occurred to me that this is subtly wrong. While FRED doesn't use
>> LSTAR/SFMASK, it does reuse STAR.
>>
>> So this needs to be:
>>
>> if ( IS_ENABLED(CONFIG_PV) )
>> ASSERT(rdmsr(MSR_STAR));
>>
>> with the include dropped, as the final sentence adjusted to say "even
>> with FRED".
> ... if we inherit a non-zero value, is the assertion of much use this way?
The inherited case is normally when we're KEXEC'd into. That doesn't
happen very much with Xen, and is more of a concern with Linux.
But yes, IMO the assertion is still useful. CI boots from clean, so the
ASSERT() will catch accidental code movement which violates the dependency.
~Andrew
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 06/14] x86/traps: Don't configure Supervisor Shadow Stack tokens in FRED mode
2026-03-02 14:50 ` Jan Beulich
@ 2026-03-02 15:47 ` Andrew Cooper
0 siblings, 0 replies; 43+ messages in thread
From: Andrew Cooper @ 2026-03-02 15:47 UTC (permalink / raw)
To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, Xen-devel
On 02/03/2026 2:50 pm, Jan Beulich wrote:
> On 28.02.2026 00:16, Andrew Cooper wrote:
>> FRED doesn't use Supervisor Shadow Stack tokens. This means that:
>>
>> 1) memguard_guard_stack() should not write Supervisor Shadow Stack Tokens.
>> 2) cpu_has_bug_shstk_fracture is no longer relevant when deciding whether or
>> not to enable Shadow Stacks in the first place.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Thanks.
>> The SDM explicitly points out the shstk fracture vs FRED case, yet PTL
>> enumerates CET-SSS (immunity to shstk fracture). I can only assume that there
>> are other Intel CPUs with FRED but without CET-SSS.
> Isn't CET-SSS still relevant to OSes not using FRED (much like you do for
> the fred=no case)?
Yes, CET-SSS is relevant outside of FRED mode.
I just don't see the point of the note if all FRED systems will
enumerate CET-SSS.
~Andrew
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 08/14] x86/traps: Enable FRED when requested
2026-02-27 23:16 ` [PATCH v4 08/14] x86/traps: Enable FRED when requested Andrew Cooper
@ 2026-03-02 16:12 ` Jan Beulich
2026-03-03 13:44 ` Andrew Cooper
0 siblings, 1 reply; 43+ messages in thread
From: Jan Beulich @ 2026-03-02 16:12 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 28.02.2026 00:16, Andrew Cooper wrote:
> With the shadow stack and exception handling adjustements in place, we can now
> activate FRED when appropriate. Note that opt_fred is still disabled by
> default until more infrastructure is in place.
>
> Introduce init_fred() to set up all the MSRs relevant for FRED. FRED uses
> MSR_STAR (entries from Ring3 only), and MSR_FRED_SSP_SL0 aliases MSR_PL0_SSP
> when CET-SS is active. Otherwise, they're all new MSRs.
>
> Also introduce init_fred_tss(). At this juncture we need a TSS set up, even
> if it is mostly unused. Reinsert the BUILD_BUG_ON() checking the size of the
> TSS against 0x67, this time with a more precise comment.
>
> With init_fred() existing, load_system_tables() and legacy_syscall_init()
> should only be used when setting up IDT delivery. Insert ASSERT()s to this
> effect, and adjust the various init functions to make this property true.
>
> The FRED initialisation path still needs to write to all system table
> registers at least once, even if only to invalidate them. Per the
> documentation, percpu_early_traps_init() is responsible for switching off the
> boot GDT, which also needs doing even in FRED mode.
>
> Finally, set CR4.FRED in traps_init()/percpu_early_traps_init().
>
> Xen can now boot in FRED mode and run a PVH dom0. PV guests still need more
> work before they can be run under FRED.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
> [*] PVH Dom0 on an Intel PantherLake CPU.
What other part is this remark connected to?
> @@ -353,7 +440,11 @@ void __init traps_init(void)
> */
> void __init bsp_traps_reinit(void)
> {
> - load_system_tables();
> + if ( opt_fred )
> + init_fred();
> + else
> + load_system_tables();
> +
> percpu_traps_init();
> }
I see now what you meant in reply to comments on an earlier patch.
Jan
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 09/14] x86/pv: Adjust GS handling for FRED mode
2026-02-27 23:16 ` [PATCH v4 09/14] x86/pv: Adjust GS handling for FRED mode Andrew Cooper
@ 2026-03-02 16:24 ` Jan Beulich
2026-03-04 17:18 ` [PATCH v4.1 " Andrew Cooper
1 sibling, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2026-03-02 16:24 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 28.02.2026 00:16, Andrew Cooper wrote:
> When FRED is active, hardware automatically swaps GS when changing privilege,
> and the SWAPGS instruction is disallowed.
>
> For native OSes using GS as the thread local pointer this is a massive
> improvement on the pre-FRED architecture, but under Xen it makes handling PV
> guests more complicated. Specifically, it means that GS_BASE and GS_SHADOW
> are the opposite way around in FRED mode, as opposed to IDT mode.
>
> This leads to the following changes:
>
> * In load_segments(), we have to load both GSes. Account for this in the
> SWAP() condition and avoid the path with SWAGS.
>
> * In save_segments(), we need to read GS_SHADOW rather than GS_BASE.
>
> * In toggle_guest_mode(), we need to emulate SWAPGS.
>
> * In {read,write}_msr() which access the live registers, GS_SHADOW and
> GS_BASE need swapping.
>
> * In do_set_segment_base(), merge the SEGBASE_GS_{USER,KERNEL} cases and
> take FRED into account when choosing which base to update.
>
> SEGBASE_GS_USER_SEL was already an LKGS invocation (decades before FRED)
> so under FRED needs to be just a MOV %gs. Simply skip the SWAPGSes.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
>
> v4:
> * Adjust GS accesses for emulated {RD,WR}MSR too.
I think this ...
> @@ -926,7 +927,7 @@ static int cf_check read_msr(
> case MSR_GS_BASE:
> if ( !cp->extd.lm )
> break;
> - *val = read_gs_base();
> + *val = opt_fred ? rdmsr(MSR_SHADOW_GS_BASE) : read_gs_base();
> return X86EMUL_OKAY;
... calls for a comment both here and ...
> @@ -1066,17 +1067,22 @@ static int cf_check write_msr(
> if ( !cp->extd.lm || !is_canonical_address(val) )
> break;
>
> - if ( reg == MSR_FS_BASE )
> - write_fs_base(val);
> - else if ( reg == MSR_GS_BASE )
> - write_gs_base(val);
> - else if ( reg == MSR_SHADOW_GS_BASE )
> + switch ( reg )
> {
> - write_gs_shadow(val);
> + case MSR_FS_BASE:
> + write_fs_base(val);
> + break;
> +
> + case MSR_SHADOW_GS_BASE:
> curr->arch.pv.gs_base_user = val;
> + fallthrough;
> + case MSR_GS_BASE:
> + if ( (reg == MSR_GS_BASE) ^ opt_fred )
> + write_gs_base(val);
> + else
> + write_gs_shadow(val);
> + break;
... here, much like you do about everywhere else.
> @@ -192,11 +193,12 @@ long do_set_segment_base(unsigned int which, unsigned long base)
>
> case SEGBASE_GS_USER:
> v->arch.pv.gs_base_user = base;
> - write_gs_shadow(base);
> - break;
> -
> + fallthrough;
> case SEGBASE_GS_KERNEL:
> - write_gs_base(base);
> + if ( (which == SEGBASE_GS_KERNEL) ^ opt_fred )
> + write_gs_base(base);
> + else
> + write_gs_shadow(base);
> break;
> }
> break;
Same perhaps here, and ...
> @@ -209,7 +211,8 @@ long do_set_segment_base(unsigned int which, unsigned long base)
> * We wish to update the user %gs from the GDT/LDT. Currently, the
> * guest kernel's GS_BASE is in context.
> */
> - asm volatile ( "swapgs" );
> + if ( !opt_fred )
> + asm volatile ( "swapgs" );
... the comment in context could also do with inserting "unless using FRED"
or some such.
Jan
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 13/14] x86: Clamp reserved bits in eflags more aggressively
2026-02-27 23:16 ` [PATCH v4 13/14] x86: Clamp reserved bits in eflags more aggressively Andrew Cooper
@ 2026-03-02 16:35 ` Jan Beulich
2026-03-11 17:58 ` [PATCH v4.1 13/14] x86: Clamp " Andrew Cooper
1 sibling, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2026-03-02 16:35 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 28.02.2026 00:16, Andrew Cooper wrote:
> ERETU, unlike IRET, requires the sticky-1 bit (bit 2) be set, and reserved
> bits to be clear. Notably this means that dom0_construct() must set
> X86_EFLAGS_MBS in order for a PV dom0 to start.
>
> Xen has been overly lax with reserved bit handling. Adjust
> arch_set_info_guest*() and hypercall_iret() which consume flags to clamp the
> reserved bits for all guest types.
>
> This is a minor ABI change, but by the same argument as commit
> 9f892f84c279 ("x86/domctl: Stop using XLAT_cpu_user_regs()"); the reserved
> bits would get clamped like this naturally by hardware when the vCPU is run.
>
> This allows PV guests to start when Xen is using FRED mode.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
>
> Still slightly RFC. Testing still in progress.
>
> v3:
> * Rewrite the commit message.
> v2:
> * New
>
> The handling of VM is complicated.
>
> It turns out that it's simply ignored by IRET in Long Mode (i.e. clearing it
> commit 0e47f92b0725 ("x86: force EFLAGS.IF on when exiting to PV guests")
> wasn't actually necessary) but ERETU does care.
>
> But, it's unclear how to handle this in in arch_set_info(). We must preserve
> it for HVM guests (which can use vm86 mode). PV32 has special handling but
> only in hypercall_iret(), not in arch_set_info().
Any reason you don't ...
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1244,7 +1244,7 @@ int arch_set_info_guest(
> v->arch.user_regs.rax = c.nat->user_regs.rax;
> v->arch.user_regs.rip = c.nat->user_regs.rip;
> v->arch.user_regs.cs = c.nat->user_regs.cs;
> - v->arch.user_regs.rflags = c.nat->user_regs.rflags;
> + v->arch.user_regs.rflags = (c.nat->user_regs.rflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
> v->arch.user_regs.rsp = c.nat->user_regs.rsp;
> v->arch.user_regs.ss = c.nat->user_regs.ss;
> v->arch.pv.es = c.nat->user_regs.es;
> @@ -1268,7 +1268,7 @@ int arch_set_info_guest(
> v->arch.user_regs.eax = c.cmp->user_regs.eax;
> v->arch.user_regs.eip = c.cmp->user_regs.eip;
> v->arch.user_regs.cs = c.cmp->user_regs.cs;
> - v->arch.user_regs.eflags = c.cmp->user_regs.eflags;
> + v->arch.user_regs.eflags = (c.cmp->user_regs.eflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
> v->arch.user_regs.esp = c.cmp->user_regs.esp;
> v->arch.user_regs.ss = c.cmp->user_regs.ss;
> v->arch.pv.es = c.cmp->user_regs.es;
... filter it out here conditionally upon domain type?
Jan
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 01/14] x86/pv: Don't assume that INT $imm8 instructions are two bytes long
2026-03-02 12:57 ` Jan Beulich
@ 2026-03-02 16:39 ` Andrew Cooper
0 siblings, 0 replies; 43+ messages in thread
From: Andrew Cooper @ 2026-03-02 16:39 UTC (permalink / raw)
To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, Xen-devel
On 02/03/2026 12:57 pm, Jan Beulich wrote:
> On 02.03.2026 12:43, Andrew Cooper wrote:
>> On 02/03/2026 11:03 am, Jan Beulich wrote:
>>> On 28.02.2026 00:16, Andrew Cooper wrote:
>>>> @@ -1401,6 +1402,53 @@ int pv_emulate_privileged_op(struct cpu_user_regs *regs)
>>>> return 0;
>>>> }
>>>>
>>>> +/*
>>>> + * Hardware already decoded the INT $N instruction and determinted that there
>>>> + * was a DPL issue, hence the #GP. Xen has already determined that the guest
>>>> + * kernel has permitted this software interrupt.
>>>> + *
>>>> + * All that is needed is the instruction length, to turn the fault into a
>>>> + * trap. All errors are turned back into the original #GP, as that's the
>>>> + * action that really happened.
>>>> + */
>>>> +void pv_emulate_sw_interrupt(struct cpu_user_regs *regs)
>>>> +{
>>>> + struct vcpu *curr = current;
>>>> + struct domain *currd = curr->domain;
>>>> + struct priv_op_ctxt ctxt = {
>>>> + .ctxt.regs = regs,
>>>> + .ctxt.lma = !is_pv_32bit_domain(currd),
>>> The difference may not be overly significant here, but 64-bit guests can run
>>> 32-bit code, so setting .lma seems wrong in that case. As it ought to be
>>> largely benign, perhaps to code could even be left as is, just with a comment
>>> to clarify things?
>> LMA must be set for a 64bit guest. Are you confusing it with %cs.l ?
> Indeed I am, sorry.
>
>>>> + struct x86_emulate_state *state;
>>>> + uint8_t vector = regs->error_code >> 3;
>>>> + unsigned int len, ar;
>>>> +
>>>> + if ( !pv_emul_read_descriptor(regs->cs, curr, &ctxt.cs.base,
>>>> + &ctxt.cs.limit, &ar, 1) ||
>>>> + !(ar & _SEGMENT_S) ||
>>>> + !(ar & _SEGMENT_P) ||
>>>> + !(ar & _SEGMENT_CODE) )
>>>> + goto error;
>>>> +
>>>> + state = x86_decode_insn(&ctxt.ctxt, insn_fetch);
>>>> + if ( IS_ERR_OR_NULL(state) )
>>>> + goto error;
>>>> +
>>>> + len = x86_insn_length(state, &ctxt.ctxt);
>>>> + x86_emulate_free_state(state);
>>>> +
>>>> + /* Note: Checked slightly late to simplify 'state' handling. */
>>>> + if ( ctxt.ctxt.opcode != 0xcd /* INT $imm8 */ )
>>>> + goto error;
>>>> +
>>>> + regs->rip += len;
>>>> + pv_inject_sw_interrupt(vector);
>>>> + return;
>>>> +
>>>> + error:
>>>> + pv_inject_hw_exception(X86_EXC_GP, regs->entry_vector);
>>> DYM regs->error_code here?
>> Oh. I'm sure I fixed this bug already. I wonder where the fix got lost.
>>
>> Yes, it should be regs->error_code.
> Then (plus with my confusion above sorted)
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Thanks.
>
>>> Might it alternatively make sense to return a
>>> boolean here, for ...
>>>
>>>> --- a/xen/arch/x86/traps.c
>>>> +++ b/xen/arch/x86/traps.c
>>>> @@ -1379,8 +1379,7 @@ void do_general_protection(struct cpu_user_regs *regs)
>>>>
>>>> if ( permit_softint(TI_GET_DPL(ti), v, regs) )
>>>> {
>>>> - regs->rip += 2;
>>>> - pv_inject_sw_interrupt(vector);
>>>> + pv_emulate_sw_interrupt(regs);
>>>> return;
>>> ... the return here to become conditional, leveraging the #GP injection at
>>> the bottom of this function?
>> To make this bool, I need to insert a new label into the function.
> Why would that be? Simply skipping the return and falling through will do,
> afaics.
>
>> I
>> considered that, but delayed it. do_general_protection() wants a lot
>> more cleaning up than just this, and proportionability is a concern.
> Whatever you exactly mean with this.
Hmm. That was supposed to say backportability, but I have no idea how
ended up like that.
The other advantage of being void functions is that they can be tailcalled.
Anyway, I have a plan for cleanup once FRED is settled, which looks a
little like this:
handle_GP_IDT()
if ( guest_regs() )
return handle_GP_guest()
else
return handle_GP_xen()
handle_GP_guest()
...
handle_GP_xen()
...
where the two FRED entrypoints can now call the context-specific
function rather than the generic one.
This does involve duplicating the X86_XEC_EXT check which is the only
common aspect in the #GP handler. Next I need to figure out whether the
other handlers can be rearranged similarly.
~Andrew
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 14/14] x86/traps: Use fatal_trap() for #UD and #GP
2026-02-27 23:16 ` [PATCH v4 14/14] x86/traps: Use fatal_trap() for #UD and #GP Andrew Cooper
@ 2026-03-02 16:39 ` Jan Beulich
2026-03-02 16:40 ` Jan Beulich
1 sibling, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2026-03-02 16:39 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 28.02.2026 00:16, Andrew Cooper wrote:
> This renders the diagnostics in a more uniform way.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 14/14] x86/traps: Use fatal_trap() for #UD and #GP
2026-02-27 23:16 ` [PATCH v4 14/14] x86/traps: Use fatal_trap() for #UD and #GP Andrew Cooper
2026-03-02 16:39 ` Jan Beulich
@ 2026-03-02 16:40 ` Jan Beulich
1 sibling, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2026-03-02 16:40 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 28.02.2026 00:16, Andrew Cooper wrote:
> This renders the diagnostics in a more uniform way.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 08/14] x86/traps: Enable FRED when requested
2026-03-02 16:12 ` Jan Beulich
@ 2026-03-03 13:44 ` Andrew Cooper
0 siblings, 0 replies; 43+ messages in thread
From: Andrew Cooper @ 2026-03-03 13:44 UTC (permalink / raw)
To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, Xen-devel
On 02/03/2026 4:12 pm, Jan Beulich wrote:
> On 28.02.2026 00:16, Andrew Cooper wrote:
>> With the shadow stack and exception handling adjustements in place, we can now
>> activate FRED when appropriate. Note that opt_fred is still disabled by
>> default until more infrastructure is in place.
>>
>> Introduce init_fred() to set up all the MSRs relevant for FRED. FRED uses
>> MSR_STAR (entries from Ring3 only), and MSR_FRED_SSP_SL0 aliases MSR_PL0_SSP
>> when CET-SS is active. Otherwise, they're all new MSRs.
>>
>> Also introduce init_fred_tss(). At this juncture we need a TSS set up, even
>> if it is mostly unused. Reinsert the BUILD_BUG_ON() checking the size of the
>> TSS against 0x67, this time with a more precise comment.
>>
>> With init_fred() existing, load_system_tables() and legacy_syscall_init()
>> should only be used when setting up IDT delivery. Insert ASSERT()s to this
>> effect, and adjust the various init functions to make this property true.
>>
>> The FRED initialisation path still needs to write to all system table
>> registers at least once, even if only to invalidate them. Per the
>> documentation, percpu_early_traps_init() is responsible for switching off the
>> boot GDT, which also needs doing even in FRED mode.
>>
>> Finally, set CR4.FRED in traps_init()/percpu_early_traps_init().
>>
>> Xen can now boot in FRED mode and run a PVH dom0. PV guests still need more
>> work before they can be run under FRED.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Thanks.
>
>> [*] PVH Dom0 on an Intel PantherLake CPU.
> What other part is this remark connected to?
Ah - the commit message. Specifically, that I've only tested VT-x, not
SVM PVH dom0.
~Andrew
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v4.1 09/14] x86/pv: Adjust GS handling for FRED mode
2026-02-27 23:16 ` [PATCH v4 09/14] x86/pv: Adjust GS handling for FRED mode Andrew Cooper
2026-03-02 16:24 ` Jan Beulich
@ 2026-03-04 17:18 ` Andrew Cooper
2026-03-05 10:00 ` Jan Beulich
1 sibling, 1 reply; 43+ messages in thread
From: Andrew Cooper @ 2026-03-04 17:18 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné
When FRED is active, hardware automatically swaps GS when changing privilege,
and the SWAPGS instruction is disallowed.
For native OSes using GS as the thread local pointer this is a massive
improvement on the pre-FRED architecture, but under Xen it makes handling PV
guests more complicated. Specifically, it means that GS_BASE and GS_SHADOW
are the opposite way around in FRED mode, as opposed to IDT mode.
This leads to the following changes:
* In load_segments(), we already load both GSes. Account for FRED in the
SWAP() condition and avoid the path with SWAGS.
* In save_segments(), we need to read GS_SHADOW rather than GS_BASE.
* In toggle_guest_mode(), we need to emulate SWAPGS.
* In {read,write}_msr() which access the live registers, GS_SHADOW and
GS_BASE need swapping.
* In do_set_segment_base(), merge the SEGBASE_GS_{USER,KERNEL} cases and
take FRED into account when choosing which base to update.
SEGBASE_GS_USER_SEL was already an LKGS invocation (decades before FRED)
so under FRED needs to be just a MOV %gs. Simply skip the SWAPGSes.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
v4.1:
* Extra comments
v4:
* Adjust GS accesses for emulated {RD,WR}MSR too.
---
xen/arch/x86/domain.c | 16 +++++++++++-----
xen/arch/x86/pv/domain.c | 22 ++++++++++++++++++++--
xen/arch/x86/pv/emul-priv-op.c | 26 +++++++++++++++++---------
xen/arch/x86/pv/misc-hypercalls.c | 23 +++++++++++++++--------
4 files changed, 63 insertions(+), 24 deletions(-)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index e658c2d647b7..9c1f6ef76d52 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1791,9 +1791,10 @@ static void load_segments(struct vcpu *n)
/*
* Figure out which way around gsb/gss want to be. gsb needs to be
- * the active context, and gss needs to be the inactive context.
+ * the active context, and gss needs to be the inactive context,
+ * unless we're in FRED mode where they're reversed.
*/
- if ( !(n->arch.flags & TF_kernel_mode) )
+ if ( !(n->arch.flags & TF_kernel_mode) ^ opt_fred )
SWAP(gsb, gss);
if ( using_svm() && (n->arch.pv.fs | n->arch.pv.gs) <= 3 )
@@ -1814,7 +1815,9 @@ static void load_segments(struct vcpu *n)
if ( !fs_gs_done && !compat )
{
- if ( read_cr4() & X86_CR4_FSGSBASE )
+ unsigned long cr4 = read_cr4();
+
+ if ( !(cr4 & X86_CR4_FRED) && (cr4 & X86_CR4_FSGSBASE) )
{
__wrgsbase(gss);
__wrfsbase(n->arch.pv.fs_base);
@@ -1931,6 +1934,9 @@ static void load_segments(struct vcpu *n)
* Guests however cannot use SWAPGS, so there is no mechanism to modify the
* inactive GS base behind Xen's back. Therefore, Xen's copy of the inactive
* GS base is still accurate, and doesn't need reading back from hardware.
+ *
+ * Under FRED, hardware automatically swaps GS for us, so SHADOW_GS is the
+ * active GS from the guest's point of view.
*/
static void save_segments(struct vcpu *v)
{
@@ -1946,12 +1952,12 @@ static void save_segments(struct vcpu *v)
if ( read_cr4() & X86_CR4_FSGSBASE )
{
fs_base = __rdfsbase();
- gs_base = __rdgsbase();
+ gs_base = opt_fred ? rdmsr(MSR_SHADOW_GS_BASE) : __rdgsbase();
}
else
{
fs_base = rdmsr(MSR_FS_BASE);
- gs_base = rdmsr(MSR_GS_BASE);
+ gs_base = opt_fred ? rdmsr(MSR_SHADOW_GS_BASE) : rdmsr(MSR_GS_BASE);
}
v->arch.pv.fs_base = fs_base;
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index d16583a7454d..b85abb5ed903 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -14,9 +14,10 @@
#include <asm/cpufeature.h>
#include <asm/fsgsbase.h>
#include <asm/invpcid.h>
-#include <asm/spec_ctrl.h>
#include <asm/pv/domain.h>
#include <asm/shadow.h>
+#include <asm/spec_ctrl.h>
+#include <asm/traps.h>
#ifdef CONFIG_PV32
int8_t __read_mostly opt_pv32 = -1;
@@ -514,11 +515,28 @@ void toggle_guest_mode(struct vcpu *v)
* subsequent context switch won't bother re-reading it.
*/
gs_base = read_gs_base();
+
+ /*
+ * In FRED mode, not only are the two GSes the other way around (i.e. we
+ * want to read GS_SHADOW here), the SWAPGS instruction is disallowed so
+ * we have to emulate it.
+ */
+ if ( opt_fred )
+ {
+ unsigned long gs_shadow = rdmsr(MSR_SHADOW_GS_BASE);
+
+ wrmsrns(MSR_SHADOW_GS_BASE, gs_base);
+ write_gs_base(gs_shadow);
+
+ gs_base = gs_shadow;
+ }
+ else
+ asm volatile ( "swapgs" );
+
if ( v->arch.flags & TF_kernel_mode )
v->arch.pv.gs_base_kernel = gs_base;
else
v->arch.pv.gs_base_user = gs_base;
- asm volatile ( "swapgs" );
_toggle_guest_pt(v);
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index 64d47ab677a4..53676b30219c 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -25,6 +25,7 @@
#include <asm/pv/traps.h>
#include <asm/shared.h>
#include <asm/stubs.h>
+#include <asm/traps.h>
#include <xsm/xsm.h>
@@ -926,7 +927,8 @@ static int cf_check read_msr(
case MSR_GS_BASE:
if ( !cp->extd.lm )
break;
- *val = read_gs_base();
+ /* Under FRED, GS is automatically swapped on privilege change. */
+ *val = opt_fred ? rdmsr(MSR_SHADOW_GS_BASE) : read_gs_base();
return X86EMUL_OKAY;
case MSR_SHADOW_GS_BASE:
@@ -1066,17 +1068,23 @@ static int cf_check write_msr(
if ( !cp->extd.lm || !is_canonical_address(val) )
break;
- if ( reg == MSR_FS_BASE )
- write_fs_base(val);
- else if ( reg == MSR_GS_BASE )
- write_gs_base(val);
- else if ( reg == MSR_SHADOW_GS_BASE )
+ switch ( reg )
{
- write_gs_shadow(val);
+ case MSR_FS_BASE:
+ write_fs_base(val);
+ break;
+
+ case MSR_SHADOW_GS_BASE:
curr->arch.pv.gs_base_user = val;
+ fallthrough;
+ case MSR_GS_BASE:
+ /* Under FRED, GS is automatically swapped on privilege change. */
+ if ( (reg == MSR_GS_BASE) ^ opt_fred )
+ write_gs_base(val);
+ else
+ write_gs_shadow(val);
+ break;
}
- else
- ASSERT_UNREACHABLE();
return X86EMUL_OKAY;
case MSR_EFER:
diff --git a/xen/arch/x86/pv/misc-hypercalls.c b/xen/arch/x86/pv/misc-hypercalls.c
index 4c2abeb4add8..7e915d86b724 100644
--- a/xen/arch/x86/pv/misc-hypercalls.c
+++ b/xen/arch/x86/pv/misc-hypercalls.c
@@ -11,6 +11,7 @@
#include <asm/debugreg.h>
#include <asm/fsgsbase.h>
+#include <asm/traps.h>
long do_set_debugreg(int reg, unsigned long value)
{
@@ -192,11 +193,13 @@ long do_set_segment_base(unsigned int which, unsigned long base)
case SEGBASE_GS_USER:
v->arch.pv.gs_base_user = base;
- write_gs_shadow(base);
- break;
-
+ fallthrough;
case SEGBASE_GS_KERNEL:
- write_gs_base(base);
+ /* Under FRED, GS is automatically swapped on privilege change. */
+ if ( (which == SEGBASE_GS_KERNEL) ^ opt_fred )
+ write_gs_base(base);
+ else
+ write_gs_shadow(base);
break;
}
break;
@@ -206,10 +209,13 @@ long do_set_segment_base(unsigned int which, unsigned long base)
unsigned int sel = (uint16_t)base;
/*
- * We wish to update the user %gs from the GDT/LDT. Currently, the
- * guest kernel's GS_BASE is in context.
+ * We wish to update the user %gs from the GDT/LDT. Currently, we are
+ * in guest kernel context.
+ *
+ * Under IDT, this means updating GS_SHADOW. Under FRED, plain GS.
*/
- asm volatile ( "swapgs" );
+ if ( !opt_fred )
+ asm volatile ( "swapgs" );
if ( sel > 3 )
/* Fix up RPL for non-NUL selectors. */
@@ -247,7 +253,8 @@ long do_set_segment_base(unsigned int which, unsigned long base)
/* Update the cache of the inactive base, as read from the GDT/LDT. */
v->arch.pv.gs_base_user = read_gs_base();
- asm volatile ( safe_swapgs );
+ if ( !opt_fred )
+ asm volatile ( safe_swapgs );
break;
}
--
2.39.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH v4.1 09/14] x86/pv: Adjust GS handling for FRED mode
2026-03-04 17:18 ` [PATCH v4.1 " Andrew Cooper
@ 2026-03-05 10:00 ` Jan Beulich
0 siblings, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2026-03-05 10:00 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 04.03.2026 18:18, Andrew Cooper wrote:
> When FRED is active, hardware automatically swaps GS when changing privilege,
> and the SWAPGS instruction is disallowed.
>
> For native OSes using GS as the thread local pointer this is a massive
> improvement on the pre-FRED architecture, but under Xen it makes handling PV
> guests more complicated. Specifically, it means that GS_BASE and GS_SHADOW
> are the opposite way around in FRED mode, as opposed to IDT mode.
>
> This leads to the following changes:
>
> * In load_segments(), we already load both GSes. Account for FRED in the
> SWAP() condition and avoid the path with SWAGS.
>
> * In save_segments(), we need to read GS_SHADOW rather than GS_BASE.
>
> * In toggle_guest_mode(), we need to emulate SWAPGS.
>
> * In {read,write}_msr() which access the live registers, GS_SHADOW and
> GS_BASE need swapping.
>
> * In do_set_segment_base(), merge the SEGBASE_GS_{USER,KERNEL} cases and
> take FRED into account when choosing which base to update.
>
> SEGBASE_GS_USER_SEL was already an LKGS invocation (decades before FRED)
> so under FRED needs to be just a MOV %gs. Simply skip the SWAPGSes.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
>
> v4.1:
> * Extra comments
Thanks.
Jan
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 12/14] x86/pv: System call handling in FRED mode
2026-02-27 23:16 ` [PATCH v4 12/14] x86/pv: System call handling in FRED mode Andrew Cooper
@ 2026-03-09 22:25 ` Andrew Cooper
2026-03-10 7:16 ` Jan Beulich
0 siblings, 1 reply; 43+ messages in thread
From: Andrew Cooper @ 2026-03-09 22:25 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
On 27/02/2026 11:16 pm, Andrew Cooper wrote:
> diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
> index 2f40f628cbff..e2c35a046e6b 100644
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> ...
> + case 2: /* SYSENTER */
> + {
> + /*
> + * FRED delivery preserves the interrupted state, but previously
> + * SYSENTER discarded almost everything.
> + *
> + * The guest isn't aware of FRED, so recreate the legacy
> + * behaviour.
> + *
> + * When setting the selectors, clear all upper metadata. In
> + * particular fred_ss.swint becomes pend_DB on ERETx.
> + *
> + * When converting to a fault, hardware finally gives us enough
> + * information to account for prefixes, so provide the more
> + * correct behaviour rather than assuming the instruction was two
> + * bytes long.
> + */
> + unsigned int len = regs->fred_ss.insnlen;
> +
> + regs->ssx = FLAT_USER_SS;
> + regs->rsp = 0;
> + regs->eflags &= ~(X86_EFLAGS_VM | X86_EFLAGS_IF);
> + regs->csx = 3;
> + regs->rip = 0;
> +
> + if ( !curr->arch.pv.sysenter_callback_eip )
> + {
> + regs->rip -= len;
> + pv_inject_hw_exception(X86_EXC_GP, 0);
> + }
> + else
> + pv_inject_callback(CALLBACKTYPE_sysenter);
> + break;
This isn't actually a correct transformation of the IDT code. When the
SYENTER entrypoint isn't registered, this delivers a #GP at
0003:fffffffffffffffe
The simple fix to get back to IDT behaviour is to simply drop the
subtraction of len.
In FRED mode, we can finally point the #GP at the SYSENTER instruction,
rather than delivering at 0. We could even provide the success case
pointing sensibly too.
The question is should we? Until now, the differences between FRED and
IDT mode are minimal. This would be major difference, and it's for
SYSENTER which all but unused. I'm erring on the side of "match IDT".
~Andrew
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 12/14] x86/pv: System call handling in FRED mode
2026-03-09 22:25 ` Andrew Cooper
@ 2026-03-10 7:16 ` Jan Beulich
0 siblings, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2026-03-10 7:16 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 09.03.2026 23:25, Andrew Cooper wrote:
> On 27/02/2026 11:16 pm, Andrew Cooper wrote:
>> diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
>> index 2f40f628cbff..e2c35a046e6b 100644
>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> ...
>> + case 2: /* SYSENTER */
>> + {
>> + /*
>> + * FRED delivery preserves the interrupted state, but previously
>> + * SYSENTER discarded almost everything.
>> + *
>> + * The guest isn't aware of FRED, so recreate the legacy
>> + * behaviour.
>> + *
>> + * When setting the selectors, clear all upper metadata. In
>> + * particular fred_ss.swint becomes pend_DB on ERETx.
>> + *
>> + * When converting to a fault, hardware finally gives us enough
>> + * information to account for prefixes, so provide the more
>> + * correct behaviour rather than assuming the instruction was two
>> + * bytes long.
>> + */
>> + unsigned int len = regs->fred_ss.insnlen;
>> +
>> + regs->ssx = FLAT_USER_SS;
>> + regs->rsp = 0;
>> + regs->eflags &= ~(X86_EFLAGS_VM | X86_EFLAGS_IF);
>> + regs->csx = 3;
>> + regs->rip = 0;
>> +
>> + if ( !curr->arch.pv.sysenter_callback_eip )
>> + {
>> + regs->rip -= len;
>> + pv_inject_hw_exception(X86_EXC_GP, 0);
>> + }
>> + else
>> + pv_inject_callback(CALLBACKTYPE_sysenter);
>> + break;
>
> This isn't actually a correct transformation of the IDT code. When the
> SYENTER entrypoint isn't registered, this delivers a #GP at
> 0003:fffffffffffffffe
>
> The simple fix to get back to IDT behaviour is to simply drop the
> subtraction of len.
>
> In FRED mode, we can finally point the #GP at the SYSENTER instruction,
> rather than delivering at 0. We could even provide the success case
> pointing sensibly too.
>
> The question is should we? Until now, the differences between FRED and
> IDT mode are minimal. This would be major difference, and it's for
> SYSENTER which all but unused. I'm erring on the side of "match IDT".
I agree. Down the road we could introduce an opt-in "better behavior" mode
when running under FRED (also covering other aspects previously discussed).
Jan
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v4.1 13/14] x86: Clamp bits in eflags more aggressively
2026-02-27 23:16 ` [PATCH v4 13/14] x86: Clamp reserved bits in eflags more aggressively Andrew Cooper
2026-03-02 16:35 ` Jan Beulich
@ 2026-03-11 17:58 ` Andrew Cooper
2026-03-12 8:15 ` Jan Beulich
1 sibling, 1 reply; 43+ messages in thread
From: Andrew Cooper @ 2026-03-11 17:58 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné
In FRED mode, ERET is stricter than IRET about flags. Notably this means:
* The vm86 bit (bit 17) and IOPL (bits 12,13) must be clear.
* The sticky-1 reserved bit (bit 2) must be set, so dom0_construct() needs to
set X86_EFLAGS_MBS in order for a PV dom0 to start.
* All other reserved bits must be clear.
Xen has been overly lax with reserved bit handling. Adjust
arch_set_info_guest*() and hypercall_iret() which consume flags to clamp the
reserved bits for all guest types.
This is a minor ABI change, but by the same argument as commit
9f892f84c279 ("x86/domctl: Stop using XLAT_cpu_user_regs()"); the reserved
bits would get clamped like this naturally by hardware when the vCPU is run.
The handling of vm86 is also different. Guests under 32bit Xen really could
use vm86 mode, but Long Mode disallows vm86 mode and IRET simply ignores the
bit. Xen's behaviour for a PV32 guest trying to use vm86 mode under a 64bit
Xen is to arrange to deliver #GP at the target of the IRET, rather than to
fail the IRET itself.
However there's no filter filtering in arch_set_info_guest() itself, and it
can't arrange to queue a #GP at the target, so do the next best thing and fail
the hypercall. This is not expected to create an issue for PV guests, as the
result of such an arch_set_info_guest() previously would be to run supposedly
Real Mode code as Protected Mode code.
This allows PV guests to start when Xen is using FRED mode.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
v4.1:
* Adjust VM handling.
* Rewrite commit message.
v3:
* Rewrite the commit message.
v2:
* New
It turns out that it's simply ignored by IRET in Long Mode (i.e. clearing it
commit 0e47f92b0725 ("x86: force EFLAGS.IF on when exiting to PV guests")
wasn't actually necessary) but ERETU does care.
---
xen/arch/x86/domain.c | 24 ++++++++++++++++++++++--
xen/arch/x86/hvm/domain.c | 4 ++--
xen/arch/x86/include/asm/x86-defns.h | 7 +++++++
xen/arch/x86/pv/dom0_build.c | 2 +-
xen/arch/x86/pv/iret.c | 8 +++++---
5 files changed, 37 insertions(+), 8 deletions(-)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 868c26036dd9..4664264b2f5d 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1193,6 +1193,14 @@ int arch_set_info_guest(
if ( !__addr_ok(c.nat->ldt_base) )
return -EINVAL;
+
+ /*
+ * IRET in Long Mode discards EFLAGS.VM, but in FRED mode ERET
+ * cares that it is zero.
+ *
+ * Guests can't see FRED, so emulate IRET behaviour.
+ */
+ c.nat->user_regs.rflags &= ~X86_EFLAGS_VM;
}
#ifdef CONFIG_COMPAT
else
@@ -1205,6 +1213,18 @@ int arch_set_info_guest(
for ( i = 0; i < ARRAY_SIZE(c.cmp->trap_ctxt); i++ )
fixup_guest_code_selector(d, c.cmp->trap_ctxt[i].cs);
+
+ /*
+ * Under 32bit Xen, PV guests could really use vm86 mode. Under
+ * 64bit Xen, vm86 mode can't be entered even by PV32 guests.
+ *
+ * For backwards compatibility, compat HYPERCALL_iret will arrange
+ * to deliver #GP at the target of the IRET rather than to fail
+ * the IRET itself, but we can't arrange for the same behaviour
+ * here. Reject the hypercall as the next best option.
+ */
+ if ( c.cmp->user_regs.eflags & X86_EFLAGS_VM )
+ return -EINVAL;
}
#endif
@@ -1244,7 +1264,7 @@ int arch_set_info_guest(
v->arch.user_regs.rax = c.nat->user_regs.rax;
v->arch.user_regs.rip = c.nat->user_regs.rip;
v->arch.user_regs.cs = c.nat->user_regs.cs;
- v->arch.user_regs.rflags = c.nat->user_regs.rflags;
+ v->arch.user_regs.rflags = (c.nat->user_regs.rflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
v->arch.user_regs.rsp = c.nat->user_regs.rsp;
v->arch.user_regs.ss = c.nat->user_regs.ss;
v->arch.pv.es = c.nat->user_regs.es;
@@ -1268,7 +1288,7 @@ int arch_set_info_guest(
v->arch.user_regs.eax = c.cmp->user_regs.eax;
v->arch.user_regs.eip = c.cmp->user_regs.eip;
v->arch.user_regs.cs = c.cmp->user_regs.cs;
- v->arch.user_regs.eflags = c.cmp->user_regs.eflags;
+ v->arch.user_regs.eflags = (c.cmp->user_regs.eflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
v->arch.user_regs.esp = c.cmp->user_regs.esp;
v->arch.user_regs.ss = c.cmp->user_regs.ss;
v->arch.pv.es = c.cmp->user_regs.es;
diff --git a/xen/arch/x86/hvm/domain.c b/xen/arch/x86/hvm/domain.c
index 155d61db13f8..a0e811ea47a0 100644
--- a/xen/arch/x86/hvm/domain.c
+++ b/xen/arch/x86/hvm/domain.c
@@ -194,7 +194,7 @@ int arch_set_info_hvm_guest(struct vcpu *v, const struct vcpu_hvm_context *ctx)
uregs->rsi = regs->esi;
uregs->rdi = regs->edi;
uregs->rip = regs->eip;
- uregs->rflags = regs->eflags;
+ uregs->rflags = (regs->eflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
v->arch.hvm.guest_cr[0] = regs->cr0;
v->arch.hvm.guest_cr[3] = regs->cr3;
@@ -245,7 +245,7 @@ int arch_set_info_hvm_guest(struct vcpu *v, const struct vcpu_hvm_context *ctx)
uregs->rsi = regs->rsi;
uregs->rdi = regs->rdi;
uregs->rip = regs->rip;
- uregs->rflags = regs->rflags;
+ uregs->rflags = (regs->rflags & X86_EFLAGS_ALL) | X86_EFLAGS_MBS;
v->arch.hvm.guest_cr[0] = regs->cr0;
v->arch.hvm.guest_cr[3] = regs->cr3;
diff --git a/xen/arch/x86/include/asm/x86-defns.h b/xen/arch/x86/include/asm/x86-defns.h
index 0a0ba83de786..edeb0b4ff95a 100644
--- a/xen/arch/x86/include/asm/x86-defns.h
+++ b/xen/arch/x86/include/asm/x86-defns.h
@@ -27,6 +27,13 @@
(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF | \
X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF)
+#define X86_EFLAGS_ALL \
+ (X86_EFLAGS_ARITH_MASK | X86_EFLAGS_TF | X86_EFLAGS_IF | \
+ X86_EFLAGS_DF | X86_EFLAGS_OF | X86_EFLAGS_IOPL | \
+ X86_EFLAGS_NT | X86_EFLAGS_RF | X86_EFLAGS_VM | \
+ X86_EFLAGS_AC | X86_EFLAGS_VIF | X86_EFLAGS_VIP | \
+ X86_EFLAGS_ID)
+
/*
* Intel CPU flags in CR0
*/
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 9a11a0a16b4e..075a3646c2a3 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -1024,7 +1024,7 @@ static int __init dom0_construct(const struct boot_domain *bd)
regs->rip = parms.virt_entry;
regs->rsp = vstack_end;
regs->rsi = vstartinfo_start;
- regs->eflags = X86_EFLAGS_IF;
+ regs->eflags = X86_EFLAGS_IF | X86_EFLAGS_MBS;
/*
* We don't call arch_set_info_guest(), so some initialisation needs doing
diff --git a/xen/arch/x86/pv/iret.c b/xen/arch/x86/pv/iret.c
index d3a1fb2c685b..39ce316b8d91 100644
--- a/xen/arch/x86/pv/iret.c
+++ b/xen/arch/x86/pv/iret.c
@@ -80,8 +80,9 @@ long do_iret(void)
regs->rip = iret_saved.rip;
regs->cs = iret_saved.cs | 3; /* force guest privilege */
- regs->rflags = ((iret_saved.rflags & ~(X86_EFLAGS_IOPL|X86_EFLAGS_VM))
- | X86_EFLAGS_IF);
+ regs->rflags = ((iret_saved.rflags & X86_EFLAGS_ALL &
+ ~(X86_EFLAGS_IOPL | X86_EFLAGS_VM)) |
+ X86_EFLAGS_IF | X86_EFLAGS_MBS);
regs->rsp = iret_saved.rsp;
regs->ss = iret_saved.ss | 3; /* force guest privilege */
@@ -143,7 +144,8 @@ int compat_iret(void)
if ( VM_ASSIST(v->domain, architectural_iopl) )
v->arch.pv.iopl = eflags & X86_EFLAGS_IOPL;
- regs->eflags = (eflags & ~X86_EFLAGS_IOPL) | X86_EFLAGS_IF;
+ regs->eflags = ((eflags & X86_EFLAGS_ALL & ~X86_EFLAGS_IOPL) |
+ X86_EFLAGS_IF | X86_EFLAGS_MBS);
if ( unlikely(eflags & X86_EFLAGS_VM) )
{
--
2.39.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH v4.1 13/14] x86: Clamp bits in eflags more aggressively
2026-03-11 17:58 ` [PATCH v4.1 13/14] x86: Clamp " Andrew Cooper
@ 2026-03-12 8:15 ` Jan Beulich
2026-03-12 12:36 ` Andrew Cooper
0 siblings, 1 reply; 43+ messages in thread
From: Jan Beulich @ 2026-03-12 8:15 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel
On 11.03.2026 18:58, Andrew Cooper wrote:
> In FRED mode, ERET is stricter than IRET about flags. Notably this means:
>
> * The vm86 bit (bit 17) and IOPL (bits 12,13) must be clear.
> * The sticky-1 reserved bit (bit 2) must be set, so dom0_construct() needs to
> set X86_EFLAGS_MBS in order for a PV dom0 to start.
> * All other reserved bits must be clear.
>
> Xen has been overly lax with reserved bit handling. Adjust
> arch_set_info_guest*() and hypercall_iret() which consume flags to clamp the
> reserved bits for all guest types.
>
> This is a minor ABI change, but by the same argument as commit
> 9f892f84c279 ("x86/domctl: Stop using XLAT_cpu_user_regs()"); the reserved
> bits would get clamped like this naturally by hardware when the vCPU is run.
>
> The handling of vm86 is also different. Guests under 32bit Xen really could
> use vm86 mode, but Long Mode disallows vm86 mode and IRET simply ignores the
> bit. Xen's behaviour for a PV32 guest trying to use vm86 mode under a 64bit
> Xen is to arrange to deliver #GP at the target of the IRET, rather than to
> fail the IRET itself.
>
> However there's no filter filtering in arch_set_info_guest() itself, and it
Nit: Excess "filter"?
> can't arrange to queue a #GP at the target, so do the next best thing and fail
> the hypercall. This is not expected to create an issue for PV guests, as the
> result of such an arch_set_info_guest() previously would be to run supposedly
> Real Mode code as Protected Mode code.
>
> This allows PV guests to start when Xen is using FRED mode.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Nevertheless, a largely unrelated remark and two suggestions:
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1193,6 +1193,14 @@ int arch_set_info_guest(
>
> if ( !__addr_ok(c.nat->ldt_base) )
> return -EINVAL;
Seeing this still in context: I had some trouble locating the position where
you're making the change, as in my local tree this is long gone. Is there
any chance we could make progress on "x86/PV: consolidate LDT checks" [1]?
> +
> + /*
> + * IRET in Long Mode discards EFLAGS.VM, but in FRED mode ERET
> + * cares that it is zero.
> + *
> + * Guests can't see FRED, so emulate IRET behaviour.
> + */
> + c.nat->user_regs.rflags &= ~X86_EFLAGS_VM;
> }
> #ifdef CONFIG_COMPAT
> else
> @@ -1205,6 +1213,18 @@ int arch_set_info_guest(
>
> for ( i = 0; i < ARRAY_SIZE(c.cmp->trap_ctxt); i++ )
> fixup_guest_code_selector(d, c.cmp->trap_ctxt[i].cs);
> +
> + /*
> + * Under 32bit Xen, PV guests could really use vm86 mode. Under
> + * 64bit Xen, vm86 mode can't be entered even by PV32 guests.
> + *
> + * For backwards compatibility, compat HYPERCALL_iret will arrange
> + * to deliver #GP at the target of the IRET rather than to fail
> + * the IRET itself, but we can't arrange for the same behaviour
> + * here. Reject the hypercall as the next best option.
> + */
> + if ( c.cmp->user_regs.eflags & X86_EFLAGS_VM )
> + return -EINVAL;
Technically we could support VM86 mode, by fully emulating things. Hence I
think -EOPNOTSUPP would be more appropriate.
> }
> #endif
Having all of the EFLAGS handling together would be nice. IOPL and IF handling
sit further down. Could I talk you into moving these additions down there? Yes,
there are downsides to that: It looks to need another "compat" conditional, and
it would further the mix of state updates and error checks. Yet I still think
having all of the EFLAGS stuff together is a benefit.
Jan
[1] https://lists.xen.org/archives/html/xen-devel/2023-09/msg00157.html
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4.1 13/14] x86: Clamp bits in eflags more aggressively
2026-03-12 8:15 ` Jan Beulich
@ 2026-03-12 12:36 ` Andrew Cooper
0 siblings, 0 replies; 43+ messages in thread
From: Andrew Cooper @ 2026-03-12 12:36 UTC (permalink / raw)
To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, Xen-devel
On 12/03/2026 8:15 am, Jan Beulich wrote:
> On 11.03.2026 18:58, Andrew Cooper wrote:
>> In FRED mode, ERET is stricter than IRET about flags. Notably this means:
>>
>> * The vm86 bit (bit 17) and IOPL (bits 12,13) must be clear.
>> * The sticky-1 reserved bit (bit 2) must be set, so dom0_construct() needs to
>> set X86_EFLAGS_MBS in order for a PV dom0 to start.
>> * All other reserved bits must be clear.
>>
>> Xen has been overly lax with reserved bit handling. Adjust
>> arch_set_info_guest*() and hypercall_iret() which consume flags to clamp the
>> reserved bits for all guest types.
>>
>> This is a minor ABI change, but by the same argument as commit
>> 9f892f84c279 ("x86/domctl: Stop using XLAT_cpu_user_regs()"); the reserved
>> bits would get clamped like this naturally by hardware when the vCPU is run.
>>
>> The handling of vm86 is also different. Guests under 32bit Xen really could
>> use vm86 mode, but Long Mode disallows vm86 mode and IRET simply ignores the
>> bit. Xen's behaviour for a PV32 guest trying to use vm86 mode under a 64bit
>> Xen is to arrange to deliver #GP at the target of the IRET, rather than to
>> fail the IRET itself.
>>
>> However there's no filter filtering in arch_set_info_guest() itself, and it
> Nit: Excess "filter"?
Yes. I noticed that immediately after sending.
>
>> can't arrange to queue a #GP at the target, so do the next best thing and fail
>> the hypercall. This is not expected to create an issue for PV guests, as the
>> result of such an arch_set_info_guest() previously would be to run supposedly
>> Real Mode code as Protected Mode code.
>>
>> This allows PV guests to start when Xen is using FRED mode.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Thanks.
>
> Nevertheless, a largely unrelated remark and two suggestions:
>
>> --- a/xen/arch/x86/domain.c
>> +++ b/xen/arch/x86/domain.c
>> @@ -1193,6 +1193,14 @@ int arch_set_info_guest(
>>
>> if ( !__addr_ok(c.nat->ldt_base) )
>> return -EINVAL;
> Seeing this still in context: I had some trouble locating the position where
> you're making the change, as in my local tree this is long gone. Is there
> any chance we could make progress on "x86/PV: consolidate LDT checks" [1]?
I'll have another look, but this patch is going to need to go in first
as it needs backporting to 4.21.
>
>> +
>> + /*
>> + * IRET in Long Mode discards EFLAGS.VM, but in FRED mode ERET
>> + * cares that it is zero.
>> + *
>> + * Guests can't see FRED, so emulate IRET behaviour.
>> + */
>> + c.nat->user_regs.rflags &= ~X86_EFLAGS_VM;
>> }
>> #ifdef CONFIG_COMPAT
>> else
>> @@ -1205,6 +1213,18 @@ int arch_set_info_guest(
>>
>> for ( i = 0; i < ARRAY_SIZE(c.cmp->trap_ctxt); i++ )
>> fixup_guest_code_selector(d, c.cmp->trap_ctxt[i].cs);
>> +
>> + /*
>> + * Under 32bit Xen, PV guests could really use vm86 mode. Under
>> + * 64bit Xen, vm86 mode can't be entered even by PV32 guests.
>> + *
>> + * For backwards compatibility, compat HYPERCALL_iret will arrange
>> + * to deliver #GP at the target of the IRET rather than to fail
>> + * the IRET itself, but we can't arrange for the same behaviour
>> + * here. Reject the hypercall as the next best option.
>> + */
>> + if ( c.cmp->user_regs.eflags & X86_EFLAGS_VM )
>> + return -EINVAL;
> Technically we could support VM86 mode, by fully emulating things. Hence I
> think -EOPNOTSUPP would be more appropriate.
Sorry, but I think you're rather too late on that suggestion. Anyone
wanting vm86 mode can use a VM.
>
>> }
>> #endif
> Having all of the EFLAGS handling together would be nice. IOPL and IF handling
> sit further down. Could I talk you into moving these additions down there?
No, but not for ...
> Yes,
> there are downsides to that: It looks to need another "compat" conditional, and
> it would further the mix of state updates and error checks. Yet I still think
> having all of the EFLAGS stuff together is a benefit.
... these reasons. The later position is after the point at which it's
buggy to fail the hypercall, because we've already reset the FPU amongst
other things.
This is a dire function in need of a lot of work. I'm just leaving it
no more broken than it was before.
~Andrew
^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2026-03-12 12:37 UTC | newest]
Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-27 23:16 [PATCH v4 00/14] x86: FRED support Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 01/14] x86/pv: Don't assume that INT $imm8 instructions are two bytes long Andrew Cooper
2026-03-02 11:03 ` Jan Beulich
2026-03-02 11:43 ` Andrew Cooper
2026-03-02 12:57 ` Jan Beulich
2026-03-02 16:39 ` Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 02/14] docs/guest-guide: Describe the PV traps and entrypoints ABI Andrew Cooper
2026-03-02 11:19 ` Jan Beulich
2026-03-02 14:47 ` Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 03/14] x86/boot: Move gdt_l1e caching out of traps_init() Andrew Cooper
2026-03-02 11:33 ` Jan Beulich
2026-02-27 23:16 ` [PATCH v4 04/14] x86/boot: Document the ordering dependency of _svm_cpu_up() Andrew Cooper
2026-03-02 11:35 ` Jan Beulich
2026-03-02 15:20 ` Andrew Cooper
2026-03-02 15:34 ` Jan Beulich
2026-03-02 15:42 ` Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 05/14] x86/traps: Move traps_init() earlier on boot Andrew Cooper
2026-03-02 11:39 ` Jan Beulich
2026-03-02 15:32 ` Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 06/14] x86/traps: Don't configure Supervisor Shadow Stack tokens in FRED mode Andrew Cooper
2026-03-02 14:50 ` Jan Beulich
2026-03-02 15:47 ` Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 07/14] x86/traps: Introduce FRED entrypoints Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 08/14] x86/traps: Enable FRED when requested Andrew Cooper
2026-03-02 16:12 ` Jan Beulich
2026-03-03 13:44 ` Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 09/14] x86/pv: Adjust GS handling for FRED mode Andrew Cooper
2026-03-02 16:24 ` Jan Beulich
2026-03-04 17:18 ` [PATCH v4.1 " Andrew Cooper
2026-03-05 10:00 ` Jan Beulich
2026-02-27 23:16 ` [PATCH v4 10/14] x86/pv: Guest exception handling in " Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 11/14] x86/pv: ERETU error handling Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 12/14] x86/pv: System call handling in FRED mode Andrew Cooper
2026-03-09 22:25 ` Andrew Cooper
2026-03-10 7:16 ` Jan Beulich
2026-02-27 23:16 ` [PATCH v4 13/14] x86: Clamp reserved bits in eflags more aggressively Andrew Cooper
2026-03-02 16:35 ` Jan Beulich
2026-03-11 17:58 ` [PATCH v4.1 13/14] x86: Clamp " Andrew Cooper
2026-03-12 8:15 ` Jan Beulich
2026-03-12 12:36 ` Andrew Cooper
2026-02-27 23:16 ` [PATCH v4 14/14] x86/traps: Use fatal_trap() for #UD and #GP Andrew Cooper
2026-03-02 16:39 ` Jan Beulich
2026-03-02 16:40 ` Jan Beulich
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.