linux-hyperv.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v10 00/38] x86: enable FRED for x86-64
@ 2023-09-14  4:47 Xin Li
  2023-09-14  4:47 ` [PATCH v10 01/38] x86/cpufeatures: Add the cpu feature bit for WRMSRNS Xin Li
                   ` (37 more replies)
  0 siblings, 38 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

This patch set enables the Intel flexible return and event delivery
(FRED) architecture for x86-64.

The FRED architecture defines simple new transitions that change
privilege level (ring transitions). The FRED architecture was
designed with the following goals:

1) Improve overall performance and response time by replacing event
   delivery through the interrupt descriptor table (IDT event
   delivery) and event return by the IRET instruction with lower
   latency transitions.

2) Improve software robustness by ensuring that event delivery
   establishes the full supervisor context and that event return
   establishes the full user context.

The new transitions defined by the FRED architecture are FRED event
delivery and, for returning from events, two FRED return instructions.
FRED event delivery can effect a transition from ring 3 to ring 0, but
it is used also to deliver events incident to ring 0. One FRED
instruction (ERETU) effects a return from ring 0 to ring 3, while the
other (ERETS) returns while remaining in ring 0. Collectively, FRED
event delivery and the FRED return instructions are FRED transitions.

Search for the latest FRED spec in most search engines with this search pattern:

  site:intel.com FRED (flexible return and event delivery) specification

As of now there is no publicly avaiable CPU supporting FRED, thus the Intel
Simics® Simulator is used as software development and testing vehicles. And
it can be downloaded from:
  https://www.intel.com/content/www/us/en/developer/articles/tool/simics-simulator.html

To enable FRED, the Simics package 8112 QSP-CPU needs to be installed with CPU
model configured as:
	$cpu_comp_class = "x86-experimental-fred"


Changes since v9:
* Set unused sysvec table entries to fred_handle_spurious_interrupt()
  in fred_complete_exception_setup() (Thomas Gleixner).
* Shove the whole thing into arch/x86/entry/entry_64_fred.S for invoking
  external_interrupt() and fred_exc_nmi() (Sean Christopherson).
* Correct and improve a few comments (Sean Christopherson).
* Merge the two IRQ/NMI asm entries into one as it's fine to invoke
  noinstr code from regular code (Thomas Gleixner).
* Setup the long mode and NMI flags in the augmented SS field of FRED
  stack frame in C instead of asm (Thomas Gleixner).
* Don't use jump tables, indirect jumps are expensive (Thomas Gleixner).
* Except #NMI/#DB/#MCE, FRED really can share the exception handlers
  with IDT (Thomas Gleixner).
* Avoid the sysvec_* idt_entry muck, do it at a central place, reuse code
  instead of blindly copying it, which breaks the performance optimized
  sysvec entries like reschedule_ipi (Thomas Gleixner).
* Add asm_ prefix to FRED asm entry points (Thomas Gleixner).
* Disable #DB to avoid endless recursion and stack overflow when a
  watchpoint/breakpoint is set in the code path which is executed by
  #DB handler (Thomas Gleixner).
* Introduce a new structure fred_ss to denote the FRED flags above SS
  selector, which avoids FRED_SSX_ macros and makes the code simpler
  and easier to read (Thomas Gleixner).
* Use type u64 to define FRED bit fields instead of type unsigned int
  (Thomas Gleixner).
* Avoid a type cast by defining X86_CR4_FRED as 0 on 32-bit (Thomas
  Gleixner).
* Add the WRMSRNS instruction support (Thomas Gleixner).

Changes since v8:
* Move the FRED initialization patch after all required changes are in
  place (Thomas Gleixner).
* Don't do syscall early out in fred_entry_from_user() before there are
  proper performance numbers and justifications (Thomas Gleixner).
* Add the control exception handler to the FRED exception handler table
  (Thomas Gleixner).
* Introduce a macro sysvec_install() to derive the asm handler name from
  a C handler, which simplifies the code and avoids an ugly typecast
  (Thomas Gleixner).
* Remove junk code that assumes no local APIC on x86_64 (Thomas Gleixner).
* Put IDTENTRY changes in a separate patch (Thomas Gleixner).
* Use high-order 48 bits above the lowest 16 bit SS only when FRED is
  enabled (Thomas Gleixner).
* Explain why writing directly to the IA32_KERNEL_GS_BASE MSR is
  doing the right thing (Thomas Gleixner).
* Reword some patch descriptions (Thomas Gleixner).
* Add a new macro VMX_DO_FRED_EVENT_IRQOFF for FRED instead of
  refactoring VMX_DO_EVENT_IRQOFF (Sean Christopherson).
* Do NOT use a trampoline, just LEA+PUSH the return RIP, PUSH the error
  code, and jump to the FRED kernel entry point for NMI or call
  external_interrupt() for IRQs (Sean Christopherson).
* Call external_interrupt() only when FRED is enabled, and convert the
  non-FRED handling to external_interrupt() after FRED lands (Sean
  Christopherson).
* Use __packed instead of __attribute__((__packed__)) (Borislav Petkov).
* Put all comments above the members, like the rest of the file does
  (Borislav Petkov).
* Reflect the FRED spec 5.0 change that ERETS and ERETU add 8 to %rsp
  before popping the return context from the stack.
* Reflect stack frame definition changes from FRED spec 3.0 to 5.0.
* Add ENDBR to the FRED_ENTER asm macro after kernel IBT is added to
  FRED base line in FRED spec 5.0.
* Add a document which briefly introduces FRED features.
* Remove 2 patches, "allow FRED systems to use interrupt vectors
  0x10-0x1f" and "allow dynamic stack frame size", from this patch set,
  as they are "optimizations" only.
* Send 2 patches, "header file for event types" and "do not modify the
  DPL bits for a null selector", as pre-FRED patches.

Changes since v7:
* Always call external_interrupt() for VMX IRQ handling on x86_64, thus avoid
  re-entering the noinstr code.
* Create a FRED stack frame when FRED is compiled-in but not enabled, which
  uses some extra stack space but simplifies the code.
* Add a log message when FRED is enabled.

Changes since v6:
* Add a comment to explain why it is safe to write to a previous FRED stack
  frame. (Lai Jiangshan).
* Export fred_entrypoint_kernel(), required when kvm-intel built as a module.
* Reserve a REDZONE for CALL emulation and Align RSP to a 64-byte boundary
  before pushing a new FRED stack frame.
* Replace pt_regs csx flags prefix FRED_CSL_ with FRED_CSX_.

Changes since v5:
* Initialize system_interrupt_handlers with dispatch_table_spurious_interrupt()
  instead of NULL to get rid of a branch (Peter Zijlstra).
* Disallow #DB inside #MCE for robustness sake (Peter Zijlstra).
* Add a comment for FRED stack level settings (Lai Jiangshan).
* Move the NMI bit from an invalid stack frame, which caused ERETU to fault,
  to the fault handler's stack frame, thus to unblock NMI ASAP if NMI is blocked
  (Lai Jiangshan).
* Refactor VMX_DO_EVENT_IRQOFF to handle IRQ/NMI in IRQ/NMI induced VM exits
  when FRED is enabled (Sean Christopherson).

Changes since v4:
* Do NOT use the term "injection", which in the KVM context means to
  reinject an event into the guest (Sean Christopherson).
* Add the explanation of why to execute "int $2" to invoke the NMI handler
  in NMI caused VM exits (Sean Christopherson).
* Use cs/ss instead of csx/ssx when initializing the pt_regs structure
  for calling external_interrupt(), otherwise it breaks i386 build.

Changes since v3:
* Call external_interrupt() to handle IRQ in IRQ caused VM exits.
* Execute "int $2" to handle NMI in NMI caused VM exits.
* Rename csl/ssl of the pt_regs structure to csx/ssx (x for extended)
  (Andrew Cooper).

Changes since v2:
* Improve comments for changes in arch/x86/include/asm/idtentry.h.

Changes since v1:
* call irqentry_nmi_{enter,exit}() in both IDT and FRED debug fault kernel
  handler (Peter Zijlstra).
* Initialize a FRED exception handler to fred_bad_event() instead of NULL
  if no FRED handler defined for an exception vector (Peter Zijlstra).
* Push calling irqentry_{enter,exit}() and instrumentation_{begin,end}()
  down into individual FRED exception handlers, instead of in the dispatch
  framework (Peter Zijlstra).


H. Peter Anvin (Intel) (22):
  x86/fred: Add Kconfig option for FRED (CONFIG_X86_FRED)
  x86/cpufeatures: Add the cpu feature bit for FRED
  x86/fred: Disable FRED support if CONFIG_X86_FRED is disabled
  x86/opcode: Add ERET[US] instructions to the x86 opcode map
  x86/objtool: Teach objtool about ERET[US]
  x86/cpu: Add X86_CR4_FRED macro
  x86/cpu: Add MSR numbers for FRED configuration
  x86/fred: Add a new header file for FRED definitions
  x86/fred: Reserve space for the FRED stack frame
  x86/fred: Update MSR_IA32_FRED_RSP0 during task switch
  x86/fred: Disallow the swapgs instruction when FRED is enabled
  x86/fred: No ESPFIX needed when FRED is enabled
  x86/fred: Allow single-step trap and NMI when starting a new task
  x86/fred: Make exc_page_fault() work for FRED
  x86/fred: Add a debug fault entry stub for FRED
  x86/fred: Add a NMI entry stub for FRED
  x86/fred: FRED entry/exit and dispatch code
  x86/traps: Add sysvec_install() to install a system interrupt handler
  x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED
    is enabled
  x86/fred: Add fred_syscall_init()
  x86/fred: Add FRED initialization functions
  x86/fred: Invoke FRED initialization code to enable FRED

Peter Zijlstra (Intel) (1):
  x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual
    entry code

Xin Li (15):
  x86/cpufeatures: Add the cpu feature bit for WRMSRNS
  x86/opcode: Add the WRMSRNS instruction to the x86 opcode map
  x86/msr: Add the WRMSRNS instruction support
  x86/entry: Remove idtentry_sysvec from entry_{32,64}.S
  x86/trapnr: Add event type macros to <asm/trapnr.h>
  Documentation/x86/64: Add a documentation for FRED
  x86/fred: Disable FRED by default in its early stage
  x86/ptrace: Cleanup the definition of the pt_regs structure
  x86/ptrace: Add FRED additional information to the pt_regs structure
  x86/idtentry: Incorporate definitions/declarations of the FRED entries
  x86/fred: Add a machine check entry stub for FRED
  x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user
  x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI
  KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling
  x86/syscall: Split IDT syscall setup code into idt_syscall_init()

 .../admin-guide/kernel-parameters.txt         |   3 +
 Documentation/arch/x86/x86_64/fred.rst        |  98 ++++++
 Documentation/arch/x86/x86_64/index.rst       |   1 +
 arch/x86/Kconfig                              |   9 +
 arch/x86/entry/Makefile                       |   5 +-
 arch/x86/entry/calling.h                      |  15 +-
 arch/x86/entry/entry_32.S                     |   4 -
 arch/x86/entry/entry_64.S                     |  14 +-
 arch/x86/entry/entry_64_fred.S                | 129 ++++++++
 arch/x86/entry/entry_fred.c                   | 279 ++++++++++++++++++
 arch/x86/entry/vsyscall/vsyscall_64.c         |   2 +-
 arch/x86/include/asm/asm-prototypes.h         |   1 +
 arch/x86/include/asm/cpufeatures.h            |   2 +
 arch/x86/include/asm/desc.h                   |   2 -
 arch/x86/include/asm/disabled-features.h      |   8 +-
 arch/x86/include/asm/extable_fixup_types.h    |   4 +-
 arch/x86/include/asm/fred.h                   |  97 ++++++
 arch/x86/include/asm/idtentry.h               |  88 +++++-
 arch/x86/include/asm/msr-index.h              |  13 +-
 arch/x86/include/asm/msr.h                    |  18 ++
 arch/x86/include/asm/ptrace.h                 |  85 +++++-
 arch/x86/include/asm/switch_to.h              |   8 +-
 arch/x86/include/asm/thread_info.h            |  12 +-
 arch/x86/include/asm/trapnr.h                 |  12 +
 arch/x86/include/asm/vmx.h                    |  17 +-
 arch/x86/include/uapi/asm/processor-flags.h   |   7 +
 arch/x86/kernel/Makefile                      |   1 +
 arch/x86/kernel/cpu/acrn.c                    |   4 +-
 arch/x86/kernel/cpu/common.c                  |  53 +++-
 arch/x86/kernel/cpu/cpuid-deps.c              |   2 +
 arch/x86/kernel/cpu/mce/core.c                |  26 ++
 arch/x86/kernel/cpu/mshyperv.c                |  15 +-
 arch/x86/kernel/espfix_64.c                   |   8 +
 arch/x86/kernel/fred.c                        |  59 ++++
 arch/x86/kernel/idt.c                         |   4 +-
 arch/x86/kernel/irqinit.c                     |   7 +-
 arch/x86/kernel/kvm.c                         |   2 +-
 arch/x86/kernel/nmi.c                         |  28 ++
 arch/x86/kernel/process_64.c                  |  67 ++++-
 arch/x86/kernel/traps.c                       |  48 ++-
 arch/x86/kvm/vmx/vmx.c                        |  12 +-
 arch/x86/lib/x86-opcode-map.txt               |   4 +-
 arch/x86/mm/extable.c                         |  79 +++++
 arch/x86/mm/fault.c                           |   5 +-
 drivers/xen/events/events_base.c              |   2 +-
 tools/arch/x86/include/asm/cpufeatures.h      |   2 +
 .../arch/x86/include/asm/disabled-features.h  |   8 +-
 tools/arch/x86/include/asm/msr-index.h        |  13 +-
 tools/arch/x86/lib/x86-opcode-map.txt         |   4 +-
 tools/objtool/arch/x86/decode.c               |  19 +-
 50 files changed, 1291 insertions(+), 114 deletions(-)
 create mode 100644 Documentation/arch/x86/x86_64/fred.rst
 create mode 100644 arch/x86/entry/entry_64_fred.S
 create mode 100644 arch/x86/entry/entry_fred.c
 create mode 100644 arch/x86/include/asm/fred.h
 create mode 100644 arch/x86/kernel/fred.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH v10 01/38] x86/cpufeatures: Add the cpu feature bit for WRMSRNS
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 02/38] x86/opcode: Add the WRMSRNS instruction to the x86 opcode map Xin Li
                   ` (36 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

WRMSRNS is an instruction that behaves exactly like WRMSR, with
the only difference being that it is not a serializing instruction
by default. Under certain conditions, WRMSRNS may replace WRMSR to
improve performance.

Add the CPU feature bit for WRMSRNS.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/cpufeatures.h       | 1 +
 tools/arch/x86/include/asm/cpufeatures.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 58cb9495e40f..330876d34b68 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -322,6 +322,7 @@
 #define X86_FEATURE_FSRS		(12*32+11) /* "" Fast short REP STOSB */
 #define X86_FEATURE_FSRC		(12*32+12) /* "" Fast short REP {CMPSB,SCASB} */
 #define X86_FEATURE_LKGS		(12*32+18) /* "" Load "kernel" (userspace) GS */
+#define X86_FEATURE_WRMSRNS		(12*32+19) /* "" Non-Serializing Write to Model Specific Register instruction */
 #define X86_FEATURE_AMX_FP16		(12*32+21) /* "" AMX fp16 Support */
 #define X86_FEATURE_AVX_IFMA            (12*32+23) /* "" Support for VPMADD52[H,L]UQ */
 #define X86_FEATURE_LAM			(12*32+26) /* Linear Address Masking */
diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h
index 798e60b5454b..1b9d86ba5bc2 100644
--- a/tools/arch/x86/include/asm/cpufeatures.h
+++ b/tools/arch/x86/include/asm/cpufeatures.h
@@ -318,6 +318,7 @@
 #define X86_FEATURE_FSRS		(12*32+11) /* "" Fast short REP STOSB */
 #define X86_FEATURE_FSRC		(12*32+12) /* "" Fast short REP {CMPSB,SCASB} */
 #define X86_FEATURE_LKGS		(12*32+18) /* "" Load "kernel" (userspace) GS */
+#define X86_FEATURE_WRMSRNS		(12*32+19) /* "" Non-Serializing Write to Model Specific Register instruction */
 #define X86_FEATURE_AMX_FP16		(12*32+21) /* "" AMX fp16 Support */
 #define X86_FEATURE_AVX_IFMA            (12*32+23) /* "" Support for VPMADD52[H,L]UQ */
 #define X86_FEATURE_LAM			(12*32+26) /* Linear Address Masking */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 02/38] x86/opcode: Add the WRMSRNS instruction to the x86 opcode map
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
  2023-09-14  4:47 ` [PATCH v10 01/38] x86/cpufeatures: Add the cpu feature bit for WRMSRNS Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-15  5:47   ` Masami Hiramatsu
  2023-09-14  4:47 ` [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support Xin Li
                   ` (35 subsequent siblings)
  37 siblings, 1 reply; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

Add the opcode used by WRMSRNS, which is the non-serializing version of
WRMSR and may replace it to improve performance, to the x86 opcode map.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/lib/x86-opcode-map.txt       | 2 +-
 tools/arch/x86/lib/x86-opcode-map.txt | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index 5168ee0360b2..1efe1d9bf5ce 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -1051,7 +1051,7 @@ GrpTable: Grp6
 EndTable
 
 GrpTable: Grp7
-0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) | VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B)
+0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) | VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B) | WRMSRNS (110),(11B)
 1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC (011),(11B) | ENCLS (111),(11B)
 2: LGDT Ms | XGETBV (000),(11B) | XSETBV (001),(11B) | VMFUNC (100),(11B) | XEND (101)(11B) | XTEST (110)(11B) | ENCLU (111),(11B)
 3: LIDT Ms
diff --git a/tools/arch/x86/lib/x86-opcode-map.txt b/tools/arch/x86/lib/x86-opcode-map.txt
index 5168ee0360b2..1efe1d9bf5ce 100644
--- a/tools/arch/x86/lib/x86-opcode-map.txt
+++ b/tools/arch/x86/lib/x86-opcode-map.txt
@@ -1051,7 +1051,7 @@ GrpTable: Grp6
 EndTable
 
 GrpTable: Grp7
-0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) | VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B)
+0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) | VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B) | WRMSRNS (110),(11B)
 1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC (011),(11B) | ENCLS (111),(11B)
 2: LGDT Ms | XGETBV (000),(11B) | XSETBV (001),(11B) | VMFUNC (100),(11B) | XEND (101)(11B) | XTEST (110)(11B) | ENCLU (111),(11B)
 3: LIDT Ms
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
  2023-09-14  4:47 ` [PATCH v10 01/38] x86/cpufeatures: Add the cpu feature bit for WRMSRNS Xin Li
  2023-09-14  4:47 ` [PATCH v10 02/38] x86/opcode: Add the WRMSRNS instruction to the x86 opcode map Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  6:02   ` Juergen Gross
                     ` (2 more replies)
  2023-09-14  4:47 ` [PATCH v10 04/38] x86/entry: Remove idtentry_sysvec from entry_{32,64}.S Xin Li
                   ` (34 subsequent siblings)
  37 siblings, 3 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

Add an always inline API __wrmsrns() to embed the WRMSRNS instruction
into the code.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/msr.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 65ec1965cd28..c284ff9ebe67 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -97,6 +97,19 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high)
 		     : : "c" (msr), "a"(low), "d" (high) : "memory");
 }
 
+/*
+ * WRMSRNS behaves exactly like WRMSR with the only difference being
+ * that it is not a serializing instruction by default.
+ */
+static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
+{
+	/* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */
+	asm volatile("1: .byte 0x0f,0x01,0xc6\n"
+		     "2:\n"
+		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
+		     : : "c" (msr), "a"(low), "d" (high));
+}
+
 #define native_rdmsr(msr, val1, val2)			\
 do {							\
 	u64 __val = __rdmsr((msr));			\
@@ -297,6 +310,11 @@ do {							\
 
 #endif	/* !CONFIG_PARAVIRT_XXL */
 
+static __always_inline void wrmsrns(u32 msr, u64 val)
+{
+	__wrmsrns(msr, val, val >> 32);
+}
+
 /*
  * 64-bit version of wrmsr_safe():
  */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 04/38] x86/entry: Remove idtentry_sysvec from entry_{32,64}.S
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (2 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 05/38] x86/trapnr: Add event type macros to <asm/trapnr.h> Xin Li
                   ` (33 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

idtentry_sysvec is really just DECLARE_IDTENTRY defined in
<asm/idtentry.h>, no need to define it separately.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/entry/entry_32.S       | 4 ----
 arch/x86/entry/entry_64.S       | 8 --------
 arch/x86/include/asm/idtentry.h | 2 +-
 3 files changed, 1 insertion(+), 13 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 6e6af42e044a..e0f22ad8ff7e 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -649,10 +649,6 @@ SYM_CODE_START_LOCAL(asm_\cfunc)
 SYM_CODE_END(asm_\cfunc)
 .endm
 
-.macro idtentry_sysvec vector cfunc
-	idtentry \vector asm_\cfunc \cfunc has_error_code=0
-.endm
-
 /*
  * Include the defines which emit the idt entries which are shared
  * shared between 32 and 64 bit and emit the __irqentry_text_* markers
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 43606de22511..9cdb61ea91de 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -432,14 +432,6 @@ SYM_CODE_END(\asmsym)
 	idtentry \vector asm_\cfunc \cfunc has_error_code=1
 .endm
 
-/*
- * System vectors which invoke their handlers directly and are not
- * going through the regular common device interrupt handling code.
- */
-.macro idtentry_sysvec vector cfunc
-	idtentry \vector asm_\cfunc \cfunc has_error_code=0
-.endm
-
 /**
  * idtentry_mce_db - Macro to generate entry stubs for #MC and #DB
  * @vector:		Vector number
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 05fd175cec7d..cfca68f6cb84 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -447,7 +447,7 @@ __visible noinstr void func(struct pt_regs *regs,			\
 
 /* System vector entries */
 #define DECLARE_IDTENTRY_SYSVEC(vector, func)				\
-	idtentry_sysvec vector func
+	DECLARE_IDTENTRY(vector, func)
 
 #ifdef CONFIG_X86_64
 # define DECLARE_IDTENTRY_MCE(vector, func)				\
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 05/38] x86/trapnr: Add event type macros to <asm/trapnr.h>
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (3 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 04/38] x86/entry: Remove idtentry_sysvec from entry_{32,64}.S Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14 14:22   ` andrew.cooper3
  2023-09-14  4:47 ` [PATCH v10 06/38] Documentation/x86/64: Add a documentation for FRED Xin Li
                   ` (32 subsequent siblings)
  37 siblings, 1 reply; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

Intel VT-x classifies events into eight different types, which is
inherited by FRED for event identification. As such, event type
becomes a common x86 concept, and should be defined in a common x86
header.

Add event type macros to <asm/trapnr.h>, and use it in <asm/vmx.h>.

Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/trapnr.h | 12 ++++++++++++
 arch/x86/include/asm/vmx.h    | 17 +++++++++--------
 2 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/trapnr.h b/arch/x86/include/asm/trapnr.h
index f5d2325aa0b7..ab7e4c9d666f 100644
--- a/arch/x86/include/asm/trapnr.h
+++ b/arch/x86/include/asm/trapnr.h
@@ -2,6 +2,18 @@
 #ifndef _ASM_X86_TRAPNR_H
 #define _ASM_X86_TRAPNR_H
 
+/*
+ * Event type codes used by both FRED and Intel VT-x
+ */
+#define EVENT_TYPE_EXTINT	0	// External interrupt
+#define EVENT_TYPE_RESERVED	1
+#define EVENT_TYPE_NMI		2	// NMI
+#define EVENT_TYPE_HWEXC	3	// Hardware originated traps, exceptions
+#define EVENT_TYPE_SWINT	4	// INT n
+#define EVENT_TYPE_PRIV_SWEXC	5	// INT1
+#define EVENT_TYPE_SWEXC	6	// INT0, INT3
+#define EVENT_TYPE_OTHER	7	// FRED SYSCALL/SYSENTER, VT-x MTF
+
 /* Interrupts/Exceptions */
 
 #define X86_TRAP_DE		 0	/* Divide-by-zero */
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0e73616b82f3..c84acfefcd31 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -17,6 +17,7 @@
 #include <linux/types.h>
 
 #include <uapi/asm/vmx.h>
+#include <asm/trapnr.h>
 #include <asm/vmxfeatures.h>
 
 #define VMCS_CONTROL_BIT(x)	BIT(VMX_FEATURE_##x & 0x1f)
@@ -374,14 +375,14 @@ enum vmcs_field {
 #define VECTORING_INFO_DELIVER_CODE_MASK    	INTR_INFO_DELIVER_CODE_MASK
 #define VECTORING_INFO_VALID_MASK       	INTR_INFO_VALID_MASK
 
-#define INTR_TYPE_EXT_INTR              (0 << 8) /* external interrupt */
-#define INTR_TYPE_RESERVED              (1 << 8) /* reserved */
-#define INTR_TYPE_NMI_INTR		(2 << 8) /* NMI */
-#define INTR_TYPE_HARD_EXCEPTION	(3 << 8) /* processor exception */
-#define INTR_TYPE_SOFT_INTR             (4 << 8) /* software interrupt */
-#define INTR_TYPE_PRIV_SW_EXCEPTION	(5 << 8) /* ICE breakpoint - undocumented */
-#define INTR_TYPE_SOFT_EXCEPTION	(6 << 8) /* software exception */
-#define INTR_TYPE_OTHER_EVENT           (7 << 8) /* other event */
+#define INTR_TYPE_EXT_INTR		(EVENT_TYPE_EXTINT << 8)	/* external interrupt */
+#define INTR_TYPE_RESERVED		(EVENT_TYPE_RESERVED << 8)	/* reserved */
+#define INTR_TYPE_NMI_INTR		(EVENT_TYPE_NMI << 8)		/* NMI */
+#define INTR_TYPE_HARD_EXCEPTION	(EVENT_TYPE_HWEXC << 8)		/* processor exception */
+#define INTR_TYPE_SOFT_INTR		(EVENT_TYPE_SWINT << 8)		/* software interrupt */
+#define INTR_TYPE_PRIV_SW_EXCEPTION	(EVENT_TYPE_PRIV_SWEXC << 8)	/* ICE breakpoint - undocumented */
+#define INTR_TYPE_SOFT_EXCEPTION	(EVENT_TYPE_SWEXC << 8)		/* software exception */
+#define INTR_TYPE_OTHER_EVENT		(EVENT_TYPE_OTHER << 8)		/* other event */
 
 /* GUEST_INTERRUPTIBILITY_INFO flags. */
 #define GUEST_INTR_STATE_STI		0x00000001
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 06/38] Documentation/x86/64: Add a documentation for FRED
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (4 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 05/38] x86/trapnr: Add event type macros to <asm/trapnr.h> Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-20  9:44   ` Nikolay Borisov
  2023-09-14  4:47 ` [PATCH v10 07/38] x86/fred: Add Kconfig option for FRED (CONFIG_X86_FRED) Xin Li
                   ` (31 subsequent siblings)
  37 siblings, 1 reply; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

Briefly introduce FRED, and its advantages compared to IDT.

Signed-off-by: Xin Li <xin3.li@intel.com>
---
 Documentation/arch/x86/x86_64/fred.rst  | 98 +++++++++++++++++++++++++
 Documentation/arch/x86/x86_64/index.rst |  1 +
 2 files changed, 99 insertions(+)
 create mode 100644 Documentation/arch/x86/x86_64/fred.rst

diff --git a/Documentation/arch/x86/x86_64/fred.rst b/Documentation/arch/x86/x86_64/fred.rst
new file mode 100644
index 000000000000..a4ebb95f92c8
--- /dev/null
+++ b/Documentation/arch/x86/x86_64/fred.rst
@@ -0,0 +1,98 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================
+Flexible Return and Event Delivery (FRED)
+=========================================
+
+Overview
+========
+
+The FRED architecture defines simple new transitions that change
+privilege level (ring transitions). The FRED architecture was
+designed with the following goals:
+
+1) Improve overall performance and response time by replacing event
+   delivery through the interrupt descriptor table (IDT event
+   delivery) and event return by the IRET instruction with lower
+   latency transitions.
+
+2) Improve software robustness by ensuring that event delivery
+   establishes the full supervisor context and that event return
+   establishes the full user context.
+
+The new transitions defined by the FRED architecture are FRED event
+delivery and, for returning from events, two FRED return instructions.
+FRED event delivery can effect a transition from ring 3 to ring 0, but
+it is used also to deliver events incident to ring 0. One FRED
+instruction (ERETU) effects a return from ring 0 to ring 3, while the
+other (ERETS) returns while remaining in ring 0. Collectively, FRED
+event delivery and the FRED return instructions are FRED transitions.
+
+In addition to these transitions, the FRED architecture defines a new
+instruction (LKGS) for managing the state of the GS segment register.
+The LKGS instruction can be used by 64-bit operating systems that do
+not use the new FRED transitions.
+
+Furthermore, the FRED architecture is easy to extend for future CPU
+architectures.
+
+Software based event dispatching
+================================
+
+FRED operates differently from IDT in terms of event handling. Instead
+of directly dispatching an event to its handler based on the event
+vector, FRED requires the software to dispatch an event to its handler
+based on both the event's type and vector. Therefore, an event dispatch
+framework must be implemented to facilitate the event-to-handler
+dispatch process. The FRED event dispatch framework takes control
+once an event is delivered, and employs a two-level dispatch.
+
+The first level dispatching is event type based, and the second level
+dispatching is event vector based.
+
+Full supervisor/user context
+============================
+
+FRED event delivery atomically save and restore full supervisor/user
+context upon event delivery and return. Thus it avoids the problem of
+transient states due to %cr2 and/or %dr6, and it is no longer needed
+to handle all the ugly corner cases caused by half baked entry states.
+
+FRED allows explicit unblock of NMI with new event return instructions
+ERETS/ERETU, avoiding the mess caused by IRET which unconditionally
+unblocks NMI, e.g., when an exception happens during NMI handling.
+
+FRED always restores the full value of %rsp, thus ESPFIX is no longer
+needed when FRED is enabled.
+
+LKGS
+====
+
+LKGS behaves like the MOV to GS instruction except that it loads the
+base address into the IA32_KERNEL_GS_BASE MSR instead of the GS
+segment’s descriptor cache. With LKGS, it ends up with avoiding
+mucking with kernel GS, i.e., an operating system can always operate
+with its own GS base address.
+
+Because FRED event delivery from ring 3 swaps the value of the GS base
+address and that of the IA32_KERNEL_GS_BASE MSR, and ERETU swaps the
+value of the GS base address and that of the IA32_KERNEL_GS_BASE MSR,
+plus the introduction of LKGS instruction, the SWAPGS instruction is
+no longer needed when FRED is enabled, thus is disallowed (#UD).
+
+Stack levels
+============
+
+4 stack levels 0~3 are introduced to replace the nonreentrant IST for
+event handling, and each stack level should be configured to use a
+dedicated stack.
+
+The current stack level could be unchanged or go higher upon FRED
+event delivery. If unchanged, the CPU keeps using the current event
+stack. If higher, the CPU switches to a new event stack specified by
+the MSR of the new stack level, i.e., MSR_IA32_FRED_RSP{1,2,3}.
+
+Only execution of a FRED return instruction ERET{U,S}, could lower
+the current stack level, causing the CPU to switch back to the stack
+it was on before a previous event delivery that promoted the stack
+level.
diff --git a/Documentation/arch/x86/x86_64/index.rst b/Documentation/arch/x86/x86_64/index.rst
index a56070fc8e77..ad15e9bd623f 100644
--- a/Documentation/arch/x86/x86_64/index.rst
+++ b/Documentation/arch/x86/x86_64/index.rst
@@ -15,3 +15,4 @@ x86_64 Support
    cpu-hotplug-spec
    machinecheck
    fsgs
+   fred
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 07/38] x86/fred: Add Kconfig option for FRED (CONFIG_X86_FRED)
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (5 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 06/38] Documentation/x86/64: Add a documentation for FRED Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 08/38] x86/cpufeatures: Add the cpu feature bit for FRED Xin Li
                   ` (30 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add the configuration option CONFIG_X86_FRED to enable FRED.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/Kconfig | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3b3594f96330..cae126417427 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -496,6 +496,15 @@ config X86_CPU_RESCTRL
 
 	  Say N if unsure.
 
+config X86_FRED
+	bool "Flexible Return and Event Delivery"
+	depends on X86_64
+	help
+	  When enabled, try to use Flexible Return and Event Delivery
+	  instead of the legacy SYSCALL/SYSENTER/IDT architecture for
+	  ring transitions and exception/interrupt handling if the
+	  system supports.
+
 if X86_32
 config X86_BIGSMP
 	bool "Support for big SMP systems with more than 8 CPUs"
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 08/38] x86/cpufeatures: Add the cpu feature bit for FRED
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (6 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 07/38] x86/fred: Add Kconfig option for FRED (CONFIG_X86_FRED) Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  6:03   ` Juergen Gross
  2023-09-14  4:47 ` [PATCH v10 09/38] x86/fred: Disable FRED support if CONFIG_X86_FRED is disabled Xin Li
                   ` (29 subsequent siblings)
  37 siblings, 1 reply; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Any FRED CPU will always have the following features as its baseline:
  1) LKGS, load attributes of the GS segment but the base address into
     the IA32_KERNEL_GS_BASE MSR instead of the GS segment’s descriptor
     cache.
  2) WRMSRNS, non-serializing WRMSR for faster MSR writes.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/cpufeatures.h       | 1 +
 arch/x86/kernel/cpu/cpuid-deps.c         | 2 ++
 tools/arch/x86/include/asm/cpufeatures.h | 1 +
 3 files changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 330876d34b68..57ae93dc1e52 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -321,6 +321,7 @@
 #define X86_FEATURE_FZRM		(12*32+10) /* "" Fast zero-length REP MOVSB */
 #define X86_FEATURE_FSRS		(12*32+11) /* "" Fast short REP STOSB */
 #define X86_FEATURE_FSRC		(12*32+12) /* "" Fast short REP {CMPSB,SCASB} */
+#define X86_FEATURE_FRED		(12*32+17) /* Flexible Return and Event Delivery */
 #define X86_FEATURE_LKGS		(12*32+18) /* "" Load "kernel" (userspace) GS */
 #define X86_FEATURE_WRMSRNS		(12*32+19) /* "" Non-Serializing Write to Model Specific Register instruction */
 #define X86_FEATURE_AMX_FP16		(12*32+21) /* "" AMX fp16 Support */
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index e462c1d3800a..b7174209d855 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -82,6 +82,8 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_XFD,			X86_FEATURE_XGETBV1   },
 	{ X86_FEATURE_AMX_TILE,			X86_FEATURE_XFD       },
 	{ X86_FEATURE_SHSTK,			X86_FEATURE_XSAVES    },
+	{ X86_FEATURE_FRED,			X86_FEATURE_LKGS      },
+	{ X86_FEATURE_FRED,			X86_FEATURE_WRMSRNS   },
 	{}
 };
 
diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h
index 1b9d86ba5bc2..18bab7987d7f 100644
--- a/tools/arch/x86/include/asm/cpufeatures.h
+++ b/tools/arch/x86/include/asm/cpufeatures.h
@@ -317,6 +317,7 @@
 #define X86_FEATURE_FZRM		(12*32+10) /* "" Fast zero-length REP MOVSB */
 #define X86_FEATURE_FSRS		(12*32+11) /* "" Fast short REP STOSB */
 #define X86_FEATURE_FSRC		(12*32+12) /* "" Fast short REP {CMPSB,SCASB} */
+#define X86_FEATURE_FRED		(12*32+17) /* Flexible Return and Event Delivery */
 #define X86_FEATURE_LKGS		(12*32+18) /* "" Load "kernel" (userspace) GS */
 #define X86_FEATURE_WRMSRNS		(12*32+19) /* "" Non-Serializing Write to Model Specific Register instruction */
 #define X86_FEATURE_AMX_FP16		(12*32+21) /* "" AMX fp16 Support */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 09/38] x86/fred: Disable FRED support if CONFIG_X86_FRED is disabled
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (7 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 08/38] x86/cpufeatures: Add the cpu feature bit for FRED Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-20 10:19   ` Nikolay Borisov
  2023-09-14  4:47 ` [PATCH v10 10/38] x86/fred: Disable FRED by default in its early stage Xin Li
                   ` (28 subsequent siblings)
  37 siblings, 1 reply; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add CONFIG_X86_FRED to <asm/disabled-features.h> to make
cpu_feature_enabled() work correctly with FRED.

Originally-by: Megha Dey <megha.dey@intel.com>
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/disabled-features.h       | 8 +++++++-
 tools/arch/x86/include/asm/disabled-features.h | 8 +++++++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 702d93fdd10e..3cde57cb5093 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -117,6 +117,12 @@
 #define DISABLE_IBT	(1 << (X86_FEATURE_IBT & 31))
 #endif
 
+#ifdef CONFIG_X86_FRED
+# define DISABLE_FRED	0
+#else
+# define DISABLE_FRED	(1 << (X86_FEATURE_FRED & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -134,7 +140,7 @@
 #define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
 			 DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
 #define DISABLED_MASK12	(DISABLE_LAM)
-#define DISABLED_MASK13	0
+#define DISABLED_MASK13	(DISABLE_FRED)
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
diff --git a/tools/arch/x86/include/asm/disabled-features.h b/tools/arch/x86/include/asm/disabled-features.h
index fafe9be7a6f4..d540ecdd8812 100644
--- a/tools/arch/x86/include/asm/disabled-features.h
+++ b/tools/arch/x86/include/asm/disabled-features.h
@@ -105,6 +105,12 @@
 # define DISABLE_TDX_GUEST	(1 << (X86_FEATURE_TDX_GUEST & 31))
 #endif
 
+#ifdef CONFIG_X86_FRED
+# define DISABLE_FRED	0
+#else
+# define DISABLE_FRED	(1 << (X86_FEATURE_FRED & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -122,7 +128,7 @@
 #define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
 			 DISABLE_CALL_DEPTH_TRACKING)
 #define DISABLED_MASK12	(DISABLE_LAM)
-#define DISABLED_MASK13	0
+#define DISABLED_MASK13	(DISABLE_FRED)
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 10/38] x86/fred: Disable FRED by default in its early stage
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (8 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 09/38] x86/fred: Disable FRED support if CONFIG_X86_FRED is disabled Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 11/38] x86/opcode: Add ERET[US] instructions to the x86 opcode map Xin Li
                   ` (27 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

To enable FRED, a new kernel command line option "fred" needs to be added.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 3 +++
 arch/x86/kernel/cpu/common.c                    | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0a1731a0f0ef..42def5cc7552 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1525,6 +1525,9 @@
 			Warning: use of this parameter will taint the kernel
 			and may cause unknown problems.
 
+	fred		[X86-64]
+			Enable flexible return and event delivery
+
 	ftrace=[tracer]
 			[FTRACE] will set and start the specified tracer
 			as early as possible in order to facilitate early
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 382d4e6b848d..317b4877e9c7 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1486,6 +1486,9 @@ static void __init cpu_parse_early_param(void)
 	char *argptr = arg, *opt;
 	int arglen, taint = 0;
 
+	if (!cmdline_find_option_bool(boot_command_line, "fred"))
+		setup_clear_cpu_cap(X86_FEATURE_FRED);
+
 #ifdef CONFIG_X86_32
 	if (cmdline_find_option_bool(boot_command_line, "no387"))
 #ifdef CONFIG_MATH_EMULATION
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 11/38] x86/opcode: Add ERET[US] instructions to the x86 opcode map
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (9 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 10/38] x86/fred: Disable FRED by default in its early stage Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 12/38] x86/objtool: Teach objtool about ERET[US] Xin Li
                   ` (26 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

ERETU returns from an event handler while making a transition to ring 3,
and ERETS returns from an event handler while staying in ring 0.

Add instruction opcodes used by ERET[US] to the x86 opcode map; opcode
numbers are per FRED spec v5.0.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 arch/x86/lib/x86-opcode-map.txt       | 2 +-
 tools/arch/x86/lib/x86-opcode-map.txt | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index 1efe1d9bf5ce..12af572201a2 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -1052,7 +1052,7 @@ EndTable
 
 GrpTable: Grp7
 0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) | VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B) | WRMSRNS (110),(11B)
-1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC (011),(11B) | ENCLS (111),(11B)
+1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC (011),(11B) | ENCLS (111),(11B) | ERETU (F3),(010),(11B) | ERETS (F2),(010),(11B)
 2: LGDT Ms | XGETBV (000),(11B) | XSETBV (001),(11B) | VMFUNC (100),(11B) | XEND (101)(11B) | XTEST (110)(11B) | ENCLU (111),(11B)
 3: LIDT Ms
 4: SMSW Mw/Rv
diff --git a/tools/arch/x86/lib/x86-opcode-map.txt b/tools/arch/x86/lib/x86-opcode-map.txt
index 1efe1d9bf5ce..12af572201a2 100644
--- a/tools/arch/x86/lib/x86-opcode-map.txt
+++ b/tools/arch/x86/lib/x86-opcode-map.txt
@@ -1052,7 +1052,7 @@ EndTable
 
 GrpTable: Grp7
 0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) | VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B) | WRMSRNS (110),(11B)
-1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC (011),(11B) | ENCLS (111),(11B)
+1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC (011),(11B) | ENCLS (111),(11B) | ERETU (F3),(010),(11B) | ERETS (F2),(010),(11B)
 2: LGDT Ms | XGETBV (000),(11B) | XSETBV (001),(11B) | VMFUNC (100),(11B) | XEND (101)(11B) | XTEST (110)(11B) | ENCLU (111),(11B)
 3: LIDT Ms
 4: SMSW Mw/Rv
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 12/38] x86/objtool: Teach objtool about ERET[US]
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (10 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 11/38] x86/opcode: Add ERET[US] instructions to the x86 opcode map Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 13/38] x86/cpu: Add X86_CR4_FRED macro Xin Li
                   ` (25 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Update the objtool decoder to know about the ERET[US] instructions
(type INSN_CONTEXT_SWITCH).

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 tools/objtool/arch/x86/decode.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/tools/objtool/arch/x86/decode.c b/tools/objtool/arch/x86/decode.c
index c0f25d00181e..6999f478c155 100644
--- a/tools/objtool/arch/x86/decode.c
+++ b/tools/objtool/arch/x86/decode.c
@@ -509,11 +509,20 @@ int arch_decode_instruction(struct objtool_file *file, const struct section *sec
 
 		if (op2 == 0x01) {
 
-			if (modrm == 0xca)
-				insn->type = INSN_CLAC;
-			else if (modrm == 0xcb)
-				insn->type = INSN_STAC;
-
+			switch (insn_last_prefix_id(&ins)) {
+			case INAT_PFX_REPE:
+			case INAT_PFX_REPNE:
+				if (modrm == 0xca)
+					/* eretu/erets */
+					insn->type = INSN_CONTEXT_SWITCH;
+				break;
+			default:
+				if (modrm == 0xca)
+					insn->type = INSN_CLAC;
+				else if (modrm == 0xcb)
+					insn->type = INSN_STAC;
+				break;
+			}
 		} else if (op2 >= 0x80 && op2 <= 0x8f) {
 
 			insn->type = INSN_JUMP_CONDITIONAL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 13/38] x86/cpu: Add X86_CR4_FRED macro
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (11 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 12/38] x86/objtool: Teach objtool about ERET[US] Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-20 10:50   ` Nikolay Borisov
  2023-09-14  4:47 ` [PATCH v10 14/38] x86/cpu: Add MSR numbers for FRED configuration Xin Li
                   ` (24 subsequent siblings)
  37 siblings, 1 reply; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add X86_CR4_FRED macro for the FRED bit in %cr4. This bit must not be
changed after initialization, so add it to the pinned CR4 bits.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v9:
* Avoid a type cast by defining X86_CR4_FRED as 0 on 32-bit (Thomas
  Gleixner).
---
 arch/x86/include/uapi/asm/processor-flags.h | 7 +++++++
 arch/x86/kernel/cpu/common.c                | 5 ++---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index d898432947ff..f1a4adc78272 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -139,6 +139,13 @@
 #define X86_CR4_LAM_SUP_BIT	28 /* LAM for supervisor pointers */
 #define X86_CR4_LAM_SUP		_BITUL(X86_CR4_LAM_SUP_BIT)
 
+#ifdef __x86_64__
+#define X86_CR4_FRED_BIT	32 /* enable FRED kernel entry */
+#define X86_CR4_FRED		_BITUL(X86_CR4_FRED_BIT)
+#else
+#define X86_CR4_FRED		(0)
+#endif
+
 /*
  * x86-64 Task Priority Register, CR8
  */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 317b4877e9c7..42511209469b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -400,9 +400,8 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
 }
 
 /* These bits should not change their value after CPU init is finished. */
-static const unsigned long cr4_pinned_mask =
-	X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
-	X86_CR4_FSGSBASE | X86_CR4_CET;
+static const unsigned long cr4_pinned_mask = X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
+					     X86_CR4_FSGSBASE | X86_CR4_CET | X86_CR4_FRED;
 static DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
 static unsigned long cr4_pinned_bits __ro_after_init;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 14/38] x86/cpu: Add MSR numbers for FRED configuration
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (12 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 13/38] x86/cpu: Add X86_CR4_FRED macro Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 15/38] x86/ptrace: Cleanup the definition of the pt_regs structure Xin Li
                   ` (23 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add MSR numbers for the FRED configuration registers per FRED spec 5.0.

Originally-by: Megha Dey <megha.dey@intel.com>
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/msr-index.h       | 13 ++++++++++++-
 tools/arch/x86/include/asm/msr-index.h | 13 ++++++++++++-
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1d111350197f..972d15404420 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -36,8 +36,19 @@
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
 #define EFER_AUTOIBRS		(1<<_EFER_AUTOIBRS)
 
-/* Intel MSRs. Some also available on other CPUs */
+/* FRED MSRs */
+#define MSR_IA32_FRED_RSP0	0x1cc			/* Level 0 stack pointer */
+#define MSR_IA32_FRED_RSP1	0x1cd			/* Level 1 stack pointer */
+#define MSR_IA32_FRED_RSP2	0x1ce			/* Level 2 stack pointer */
+#define MSR_IA32_FRED_RSP3	0x1cf			/* Level 3 stack pointer */
+#define MSR_IA32_FRED_STKLVLS	0x1d0			/* Exception stack levels */
+#define MSR_IA32_FRED_SSP0	MSR_IA32_PL0_SSP	/* Level 0 shadow stack pointer */
+#define MSR_IA32_FRED_SSP1	0x1d1			/* Level 1 shadow stack pointer */
+#define MSR_IA32_FRED_SSP2	0x1d2			/* Level 2 shadow stack pointer */
+#define MSR_IA32_FRED_SSP3	0x1d3			/* Level 3 shadow stack pointer */
+#define MSR_IA32_FRED_CONFIG	0x1d4			/* Entrypoint and interrupt stack level */
 
+/* Intel MSRs. Some also available on other CPUs */
 #define MSR_TEST_CTRL				0x00000033
 #define MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT	29
 #define MSR_TEST_CTRL_SPLIT_LOCK_DETECT		BIT(MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT)
diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h
index a00a53e15ab7..fc75e3ca47d9 100644
--- a/tools/arch/x86/include/asm/msr-index.h
+++ b/tools/arch/x86/include/asm/msr-index.h
@@ -36,8 +36,19 @@
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
 #define EFER_AUTOIBRS		(1<<_EFER_AUTOIBRS)
 
-/* Intel MSRs. Some also available on other CPUs */
+/* FRED MSRs */
+#define MSR_IA32_FRED_RSP0	0x1cc			/* Level 0 stack pointer */
+#define MSR_IA32_FRED_RSP1	0x1cd			/* Level 1 stack pointer */
+#define MSR_IA32_FRED_RSP2	0x1ce			/* Level 2 stack pointer */
+#define MSR_IA32_FRED_RSP3	0x1cf			/* Level 3 stack pointer */
+#define MSR_IA32_FRED_STKLVLS	0x1d0			/* Exception stack levels */
+#define MSR_IA32_FRED_SSP0	MSR_IA32_PL0_SSP	/* Level 0 shadow stack pointer */
+#define MSR_IA32_FRED_SSP1	0x1d1			/* Level 1 shadow stack pointer */
+#define MSR_IA32_FRED_SSP2	0x1d2			/* Level 2 shadow stack pointer */
+#define MSR_IA32_FRED_SSP3	0x1d3			/* Level 3 shadow stack pointer */
+#define MSR_IA32_FRED_CONFIG	0x1d4			/* Entrypoint and interrupt stack level */
 
+/* Intel MSRs. Some also available on other CPUs */
 #define MSR_TEST_CTRL				0x00000033
 #define MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT	29
 #define MSR_TEST_CTRL_SPLIT_LOCK_DETECT		BIT(MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 15/38] x86/ptrace: Cleanup the definition of the pt_regs structure
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (13 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 14/38] x86/cpu: Add MSR numbers for FRED configuration Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 16/38] x86/ptrace: Add FRED additional information to " Xin Li
                   ` (22 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

struct pt_regs is hard to read because the member or section related
comments are not aligned with the members.

The 'cs' and 'ss' members of pt_regs are type of 'unsigned long' while
in reality they are only 16-bit wide. This works so far as the
remaining space is unused, but FRED will use the remaining bits for
other purposes.

To prepare for FRED:

  - Cleanup the formatting
  - Convert 'cs' and 'ss' to u16 and embed them into an union
    with a u64
  - Fixup the related printk() format strings

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/entry/vsyscall/vsyscall_64.c |  2 +-
 arch/x86/include/asm/ptrace.h         | 44 +++++++++++++++++++--------
 arch/x86/kernel/process_64.c          |  2 +-
 3 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index e0ca8120aea8..a3c0df11d0e6 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -76,7 +76,7 @@ static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
 	if (!show_unhandled_signals)
 		return;
 
-	printk_ratelimited("%s%s[%d] %s ip:%lx cs:%lx sp:%lx ax:%lx si:%lx di:%lx\n",
+	printk_ratelimited("%s%s[%d] %s ip:%lx cs:%x sp:%lx ax:%lx si:%lx di:%lx\n",
 			   level, current->comm, task_pid_nr(current),
 			   message, regs->ip, regs->cs,
 			   regs->sp, regs->ax, regs->si, regs->di);
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index f4db78b09c8f..f08ea073edd6 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -57,17 +57,19 @@ struct pt_regs {
 #else /* __i386__ */
 
 struct pt_regs {
-/*
- * C ABI says these regs are callee-preserved. They aren't saved on kernel entry
- * unless syscall needs a complete, fully filled "struct pt_regs".
- */
+	/*
+	 * C ABI says these regs are callee-preserved. They aren't saved on
+	 * kernel entry unless syscall needs a complete, fully filled
+	 * "struct pt_regs".
+	 */
 	unsigned long r15;
 	unsigned long r14;
 	unsigned long r13;
 	unsigned long r12;
 	unsigned long bp;
 	unsigned long bx;
-/* These regs are callee-clobbered. Always saved on kernel entry. */
+
+	/* These regs are callee-clobbered. Always saved on kernel entry. */
 	unsigned long r11;
 	unsigned long r10;
 	unsigned long r9;
@@ -77,18 +79,34 @@ struct pt_regs {
 	unsigned long dx;
 	unsigned long si;
 	unsigned long di;
-/*
- * On syscall entry, this is syscall#. On CPU exception, this is error code.
- * On hw interrupt, it's IRQ number:
- */
+
+	/*
+	 * orig_ax is used on entry for:
+	 * - the syscall number (syscall, sysenter, int80)
+	 * - error_code stored by the CPU on traps and exceptions
+	 * - the interrupt number for device interrupts
+	 */
 	unsigned long orig_ax;
-/* Return frame for iretq */
+
+	/* The IRETQ return frame starts here */
 	unsigned long ip;
-	unsigned long cs;
+
+	union {
+		u64	csx;	// The full 64-bit data slot containing CS
+		u16	cs;	// CS selector
+	};
+
 	unsigned long flags;
 	unsigned long sp;
-	unsigned long ss;
-/* top of stack page */
+
+	union {
+		u64	ssx;	// The full 64-bit data slot containing SS
+		u16	ss;	// SS selector
+	};
+
+	/*
+	 * Top of stack on IDT systems.
+	 */
 };
 
 #endif /* !__i386__ */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 33b268747bb7..0f78b58021bb 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -117,7 +117,7 @@ void __show_regs(struct pt_regs *regs, enum show_regs_mode mode,
 
 	printk("%sFS:  %016lx(%04x) GS:%016lx(%04x) knlGS:%016lx\n",
 	       log_lvl, fs, fsindex, gs, gsindex, shadowgs);
-	printk("%sCS:  %04lx DS: %04x ES: %04x CR0: %016lx\n",
+	printk("%sCS:  %04x DS: %04x ES: %04x CR0: %016lx\n",
 		log_lvl, regs->cs, ds, es, cr0);
 	printk("%sCR2: %016lx CR3: %016lx CR4: %016lx\n",
 		log_lvl, cr2, cr3, cr4);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 16/38] x86/ptrace: Add FRED additional information to the pt_regs structure
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (14 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 15/38] x86/ptrace: Cleanup the definition of the pt_regs structure Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-20 12:57   ` Nikolay Borisov
  2023-09-14  4:47 ` [PATCH v10 17/38] x86/fred: Add a new header file for FRED definitions Xin Li
                   ` (21 subsequent siblings)
  37 siblings, 1 reply; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

FRED defines additional information in the upper 48 bits of cs/ss
fields. Therefore add the information definitions into the pt_regs
structure.

Specially introduce a new structure fred_ss to denote the FRED flags
above SS selector, which avoids FRED_SSX_ macros and makes the code
simpler and easier to read.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v9:
* Introduce a new structure fred_ss to denote the FRED flags above SS
  selector, which avoids FRED_SSX_ macros and makes the code simpler
  and easier to read (Thomas Gleixner).
* Use type u64 to define FRED bit fields instead of type unsigned int
  (Thomas Gleixner).

Changes since v8:
* Reflect stack frame definition changes from FRED spec 3.0 to 5.0.
* Use __packed instead of __attribute__((__packed__)) (Borislav Petkov).
* Put all comments above the members, like the rest of the file does
  (Borislav Petkov).

Changes since v3:
* Rename csl/ssl of the pt_regs structure to csx/ssx (x for extended)
  (Andrew Cooper).
---
 arch/x86/include/asm/ptrace.h | 51 +++++++++++++++++++++++++++++++----
 1 file changed, 46 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index f08ea073edd6..5786c8ca5f4c 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -56,6 +56,25 @@ struct pt_regs {
 
 #else /* __i386__ */
 
+struct fred_ss {
+	u64	ss	: 16,	// SS selector
+		sti	:  1,	// STI state
+		swevent	:  1,	// Set if syscall, sysenter or INT n
+		nmi	:  1,	// Event is NMI type
+			: 13,
+		vector	:  8,	// Event vector
+			:  8,
+		type	:  4,	// Event type
+			:  4,
+		enclave	:  1,	// Event was incident to enclave execution
+		lm	:  1,	// CPU was in long mode
+		nested	:  1,	// Nested exception during FRED delivery
+				// not set for #DF
+			:  1,
+		insnlen	:  4;	// The length of the instruction causing the event
+				// Only set for INT0, INT1, INT3, INT n, SYSCALL
+};				// and SYSENTER. 0 otherwise.
+
 struct pt_regs {
 	/*
 	 * C ABI says these regs are callee-preserved. They aren't saved on
@@ -85,6 +104,12 @@ struct pt_regs {
 	 * - the syscall number (syscall, sysenter, int80)
 	 * - error_code stored by the CPU on traps and exceptions
 	 * - the interrupt number for device interrupts
+	 *
+	 * A FRED stack frame starts here:
+	 *   1) It _always_ includes an error code;
+	 *
+	 *   2) The return frame for ERET[US] starts here, but
+	 *	the content of orig_ax is ignored.
 	 */
 	unsigned long orig_ax;
 
@@ -92,20 +117,36 @@ struct pt_regs {
 	unsigned long ip;
 
 	union {
-		u64	csx;	// The full 64-bit data slot containing CS
-		u16	cs;	// CS selector
+		u64	csx;		// The full data for FRED
+		/*
+		 * The 'cs' member should be defined as a 16-bit bit-field
+		 * along with the 'sl' and 'wfe' members, which however
+		 * breaks compiling REG_OFFSET_NAME(cs), because compilers
+		 * disallow calculating the address of a bit-field.
+		 *
+		 * Therefore 'cs" is defined as an individual member with
+		 * type u16.
+		 */
+		u16	cs;		// CS selector
+		u64		: 16,
+			sl	:  2,	// Stack level at event time
+			wfe	:  1,	// IBT is in WAIT_FOR_BRANCH_STATE
+				: 45;
 	};
 
 	unsigned long flags;
 	unsigned long sp;
 
 	union {
-		u64	ssx;	// The full 64-bit data slot containing SS
-		u16	ss;	// SS selector
+		u64		ssx;		// The full data for FRED
+		u16		ss;		// SS selector
+		struct fred_ss	fred_ss;	// The fred extension
 	};
 
 	/*
-	 * Top of stack on IDT systems.
+	 * Top of stack on IDT systems, while FRED systems have extra fields
+	 * defined above for storing exception related information, e.g. CR2 or
+	 * DR6.
 	 */
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 17/38] x86/fred: Add a new header file for FRED definitions
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (15 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 16/38] x86/ptrace: Add FRED additional information to " Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 18/38] x86/fred: Reserve space for the FRED stack frame Xin Li
                   ` (20 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add a header file for FRED prototypes and definitions.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v6:
* Replace pt_regs csx flags prefix FRED_CSL_ with FRED_CSX_.
---
 arch/x86/include/asm/fred.h | 68 +++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)
 create mode 100644 arch/x86/include/asm/fred.h

diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
new file mode 100644
index 000000000000..f514fdb5a39f
--- /dev/null
+++ b/arch/x86/include/asm/fred.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Macros for Flexible Return and Event Delivery (FRED)
+ */
+
+#ifndef ASM_X86_FRED_H
+#define ASM_X86_FRED_H
+
+#include <linux/const.h>
+
+#include <asm/asm.h>
+
+/*
+ * FRED event return instruction opcodes for ERET{S,U}; supported in
+ * binutils >= 2.41.
+ */
+#define ERETS			_ASM_BYTES(0xf2,0x0f,0x01,0xca)
+#define ERETU			_ASM_BYTES(0xf3,0x0f,0x01,0xca)
+
+/*
+ * RSP is aligned to a 64-byte boundary before used to push a new stack frame
+ */
+#define FRED_STACK_FRAME_RSP_MASK	_AT(unsigned long, (~0x3f))
+
+/*
+ * Used for the return address for call emulation during code patching,
+ * and measured in 64-byte cache lines.
+ */
+#define FRED_CONFIG_REDZONE_AMOUNT	1
+#define FRED_CONFIG_REDZONE		(_AT(unsigned long, FRED_CONFIG_REDZONE_AMOUNT) << 6)
+#define FRED_CONFIG_INT_STKLVL(l)	(_AT(unsigned long, l) << 9)
+#define FRED_CONFIG_ENTRYPOINT(p)	_AT(unsigned long, (p))
+
+#ifndef __ASSEMBLY__
+
+#ifdef CONFIG_X86_FRED
+#include <linux/kernel.h>
+
+#include <asm/ptrace.h>
+
+struct fred_info {
+	/* Event data: CR2, DR6, ... */
+	unsigned long edata;
+	unsigned long resv;
+};
+
+/* Full format of the FRED stack frame */
+struct fred_frame {
+	struct pt_regs   regs;
+	struct fred_info info;
+};
+
+static __always_inline struct fred_info *fred_info(struct pt_regs *regs)
+{
+	return &container_of(regs, struct fred_frame, regs)->info;
+}
+
+static __always_inline unsigned long fred_event_data(struct pt_regs *regs)
+{
+	return fred_info(regs)->edata;
+}
+
+#else /* CONFIG_X86_FRED */
+static __always_inline unsigned long fred_event_data(struct pt_regs *regs) { return 0; }
+#endif /* CONFIG_X86_FRED */
+#endif /* !__ASSEMBLY__ */
+
+#endif /* ASM_X86_FRED_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 18/38] x86/fred: Reserve space for the FRED stack frame
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (16 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 17/38] x86/fred: Add a new header file for FRED definitions Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 19/38] x86/fred: Update MSR_IA32_FRED_RSP0 during task switch Xin Li
                   ` (19 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

When using FRED, reserve space at the top of the stack frame, just
like i386 does.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/thread_info.h | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index d63b02940747..12da7dfd5ef1 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -31,7 +31,9 @@
  * In vm86 mode, the hardware frame is much longer still, so add 16
  * bytes to make room for the real-mode segments.
  *
- * x86_64 has a fixed-length stack frame.
+ * x86-64 has a fixed-length stack frame, but it depends on whether
+ * or not FRED is enabled. Future versions of FRED might make this
+ * dynamic, but for now it is always 2 words longer.
  */
 #ifdef CONFIG_X86_32
 # ifdef CONFIG_VM86
@@ -39,8 +41,12 @@
 # else
 #  define TOP_OF_KERNEL_STACK_PADDING 8
 # endif
-#else
-# define TOP_OF_KERNEL_STACK_PADDING 0
+#else /* x86-64 */
+# ifdef CONFIG_X86_FRED
+#  define TOP_OF_KERNEL_STACK_PADDING (2 * 8)
+# else
+#  define TOP_OF_KERNEL_STACK_PADDING 0
+# endif
 #endif
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 19/38] x86/fred: Update MSR_IA32_FRED_RSP0 during task switch
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (17 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 18/38] x86/fred: Reserve space for the FRED stack frame Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 20/38] x86/fred: Disallow the swapgs instruction when FRED is enabled Xin Li
                   ` (18 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

MSR_IA32_FRED_RSP0 is used during ring 3 event delivery, and needs to
be updated to point to the top of next task stack during task switch.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/switch_to.h | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index f42dbf17f52b..c3bd0c0758c9 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -70,9 +70,13 @@ static inline void update_task_stack(struct task_struct *task)
 #ifdef CONFIG_X86_32
 	this_cpu_write(cpu_tss_rw.x86_tss.sp1, task->thread.sp0);
 #else
-	/* Xen PV enters the kernel on the thread stack. */
-	if (cpu_feature_enabled(X86_FEATURE_XENPV))
+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+		/* WRMSRNS is a baseline feature for FRED. */
+		wrmsrns(MSR_IA32_FRED_RSP0, (unsigned long)task_stack_page(task) + THREAD_SIZE);
+	} else if (cpu_feature_enabled(X86_FEATURE_XENPV)) {
+		/* Xen PV enters the kernel on the thread stack. */
 		load_sp0(task_top_of_stack(task));
+	}
 #endif
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 20/38] x86/fred: Disallow the swapgs instruction when FRED is enabled
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (18 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 19/38] x86/fred: Update MSR_IA32_FRED_RSP0 during task switch Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 21/38] x86/fred: No ESPFIX needed " Xin Li
                   ` (17 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

SWAPGS is no longer needed thus NOT allowed with FRED because FRED
transitions ensure that an operating system can _always_ operate
with its own GS base address:
- For events that occur in ring 3, FRED event delivery swaps the GS
  base address with the IA32_KERNEL_GS_BASE MSR.
- ERETU (the FRED transition that returns to ring 3) also swaps the
  GS base address with the IA32_KERNEL_GS_BASE MSR.

And the operating system can still setup the GS segment for a user
thread without the need of loading a user thread GS with:
- Using LKGS, available with FRED, to modify other attributes of the
  GS segment without compromising its ability always to operate with
  its own GS base address.
- Accessing the GS segment base address for a user thread as before
  using RDMSR or WRMSR on the IA32_KERNEL_GS_BASE MSR.

Note, LKGS loads the GS base address into the IA32_KERNEL_GS_BASE MSR
instead of the GS segment’s descriptor cache. As such, the operating
system never changes its runtime GS base address.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v8:
* Explain why writing directly to the IA32_KERNEL_GS_BASE MSR is
  doing the right thing (Thomas Gleixner).
---
 arch/x86/kernel/process_64.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 0f78b58021bb..4f87f5987ae8 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -166,7 +166,29 @@ static noinstr unsigned long __rdgsbase_inactive(void)
 
 	lockdep_assert_irqs_disabled();
 
-	if (!cpu_feature_enabled(X86_FEATURE_XENPV)) {
+	/*
+	 * SWAPGS is no longer needed thus NOT allowed with FRED because
+	 * FRED transitions ensure that an operating system can _always_
+	 * operate with its own GS base address:
+	 * - For events that occur in ring 3, FRED event delivery swaps
+	 *   the GS base address with the IA32_KERNEL_GS_BASE MSR.
+	 * - ERETU (the FRED transition that returns to ring 3) also swaps
+	 *   the GS base address with the IA32_KERNEL_GS_BASE MSR.
+	 *
+	 * And the operating system can still setup the GS segment for a
+	 * user thread without the need of loading a user thread GS with:
+	 * - Using LKGS, available with FRED, to modify other attributes
+	 *   of the GS segment without compromising its ability always to
+	 *   operate with its own GS base address.
+	 * - Accessing the GS segment base address for a user thread as
+	 *   before using RDMSR or WRMSR on the IA32_KERNEL_GS_BASE MSR.
+	 *
+	 * Note, LKGS loads the GS base address into the IA32_KERNEL_GS_BASE
+	 * MSR instead of the GS segment’s descriptor cache. As such, the
+	 * operating system never changes its runtime GS base address.
+	 */
+	if (!cpu_feature_enabled(X86_FEATURE_FRED) &&
+	    !cpu_feature_enabled(X86_FEATURE_XENPV)) {
 		native_swapgs();
 		gsbase = rdgsbase();
 		native_swapgs();
@@ -191,7 +213,8 @@ static noinstr void __wrgsbase_inactive(unsigned long gsbase)
 {
 	lockdep_assert_irqs_disabled();
 
-	if (!cpu_feature_enabled(X86_FEATURE_XENPV)) {
+	if (!cpu_feature_enabled(X86_FEATURE_FRED) &&
+	    !cpu_feature_enabled(X86_FEATURE_XENPV)) {
 		native_swapgs();
 		wrgsbase(gsbase);
 		native_swapgs();
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 21/38] x86/fred: No ESPFIX needed when FRED is enabled
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (19 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 20/38] x86/fred: Disallow the swapgs instruction when FRED is enabled Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 22/38] x86/fred: Allow single-step trap and NMI when starting a new task Xin Li
                   ` (16 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Because FRED always restores the full value of %rsp, ESPFIX is
no longer needed when it's enabled.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kernel/espfix_64.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 16f9814c9be0..6726e0473d0b 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -106,6 +106,10 @@ void __init init_espfix_bsp(void)
 	pgd_t *pgd;
 	p4d_t *p4d;
 
+	/* FRED systems always restore the full value of %rsp */
+	if (cpu_feature_enabled(X86_FEATURE_FRED))
+		return;
+
 	/* Install the espfix pud into the kernel page directory */
 	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
@@ -129,6 +133,10 @@ void init_espfix_ap(int cpu)
 	void *stack_page;
 	pteval_t ptemask;
 
+	/* FRED systems always restore the full value of %rsp */
+	if (cpu_feature_enabled(X86_FEATURE_FRED))
+		return;
+
 	/* We only have to do this once... */
 	if (likely(per_cpu(espfix_stack, cpu)))
 		return;		/* Already initialized */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 22/38] x86/fred: Allow single-step trap and NMI when starting a new task
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (20 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 21/38] x86/fred: No ESPFIX needed " Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 23/38] x86/fred: Make exc_page_fault() work for FRED Xin Li
                   ` (15 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Entering a new task is logically speaking a return from a system call
(exec, fork, clone, etc.). As such, if ptrace enables single stepping
a single step exception should be allowed to trigger immediately upon
entering user space. This is not optional.

NMI should *never* be disabled in user space. As such, this is an
optional, opportunistic way to catch errors.

Allow single-step trap and NMI when starting a new task, thus once
the new task enters user space, single-step trap and NMI are both
enabled immediately.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v8:
* Use high-order 48 bits above the lowest 16 bit SS only when FRED
  is enabled (Thomas Gleixner).
---
 arch/x86/kernel/process_64.c | 38 ++++++++++++++++++++++++++++++------
 1 file changed, 32 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 4f87f5987ae8..c075591b7b46 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -56,6 +56,7 @@
 #include <asm/resctrl.h>
 #include <asm/unistd.h>
 #include <asm/fsgsbase.h>
+#include <asm/fred.h>
 #ifdef CONFIG_IA32_EMULATION
 /* Not included via unistd.h */
 #include <asm/unistd_32_ia32.h>
@@ -528,7 +529,7 @@ void x86_gsbase_write_task(struct task_struct *task, unsigned long gsbase)
 static void
 start_thread_common(struct pt_regs *regs, unsigned long new_ip,
 		    unsigned long new_sp,
-		    unsigned int _cs, unsigned int _ss, unsigned int _ds)
+		    u16 _cs, u16 _ss, u16 _ds)
 {
 	WARN_ON_ONCE(regs != current_pt_regs());
 
@@ -545,11 +546,36 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
 	loadsegment(ds, _ds);
 	load_gs_index(0);
 
-	regs->ip		= new_ip;
-	regs->sp		= new_sp;
-	regs->cs		= _cs;
-	regs->ss		= _ss;
-	regs->flags		= X86_EFLAGS_IF;
+	regs->ip	= new_ip;
+	regs->sp	= new_sp;
+	regs->csx	= _cs;
+	regs->ssx	= _ss;
+	/*
+	 * Allow single-step trap and NMI when starting a new task, thus
+	 * once the new task enters user space, single-step trap and NMI
+	 * are both enabled immediately.
+	 *
+	 * Entering a new task is logically speaking a return from a
+	 * system call (exec, fork, clone, etc.). As such, if ptrace
+	 * enables single stepping a single step exception should be
+	 * allowed to trigger immediately upon entering user space.
+	 * This is not optional.
+	 *
+	 * NMI should *never* be disabled in user space. As such, this
+	 * is an optional, opportunistic way to catch errors.
+	 *
+	 * Paranoia: High-order 48 bits above the lowest 16 bit SS are
+	 * discarded by the legacy IRET instruction on all Intel, AMD,
+	 * and Cyrix/Centaur/VIA CPUs, thus can be set unconditionally,
+	 * even when FRED is not enabled. But we choose the safer side
+	 * to use these bits only when FRED is enabled.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+		regs->fred_ss.swevent	= true;
+		regs->fred_ss.nmi	= true;
+	}
+
+	regs->flags	= X86_EFLAGS_IF | X86_EFLAGS_FIXED;
 }
 
 void
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 23/38] x86/fred: Make exc_page_fault() work for FRED
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (21 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 22/38] x86/fred: Allow single-step trap and NMI when starting a new task Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 24/38] x86/idtentry: Incorporate definitions/declarations of the FRED entries Xin Li
                   ` (14 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

On a FRED system, the faulting address (CR2) is passed on the stack,
to avoid the problem of transient state. Thus we get the page fault
address from the stack instead of CR2.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/mm/fault.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index ab778eac1952..7675bc067153 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -34,6 +34,7 @@
 #include <asm/kvm_para.h>		/* kvm_handle_async_pf		*/
 #include <asm/vdso.h>			/* fixup_vdso_exception()	*/
 #include <asm/irq_stack.h>
+#include <asm/fred.h>
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -1516,8 +1517,10 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
 
 DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 {
-	unsigned long address = read_cr2();
 	irqentry_state_t state;
+	unsigned long address;
+
+	address = cpu_feature_enabled(X86_FEATURE_FRED) ? fred_event_data(regs) : read_cr2();
 
 	prefetchw(&current->mm->mmap_lock);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 24/38] x86/idtentry: Incorporate definitions/declarations of the FRED entries
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (22 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 23/38] x86/fred: Make exc_page_fault() work for FRED Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 25/38] x86/fred: Add a debug fault entry stub for FRED Xin Li
                   ` (13 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

FRED and IDT can share most of the definitions and declarations so
that in the majority of cases the actual handler implementation is the
same.

The differences are the exceptions where FRED stores exception related
information on the stack and the sysvec implementations as FRED can
handle irqentry/exit() in the dispatcher instead of having it in each
handler.

Also add stub defines for vectors which are not used due to Kconfig
decisions to spare the ifdeffery in the actual FRED dispatch code.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v9:
* Except NMI/#DB/#MCE, FRED really should share the exception handlers
  with IDT (Thomas Gleixner).

Changes since v8:
* Put IDTENTRY changes in a separate patch (Thomas Gleixner).
---
 arch/x86/include/asm/idtentry.h | 71 +++++++++++++++++++++++++++++----
 1 file changed, 63 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index cfca68f6cb84..4f26ee9b8b74 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -13,15 +13,18 @@
 
 #include <asm/irq_stack.h>
 
+typedef void (*idtentry_t)(struct pt_regs *regs);
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *		      No error code pushed by hardware
  * @vector:	Vector number (ignored for C)
  * @func:	Function name of the entry point
  *
- * Declares three functions:
+ * Declares four functions:
  * - The ASM entry point: asm_##func
  * - The XEN PV trap entry point: xen_##func (maybe unused)
+ * - The C handler called from the FRED event dispatcher (maybe unused)
  * - The C handler called from the ASM entry point
  *
  * Note: This is the C variant of DECLARE_IDTENTRY(). As the name says it
@@ -31,6 +34,7 @@
 #define DECLARE_IDTENTRY(vector, func)					\
 	asmlinkage void asm_##func(void);				\
 	asmlinkage void xen_asm_##func(void);				\
+	void fred_##func(struct pt_regs *regs);				\
 	__visible void func(struct pt_regs *regs)
 
 /**
@@ -137,6 +141,17 @@ static __always_inline void __##func(struct pt_regs *regs,		\
 #define DEFINE_IDTENTRY_RAW(func)					\
 __visible noinstr void func(struct pt_regs *regs)
 
+/**
+ * DEFINE_FREDENTRY_RAW - Emit code for raw FRED entry points
+ * @func:	Function name of the entry point
+ *
+ * @func is called from the FRED event dispatcher with interrupts disabled.
+ *
+ * See @DEFINE_IDTENTRY_RAW for further details.
+ */
+#define DEFINE_FREDENTRY_RAW(func)					\
+noinstr void fred_##func(struct pt_regs *regs)
+
 /**
  * DECLARE_IDTENTRY_RAW_ERRORCODE - Declare functions for raw IDT entry points
  *				    Error code pushed by hardware
@@ -233,17 +248,27 @@ static noinline void __##func(struct pt_regs *regs, u32 vector)
 #define DEFINE_IDTENTRY_SYSVEC(func)					\
 static void __##func(struct pt_regs *regs);				\
 									\
+static __always_inline void instr_##func(struct pt_regs *regs)		\
+{									\
+	kvm_set_cpu_l1tf_flush_l1d();					\
+	run_sysvec_on_irqstack_cond(__##func, regs);			\
+}									\
+									\
 __visible noinstr void func(struct pt_regs *regs)			\
 {									\
 	irqentry_state_t state = irqentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
-	kvm_set_cpu_l1tf_flush_l1d();					\
-	run_sysvec_on_irqstack_cond(__##func, regs);			\
+	instr_##func (regs);						\
 	instrumentation_end();						\
 	irqentry_exit(regs, state);					\
 }									\
 									\
+void fred_##func(struct pt_regs *regs)					\
+{									\
+	instr_##func (regs);						\
+}									\
+									\
 static noinline void __##func(struct pt_regs *regs)
 
 /**
@@ -260,19 +285,29 @@ static noinline void __##func(struct pt_regs *regs)
 #define DEFINE_IDTENTRY_SYSVEC_SIMPLE(func)				\
 static __always_inline void __##func(struct pt_regs *regs);		\
 									\
-__visible noinstr void func(struct pt_regs *regs)			\
+static __always_inline void instr_##func(struct pt_regs *regs)		\
 {									\
-	irqentry_state_t state = irqentry_enter(regs);			\
-									\
-	instrumentation_begin();					\
 	__irq_enter_raw();						\
 	kvm_set_cpu_l1tf_flush_l1d();					\
 	__##func (regs);						\
 	__irq_exit_raw();						\
+}									\
+									\
+__visible noinstr void func(struct pt_regs *regs)			\
+{									\
+	irqentry_state_t state = irqentry_enter(regs);			\
+									\
+	instrumentation_begin();					\
+	instr_##func (regs);						\
 	instrumentation_end();						\
 	irqentry_exit(regs, state);					\
 }									\
 									\
+void fred_##func(struct pt_regs *regs)					\
+{									\
+	instr_##func (regs);						\
+}									\
+									\
 static __always_inline void __##func(struct pt_regs *regs)
 
 /**
@@ -410,15 +445,18 @@ __visible noinstr void func(struct pt_regs *regs,			\
 /* C-Code mapping */
 #define DECLARE_IDTENTRY_NMI		DECLARE_IDTENTRY_RAW
 #define DEFINE_IDTENTRY_NMI		DEFINE_IDTENTRY_RAW
+#define DEFINE_FREDENTRY_NMI		DEFINE_FREDENTRY_RAW
 
 #ifdef CONFIG_X86_64
 #define DECLARE_IDTENTRY_MCE		DECLARE_IDTENTRY_IST
 #define DEFINE_IDTENTRY_MCE		DEFINE_IDTENTRY_IST
 #define DEFINE_IDTENTRY_MCE_USER	DEFINE_IDTENTRY_NOIST
+#define DEFINE_FREDENTRY_MCE		DEFINE_FREDENTRY_RAW
 
 #define DECLARE_IDTENTRY_DEBUG		DECLARE_IDTENTRY_IST
 #define DEFINE_IDTENTRY_DEBUG		DEFINE_IDTENTRY_IST
 #define DEFINE_IDTENTRY_DEBUG_USER	DEFINE_IDTENTRY_NOIST
+#define DEFINE_FREDENTRY_DEBUG		DEFINE_FREDENTRY_RAW
 #endif
 
 #else /* !__ASSEMBLY__ */
@@ -651,23 +689,36 @@ DECLARE_IDTENTRY(RESCHEDULE_VECTOR,			sysvec_reschedule_ipi);
 DECLARE_IDTENTRY_SYSVEC(REBOOT_VECTOR,			sysvec_reboot);
 DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR,	sysvec_call_function_single);
 DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_VECTOR,		sysvec_call_function);
+#else
+# define fred_sysvec_reschedule_ipi			NULL
+# define fred_sysvec_reboot				NULL
+# define fred_sysvec_call_function_single		NULL
+# define fred_sysvec_call_function			NULL
 #endif
 
 #ifdef CONFIG_X86_LOCAL_APIC
 # ifdef CONFIG_X86_MCE_THRESHOLD
 DECLARE_IDTENTRY_SYSVEC(THRESHOLD_APIC_VECTOR,		sysvec_threshold);
+# else
+# define fred_sysvec_threshold				NULL
 # endif
 
 # ifdef CONFIG_X86_MCE_AMD
 DECLARE_IDTENTRY_SYSVEC(DEFERRED_ERROR_VECTOR,		sysvec_deferred_error);
+# else
+# define fred_sysvec_deferred_error			NULL
 # endif
 
 # ifdef CONFIG_X86_THERMAL_VECTOR
 DECLARE_IDTENTRY_SYSVEC(THERMAL_APIC_VECTOR,		sysvec_thermal);
+# else
+# define fred_sysvec_thermal				NULL
 # endif
 
 # ifdef CONFIG_IRQ_WORK
 DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR,		sysvec_irq_work);
+# else
+# define fred_sysvec_irq_work				NULL
 # endif
 #endif
 
@@ -675,12 +726,16 @@ DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR,		sysvec_irq_work);
 DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_VECTOR,		sysvec_kvm_posted_intr_ipi);
 DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	sysvec_kvm_posted_intr_wakeup_ipi);
 DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,	sysvec_kvm_posted_intr_nested_ipi);
+#else
+# define fred_sysvec_kvm_posted_intr_ipi		NULL
+# define fred_sysvec_kvm_posted_intr_wakeup_ipi		NULL
+# define fred_sysvec_kvm_posted_intr_nested_ipi		NULL
 #endif
 
 #if IS_ENABLED(CONFIG_HYPERV)
 DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR,	sysvec_hyperv_callback);
 DECLARE_IDTENTRY_SYSVEC(HYPERV_REENLIGHTENMENT_VECTOR,	sysvec_hyperv_reenlightenment);
-DECLARE_IDTENTRY_SYSVEC(HYPERV_STIMER0_VECTOR,	sysvec_hyperv_stimer0);
+DECLARE_IDTENTRY_SYSVEC(HYPERV_STIMER0_VECTOR,		sysvec_hyperv_stimer0);
 #endif
 
 #if IS_ENABLED(CONFIG_ACRN_GUEST)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 25/38] x86/fred: Add a debug fault entry stub for FRED
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (23 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 24/38] x86/idtentry: Incorporate definitions/declarations of the FRED entries Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 26/38] x86/fred: Add a NMI " Xin Li
                   ` (12 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

When occurred on different ring level, i.e., from user or kernel context,
#DB needs to be handled on different stack: User #DB on current task
stack, while kernel #DB on a dedicated stack. This is exactly how FRED
event delivery invokes an exception handler: ring 3 event on level 0
stack, i.e., current task stack; ring 0 event on the #DB dedicated stack
specified in the IA32_FRED_STKLVLS MSR. So unlike IDT, the FRED debug
exception entry stub doesn't do stack switch.

On a FRED system, the debug trap status information (DR6) is passed on
the stack, to avoid the problem of transient state. Furthermore, FRED
transitions avoid a lot of ugly corner cases the handling of which can,
and should be, skipped.

The FRED debug trap status information saved on the stack differs from
DR6 in both stickiness and polarity; it is exactly in the format which
debug_read_clear_dr6() returns for the IDT entry points.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v9:
* Disable #DB to avoid endless recursion and stack overflow when a
  watchpoint/breakpoint is set in the code path which is executed by
  #DB handler (Thomas Gleixner).

Changes since v1:
* call irqentry_nmi_{enter,exit}() in both IDT and FRED debug fault kernel
  handler (Peter Zijlstra).
---
 arch/x86/kernel/traps.c | 43 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c876f1d36a81..848c85208a57 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -50,6 +50,7 @@
 #include <asm/ftrace.h>
 #include <asm/traps.h>
 #include <asm/desc.h>
+#include <asm/fred.h>
 #include <asm/fpu/api.h>
 #include <asm/cpu.h>
 #include <asm/cpu_entry_area.h>
@@ -934,8 +935,7 @@ static bool notify_debug(struct pt_regs *regs, unsigned long *dr6)
 	return false;
 }
 
-static __always_inline void exc_debug_kernel(struct pt_regs *regs,
-					     unsigned long dr6)
+static noinstr void exc_debug_kernel(struct pt_regs *regs, unsigned long dr6)
 {
 	/*
 	 * Disable breakpoints during exception handling; recursive exceptions
@@ -947,6 +947,11 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
 	 *
 	 * Entry text is excluded for HW_BP_X and cpu_entry_area, which
 	 * includes the entry stack is excluded for everything.
+	 *
+	 * For FRED, nested #DB should just work fine. But when a watchpoint or
+	 * breakpoint is set in the code path which is executed by #DB handler,
+	 * it results in an endless recursion and stack overflow. Thus we stay
+	 * with the IDT approach, i.e., save DR7 and disable #DB.
 	 */
 	unsigned long dr7 = local_db_save();
 	irqentry_state_t irq_state = irqentry_nmi_enter(regs);
@@ -976,7 +981,8 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
 	 * Catch SYSENTER with TF set and clear DR_STEP. If this hit a
 	 * watchpoint at the same time then that will still be handled.
 	 */
-	if ((dr6 & DR_STEP) && is_sysenter_singlestep(regs))
+	if (!cpu_feature_enabled(X86_FEATURE_FRED) &&
+	    (dr6 & DR_STEP) && is_sysenter_singlestep(regs))
 		dr6 &= ~DR_STEP;
 
 	/*
@@ -1008,8 +1014,7 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
 	local_db_restore(dr7);
 }
 
-static __always_inline void exc_debug_user(struct pt_regs *regs,
-					   unsigned long dr6)
+static noinstr void exc_debug_user(struct pt_regs *regs, unsigned long dr6)
 {
 	bool icebp;
 
@@ -1093,6 +1098,34 @@ DEFINE_IDTENTRY_DEBUG_USER(exc_debug)
 {
 	exc_debug_user(regs, debug_read_clear_dr6());
 }
+
+#ifdef CONFIG_X86_FRED
+/*
+ * When occurred on different ring level, i.e., from user or kernel
+ * context, #DB needs to be handled on different stack: User #DB on
+ * current task stack, while kernel #DB on a dedicated stack.
+ *
+ * This is exactly how FRED event delivery invokes an exception
+ * handler: ring 3 event on level 0 stack, i.e., current task stack;
+ * ring 0 event on the #DB dedicated stack specified in the
+ * IA32_FRED_STKLVLS MSR. So unlike IDT, the FRED debug exception
+ * entry stub doesn't do stack switch.
+ */
+DEFINE_FREDENTRY_DEBUG(exc_debug)
+{
+	/*
+	 * FRED #DB stores DR6 on the stack in the format which
+	 * debug_read_clear_dr6() returns for the IDT entry points.
+	 */
+	unsigned long dr6 = fred_event_data(regs);
+
+	if (user_mode(regs))
+		exc_debug_user(regs, dr6);
+	else
+		exc_debug_kernel(regs, dr6);
+}
+#endif /* CONFIG_X86_FRED */
+
 #else
 /* 32 bit does not have separate entry points. */
 DEFINE_IDTENTRY_RAW(exc_debug)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 26/38] x86/fred: Add a NMI entry stub for FRED
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (24 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 25/38] x86/fred: Add a debug fault entry stub for FRED Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 27/38] x86/fred: Add a machine check " Xin Li
                   ` (11 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

On a FRED system, NMIs nest both with themselves and faults, transient
information is saved into the stack frame, and NMI unblocking only
happens when the stack frame indicates that so should happen.

Thus, the NMI entry stub for FRED is really quite small...

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kernel/nmi.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index a0c551846b35..58843fdf5cd0 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -34,6 +34,7 @@
 #include <asm/cache.h>
 #include <asm/nospec-branch.h>
 #include <asm/sev.h>
+#include <asm/fred.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/nmi.h>
@@ -643,6 +644,33 @@ void nmi_backtrace_stall_check(const struct cpumask *btp)
 
 #endif
 
+#ifdef CONFIG_X86_FRED
+/*
+ * With FRED, CR2/DR6 is pushed to #PF/#DB stack frame during FRED
+ * event delivery, i.e., there is no problem of transient states.
+ * And NMI unblocking only happens when the stack frame indicates
+ * that so should happen.
+ *
+ * Thus, the NMI entry stub for FRED is really straightforward and
+ * as simple as most exception handlers. As such, #DB is allowed
+ * during NMI handling.
+ */
+DEFINE_FREDENTRY_NMI(exc_nmi)
+{
+	irqentry_state_t irq_state;
+
+	if (IS_ENABLED(CONFIG_SMP) && arch_cpu_is_offline(smp_processor_id()))
+		return;
+
+	irq_state = irqentry_nmi_enter(regs);
+
+	inc_irq_stat(__nmi_count);
+	default_do_nmi(regs);
+
+	irqentry_nmi_exit(regs, irq_state);
+}
+#endif
+
 void stop_nmi(void)
 {
 	ignore_nmis++;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 27/38] x86/fred: Add a machine check entry stub for FRED
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (25 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 26/38] x86/fred: Add a NMI " Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 28/38] x86/fred: FRED entry/exit and dispatch code Xin Li
                   ` (10 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

Like #DB, when occurred on different ring level, i.e., from user or kernel
context, #MCE needs to be handled on different stack: User #MCE on current
task stack, while kernel #MCE on a dedicated stack.

This is exactly how FRED event delivery invokes an exception handler: ring
3 event on level 0 stack, i.e., current task stack; ring 0 event on the
#MCE dedicated stack specified in the IA32_FRED_STKLVLS MSR. So unlike IDT,
the FRED machine check entry stub doesn't do stack switch.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v5:
* Disallow #DB inside #MCE for robustness sake (Peter Zijlstra).
---
 arch/x86/kernel/cpu/mce/core.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6f35f724cc14..da0a4a102afe 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -52,6 +52,7 @@
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/reboot.h>
+#include <asm/fred.h>
 
 #include "internal.h"
 
@@ -2144,6 +2145,31 @@ DEFINE_IDTENTRY_MCE_USER(exc_machine_check)
 	exc_machine_check_user(regs);
 	local_db_restore(dr7);
 }
+
+#ifdef CONFIG_X86_FRED
+/*
+ * When occurred on different ring level, i.e., from user or kernel
+ * context, #MCE needs to be handled on different stack: User #MCE
+ * on current task stack, while kernel #MCE on a dedicated stack.
+ *
+ * This is exactly how FRED event delivery invokes an exception
+ * handler: ring 3 event on level 0 stack, i.e., current task stack;
+ * ring 0 event on the #MCE dedicated stack specified in the
+ * IA32_FRED_STKLVLS MSR. So unlike IDT, the FRED machine check entry
+ * stub doesn't do stack switch.
+ */
+DEFINE_FREDENTRY_MCE(exc_machine_check)
+{
+	unsigned long dr7;
+
+	dr7 = local_db_save();
+	if (user_mode(regs))
+		exc_machine_check_user(regs);
+	else
+		exc_machine_check_kernel(regs);
+	local_db_restore(dr7);
+}
+#endif
 #else
 /* 32bit unified entry point */
 DEFINE_IDTENTRY_RAW(exc_machine_check)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 28/38] x86/fred: FRED entry/exit and dispatch code
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (26 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 27/38] x86/fred: Add a machine check " Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-21  9:48   ` Nikolay Borisov
  2023-09-14  4:47 ` [PATCH v10 29/38] x86/traps: Add sysvec_install() to install a system interrupt handler Xin Li
                   ` (9 subsequent siblings)
  37 siblings, 1 reply; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

The code to actually handle kernel and event entry/exit using
FRED. It is split up into two files thus:

- entry_64_fred.S contains the actual entrypoints and exit code, and
  saves and restores registers.
- entry_fred.c contains the two-level event dispatch code for FRED.
  The first-level dispatch is on the event type, and the second-level
  is on the event vector.

Originally-by: Megha Dey <megha.dey@intel.com>
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Co-developed-by: Xin Li <xin3.li@intel.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v9:
* Don't use jump tables, indirect jumps are expensive (Thomas Gleixner).
* Except NMI/#DB/#MCE, FRED really can share the exception handlers
  with IDT (Thomas Gleixner).
* Avoid the sysvec_* idt_entry muck, do it at a central place, reuse code
  instead of blindly copying it, which breaks the performance optimized
  sysvec entries like reschedule_ipi (Thomas Gleixner).
* Add asm_ prefix to FRED asm entry points (Thomas Gleixner).

Changes since v8:
* Don't do syscall early out in fred_entry_from_user() before there are
  proper performance numbers and justifications (Thomas Gleixner).
* Add the control exception handler to the FRED exception handler table
  (Thomas Gleixner).
* Add ENDBR to the FRED_ENTER asm macro.
* Reflect the FRED spec 5.0 change that ERETS and ERETU add 8 to %rsp
  before popping the return context from the stack.

Changes since v1:
* Initialize a FRED exception handler to fred_bad_event() instead of NULL
  if no FRED handler defined for an exception vector (Peter Zijlstra).
* Push calling irqentry_{enter,exit}() and instrumentation_{begin,end}()
  down into individual FRED exception handlers, instead of in the dispatch
  framework (Peter Zijlstra).
---
 arch/x86/entry/Makefile               |   5 +-
 arch/x86/entry/entry_64_fred.S        |  52 ++++++
 arch/x86/entry/entry_fred.c           | 230 ++++++++++++++++++++++++++
 arch/x86/include/asm/asm-prototypes.h |   1 +
 arch/x86/include/asm/fred.h           |   6 +
 5 files changed, 293 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/entry/entry_64_fred.S
 create mode 100644 arch/x86/entry/entry_fred.c

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index ca2fe186994b..c93e7f5c2a06 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -18,6 +18,9 @@ obj-y				+= vdso/
 obj-y				+= vsyscall/
 
 obj-$(CONFIG_PREEMPTION)	+= thunk_$(BITS).o
+CFLAGS_entry_fred.o		+= -fno-stack-protector
+CFLAGS_REMOVE_entry_fred.o	+= -pg $(CC_FLAGS_FTRACE)
+obj-$(CONFIG_X86_FRED)		+= entry_64_fred.o entry_fred.o
+
 obj-$(CONFIG_IA32_EMULATION)	+= entry_64_compat.o syscall_32.o
 obj-$(CONFIG_X86_X32_ABI)	+= syscall_x32.o
-
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
new file mode 100644
index 000000000000..37a1dd5e8ace
--- /dev/null
+++ b/arch/x86/entry/entry_64_fred.S
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * The actual FRED entry points.
+ */
+
+#include <asm/fred.h>
+
+#include "calling.h"
+
+	.code64
+	.section .noinstr.text, "ax"
+
+.macro FRED_ENTER
+	UNWIND_HINT_END_OF_STACK
+	ENDBR
+	PUSH_AND_CLEAR_REGS
+	movq	%rsp, %rdi	/* %rdi -> pt_regs */
+.endm
+
+.macro FRED_EXIT
+	UNWIND_HINT_REGS
+	POP_REGS
+.endm
+
+/*
+ * The new RIP value that FRED event delivery establishes is
+ * IA32_FRED_CONFIG & ~FFFH for events that occur in ring 3.
+ * Thus the FRED ring 3 entry point must be 4K page aligned.
+ */
+	.align 4096
+
+SYM_CODE_START_NOALIGN(asm_fred_entrypoint_user)
+	FRED_ENTER
+	call	fred_entry_from_user
+	FRED_EXIT
+	ERETU
+SYM_CODE_END(asm_fred_entrypoint_user)
+
+.fill asm_fred_entrypoint_kernel - ., 1, 0xcc
+
+/*
+ * The new RIP value that FRED event delivery establishes is
+ * (IA32_FRED_CONFIG & ~FFFH) + 256 for events that occur in
+ * ring 0, i.e., asm_fred_entrypoint_user + 256.
+ */
+	.org asm_fred_entrypoint_user + 256
+SYM_CODE_START_NOALIGN(asm_fred_entrypoint_kernel)
+	FRED_ENTER
+	call	fred_entry_from_kernel
+	FRED_EXIT
+	ERETS
+SYM_CODE_END(asm_fred_entrypoint_kernel)
diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c
new file mode 100644
index 000000000000..dc645f5819a9
--- /dev/null
+++ b/arch/x86/entry/entry_fred.c
@@ -0,0 +1,230 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * The FRED specific kernel/user entry functions which are invoked from
+ * assembly code and dispatch to the associated handlers.
+ */
+#include <linux/kernel.h>
+#include <linux/kdebug.h>
+#include <linux/nospec.h>
+
+#include <asm/desc.h>
+#include <asm/fred.h>
+#include <asm/idtentry.h>
+#include <asm/syscall.h>
+#include <asm/trapnr.h>
+#include <asm/traps.h>
+
+/* FRED EVENT_TYPE_OTHER vector numbers */
+#define FRED_SYSCALL			1
+#define FRED_SYSENTER			2
+
+static noinstr void fred_bad_type(struct pt_regs *regs, unsigned long error_code)
+{
+	irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+
+	instrumentation_begin();
+
+	/* Panic on events from a high stack level */
+	if (regs->sl > 0) {
+		pr_emerg("PANIC: invalid or fatal FRED event; event type %u "
+			 "vector %u error 0x%lx aux 0x%lx at %04x:%016lx\n",
+			 regs->fred_ss.type, regs->fred_ss.vector, regs->orig_ax,
+			 fred_event_data(regs), regs->cs, regs->ip);
+		die("invalid or fatal FRED event", regs, regs->orig_ax);
+		panic("invalid or fatal FRED event");
+	} else {
+		unsigned long flags = oops_begin();
+		int sig = SIGKILL;
+
+		pr_alert("BUG: invalid or fatal FRED event; event type %u "
+			 "vector %u error 0x%lx aux 0x%lx at %04x:%016lx\n",
+			 regs->fred_ss.type, regs->fred_ss.vector, regs->orig_ax,
+			 fred_event_data(regs), regs->cs, regs->ip);
+
+		if (__die("Invalid or fatal FRED event", regs, regs->orig_ax))
+			sig = 0;
+
+		oops_end(flags, regs, sig);
+	}
+
+	instrumentation_end();
+	irqentry_nmi_exit(regs, irq_state);
+}
+
+static noinstr void fred_intx(struct pt_regs *regs)
+{
+	switch (regs->fred_ss.vector) {
+	/* INT0 */
+	case X86_TRAP_OF:
+		exc_overflow(regs);
+		return;
+
+	/* INT3 */
+	case X86_TRAP_BP:
+		exc_int3(regs);
+		return;
+
+	/* INT80 */
+	case IA32_SYSCALL_VECTOR:
+		if (likely(IS_ENABLED(CONFIG_IA32_EMULATION))) {
+			/* Save the syscall number */
+			regs->orig_ax = regs->ax;
+			regs->ax = -ENOSYS;
+			do_int80_syscall_32(regs);
+			return;
+		}
+		fallthrough;
+
+	default:
+		exc_general_protection(regs, 0);
+		return;
+	}
+}
+
+static __always_inline void fred_other(struct pt_regs *regs)
+{
+	/* The compiler can fold these conditions into a single test */
+	if (likely(regs->fred_ss.vector == FRED_SYSCALL && regs->fred_ss.lm)) {
+		regs->orig_ax = regs->ax;
+		regs->ax = -ENOSYS;
+		do_syscall_64(regs, regs->orig_ax);
+		return;
+	} else if (likely(IS_ENABLED(CONFIG_IA32_EMULATION) &&
+			  regs->fred_ss.vector == FRED_SYSENTER &&
+			  !regs->fred_ss.lm)) {
+		regs->orig_ax = regs->ax;
+		regs->ax = -ENOSYS;
+		do_fast_syscall_32(regs);
+		return;
+	} else {
+		exc_invalid_op(regs);
+		return;
+	}
+}
+
+#define SYSVEC(_vector, _function) [_vector - FIRST_SYSTEM_VECTOR] = fred_sysvec_##_function
+
+static idtentry_t sysvec_table[NR_SYSTEM_VECTORS] __ro_after_init = {
+	SYSVEC(ERROR_APIC_VECTOR,		error_interrupt),
+	SYSVEC(SPURIOUS_APIC_VECTOR,		spurious_apic_interrupt),
+	SYSVEC(LOCAL_TIMER_VECTOR,		apic_timer_interrupt),
+	SYSVEC(X86_PLATFORM_IPI_VECTOR,		x86_platform_ipi),
+
+	SYSVEC(RESCHEDULE_VECTOR,		reschedule_ipi),
+	SYSVEC(CALL_FUNCTION_SINGLE_VECTOR,	call_function_single),
+	SYSVEC(CALL_FUNCTION_VECTOR,		call_function),
+	SYSVEC(REBOOT_VECTOR,			reboot),
+
+	SYSVEC(THRESHOLD_APIC_VECTOR,		threshold),
+	SYSVEC(DEFERRED_ERROR_VECTOR,		deferred_error),
+	SYSVEC(THERMAL_APIC_VECTOR,		thermal),
+
+	SYSVEC(IRQ_WORK_VECTOR,			irq_work),
+
+	SYSVEC(POSTED_INTR_VECTOR,		kvm_posted_intr_ipi),
+	SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	kvm_posted_intr_wakeup_ipi),
+	SYSVEC(POSTED_INTR_NESTED_VECTOR,	kvm_posted_intr_nested_ipi),
+};
+
+static noinstr void fred_extint(struct pt_regs *regs)
+{
+	unsigned int vector = regs->fred_ss.vector;
+
+	if (WARN_ON_ONCE(vector < FIRST_EXTERNAL_VECTOR))
+		return;
+
+	if (likely(vector >= FIRST_SYSTEM_VECTOR)) {
+		irqentry_state_t state = irqentry_enter(regs);
+
+		instrumentation_begin();
+		sysvec_table[vector - FIRST_SYSTEM_VECTOR](regs);
+		instrumentation_end();
+		irqentry_exit(regs, state);
+	} else {
+		common_interrupt(regs, vector);
+	}
+}
+
+static noinstr void fred_exception(struct pt_regs *regs, unsigned long error_code)
+{
+	/* Optimize for #PF. That's the only exception which matters performance wise */
+	if (likely(regs->fred_ss.vector == X86_TRAP_PF)) {
+		exc_page_fault(regs, error_code);
+		return;
+	}
+
+	switch (regs->fred_ss.vector) {
+	case X86_TRAP_DE: return exc_divide_error(regs);
+	case X86_TRAP_DB: return fred_exc_debug(regs);
+	case X86_TRAP_BP: return exc_int3(regs);
+	case X86_TRAP_OF: return exc_overflow(regs);
+	case X86_TRAP_BR: return exc_bounds(regs);
+	case X86_TRAP_UD: return exc_invalid_op(regs);
+	case X86_TRAP_NM: return exc_device_not_available(regs);
+	case X86_TRAP_DF: return exc_double_fault(regs, error_code);
+	case X86_TRAP_TS: return exc_invalid_tss(regs, error_code);
+	case X86_TRAP_NP: return exc_segment_not_present(regs, error_code);
+	case X86_TRAP_SS: return exc_stack_segment(regs, error_code);
+	case X86_TRAP_GP: return exc_general_protection(regs, error_code);
+	case X86_TRAP_MF: return exc_coprocessor_error(regs);
+	case X86_TRAP_AC: return exc_alignment_check(regs, error_code);
+	case X86_TRAP_XF: return exc_simd_coprocessor_error(regs);
+
+#ifdef CONFIG_X86_MCE
+	case X86_TRAP_MC: return fred_exc_machine_check(regs);
+#endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+	case X86_TRAP_VE: return exc_virtualization_exception(regs);
+#endif
+#ifdef CONFIG_X86_KERNEL_IBT
+	case X86_TRAP_CP: return exc_control_protection(regs, error_code);
+#endif
+	default: return fred_bad_type(regs, error_code);
+	}
+}
+
+__visible noinstr void fred_entry_from_user(struct pt_regs *regs)
+{
+	unsigned long error_code = regs->orig_ax;
+
+	/* Invalidate orig_ax so that syscall_get_nr() works correctly */
+	regs->orig_ax = -1;
+
+	switch (regs->fred_ss.type) {
+	case EVENT_TYPE_EXTINT:
+		return fred_extint(regs);
+	case EVENT_TYPE_NMI:
+		return fred_exc_nmi(regs);
+	case EVENT_TYPE_SWINT:
+		return fred_intx(regs);
+	case EVENT_TYPE_HWEXC:
+	case EVENT_TYPE_SWEXC:
+	case EVENT_TYPE_PRIV_SWEXC:
+		return fred_exception(regs, error_code);
+	case EVENT_TYPE_OTHER:
+		return fred_other(regs);
+	default:
+		return fred_bad_type(regs, error_code);
+	}
+}
+
+__visible noinstr void fred_entry_from_kernel(struct pt_regs *regs)
+{
+	unsigned long error_code = regs->orig_ax;
+
+	/* Invalidate orig_ax so that syscall_get_nr() works correctly */
+	regs->orig_ax = -1;
+
+	switch (regs->fred_ss.type) {
+	case EVENT_TYPE_EXTINT:
+		return fred_extint(regs);
+	case EVENT_TYPE_NMI:
+		return fred_exc_nmi(regs);
+	case EVENT_TYPE_HWEXC:
+	case EVENT_TYPE_SWEXC:
+	case EVENT_TYPE_PRIV_SWEXC:
+		return fred_exception(regs, error_code);
+	default:
+		return fred_bad_type(regs, error_code);
+	}
+}
diff --git a/arch/x86/include/asm/asm-prototypes.h b/arch/x86/include/asm/asm-prototypes.h
index b1a98fa38828..076bf8dee702 100644
--- a/arch/x86/include/asm/asm-prototypes.h
+++ b/arch/x86/include/asm/asm-prototypes.h
@@ -12,6 +12,7 @@
 #include <asm/special_insns.h>
 #include <asm/preempt.h>
 #include <asm/asm.h>
+#include <asm/fred.h>
 #include <asm/gsseg.h>
 
 #ifndef CONFIG_X86_CMPXCHG64
diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index f514fdb5a39f..16a64ffecbf8 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -60,6 +60,12 @@ static __always_inline unsigned long fred_event_data(struct pt_regs *regs)
 	return fred_info(regs)->edata;
 }
 
+void asm_fred_entrypoint_user(void);
+void asm_fred_entrypoint_kernel(void);
+
+__visible void fred_entry_from_user(struct pt_regs *regs);
+__visible void fred_entry_from_kernel(struct pt_regs *regs);
+
 #else /* CONFIG_X86_FRED */
 static __always_inline unsigned long fred_event_data(struct pt_regs *regs) { return 0; }
 #endif /* CONFIG_X86_FRED */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 29/38] x86/traps: Add sysvec_install() to install a system interrupt handler
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (27 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 28/38] x86/fred: FRED entry/exit and dispatch code Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 30/38] x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled Xin Li
                   ` (8 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add sysvec_install() to install a system interrupt handler into the IDT
or the FRED system interrupt handler table.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v8:
* Introduce a macro sysvec_install() to derive the asm handler name from
  a C handler, which simplifies the code and avoids an ugly typecast
  (Thomas Gleixner).
---
 arch/x86/entry/entry_fred.c      | 14 ++++++++++++++
 arch/x86/include/asm/desc.h      |  2 --
 arch/x86/include/asm/idtentry.h  | 15 +++++++++++++++
 arch/x86/kernel/cpu/acrn.c       |  4 ++--
 arch/x86/kernel/cpu/mshyperv.c   | 15 +++++++--------
 arch/x86/kernel/idt.c            |  4 ++--
 arch/x86/kernel/kvm.c            |  2 +-
 drivers/xen/events/events_base.c |  2 +-
 8 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c
index dc645f5819a9..2fd3e421e066 100644
--- a/arch/x86/entry/entry_fred.c
+++ b/arch/x86/entry/entry_fred.c
@@ -126,6 +126,20 @@ static idtentry_t sysvec_table[NR_SYSTEM_VECTORS] __ro_after_init = {
 	SYSVEC(POSTED_INTR_NESTED_VECTOR,	kvm_posted_intr_nested_ipi),
 };
 
+static bool fred_setup_done __initdata;
+
+void __init fred_install_sysvec(unsigned int sysvec, idtentry_t handler)
+{
+	if (WARN_ON_ONCE(sysvec < FIRST_SYSTEM_VECTOR))
+		return;
+
+	if (WARN_ON_ONCE(fred_setup_done))
+		return;
+
+	if (!WARN_ON_ONCE(sysvec_table[sysvec - FIRST_SYSTEM_VECTOR]))
+		 sysvec_table[sysvec - FIRST_SYSTEM_VECTOR] = handler;
+}
+
 static noinstr void fred_extint(struct pt_regs *regs)
 {
 	unsigned int vector = regs->fred_ss.vector;
diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index ab97b22ac04a..ec95fe44fa3a 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -402,8 +402,6 @@ static inline void set_desc_limit(struct desc_struct *desc, unsigned long limit)
 	desc->limit1 = (limit >> 16) & 0xf;
 }
 
-void alloc_intr_gate(unsigned int n, const void *addr);
-
 static inline void init_idt_data(struct idt_data *data, unsigned int n,
 				 const void *addr)
 {
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 4f26ee9b8b74..650c98160152 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -459,6 +459,21 @@ __visible noinstr void func(struct pt_regs *regs,			\
 #define DEFINE_FREDENTRY_DEBUG		DEFINE_FREDENTRY_RAW
 #endif
 
+void idt_install_sysvec(unsigned int n, const void *function);
+
+#ifdef CONFIG_X86_FRED
+void fred_install_sysvec(unsigned int vector, const idtentry_t function);
+#else
+static inline void fred_install_sysvec(unsigned int vector, const idtentry_t function) { }
+#endif
+
+#define sysvec_install(vector, function) {				\
+	if (cpu_feature_enabled(X86_FEATURE_FRED))			\
+		fred_install_sysvec(vector, function);			\
+	else								\
+		idt_install_sysvec(vector, asm_##function);		\
+}
+
 #else /* !__ASSEMBLY__ */
 
 /*
diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
index bfeb18fad63f..2c5b51aad91a 100644
--- a/arch/x86/kernel/cpu/acrn.c
+++ b/arch/x86/kernel/cpu/acrn.c
@@ -26,8 +26,8 @@ static u32 __init acrn_detect(void)
 
 static void __init acrn_init_platform(void)
 {
-	/* Setup the IDT for ACRN hypervisor callback */
-	alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_acrn_hv_callback);
+	/* Install system interrupt handler for ACRN hypervisor callback */
+	sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
 
 	x86_platform.calibrate_tsc = acrn_get_tsc_khz;
 	x86_platform.calibrate_cpu = acrn_get_tsc_khz;
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index e6bba12c759c..3403880c3e09 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -536,19 +536,18 @@ static void __init ms_hyperv_init_platform(void)
 	 */
 	x86_platform.apic_post_init = hyperv_init;
 	hyperv_setup_mmu_ops();
-	/* Setup the IDT for hypervisor callback */
-	alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_hyperv_callback);
 
-	/* Setup the IDT for reenlightenment notifications */
+	/* Install system interrupt handler for hypervisor callback */
+	sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_hyperv_callback);
+
+	/* Install system interrupt handler for reenlightenment notifications */
 	if (ms_hyperv.features & HV_ACCESS_REENLIGHTENMENT) {
-		alloc_intr_gate(HYPERV_REENLIGHTENMENT_VECTOR,
-				asm_sysvec_hyperv_reenlightenment);
+		sysvec_install(HYPERV_REENLIGHTENMENT_VECTOR, sysvec_hyperv_reenlightenment);
 	}
 
-	/* Setup the IDT for stimer0 */
+	/* Install system interrupt handler for stimer0 */
 	if (ms_hyperv.misc_features & HV_STIMER_DIRECT_MODE_AVAILABLE) {
-		alloc_intr_gate(HYPERV_STIMER0_VECTOR,
-				asm_sysvec_hyperv_stimer0);
+		sysvec_install(HYPERV_STIMER0_VECTOR, sysvec_hyperv_stimer0);
 	}
 
 # ifdef CONFIG_SMP
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index b786d48f5a0f..0db7d92de48a 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -330,7 +330,7 @@ void idt_invalidate(void)
 	load_idt(&idt);
 }
 
-void __init alloc_intr_gate(unsigned int n, const void *addr)
+void __init idt_install_sysvec(unsigned int n, const void *function)
 {
 	if (WARN_ON(n < FIRST_SYSTEM_VECTOR))
 		return;
@@ -339,5 +339,5 @@ void __init alloc_intr_gate(unsigned int n, const void *addr)
 		return;
 
 	if (!WARN_ON(test_and_set_bit(n, system_vectors)))
-		set_intr_gate(n, addr);
+		set_intr_gate(n, function);
 }
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index b8ab9ee5896c..eabf03813a5c 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -829,7 +829,7 @@ static void __init kvm_guest_init(void)
 
 	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_INT) && kvmapf) {
 		static_branch_enable(&kvm_async_pf_enabled);
-		alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_kvm_asyncpf_interrupt);
+		sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_kvm_asyncpf_interrupt);
 	}
 
 #ifdef CONFIG_SMP
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 3bdd5b59661d..c54123ca7b1a 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -2243,7 +2243,7 @@ static __init void xen_alloc_callback_vector(void)
 		return;
 
 	pr_info("Xen HVM callback vector for event delivery is enabled\n");
-	alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_xen_hvm_callback);
+	sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_xen_hvm_callback);
 }
 #else
 void xen_setup_callback_vector(void) {}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 30/38] x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (28 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 29/38] x86/traps: Add sysvec_install() to install a system interrupt handler Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 31/38] x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user Xin Li
                   ` (7 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled,
otherwise the existing IDT code is chosen.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/entry/entry_64.S      | 6 ++++++
 arch/x86/entry/entry_64_fred.S | 1 +
 2 files changed, 7 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9cdb61ea91de..c9e14617dd3f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -309,7 +309,13 @@ SYM_CODE_START(ret_from_fork_asm)
 	 * and unwind should work normally.
 	 */
 	UNWIND_HINT_REGS
+
+#ifdef CONFIG_X86_FRED
+	ALTERNATIVE "jmp swapgs_restore_regs_and_return_to_usermode", \
+		    "jmp asm_fred_exit_user", X86_FEATURE_FRED
+#else
 	jmp	swapgs_restore_regs_and_return_to_usermode
+#endif
 SYM_CODE_END(ret_from_fork_asm)
 .popsection
 
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 37a1dd5e8ace..5781c3411b44 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -32,6 +32,7 @@
 SYM_CODE_START_NOALIGN(asm_fred_entrypoint_user)
 	FRED_ENTER
 	call	fred_entry_from_user
+SYM_INNER_LABEL(asm_fred_exit_user, SYM_L_GLOBAL)
 	FRED_EXIT
 	ERETU
 SYM_CODE_END(asm_fred_entrypoint_user)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 31/38] x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (29 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 30/38] x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:47 ` [PATCH v10 32/38] x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code Xin Li
                   ` (6 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

If the stack frame contains an invalid user context (e.g. due to invalid SS,
a non-canonical RIP, etc.) the ERETU instruction will trap (#SS or #GP).

From a Linux point of view, this really should be considered a user space
failure, so use the standard fault fixup mechanism to intercept the fault,
fix up the exception frame, and redirect execution to fred_entrypoint_user.
The end result is that it appears just as if the hardware had taken the
exception immediately after completing the transition to user space.

Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v8:
* Reflect the FRED spec 5.0 change that ERETS and ERETU add 8 to %rsp
  before popping the return context from the stack.

Changes since v6:
* Add a comment to explain why it is safe to write to the previous FRED stack
  frame. (Lai Jiangshan).

Changes since v5:
* Move the NMI bit from an invalid stack frame, which caused ERETU to fault,
  to the fault handler's stack frame, thus to unblock NMI ASAP if NMI is blocked
  (Lai Jiangshan).
---
 arch/x86/entry/entry_64_fred.S             |  5 +-
 arch/x86/include/asm/extable_fixup_types.h |  4 +-
 arch/x86/mm/extable.c                      | 79 ++++++++++++++++++++++
 3 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 5781c3411b44..d1c2fc4af8ae 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -3,6 +3,7 @@
  * The actual FRED entry points.
  */
 
+#include <asm/asm.h>
 #include <asm/fred.h>
 
 #include "calling.h"
@@ -34,7 +35,9 @@ SYM_CODE_START_NOALIGN(asm_fred_entrypoint_user)
 	call	fred_entry_from_user
 SYM_INNER_LABEL(asm_fred_exit_user, SYM_L_GLOBAL)
 	FRED_EXIT
-	ERETU
+1:	ERETU
+
+	_ASM_EXTABLE_TYPE(1b, asm_fred_entrypoint_user, EX_TYPE_ERETU)
 SYM_CODE_END(asm_fred_entrypoint_user)
 
 .fill asm_fred_entrypoint_kernel - ., 1, 0xcc
diff --git a/arch/x86/include/asm/extable_fixup_types.h b/arch/x86/include/asm/extable_fixup_types.h
index 991e31cfde94..1585c798a02f 100644
--- a/arch/x86/include/asm/extable_fixup_types.h
+++ b/arch/x86/include/asm/extable_fixup_types.h
@@ -64,6 +64,8 @@
 #define	EX_TYPE_UCOPY_LEN4		(EX_TYPE_UCOPY_LEN | EX_DATA_IMM(4))
 #define	EX_TYPE_UCOPY_LEN8		(EX_TYPE_UCOPY_LEN | EX_DATA_IMM(8))
 
-#define EX_TYPE_ZEROPAD			20 /* longword load with zeropad on fault */
+#define	EX_TYPE_ZEROPAD			20 /* longword load with zeropad on fault */
+
+#define	EX_TYPE_ERETU			21
 
 #endif
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 271dcb2deabc..bc7af7e8587b 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -6,6 +6,7 @@
 #include <xen/xen.h>
 
 #include <asm/fpu/api.h>
+#include <asm/fred.h>
 #include <asm/sev.h>
 #include <asm/traps.h>
 #include <asm/kdebug.h>
@@ -223,6 +224,80 @@ static bool ex_handler_ucopy_len(const struct exception_table_entry *fixup,
 	return ex_handler_uaccess(fixup, regs, trapnr, fault_address);
 }
 
+#ifdef CONFIG_X86_FRED
+static bool ex_handler_eretu(const struct exception_table_entry *fixup,
+			     struct pt_regs *regs, unsigned long error_code)
+{
+	struct pt_regs *uregs = (struct pt_regs *)
+		(regs->sp - offsetof(struct pt_regs, orig_ax));
+	unsigned short ss = uregs->ss;
+	unsigned short cs = uregs->cs;
+
+	/*
+	 * Move the NMI bit from the invalid stack frame, which caused ERETU
+	 * to fault, to the fault handler's stack frame, thus to unblock NMI
+	 * with the fault handler's ERETS instruction ASAP if NMI is blocked.
+	 */
+	regs->fred_ss.nmi = uregs->fred_ss.nmi;
+
+	/*
+	 * Sync event information to uregs, i.e., the ERETU return frame, but
+	 * is it safe to write to the ERETU return frame which is just above
+	 * current event stack frame?
+	 *
+	 * The RSP used by FRED to push a stack frame is not the value in %rsp,
+	 * it is calculated from %rsp with the following 2 steps:
+	 * 1) RSP = %rsp - (IA32_FRED_CONFIG & 0x1c0)	// Reserve N*64 bytes
+	 * 2) RSP = RSP & ~0x3f		// Align to a 64-byte cache line
+	 * when an event delivery doesn't trigger a stack level change.
+	 *
+	 * Here is an example with N*64 (N=1) bytes reserved:
+	 *
+	 *  64-byte cache line ==>  ______________
+	 *                         |___Reserved___|
+	 *                         |__Event_data__|
+	 *                         |_____SS_______|
+	 *                         |_____RSP______|
+	 *                         |_____FLAGS____|
+	 *                         |_____CS_______|
+	 *                         |_____IP_______|
+	 *  64-byte cache line ==> |__Error_code__| <== ERETU return frame
+	 *                         |______________|
+	 *                         |______________|
+	 *                         |______________|
+	 *                         |______________|
+	 *                         |______________|
+	 *                         |______________|
+	 *                         |______________|
+	 *  64-byte cache line ==> |______________| <== RSP after step 1) and 2)
+	 *                         |___Reserved___|
+	 *                         |__Event_data__|
+	 *                         |_____SS_______|
+	 *                         |_____RSP______|
+	 *                         |_____FLAGS____|
+	 *                         |_____CS_______|
+	 *                         |_____IP_______|
+	 *  64-byte cache line ==> |__Error_code__| <== ERETS return frame
+	 *
+	 * Thus a new FRED stack frame will always be pushed below a previous
+	 * FRED stack frame ((N*64) bytes may be reserved between), and it is
+	 * safe to write to a previous FRED stack frame as they never overlap.
+	 */
+	fred_info(uregs)->edata = fred_event_data(regs);
+	uregs->ssx = regs->ssx;
+	uregs->fred_ss.ss = ss;
+	/* The NMI bit was moved away above */
+	uregs->fred_ss.nmi = 0;
+	uregs->csx = regs->csx;
+	uregs->sl = 0;
+	uregs->wfe = 0;
+	uregs->cs = cs;
+	uregs->orig_ax = error_code;
+
+	return ex_handler_default(fixup, regs);
+}
+#endif
+
 int ex_get_fixup_type(unsigned long ip)
 {
 	const struct exception_table_entry *e = search_exception_tables(ip);
@@ -300,6 +375,10 @@ int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code,
 		return ex_handler_ucopy_len(e, regs, trapnr, fault_addr, reg, imm);
 	case EX_TYPE_ZEROPAD:
 		return ex_handler_zeropad(e, regs, fault_addr);
+#ifdef CONFIG_X86_FRED
+	case EX_TYPE_ERETU:
+		return ex_handler_eretu(e, regs, error_code);
+#endif
 	}
 	BUG();
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 32/38] x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (30 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 31/38] x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user Xin Li
@ 2023-09-14  4:47 ` Xin Li
  2023-09-14  4:48 ` [PATCH v10 33/38] x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI Xin Li
                   ` (5 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "Peter Zijlstra (Intel)" <peterz@infradead.org>

PUSH_AND_CLEAR_REGS could be used besides actual entry code; in that case
%rbp shouldn't be cleared (otherwise the frame pointer is destroyed) and
UNWIND_HINT shouldn't be added.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/entry/calling.h | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index f6907627172b..eb57c023d5df 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -65,7 +65,7 @@ For 32-bit we have the following conventions - kernel is built with
  * for assembly code:
  */
 
-.macro PUSH_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0
+.macro PUSH_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0 unwind_hint=1
 	.if \save_ret
 	pushq	%rsi		/* pt_regs->si */
 	movq	8(%rsp), %rsi	/* temporarily store the return address in %rsi */
@@ -87,14 +87,17 @@ For 32-bit we have the following conventions - kernel is built with
 	pushq	%r13		/* pt_regs->r13 */
 	pushq	%r14		/* pt_regs->r14 */
 	pushq	%r15		/* pt_regs->r15 */
+
+	.if \unwind_hint
 	UNWIND_HINT_REGS
+	.endif
 
 	.if \save_ret
 	pushq	%rsi		/* return address on top of stack */
 	.endif
 .endm
 
-.macro CLEAR_REGS
+.macro CLEAR_REGS clear_bp=1
 	/*
 	 * Sanitize registers of values that a speculation attack might
 	 * otherwise want to exploit. The lower registers are likely clobbered
@@ -109,7 +112,9 @@ For 32-bit we have the following conventions - kernel is built with
 	xorl	%r10d, %r10d	/* nospec r10 */
 	xorl	%r11d, %r11d	/* nospec r11 */
 	xorl	%ebx,  %ebx	/* nospec rbx */
+	.if \clear_bp
 	xorl	%ebp,  %ebp	/* nospec rbp */
+	.endif
 	xorl	%r12d, %r12d	/* nospec r12 */
 	xorl	%r13d, %r13d	/* nospec r13 */
 	xorl	%r14d, %r14d	/* nospec r14 */
@@ -117,9 +122,9 @@ For 32-bit we have the following conventions - kernel is built with
 
 .endm
 
-.macro PUSH_AND_CLEAR_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0
-	PUSH_REGS rdx=\rdx, rcx=\rcx, rax=\rax, save_ret=\save_ret
-	CLEAR_REGS
+.macro PUSH_AND_CLEAR_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0 clear_bp=1 unwind_hint=1
+	PUSH_REGS rdx=\rdx, rcx=\rcx, rax=\rax, save_ret=\save_ret unwind_hint=\unwind_hint
+	CLEAR_REGS clear_bp=\clear_bp
 .endm
 
 .macro POP_REGS pop_rdi=1
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 33/38] x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (31 preceding siblings ...)
  2023-09-14  4:47 ` [PATCH v10 32/38] x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code Xin Li
@ 2023-09-14  4:48 ` Xin Li
  2023-09-20 17:54   ` Paolo Bonzini
  2023-09-21 12:11   ` Nikolay Borisov
  2023-09-14  4:48 ` [PATCH v10 34/38] KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling Xin Li
                   ` (4 subsequent siblings)
  37 siblings, 2 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:48 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

In IRQ/NMI induced VM exits, KVM VMX needs to execute the respective
handlers, which requires the software to create a FRED stack frame,
and use it to invoke the handlers. Add fred_irq_entry_from_kvm() for
this job.

Export fred_entry_from_kvm() because VMX can be compiled as a module.

Suggested-by: Sean Christopherson <seanjc@google.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v9:
* Shove the whole thing into arch/x86/entry/entry_64_fred.S for invoking
  external_interrupt() and fred_exc_nmi() (Sean Christopherson).
* Correct and improve a few comments (Sean Christopherson).
* Merge the two IRQ/NMI asm entries into one as it's fine to invoke
  noinstr code from regular code (Thomas Gleixner).
* Setup the long mode and NMI flags in the augmented SS field of FRED
  stack frame in C instead of asm (Thomas Gleixner).
* Add UNWIND_HINT_{SAVE,RESTORE} to get rid of the warning: "objtool:
  asm_fred_entry_from_kvm+0x0: unreachable instruction" (Peter Zijlstra).

Changes since v8:
* Add a new macro VMX_DO_FRED_EVENT_IRQOFF for FRED instead of
  refactoring VMX_DO_EVENT_IRQOFF (Sean Christopherson).
* Do NOT use a trampoline, just LEA+PUSH the return RIP, PUSH the error
  code, and jump to the FRED kernel entry point for NMI or call
  external_interrupt() for IRQs (Sean Christopherson).
* Call external_interrupt() only when FRED is enabled, and convert the
  non-FRED handling to external_interrupt() after FRED lands (Sean
  Christopherson).
---
 arch/x86/entry/entry_64_fred.S | 73 ++++++++++++++++++++++++++++++++++
 arch/x86/entry/entry_fred.c    | 14 +++++++
 arch/x86/include/asm/fred.h    | 18 +++++++++
 3 files changed, 105 insertions(+)

diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index d1c2fc4af8ae..f1088d6f2054 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -4,7 +4,9 @@
  */
 
 #include <asm/asm.h>
+#include <asm/export.h>
 #include <asm/fred.h>
+#include <asm/segment.h>
 
 #include "calling.h"
 
@@ -54,3 +56,74 @@ SYM_CODE_START_NOALIGN(asm_fred_entrypoint_kernel)
 	FRED_EXIT
 	ERETS
 SYM_CODE_END(asm_fred_entrypoint_kernel)
+
+#if IS_ENABLED(CONFIG_KVM_INTEL)
+SYM_FUNC_START(asm_fred_entry_from_kvm)
+	push %rbp
+	mov %rsp, %rbp
+
+	UNWIND_HINT_SAVE
+
+	/*
+	 * Don't check the FRED stack level, the call stack leading to this
+	 * helper is effectively constant and shallow (relatively speaking).
+	 *
+	 * Emulate the FRED-defined redzone and stack alignment.
+	 */
+	sub $(FRED_CONFIG_REDZONE_AMOUNT << 6), %rsp
+	and $FRED_STACK_FRAME_RSP_MASK, %rsp
+
+	/*
+	 * Start to push a FRED stack frame, which is always 64 bytes:
+	 *
+	 * +--------+-----------------+
+	 * | Bytes  | Usage           |
+	 * +--------+-----------------+
+	 * | 63:56  | Reserved        |
+	 * | 55:48  | Event Data      |
+	 * | 47:40  | SS + Event Info |
+	 * | 39:32  | RSP             |
+	 * | 31:24  | RFLAGS          |
+	 * | 23:16  | CS + Aux Info   |
+	 * |  15:8  | RIP             |
+	 * |   7:0  | Error Code      |
+	 * +--------+-----------------+
+	 */
+	push $0				/* Reserved, must be 0 */
+	push $0				/* Event data, 0 for IRQ/NMI */
+	push %rdi			/* fred_ss handed in by the caller */
+	push %rbp
+	pushf
+	mov $__KERNEL_CS, %rax
+	push %rax
+
+	/*
+	 * Unlike the IDT event delivery, FRED _always_ pushes an error code
+	 * after pushing the return RIP, thus the CALL instruction CANNOT be
+	 * used here to push the return RIP, otherwise there is no chance to
+	 * push an error code before invoking the IRQ/NMI handler.
+	 *
+	 * Use LEA to get the return RIP and push it, then push an error code.
+	 */
+	lea 1f(%rip), %rax
+	push %rax				/* Return RIP */
+	push $0					/* Error code, 0 for IRQ/NMI */
+
+	PUSH_AND_CLEAR_REGS clear_bp=0 unwind_hint=0
+	movq %rsp, %rdi				/* %rdi -> pt_regs */
+	call __fred_entry_from_kvm		/* Call the C entry point */
+	POP_REGS
+	ERETS
+1:
+	/*
+	 * Objtool doesn't understand what ERETS does, this hint tells it that
+	 * yes, we'll reach here and with what stack state. A save/restore pair
+	 * isn't strictly needed, but it's the simplest form.
+	 */
+	UNWIND_HINT_RESTORE
+	pop %rbp
+	RET
+
+SYM_FUNC_END(asm_fred_entry_from_kvm)
+EXPORT_SYMBOL_GPL(asm_fred_entry_from_kvm);
+#endif
diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c
index 2fd3e421e066..f8774611af80 100644
--- a/arch/x86/entry/entry_fred.c
+++ b/arch/x86/entry/entry_fred.c
@@ -242,3 +242,17 @@ __visible noinstr void fred_entry_from_kernel(struct pt_regs *regs)
 		return fred_bad_type(regs, error_code);
 	}
 }
+
+#if IS_ENABLED(CONFIG_KVM_INTEL)
+__visible noinstr void __fred_entry_from_kvm(struct pt_regs *regs)
+{
+	switch (regs->fred_ss.type) {
+	case EVENT_TYPE_EXTINT:
+		return fred_extint(regs);
+	case EVENT_TYPE_NMI:
+		return fred_exc_nmi(regs);
+	default:
+		WARN_ON_ONCE(1);
+	}
+}
+#endif
diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index 16a64ffecbf8..2fa9f34e5c95 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -9,6 +9,7 @@
 #include <linux/const.h>
 
 #include <asm/asm.h>
+#include <asm/trapnr.h>
 
 /*
  * FRED event return instruction opcodes for ERET{S,U}; supported in
@@ -62,12 +63,29 @@ static __always_inline unsigned long fred_event_data(struct pt_regs *regs)
 
 void asm_fred_entrypoint_user(void);
 void asm_fred_entrypoint_kernel(void);
+void asm_fred_entry_from_kvm(struct fred_ss);
 
 __visible void fred_entry_from_user(struct pt_regs *regs);
 __visible void fred_entry_from_kernel(struct pt_regs *regs);
+__visible void __fred_entry_from_kvm(struct pt_regs *regs);
+
+/* Can be called from noinstr code, thus __always_inline */
+static __always_inline void fred_entry_from_kvm(unsigned int type, unsigned int vector)
+{
+	struct fred_ss ss = {
+		.ss     =__KERNEL_DS,
+		.type   = type,
+		.vector = vector,
+		.nmi    = type == EVENT_TYPE_NMI,
+		.lm     = 1,
+	};
+
+	asm_fred_entry_from_kvm(ss);
+}
 
 #else /* CONFIG_X86_FRED */
 static __always_inline unsigned long fred_event_data(struct pt_regs *regs) { return 0; }
+static __always_inline void fred_entry_from_kvm(unsigned int type, unsigned int vector) { }
 #endif /* CONFIG_X86_FRED */
 #endif /* !__ASSEMBLY__ */
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 34/38] KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (32 preceding siblings ...)
  2023-09-14  4:48 ` [PATCH v10 33/38] x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI Xin Li
@ 2023-09-14  4:48 ` Xin Li
  2023-09-20 17:54   ` Paolo Bonzini
  2023-09-14  4:48 ` [PATCH v10 35/38] x86/syscall: Split IDT syscall setup code into idt_syscall_init() Xin Li
                   ` (3 subsequent siblings)
  37 siblings, 1 reply; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:48 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

When FRED is enabled, call fred_entry_from_kvm() to handle IRQ/NMI in
IRQ/NMI induced VM exits.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 72e3943f3693..db55b8418fa3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -38,6 +38,7 @@
 #include <asm/desc.h>
 #include <asm/fpu/api.h>
 #include <asm/fpu/xstate.h>
+#include <asm/fred.h>
 #include <asm/idtentry.h>
 #include <asm/io.h>
 #include <asm/irq_remapping.h>
@@ -6962,14 +6963,16 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
 {
 	u32 intr_info = vmx_get_intr_info(vcpu);
 	unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
-	gate_desc *desc = (gate_desc *)host_idt_base + vector;
 
 	if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
 	    "unexpected VM-Exit interrupt info: 0x%x", intr_info))
 		return;
 
 	kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
-	vmx_do_interrupt_irqoff(gate_offset(desc));
+	if (cpu_feature_enabled(X86_FEATURE_FRED))
+		fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
+	else
+		vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector));
 	kvm_after_interrupt(vcpu);
 
 	vcpu->arch.at_instruction_boundary = true;
@@ -7262,7 +7265,10 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 	if ((u16)vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI &&
 	    is_nmi(vmx_get_intr_info(vcpu))) {
 		kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
-		vmx_do_nmi_irqoff();
+		if (cpu_feature_enabled(X86_FEATURE_FRED))
+			fred_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR);
+		else
+			vmx_do_nmi_irqoff();
 		kvm_after_interrupt(vcpu);
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 35/38] x86/syscall: Split IDT syscall setup code into idt_syscall_init()
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (33 preceding siblings ...)
  2023-09-14  4:48 ` [PATCH v10 34/38] KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling Xin Li
@ 2023-09-14  4:48 ` Xin Li
  2023-09-14  4:48 ` [PATCH v10 36/38] x86/fred: Add fred_syscall_init() Xin Li
                   ` (2 subsequent siblings)
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:48 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

Split IDT syscall setup code into idt_syscall_init() to make it
cleaner to add FRED syscall setup code.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kernel/cpu/common.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 42511209469b..d960b7276008 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2070,10 +2070,8 @@ static void wrmsrl_cstar(unsigned long val)
 		wrmsrl(MSR_CSTAR, val);
 }
 
-/* May not be marked __init: used by software suspend */
-void syscall_init(void)
+static inline void idt_syscall_init(void)
 {
-	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
 	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
 
 #ifdef CONFIG_IA32_EMULATION
@@ -2107,6 +2105,15 @@ void syscall_init(void)
 	       X86_EFLAGS_AC|X86_EFLAGS_ID);
 }
 
+/* May not be marked __init: used by software suspend */
+void syscall_init(void)
+{
+	/* The default user and kernel segments */
+	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
+
+	idt_syscall_init();
+}
+
 #else	/* CONFIG_X86_64 */
 
 #ifdef CONFIG_STACKPROTECTOR
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 36/38] x86/fred: Add fred_syscall_init()
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (34 preceding siblings ...)
  2023-09-14  4:48 ` [PATCH v10 35/38] x86/syscall: Split IDT syscall setup code into idt_syscall_init() Xin Li
@ 2023-09-14  4:48 ` Xin Li
  2023-09-19  8:28   ` Thomas Gleixner
  2023-09-14  4:48 ` [PATCH v10 37/38] x86/fred: Add FRED initialization functions Xin Li
  2023-09-14  4:48 ` [PATCH v10 38/38] x86/fred: Invoke FRED initialization code to enable FRED Xin Li
  37 siblings, 1 reply; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:48 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add a syscall initialization function fred_syscall_init() for FRED,
and this is really just to skip setting up SYSCALL/SYSENTER related
MSRs, e.g., MSR_LSTAR and invalidate SYSENTER configurations, because
FRED uses the ring 3 FRED entrypoint for SYSCALL and SYSENTER, and
ERETU is the only legit instruction to return to ring 3 per FRED spec
5.0.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kernel/cpu/common.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index d960b7276008..4cb36e241c9a 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2105,6 +2105,23 @@ static inline void idt_syscall_init(void)
 	       X86_EFLAGS_AC|X86_EFLAGS_ID);
 }
 
+static inline void fred_syscall_init(void)
+{
+	/*
+	 * Per FRED spec 5.0, FRED uses the ring 3 FRED entrypoint for SYSCALL
+	 * and SYSENTER, and ERETU is the only legit instruction to return to
+	 * ring 3, as a result there is _no_ need to setup the SYSCALL and
+	 * SYSENTER MSRs.
+	 *
+	 * Note, both sysexit and sysret cause #UD when FRED is enabled.
+	 */
+	wrmsrl(MSR_LSTAR, 0ULL);
+	wrmsrl_cstar(0ULL);
+	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
+	wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
+	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
+}
+
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 37/38] x86/fred: Add FRED initialization functions
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (35 preceding siblings ...)
  2023-09-14  4:48 ` [PATCH v10 36/38] x86/fred: Add fred_syscall_init() Xin Li
@ 2023-09-14  4:48 ` Xin Li
  2023-09-14  4:48 ` [PATCH v10 38/38] x86/fred: Invoke FRED initialization code to enable FRED Xin Li
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:48 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add cpu_init_fred_exceptions() to:
  - Set FRED entrypoints for events happening in ring 0 and 3.
  - Specify the stack level for IRQs occurred ring 0.
  - Specify dedicated event stacks for #DB/NMI/#MCE/#DF.
  - Enable FRED and invalidtes IDT.
  - Force 32-bit system calls to use "int $0x80" only.

Add fred_complete_exception_setup() to:
  - Initialize system_vectors as done for IDT systems.
  - Set unused sysvec_table entries to fred_handle_spurious_interrupt().

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Co-developed-by: Xin Li <xin3.li@intel.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v9:
* Set unused sysvec table entries to fred_handle_spurious_interrupt()
  in fred_complete_exception_setup() (Thomas Gleixner).

Changes since v5:
* Add a comment for FRED stack level settings (Lai Jiangshan).
* Define NMI/#DB/#MCE/#DF stack levels using macros.
---
 arch/x86/entry/entry_fred.c | 21 +++++++++++++
 arch/x86/include/asm/fred.h |  5 ++++
 arch/x86/kernel/Makefile    |  1 +
 arch/x86/kernel/fred.c      | 59 +++++++++++++++++++++++++++++++++++++
 4 files changed, 86 insertions(+)
 create mode 100644 arch/x86/kernel/fred.c

diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c
index f8774611af80..7d507f59315d 100644
--- a/arch/x86/entry/entry_fred.c
+++ b/arch/x86/entry/entry_fred.c
@@ -140,6 +140,27 @@ void __init fred_install_sysvec(unsigned int sysvec, idtentry_t handler)
 		 sysvec_table[sysvec - FIRST_SYSTEM_VECTOR] = handler;
 }
 
+static noinstr void fred_handle_spurious_interrupt(struct pt_regs *regs)
+{
+	spurious_interrupt(regs, regs->fred_ss.vector);
+}
+
+void __init fred_complete_exception_setup(void)
+{
+	unsigned int vector;
+
+	for (vector = 0; vector < FIRST_EXTERNAL_VECTOR; vector++)
+		set_bit(vector, system_vectors);
+
+	for (vector = 0; vector < NR_SYSTEM_VECTORS; vector++) {
+		if (sysvec_table[vector])
+			set_bit(vector + FIRST_SYSTEM_VECTOR, system_vectors);
+		else
+			sysvec_table[vector] = fred_handle_spurious_interrupt;
+	}
+	fred_setup_done = true;
+}
+
 static noinstr void fred_extint(struct pt_regs *regs)
 {
 	unsigned int vector = regs->fred_ss.vector;
diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index 2fa9f34e5c95..e86c7ba32435 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -83,8 +83,13 @@ static __always_inline void fred_entry_from_kvm(unsigned int type, unsigned int
 	asm_fred_entry_from_kvm(ss);
 }
 
+void cpu_init_fred_exceptions(void);
+void fred_complete_exception_setup(void);
+
 #else /* CONFIG_X86_FRED */
 static __always_inline unsigned long fred_event_data(struct pt_regs *regs) { return 0; }
+static inline void cpu_init_fred_exceptions(void) { }
+static inline void fred_complete_exception_setup(void) { }
 static __always_inline void fred_entry_from_kvm(unsigned int type, unsigned int vector) { }
 #endif /* CONFIG_X86_FRED */
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 3269a0e23d3a..8dfdae4111bb 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -47,6 +47,7 @@ obj-y			+= platform-quirks.o
 obj-y			+= process_$(BITS).o signal.o signal_$(BITS).o
 obj-y			+= traps.o idt.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
 obj-y			+= time.o ioport.o dumpstack.o nmi.o
+obj-$(CONFIG_X86_FRED)	+= fred.o
 obj-$(CONFIG_MODIFY_LDT_SYSCALL)	+= ldt.o
 obj-$(CONFIG_X86_KERNEL_IBT)		+= ibt_selftest.o
 obj-y			+= setup.o x86_init.o i8259.o irqinit.o
diff --git a/arch/x86/kernel/fred.c b/arch/x86/kernel/fred.c
new file mode 100644
index 000000000000..4bcd8791ad96
--- /dev/null
+++ b/arch/x86/kernel/fred.c
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/kernel.h>
+
+#include <asm/desc.h>
+#include <asm/fred.h>
+#include <asm/tlbflush.h>
+#include <asm/traps.h>
+
+/* #DB in the kernel would imply the use of a kernel debugger. */
+#define FRED_DB_STACK_LEVEL		1UL
+#define FRED_NMI_STACK_LEVEL		2UL
+#define FRED_MC_STACK_LEVEL		2UL
+/*
+ * #DF is the highest level because a #DF means "something went wrong
+ * *while delivering an exception*." The number of cases for which that
+ * can happen with FRED is drastically reduced and basically amounts to
+ * "the stack you pointed me to is broken." Thus, always change stacks
+ * on #DF, which means it should be at the highest level.
+ */
+#define FRED_DF_STACK_LEVEL		3UL
+
+#define FRED_STKLVL(vector, lvl)	((lvl) << (2 * (vector)))
+
+void cpu_init_fred_exceptions(void)
+{
+	/* When FRED is enabled by default, remove this log message */
+	pr_info("Initialize FRED on CPU%d\n", smp_processor_id());
+
+	wrmsrl(MSR_IA32_FRED_CONFIG,
+	       /* Reserve for CALL emulation */
+	       FRED_CONFIG_REDZONE |
+	       FRED_CONFIG_INT_STKLVL(0) |
+	       FRED_CONFIG_ENTRYPOINT(asm_fred_entrypoint_user));
+
+	/*
+	 * The purpose of separate stacks for NMI, #DB and #MC *in the kernel*
+	 * (remember that user space faults are always taken on stack level 0)
+	 * is to avoid overflowing the kernel stack.
+	 */
+	wrmsrl(MSR_IA32_FRED_STKLVLS,
+	       FRED_STKLVL(X86_TRAP_DB,  FRED_DB_STACK_LEVEL) |
+	       FRED_STKLVL(X86_TRAP_NMI, FRED_NMI_STACK_LEVEL) |
+	       FRED_STKLVL(X86_TRAP_MC,  FRED_MC_STACK_LEVEL) |
+	       FRED_STKLVL(X86_TRAP_DF,  FRED_DF_STACK_LEVEL));
+
+	/* The FRED equivalents to IST stacks... */
+	wrmsrl(MSR_IA32_FRED_RSP1, __this_cpu_ist_top_va(DB));
+	wrmsrl(MSR_IA32_FRED_RSP2, __this_cpu_ist_top_va(NMI));
+	wrmsrl(MSR_IA32_FRED_RSP3, __this_cpu_ist_top_va(DF));
+
+	/* Enable FRED */
+	cr4_set_bits(X86_CR4_FRED);
+	/* Any further IDT use is a bug */
+	idt_invalidate();
+
+	/* Use int $0x80 for 32-bit system calls in FRED mode */
+	setup_clear_cpu_cap(X86_FEATURE_SYSENTER32);
+	setup_clear_cpu_cap(X86_FEATURE_SYSCALL32);
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH v10 38/38] x86/fred: Invoke FRED initialization code to enable FRED
  2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
                   ` (36 preceding siblings ...)
  2023-09-14  4:48 ` [PATCH v10 37/38] x86/fred: Add FRED initialization functions Xin Li
@ 2023-09-14  4:48 ` Xin Li
  37 siblings, 0 replies; 88+ messages in thread
From: Xin Li @ 2023-09-14  4:48 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Let cpu_init_exception_handling() call cpu_init_fred_exceptions() to
initialize FRED. However if FRED is unavailable or disabled, it falls
back to set up TSS IST and initialize IDT.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Co-developed-by: Xin Li <xin3.li@intel.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v8:
* Move this patch after all required changes are in place (Thomas
  Gleixner).
---
 arch/x86/kernel/cpu/common.c | 17 ++++++++++++-----
 arch/x86/kernel/irqinit.c    |  7 ++++++-
 arch/x86/kernel/traps.c      |  5 ++++-
 3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 4cb36e241c9a..e230d3f4c556 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -61,6 +61,7 @@
 #include <asm/microcode.h>
 #include <asm/intel-family.h>
 #include <asm/cpu_device_id.h>
+#include <asm/fred.h>
 #include <asm/uv/uv.h>
 #include <asm/set_memory.h>
 #include <asm/traps.h>
@@ -2128,7 +2129,10 @@ void syscall_init(void)
 	/* The default user and kernel segments */
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
 
-	idt_syscall_init();
+	if (cpu_feature_enabled(X86_FEATURE_FRED))
+		fred_syscall_init();
+	else
+		idt_syscall_init();
 }
 
 #else	/* CONFIG_X86_64 */
@@ -2244,8 +2248,9 @@ void cpu_init_exception_handling(void)
 	/* paranoid_entry() gets the CPU number from the GDT */
 	setup_getcpu(cpu);
 
-	/* IST vectors need TSS to be set up. */
-	tss_setup_ist(tss);
+	/* For IDT mode, IST vectors need to be set in TSS. */
+	if (!cpu_feature_enabled(X86_FEATURE_FRED))
+		tss_setup_ist(tss);
 	tss_setup_io_bitmap(tss);
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 
@@ -2254,8 +2259,10 @@ void cpu_init_exception_handling(void)
 	/* GHCB needs to be setup to handle #VC. */
 	setup_ghcb();
 
-	/* Finally load the IDT */
-	load_current_idt();
+	if (cpu_feature_enabled(X86_FEATURE_FRED))
+		cpu_init_fred_exceptions();
+	else
+		load_current_idt();
 }
 
 /*
diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
index c683666876f1..f79c5edc0b89 100644
--- a/arch/x86/kernel/irqinit.c
+++ b/arch/x86/kernel/irqinit.c
@@ -28,6 +28,7 @@
 #include <asm/setup.h>
 #include <asm/i8259.h>
 #include <asm/traps.h>
+#include <asm/fred.h>
 #include <asm/prom.h>
 
 /*
@@ -96,7 +97,11 @@ void __init native_init_IRQ(void)
 	/* Execute any quirks before the call gates are initialised: */
 	x86_init.irqs.pre_vector_init();
 
-	idt_setup_apic_and_irq_gates();
+	if (cpu_feature_enabled(X86_FEATURE_FRED))
+		fred_complete_exception_setup();
+	else
+		idt_setup_apic_and_irq_gates();
+
 	lapic_assign_system_vectors();
 
 	if (!acpi_ioapic && !of_ioapic && nr_legacy_irqs()) {
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 848c85208a57..0ee78a30e14a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1411,7 +1411,10 @@ void __init trap_init(void)
 
 	/* Initialize TSS before setting up traps so ISTs work */
 	cpu_init_exception_handling();
+
 	/* Setup traps as cpu_init() might #GP */
-	idt_setup_traps();
+	if (!cpu_feature_enabled(X86_FEATURE_FRED))
+		idt_setup_traps();
+
 	cpu_init();
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-14  4:47 ` [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support Xin Li
@ 2023-09-14  6:02   ` Juergen Gross
  2023-09-14 13:01     ` andrew.cooper3
  2023-09-14 14:05   ` andrew.cooper3
  2023-09-20  7:58   ` Nikolay Borisov
  2 siblings, 1 reply; 88+ messages in thread
From: Juergen Gross @ 2023-09-14  6:02 UTC (permalink / raw)
  To: Xin Li, linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm,
	xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, ravi.v.shankar, mhiramat, andrew.cooper3, jiangshanlai


[-- Attachment #1.1.1: Type: text/plain, Size: 1689 bytes --]

On 14.09.23 06:47, Xin Li wrote:
> Add an always inline API __wrmsrns() to embed the WRMSRNS instruction
> into the code.
> 
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>

In order to avoid having to add paravirt support for WRMSRNS I think
xen_init_capabilities() should gain:

+	setup_clear_cpu_cap(X86_FEATURE_WRMSRNS);


Juergen

> ---
>   arch/x86/include/asm/msr.h | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
> 
> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> index 65ec1965cd28..c284ff9ebe67 100644
> --- a/arch/x86/include/asm/msr.h
> +++ b/arch/x86/include/asm/msr.h
> @@ -97,6 +97,19 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high)
>   		     : : "c" (msr), "a"(low), "d" (high) : "memory");
>   }
>   
> +/*
> + * WRMSRNS behaves exactly like WRMSR with the only difference being
> + * that it is not a serializing instruction by default.
> + */
> +static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
> +{
> +	/* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */
> +	asm volatile("1: .byte 0x0f,0x01,0xc6\n"
> +		     "2:\n"
> +		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
> +		     : : "c" (msr), "a"(low), "d" (high));
> +}
> +
>   #define native_rdmsr(msr, val1, val2)			\
>   do {							\
>   	u64 __val = __rdmsr((msr));			\
> @@ -297,6 +310,11 @@ do {							\
>   
>   #endif	/* !CONFIG_PARAVIRT_XXL */
>   
> +static __always_inline void wrmsrns(u32 msr, u64 val)
> +{
> +	__wrmsrns(msr, val, val >> 32);
> +}
> +
>   /*
>    * 64-bit version of wrmsr_safe():
>    */


[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 08/38] x86/cpufeatures: Add the cpu feature bit for FRED
  2023-09-14  4:47 ` [PATCH v10 08/38] x86/cpufeatures: Add the cpu feature bit for FRED Xin Li
@ 2023-09-14  6:03   ` Juergen Gross
  2023-09-14  6:09     ` Jan Beulich
  0 siblings, 1 reply; 88+ messages in thread
From: Juergen Gross @ 2023-09-14  6:03 UTC (permalink / raw)
  To: Xin Li, linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm,
	xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, ravi.v.shankar, mhiramat, andrew.cooper3, jiangshanlai


[-- Attachment #1.1.1: Type: text/plain, Size: 3157 bytes --]

On 14.09.23 06:47, Xin Li wrote:
> From: "H. Peter Anvin (Intel)" <hpa@zytor.com>
> 
> Any FRED CPU will always have the following features as its baseline:
>    1) LKGS, load attributes of the GS segment but the base address into
>       the IA32_KERNEL_GS_BASE MSR instead of the GS segment’s descriptor
>       cache.
>    2) WRMSRNS, non-serializing WRMSR for faster MSR writes.
> 
> Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>

In order to avoid having to add paravirt support for FRED I think
xen_init_capabilities() should gain:

+    setup_clear_cpu_cap(X86_FEATURE_FRED);


Juergen

> ---
>   arch/x86/include/asm/cpufeatures.h       | 1 +
>   arch/x86/kernel/cpu/cpuid-deps.c         | 2 ++
>   tools/arch/x86/include/asm/cpufeatures.h | 1 +
>   3 files changed, 4 insertions(+)
> 
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 330876d34b68..57ae93dc1e52 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -321,6 +321,7 @@
>   #define X86_FEATURE_FZRM		(12*32+10) /* "" Fast zero-length REP MOVSB */
>   #define X86_FEATURE_FSRS		(12*32+11) /* "" Fast short REP STOSB */
>   #define X86_FEATURE_FSRC		(12*32+12) /* "" Fast short REP {CMPSB,SCASB} */
> +#define X86_FEATURE_FRED		(12*32+17) /* Flexible Return and Event Delivery */
>   #define X86_FEATURE_LKGS		(12*32+18) /* "" Load "kernel" (userspace) GS */
>   #define X86_FEATURE_WRMSRNS		(12*32+19) /* "" Non-Serializing Write to Model Specific Register instruction */
>   #define X86_FEATURE_AMX_FP16		(12*32+21) /* "" AMX fp16 Support */
> diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
> index e462c1d3800a..b7174209d855 100644
> --- a/arch/x86/kernel/cpu/cpuid-deps.c
> +++ b/arch/x86/kernel/cpu/cpuid-deps.c
> @@ -82,6 +82,8 @@ static const struct cpuid_dep cpuid_deps[] = {
>   	{ X86_FEATURE_XFD,			X86_FEATURE_XGETBV1   },
>   	{ X86_FEATURE_AMX_TILE,			X86_FEATURE_XFD       },
>   	{ X86_FEATURE_SHSTK,			X86_FEATURE_XSAVES    },
> +	{ X86_FEATURE_FRED,			X86_FEATURE_LKGS      },
> +	{ X86_FEATURE_FRED,			X86_FEATURE_WRMSRNS   },
>   	{}
>   };
>   
> diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h
> index 1b9d86ba5bc2..18bab7987d7f 100644
> --- a/tools/arch/x86/include/asm/cpufeatures.h
> +++ b/tools/arch/x86/include/asm/cpufeatures.h
> @@ -317,6 +317,7 @@
>   #define X86_FEATURE_FZRM		(12*32+10) /* "" Fast zero-length REP MOVSB */
>   #define X86_FEATURE_FSRS		(12*32+11) /* "" Fast short REP STOSB */
>   #define X86_FEATURE_FSRC		(12*32+12) /* "" Fast short REP {CMPSB,SCASB} */
> +#define X86_FEATURE_FRED		(12*32+17) /* Flexible Return and Event Delivery */
>   #define X86_FEATURE_LKGS		(12*32+18) /* "" Load "kernel" (userspace) GS */
>   #define X86_FEATURE_WRMSRNS		(12*32+19) /* "" Non-Serializing Write to Model Specific Register instruction */
>   #define X86_FEATURE_AMX_FP16		(12*32+21) /* "" AMX fp16 Support */


[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 08/38] x86/cpufeatures: Add the cpu feature bit for FRED
  2023-09-14  6:03   ` Juergen Gross
@ 2023-09-14  6:09     ` Jan Beulich
  2023-09-14 13:15       ` andrew.cooper3
  0 siblings, 1 reply; 88+ messages in thread
From: Jan Beulich @ 2023-09-14  6:09 UTC (permalink / raw)
  To: Juergen Gross
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, ravi.v.shankar, mhiramat, andrew.cooper3, jiangshanlai,
	Xin Li, linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm,
	xen-devel

On 14.09.2023 08:03, Juergen Gross wrote:
> On 14.09.23 06:47, Xin Li wrote:
>> From: "H. Peter Anvin (Intel)" <hpa@zytor.com>
>>
>> Any FRED CPU will always have the following features as its baseline:
>>    1) LKGS, load attributes of the GS segment but the base address into
>>       the IA32_KERNEL_GS_BASE MSR instead of the GS segment’s descriptor
>>       cache.
>>    2) WRMSRNS, non-serializing WRMSR for faster MSR writes.
>>
>> Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
>> Tested-by: Shan Kang <shan.kang@intel.com>
>> Signed-off-by: Xin Li <xin3.li@intel.com>
> 
> In order to avoid having to add paravirt support for FRED I think
> xen_init_capabilities() should gain:
> 
> +    setup_clear_cpu_cap(X86_FEATURE_FRED);

I don't view it as very likely that Xen would expose FRED to PV guests
(Andrew?), at which point such a precaution may not be necessary.

Jan

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-14  6:02   ` Juergen Gross
@ 2023-09-14 13:01     ` andrew.cooper3
  0 siblings, 0 replies; 88+ messages in thread
From: andrew.cooper3 @ 2023-09-14 13:01 UTC (permalink / raw)
  To: Juergen Gross, Xin Li, linux-doc, linux-kernel, linux-edac,
	linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, ravi.v.shankar, mhiramat, jiangshanlai

On 14/09/2023 7:02 am, Juergen Gross wrote:
> On 14.09.23 06:47, Xin Li wrote:
>> Add an always inline API __wrmsrns() to embed the WRMSRNS instruction
>> into the code.
>>
>> Tested-by: Shan Kang <shan.kang@intel.com>
>> Signed-off-by: Xin Li <xin3.li@intel.com>
>
> In order to avoid having to add paravirt support for WRMSRNS I think
> xen_init_capabilities() should gain:
>
> +    setup_clear_cpu_cap(X86_FEATURE_WRMSRNS);

Xen PV guests will never ever see WRMSRNS.  Operating in CPL3, they have
no possible way of adjusting an MSR which isn't serialising, because
even the hypercall forms are serialising.

Xen only exposes the bit for HVM guests.

~Andrew

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 08/38] x86/cpufeatures: Add the cpu feature bit for FRED
  2023-09-14  6:09     ` Jan Beulich
@ 2023-09-14 13:15       ` andrew.cooper3
  2023-09-15  1:07         ` Thomas Gleixner
  0 siblings, 1 reply; 88+ messages in thread
From: andrew.cooper3 @ 2023-09-14 13:15 UTC (permalink / raw)
  To: Jan Beulich, Juergen Gross
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, ravi.v.shankar, mhiramat, jiangshanlai, Xin Li, linux-doc,
	linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel

On 14/09/2023 7:09 am, Jan Beulich wrote:
> On 14.09.2023 08:03, Juergen Gross wrote:
>> On 14.09.23 06:47, Xin Li wrote:
>>> From: "H. Peter Anvin (Intel)" <hpa@zytor.com>
>>>
>>> Any FRED CPU will always have the following features as its baseline:
>>>    1) LKGS, load attributes of the GS segment but the base address into
>>>       the IA32_KERNEL_GS_BASE MSR instead of the GS segment’s descriptor
>>>       cache.
>>>    2) WRMSRNS, non-serializing WRMSR for faster MSR writes.
>>>
>>> Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
>>> Tested-by: Shan Kang <shan.kang@intel.com>
>>> Signed-off-by: Xin Li <xin3.li@intel.com>
>> In order to avoid having to add paravirt support for FRED I think
>> xen_init_capabilities() should gain:
>>
>> +    setup_clear_cpu_cap(X86_FEATURE_FRED);
> I don't view it as very likely that Xen would expose FRED to PV guests
> (Andrew?), at which point such a precaution may not be necessary.

PV guests are never going to see FRED (or LKGS for that matter) because
it advertises too much stuff which simply traps because the kernel is in
CPL3.

That said, the 64bit PV ABI is a whole lot closer to FRED than it is to
IDT delivery.  (Almost as if we decided 15 years ago that giving the PV
guest kernel a good stack and GSbase was the right thing to do...)

In some copious free time, I think we ought to provide a
minorly-paravirt FRED to PV guests because there are still some
improvements available as low hanging fruit.

My plan was to have a PV hypervisor leaf advertising paravirt versions
of hardware features, so a guest could see "I don't have architectural
FRED, but I do have paravirt-FRED which is as similar as we can
reasonably make it".  The same goes for a whole bunch of other features.

~Andrew

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-14  4:47 ` [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support Xin Li
  2023-09-14  6:02   ` Juergen Gross
@ 2023-09-14 14:05   ` andrew.cooper3
  2023-09-14 23:00     ` Thomas Gleixner
  2023-09-20  7:58   ` Nikolay Borisov
  2 siblings, 1 reply; 88+ messages in thread
From: andrew.cooper3 @ 2023-09-14 14:05 UTC (permalink / raw)
  To: Xin Li, linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm,
	xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, jiangshanlai

On 14/09/2023 5:47 am, Xin Li wrote:
> Add an always inline API __wrmsrns() to embed the WRMSRNS instruction
> into the code.
>
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
>  arch/x86/include/asm/msr.h | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> index 65ec1965cd28..c284ff9ebe67 100644
> --- a/arch/x86/include/asm/msr.h
> +++ b/arch/x86/include/asm/msr.h
> @@ -97,6 +97,19 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high)
>  		     : : "c" (msr), "a"(low), "d" (high) : "memory");
>  }
>  
> +/*
> + * WRMSRNS behaves exactly like WRMSR with the only difference being
> + * that it is not a serializing instruction by default.
> + */
> +static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
> +{
> +	/* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */
> +	asm volatile("1: .byte 0x0f,0x01,0xc6\n"
> +		     "2:\n"
> +		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
> +		     : : "c" (msr), "a"(low), "d" (high));
> +}
> +
>  #define native_rdmsr(msr, val1, val2)			\
>  do {							\
>  	u64 __val = __rdmsr((msr));			\
> @@ -297,6 +310,11 @@ do {							\
>  
>  #endif	/* !CONFIG_PARAVIRT_XXL */
>  
> +static __always_inline void wrmsrns(u32 msr, u64 val)
> +{
> +	__wrmsrns(msr, val, val >> 32);
> +}

This API works in terms of this series where every WRMSRNS is hidden
behind a FRED check, but it's an awkward interface to use anywhere else
in the kernel.

I fully understand that you expect all FRED capable systems to have
WRMSRNS, but it is not a hard requirement and you will end up with
simpler (and therefore better) logic by deleting the dependency.

As a "normal" user of the WRMSR APIs, the programmer only cares about:

1) wrmsr() -> needs to be serialising
2) wrmsr_ns() -> safe to be non-serialising


In Xen, I added something of the form:

/* Non-serialising WRMSR, when available.  Falls back to a serialising
WRMSR. */
static inline void wrmsr_ns(uint32_t msr, uint32_t lo, uint32_t hi)
{
    /*
     * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant CS
     * prefix to avoid a trailing NOP.
     */
    alternative_input(".byte 0x2e; wrmsr",
                      ".byte 0x0f,0x01,0xc6", X86_FEATURE_WRMSRNS,
                      "c" (msr), "a" (lo), "d" (hi));
}

and despite what Juergen said, I'm going to recommend that you do wire
this through the paravirt infrastructure, for the benefit of regular
users having a nice API, not because XenPV is expecting to do something
wildly different here.


I'd actually go as far as suggesting that you break patches 1-3 into
different series and e.g. update the regular context switch path to use
the WRMSRNS-falling-back-to-WRMSR helpers, just to get started.

~Andrew

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 05/38] x86/trapnr: Add event type macros to <asm/trapnr.h>
  2023-09-14  4:47 ` [PATCH v10 05/38] x86/trapnr: Add event type macros to <asm/trapnr.h> Xin Li
@ 2023-09-14 14:22   ` andrew.cooper3
  0 siblings, 0 replies; 88+ messages in thread
From: andrew.cooper3 @ 2023-09-14 14:22 UTC (permalink / raw)
  To: Xin Li, linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm,
	xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, jiangshanlai

On 14/09/2023 5:47 am, Xin Li wrote:
> Intel VT-x classifies events into eight different types, which is
> inherited by FRED for event identification. As such, event type
> becomes a common x86 concept, and should be defined in a common x86
> header.
>
> Add event type macros to <asm/trapnr.h>, and use it in <asm/vmx.h>.
>
> Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
>  arch/x86/include/asm/trapnr.h | 12 ++++++++++++
>  arch/x86/include/asm/vmx.h    | 17 +++++++++--------
>  2 files changed, 21 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/include/asm/trapnr.h b/arch/x86/include/asm/trapnr.h
> index f5d2325aa0b7..ab7e4c9d666f 100644
> --- a/arch/x86/include/asm/trapnr.h
> +++ b/arch/x86/include/asm/trapnr.h
> @@ -2,6 +2,18 @@
>  #ifndef _ASM_X86_TRAPNR_H
>  #define _ASM_X86_TRAPNR_H
>  
> +/*
> + * Event type codes used by both FRED and Intel VT-x

And AMD SVM.  This enumeration has never been unique to just VT-x.

> + */
> +#define EVENT_TYPE_EXTINT	0	// External interrupt
> +#define EVENT_TYPE_RESERVED	1
> +#define EVENT_TYPE_NMI		2	// NMI
> +#define EVENT_TYPE_HWEXC	3	// Hardware originated traps, exceptions
> +#define EVENT_TYPE_SWINT	4	// INT n
> +#define EVENT_TYPE_PRIV_SWEXC	5	// INT1
> +#define EVENT_TYPE_SWEXC	6	// INT0, INT3

Typo.  into, not int0  (the difference shows up more clearly in lower case.)

> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 0e73616b82f3..c84acfefcd31 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -374,14 +375,14 @@ enum vmcs_field {
>  #define VECTORING_INFO_DELIVER_CODE_MASK    	INTR_INFO_DELIVER_CODE_MASK
>  #define VECTORING_INFO_VALID_MASK       	INTR_INFO_VALID_MASK
>  
> -#define INTR_TYPE_EXT_INTR              (0 << 8) /* external interrupt */
> -#define INTR_TYPE_RESERVED              (1 << 8) /* reserved */
> -#define INTR_TYPE_NMI_INTR		(2 << 8) /* NMI */
> -#define INTR_TYPE_HARD_EXCEPTION	(3 << 8) /* processor exception */
> -#define INTR_TYPE_SOFT_INTR             (4 << 8) /* software interrupt */
> -#define INTR_TYPE_PRIV_SW_EXCEPTION	(5 << 8) /* ICE breakpoint - undocumented */
> -#define INTR_TYPE_SOFT_EXCEPTION	(6 << 8) /* software exception */
> -#define INTR_TYPE_OTHER_EVENT           (7 << 8) /* other event */
> +#define INTR_TYPE_EXT_INTR		(EVENT_TYPE_EXTINT << 8)	/* external interrupt */
> +#define INTR_TYPE_RESERVED		(EVENT_TYPE_RESERVED << 8)	/* reserved */
> +#define INTR_TYPE_NMI_INTR		(EVENT_TYPE_NMI << 8)		/* NMI */
> +#define INTR_TYPE_HARD_EXCEPTION	(EVENT_TYPE_HWEXC << 8)		/* processor exception */
> +#define INTR_TYPE_SOFT_INTR		(EVENT_TYPE_SWINT << 8)		/* software interrupt */
> +#define INTR_TYPE_PRIV_SW_EXCEPTION	(EVENT_TYPE_PRIV_SWEXC << 8)	/* ICE breakpoint - undocumented */

ICEBP/INT1 is no longer undocumented.

~Andrew

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-14 14:05   ` andrew.cooper3
@ 2023-09-14 23:00     ` Thomas Gleixner
  2023-09-14 23:34       ` H. Peter Anvin
  2023-09-14 23:46       ` andrew.cooper3
  0 siblings, 2 replies; 88+ messages in thread
From: Thomas Gleixner @ 2023-09-14 23:00 UTC (permalink / raw)
  To: andrew.cooper3, Xin Li, linux-doc, linux-kernel, linux-edac,
	linux-hyperv, kvm, xen-devel
  Cc: mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc, peterz,
	jgross, ravi.v.shankar, mhiramat, jiangshanlai

Andrew!

On Thu, Sep 14 2023 at 15:05, andrew wrote:
> On 14/09/2023 5:47 am, Xin Li wrote:
>> +static __always_inline void wrmsrns(u32 msr, u64 val)
>> +{
>> +	__wrmsrns(msr, val, val >> 32);
>> +}
>
> This API works in terms of this series where every WRMSRNS is hidden
> behind a FRED check, but it's an awkward interface to use anywhere else
> in the kernel.

Agreed.

> I fully understand that you expect all FRED capable systems to have
> WRMSRNS, but it is not a hard requirement and you will end up with
> simpler (and therefore better) logic by deleting the dependency.

According to the CPU folks FRED systems are guaranteed to have WRMSRNS -
I asked for that :). It's just not yet documented.

But that I aside, I agree that we should opt for the safe side with a
fallback like the one you have in XEN even for the places which are
strictly FRED dependent.

> As a "normal" user of the WRMSR APIs, the programmer only cares about:
>
> 1) wrmsr() -> needs to be serialising
> 2) wrmsr_ns() -> safe to be non-serialising

Correct.

> In Xen, I added something of the form:
>
> /* Non-serialising WRMSR, when available.  Falls back to a serialising
> WRMSR. */
> static inline void wrmsr_ns(uint32_t msr, uint32_t lo, uint32_t hi)
> {
>     /*
>      * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant CS
>      * prefix to avoid a trailing NOP.
>      */
>     alternative_input(".byte 0x2e; wrmsr",
>                       ".byte 0x0f,0x01,0xc6", X86_FEATURE_WRMSRNS,
>                       "c" (msr), "a" (lo), "d" (hi));
> }
>
> and despite what Juergen said, I'm going to recommend that you do wire
> this through the paravirt infrastructure, for the benefit of regular
> users having a nice API, not because XenPV is expecting to do something
> wildly different here.

I fundamentaly hate adding this to the PV infrastructure. We don't want
more PV ops, quite the contrary.

For the initial use case at hand, there is an explicit FRED dependency
and the code in question really wants to use WRMSRNS directly and not
through a PV function call.

I agree with your reasoning for the more generic use case where we can
gain performance independent of FRED by using WRMSRNS for cases where
the write has no serialization requirements.

But this made me look into PV ops some more. For actual performance
relevant code the current PV ops mechanics are a horrorshow when the op
defaults to the native instruction.

Let's look at wrmsrl():

wrmsrl(msr, val
 wrmsr(msr, (u32)val, (u32)val >> 32))
  paravirt_write_msr(msr, low, high)
    PVOP_VCALL3(cpu.write_msr, msr, low, high)

Which results in

	mov	$msr, %edi
	mov	$val, %rdx
	mov	%edx, %esi
	shr	$0x20, %rdx
	call	native_write_msr

and native_write_msr() does at minimum:

	mov    %edi,%ecx
	mov    %esi,%eax
	wrmsr
        ret

In the worst case 'ret' is going through the return thunk. Not to talk
about function prologues and whatever.

This becomes even more silly for trivial instructions like STI/CLI or in
the worst case paravirt_nop().

The call makes only sense, when the native default is an actual
function, but for the trivial cases it's a blatant engineering
trainwreck.

I wouldn't care at all if CONFIG_PARAVIRT_XXL would be the esoteric use
case, but AFAICT it's default enabled on all major distros.

So no. I'm fundamentally disagreeing with your recommendation. The way
forward is:

  1) Provide the native variant for wrmsrns(), i.e. rename the proposed
     wrmsrns() to native_wrmsr_ns() and have the X86_FEATURE_WRMSRNS
     safety net as you pointed out.

     That function can be used in code which is guaranteed to be not
     affected by the PV_XXL madness.

  2) Come up with a sensible solution for the PV_XXL horrorshow

  3) Implement a sane general variant of wrmsr_ns() which handles
     both X86_FEATURE_WRMSRNS and X86_MISFEATURE_PV_XXL

  4) Convert other code which benefits from the non-serializing variant
     to wrmsr_ns()

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-14 23:00     ` Thomas Gleixner
@ 2023-09-14 23:34       ` H. Peter Anvin
  2023-09-14 23:46       ` andrew.cooper3
  1 sibling, 0 replies; 88+ messages in thread
From: H. Peter Anvin @ 2023-09-14 23:34 UTC (permalink / raw)
  To: Thomas Gleixner, andrew.cooper3, Xin Li, linux-doc, linux-kernel,
	linux-edac, linux-hyperv, kvm, xen-devel
  Cc: mingo, bp, dave.hansen, x86, luto, pbonzini, seanjc, peterz,
	jgross, ravi.v.shankar, mhiramat, jiangshanlai

On September 14, 2023 4:00:39 PM PDT, Thomas Gleixner <tglx@linutronix.de> wrote:
>Andrew!
>
>On Thu, Sep 14 2023 at 15:05, andrew wrote:
>> On 14/09/2023 5:47 am, Xin Li wrote:
>>> +static __always_inline void wrmsrns(u32 msr, u64 val)
>>> +{
>>> +	__wrmsrns(msr, val, val >> 32);
>>> +}
>>
>> This API works in terms of this series where every WRMSRNS is hidden
>> behind a FRED check, but it's an awkward interface to use anywhere else
>> in the kernel.
>
>Agreed.
>
>> I fully understand that you expect all FRED capable systems to have
>> WRMSRNS, but it is not a hard requirement and you will end up with
>> simpler (and therefore better) logic by deleting the dependency.
>
>According to the CPU folks FRED systems are guaranteed to have WRMSRNS -
>I asked for that :). It's just not yet documented.
>
>But that I aside, I agree that we should opt for the safe side with a
>fallback like the one you have in XEN even for the places which are
>strictly FRED dependent.
>
>> As a "normal" user of the WRMSR APIs, the programmer only cares about:
>>
>> 1) wrmsr() -> needs to be serialising
>> 2) wrmsr_ns() -> safe to be non-serialising
>
>Correct.
>
>> In Xen, I added something of the form:
>>
>> /* Non-serialising WRMSR, when available.  Falls back to a serialising
>> WRMSR. */
>> static inline void wrmsr_ns(uint32_t msr, uint32_t lo, uint32_t hi)
>> {
>>     /*
>>      * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant CS
>>      * prefix to avoid a trailing NOP.
>>      */
>>     alternative_input(".byte 0x2e; wrmsr",
>>                       ".byte 0x0f,0x01,0xc6", X86_FEATURE_WRMSRNS,
>>                       "c" (msr), "a" (lo), "d" (hi));
>> }
>>
>> and despite what Juergen said, I'm going to recommend that you do wire
>> this through the paravirt infrastructure, for the benefit of regular
>> users having a nice API, not because XenPV is expecting to do something
>> wildly different here.
>
>I fundamentaly hate adding this to the PV infrastructure. We don't want
>more PV ops, quite the contrary.
>
>For the initial use case at hand, there is an explicit FRED dependency
>and the code in question really wants to use WRMSRNS directly and not
>through a PV function call.
>
>I agree with your reasoning for the more generic use case where we can
>gain performance independent of FRED by using WRMSRNS for cases where
>the write has no serialization requirements.
>
>But this made me look into PV ops some more. For actual performance
>relevant code the current PV ops mechanics are a horrorshow when the op
>defaults to the native instruction.
>
>Let's look at wrmsrl():
>
>wrmsrl(msr, val
> wrmsr(msr, (u32)val, (u32)val >> 32))
>  paravirt_write_msr(msr, low, high)
>    PVOP_VCALL3(cpu.write_msr, msr, low, high)
>
>Which results in
>
>	mov	$msr, %edi
>	mov	$val, %rdx
>	mov	%edx, %esi
>	shr	$0x20, %rdx
>	call	native_write_msr
>
>and native_write_msr() does at minimum:
>
>	mov    %edi,%ecx
>	mov    %esi,%eax
>	wrmsr
>        ret
>
>In the worst case 'ret' is going through the return thunk. Not to talk
>about function prologues and whatever.
>
>This becomes even more silly for trivial instructions like STI/CLI or in
>the worst case paravirt_nop().
>
>The call makes only sense, when the native default is an actual
>function, but for the trivial cases it's a blatant engineering
>trainwreck.
>
>I wouldn't care at all if CONFIG_PARAVIRT_XXL would be the esoteric use
>case, but AFAICT it's default enabled on all major distros.
>
>So no. I'm fundamentally disagreeing with your recommendation. The way
>forward is:
>
>  1) Provide the native variant for wrmsrns(), i.e. rename the proposed
>     wrmsrns() to native_wrmsr_ns() and have the X86_FEATURE_WRMSRNS
>     safety net as you pointed out.
>
>     That function can be used in code which is guaranteed to be not
>     affected by the PV_XXL madness.
>
>  2) Come up with a sensible solution for the PV_XXL horrorshow
>
>  3) Implement a sane general variant of wrmsr_ns() which handles
>     both X86_FEATURE_WRMSRNS and X86_MISFEATURE_PV_XXL
>
>  4) Convert other code which benefits from the non-serializing variant
>     to wrmsr_ns()
>
>Thanks,
>
>        tglx
>

With regards to (2), the IMO only sensible solution is to have the ABI be the one of the native instruction, and to have the PVXXL users take the full brunt of the overhead. That means that on a native or HVM machine, the proper code gets patched in inline, and the PVXXL code becomes a call to a stub to do the parameter marshalling before walking off back into C ABI land. The pvop further has to bear the full cost of providing the full native semantics unless otherwise agreed with the native maintainers and explicitly documented what the modified semantics are (notably, having excplicit stubs for certain specific MSRs is entirely reasonable.)

In case this sounds familiar, it is the pvops we were promised over 15 years ago, and yet somehow never really got there. It *also* is similar in an inside-out way of the ABI marshalling used for legacy BIOS functions in the 16-bit startup code.

In-place "fat" paravirtualization in the Linux kernel has been a horror show, with the notable exception of UML, which is almost universally ignored but yet manages to stay out of the way and keep working.

This is a classic case of Tragedy of The Commons, much like burning coal for power.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-14 23:00     ` Thomas Gleixner
  2023-09-14 23:34       ` H. Peter Anvin
@ 2023-09-14 23:46       ` andrew.cooper3
  2023-09-15  0:12         ` Thomas Gleixner
                           ` (2 more replies)
  1 sibling, 3 replies; 88+ messages in thread
From: andrew.cooper3 @ 2023-09-14 23:46 UTC (permalink / raw)
  To: Thomas Gleixner, Xin Li, linux-doc, linux-kernel, linux-edac,
	linux-hyperv, kvm, xen-devel
  Cc: mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc, peterz,
	jgross, ravi.v.shankar, mhiramat, jiangshanlai

On 15/09/2023 12:00 am, Thomas Gleixner wrote:
>> and despite what Juergen said, I'm going to recommend that you do wire
>> this through the paravirt infrastructure, for the benefit of regular
>> users having a nice API, not because XenPV is expecting to do something
>> wildly different here.
> I fundamentaly hate adding this to the PV infrastructure. We don't want
> more PV ops, quite the contrary.

What I meant was "there should be the two top-level APIs, and under the
covers they DTRT".  Part of doing the right thing is to wire up paravirt
for configs where that is specified.

Anything else is going to force people to write logic of the form:

    if (WRMSRNS && !XENPV)
        wrmsr_ns(...)
    else
        wrmsr(...)

which is going to be worse overall.  And there really is one example of
this antipattern already in the series.


> For the initial use case at hand, there is an explicit FRED dependency
> and the code in question really wants to use WRMSRNS directly and not
> through a PV function call.
>
> I agree with your reasoning for the more generic use case where we can
> gain performance independent of FRED by using WRMSRNS for cases where
> the write has no serialization requirements.
>
> But this made me look into PV ops some more. For actual performance
> relevant code the current PV ops mechanics are a horrorshow when the op
> defaults to the native instruction.
>
> Let's look at wrmsrl():
>
> wrmsrl(msr, val
>  wrmsr(msr, (u32)val, (u32)val >> 32))
>   paravirt_write_msr(msr, low, high)
>     PVOP_VCALL3(cpu.write_msr, msr, low, high)
>
> Which results in
>
> 	mov	$msr, %edi
> 	mov	$val, %rdx
> 	mov	%edx, %esi
> 	shr	$0x20, %rdx
> 	call	native_write_msr
>
> and native_write_msr() does at minimum:
>
> 	mov    %edi,%ecx
> 	mov    %esi,%eax
> 	wrmsr
>         ret

Yeah, this is daft.  But it can also be fixed irrespective of WRMSRNS.

WRMSR has one complexity that most other PV-ops don't, and that's the
exception table reference for the instruction itself.

In a theoretical future ought to look like:

    mov    $msr, %ecx
    mov    $lo, %eax
    mov    $hi, %edx
    1: {call paravirt_blah(%rip) | cs...cs wrmsr | cs...cs wrmsrns }
    _ASM_EXTABLE(1b, ...)

In paravirt builds, the CALL needs to be the emitted form, because it
needs to function in very early boot.

But once the paravirt-ness has been chosen and alternatives run, the
as-native paths are fully inline.

The alternative which processes this site wants to conclude that, in the
case it does not alter from the CALL, to clobber the EXTABLE reference. 
CALL instructions can #GP, and you don't want to end up thinking you're
handling a WRMSR #GP when in fact it was a non-canonical function pointer.

> In the worst case 'ret' is going through the return thunk. Not to talk
> about function prologues and whatever.
>
> This becomes even more silly for trivial instructions like STI/CLI or in
> the worst case paravirt_nop().

STI/CLI are already magic.  Are they not inlined?

> The call makes only sense, when the native default is an actual
> function, but for the trivial cases it's a blatant engineering
> trainwreck.
>
> I wouldn't care at all if CONFIG_PARAVIRT_XXL would be the esoteric use
> case, but AFAICT it's default enabled on all major distros.
>
> So no. I'm fundamentally disagreeing with your recommendation. The way
> forward is:
>
>   1) Provide the native variant for wrmsrns(), i.e. rename the proposed
>      wrmsrns() to native_wrmsr_ns() and have the X86_FEATURE_WRMSRNS
>      safety net as you pointed out.
>
>      That function can be used in code which is guaranteed to be not
>      affected by the PV_XXL madness.
>
>   2) Come up with a sensible solution for the PV_XXL horrorshow
>
>   3) Implement a sane general variant of wrmsr_ns() which handles
>      both X86_FEATURE_WRMSRNS and X86_MISFEATURE_PV_XXL
>
>   4) Convert other code which benefits from the non-serializing variant
>      to wrmsr_ns()

Well - point 1 is mostly work that needs reverting as part of completing
point 3, and point 2 clearly needs doing irrespective of anything else.

Thanks,

~Andrew

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-14 23:46       ` andrew.cooper3
@ 2023-09-15  0:12         ` Thomas Gleixner
  2023-09-15  0:33           ` andrew.cooper3
  2023-09-15  0:42         ` Thomas Gleixner
  2023-09-15  1:01         ` H. Peter Anvin
  2 siblings, 1 reply; 88+ messages in thread
From: Thomas Gleixner @ 2023-09-15  0:12 UTC (permalink / raw)
  To: andrew.cooper3, Xin Li, linux-doc, linux-kernel, linux-edac,
	linux-hyperv, kvm, xen-devel
  Cc: mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc, peterz,
	jgross, ravi.v.shankar, mhiramat, jiangshanlai

On Fri, Sep 15 2023 at 00:46, andrew wrote:
> On 15/09/2023 12:00 am, Thomas Gleixner wrote:
>> So no. I'm fundamentally disagreeing with your recommendation. The way
>> forward is:
>>
>>   1) Provide the native variant for wrmsrns(), i.e. rename the proposed
>>      wrmsrns() to native_wrmsr_ns() and have the X86_FEATURE_WRMSRNS
>>      safety net as you pointed out.
>>
>>      That function can be used in code which is guaranteed to be not
>>      affected by the PV_XXL madness.
>>
>>   2) Come up with a sensible solution for the PV_XXL horrorshow
>>
>>   3) Implement a sane general variant of wrmsr_ns() which handles
>>      both X86_FEATURE_WRMSRNS and X86_MISFEATURE_PV_XXL
>>
>>   4) Convert other code which benefits from the non-serializing variant
>>      to wrmsr_ns()
>
> Well - point 1 is mostly work that needs reverting as part of completing
> point 3, and point 2 clearly needs doing irrespective of anything else.

No. #1 has a value on its own independent of the general variant in #3.

>>      That function can be used in code which is guaranteed to be not
>>      affected by the PV_XXL madness.

That makes a lot of sense because it's guaranteed to generate better
code than whatever we come up with for the PV_XXL nonsense.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-15  0:12         ` Thomas Gleixner
@ 2023-09-15  0:33           ` andrew.cooper3
  2023-09-15  0:38             ` H. Peter Anvin
  0 siblings, 1 reply; 88+ messages in thread
From: andrew.cooper3 @ 2023-09-15  0:33 UTC (permalink / raw)
  To: Thomas Gleixner, Xin Li, linux-doc, linux-kernel, linux-edac,
	linux-hyperv, kvm, xen-devel
  Cc: mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc, peterz,
	jgross, ravi.v.shankar, mhiramat, jiangshanlai

On 15/09/2023 1:12 am, Thomas Gleixner wrote:
> On Fri, Sep 15 2023 at 00:46, andrew wrote:
>> On 15/09/2023 12:00 am, Thomas Gleixner wrote:
>>> So no. I'm fundamentally disagreeing with your recommendation. The way
>>> forward is:
>>>
>>>   1) Provide the native variant for wrmsrns(), i.e. rename the proposed
>>>      wrmsrns() to native_wrmsr_ns() and have the X86_FEATURE_WRMSRNS
>>>      safety net as you pointed out.
>>>
>>>      That function can be used in code which is guaranteed to be not
>>>      affected by the PV_XXL madness.
>>>
>>>   2) Come up with a sensible solution for the PV_XXL horrorshow
>>>
>>>   3) Implement a sane general variant of wrmsr_ns() which handles
>>>      both X86_FEATURE_WRMSRNS and X86_MISFEATURE_PV_XXL
>>>
>>>   4) Convert other code which benefits from the non-serializing variant
>>>      to wrmsr_ns()
>> Well - point 1 is mostly work that needs reverting as part of completing
>> point 3, and point 2 clearly needs doing irrespective of anything else.
> No. #1 has a value on its own independent of the general variant in #3.
>
>>>      That function can be used in code which is guaranteed to be not
>>>      affected by the PV_XXL madness.
> That makes a lot of sense because it's guaranteed to generate better
> code than whatever we come up with for the PV_XXL nonsense.

It's an assumption about what "definitely won't" be paravirt in the future.

XenPV stack handling is almost-FRED-like and has been for the better
part of two decades.

You frequently complain that there's too much black magic holding XenPV
together.  A paravirt-FRED will reduce the differences vs native
substantially.

~Andrew

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-15  0:33           ` andrew.cooper3
@ 2023-09-15  0:38             ` H. Peter Anvin
  2023-09-15  1:46               ` andrew.cooper3
  0 siblings, 1 reply; 88+ messages in thread
From: H. Peter Anvin @ 2023-09-15  0:38 UTC (permalink / raw)
  To: andrew.cooper3, Thomas Gleixner, Xin Li, linux-doc, linux-kernel,
	linux-edac, linux-hyperv, kvm, xen-devel
  Cc: mingo, bp, dave.hansen, x86, luto, pbonzini, seanjc, peterz,
	jgross, ravi.v.shankar, mhiramat, jiangshanlai

On 9/14/23 17:33, andrew.cooper3@citrix.com wrote:
> 
> It's an assumption about what "definitely won't" be paravirt in the future.
> 
> XenPV stack handling is almost-FRED-like and has been for the better
> part of two decades.
> 
> You frequently complain that there's too much black magic holding XenPV
> together.  A paravirt-FRED will reduce the differences vs native
> substantially.
> 

Call it "paravirtualized exception handling." In that sense, the 
refactoring of the exception handling to benefit FRED is definitely 
useful for reducing paravirtualization. The FRED-specific code is 
largely trivial, and presumably what you would do is to replace the FRED 
wrapper with a Xen wrapper and call the common handler routines.

	-hpa


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-14 23:46       ` andrew.cooper3
  2023-09-15  0:12         ` Thomas Gleixner
@ 2023-09-15  0:42         ` Thomas Gleixner
  2023-09-15  1:01         ` H. Peter Anvin
  2 siblings, 0 replies; 88+ messages in thread
From: Thomas Gleixner @ 2023-09-15  0:42 UTC (permalink / raw)
  To: andrew.cooper3, Xin Li, linux-doc, linux-kernel, linux-edac,
	linux-hyperv, kvm, xen-devel
  Cc: mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc, peterz,
	jgross, ravi.v.shankar, mhiramat, jiangshanlai

On Fri, Sep 15 2023 at 00:46, andrew wrote:
> On 15/09/2023 12:00 am, Thomas Gleixner wrote:
> What I meant was "there should be the two top-level APIs, and under the
> covers they DTRT".  Part of doing the right thing is to wire up paravirt
> for configs where that is specified.
>
> Anything else is going to force people to write logic of the form:
>
>     if (WRMSRNS && !XENPV)
>         wrmsr_ns(...)
>     else
>         wrmsr(...)
>
> which is going to be worse overall.

I already agreed with that for generic code which might be affected by
PV. But code which is explicitely depending on something which never can
be affected by PV _and_ is in a performance sensitive code path really
wants to be able to use the native variant explicitely.

> And there really is one example of this antipattern already in the
> series.

No. There is no antipattern in this series. The only place which uses
wrmsrns() is:

@@ -70,9 +70,13 @@ static inline void update_task_stack(str
 #ifdef CONFIG_X86_32
 	this_cpu_write(cpu_tss_rw.x86_tss.sp1, task->thread.sp0);
 #else
-	/* Xen PV enters the kernel on the thread stack. */
-	if (cpu_feature_enabled(X86_FEATURE_XENPV))
+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+		/* WRMSRNS is a baseline feature for FRED. */
+		wrmsrns(MSR_IA32_FRED_RSP0, (unsigned long)task_stack_page(task) + THREAD_SIZE);
+	} else if (cpu_feature_enabled(X86_FEATURE_XENPV)) {
+		/* Xen PV enters the kernel on the thread stack. */
 		load_sp0(task_top_of_stack(task));
+	}
 #endif
 }

The XENPV condition exists already today and is required independent of
FRED, no?

I deliberately distinguished #1 and #3 on my proposed todo list exactly
because the above use case really wants #1 without the extra bells and
whistles of a generic PV patchable wrmrs_ns() variant. Why?

  No matter how clever the enhanced PV implementation might be, it is
  guaranteed to generate worse code than the straight forward native
  inline assembly. Simply because it has to prevent the compiler from
  being overly clever on optimizations as it obviously mandates wider
  register restrictions, while the pure native variant (independent of
  the availability of X86_FEATURE_WRMSRNS) ony mandates the requirements
  of WRMSR[NS], but not the extra register indirection of the call ABI.

I'm not debating that any other code pattern like you pointed out in
some generic code would be horrible, but I'm not buying your strawman
related to this particular usage site.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-14 23:46       ` andrew.cooper3
  2023-09-15  0:12         ` Thomas Gleixner
  2023-09-15  0:42         ` Thomas Gleixner
@ 2023-09-15  1:01         ` H. Peter Anvin
  2023-09-15  1:16           ` andrew.cooper3
  2 siblings, 1 reply; 88+ messages in thread
From: H. Peter Anvin @ 2023-09-15  1:01 UTC (permalink / raw)
  To: andrew.cooper3, Thomas Gleixner, Xin Li, linux-doc, linux-kernel,
	linux-edac, linux-hyperv, kvm, xen-devel
  Cc: mingo, bp, dave.hansen, x86, luto, pbonzini, seanjc, peterz,
	jgross, ravi.v.shankar, mhiramat, jiangshanlai

> WRMSR has one complexity that most other PV-ops don't, and that's the
> exception table reference for the instruction itself.
> 
> In a theoretical future ought to look like:
> 
>     mov    $msr, %ecx
>     mov    $lo, %eax
>     mov    $hi, %edx
>     1: {call paravirt_blah(%rip) | cs...cs wrmsr | cs...cs wrmsrns }
>     _ASM_EXTABLE(1b, ...)
> 
> In paravirt builds, the CALL needs to be the emitted form, because it
> needs to function in very early boot.
> 
> But once the paravirt-ness has been chosen and alternatives run, the
> as-native paths are fully inline.
> 
> The alternative which processes this site wants to conclude that, in the
> case it does not alter from the CALL, to clobber the EXTABLE reference. 
> CALL instructions can #GP, and you don't want to end up thinking you're
> handling a WRMSR #GP when in fact it was a non-canonical function pointer.


On 9/14/23 17:36, andrew.cooper3@citrix.com wrote:> On 15/09/2023 1:07 
am, H. Peter Anvin wrote:
 >> Is *that* your concern?! Just put a NOP before WRMSR – you need 
padding NOP bytes anyway – and the extable entry is no longer at the 
same address. Problem solved.
 >>
 >> Either that, or use a direct call, which can't #GP in the address 
range used by the kernel.
 >
 > For non-paravirt builds, I really hope the inlining DoesTheRightThing.
 > If it doesn't lets fix it to do so.
 >
 > For paravirt builds, the emitted form must be the indirect call so it
 > can function in boot prior to alternatives running [1].
 >
No, it doesn't. You always have the option of a direct call to an 
indirect JMP. This is in fact exactly what userspace does in the form of 
the PLT.

 > So you still need some way of putting the EXTABLE reference at the
 > emitted site, not in the .altintr_replacement section where the
 > WRMSR{,NS} instruction lives.  This needs to be at build time because
 > the EXTABLE references aren't shuffled at runtime.
 >
 > How else do you propose getting an extable reference to midway through
 > an instruction on the "wrong" part of an alternative?
Well, obviously there has to be a magic inline at the patch site. It 
ends up looking something like this:

	asm volatile("1:"
		     ALTERNATIVE_2("call pv_wrmsr(%%rip)",
			"nop; wrmsr", X86_FEATURE_NATIVE_MSR,
			"nop; wrmsrns", X86_FEATURE_WRMSRNS)
		     "2:"
		     _ASM_EXTABLE_TYPE(1b+1, 2b, EX_TYPE_WRMSR)
		     : : "c" (msr), "a" (low), "d" (high) : "memory");


[one can argue whether or not WRMSRNS specifically should require 
"memory" or not.]

The whole bit with alternatives and pvops being separate is a major 
maintainability problem, and honestly it never made any sense in the 
first place. Never have two mechanisms to do one job; it makes it harder 
to grok their interactions.

As an alternative to the NOP, the EX_TYPE_*MSR* handlers could simply 
look for a CALL opcode at the origin.

	-hpa

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 08/38] x86/cpufeatures: Add the cpu feature bit for FRED
  2023-09-14 13:15       ` andrew.cooper3
@ 2023-09-15  1:07         ` Thomas Gleixner
  2023-09-15  5:27           ` Juergen Gross
  0 siblings, 1 reply; 88+ messages in thread
From: Thomas Gleixner @ 2023-09-15  1:07 UTC (permalink / raw)
  To: andrew.cooper3, Jan Beulich, Juergen Gross
  Cc: mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc, peterz,
	ravi.v.shankar, mhiramat, jiangshanlai, Xin Li, linux-doc,
	linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel

On Thu, Sep 14 2023 at 14:15, andrew wrote:
> PV guests are never going to see FRED (or LKGS for that matter) because
> it advertises too much stuff which simply traps because the kernel is in
> CPL3.
>
> That said, the 64bit PV ABI is a whole lot closer to FRED than it is to
> IDT delivery.  (Almost as if we decided 15 years ago that giving the PV
> guest kernel a good stack and GSbase was the right thing to do...)

No argument about that.

> In some copious free time, I think we ought to provide a
> minorly-paravirt FRED to PV guests because there are still some
> improvements available as low hanging fruit.
>
> My plan was to have a PV hypervisor leaf advertising paravirt versions
> of hardware features, so a guest could see "I don't have architectural
> FRED, but I do have paravirt-FRED which is as similar as we can
> reasonably make it".  The same goes for a whole bunch of other features.

*GROAN*

I told you before that we want less paravirt nonsense and not more. I'm
serious about that. XENPV CPL3 virtualization is a dead horse from a
technical POV. No point in wasting brain cycles to enhance the zombie
unless you can get rid of the existing PV nonsense, which you can't for
obvious reasons.

That said, we can debate this once the more fundamental issues of
XEN[PV] have been addressed. I expect that to happen quite some time
after I retired :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-15  1:01         ` H. Peter Anvin
@ 2023-09-15  1:16           ` andrew.cooper3
  2023-09-15  5:32             ` Juergen Gross
  2023-09-20 15:00             ` Peter Zijlstra
  0 siblings, 2 replies; 88+ messages in thread
From: andrew.cooper3 @ 2023-09-15  1:16 UTC (permalink / raw)
  To: H. Peter Anvin, Thomas Gleixner, Xin Li, linux-doc, linux-kernel,
	linux-edac, linux-hyperv, kvm, xen-devel
  Cc: mingo, bp, dave.hansen, x86, luto, pbonzini, seanjc, peterz,
	jgross, ravi.v.shankar, mhiramat, jiangshanlai

On 15/09/2023 2:01 am, H. Peter Anvin wrote:
> The whole bit with alternatives and pvops being separate is a major
> maintainability problem, and honestly it never made any sense in the
> first place. Never have two mechanisms to do one job; it makes it
> harder to grok their interactions.

This bit is easy.

Juergen has already done the work to delete one of these two patching
mechanisms and replace it with the other.

https://lore.kernel.org/lkml/a32e211f-4add-4fb2-9e5a-480ae9b9bbf2@suse.com/

Unfortunately, it's only collecting pings and tumbleweeds.

~Andrew

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-15  0:38             ` H. Peter Anvin
@ 2023-09-15  1:46               ` andrew.cooper3
  2023-09-15  2:06                 ` H. Peter Anvin
  0 siblings, 1 reply; 88+ messages in thread
From: andrew.cooper3 @ 2023-09-15  1:46 UTC (permalink / raw)
  To: H. Peter Anvin, Thomas Gleixner, Xin Li, linux-doc, linux-kernel,
	linux-edac, linux-hyperv, kvm, xen-devel
  Cc: mingo, bp, dave.hansen, x86, luto, pbonzini, seanjc, peterz,
	jgross, ravi.v.shankar, mhiramat, jiangshanlai

On 15/09/2023 1:38 am, H. Peter Anvin wrote:
> On 9/14/23 17:33, andrew.cooper3@citrix.com wrote:
>>
>> It's an assumption about what "definitely won't" be paravirt in the
>> future.
>>
>> XenPV stack handling is almost-FRED-like and has been for the better
>> part of two decades.
>>
>> You frequently complain that there's too much black magic holding XenPV
>> together.  A paravirt-FRED will reduce the differences vs native
>> substantially.
>>
>
> Call it "paravirtualized exception handling." In that sense, the
> refactoring of the exception handling to benefit FRED is definitely
> useful for reducing paravirtualization. The FRED-specific code is
> largely trivial, and presumably what you would do is to replace the
> FRED wrapper with a Xen wrapper and call the common handler routines.

Why do only half the job?

There's no need for any Xen wrappers at all when XenPV can use the
native FRED paths, as long as ERETU, ERETS and the relevant MSRs can be
paravirt (sure - with an interface that sucks less than right now) so
they're not taking the #GP/emulate in Xen path.

And this can work on all hardware with a slightly-future version of Xen
and Linux, because it's just a minor adjustment to how Xen writes the
exception frame on the guests stack as part of event delivery.

~Andrew

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-15  1:46               ` andrew.cooper3
@ 2023-09-15  2:06                 ` H. Peter Anvin
  0 siblings, 0 replies; 88+ messages in thread
From: H. Peter Anvin @ 2023-09-15  2:06 UTC (permalink / raw)
  To: andrew.cooper3, Thomas Gleixner, Xin Li, linux-doc, linux-kernel,
	linux-edac, linux-hyperv, kvm, xen-devel
  Cc: mingo, bp, dave.hansen, x86, luto, pbonzini, seanjc, peterz,
	jgross, ravi.v.shankar, mhiramat, jiangshanlai

On 9/14/23 18:46, andrew.cooper3@citrix.com wrote:
> On 15/09/2023 1:38 am, H. Peter Anvin wrote:
>> On 9/14/23 17:33, andrew.cooper3@citrix.com wrote:
>>>
>>> It's an assumption about what "definitely won't" be paravirt in the
>>> future.
>>>
>>> XenPV stack handling is almost-FRED-like and has been for the better
>>> part of two decades.
>>>
>>> You frequently complain that there's too much black magic holding XenPV
>>> together.  A paravirt-FRED will reduce the differences vs native
>>> substantially.
>>>
>>
>> Call it "paravirtualized exception handling." In that sense, the
>> refactoring of the exception handling to benefit FRED is definitely
>> useful for reducing paravirtualization. The FRED-specific code is
>> largely trivial, and presumably what you would do is to replace the
>> FRED wrapper with a Xen wrapper and call the common handler routines.
> 
> Why do only half the job?
> 
> There's no need for any Xen wrappers at all when XenPV can use the
> native FRED paths, as long as ERETU, ERETS and the relevant MSRs can be
> paravirt (sure - with an interface that sucks less than right now) so
> they're not taking the #GP/emulate in Xen path.
> 
> And this can work on all hardware with a slightly-future version of Xen
> and Linux, because it's just a minor adjustment to how Xen writes the
> exception frame on the guests stack as part of event delivery.
> 

It's not about doing "half the job", it's about using the proper 
abstraction mechanism. By all means, if you can join the common code 
flow earlier, do so, but paravirtualizing the entry/exit stubs which is 
the *only* place ERETU and ERETS show up is just crazy.

Similarly, nearly all the MSRs are just configuration setup. The only 
ones which have any kind of performance relevance is the stack setup (RSP0).

	-hpa


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 08/38] x86/cpufeatures: Add the cpu feature bit for FRED
  2023-09-15  1:07         ` Thomas Gleixner
@ 2023-09-15  5:27           ` Juergen Gross
  0 siblings, 0 replies; 88+ messages in thread
From: Juergen Gross @ 2023-09-15  5:27 UTC (permalink / raw)
  To: Thomas Gleixner, andrew.cooper3, Jan Beulich
  Cc: mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc, peterz,
	ravi.v.shankar, mhiramat, jiangshanlai, Xin Li, linux-doc,
	linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 1323 bytes --]

On 15.09.23 03:07, Thomas Gleixner wrote:
> On Thu, Sep 14 2023 at 14:15, andrew wrote:
>> PV guests are never going to see FRED (or LKGS for that matter) because
>> it advertises too much stuff which simply traps because the kernel is in
>> CPL3.
>>
>> That said, the 64bit PV ABI is a whole lot closer to FRED than it is to
>> IDT delivery.  (Almost as if we decided 15 years ago that giving the PV
>> guest kernel a good stack and GSbase was the right thing to do...)
> 
> No argument about that.
> 
>> In some copious free time, I think we ought to provide a
>> minorly-paravirt FRED to PV guests because there are still some
>> improvements available as low hanging fruit.
>>
>> My plan was to have a PV hypervisor leaf advertising paravirt versions
>> of hardware features, so a guest could see "I don't have architectural
>> FRED, but I do have paravirt-FRED which is as similar as we can
>> reasonably make it".  The same goes for a whole bunch of other features.
> 
> *GROAN*
> 
> I told you before that we want less paravirt nonsense and not more.

I agree.

We will still have to support the PV stuff for non-FRED hypervisors even with
pv-FRED being available on new Xen. So adding pv-FRED would just add more PV
interfaces without the ability to remove old stuff.


Juergen


[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-15  1:16           ` andrew.cooper3
@ 2023-09-15  5:32             ` Juergen Gross
  2023-09-20 15:00             ` Peter Zijlstra
  1 sibling, 0 replies; 88+ messages in thread
From: Juergen Gross @ 2023-09-15  5:32 UTC (permalink / raw)
  To: andrew.cooper3, H. Peter Anvin, Thomas Gleixner, Xin Li,
	linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
  Cc: mingo, bp, dave.hansen, x86, luto, pbonzini, seanjc, peterz,
	ravi.v.shankar, mhiramat, jiangshanlai


[-- Attachment #1.1.1: Type: text/plain, Size: 786 bytes --]

On 15.09.23 03:16, andrew.cooper3@citrix.com wrote:
> On 15/09/2023 2:01 am, H. Peter Anvin wrote:
>> The whole bit with alternatives and pvops being separate is a major
>> maintainability problem, and honestly it never made any sense in the
>> first place. Never have two mechanisms to do one job; it makes it
>> harder to grok their interactions.
> 
> This bit is easy.
> 
> Juergen has already done the work to delete one of these two patching
> mechanisms and replace it with the other.
> 
> https://lore.kernel.org/lkml/a32e211f-4add-4fb2-9e5a-480ae9b9bbf2@suse.com/
> 
> Unfortunately, it's only collecting pings and tumbleweeds.

Indeed.

Unfortunately there is probably some objtool support needed for that, which I'm
not sure how to implement.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 02/38] x86/opcode: Add the WRMSRNS instruction to the x86 opcode map
  2023-09-14  4:47 ` [PATCH v10 02/38] x86/opcode: Add the WRMSRNS instruction to the x86 opcode map Xin Li
@ 2023-09-15  5:47   ` Masami Hiramatsu
  0 siblings, 0 replies; 88+ messages in thread
From: Masami Hiramatsu @ 2023-09-15  5:47 UTC (permalink / raw)
  To: Xin Li
  Cc: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel,
	tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai

On Wed, 13 Sep 2023 21:47:29 -0700
Xin Li <xin3.li@intel.com> wrote:

> Add the opcode used by WRMSRNS, which is the non-serializing version of
> WRMSR and may replace it to improve performance, to the x86 opcode map.
> 
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>

This looks good to me.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks,

> ---
>  arch/x86/lib/x86-opcode-map.txt       | 2 +-
>  tools/arch/x86/lib/x86-opcode-map.txt | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
> index 5168ee0360b2..1efe1d9bf5ce 100644
> --- a/arch/x86/lib/x86-opcode-map.txt
> +++ b/arch/x86/lib/x86-opcode-map.txt
> @@ -1051,7 +1051,7 @@ GrpTable: Grp6
>  EndTable
>  
>  GrpTable: Grp7
> -0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) | VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B)
> +0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) | VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B) | WRMSRNS (110),(11B)
>  1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC (011),(11B) | ENCLS (111),(11B)
>  2: LGDT Ms | XGETBV (000),(11B) | XSETBV (001),(11B) | VMFUNC (100),(11B) | XEND (101)(11B) | XTEST (110)(11B) | ENCLU (111),(11B)
>  3: LIDT Ms
> diff --git a/tools/arch/x86/lib/x86-opcode-map.txt b/tools/arch/x86/lib/x86-opcode-map.txt
> index 5168ee0360b2..1efe1d9bf5ce 100644
> --- a/tools/arch/x86/lib/x86-opcode-map.txt
> +++ b/tools/arch/x86/lib/x86-opcode-map.txt
> @@ -1051,7 +1051,7 @@ GrpTable: Grp6
>  EndTable
>  
>  GrpTable: Grp7
> -0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) | VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B)
> +0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) | VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B) | WRMSRNS (110),(11B)
>  1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC (011),(11B) | ENCLS (111),(11B)
>  2: LGDT Ms | XGETBV (000),(11B) | XSETBV (001),(11B) | VMFUNC (100),(11B) | XEND (101)(11B) | XTEST (110)(11B) | ENCLU (111),(11B)
>  3: LIDT Ms
> -- 
> 2.34.1
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 36/38] x86/fred: Add fred_syscall_init()
  2023-09-14  4:48 ` [PATCH v10 36/38] x86/fred: Add fred_syscall_init() Xin Li
@ 2023-09-19  8:28   ` Thomas Gleixner
  2023-09-20  4:33     ` Li, Xin3
  0 siblings, 1 reply; 88+ messages in thread
From: Thomas Gleixner @ 2023-09-19  8:28 UTC (permalink / raw)
  To: Xin Li, linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm,
	xen-devel
  Cc: mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc, peterz,
	jgross, ravi.v.shankar, mhiramat, andrew.cooper3, jiangshanlai

On Wed, Sep 13 2023 at 21:48, Xin Li wrote:
> +static inline void fred_syscall_init(void)
> +{
> +	/*
> +	 * Per FRED spec 5.0, FRED uses the ring 3 FRED entrypoint for SYSCALL
> +	 * and SYSENTER, and ERETU is the only legit instruction to return to
> +	 * ring 3, as a result there is _no_ need to setup the SYSCALL and
> +	 * SYSENTER MSRs.
> +	 *
> +	 * Note, both sysexit and sysret cause #UD when FRED is enabled.
> +	 */
> +	wrmsrl(MSR_LSTAR, 0ULL);
> +	wrmsrl_cstar(0ULL);

That write is pointless. See the comment in wrmsrl_cstar().

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 88+ messages in thread

* RE: [PATCH v10 36/38] x86/fred: Add fred_syscall_init()
  2023-09-19  8:28   ` Thomas Gleixner
@ 2023-09-20  4:33     ` Li, Xin3
  2023-09-20  8:18       ` Thomas Gleixner
  0 siblings, 1 reply; 88+ messages in thread
From: Li, Xin3 @ 2023-09-20  4:33 UTC (permalink / raw)
  To: Thomas Gleixner, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org,
	linux-hyperv@vger.kernel.org, kvm@vger.kernel.org,
	xen-devel@lists.xenproject.org
  Cc: mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, Lutomirski, Andy,
	pbonzini@redhat.com, Christopherson,, Sean, peterz@infradead.org,
	Gross, Jurgen, Shankar, Ravi V, mhiramat@kernel.org,
	andrew.cooper3@citrix.com, jiangshanlai@gmail.com

> > +static inline void fred_syscall_init(void) {
> > +	/*
> > +	 * Per FRED spec 5.0, FRED uses the ring 3 FRED entrypoint for SYSCALL
> > +	 * and SYSENTER, and ERETU is the only legit instruction to return to
> > +	 * ring 3, as a result there is _no_ need to setup the SYSCALL and
> > +	 * SYSENTER MSRs.
> > +	 *
> > +	 * Note, both sysexit and sysret cause #UD when FRED is enabled.
> > +	 */
> > +	wrmsrl(MSR_LSTAR, 0ULL);
> > +	wrmsrl_cstar(0ULL);
> 
> That write is pointless. See the comment in wrmsrl_cstar().

What I heard is that AMD is going to support FRED.

Both LSTAR and CSTAR have no function when FRED is enabled, so maybe
just do NOT write to them?

Thanks!
    Xin


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-14  4:47 ` [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support Xin Li
  2023-09-14  6:02   ` Juergen Gross
  2023-09-14 14:05   ` andrew.cooper3
@ 2023-09-20  7:58   ` Nikolay Borisov
  2023-09-20  8:18     ` Li, Xin3
  2 siblings, 1 reply; 88+ messages in thread
From: Nikolay Borisov @ 2023-09-20  7:58 UTC (permalink / raw)
  To: Xin Li, linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm,
	xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai



On 14.09.23 г. 7:47 ч., Xin Li wrote:
> Add an always inline API __wrmsrns() to embed the WRMSRNS instruction
> into the code.
> 
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
>   arch/x86/include/asm/msr.h | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
> 
> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> index 65ec1965cd28..c284ff9ebe67 100644
> --- a/arch/x86/include/asm/msr.h
> +++ b/arch/x86/include/asm/msr.h
> @@ -97,6 +97,19 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high)
>   		     : : "c" (msr), "a"(low), "d" (high) : "memory");
>   }
>   
> +/*
> + * WRMSRNS behaves exactly like WRMSR with the only difference being
> + * that it is not a serializing instruction by default.
> + */
> +static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)

Shouldn't this be named wrmsrns_safe since it has exception handling, 
similar to the current wrmsrl_safe.

> +{
> +	/* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */
> +	asm volatile("1: .byte 0x0f,0x01,0xc6\n"
> +		     "2:\n"
> +		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
> +		     : : "c" (msr), "a"(low), "d" (high));
> +}
> +


<snip>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* RE: [PATCH v10 36/38] x86/fred: Add fred_syscall_init()
  2023-09-20  4:33     ` Li, Xin3
@ 2023-09-20  8:18       ` Thomas Gleixner
  2023-09-21  2:24         ` H. Peter Anvin
  0 siblings, 1 reply; 88+ messages in thread
From: Thomas Gleixner @ 2023-09-20  8:18 UTC (permalink / raw)
  To: Li, Xin3, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-edac@vger.kernel.org, linux-hyperv@vger.kernel.org,
	kvm@vger.kernel.org, xen-devel@lists.xenproject.org
  Cc: mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, Lutomirski, Andy,
	pbonzini@redhat.com, Christopherson,, Sean, peterz@infradead.org,
	Gross, Jurgen, Shankar, Ravi V, mhiramat@kernel.org,
	andrew.cooper3@citrix.com, jiangshanlai@gmail.com

On Wed, Sep 20 2023 at 04:33, Li, Xin3 wrote:
>> > +static inline void fred_syscall_init(void) {
>> > +	/*
>> > +	 * Per FRED spec 5.0, FRED uses the ring 3 FRED entrypoint for SYSCALL
>> > +	 * and SYSENTER, and ERETU is the only legit instruction to return to
>> > +	 * ring 3, as a result there is _no_ need to setup the SYSCALL and
>> > +	 * SYSENTER MSRs.
>> > +	 *
>> > +	 * Note, both sysexit and sysret cause #UD when FRED is enabled.
>> > +	 */
>> > +	wrmsrl(MSR_LSTAR, 0ULL);
>> > +	wrmsrl_cstar(0ULL);
>> 
>> That write is pointless. See the comment in wrmsrl_cstar().
>
> What I heard is that AMD is going to support FRED.
>
> Both LSTAR and CSTAR have no function when FRED is enabled, so maybe
> just do NOT write to them?

Right. If AMD needs to clear it then it's trivial enough to add a
wrmsrl_cstar(0) to it.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* RE: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-20  7:58   ` Nikolay Borisov
@ 2023-09-20  8:18     ` Li, Xin3
  2023-09-22  8:16       ` Li, Xin3
  0 siblings, 1 reply; 88+ messages in thread
From: Li, Xin3 @ 2023-09-20  8:18 UTC (permalink / raw)
  To: Nikolay Borisov, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org,
	linux-hyperv@vger.kernel.org, kvm@vger.kernel.org,
	xen-devel@lists.xenproject.org
  Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	Lutomirski, Andy, pbonzini@redhat.com, Christopherson,, Sean,
	peterz@infradead.org, Gross, Jurgen, Shankar, Ravi V,
	mhiramat@kernel.org, andrew.cooper3@citrix.com,
	jiangshanlai@gmail.com

> > +static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
> 
> Shouldn't this be named wrmsrns_safe since it has exception handling, similar to
> the current wrmsrl_safe.
> 

Both safe and unsafe versions have exception handling, while the safe
version returns an integer to its caller to indicate an exception did
happen or not.

Exception handling is a must for WRMSR/RDMSR and related instructions.

Thanks!
    Xin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 06/38] Documentation/x86/64: Add a documentation for FRED
  2023-09-14  4:47 ` [PATCH v10 06/38] Documentation/x86/64: Add a documentation for FRED Xin Li
@ 2023-09-20  9:44   ` Nikolay Borisov
  0 siblings, 0 replies; 88+ messages in thread
From: Nikolay Borisov @ 2023-09-20  9:44 UTC (permalink / raw)
  To: Xin Li, linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm,
	xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai



On 14.09.23 г. 7:47 ч., Xin Li wrote:
> Briefly introduce FRED, and its advantages compared to IDT.
> 
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
>   Documentation/arch/x86/x86_64/fred.rst  | 98 +++++++++++++++++++++++++
>   Documentation/arch/x86/x86_64/index.rst |  1 +
>   2 files changed, 99 insertions(+)
>   create mode 100644 Documentation/arch/x86/x86_64/fred.rst
> 
> diff --git a/Documentation/arch/x86/x86_64/fred.rst b/Documentation/arch/x86/x86_64/fred.rst
> new file mode 100644
> index 000000000000..a4ebb95f92c8
> --- /dev/null
> +++ b/Documentation/arch/x86/x86_64/fred.rst
> @@ -0,0 +1,98 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=========================================
> +Flexible Return and Event Delivery (FRED)
> +=========================================
> +
> +Overview
> +========
> +
> +The FRED architecture defines simple new transitions that change
> +privilege level (ring transitions). The FRED architecture was
> +designed with the following goals:
> +
> +1) Improve overall performance and response time by replacing event
> +   delivery through the interrupt descriptor table (IDT event
> +   delivery) and event return by the IRET instruction with lower
> +   latency transitions.
> +
> +2) Improve software robustness by ensuring that event delivery
> +   establishes the full supervisor context and that event return
> +   establishes the full user context.
> +
> +The new transitions defined by the FRED architecture are FRED event
> +delivery and, for returning from events, two FRED return instructions.
> +FRED event delivery can effect a transition from ring 3 to ring 0, but
> +it is used also to deliver events incident to ring 0. One FRED
> +instruction (ERETU) effects a return from ring 0 to ring 3, while the
> +other (ERETS) returns while remaining in ring 0. Collectively, FRED
> +event delivery and the FRED return instructions are FRED transitions.
> +
> +In addition to these transitions, the FRED architecture defines a new
> +instruction (LKGS) for managing the state of the GS segment register.
> +The LKGS instruction can be used by 64-bit operating systems that do
> +not use the new FRED transitions.
> +
> +Furthermore, the FRED architecture is easy to extend for future CPU
> +architectures.
> +
> +Software based event dispatching
> +================================
> +
> +FRED operates differently from IDT in terms of event handling. Instead
> +of directly dispatching an event to its handler based on the event
> +vector, FRED requires the software to dispatch an event to its handler
> +based on both the event's type and vector. Therefore, an event dispatch
> +framework must be implemented to facilitate the event-to-handler
> +dispatch process. The FRED event dispatch framework takes control
> +once an event is delivered, and employs a two-level dispatch.
> +
> +The first level dispatching is event type based, and the second level
> +dispatching is event vector based.
> +
> +Full supervisor/user context
> +============================
> +
> +FRED event delivery atomically save and restore full supervisor/user
> +context upon event delivery and return. Thus it avoids the problem of
> +transient states due to %cr2 and/or %dr6, and it is no longer needed
> +to handle all the ugly corner cases caused by half baked entry states.
> +
> +FRED allows explicit unblock of NMI with new event return instructions
> +ERETS/ERETU, avoiding the mess caused by IRET which unconditionally
> +unblocks NMI, e.g., when an exception happens during NMI handling.
> +
> +FRED always restores the full value of %rsp, thus ESPFIX is no longer
> +needed when FRED is enabled.
> +
> +LKGS
> +====
> +
> +LKGS behaves like the MOV to GS instruction except that it loads the
> +base address into the IA32_KERNEL_GS_BASE MSR instead of the GS
> +segment’s descriptor cache. With LKGS, it ends up with avoiding
> +mucking with kernel GS, i.e., an operating system can always operate
> +with its own GS base address.
> +
> +Because FRED event delivery from ring 3 swaps the value of the GS base
> +address and that of the IA32_KERNEL_GS_BASE MSR, and ERETU swaps the
> +value of the GS base address and that of the IA32_KERNEL_GS_BASE MSR,
> +plus the introduction of LKGS instruction, the SWAPGS instruction is
> +no longer needed when FRED is enabled, thus is disallowed (#UD).

nit: This will be more clear if rewritten: "Because FRED event delivery 
from ring 3 and ERETU both swap the value of the GS base, plus the..." .

The idea is to remove the duplicate statement that IA32_KERNEL_GS_BASE 
and the GS registers are swapped as it makes the sentence somewhat hard 
to read.


<snip>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 09/38] x86/fred: Disable FRED support if CONFIG_X86_FRED is disabled
  2023-09-14  4:47 ` [PATCH v10 09/38] x86/fred: Disable FRED support if CONFIG_X86_FRED is disabled Xin Li
@ 2023-09-20 10:19   ` Nikolay Borisov
  0 siblings, 0 replies; 88+ messages in thread
From: Nikolay Borisov @ 2023-09-20 10:19 UTC (permalink / raw)
  To: Xin Li, linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm,
	xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai



On 14.09.23 г. 7:47 ч., Xin Li wrote:
> From: "H. Peter Anvin (Intel)" <hpa@zytor.com>
> 
> Add CONFIG_X86_FRED to <asm/disabled-features.h> to make
> cpu_feature_enabled() work correctly with FRED.
> 
> Originally-by: Megha Dey <megha.dey@intel.com>
> Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
>   arch/x86/include/asm/disabled-features.h       | 8 +++++++-
>   tools/arch/x86/include/asm/disabled-features.h | 8 +++++++-
>   2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
> index 702d93fdd10e..3cde57cb5093 100644
> --- a/arch/x86/include/asm/disabled-features.h
> +++ b/arch/x86/include/asm/disabled-features.h
> @@ -117,6 +117,12 @@
>   #define DISABLE_IBT	(1 << (X86_FEATURE_IBT & 31))
>   #endif
>   
> +#ifdef CONFIG_X86_FRED
> +# define DISABLE_FRED	0
> +#else
> +# define DISABLE_FRED	(1 << (X86_FEATURE_FRED & 31))
> +#endif
> +
>   /*
>    * Make sure to add features to the correct mask
>    */
> @@ -134,7 +140,7 @@
>   #define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
>   			 DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
>   #define DISABLED_MASK12	(DISABLE_LAM)
> -#define DISABLED_MASK13	0
> +#define DISABLED_MASK13	(DISABLE_FRED)
>   #define DISABLED_MASK14	0
>   #define DISABLED_MASK15	0
>   #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
> diff --git a/tools/arch/x86/include/asm/disabled-features.h b/tools/arch/x86/include/asm/disabled-features.h
> index fafe9be7a6f4..d540ecdd8812 100644
> --- a/tools/arch/x86/include/asm/disabled-features.h
> +++ b/tools/arch/x86/include/asm/disabled-features.h
> @@ -105,6 +105,12 @@
>   # define DISABLE_TDX_GUEST	(1 << (X86_FEATURE_TDX_GUEST & 31))
>   #endif
>   
> +#ifdef CONFIG_X86_FRED
> +# define DISABLE_FRED	0
> +#else
> +# define DISABLE_FRED	(1 << (X86_FEATURE_FRED & 31))
> +#endif
> +
>   /*
>    * Make sure to add features to the correct mask
>    */
> @@ -122,7 +128,7 @@
>   #define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
>   			 DISABLE_CALL_DEPTH_TRACKING)
>   #define DISABLED_MASK12	(DISABLE_LAM)
> -#define DISABLED_MASK13	0
> +#define DISABLED_MASK13	(DISABLE_FRED)

FRED feature is defined in cpuid word 12, not 13

>   #define DISABLED_MASK14	0
>   #define DISABLED_MASK15	0
>   #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 13/38] x86/cpu: Add X86_CR4_FRED macro
  2023-09-14  4:47 ` [PATCH v10 13/38] x86/cpu: Add X86_CR4_FRED macro Xin Li
@ 2023-09-20 10:50   ` Nikolay Borisov
  2023-09-20 17:25     ` Li, Xin3
  0 siblings, 1 reply; 88+ messages in thread
From: Nikolay Borisov @ 2023-09-20 10:50 UTC (permalink / raw)
  To: Xin Li, linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm,
	xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai



On 14.09.23 г. 7:47 ч., Xin Li wrote:
> From: "H. Peter Anvin (Intel)" <hpa@zytor.com>
> 
> Add X86_CR4_FRED macro for the FRED bit in %cr4. This bit must not be
> changed after initialization, so add it to the pinned CR4 bits.
> 
> Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
> 
> Changes since v9:
> * Avoid a type cast by defining X86_CR4_FRED as 0 on 32-bit (Thomas
>    Gleixner).
> ---
>   arch/x86/include/uapi/asm/processor-flags.h | 7 +++++++
>   arch/x86/kernel/cpu/common.c                | 5 ++---
>   2 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
> index d898432947ff..f1a4adc78272 100644
> --- a/arch/x86/include/uapi/asm/processor-flags.h
> +++ b/arch/x86/include/uapi/asm/processor-flags.h
> @@ -139,6 +139,13 @@
>   #define X86_CR4_LAM_SUP_BIT	28 /* LAM for supervisor pointers */
>   #define X86_CR4_LAM_SUP		_BITUL(X86_CR4_LAM_SUP_BIT)
>   
> +#ifdef __x86_64__
> +#define X86_CR4_FRED_BIT	32 /* enable FRED kernel entry */
> +#define X86_CR4_FRED		_BITUL(X86_CR4_FRED_BIT)

nit: s/BITUL/BITULL I guess if __x86_64__ is defined then we are 
guaranteed that unsigned long will be a 64 bit, but for the sake of 
clarity I'd rather have this spelled out explicitly by using BITULL


> +#else
> +#define X86_CR4_FRED		(0)
> +#endif
> +
>   /*
>    * x86-64 Task Priority Register, CR8
>    */
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 317b4877e9c7..42511209469b 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -400,9 +400,8 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
>   }
>   
>   /* These bits should not change their value after CPU init is finished. */
> -static const unsigned long cr4_pinned_mask =
> -	X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
> -	X86_CR4_FSGSBASE | X86_CR4_CET;
> +static const unsigned long cr4_pinned_mask = X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
> +					     X86_CR4_FSGSBASE | X86_CR4_CET | X86_CR4_FRED;
>   static DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
>   static unsigned long cr4_pinned_bits __ro_after_init;
>   

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 16/38] x86/ptrace: Add FRED additional information to the pt_regs structure
  2023-09-14  4:47 ` [PATCH v10 16/38] x86/ptrace: Add FRED additional information to " Xin Li
@ 2023-09-20 12:57   ` Nikolay Borisov
  2023-09-20 17:23     ` Li, Xin3
  0 siblings, 1 reply; 88+ messages in thread
From: Nikolay Borisov @ 2023-09-20 12:57 UTC (permalink / raw)
  To: Xin Li, linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm,
	xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai



On 14.09.23 г. 7:47 ч., Xin Li wrote:
> FRED defines additional information in the upper 48 bits of cs/ss
> fields. Therefore add the information definitions into the pt_regs
> structure.
> 
> Specially introduce a new structure fred_ss to denote the FRED flags
> above SS selector, which avoids FRED_SSX_ macros and makes the code
> simpler and easier to read.
> 
> Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
> 
> Changes since v9:
> * Introduce a new structure fred_ss to denote the FRED flags above SS
>    selector, which avoids FRED_SSX_ macros and makes the code simpler
>    and easier to read (Thomas Gleixner).
> * Use type u64 to define FRED bit fields instead of type unsigned int
>    (Thomas Gleixner).
> 
> Changes since v8:
> * Reflect stack frame definition changes from FRED spec 3.0 to 5.0.
> * Use __packed instead of __attribute__((__packed__)) (Borislav Petkov).
> * Put all comments above the members, like the rest of the file does
>    (Borislav Petkov).
> 
> Changes since v3:
> * Rename csl/ssl of the pt_regs structure to csx/ssx (x for extended)
>    (Andrew Cooper).
> ---
>   arch/x86/include/asm/ptrace.h | 51 +++++++++++++++++++++++++++++++----
>   1 file changed, 46 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
> index f08ea073edd6..5786c8ca5f4c 100644
> --- a/arch/x86/include/asm/ptrace.h
> +++ b/arch/x86/include/asm/ptrace.h
> @@ -56,6 +56,25 @@ struct pt_regs {
>   
>   #else /* __i386__ */
>   
> +struct fred_ss {
> +	u64	ss	: 16,	// SS selector

Is this structure conformant to the return state as described in FRED 5.0?

— The stack segment of the interrupted context, 64 bits formatted as follows:

• Bits 15:0 contain the SS selector. < - WE HAVE THIS

• Bits 31:16 are not currently defined and will be zero until they are. < - MISSING hole?


> +		sti	:  1,	// STI state < -
> +		swevent	:  1,	// Set if syscall, sysenter or INT n
> +		nmi	:  1,	// Event is NMI type
> +			: 13,
> +		vector	:  8,	// Event vector
> +			:  8,
> +		type	:  4,	// Event type
> +			:  4,
> +		enclave	:  1,	// Event was incident to enclave execution
> +		lm	:  1,	// CPU was in long mode
> +		nested	:  1,	// Nested exception during FRED delivery
> +				// not set for #DF
> +			:  1,
> +		insnlen	:  4;	// The length of the instruction causing the event
> +				// Only set for INT0, INT1, INT3, INT n, SYSCALL
> +};				// and SYSENTER. 0 otherwise.
> +

<Snip>
   

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-15  1:16           ` andrew.cooper3
  2023-09-15  5:32             ` Juergen Gross
@ 2023-09-20 15:00             ` Peter Zijlstra
  2023-09-20 15:04               ` Juergen Gross
  1 sibling, 1 reply; 88+ messages in thread
From: Peter Zijlstra @ 2023-09-20 15:00 UTC (permalink / raw)
  To: andrew.cooper3
  Cc: H. Peter Anvin, Thomas Gleixner, Xin Li, linux-doc, linux-kernel,
	linux-edac, linux-hyperv, kvm, xen-devel, mingo, bp, dave.hansen,
	x86, luto, pbonzini, seanjc, jgross, ravi.v.shankar, mhiramat,
	jiangshanlai

On Fri, Sep 15, 2023 at 02:16:50AM +0100, andrew.cooper3@citrix.com wrote:

> Juergen has already done the work to delete one of these two patching
> mechanisms and replace it with the other.
> 
> https://lore.kernel.org/lkml/a32e211f-4add-4fb2-9e5a-480ae9b9bbf2@suse.com/
> 
> Unfortunately, it's only collecting pings and tumbleweeds.

Fixed that...

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-20 15:00             ` Peter Zijlstra
@ 2023-09-20 15:04               ` Juergen Gross
  0 siblings, 0 replies; 88+ messages in thread
From: Juergen Gross @ 2023-09-20 15:04 UTC (permalink / raw)
  To: Peter Zijlstra, andrew.cooper3
  Cc: H. Peter Anvin, Thomas Gleixner, Xin Li, linux-doc, linux-kernel,
	linux-edac, linux-hyperv, kvm, xen-devel, mingo, bp, dave.hansen,
	x86, luto, pbonzini, seanjc, ravi.v.shankar, mhiramat,
	jiangshanlai


[-- Attachment #1.1.1: Type: text/plain, Size: 444 bytes --]

On 20.09.23 17:00, Peter Zijlstra wrote:
> On Fri, Sep 15, 2023 at 02:16:50AM +0100, andrew.cooper3@citrix.com wrote:
> 
>> Juergen has already done the work to delete one of these two patching
>> mechanisms and replace it with the other.
>>
>> https://lore.kernel.org/lkml/a32e211f-4add-4fb2-9e5a-480ae9b9bbf2@suse.com/
>>
>> Unfortunately, it's only collecting pings and tumbleweeds.
> 
> Fixed that...

Thanks. :-)


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* RE: [PATCH v10 16/38] x86/ptrace: Add FRED additional information to the pt_regs structure
  2023-09-20 12:57   ` Nikolay Borisov
@ 2023-09-20 17:23     ` Li, Xin3
  2023-09-21  6:07       ` Nikolay Borisov
  0 siblings, 1 reply; 88+ messages in thread
From: Li, Xin3 @ 2023-09-20 17:23 UTC (permalink / raw)
  To: Nikolay Borisov, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org,
	linux-hyperv@vger.kernel.org, kvm@vger.kernel.org,
	xen-devel@lists.xenproject.org
  Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	Lutomirski, Andy, pbonzini@redhat.com, Christopherson,, Sean,
	peterz@infradead.org, Gross, Jurgen, Shankar, Ravi V,
	mhiramat@kernel.org, andrew.cooper3@citrix.com,
	jiangshanlai@gmail.com

> > +struct fred_ss {
> > +	u64	ss	: 16,	// SS selector
> 
> Is this structure conformant to the return state as described in FRED 5.0?
> 
> — The stack segment of the interrupted context, 64 bits formatted as follows:
> 
> • Bits 15:0 contain the SS selector. < - WE HAVE THIS
> 
> • Bits 31:16 are not currently defined and will be zero until they are.

Where did you download the FRED 5.0 spec from?

Mine says bit 16 is sti, bit 17 for sw initiated events and bit 18 is NMI.

I guess you have FRED 3.0 spec, no?

>  < - MISSING > hole?
> 
> > +		sti	:  1,	// STI state < -
> > +		swevent	:  1,	// Set if syscall, sysenter or INT n
> > +		nmi	:  1,	// Event is NMI type
> > +			: 13,
 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* RE: [PATCH v10 13/38] x86/cpu: Add X86_CR4_FRED macro
  2023-09-20 10:50   ` Nikolay Borisov
@ 2023-09-20 17:25     ` Li, Xin3
  0 siblings, 0 replies; 88+ messages in thread
From: Li, Xin3 @ 2023-09-20 17:25 UTC (permalink / raw)
  To: Nikolay Borisov, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org,
	linux-hyperv@vger.kernel.org, kvm@vger.kernel.org,
	xen-devel@lists.xenproject.org
  Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	Lutomirski, Andy, pbonzini@redhat.com, Christopherson,, Sean,
	peterz@infradead.org, Gross, Jurgen, Shankar, Ravi V,
	mhiramat@kernel.org, andrew.cooper3@citrix.com,
	jiangshanlai@gmail.com

> > +#ifdef __x86_64__
> > +#define X86_CR4_FRED_BIT	32 /* enable FRED kernel entry */
> > +#define X86_CR4_FRED		_BITUL(X86_CR4_FRED_BIT)
> 
> nit: s/BITUL/BITULL I guess if __x86_64__ is defined then we are
> guaranteed that unsigned long will be a 64 bit, but for the sake of
> clarity I'd rather have this spelled out explicitly by using BITULL
>

UL is better because CR4 is a machine word.

> 
> 
> > +#else
> > +#define X86_CR4_FRED		(0)
> > +#endif


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 33/38] x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI
  2023-09-14  4:48 ` [PATCH v10 33/38] x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI Xin Li
@ 2023-09-20 17:54   ` Paolo Bonzini
  2023-09-20 23:10     ` Li, Xin3
  2023-09-21 12:11   ` Nikolay Borisov
  1 sibling, 1 reply; 88+ messages in thread
From: Paolo Bonzini @ 2023-09-20 17:54 UTC (permalink / raw)
  To: Xin Li, linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm,
	xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, seanjc, peterz,
	jgross, ravi.v.shankar, mhiramat, andrew.cooper3, jiangshanlai

On 9/14/23 06:48, Xin Li wrote:
> +	/*
> +	 * Don't check the FRED stack level, the call stack leading to this
> +	 * helper is effectively constant and shallow (relatively speaking).

It's more that we don't need to protect from reentrancy.  The external 
interrupt uses stack level 0 so no adjustment would be needed anyway, 
and NMI does not use an IST even in the non-FRED case.

> +	 * Emulate the FRED-defined redzone and stack alignment.
> +	 */
> +	sub $(FRED_CONFIG_REDZONE_AMOUNT << 6), %rsp
> +	and $FRED_STACK_FRAME_RSP_MASK, %rsp


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 34/38] KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling
  2023-09-14  4:48 ` [PATCH v10 34/38] KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling Xin Li
@ 2023-09-20 17:54   ` Paolo Bonzini
  0 siblings, 0 replies; 88+ messages in thread
From: Paolo Bonzini @ 2023-09-20 17:54 UTC (permalink / raw)
  To: Xin Li, linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm,
	xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, seanjc, peterz,
	jgross, ravi.v.shankar, mhiramat, andrew.cooper3, jiangshanlai

On 9/14/23 06:48, Xin Li wrote:
> When FRED is enabled, call fred_entry_from_kvm() to handle IRQ/NMI in
> IRQ/NMI induced VM exits.
> 
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>

Acked-by: Paolo Bonzini <pbonzini@redhat.com>

> ---
>   arch/x86/kvm/vmx/vmx.c | 12 +++++++++---
>   1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 72e3943f3693..db55b8418fa3 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -38,6 +38,7 @@
>   #include <asm/desc.h>
>   #include <asm/fpu/api.h>
>   #include <asm/fpu/xstate.h>
> +#include <asm/fred.h>
>   #include <asm/idtentry.h>
>   #include <asm/io.h>
>   #include <asm/irq_remapping.h>
> @@ -6962,14 +6963,16 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
>   {
>   	u32 intr_info = vmx_get_intr_info(vcpu);
>   	unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
> -	gate_desc *desc = (gate_desc *)host_idt_base + vector;
>   
>   	if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
>   	    "unexpected VM-Exit interrupt info: 0x%x", intr_info))
>   		return;
>   
>   	kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
> -	vmx_do_interrupt_irqoff(gate_offset(desc));
> +	if (cpu_feature_enabled(X86_FEATURE_FRED))
> +		fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
> +	else
> +		vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector));
>   	kvm_after_interrupt(vcpu);
>   
>   	vcpu->arch.at_instruction_boundary = true;
> @@ -7262,7 +7265,10 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
>   	if ((u16)vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI &&
>   	    is_nmi(vmx_get_intr_info(vcpu))) {
>   		kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
> -		vmx_do_nmi_irqoff();
> +		if (cpu_feature_enabled(X86_FEATURE_FRED))
> +			fred_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR);
> +		else
> +			vmx_do_nmi_irqoff();
>   		kvm_after_interrupt(vcpu);
>   	}
>   


^ permalink raw reply	[flat|nested] 88+ messages in thread

* RE: [PATCH v10 33/38] x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI
  2023-09-20 17:54   ` Paolo Bonzini
@ 2023-09-20 23:10     ` Li, Xin3
  0 siblings, 0 replies; 88+ messages in thread
From: Li, Xin3 @ 2023-09-20 23:10 UTC (permalink / raw)
  To: Paolo Bonzini, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org,
	linux-hyperv@vger.kernel.org, kvm@vger.kernel.org,
	xen-devel@lists.xenproject.org
  Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	Lutomirski, Andy, Christopherson,, Sean, peterz@infradead.org,
	Gross, Jurgen, Shankar, Ravi V, mhiramat@kernel.org,
	andrew.cooper3@citrix.com, jiangshanlai@gmail.com

> > +	/*
> > +	 * Don't check the FRED stack level, the call stack leading to this
> > +	 * helper is effectively constant and shallow (relatively speaking).
> 
> It's more that we don't need to protect from reentrancy.  The external
> interrupt uses stack level 0 so no adjustment would be needed anyway,
> and NMI does not use an IST even in the non-FRED case.

I will incorporate this comment.

I think a VMX NMI is kind of like a user level NMI, and we don't need
to worry about nested NMIs.

> 
> > +	 * Emulate the FRED-defined redzone and stack alignment.
> > +	 */
> > +	sub $(FRED_CONFIG_REDZONE_AMOUNT << 6), %rsp
> > +	and $FRED_STACK_FRAME_RSP_MASK, %rsp


^ permalink raw reply	[flat|nested] 88+ messages in thread

* RE: [PATCH v10 36/38] x86/fred: Add fred_syscall_init()
  2023-09-20  8:18       ` Thomas Gleixner
@ 2023-09-21  2:24         ` H. Peter Anvin
  0 siblings, 0 replies; 88+ messages in thread
From: H. Peter Anvin @ 2023-09-21  2:24 UTC (permalink / raw)
  To: Thomas Gleixner, Li, Xin3, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org,
	linux-hyperv@vger.kernel.org, kvm@vger.kernel.org,
	xen-devel@lists.xenproject.org
  Cc: mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, Lutomirski, Andy, pbonzini@redhat.com,
	Christopherson,, Sean, peterz@infradead.org, Gross, Jurgen,
	Shankar, Ravi V, mhiramat@kernel.org, andrew.cooper3@citrix.com,
	jiangshanlai@gmail.com

On September 20, 2023 1:18:14 AM PDT, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Wed, Sep 20 2023 at 04:33, Li, Xin3 wrote:
>>> > +static inline void fred_syscall_init(void) {
>>> > +	/*
>>> > +	 * Per FRED spec 5.0, FRED uses the ring 3 FRED entrypoint for SYSCALL
>>> > +	 * and SYSENTER, and ERETU is the only legit instruction to return to
>>> > +	 * ring 3, as a result there is _no_ need to setup the SYSCALL and
>>> > +	 * SYSENTER MSRs.
>>> > +	 *
>>> > +	 * Note, both sysexit and sysret cause #UD when FRED is enabled.
>>> > +	 */
>>> > +	wrmsrl(MSR_LSTAR, 0ULL);
>>> > +	wrmsrl_cstar(0ULL);
>>> 
>>> That write is pointless. See the comment in wrmsrl_cstar().
>>
>> What I heard is that AMD is going to support FRED.
>>
>> Both LSTAR and CSTAR have no function when FRED is enabled, so maybe
>> just do NOT write to them?
>
>Right. If AMD needs to clear it then it's trivial enough to add a
>wrmsrl_cstar(0) to it.

Just to clarify: the only reason I added the writes here was to possibly make bugs easier to track down. There is indeed no functional reason.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 16/38] x86/ptrace: Add FRED additional information to the pt_regs structure
  2023-09-20 17:23     ` Li, Xin3
@ 2023-09-21  6:07       ` Nikolay Borisov
  2023-09-21  6:24         ` Li, Xin3
  0 siblings, 1 reply; 88+ messages in thread
From: Nikolay Borisov @ 2023-09-21  6:07 UTC (permalink / raw)
  To: Li, Xin3, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-edac@vger.kernel.org, linux-hyperv@vger.kernel.org,
	kvm@vger.kernel.org, xen-devel@lists.xenproject.org
  Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	Lutomirski, Andy, pbonzini@redhat.com, Christopherson,, Sean,
	peterz@infradead.org, Gross, Jurgen, Shankar, Ravi V,
	mhiramat@kernel.org, andrew.cooper3@citrix.com,
	jiangshanlai@gmail.com



On 20.09.23 г. 20:23 ч., Li, Xin3 wrote:
>>> +struct fred_ss {
>>> +	u64	ss	: 16,	// SS selector
>>
>> Is this structure conformant to the return state as described in FRED 5.0?
>>
>> — The stack segment of the interrupted context, 64 bits formatted as follows:
>>
>> • Bits 15:0 contain the SS selector. < - WE HAVE THIS
>>
>> • Bits 31:16 are not currently defined and will be zero until they are.
> 
> Where did you download the FRED 5.0 spec from?
> 
> Mine says bit 16 is sti, bit 17 for sw initiated events and bit 18 is NMI.
> 
> I guess you have FRED 3.0 spec, no?
Doh you are right, I was looking at the wrong version of the document 
.... sorry for the noise.
> 
>>   < - MISSING > hole?
>>
>>> +		sti	:  1,	// STI state < -
>>> +		swevent	:  1,	// Set if syscall, sysenter or INT n
>>> +		nmi	:  1,	// Event is NMI type
>>> +			: 13,
>   

^ permalink raw reply	[flat|nested] 88+ messages in thread

* RE: [PATCH v10 16/38] x86/ptrace: Add FRED additional information to the pt_regs structure
  2023-09-21  6:07       ` Nikolay Borisov
@ 2023-09-21  6:24         ` Li, Xin3
  0 siblings, 0 replies; 88+ messages in thread
From: Li, Xin3 @ 2023-09-21  6:24 UTC (permalink / raw)
  To: Nikolay Borisov, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org,
	linux-hyperv@vger.kernel.org, kvm@vger.kernel.org,
	xen-devel@lists.xenproject.org
  Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	Lutomirski, Andy, pbonzini@redhat.com, Christopherson,, Sean,
	peterz@infradead.org, Gross, Jurgen, Shankar, Ravi V,
	mhiramat@kernel.org, andrew.cooper3@citrix.com,
	jiangshanlai@gmail.com

> > I guess you have FRED 3.0 spec, no?
> Doh you are right, I was looking at the wrong version of the document .... sorry for
> the noise.

Actually I appreciate your review so much!

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 28/38] x86/fred: FRED entry/exit and dispatch code
  2023-09-14  4:47 ` [PATCH v10 28/38] x86/fred: FRED entry/exit and dispatch code Xin Li
@ 2023-09-21  9:48   ` Nikolay Borisov
  2023-09-21 10:08     ` Thomas Gleixner
  0 siblings, 1 reply; 88+ messages in thread
From: Nikolay Borisov @ 2023-09-21  9:48 UTC (permalink / raw)
  To: Xin Li, linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm,
	xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai



On 14.09.23 г. 7:47 ч., Xin Li wrote:
> From: "H. Peter Anvin (Intel)" <hpa@zytor.com>
> 
> The code to actually handle kernel and event entry/exit using
> FRED. It is split up into two files thus:
> 
> - entry_64_fred.S contains the actual entrypoints and exit code, and
>    saves and restores registers.
> - entry_fred.c contains the two-level event dispatch code for FRED.
>    The first-level dispatch is on the event type, and the second-level
>    is on the event vector.
> 
> Originally-by: Megha Dey <megha.dey@intel.com>
> Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> Co-developed-by: Xin Li <xin3.li@intel.com>
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
> 
> Changes since v9:
> * Don't use jump tables, indirect jumps are expensive (Thomas Gleixner).
> * Except NMI/#DB/#MCE, FRED really can share the exception handlers
>    with IDT (Thomas Gleixner).
> * Avoid the sysvec_* idt_entry muck, do it at a central place, reuse code
>    instead of blindly copying it, which breaks the performance optimized
>    sysvec entries like reschedule_ipi (Thomas Gleixner).
> * Add asm_ prefix to FRED asm entry points (Thomas Gleixner).
> 
> Changes since v8:
> * Don't do syscall early out in fred_entry_from_user() before there are
>    proper performance numbers and justifications (Thomas Gleixner).
> * Add the control exception handler to the FRED exception handler table
>    (Thomas Gleixner).
> * Add ENDBR to the FRED_ENTER asm macro.
> * Reflect the FRED spec 5.0 change that ERETS and ERETU add 8 to %rsp
>    before popping the return context from the stack.
> 
> Changes since v1:
> * Initialize a FRED exception handler to fred_bad_event() instead of NULL
>    if no FRED handler defined for an exception vector (Peter Zijlstra).
> * Push calling irqentry_{enter,exit}() and instrumentation_{begin,end}()
>    down into individual FRED exception handlers, instead of in the dispatch
>    framework (Peter Zijlstra).
> ---
>   arch/x86/entry/Makefile               |   5 +-
>   arch/x86/entry/entry_64_fred.S        |  52 ++++++
>   arch/x86/entry/entry_fred.c           | 230 ++++++++++++++++++++++++++
>   arch/x86/include/asm/asm-prototypes.h |   1 +
>   arch/x86/include/asm/fred.h           |   6 +
>   5 files changed, 293 insertions(+), 1 deletion(-)
>   create mode 100644 arch/x86/entry/entry_64_fred.S
>   create mode 100644 arch/x86/entry/entry_fred.c
>


<snip>

> +
> +static noinstr void fred_intx(struct pt_regs *regs)
> +{
> +	switch (regs->fred_ss.vector) {
> +	/* INT0 */
> +	case X86_TRAP_OF:
> +		exc_overflow(regs);
> +		return;
> +
> +	/* INT3 */
> +	case X86_TRAP_BP:
> +		exc_int3(regs);
> +		return;
> +
> +	/* INT80 */
> +	case IA32_SYSCALL_VECTOR:
> +		if (likely(IS_ENABLED(CONFIG_IA32_EMULATION))) {

Since future kernels will support boottime toggling of whether 32bit 
syscall interface should be enabled or not as per:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/entry&id=1da5c9bc119d3a749b519596b93f9b2667e93c4a

It will make more sense to replace this with ia32_enabled() invocation. 
I guess this could be done as a follow-up patch based on when this is 
merged as the ia32_enbaled changes are going to be merged in 6.7.


> +			/* Save the syscall number */
> +			regs->orig_ax = regs->ax;
> +			regs->ax = -ENOSYS;
> +			do_int80_syscall_32(regs);
> +			return;
> +		}
> +		fallthrough;
> +
> +	default:
> +		exc_general_protection(regs, 0);
> +		return;
> +	}
> +}
> +
> +static __always_inline void fred_other(struct pt_regs *regs)
> +{
> +	/* The compiler can fold these conditions into a single test */
> +	if (likely(regs->fred_ss.vector == FRED_SYSCALL && regs->fred_ss.lm)) {
> +		regs->orig_ax = regs->ax;
> +		regs->ax = -ENOSYS;
> +		do_syscall_64(regs, regs->orig_ax);
> +		return;
> +	} else if (likely(IS_ENABLED(CONFIG_IA32_EMULATION) &&

Ditto

> +			  regs->fred_ss.vector == FRED_SYSENTER &&
> +			  !regs->fred_ss.lm)) {
> +		regs->orig_ax = regs->ax;
> +		regs->ax = -ENOSYS;
> +		do_fast_syscall_32(regs);
> +		return;
> +	} else {
> +		exc_invalid_op(regs);
> +		return;
> +	}
> +}
> +

<snip>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 28/38] x86/fred: FRED entry/exit and dispatch code
  2023-09-21  9:48   ` Nikolay Borisov
@ 2023-09-21 10:08     ` Thomas Gleixner
  2023-09-21 17:54       ` Li, Xin3
  0 siblings, 1 reply; 88+ messages in thread
From: Thomas Gleixner @ 2023-09-21 10:08 UTC (permalink / raw)
  To: Nikolay Borisov, Xin Li, linux-doc, linux-kernel, linux-edac,
	linux-hyperv, kvm, xen-devel
  Cc: mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc, peterz,
	jgross, ravi.v.shankar, mhiramat, andrew.cooper3, jiangshanlai

On Thu, Sep 21 2023 at 12:48, Nikolay Borisov wrote:
> On 14.09.23 г. 7:47 ч., Xin Li wrote:
>> +
>> +	/* INT80 */
>> +	case IA32_SYSCALL_VECTOR:
>> +		if (likely(IS_ENABLED(CONFIG_IA32_EMULATION))) {
>
> Since future kernels will support boottime toggling of whether 32bit 
> syscall interface should be enabled or not as per:
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/entry&id=1da5c9bc119d3a749b519596b93f9b2667e93c4a
>
> It will make more sense to replace this with ia32_enabled() invocation. 
> I guess this could be done as a follow-up patch based on when this is 
> merged as the ia32_enbaled changes are going to be merged in 6.7.

The simplest solution is to rebase the series to tip x86/entry and just
do it right away :)

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 33/38] x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI
  2023-09-14  4:48 ` [PATCH v10 33/38] x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI Xin Li
  2023-09-20 17:54   ` Paolo Bonzini
@ 2023-09-21 12:11   ` Nikolay Borisov
  2023-09-21 12:38     ` Paolo Bonzini
  1 sibling, 1 reply; 88+ messages in thread
From: Nikolay Borisov @ 2023-09-21 12:11 UTC (permalink / raw)
  To: Xin Li, linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm,
	xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, pbonzini, seanjc,
	peterz, jgross, ravi.v.shankar, mhiramat, andrew.cooper3,
	jiangshanlai



On 14.09.23 г. 7:48 ч., Xin Li wrote:
> In IRQ/NMI induced VM exits, KVM VMX needs to execute the respective
> handlers, which requires the software to create a FRED stack frame,
> and use it to invoke the handlers. Add fred_irq_entry_from_kvm() for
> this job.
> 
> Export fred_entry_from_kvm() because VMX can be compiled as a module.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
> 
> Changes since v9:
> * Shove the whole thing into arch/x86/entry/entry_64_fred.S for invoking
>    external_interrupt() and fred_exc_nmi() (Sean Christopherson).
> * Correct and improve a few comments (Sean Christopherson).
> * Merge the two IRQ/NMI asm entries into one as it's fine to invoke
>    noinstr code from regular code (Thomas Gleixner).
> * Setup the long mode and NMI flags in the augmented SS field of FRED
>    stack frame in C instead of asm (Thomas Gleixner).
> * Add UNWIND_HINT_{SAVE,RESTORE} to get rid of the warning: "objtool:
>    asm_fred_entry_from_kvm+0x0: unreachable instruction" (Peter Zijlstra).
> 
> Changes since v8:
> * Add a new macro VMX_DO_FRED_EVENT_IRQOFF for FRED instead of
>    refactoring VMX_DO_EVENT_IRQOFF (Sean Christopherson).
> * Do NOT use a trampoline, just LEA+PUSH the return RIP, PUSH the error
>    code, and jump to the FRED kernel entry point for NMI or call
>    external_interrupt() for IRQs (Sean Christopherson).
> * Call external_interrupt() only when FRED is enabled, and convert the
>    non-FRED handling to external_interrupt() after FRED lands (Sean
>    Christopherson).
> ---
>   arch/x86/entry/entry_64_fred.S | 73 ++++++++++++++++++++++++++++++++++
>   arch/x86/entry/entry_fred.c    | 14 +++++++
>   arch/x86/include/asm/fred.h    | 18 +++++++++
>   3 files changed, 105 insertions(+)
> 
> diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
> index d1c2fc4af8ae..f1088d6f2054 100644
> --- a/arch/x86/entry/entry_64_fred.S
> +++ b/arch/x86/entry/entry_64_fred.S
> @@ -4,7 +4,9 @@
>    */
>   
>   #include <asm/asm.h>
> +#include <asm/export.h>
>   #include <asm/fred.h>
> +#include <asm/segment.h>
>   
>   #include "calling.h"
>   
> @@ -54,3 +56,74 @@ SYM_CODE_START_NOALIGN(asm_fred_entrypoint_kernel)
>   	FRED_EXIT
>   	ERETS
>   SYM_CODE_END(asm_fred_entrypoint_kernel)
> +
> +#if IS_ENABLED(CONFIG_KVM_INTEL)
> +SYM_FUNC_START(asm_fred_entry_from_kvm)
> +	push %rbp
> +	mov %rsp, %rbp

use FRAME_BEGIN/FRAME_END macros to ommit this code if 
CONFIG_FRAME_POINTER is disabled.

> +
> +	UNWIND_HINT_SAVE
> +
> +	/*
> +	 * Don't check the FRED stack level, the call stack leading to this
> +	 * helper is effectively constant and shallow (relatively speaking).
> +	 *
> +	 * Emulate the FRED-defined redzone and stack alignment.
> +	 */
> +	sub $(FRED_CONFIG_REDZONE_AMOUNT << 6), %rsp
> +	and $FRED_STACK_FRAME_RSP_MASK, %rsp
> +
> +	/*
> +	 * Start to push a FRED stack frame, which is always 64 bytes:
> +	 *
> +	 * +--------+-----------------+
> +	 * | Bytes  | Usage           |
> +	 * +--------+-----------------+
> +	 * | 63:56  | Reserved        |
> +	 * | 55:48  | Event Data      |
> +	 * | 47:40  | SS + Event Info |
> +	 * | 39:32  | RSP             |
> +	 * | 31:24  | RFLAGS          |
> +	 * | 23:16  | CS + Aux Info   |
> +	 * |  15:8  | RIP             |
> +	 * |   7:0  | Error Code      |
> +	 * +--------+-----------------+
> +	 */
> +	push $0				/* Reserved, must be 0 */
> +	push $0				/* Event data, 0 for IRQ/NMI */
> +	push %rdi			/* fred_ss handed in by the caller */
> +	push %rbp
> +	pushf
> +	mov $__KERNEL_CS, %rax
> +	push %rax
> +
> +	/*
> +	 * Unlike the IDT event delivery, FRED _always_ pushes an error code
> +	 * after pushing the return RIP, thus the CALL instruction CANNOT be
> +	 * used here to push the return RIP, otherwise there is no chance to
> +	 * push an error code before invoking the IRQ/NMI handler.
> +	 *
> +	 * Use LEA to get the return RIP and push it, then push an error code.
> +	 */
> +	lea 1f(%rip), %rax
> +	push %rax				/* Return RIP */
> +	push $0					/* Error code, 0 for IRQ/NMI */
> +
> +	PUSH_AND_CLEAR_REGS clear_bp=0 unwind_hint=0
> +	movq %rsp, %rdi				/* %rdi -> pt_regs */
> +	call __fred_entry_from_kvm		/* Call the C entry point */
> +	POP_REGS
> +	ERETS
> +1:
> +	/*
> +	 * Objtool doesn't understand what ERETS does, this hint tells it that
> +	 * yes, we'll reach here and with what stack state. A save/restore pair
> +	 * isn't strictly needed, but it's the simplest form.
> +	 */
> +	UNWIND_HINT_RESTORE
> +	pop %rbp

FRAME_END

> +	RET
> +
> +SYM_FUNC_END(asm_fred_entry_from_kvm)
> +EXPORT_SYMBOL_GPL(asm_fred_entry_from_kvm);
> +#endif


<snip>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH v10 33/38] x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI
  2023-09-21 12:11   ` Nikolay Borisov
@ 2023-09-21 12:38     ` Paolo Bonzini
  0 siblings, 0 replies; 88+ messages in thread
From: Paolo Bonzini @ 2023-09-21 12:38 UTC (permalink / raw)
  To: Nikolay Borisov, Xin Li, linux-doc, linux-kernel, linux-edac,
	linux-hyperv, kvm, xen-devel
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, luto, seanjc, peterz,
	jgross, ravi.v.shankar, mhiramat, andrew.cooper3, jiangshanlai

On 9/21/23 14:11, Nikolay Borisov wrote:
>>
>> +SYM_FUNC_START(asm_fred_entry_from_kvm)
>> +    push %rbp
>> +    mov %rsp, %rbp
> 
> use FRAME_BEGIN/FRAME_END macros to ommit this code if 
> CONFIG_FRAME_POINTER is disabled.

No, the previous stack pointer is used below, so the code might as well 
use %rbp for that; but it must do so unconditionally.

Paolo

>> +
>> +    UNWIND_HINT_SAVE
>> +
>> +    /*
>> +     * Don't check the FRED stack level, the call stack leading to this
>> +     * helper is effectively constant and shallow (relatively speaking).
>> +     *
>> +     * Emulate the FRED-defined redzone and stack alignment.
>> +     */
>> +    sub $(FRED_CONFIG_REDZONE_AMOUNT << 6), %rsp
>> +    and $FRED_STACK_FRAME_RSP_MASK, %rsp
>> +
>> +    /*
>> +     * Start to push a FRED stack frame, which is always 64 bytes:
>> +     *
>> +     * +--------+-----------------+
>> +     * | Bytes  | Usage           |
>> +     * +--------+-----------------+
>> +     * | 63:56  | Reserved        |
>> +     * | 55:48  | Event Data      |
>> +     * | 47:40  | SS + Event Info |
>> +     * | 39:32  | RSP             |
>> +     * | 31:24  | RFLAGS          |
>> +     * | 23:16  | CS + Aux Info   |
>> +     * |  15:8  | RIP             |
>> +     * |   7:0  | Error Code      |
>> +     * +--------+-----------------+
>> +     */
>> +    push $0                /* Reserved, must be 0 */
>> +    push $0                /* Event data, 0 for IRQ/NMI */
>> +    push %rdi            /* fred_ss handed in by the caller */
>> +    push %rbp

^^ here

Paolo

>> +    pushf
>> +    mov $__KERNEL_CS, %rax
>> +    push %rax 


^ permalink raw reply	[flat|nested] 88+ messages in thread

* RE: [PATCH v10 28/38] x86/fred: FRED entry/exit and dispatch code
  2023-09-21 10:08     ` Thomas Gleixner
@ 2023-09-21 17:54       ` Li, Xin3
  0 siblings, 0 replies; 88+ messages in thread
From: Li, Xin3 @ 2023-09-21 17:54 UTC (permalink / raw)
  To: Thomas Gleixner, Nikolay Borisov, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org,
	linux-hyperv@vger.kernel.org, kvm@vger.kernel.org,
	xen-devel@lists.xenproject.org
  Cc: mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, Lutomirski, Andy,
	pbonzini@redhat.com, Christopherson,, Sean, peterz@infradead.org,
	Gross, Jurgen, Shankar, Ravi V, mhiramat@kernel.org,
	andrew.cooper3@citrix.com, jiangshanlai@gmail.com

> > Since future kernels will support boottime toggling of whether 32bit
> > syscall interface should be enabled or not as per:
> > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=
> > x86/entry&id=1da5c9bc119d3a749b519596b93f9b2667e93c4a
> >
> > It will make more sense to replace this with ia32_enabled() invocation.
> > I guess this could be done as a follow-up patch based on when this is
> > merged as the ia32_enbaled changes are going to be merged in 6.7.
> 
> The simplest solution is to rebase the series to tip x86/entry and just do it right
> away :)

Just did it for the next iteration.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* RE: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-20  8:18     ` Li, Xin3
@ 2023-09-22  8:16       ` Li, Xin3
  2023-09-22 15:00         ` Thomas Gleixner
  0 siblings, 1 reply; 88+ messages in thread
From: Li, Xin3 @ 2023-09-22  8:16 UTC (permalink / raw)
  To: Li, Xin3, Nikolay Borisov, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org,
	linux-hyperv@vger.kernel.org, kvm@vger.kernel.org,
	xen-devel@lists.xenproject.org
  Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	Lutomirski, Andy, pbonzini@redhat.com, Christopherson,, Sean,
	peterz@infradead.org, Gross, Jurgen, Shankar, Ravi V,
	mhiramat@kernel.org, andrew.cooper3@citrix.com,
	jiangshanlai@gmail.com

> > > +static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
> >
> > Shouldn't this be named wrmsrns_safe since it has exception handling, similar
> to
> > the current wrmsrl_safe.
> >
> 
> Both safe and unsafe versions have exception handling, while the safe
> version returns an integer to its caller to indicate an exception did
> happen or not.

I notice there are several call sites using the safe version w/o
checking the return value, should the unsafe version be a better
choice in such cases?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* RE: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-22  8:16       ` Li, Xin3
@ 2023-09-22 15:00         ` Thomas Gleixner
  2023-09-22 23:21           ` Li, Xin3
  0 siblings, 1 reply; 88+ messages in thread
From: Thomas Gleixner @ 2023-09-22 15:00 UTC (permalink / raw)
  To: Li, Xin3, Li, Xin3, Nikolay Borisov, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org,
	linux-hyperv@vger.kernel.org, kvm@vger.kernel.org,
	xen-devel@lists.xenproject.org
  Cc: mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, Lutomirski, Andy,
	pbonzini@redhat.com, Christopherson,, Sean, peterz@infradead.org,
	Gross, Jurgen, Shankar, Ravi V, mhiramat@kernel.org,
	andrew.cooper3@citrix.com, jiangshanlai@gmail.com

On Fri, Sep 22 2023 at 08:16, Xin3 Li wrote:
>> > > +static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
>> >
>> > Shouldn't this be named wrmsrns_safe since it has exception handling, similar
>> to
>> > the current wrmsrl_safe.
>> >
>> 
>> Both safe and unsafe versions have exception handling, while the safe
>> version returns an integer to its caller to indicate an exception did
>> happen or not.
>
> I notice there are several call sites using the safe version w/o
> checking the return value, should the unsafe version be a better
> choice in such cases?

Depends. The safe version does not emit a warning on fail. So if the
callsite truly does not care about the error it's fine.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* RE: [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support
  2023-09-22 15:00         ` Thomas Gleixner
@ 2023-09-22 23:21           ` Li, Xin3
  0 siblings, 0 replies; 88+ messages in thread
From: Li, Xin3 @ 2023-09-22 23:21 UTC (permalink / raw)
  To: Thomas Gleixner, Nikolay Borisov, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org,
	linux-hyperv@vger.kernel.org, kvm@vger.kernel.org,
	xen-devel@lists.xenproject.org
  Cc: mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, Lutomirski, Andy,
	pbonzini@redhat.com, Christopherson,, Sean, peterz@infradead.org,
	Gross, Jurgen, Shankar, Ravi V, mhiramat@kernel.org,
	andrew.cooper3@citrix.com, jiangshanlai@gmail.com

> > I notice there are several call sites using the safe version w/o
> > checking the return value, should the unsafe version be a better
> > choice in such cases?
> 
> Depends. The safe version does not emit a warning on fail. So if the
> callsite truly does not care about the error it's fine.

Right. So the _safe suffix also means to suppress a warning that the
caller doesn't care.

Thanks!
    Xin


^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2023-09-22 23:21 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-14  4:47 [PATCH v10 00/38] x86: enable FRED for x86-64 Xin Li
2023-09-14  4:47 ` [PATCH v10 01/38] x86/cpufeatures: Add the cpu feature bit for WRMSRNS Xin Li
2023-09-14  4:47 ` [PATCH v10 02/38] x86/opcode: Add the WRMSRNS instruction to the x86 opcode map Xin Li
2023-09-15  5:47   ` Masami Hiramatsu
2023-09-14  4:47 ` [PATCH v10 03/38] x86/msr: Add the WRMSRNS instruction support Xin Li
2023-09-14  6:02   ` Juergen Gross
2023-09-14 13:01     ` andrew.cooper3
2023-09-14 14:05   ` andrew.cooper3
2023-09-14 23:00     ` Thomas Gleixner
2023-09-14 23:34       ` H. Peter Anvin
2023-09-14 23:46       ` andrew.cooper3
2023-09-15  0:12         ` Thomas Gleixner
2023-09-15  0:33           ` andrew.cooper3
2023-09-15  0:38             ` H. Peter Anvin
2023-09-15  1:46               ` andrew.cooper3
2023-09-15  2:06                 ` H. Peter Anvin
2023-09-15  0:42         ` Thomas Gleixner
2023-09-15  1:01         ` H. Peter Anvin
2023-09-15  1:16           ` andrew.cooper3
2023-09-15  5:32             ` Juergen Gross
2023-09-20 15:00             ` Peter Zijlstra
2023-09-20 15:04               ` Juergen Gross
2023-09-20  7:58   ` Nikolay Borisov
2023-09-20  8:18     ` Li, Xin3
2023-09-22  8:16       ` Li, Xin3
2023-09-22 15:00         ` Thomas Gleixner
2023-09-22 23:21           ` Li, Xin3
2023-09-14  4:47 ` [PATCH v10 04/38] x86/entry: Remove idtentry_sysvec from entry_{32,64}.S Xin Li
2023-09-14  4:47 ` [PATCH v10 05/38] x86/trapnr: Add event type macros to <asm/trapnr.h> Xin Li
2023-09-14 14:22   ` andrew.cooper3
2023-09-14  4:47 ` [PATCH v10 06/38] Documentation/x86/64: Add a documentation for FRED Xin Li
2023-09-20  9:44   ` Nikolay Borisov
2023-09-14  4:47 ` [PATCH v10 07/38] x86/fred: Add Kconfig option for FRED (CONFIG_X86_FRED) Xin Li
2023-09-14  4:47 ` [PATCH v10 08/38] x86/cpufeatures: Add the cpu feature bit for FRED Xin Li
2023-09-14  6:03   ` Juergen Gross
2023-09-14  6:09     ` Jan Beulich
2023-09-14 13:15       ` andrew.cooper3
2023-09-15  1:07         ` Thomas Gleixner
2023-09-15  5:27           ` Juergen Gross
2023-09-14  4:47 ` [PATCH v10 09/38] x86/fred: Disable FRED support if CONFIG_X86_FRED is disabled Xin Li
2023-09-20 10:19   ` Nikolay Borisov
2023-09-14  4:47 ` [PATCH v10 10/38] x86/fred: Disable FRED by default in its early stage Xin Li
2023-09-14  4:47 ` [PATCH v10 11/38] x86/opcode: Add ERET[US] instructions to the x86 opcode map Xin Li
2023-09-14  4:47 ` [PATCH v10 12/38] x86/objtool: Teach objtool about ERET[US] Xin Li
2023-09-14  4:47 ` [PATCH v10 13/38] x86/cpu: Add X86_CR4_FRED macro Xin Li
2023-09-20 10:50   ` Nikolay Borisov
2023-09-20 17:25     ` Li, Xin3
2023-09-14  4:47 ` [PATCH v10 14/38] x86/cpu: Add MSR numbers for FRED configuration Xin Li
2023-09-14  4:47 ` [PATCH v10 15/38] x86/ptrace: Cleanup the definition of the pt_regs structure Xin Li
2023-09-14  4:47 ` [PATCH v10 16/38] x86/ptrace: Add FRED additional information to " Xin Li
2023-09-20 12:57   ` Nikolay Borisov
2023-09-20 17:23     ` Li, Xin3
2023-09-21  6:07       ` Nikolay Borisov
2023-09-21  6:24         ` Li, Xin3
2023-09-14  4:47 ` [PATCH v10 17/38] x86/fred: Add a new header file for FRED definitions Xin Li
2023-09-14  4:47 ` [PATCH v10 18/38] x86/fred: Reserve space for the FRED stack frame Xin Li
2023-09-14  4:47 ` [PATCH v10 19/38] x86/fred: Update MSR_IA32_FRED_RSP0 during task switch Xin Li
2023-09-14  4:47 ` [PATCH v10 20/38] x86/fred: Disallow the swapgs instruction when FRED is enabled Xin Li
2023-09-14  4:47 ` [PATCH v10 21/38] x86/fred: No ESPFIX needed " Xin Li
2023-09-14  4:47 ` [PATCH v10 22/38] x86/fred: Allow single-step trap and NMI when starting a new task Xin Li
2023-09-14  4:47 ` [PATCH v10 23/38] x86/fred: Make exc_page_fault() work for FRED Xin Li
2023-09-14  4:47 ` [PATCH v10 24/38] x86/idtentry: Incorporate definitions/declarations of the FRED entries Xin Li
2023-09-14  4:47 ` [PATCH v10 25/38] x86/fred: Add a debug fault entry stub for FRED Xin Li
2023-09-14  4:47 ` [PATCH v10 26/38] x86/fred: Add a NMI " Xin Li
2023-09-14  4:47 ` [PATCH v10 27/38] x86/fred: Add a machine check " Xin Li
2023-09-14  4:47 ` [PATCH v10 28/38] x86/fred: FRED entry/exit and dispatch code Xin Li
2023-09-21  9:48   ` Nikolay Borisov
2023-09-21 10:08     ` Thomas Gleixner
2023-09-21 17:54       ` Li, Xin3
2023-09-14  4:47 ` [PATCH v10 29/38] x86/traps: Add sysvec_install() to install a system interrupt handler Xin Li
2023-09-14  4:47 ` [PATCH v10 30/38] x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled Xin Li
2023-09-14  4:47 ` [PATCH v10 31/38] x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user Xin Li
2023-09-14  4:47 ` [PATCH v10 32/38] x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code Xin Li
2023-09-14  4:48 ` [PATCH v10 33/38] x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI Xin Li
2023-09-20 17:54   ` Paolo Bonzini
2023-09-20 23:10     ` Li, Xin3
2023-09-21 12:11   ` Nikolay Borisov
2023-09-21 12:38     ` Paolo Bonzini
2023-09-14  4:48 ` [PATCH v10 34/38] KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling Xin Li
2023-09-20 17:54   ` Paolo Bonzini
2023-09-14  4:48 ` [PATCH v10 35/38] x86/syscall: Split IDT syscall setup code into idt_syscall_init() Xin Li
2023-09-14  4:48 ` [PATCH v10 36/38] x86/fred: Add fred_syscall_init() Xin Li
2023-09-19  8:28   ` Thomas Gleixner
2023-09-20  4:33     ` Li, Xin3
2023-09-20  8:18       ` Thomas Gleixner
2023-09-21  2:24         ` H. Peter Anvin
2023-09-14  4:48 ` [PATCH v10 37/38] x86/fred: Add FRED initialization functions Xin Li
2023-09-14  4:48 ` [PATCH v10 38/38] x86/fred: Invoke FRED initialization code to enable FRED Xin Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).