* [RFC PATCH 1/7] x86/entry: Move PUSH_AND_CLEAR_REGS out of paranoid_entry
2023-04-03 14:05 [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST Lai Jiangshan
@ 2023-04-03 14:05 ` Lai Jiangshan
2023-04-06 20:37 ` Peter Zijlstra
2023-04-03 14:06 ` [RFC PATCH 2/7] x86/entry: Add IST main stack Lai Jiangshan
` (7 subsequent siblings)
8 siblings, 1 reply; 25+ messages in thread
From: Lai Jiangshan @ 2023-04-03 14:05 UTC (permalink / raw)
To: linux-kernel
Cc: Lai Jiangshan, H. Peter Anvin, Andi Kleen, Andrew Cooper,
Andy Lutomirski, Asit Mallick, Cfir Cohen, Dan Williams,
Dave Hansen, David Kaplan, David Rientjes, Dirk Hohndel,
Erdem Aktas, Jan Kiszka, Jiri Slaby, Joerg Roedel, Juergen Gross,
Kees Cook, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
Linus Torvalds, Mike Stunes, Peter Zijlstra, Raj Ashok,
Sean Christopherson, Thomas Gleixner, Tom Lendacky, Tony Luck,
kvm, linux-coco, x86, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin
From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Prepare to call paranoid_entry() with pt_regs pushed.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
arch/x86/entry/entry_64.S | 33 ++++++++++++++++++++-------------
1 file changed, 20 insertions(+), 13 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index eccc3431e515..49ddc4dd3117 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -475,11 +475,13 @@ SYM_CODE_START(\asmsym)
testb $3, CS-ORIG_RAX(%rsp)
jnz .Lfrom_usermode_switch_stack_\@
+ PUSH_AND_CLEAR_REGS
+ UNWIND_HINT_REGS
+ ENCODE_FRAME_POINTER
+
/* paranoid_entry returns GS information for paranoid_exit in EBX. */
call paranoid_entry
- UNWIND_HINT_REGS
-
movq %rsp, %rdi /* pt_regs pointer */
call \cfunc
@@ -530,14 +532,16 @@ SYM_CODE_START(\asmsym)
testb $3, CS-ORIG_RAX(%rsp)
jnz .Lfrom_usermode_switch_stack_\@
+ PUSH_AND_CLEAR_REGS
+ UNWIND_HINT_REGS
+ ENCODE_FRAME_POINTER
+
/*
* paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
* EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS
*/
call paranoid_entry
- UNWIND_HINT_REGS
-
/*
* Switch off the IST stack to make it free for nested exceptions. The
* vc_switch_off_ist() function will switch back to the interrupted
@@ -587,9 +591,12 @@ SYM_CODE_START(\asmsym)
ASM_CLAC
cld
+ PUSH_AND_CLEAR_REGS
+ UNWIND_HINT_REGS
+ ENCODE_FRAME_POINTER
+
/* paranoid_entry returns GS information for paranoid_exit in EBX. */
call paranoid_entry
- UNWIND_HINT_REGS
movq %rsp, %rdi /* pt_regs pointer into first argument */
movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/
@@ -903,8 +910,8 @@ SYM_CODE_END(xen_failsafe_callback)
#endif /* CONFIG_XEN_PV */
/*
- * Save all registers in pt_regs. Return GSBASE related information
- * in EBX depending on the availability of the FSGSBASE instructions:
+ * Return GSBASE related information in EBX depending on the availability
+ * of the FSGSBASE instructions:
*
* FSGSBASE R/EBX
* N 0 -> SWAPGS on exit
@@ -915,11 +922,8 @@ SYM_CODE_END(xen_failsafe_callback)
* R14 - old CR3
* R15 - old SPEC_CTRL
*/
-SYM_CODE_START(paranoid_entry)
+SYM_FUNC_START(paranoid_entry)
ANNOTATE_NOENDBR
- UNWIND_HINT_FUNC
- PUSH_AND_CLEAR_REGS save_ret=1
- ENCODE_FRAME_POINTER 8
/*
* Always stash CR3 in %r14. This value will be restored,
@@ -988,7 +992,7 @@ SYM_CODE_START(paranoid_entry)
UNTRAIN_RET_FROM_CALL
RET
-SYM_CODE_END(paranoid_entry)
+SYM_FUNC_END(paranoid_entry)
/*
* "Paranoid" exit path from exception stack. This is invoked
@@ -1443,6 +1447,10 @@ end_repeat_nmi:
*/
pushq $-1 /* ORIG_RAX: no syscall to restart */
+ PUSH_AND_CLEAR_REGS
+ UNWIND_HINT_REGS
+ ENCODE_FRAME_POINTER
+
/*
* Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
* as we should not be calling schedule in NMI context.
@@ -1451,7 +1459,6 @@ end_repeat_nmi:
* exceptions might do.
*/
call paranoid_entry
- UNWIND_HINT_REGS
movq %rsp, %rdi
movq $-1, %rsi
--
2.19.1.6.gb485710b
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [RFC PATCH 1/7] x86/entry: Move PUSH_AND_CLEAR_REGS out of paranoid_entry
2023-04-03 14:05 ` [RFC PATCH 1/7] x86/entry: Move PUSH_AND_CLEAR_REGS out of paranoid_entry Lai Jiangshan
@ 2023-04-06 20:37 ` Peter Zijlstra
0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-04-06 20:37 UTC (permalink / raw)
To: Lai Jiangshan
Cc: linux-kernel, Lai Jiangshan, H. Peter Anvin, Andi Kleen,
Andrew Cooper, Andy Lutomirski, Asit Mallick, Cfir Cohen,
Dan Williams, Dave Hansen, David Kaplan, David Rientjes,
Dirk Hohndel, Erdem Aktas, Jan Kiszka, Jiri Slaby, Joerg Roedel,
Juergen Gross, Kees Cook, Kirill Shutemov,
Kuppuswamy Sathyanarayanan, Linus Torvalds, Mike Stunes,
Raj Ashok, Sean Christopherson, Thomas Gleixner, Tom Lendacky,
Tony Luck, kvm, linux-coco, x86, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin
On Mon, Apr 03, 2023 at 10:05:59PM +0800, Lai Jiangshan wrote:
> @@ -915,11 +922,8 @@ SYM_CODE_END(xen_failsafe_callback)
> * R14 - old CR3
> * R15 - old SPEC_CTRL
> */
> -SYM_CODE_START(paranoid_entry)
> +SYM_FUNC_START(paranoid_entry)
> ANNOTATE_NOENDBR
> - UNWIND_HINT_FUNC
That isn't quite equivalent. SYM_FUNC_START() gets you ENDBR, while the
SYM_CODE_START(); ANNOTATE_NOENDBR; UNWIND_HINT_FUNC is
explicitly no ENDBR.
> - PUSH_AND_CLEAR_REGS save_ret=1
> - ENCODE_FRAME_POINTER 8
>
> /*
> * Always stash CR3 in %r14. This value will be restored,
^ permalink raw reply [flat|nested] 25+ messages in thread
* [RFC PATCH 2/7] x86/entry: Add IST main stack
2023-04-03 14:05 [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST Lai Jiangshan
2023-04-03 14:05 ` [RFC PATCH 1/7] x86/entry: Move PUSH_AND_CLEAR_REGS out of paranoid_entry Lai Jiangshan
@ 2023-04-03 14:06 ` Lai Jiangshan
2023-04-03 16:21 ` Linus Torvalds
2023-04-06 20:51 ` Peter Zijlstra
2023-04-03 14:06 ` [RFC PATCH 3/7] x86/entry: Implement atomic-IST-entry Lai Jiangshan
` (6 subsequent siblings)
8 siblings, 2 replies; 25+ messages in thread
From: Lai Jiangshan @ 2023-04-03 14:06 UTC (permalink / raw)
To: linux-kernel
Cc: Lai Jiangshan, H. Peter Anvin, Andi Kleen, Andrew Cooper,
Andy Lutomirski, Asit Mallick, Cfir Cohen, Dan Williams,
Dave Hansen, David Kaplan, David Rientjes, Dirk Hohndel,
Erdem Aktas, Jan Kiszka, Jiri Slaby, Joerg Roedel, Juergen Gross,
Kees Cook, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
Linus Torvalds, Mike Stunes, Peter Zijlstra, Raj Ashok,
Sean Christopherson, Thomas Gleixner, Tom Lendacky, Tony Luck,
kvm, linux-coco, x86, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jonathan Corbet, linux-doc
From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Add IST main stack for atomic-IST-entry.
The size is THREAD_SIZE since there might be multiple super
exceptions being handled on the stack.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
Documentation/x86/kernel-stacks.rst | 2 ++
arch/x86/include/asm/cpu_entry_area.h | 5 +++++
arch/x86/kernel/dumpstack_64.c | 6 ++++--
arch/x86/mm/cpu_entry_area.c | 1 +
4 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/Documentation/x86/kernel-stacks.rst b/Documentation/x86/kernel-stacks.rst
index 6b0bcf027ff1..be89acf302da 100644
--- a/Documentation/x86/kernel-stacks.rst
+++ b/Documentation/x86/kernel-stacks.rst
@@ -105,6 +105,8 @@ The currently assigned IST stacks are:
middle of switching stacks. Using IST for MCE events avoids making
assumptions about the previous state of the kernel stack.
+* ESTACK_IST. bla bla
+
For more details see the Intel IA32 or AMD AMD64 architecture manuals.
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 462fc34f1317..a373e8c37e25 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -10,6 +10,8 @@
#ifdef CONFIG_X86_64
+#define IST_MAIN_STKSZ THREAD_SIZE
+
#ifdef CONFIG_AMD_MEM_ENCRYPT
#define VC_EXCEPTION_STKSZ EXCEPTION_STKSZ
#else
@@ -30,6 +32,8 @@
char VC_stack[optional_stack_size]; \
char VC2_stack_guard[guardsize]; \
char VC2_stack[optional_stack_size]; \
+ char IST_stack_guard[guardsize]; \
+ char IST_stack[IST_MAIN_STKSZ]; \
char IST_top_guard[guardsize]; \
/* The exception stacks' physical storage. No guard pages required */
@@ -52,6 +56,7 @@ enum exception_stack_ordering {
ESTACK_MCE,
ESTACK_VC,
ESTACK_VC2,
+ ESTACK_IST,
N_EXCEPTION_STACKS
};
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index f05339fee778..3413b23fa9f1 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -26,11 +26,12 @@ static const char * const exception_stack_names[] = {
[ ESTACK_MCE ] = "#MC",
[ ESTACK_VC ] = "#VC",
[ ESTACK_VC2 ] = "#VC2",
+ [ ESTACK_IST ] = "#IST",
};
const char *stack_type_name(enum stack_type type)
{
- BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
+ BUILD_BUG_ON(N_EXCEPTION_STACKS != ARRAY_SIZE(exception_stack_names));
if (type == STACK_TYPE_TASK)
return "TASK";
@@ -89,6 +90,7 @@ struct estack_pages estack_pages[CEA_ESTACK_PAGES] ____cacheline_aligned = {
EPAGERANGE(MCE),
EPAGERANGE(VC),
EPAGERANGE(VC2),
+ EPAGERANGE(IST),
};
static __always_inline bool in_exception_stack(unsigned long *stack, struct stack_info *info)
@@ -98,7 +100,7 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac
struct pt_regs *regs;
unsigned int k;
- BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
+ BUILD_BUG_ON(N_EXCEPTION_STACKS != 7);
begin = (unsigned long)__this_cpu_read(cea_exception_stacks);
/*
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 7316a8224259..62341cb819ab 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -148,6 +148,7 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
cea_map_stack(NMI);
cea_map_stack(DB);
cea_map_stack(MCE);
+ cea_map_stack(IST);
if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
--
2.19.1.6.gb485710b
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [RFC PATCH 2/7] x86/entry: Add IST main stack
2023-04-03 14:06 ` [RFC PATCH 2/7] x86/entry: Add IST main stack Lai Jiangshan
@ 2023-04-03 16:21 ` Linus Torvalds
2023-04-06 20:51 ` Peter Zijlstra
1 sibling, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2023-04-03 16:21 UTC (permalink / raw)
To: Lai Jiangshan
Cc: linux-kernel, Lai Jiangshan, H. Peter Anvin, Andi Kleen,
Andrew Cooper, Andy Lutomirski, Asit Mallick, Cfir Cohen,
Dan Williams, Dave Hansen, David Kaplan, David Rientjes,
Dirk Hohndel, Erdem Aktas, Jan Kiszka, Jiri Slaby, Joerg Roedel,
Juergen Gross, Kees Cook, Kirill Shutemov,
Kuppuswamy Sathyanarayanan, Mike Stunes, Peter Zijlstra,
Raj Ashok, Sean Christopherson, Thomas Gleixner, Tom Lendacky,
Tony Luck, kvm, linux-coco, x86, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Jonathan Corbet, linux-doc
On Mon, Apr 3, 2023 at 7:05 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
>
> diff --git a/Documentation/x86/kernel-stacks.rst b/Documentation/x86/kernel-stacks.rst
> index 6b0bcf027ff1..be89acf302da 100644
> --- a/Documentation/x86/kernel-stacks.rst
> +++ b/Documentation/x86/kernel-stacks.rst
> @@ -105,6 +105,8 @@ The currently assigned IST stacks are:
> middle of switching stacks. Using IST for MCE events avoids making
> assumptions about the previous state of the kernel stack.
>
> +* ESTACK_IST. bla bla
> +
> For more details see the Intel IA32 or AMD AMD64 architecture manuals.
Maybe the cover letter description could be used here, rather than the
"bla bla" placeholder?
Linus
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [RFC PATCH 2/7] x86/entry: Add IST main stack
2023-04-03 14:06 ` [RFC PATCH 2/7] x86/entry: Add IST main stack Lai Jiangshan
2023-04-03 16:21 ` Linus Torvalds
@ 2023-04-06 20:51 ` Peter Zijlstra
1 sibling, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-04-06 20:51 UTC (permalink / raw)
To: Lai Jiangshan
Cc: linux-kernel, Lai Jiangshan, H. Peter Anvin, Andi Kleen,
Andrew Cooper, Andy Lutomirski, Asit Mallick, Cfir Cohen,
Dan Williams, Dave Hansen, David Kaplan, David Rientjes,
Dirk Hohndel, Erdem Aktas, Jan Kiszka, Jiri Slaby, Joerg Roedel,
Juergen Gross, Kees Cook, Kirill Shutemov,
Kuppuswamy Sathyanarayanan, Linus Torvalds, Mike Stunes,
Raj Ashok, Sean Christopherson, Thomas Gleixner, Tom Lendacky,
Tony Luck, kvm, linux-coco, x86, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Jonathan Corbet, linux-doc
On Mon, Apr 03, 2023 at 10:06:00PM +0800, Lai Jiangshan wrote:
> diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
> index f05339fee778..3413b23fa9f1 100644
> --- a/arch/x86/kernel/dumpstack_64.c
> +++ b/arch/x86/kernel/dumpstack_64.c
> @@ -26,11 +26,12 @@ static const char * const exception_stack_names[] = {
> [ ESTACK_MCE ] = "#MC",
> [ ESTACK_VC ] = "#VC",
> [ ESTACK_VC2 ] = "#VC2",
> + [ ESTACK_IST ] = "#IST",
> };
Since #IST is not actually a vector, perhaps ditch the '#' ?
^ permalink raw reply [flat|nested] 25+ messages in thread
* [RFC PATCH 3/7] x86/entry: Implement atomic-IST-entry
2023-04-03 14:05 [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST Lai Jiangshan
2023-04-03 14:05 ` [RFC PATCH 1/7] x86/entry: Move PUSH_AND_CLEAR_REGS out of paranoid_entry Lai Jiangshan
2023-04-03 14:06 ` [RFC PATCH 2/7] x86/entry: Add IST main stack Lai Jiangshan
@ 2023-04-03 14:06 ` Lai Jiangshan
2023-04-06 21:01 ` Peter Zijlstra
2023-04-06 21:58 ` Peter Zijlstra
2023-04-03 14:06 ` [RFC PATCH 4/7] x86/entry: Use atomic-IST-entry for NMI Lai Jiangshan
` (5 subsequent siblings)
8 siblings, 2 replies; 25+ messages in thread
From: Lai Jiangshan @ 2023-04-03 14:06 UTC (permalink / raw)
To: linux-kernel
Cc: Lai Jiangshan, H. Peter Anvin, Andi Kleen, Andrew Cooper,
Andy Lutomirski, Asit Mallick, Cfir Cohen, Dan Williams,
Dave Hansen, David Kaplan, David Rientjes, Dirk Hohndel,
Erdem Aktas, Jan Kiszka, Jiri Slaby, Joerg Roedel, Juergen Gross,
Kees Cook, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
Linus Torvalds, Mike Stunes, Peter Zijlstra, Raj Ashok,
Sean Christopherson, Thomas Gleixner, Tom Lendacky, Tony Luck,
kvm, linux-coco, x86, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Josh Poimboeuf, Arnd Bergmann
From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
See the comments in the cover-letter. They will be moved into the code
and changelog here when improved.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
arch/x86/entry/Makefile | 3 +
arch/x86/entry/entry_64.S | 193 ++++++++++++++++++++
arch/x86/entry/ist_entry.c | 299 +++++++++++++++++++++++++++++++
arch/x86/kernel/asm-offsets_64.c | 7 +
arch/x86/kernel/callthunks.c | 2 +
tools/objtool/check.c | 7 +-
6 files changed, 510 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/entry/ist_entry.c
diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index ca2fe186994b..7cc1254ca519 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -8,11 +8,14 @@ UBSAN_SANITIZE := n
KCOV_INSTRUMENT := n
CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRACE)
+CFLAGS_REMOVE_ist_entry.o = $(CC_FLAGS_FTRACE) $(RETHUNK_CFLAGS)
CFLAGS_common.o += -fno-stack-protector
+CFLAGS_ist_entry.o += -fno-stack-protector
obj-y := entry.o entry_$(BITS).o syscall_$(BITS).o
obj-y += common.o
+obj-$(CONFIG_X86_64) += ist_entry.o
obj-y += vdso/
obj-y += vsyscall/
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 49ddc4dd3117..50a24cc83581 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -443,6 +443,184 @@ SYM_CODE_END(\asmsym)
idtentry \vector asm_\cfunc \cfunc has_error_code=0
.endm
+/**
+ * idtentry_ist - Macro to generate entry stubs for IST exceptions except #DF
+ * @vector: Vector number
+ * @asmsym: ASM symbol for the entry point
+ * @cfunc: C function to be called when it occurs in kernel
+ * @user_cfunc: C function to be called when it occurs in userspace
+ * @has_error_code: Hardware pushed error code on stack
+ * @stack_offset: Offset of the IST stack top in struct cea_exception_stacks
+ *
+ * The macro emits code to set up the kernel context for IST exceptions.
+ *
+ * From the hardware entry of the event to the SYM_INNER_LABEL(commit_\asmsym)
+ * is atomic-IST-entry (note: atomic-IST-entry is from the hardware entry,
+ * not merely from the first instruction of this macro).
+ *
+ * The atomic-IST-entry pushes pt_regs and copies the pt_regs to the IST
+ * main stack, and switches to it. If the atomic-IST-entry is interrupted
+ * by another IST event (except #DF), the new atomic-IST-entry will
+ * replicate the interrupted one as if every atomic-IST-entry is atomic.
+ *
+ * See the comments in entry64.c.
+ *
+ * When the cpu is on any IST stack or the IST main stack, %rsp can not be
+ * switched off except being interrupted by any IST exception or totally
+ * switching off (no usable data left).
+ *
+ * If the entry comes from user space, it turns to use the normal entry
+ * path finally on its kernel stack including the return to user space
+ * work and preemption checks on exit. The macro idtentry_body ensures
+ * the IST main stack is totally switched off (no usable data left) at
+ * the same time it switches to the kernel stack..
+ *
+ * If hits in kernel mode then it needs to go through the paranoid
+ * entry as the exception can hit any random state. No preemption
+ * check on exit to keep the paranoid path simple.
+ */
+.macro idtentry_ist vector asmsym cfunc user_cfunc has_error_code:req, stack_offset:req
+SYM_CODE_START(\asmsym)
+ UNWIND_HINT_IRET_REGS offset=\has_error_code*8
+ ENDBR
+
+ /*
+ * Clear X86_EFLAGS_AC, X86_EFLAGS_DF and set a default ORIG_RAX.
+ *
+ * The code setting ORIG_RAX will not be replicated if interrupted.
+ */
+ ASM_CLAC
+ cld
+
+ .if \has_error_code == 0
+ pushq $-1 /* ORIG_RAX: no syscall to restart */
+ .endif
+
+ /*
+ * No register can be touched except %rsp,%rflags,%rip before
+ * pushing all the registers. It is indispensable for nested
+ * atomic-IST-entry to replicate pushing the registers.
+ */
+ PUSH_REGS
+
+ /*
+ * Finished pushing register, all registers can be touched by now.
+ *
+ * Clear registers for the C function ist_copy_regs_to_main_stack()
+ * and the handler to avoid any possible exploitation of any
+ * speculation attack.
+ */
+ CLEAR_REGS
+
+ /*
+ * Copy the pt_regs to the IST main stack including the pt_regs of
+ * the interrupted atomic-IST-entris, if any, by replicating.
+ */
+ movq %rsp, %rdi /* pt_regs pointer on its own IST stack */
+ leaq PTREGS_SIZE-\stack_offset(%rsp), %rsi /* struct cea_exception_stacks pointer */
+ call ist_copy_regs_to_main_stack
+
+ /*
+ * Commit stage.
+ */
+SYM_INNER_LABEL(start_commit_\asmsym, SYM_L_GLOBAL)
+ /*
+ * Switches to the IST main stack. Before the switching is done,
+ * %rax is the copied pt_regs pointer in IST main stack.
+ */
+ movq %rax, %rsp
+
+ /*
+ * The label should be immediate after the instruction that switches
+ * the stack since there is code assuming there is only one single
+ * instruction in the commit stage and the code assumes "%rsp in the
+ * IST main stack is also the sign of ending a atomic-IST-entry".
+ * (The code will be removed in future when %rip-based identifying
+ * is added.)
+ */
+SYM_INNER_LABEL(commit_\asmsym, SYM_L_GLOBAL)
+
+ /*
+ * Now, it is on the IST main stack. For the whole kernel, the entries
+ * of the IST exceptions can be seen from here because the inside
+ * of the atomic-IST-entry can not be seen from the whole kernel
+ * except in the atomic-IST-entry or #DF.
+ */
+ UNWIND_HINT_REGS
+ ENCODE_FRAME_POINTER
+
+ /*
+ * The code setting ORIG_RAX will not be replicated if interrupted.
+ * So redo it here.
+ */
+ .if \has_error_code == 0
+ movq $-1, ORIG_RAX(%rsp) /* ORIG_RAX: no syscall to restart */
+ .endif
+
+ /*
+ * If the entry is from userspace, switch stacks and treat it as
+ * a normal entry.
+ */
+ testb $3, CS(%rsp)
+ jnz .Lfrom_usermode_switch_stack_\@
+
+ /*
+ * paranoid_entry returns GS/CR3/SPEC_CTL information for
+ * paranoid_exit in RBX/R14/R15.
+ */
+ call paranoid_entry
+
+ movq %rsp, %rdi /* pt_regs pointer */
+ .if \has_error_code == 1
+ movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/
+ movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
+ .endif
+ call \cfunc
+
+ jmp paranoid_exit
+
+.Lfrom_usermode_switch_stack_\@:
+ /* Switch context: GS_BASE, CR3, SPEC_CTL. */
+ swapgs
+ FENCE_SWAPGS_USER_ENTRY
+ /* We have user CR3. Change to kernel CR3. */
+ SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
+ IBRS_ENTER
+ UNTRAIN_RET
+
+ /* Put the pt_regs onto the kernel task stack. */
+ movq %rsp, %rdi /* arg0 = pt_regs pointer */
+ call sync_regs
+
+ /*
+ * Switch to the kernel task stack and use the user entry point.
+ *
+ * When from the user mode, the procedure has to atomically switches
+ * off the TSS-configured IST stacks too, so it switches to the IST
+ * main stack first, and then switches off the IST main stack in atomic
+ * fashion: when %rsp leaves the IST main stack, the IST main stack is
+ * totally free.
+ */
+ movq %rax, %rsp
+ UNWIND_HINT_REGS
+ ENCODE_FRAME_POINTER
+
+ movq %rsp, %rdi /* pt_regs pointer into 1st argument*/
+ .if \has_error_code == 1
+ movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/
+ movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
+ .endif
+ call \user_cfunc
+
+ /* For some configurations \user_cfunc ends up being a noreturn. */
+ REACHABLE
+
+ jmp error_return
+
+_ASM_NOKPROBE(\asmsym)
+SYM_CODE_END(\asmsym)
+.endm
+
/**
* idtentry_mce_db - Macro to generate entry stubs for #MC and #DB
* @vector: Vector number
@@ -586,8 +764,23 @@ SYM_CODE_END(\asmsym)
*/
.macro idtentry_df vector asmsym cfunc
SYM_CODE_START(\asmsym)
+
+ /*
+ * This unwind-hint is incorect if it is the soft double fault rasied
+ * from ist_double_fault(). It doesn't matter since it is unrecoverable
+ * double fault.
+ */
UNWIND_HINT_IRET_REGS offset=8
ENDBR
+
+ /*
+ * Set %rsp = %rsp - 8 if it is the soft double fault raisied from
+ * ist_double_fault(). The CPU doesn't push an error code in the case
+ * since it is injected by an INT instruction.
+ */
+ btr $3, %rsp
+ UNWIND_HINT_IRET_REGS offset=8
+
ASM_CLAC
cld
diff --git a/arch/x86/entry/ist_entry.c b/arch/x86/entry/ist_entry.c
new file mode 100644
index 000000000000..e1b06306ac51
--- /dev/null
+++ b/arch/x86/entry/ist_entry.c
@@ -0,0 +1,299 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022-2023 Lai Jiangshan, Ant Group
+ *
+ * Handle entries and exits for hardware traps and faults.
+ *
+ * It is as low level as entry_64.S and its code can be running in the
+ * environments that the GS base is a user controlled value, or the CR3
+ * is the PTI user CR3 or both.
+ */
+#include <asm/traps.h>
+
+#define IST_DOUBLE_FAULT_VECTOR 8
+
+static __always_inline void ist_double_fault(void)
+{
+ asm volatile ("int $" __stringify(IST_DOUBLE_FAULT_VECTOR));
+}
+
+#define IN_CEA_ESTACK(ceastp, name, sp) \
+ ((CEA_ESTACK_BOT(ceastp, name) <= (sp)) && \
+ ((sp) < CEA_ESTACK_TOP(ceastp, name)))
+
+struct ist_ctx {
+ const struct pt_regs *regs;
+ unsigned long commit_ip;
+};
+
+#define DEFINE_IDENTIFY_IST(stack_name, sym_name, is_enabled) \
+extern char commit_asm_exc_##sym_name[]; \
+static __always_inline bool identify_ist_##sym_name( \
+ const struct pt_regs *regs, struct cea_exception_stacks *stacks,\
+ struct ist_ctx *ctx) \
+{ \
+ if (!(is_enabled)) \
+ return false; \
+ if (!IN_CEA_ESTACK(stacks, stack_name, regs->sp)) \
+ return false; \
+ ctx->regs = (struct pt_regs *)CEA_ESTACK_TOP(stacks, stack_name) - 1; \
+ ctx->commit_ip = (unsigned long)commit_asm_exc_##sym_name; \
+ return true; \
+}
+
+DEFINE_IDENTIFY_IST(NMI, nmi, false)
+DEFINE_IDENTIFY_IST(DB, debug, false)
+DEFINE_IDENTIFY_IST(MCE, machine_check, false)
+DEFINE_IDENTIFY_IST(VC, vmm_communication, false)
+
+static __always_inline bool identify_ist(
+ const struct pt_regs *regs, struct cea_exception_stacks *stacks,
+ struct ist_ctx *ctx)
+{
+ return identify_ist_nmi(regs, stacks, ctx) ||
+ identify_ist_debug(regs, stacks, ctx) ||
+ identify_ist_machine_check(regs, stacks, ctx) ||
+ identify_ist_vmm_communication(regs, stacks, ctx);
+}
+
+/*
+ * identify if an interrupted atomic-IST-entry had successfully saved
+ * the general registers onto its IST stack.
+ *
+ * Generally, the outmost atomic-IST-entry had likely successfully saved
+ * the general registers. If not, there must be one of the nested
+ * atomic-IST-entry had saved the general registers of the context that
+ * the outmost atomic-IST-entry had interrupted.
+ *
+ * Arguments:
+ * @nested: the nested atomic-IST-entry who had interrupted @interrupted
+ * @interrupted: the interrupted atomic-IST-entry.
+ *
+ * Returns:
+ * true: the interrupted atomic-IST-entry had saved the general registers.
+ * false: the interrupted atomic-IST-entry had not yet saved the general registers.
+ */
+static __always_inline
+bool identify_if_gp_registers_saved(const struct pt_regs *nested, const struct pt_regs *interrupted)
+{
+ return nested->sp <= (unsigned long)(void *)interrupted;
+}
+
+static __always_inline
+void copy_regs_exception_head(struct pt_regs *target, const struct pt_regs *from)
+{
+ target->ss = from->ss;
+ target->sp = from->sp;
+ target->flags = from->flags;
+ target->cs = from->cs;
+ target->ip = from->ip;
+ target->orig_ax = from->orig_ax;
+}
+
+static __always_inline
+void copy_regs_general_registers(struct pt_regs *target, const struct pt_regs *from)
+{
+ target->di = from->di;
+ target->si = from->si;
+ target->dx = from->dx;
+ target->cx = from->cx;
+ target->ax = from->ax;
+ target->r8 = from->r8;
+ target->r9 = from->r9;
+ target->r10 = from->r10;
+ target->r11 = from->r11;
+ target->bx = from->bx;
+ target->bp = from->bp;
+ target->r12 = from->r12;
+ target->r13 = from->r13;
+ target->r14 = from->r14;
+ target->r15 = from->r15;
+}
+
+/*
+ * Do the work as the outmost atomic-IST-entry to copy the supposed pt_regs
+ * of the interrupted context to the IST main stack. (If the ongoing
+ * atomic-IST-entry is the outmost one, the work is literally doing copy as
+ * the outmost, if not, the work is replicating the outmost.)
+ *
+ * The hardware-entry of the outmost atomic-IST-entry has already saved the
+ * exception head of the pt_regs. If the outmost atomic-IST-entry was
+ * unfortunately interrupted before fully saving all the general registers,
+ * the general registers are untouched and must be saved by one of the
+ * consequent nested atomic-IST-entries. The identifying code can just
+ * examine all the nested atomic-IST-entries to find which one has saved
+ * the general registers.
+ */
+static __always_inline
+void copy_outmost(struct pt_regs *target, const struct pt_regs *outmost, const struct pt_regs *gp)
+{
+ copy_regs_exception_head(target, outmost);
+ copy_regs_general_registers(target, gp);
+}
+
+/*
+ * Replicate the interrupted atomic-IST-entry's CLAC and CLD in the ASM
+ * code. Even SMAP is not enabled, CLAC is replicated unconditionally
+ * since doing so has no harm.
+ */
+static __always_inline void replicate_clac_cld(struct pt_regs *target)
+{
+ target->flags &= ~(unsigned long)(X86_EFLAGS_AC | X86_EFLAGS_DF);
+}
+
+/* Replicate the interrupted atomic-IST-entry's CLEAR_REGS macro. */
+static __always_inline void replicate_clear_regs(struct pt_regs *target)
+{
+ target->di = 0;
+ target->si = 0;
+ target->dx = 0;
+ target->cx = 0;
+ target->ax = 0;
+ target->r8 = 0;
+ target->r9 = 0;
+ target->r10 = 0;
+ target->r11 = 0;
+ target->bx = 0;
+ target->bp = 0;
+ target->r12 = 0;
+ target->r13 = 0;
+ target->r14 = 0;
+ target->r15 = 0;
+}
+
+/*
+ * Replicate the action that the interrupted atomic-IST-entry's
+ * ist_copy_regs_to_main_stack() clobbers caller-saved registers
+ */
+static __always_inline void replicate_func_clobber(struct pt_regs *target)
+{
+ /* nothing needs to be done. */
+}
+
+/*
+ * Replicate the copy operation in the interrupted atomic-IST-entry's
+ * ist_copy_regs_to_main_stack()
+ */
+static __always_inline void replicate_func_copy(struct pt_regs *target)
+{
+ /*
+ * To avoid recursive functions calls with __always_inline, the
+ * copy operation for the interrupted atomic-IST-entry has been
+ * done in the caller of copy_nested(). Nothing need to be done.
+ */
+}
+
+#define IST_FRAME_SIZE ALIGN(sizeof(struct pt_regs), 16)
+
+/*
+ * Replicate the return result of the interrupted atomic-IST-entry's
+ * ist_copy_regs_to_main_stack() in %rax and the commit operation.
+ */
+static __always_inline void replicate_func_result_and_commit(struct pt_regs *target, unsigned long commit_ip)
+{
+ void *target_of_interrupted = (void *)target + IST_FRAME_SIZE;
+
+ /* return result in %rax */
+ target->ax = (unsigned long)target_of_interrupted;
+ /* move %rax, %rsp */
+ target->sp = (unsigned long)target_of_interrupted;
+ /* the %rip advances to commit point */
+ target->ip = commit_ip;
+}
+
+/*
+ * Do the work as a nested atomic-IST-entry to copy the supposed pt_regs
+ * of the interrupted context to the IST main stack.
+ *
+ * The hardware-entry of the nested atomic-IST-entry has already saved
+ * the exception head of the pt_regs of the interrupted context (inside
+ * the interrupted atomic-IST-entry). To maintain the atomic attribute
+ * of the atomic-IST-entry, the copy_nested() (of the ongoing nested
+ * atomic-IST-entry) has to replicate all that the interrupted
+ * atomic-IST-entries should have been done till the commit point and
+ * copy the supposed saved context (pt_regs).
+ *
+ * To avoid touching any saved pt_regs, the replicating is actually
+ * directly applied on the target pt_regs.
+ */
+static __always_inline
+void copy_nested(struct pt_regs *target, const struct pt_regs *nested, unsigned long commit_ip)
+{
+ copy_regs_exception_head(target, nested);
+ replicate_clac_cld(target);
+ replicate_clear_regs(target);
+ replicate_func_clobber(target);
+ replicate_func_copy(target);
+ replicate_func_result_and_commit(target, commit_ip);
+}
+
+asmlinkage __visible __noinstr_section(".entry.text")
+struct pt_regs *ist_copy_regs_to_main_stack(
+ const struct pt_regs *regs, struct cea_exception_stacks *stacks)
+{
+ unsigned long ist_main_sp = CEA_ESTACK_TOP(stacks, IST);
+ struct ist_ctx ist_ctx[8];
+ const struct pt_regs *gp_saved;
+ struct pt_regs *target;
+ int nr_entries, i;
+
+ /*
+ * Identify all of the atomic-IST-entris.
+ *
+ * The current ongoing atomic-IST-entry doesn't need to be identified,
+ * but is also put in the @ist_ctx[0] for later convenience.
+ *
+ * The for-loop identifies what the context @regs has interrupted is.
+ * It travels back to the outmost atomic-IST-entry.
+ *
+ * Result:
+ * Identified result is put in ist_ctx[i].
+ * ist_ctx[0] is the current ongoing atomic-IST-entry.
+ * ist_ctx[nr_entries-1] is the outmost atomic-IST-entry.
+ * gp_saved is the atomic-IST-entry that has saved the general registers.
+ */
+ ist_ctx[0].regs = regs;
+ ist_ctx[0].commit_ip = -1; /* unused */
+ nr_entries = 1;
+ gp_saved = regs;
+ for (;;) {
+ if (user_mode((struct pt_regs *)regs))
+ break;
+ if (ip_within_syscall_gap((struct pt_regs *)regs))
+ break;
+ if (!identify_ist(regs, stacks, &ist_ctx[nr_entries])) {
+ /* locate the top of copying target pt_regs */
+ if (IN_CEA_ESTACK(stacks, IST, regs->sp))
+ ist_main_sp = ALIGN_DOWN(regs->sp, 16);
+ break;
+ }
+ if (identify_if_gp_registers_saved(regs, ist_ctx[nr_entries].regs))
+ gp_saved = ist_ctx[nr_entries].regs;
+ regs = ist_ctx[nr_entries].regs;
+ nr_entries++;
+ if (nr_entries >= ARRAY_SIZE(ist_ctx))
+ ist_double_fault();
+ }
+
+ if (!IN_CEA_ESTACK(stacks, IST, ist_main_sp - IST_FRAME_SIZE * nr_entries))
+ ist_double_fault();
+
+ /*
+ * Copy the saved pt_regs to the IST main stack.
+ *
+ * For each atomic-IST-entry including the interrupted ones and
+ * the current ongoing one, calls either copy_outmost() or copy_nested()
+ * to copy the pt_regs of what should have been saved, by replicating
+ * if needed, to the IST main stack.
+ */
+ ist_main_sp -= IST_FRAME_SIZE;
+ target = (void *)ist_main_sp;
+ copy_outmost(target, ist_ctx[nr_entries - 1].regs, gp_saved);
+ for (i = nr_entries - 2; unlikely(i >= 0); i--) {
+ ist_main_sp -= IST_FRAME_SIZE;
+ target = (void *)ist_main_sp;
+ copy_nested(target, ist_ctx[i].regs, ist_ctx[i+1].commit_ip);
+ }
+
+ return target;
+}
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index bb65371ea9df..f861a56c0002 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -60,5 +60,12 @@ int main(void)
OFFSET(FIXED_stack_canary, fixed_percpu_data, stack_canary);
BLANK();
#endif
+
+ DEFINE(CEA_stacks_NMI, offsetofend(struct cea_exception_stacks, NMI_stack));
+ DEFINE(CEA_stacks_DB, offsetofend(struct cea_exception_stacks, DB_stack));
+ DEFINE(CEA_stacks_MCE, offsetofend(struct cea_exception_stacks, MCE_stack));
+ DEFINE(CEA_stacks_VC, offsetofend(struct cea_exception_stacks, VC_stack));
+ BLANK();
+
return 0;
}
diff --git a/arch/x86/kernel/callthunks.c b/arch/x86/kernel/callthunks.c
index ffea98f9064b..e756c89996d8 100644
--- a/arch/x86/kernel/callthunks.c
+++ b/arch/x86/kernel/callthunks.c
@@ -123,6 +123,8 @@ static bool skip_addr(void *dest)
{
if (dest == error_entry)
return true;
+ if (dest == ist_copy_regs_to_main_stack)
+ return true;
if (dest == paranoid_entry)
return true;
if (dest == xen_error_entry)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index f937be1afe65..8dfa627d4b41 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3998,6 +3998,11 @@ static int validate_unret(struct objtool_file *file)
return warnings;
}
+static bool in_ist_entry(struct instruction *insn)
+{
+ return !strcmp(insn->sym->name, "ist_copy_regs_to_main_stack");
+}
+
static int validate_retpoline(struct objtool_file *file)
{
struct instruction *insn;
@@ -4016,7 +4021,7 @@ static int validate_retpoline(struct objtool_file *file)
continue;
if (insn->type == INSN_RETURN) {
- if (opts.rethunk) {
+ if (opts.rethunk && !in_ist_entry(insn)) {
WARN_FUNC("'naked' return found in RETHUNK build",
insn->sec, insn->offset);
} else
--
2.19.1.6.gb485710b
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [RFC PATCH 3/7] x86/entry: Implement atomic-IST-entry
2023-04-03 14:06 ` [RFC PATCH 3/7] x86/entry: Implement atomic-IST-entry Lai Jiangshan
@ 2023-04-06 21:01 ` Peter Zijlstra
2023-04-07 2:33 ` Lai Jiangshan
2023-04-06 21:58 ` Peter Zijlstra
1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-04-06 21:01 UTC (permalink / raw)
To: Lai Jiangshan
Cc: linux-kernel, Lai Jiangshan, H. Peter Anvin, Andi Kleen,
Andrew Cooper, Andy Lutomirski, Asit Mallick, Cfir Cohen,
Dan Williams, Dave Hansen, David Kaplan, David Rientjes,
Dirk Hohndel, Erdem Aktas, Jan Kiszka, Jiri Slaby, Joerg Roedel,
Juergen Gross, Kees Cook, Kirill Shutemov,
Kuppuswamy Sathyanarayanan, Linus Torvalds, Mike Stunes,
Raj Ashok, Sean Christopherson, Thomas Gleixner, Tom Lendacky,
Tony Luck, kvm, linux-coco, x86, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Josh Poimboeuf, Arnd Bergmann
On Mon, Apr 03, 2023 at 10:06:01PM +0800, Lai Jiangshan wrote:
> diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
> index ca2fe186994b..7cc1254ca519 100644
> --- a/arch/x86/entry/Makefile
> +++ b/arch/x86/entry/Makefile
> @@ -8,11 +8,14 @@ UBSAN_SANITIZE := n
> KCOV_INSTRUMENT := n
>
> CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_REMOVE_ist_entry.o = $(CC_FLAGS_FTRACE) $(RETHUNK_CFLAGS)
This ^^^
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 49ddc4dd3117..50a24cc83581 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -443,6 +443,184 @@ SYM_CODE_END(\asmsym)
> +.macro idtentry_ist vector asmsym cfunc user_cfunc has_error_code:req, stack_offset:req
> +SYM_CODE_START(\asmsym)
> + UNWIND_HINT_IRET_REGS offset=\has_error_code*8
> + ENDBR
> +
> + /*
> + * Clear X86_EFLAGS_AC, X86_EFLAGS_DF and set a default ORIG_RAX.
> + *
> + * The code setting ORIG_RAX will not be replicated if interrupted.
> + */
> + ASM_CLAC
> + cld
> +
> + .if \has_error_code == 0
> + pushq $-1 /* ORIG_RAX: no syscall to restart */
> + .endif
> +
> + /*
> + * No register can be touched except %rsp,%rflags,%rip before
> + * pushing all the registers. It is indispensable for nested
> + * atomic-IST-entry to replicate pushing the registers.
> + */
> + PUSH_REGS
> +
> + /*
> + * Finished pushing register, all registers can be touched by now.
> + *
> + * Clear registers for the C function ist_copy_regs_to_main_stack()
> + * and the handler to avoid any possible exploitation of any
> + * speculation attack.
> + */
> + CLEAR_REGS
> +
> + /*
> + * Copy the pt_regs to the IST main stack including the pt_regs of
> + * the interrupted atomic-IST-entris, if any, by replicating.
> + */
> + movq %rsp, %rdi /* pt_regs pointer on its own IST stack */
> + leaq PTREGS_SIZE-\stack_offset(%rsp), %rsi /* struct cea_exception_stacks pointer */
> + call ist_copy_regs_to_main_stack
IIUC you do a CALL+RET here, before you call paranoid_entry ...
> +
> + /*
> + * Commit stage.
> + */
> +SYM_INNER_LABEL(start_commit_\asmsym, SYM_L_GLOBAL)
> + /*
> + * Switches to the IST main stack. Before the switching is done,
> + * %rax is the copied pt_regs pointer in IST main stack.
> + */
> + movq %rax, %rsp
> +
> + /*
> + * The label should be immediate after the instruction that switches
> + * the stack since there is code assuming there is only one single
> + * instruction in the commit stage and the code assumes "%rsp in the
> + * IST main stack is also the sign of ending a atomic-IST-entry".
> + * (The code will be removed in future when %rip-based identifying
> + * is added.)
> + */
> +SYM_INNER_LABEL(commit_\asmsym, SYM_L_GLOBAL)
> +
> + /*
> + * Now, it is on the IST main stack. For the whole kernel, the entries
> + * of the IST exceptions can be seen from here because the inside
> + * of the atomic-IST-entry can not be seen from the whole kernel
> + * except in the atomic-IST-entry or #DF.
> + */
> + UNWIND_HINT_REGS
> + ENCODE_FRAME_POINTER
> +
> + /*
> + * The code setting ORIG_RAX will not be replicated if interrupted.
> + * So redo it here.
> + */
> + .if \has_error_code == 0
> + movq $-1, ORIG_RAX(%rsp) /* ORIG_RAX: no syscall to restart */
> + .endif
> +
> + /*
> + * If the entry is from userspace, switch stacks and treat it as
> + * a normal entry.
> + */
> + testb $3, CS(%rsp)
> + jnz .Lfrom_usermode_switch_stack_\@
> +
> + /*
> + * paranoid_entry returns GS/CR3/SPEC_CTL information for
> + * paranoid_exit in RBX/R14/R15.
> + */
> + call paranoid_entry
... all the way down here, which will do:
IBRS_ENTER;
UNTRAIN_RET_FROM_CALL;
Which thus breaks the whole RetBleed mess, since that must not do RET
before that happens.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH 3/7] x86/entry: Implement atomic-IST-entry
2023-04-06 21:01 ` Peter Zijlstra
@ 2023-04-07 2:33 ` Lai Jiangshan
0 siblings, 0 replies; 25+ messages in thread
From: Lai Jiangshan @ 2023-04-07 2:33 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, Lai Jiangshan, H. Peter Anvin, Andi Kleen,
Andrew Cooper, Andy Lutomirski, Asit Mallick, Cfir Cohen,
Dan Williams, Dave Hansen, David Kaplan, David Rientjes,
Dirk Hohndel, Erdem Aktas, Jan Kiszka, Jiri Slaby, Joerg Roedel,
Juergen Gross, Kees Cook, Kirill Shutemov,
Kuppuswamy Sathyanarayanan, Linus Torvalds, Mike Stunes,
Raj Ashok, Sean Christopherson, Thomas Gleixner, Tom Lendacky,
Tony Luck, kvm, linux-coco, x86, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Josh Poimboeuf, Arnd Bergmann
On Fri, Apr 7, 2023 at 5:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Apr 03, 2023 at 10:06:01PM +0800, Lai Jiangshan wrote:
>
> > diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
> > index ca2fe186994b..7cc1254ca519 100644
> > --- a/arch/x86/entry/Makefile
> > +++ b/arch/x86/entry/Makefile
> > @@ -8,11 +8,14 @@ UBSAN_SANITIZE := n
> > KCOV_INSTRUMENT := n
> >
> > CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRACE)
> > +CFLAGS_REMOVE_ist_entry.o = $(CC_FLAGS_FTRACE) $(RETHUNK_CFLAGS)
>
> This ^^^
>
>
> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > index 49ddc4dd3117..50a24cc83581 100644
> > --- a/arch/x86/entry/entry_64.S
> > +++ b/arch/x86/entry/entry_64.S
> > @@ -443,6 +443,184 @@ SYM_CODE_END(\asmsym)
>
> > +.macro idtentry_ist vector asmsym cfunc user_cfunc has_error_code:req, stack_offset:req
> > +SYM_CODE_START(\asmsym)
> > + UNWIND_HINT_IRET_REGS offset=\has_error_code*8
> > + ENDBR
> > +
> > + /*
> > + * Clear X86_EFLAGS_AC, X86_EFLAGS_DF and set a default ORIG_RAX.
> > + *
> > + * The code setting ORIG_RAX will not be replicated if interrupted.
> > + */
> > + ASM_CLAC
> > + cld
> > +
> > + .if \has_error_code == 0
> > + pushq $-1 /* ORIG_RAX: no syscall to restart */
> > + .endif
> > +
> > + /*
> > + * No register can be touched except %rsp,%rflags,%rip before
> > + * pushing all the registers. It is indispensable for nested
> > + * atomic-IST-entry to replicate pushing the registers.
> > + */
> > + PUSH_REGS
> > +
> > + /*
> > + * Finished pushing register, all registers can be touched by now.
> > + *
> > + * Clear registers for the C function ist_copy_regs_to_main_stack()
> > + * and the handler to avoid any possible exploitation of any
> > + * speculation attack.
> > + */
> > + CLEAR_REGS
> > +
> > + /*
> > + * Copy the pt_regs to the IST main stack including the pt_regs of
> > + * the interrupted atomic-IST-entris, if any, by replicating.
> > + */
> > + movq %rsp, %rdi /* pt_regs pointer on its own IST stack */
> > + leaq PTREGS_SIZE-\stack_offset(%rsp), %rsi /* struct cea_exception_stacks pointer */
> > + call ist_copy_regs_to_main_stack
>
> IIUC you do a CALL+RET here, before you call paranoid_entry ...
>
> > +
> > + /*
> > + * Commit stage.
> > + */
> > +SYM_INNER_LABEL(start_commit_\asmsym, SYM_L_GLOBAL)
> > + /*
> > + * Switches to the IST main stack. Before the switching is done,
> > + * %rax is the copied pt_regs pointer in IST main stack.
> > + */
> > + movq %rax, %rsp
> > +
> > + /*
> > + * The label should be immediate after the instruction that switches
> > + * the stack since there is code assuming there is only one single
> > + * instruction in the commit stage and the code assumes "%rsp in the
> > + * IST main stack is also the sign of ending a atomic-IST-entry".
> > + * (The code will be removed in future when %rip-based identifying
> > + * is added.)
> > + */
> > +SYM_INNER_LABEL(commit_\asmsym, SYM_L_GLOBAL)
> > +
> > + /*
> > + * Now, it is on the IST main stack. For the whole kernel, the entries
> > + * of the IST exceptions can be seen from here because the inside
> > + * of the atomic-IST-entry can not be seen from the whole kernel
> > + * except in the atomic-IST-entry or #DF.
> > + */
> > + UNWIND_HINT_REGS
> > + ENCODE_FRAME_POINTER
> > +
> > + /*
> > + * The code setting ORIG_RAX will not be replicated if interrupted.
> > + * So redo it here.
> > + */
> > + .if \has_error_code == 0
> > + movq $-1, ORIG_RAX(%rsp) /* ORIG_RAX: no syscall to restart */
> > + .endif
> > +
> > + /*
> > + * If the entry is from userspace, switch stacks and treat it as
> > + * a normal entry.
> > + */
> > + testb $3, CS(%rsp)
> > + jnz .Lfrom_usermode_switch_stack_\@
> > +
> > + /*
> > + * paranoid_entry returns GS/CR3/SPEC_CTL information for
> > + * paranoid_exit in RBX/R14/R15.
> > + */
> > + call paranoid_entry
>
> ... all the way down here, which will do:
>
> IBRS_ENTER;
> UNTRAIN_RET_FROM_CALL;
>
> Which thus breaks the whole RetBleed mess, since that must not do RET
> before that happens.
I got it.
I will add the save-stage-3 in the atomic-IST-entry.
The benefit of the new stage:
Do CR3/GSBASE/IBRS switching in the C atomic-IST-entry.
(^^^^^ Also the drawback, which complicates the code, and basically needs:
https://lore.kernel.org/lkml/20211126101209.8613-1-jiangshanlai@gmail.com/ )
The IST main stack can be in the Kernel CR3, not necessarily in the CEA
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH 3/7] x86/entry: Implement atomic-IST-entry
2023-04-03 14:06 ` [RFC PATCH 3/7] x86/entry: Implement atomic-IST-entry Lai Jiangshan
2023-04-06 21:01 ` Peter Zijlstra
@ 2023-04-06 21:58 ` Peter Zijlstra
2023-04-06 23:07 ` Andrew Cooper
1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-04-06 21:58 UTC (permalink / raw)
To: Lai Jiangshan
Cc: linux-kernel, Lai Jiangshan, H. Peter Anvin, Andi Kleen,
Andrew Cooper, Andy Lutomirski, Asit Mallick, Cfir Cohen,
Dan Williams, Dave Hansen, David Kaplan, David Rientjes,
Dirk Hohndel, Erdem Aktas, Jan Kiszka, Jiri Slaby, Joerg Roedel,
Juergen Gross, Kees Cook, Kirill Shutemov,
Kuppuswamy Sathyanarayanan, Linus Torvalds, Mike Stunes,
Raj Ashok, Sean Christopherson, Thomas Gleixner, Tom Lendacky,
Tony Luck, kvm, linux-coco, x86, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Josh Poimboeuf, Arnd Bergmann
On Mon, Apr 03, 2023 at 10:06:01PM +0800, Lai Jiangshan wrote:
> +static __always_inline
> +void copy_regs_exception_head(struct pt_regs *target, const struct pt_regs *from)
> +{
> + target->ss = from->ss;
> + target->sp = from->sp;
> + target->flags = from->flags;
> + target->cs = from->cs;
> + target->ip = from->ip;
> + target->orig_ax = from->orig_ax;
> +}
> +
> +static __always_inline
> +void copy_regs_general_registers(struct pt_regs *target, const struct pt_regs *from)
> +{
> + target->di = from->di;
> + target->si = from->si;
> + target->dx = from->dx;
> + target->cx = from->cx;
> + target->ax = from->ax;
> + target->r8 = from->r8;
> + target->r9 = from->r9;
> + target->r10 = from->r10;
> + target->r11 = from->r11;
> + target->bx = from->bx;
> + target->bp = from->bp;
> + target->r12 = from->r12;
> + target->r13 = from->r13;
> + target->r14 = from->r14;
> + target->r15 = from->r15;
> +}
> +/* Replicate the interrupted atomic-IST-entry's CLEAR_REGS macro. */
> +static __always_inline void replicate_clear_regs(struct pt_regs *target)
> +{
> + target->di = 0;
> + target->si = 0;
> + target->dx = 0;
> + target->cx = 0;
> + target->ax = 0;
> + target->r8 = 0;
> + target->r9 = 0;
> + target->r10 = 0;
> + target->r11 = 0;
> + target->bx = 0;
> + target->bp = 0;
> + target->r12 = 0;
> + target->r13 = 0;
> + target->r14 = 0;
> + target->r15 = 0;
> +}
I think there's compilers smart enough to see through your attempts at
avoiding mem{set,cpy}() there and I think we'll end up needing something
like __inline_memset() and __inline_memcpy() like here:
https://lore.kernel.org/lkml/Y759AJ%2F0N9fqwDED@hirez.programming.kicks-ass.net/
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [RFC PATCH 3/7] x86/entry: Implement atomic-IST-entry
2023-04-06 21:58 ` Peter Zijlstra
@ 2023-04-06 23:07 ` Andrew Cooper
0 siblings, 0 replies; 25+ messages in thread
From: Andrew Cooper @ 2023-04-06 23:07 UTC (permalink / raw)
To: Peter Zijlstra, Lai Jiangshan
Cc: linux-kernel, Lai Jiangshan, H. Peter Anvin, Andi Kleen,
Andy Lutomirski, Asit Mallick, Cfir Cohen, Dan Williams,
Dave Hansen, David Kaplan, David Rientjes, Dirk Hohndel,
Erdem Aktas, Jan Kiszka, Jiri Slaby, Joerg Roedel, Juergen Gross,
Kees Cook, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
Linus Torvalds, Mike Stunes, Raj Ashok, Sean Christopherson,
Thomas Gleixner, Tom Lendacky, Tony Luck, kvm, linux-coco, x86,
Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
Josh Poimboeuf, Arnd Bergmann
On 06/04/2023 10:58 pm, Peter Zijlstra wrote:
> On Mon, Apr 03, 2023 at 10:06:01PM +0800, Lai Jiangshan wrote:
>> +/* Replicate the interrupted atomic-IST-entry's CLEAR_REGS macro. */
>> +static __always_inline void replicate_clear_regs(struct pt_regs *target)
>> +{
>> + target->di = 0;
>> + target->si = 0;
>> + target->dx = 0;
>> + target->cx = 0;
>> + target->ax = 0;
>> + target->r8 = 0;
>> + target->r9 = 0;
>> + target->r10 = 0;
>> + target->r11 = 0;
>> + target->bx = 0;
>> + target->bp = 0;
>> + target->r12 = 0;
>> + target->r13 = 0;
>> + target->r14 = 0;
>> + target->r15 = 0;
>> +}
> I think there's compilers smart enough to see through your attempts at
> avoiding mem{set,cpy}() there
Indeed. It took a little bit of convincing (needed 4 extra registers to
zero), but https://godbolt.org/z/7rvb8db66
Including Peter's other observation about speculation safety, you
basically can't have any C at all before passing IBRS/UNRET/whatever
else comes along in the future.
Otherwise, the compiler will make you wish you'd written it in asm the
first time around.
~Andrew
^ permalink raw reply [flat|nested] 25+ messages in thread
* [RFC PATCH 4/7] x86/entry: Use atomic-IST-entry for NMI
2023-04-03 14:05 [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST Lai Jiangshan
` (2 preceding siblings ...)
2023-04-03 14:06 ` [RFC PATCH 3/7] x86/entry: Implement atomic-IST-entry Lai Jiangshan
@ 2023-04-03 14:06 ` Lai Jiangshan
2023-04-03 14:06 ` [RFC PATCH 5/7] x86/entry: Use atomic-IST-entry for MCE and DB Lai Jiangshan
` (4 subsequent siblings)
8 siblings, 0 replies; 25+ messages in thread
From: Lai Jiangshan @ 2023-04-03 14:06 UTC (permalink / raw)
To: linux-kernel
Cc: Lai Jiangshan, H. Peter Anvin, Andi Kleen, Andrew Cooper,
Andy Lutomirski, Asit Mallick, Cfir Cohen, Dan Williams,
Dave Hansen, David Kaplan, David Rientjes, Dirk Hohndel,
Erdem Aktas, Jan Kiszka, Jiri Slaby, Joerg Roedel, Juergen Gross,
Kees Cook, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
Linus Torvalds, Mike Stunes, Peter Zijlstra, Raj Ashok,
Sean Christopherson, Thomas Gleixner, Tom Lendacky, Tony Luck,
kvm, linux-coco, x86, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kuppuswamy Sathyanarayanan
From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
arch/x86/entry/entry_64.S | 373 --------------------------------
arch/x86/entry/ist_entry.c | 2 +-
arch/x86/include/asm/idtentry.h | 9 +-
3 files changed, 7 insertions(+), 377 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 50a24cc83581..2bb7ab8512dc 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1341,379 +1341,6 @@ SYM_CODE_START_LOCAL(error_return)
jmp swapgs_restore_regs_and_return_to_usermode
SYM_CODE_END(error_return)
-/*
- * Runs on exception stack. Xen PV does not go through this path at all,
- * so we can use real assembly here.
- *
- * Registers:
- * %r14: Used to save/restore the CR3 of the interrupted context
- * when PAGE_TABLE_ISOLATION is in use. Do not clobber.
- */
-SYM_CODE_START(asm_exc_nmi)
- UNWIND_HINT_IRET_REGS
- ENDBR
-
- /*
- * We allow breakpoints in NMIs. If a breakpoint occurs, then
- * the iretq it performs will take us out of NMI context.
- * This means that we can have nested NMIs where the next
- * NMI is using the top of the stack of the previous NMI. We
- * can't let it execute because the nested NMI will corrupt the
- * stack of the previous NMI. NMI handlers are not re-entrant
- * anyway.
- *
- * To handle this case we do the following:
- * Check the a special location on the stack that contains
- * a variable that is set when NMIs are executing.
- * The interrupted task's stack is also checked to see if it
- * is an NMI stack.
- * If the variable is not set and the stack is not the NMI
- * stack then:
- * o Set the special variable on the stack
- * o Copy the interrupt frame into an "outermost" location on the
- * stack
- * o Copy the interrupt frame into an "iret" location on the stack
- * o Continue processing the NMI
- * If the variable is set or the previous stack is the NMI stack:
- * o Modify the "iret" location to jump to the repeat_nmi
- * o return back to the first NMI
- *
- * Now on exit of the first NMI, we first clear the stack variable
- * The NMI stack will tell any nested NMIs at that point that it is
- * nested. Then we pop the stack normally with iret, and if there was
- * a nested NMI that updated the copy interrupt stack frame, a
- * jump will be made to the repeat_nmi code that will handle the second
- * NMI.
- *
- * However, espfix prevents us from directly returning to userspace
- * with a single IRET instruction. Similarly, IRET to user mode
- * can fault. We therefore handle NMIs from user space like
- * other IST entries.
- */
-
- ASM_CLAC
- cld
-
- /* Use %rdx as our temp variable throughout */
- pushq %rdx
-
- testb $3, CS-RIP+8(%rsp)
- jz .Lnmi_from_kernel
-
- /*
- * NMI from user mode. We need to run on the thread stack, but we
- * can't go through the normal entry paths: NMIs are masked, and
- * we don't want to enable interrupts, because then we'll end
- * up in an awkward situation in which IRQs are on but NMIs
- * are off.
- *
- * We also must not push anything to the stack before switching
- * stacks lest we corrupt the "NMI executing" variable.
- */
-
- swapgs
- FENCE_SWAPGS_USER_ENTRY
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
- movq %rsp, %rdx
- movq PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp
- UNWIND_HINT_IRET_REGS base=%rdx offset=8
- pushq 5*8(%rdx) /* pt_regs->ss */
- pushq 4*8(%rdx) /* pt_regs->rsp */
- pushq 3*8(%rdx) /* pt_regs->flags */
- pushq 2*8(%rdx) /* pt_regs->cs */
- pushq 1*8(%rdx) /* pt_regs->rip */
- UNWIND_HINT_IRET_REGS
- pushq $-1 /* pt_regs->orig_ax */
- PUSH_AND_CLEAR_REGS rdx=(%rdx)
- ENCODE_FRAME_POINTER
-
- IBRS_ENTER
- UNTRAIN_RET
-
- /*
- * At this point we no longer need to worry about stack damage
- * due to nesting -- we're on the normal thread stack and we're
- * done with the NMI stack.
- */
-
- movq %rsp, %rdi
- movq $-1, %rsi
- call exc_nmi
-
- /*
- * Return back to user mode. We must *not* do the normal exit
- * work, because we don't want to enable interrupts.
- */
- jmp swapgs_restore_regs_and_return_to_usermode
-
-.Lnmi_from_kernel:
- /*
- * Here's what our stack frame will look like:
- * +---------------------------------------------------------+
- * | original SS |
- * | original Return RSP |
- * | original RFLAGS |
- * | original CS |
- * | original RIP |
- * +---------------------------------------------------------+
- * | temp storage for rdx |
- * +---------------------------------------------------------+
- * | "NMI executing" variable |
- * +---------------------------------------------------------+
- * | iret SS } Copied from "outermost" frame |
- * | iret Return RSP } on each loop iteration; overwritten |
- * | iret RFLAGS } by a nested NMI to force another |
- * | iret CS } iteration if needed. |
- * | iret RIP } |
- * +---------------------------------------------------------+
- * | outermost SS } initialized in first_nmi; |
- * | outermost Return RSP } will not be changed before |
- * | outermost RFLAGS } NMI processing is done. |
- * | outermost CS } Copied to "iret" frame on each |
- * | outermost RIP } iteration. |
- * +---------------------------------------------------------+
- * | pt_regs |
- * +---------------------------------------------------------+
- *
- * The "original" frame is used by hardware. Before re-enabling
- * NMIs, we need to be done with it, and we need to leave enough
- * space for the asm code here.
- *
- * We return by executing IRET while RSP points to the "iret" frame.
- * That will either return for real or it will loop back into NMI
- * processing.
- *
- * The "outermost" frame is copied to the "iret" frame on each
- * iteration of the loop, so each iteration starts with the "iret"
- * frame pointing to the final return target.
- */
-
- /*
- * Determine whether we're a nested NMI.
- *
- * If we interrupted kernel code between repeat_nmi and
- * end_repeat_nmi, then we are a nested NMI. We must not
- * modify the "iret" frame because it's being written by
- * the outer NMI. That's okay; the outer NMI handler is
- * about to about to call exc_nmi() anyway, so we can just
- * resume the outer NMI.
- */
-
- movq $repeat_nmi, %rdx
- cmpq 8(%rsp), %rdx
- ja 1f
- movq $end_repeat_nmi, %rdx
- cmpq 8(%rsp), %rdx
- ja nested_nmi_out
-1:
-
- /*
- * Now check "NMI executing". If it's set, then we're nested.
- * This will not detect if we interrupted an outer NMI just
- * before IRET.
- */
- cmpl $1, -8(%rsp)
- je nested_nmi
-
- /*
- * Now test if the previous stack was an NMI stack. This covers
- * the case where we interrupt an outer NMI after it clears
- * "NMI executing" but before IRET. We need to be careful, though:
- * there is one case in which RSP could point to the NMI stack
- * despite there being no NMI active: naughty userspace controls
- * RSP at the very beginning of the SYSCALL targets. We can
- * pull a fast one on naughty userspace, though: we program
- * SYSCALL to mask DF, so userspace cannot cause DF to be set
- * if it controls the kernel's RSP. We set DF before we clear
- * "NMI executing".
- */
- lea 6*8(%rsp), %rdx
- /* Compare the NMI stack (rdx) with the stack we came from (4*8(%rsp)) */
- cmpq %rdx, 4*8(%rsp)
- /* If the stack pointer is above the NMI stack, this is a normal NMI */
- ja first_nmi
-
- subq $EXCEPTION_STKSZ, %rdx
- cmpq %rdx, 4*8(%rsp)
- /* If it is below the NMI stack, it is a normal NMI */
- jb first_nmi
-
- /* Ah, it is within the NMI stack. */
-
- testb $(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp)
- jz first_nmi /* RSP was user controlled. */
-
- /* This is a nested NMI. */
-
-nested_nmi:
- /*
- * Modify the "iret" frame to point to repeat_nmi, forcing another
- * iteration of NMI handling.
- */
- subq $8, %rsp
- leaq -10*8(%rsp), %rdx
- pushq $__KERNEL_DS
- pushq %rdx
- pushfq
- pushq $__KERNEL_CS
- pushq $repeat_nmi
-
- /* Put stack back */
- addq $(6*8), %rsp
-
-nested_nmi_out:
- popq %rdx
-
- /* We are returning to kernel mode, so this cannot result in a fault. */
- iretq
-
-first_nmi:
- /* Restore rdx. */
- movq (%rsp), %rdx
-
- /* Make room for "NMI executing". */
- pushq $0
-
- /* Leave room for the "iret" frame */
- subq $(5*8), %rsp
-
- /* Copy the "original" frame to the "outermost" frame */
- .rept 5
- pushq 11*8(%rsp)
- .endr
- UNWIND_HINT_IRET_REGS
-
- /* Everything up to here is safe from nested NMIs */
-
-#ifdef CONFIG_DEBUG_ENTRY
- /*
- * For ease of testing, unmask NMIs right away. Disabled by
- * default because IRET is very expensive.
- */
- pushq $0 /* SS */
- pushq %rsp /* RSP (minus 8 because of the previous push) */
- addq $8, (%rsp) /* Fix up RSP */
- pushfq /* RFLAGS */
- pushq $__KERNEL_CS /* CS */
- pushq $1f /* RIP */
- iretq /* continues at repeat_nmi below */
- UNWIND_HINT_IRET_REGS
-1:
-#endif
-
-repeat_nmi:
- ANNOTATE_NOENDBR // this code
- /*
- * If there was a nested NMI, the first NMI's iret will return
- * here. But NMIs are still enabled and we can take another
- * nested NMI. The nested NMI checks the interrupted RIP to see
- * if it is between repeat_nmi and end_repeat_nmi, and if so
- * it will just return, as we are about to repeat an NMI anyway.
- * This makes it safe to copy to the stack frame that a nested
- * NMI will update.
- *
- * RSP is pointing to "outermost RIP". gsbase is unknown, but, if
- * we're repeating an NMI, gsbase has the same value that it had on
- * the first iteration. paranoid_entry will load the kernel
- * gsbase if needed before we call exc_nmi(). "NMI executing"
- * is zero.
- */
- movq $1, 10*8(%rsp) /* Set "NMI executing". */
-
- /*
- * Copy the "outermost" frame to the "iret" frame. NMIs that nest
- * here must not modify the "iret" frame while we're writing to
- * it or it will end up containing garbage.
- */
- addq $(10*8), %rsp
- .rept 5
- pushq -6*8(%rsp)
- .endr
- subq $(5*8), %rsp
-end_repeat_nmi:
- ANNOTATE_NOENDBR // this code
-
- /*
- * Everything below this point can be preempted by a nested NMI.
- * If this happens, then the inner NMI will change the "iret"
- * frame to point back to repeat_nmi.
- */
- pushq $-1 /* ORIG_RAX: no syscall to restart */
-
- PUSH_AND_CLEAR_REGS
- UNWIND_HINT_REGS
- ENCODE_FRAME_POINTER
-
- /*
- * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
- * as we should not be calling schedule in NMI context.
- * Even with normal interrupts enabled. An NMI should not be
- * setting NEED_RESCHED or anything that normal interrupts and
- * exceptions might do.
- */
- call paranoid_entry
-
- movq %rsp, %rdi
- movq $-1, %rsi
- call exc_nmi
-
- /* Always restore stashed SPEC_CTRL value (see paranoid_entry) */
- IBRS_EXIT save_reg=%r15
-
- /* Always restore stashed CR3 value (see paranoid_entry) */
- RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
-
- /*
- * The above invocation of paranoid_entry stored the GSBASE
- * related information in R/EBX depending on the availability
- * of FSGSBASE.
- *
- * If FSGSBASE is enabled, restore the saved GSBASE value
- * unconditionally, otherwise take the conditional SWAPGS path.
- */
- ALTERNATIVE "jmp nmi_no_fsgsbase", "", X86_FEATURE_FSGSBASE
-
- wrgsbase %rbx
- jmp nmi_restore
-
-nmi_no_fsgsbase:
- /* EBX == 0 -> invoke SWAPGS */
- testl %ebx, %ebx
- jnz nmi_restore
-
-nmi_swapgs:
- swapgs
-
-nmi_restore:
- POP_REGS
-
- /*
- * Skip orig_ax and the "outermost" frame to point RSP at the "iret"
- * at the "iret" frame.
- */
- addq $6*8, %rsp
-
- /*
- * Clear "NMI executing". Set DF first so that we can easily
- * distinguish the remaining code between here and IRET from
- * the SYSCALL entry and exit paths.
- *
- * We arguably should just inspect RIP instead, but I (Andy) wrote
- * this code when I had the misapprehension that Xen PV supported
- * NMIs, and Xen PV would break that approach.
- */
- std
- movq $0, 5*8(%rsp) /* clear "NMI executing" */
-
- /*
- * iretq reads the "iret" frame and exits the NMI stack in a
- * single instruction. We are returning to kernel mode, so this
- * cannot result in a fault. Similarly, we don't need to worry
- * about espfix64 on the way back to kernel mode.
- */
- iretq
-SYM_CODE_END(asm_exc_nmi)
-
#ifndef CONFIG_IA32_EMULATION
/*
* This handles SYSCALL from 32-bit code. There is no way to program
diff --git a/arch/x86/entry/ist_entry.c b/arch/x86/entry/ist_entry.c
index e1b06306ac51..407571cc4a8c 100644
--- a/arch/x86/entry/ist_entry.c
+++ b/arch/x86/entry/ist_entry.c
@@ -41,7 +41,7 @@ static __always_inline bool identify_ist_##sym_name( \
return true; \
}
-DEFINE_IDENTIFY_IST(NMI, nmi, false)
+DEFINE_IDENTIFY_IST(NMI, nmi, true)
DEFINE_IDENTIFY_IST(DB, debug, false)
DEFINE_IDENTIFY_IST(MCE, machine_check, false)
DEFINE_IDENTIFY_IST(VC, vmm_communication, false)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b241af4ce9b4..b568f1de6da6 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -450,6 +450,9 @@ __visible noinstr void func(struct pt_regs *regs, \
idtentry_sysvec vector func
#ifdef CONFIG_X86_64
+# define DECLARE_IDTENTRY_NMI(vector, func) \
+ idtentry_ist vector asm_##func func func has_error_code=0 stack_offset=CEA_stacks_NMI
+
# define DECLARE_IDTENTRY_MCE(vector, func) \
idtentry_mce_db vector asm_##func func
@@ -475,11 +478,11 @@ __visible noinstr void func(struct pt_regs *regs, \
/* No ASM emitted for XEN hypervisor callback */
# define DECLARE_IDTENTRY_XENCB(vector, func)
-#endif
-
-/* No ASM code emitted for NMI */
+/* No ASM code emitted for NMI for X86_32 */
#define DECLARE_IDTENTRY_NMI(vector, func)
+#endif
+
/*
* ASM code to emit the common vector entry stubs where each stub is
* packed into IDT_ALIGN bytes.
--
2.19.1.6.gb485710b
^ permalink raw reply related [flat|nested] 25+ messages in thread* [RFC PATCH 5/7] x86/entry: Use atomic-IST-entry for MCE and DB
2023-04-03 14:05 [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST Lai Jiangshan
` (3 preceding siblings ...)
2023-04-03 14:06 ` [RFC PATCH 4/7] x86/entry: Use atomic-IST-entry for NMI Lai Jiangshan
@ 2023-04-03 14:06 ` Lai Jiangshan
2023-04-03 14:06 ` [RFC PATCH 6/7] x86/entry: Use atomic-IST-entry for VC Lai Jiangshan
` (3 subsequent siblings)
8 siblings, 0 replies; 25+ messages in thread
From: Lai Jiangshan @ 2023-04-03 14:06 UTC (permalink / raw)
To: linux-kernel
Cc: Lai Jiangshan, H. Peter Anvin, Andi Kleen, Andrew Cooper,
Andy Lutomirski, Asit Mallick, Cfir Cohen, Dan Williams,
Dave Hansen, David Kaplan, David Rientjes, Dirk Hohndel,
Erdem Aktas, Jan Kiszka, Jiri Slaby, Joerg Roedel, Juergen Gross,
Kees Cook, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
Linus Torvalds, Mike Stunes, Peter Zijlstra, Raj Ashok,
Sean Christopherson, Thomas Gleixner, Tom Lendacky, Tony Luck,
kvm, linux-coco, x86, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kuppuswamy Sathyanarayanan
From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
arch/x86/entry/entry_64.S | 53 ---------------------------------
arch/x86/entry/ist_entry.c | 4 +--
arch/x86/include/asm/idtentry.h | 4 +--
3 files changed, 4 insertions(+), 57 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 2bb7ab8512dc..e4ddc793f841 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -621,59 +621,6 @@ _ASM_NOKPROBE(\asmsym)
SYM_CODE_END(\asmsym)
.endm
-/**
- * idtentry_mce_db - Macro to generate entry stubs for #MC and #DB
- * @vector: Vector number
- * @asmsym: ASM symbol for the entry point
- * @cfunc: C function to be called
- *
- * The macro emits code to set up the kernel context for #MC and #DB
- *
- * If the entry comes from user space it uses the normal entry path
- * including the return to user space work and preemption checks on
- * exit.
- *
- * If hits in kernel mode then it needs to go through the paranoid
- * entry as the exception can hit any random state. No preemption
- * check on exit to keep the paranoid path simple.
- */
-.macro idtentry_mce_db vector asmsym cfunc
-SYM_CODE_START(\asmsym)
- UNWIND_HINT_IRET_REGS
- ENDBR
- ASM_CLAC
- cld
-
- pushq $-1 /* ORIG_RAX: no syscall to restart */
-
- /*
- * If the entry is from userspace, switch stacks and treat it as
- * a normal entry.
- */
- testb $3, CS-ORIG_RAX(%rsp)
- jnz .Lfrom_usermode_switch_stack_\@
-
- PUSH_AND_CLEAR_REGS
- UNWIND_HINT_REGS
- ENCODE_FRAME_POINTER
-
- /* paranoid_entry returns GS information for paranoid_exit in EBX. */
- call paranoid_entry
-
- movq %rsp, %rdi /* pt_regs pointer */
-
- call \cfunc
-
- jmp paranoid_exit
-
- /* Switch to the regular task stack and use the noist entry point */
-.Lfrom_usermode_switch_stack_\@:
- idtentry_body noist_\cfunc, has_error_code=0
-
-_ASM_NOKPROBE(\asmsym)
-SYM_CODE_END(\asmsym)
-.endm
-
#ifdef CONFIG_AMD_MEM_ENCRYPT
/**
* idtentry_vc - Macro to generate entry stub for #VC
diff --git a/arch/x86/entry/ist_entry.c b/arch/x86/entry/ist_entry.c
index 407571cc4a8c..946b3b537bd5 100644
--- a/arch/x86/entry/ist_entry.c
+++ b/arch/x86/entry/ist_entry.c
@@ -42,8 +42,8 @@ static __always_inline bool identify_ist_##sym_name( \
}
DEFINE_IDENTIFY_IST(NMI, nmi, true)
-DEFINE_IDENTIFY_IST(DB, debug, false)
-DEFINE_IDENTIFY_IST(MCE, machine_check, false)
+DEFINE_IDENTIFY_IST(DB, debug, true)
+DEFINE_IDENTIFY_IST(MCE, machine_check, IS_ENABLED(CONFIG_X86_MCE))
DEFINE_IDENTIFY_IST(VC, vmm_communication, false)
static __always_inline bool identify_ist(
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b568f1de6da6..01f3152ffe82 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -454,10 +454,10 @@ __visible noinstr void func(struct pt_regs *regs, \
idtentry_ist vector asm_##func func func has_error_code=0 stack_offset=CEA_stacks_NMI
# define DECLARE_IDTENTRY_MCE(vector, func) \
- idtentry_mce_db vector asm_##func func
+ idtentry_ist vector asm_##func func noist_##func has_error_code=0 stack_offset=CEA_stacks_MCE
# define DECLARE_IDTENTRY_DEBUG(vector, func) \
- idtentry_mce_db vector asm_##func func
+ idtentry_ist vector asm_##func func noist_##func has_error_code=0 stack_offset=CEA_stacks_DB
# define DECLARE_IDTENTRY_DF(vector, func) \
idtentry_df vector asm_##func func
--
2.19.1.6.gb485710b
^ permalink raw reply related [flat|nested] 25+ messages in thread* [RFC PATCH 6/7] x86/entry: Use atomic-IST-entry for VC
2023-04-03 14:05 [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST Lai Jiangshan
` (4 preceding siblings ...)
2023-04-03 14:06 ` [RFC PATCH 5/7] x86/entry: Use atomic-IST-entry for MCE and DB Lai Jiangshan
@ 2023-04-03 14:06 ` Lai Jiangshan
2023-04-03 14:06 ` [RFC PATCH 7/7] x86/entry: Test atomic-IST-entry via KVM Lai Jiangshan
` (2 subsequent siblings)
8 siblings, 0 replies; 25+ messages in thread
From: Lai Jiangshan @ 2023-04-03 14:06 UTC (permalink / raw)
To: linux-kernel
Cc: Lai Jiangshan, H. Peter Anvin, Andi Kleen, Andrew Cooper,
Andy Lutomirski, Asit Mallick, Cfir Cohen, Dan Williams,
Dave Hansen, David Kaplan, David Rientjes, Dirk Hohndel,
Erdem Aktas, Jan Kiszka, Jiri Slaby, Joerg Roedel, Juergen Gross,
Kees Cook, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
Linus Torvalds, Mike Stunes, Peter Zijlstra, Raj Ashok,
Sean Christopherson, Thomas Gleixner, Tom Lendacky, Tony Luck,
kvm, linux-coco, x86, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Kuppuswamy Sathyanarayanan, Brijesh Singh,
Michael Roth, Venu Busireddy, Paul E. McKenney,
Jason A. Donenfeld, Sami Tolvanen, Alexander Potapenko,
Jiapeng Chong
From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
arch/x86/entry/entry_64.S | 83 --------------------
arch/x86/entry/ist_entry.c | 2 +-
arch/x86/include/asm/cpu_entry_area.h | 3 -
arch/x86/include/asm/idtentry.h | 2 +-
arch/x86/include/asm/sev.h | 14 ----
arch/x86/include/asm/traps.h | 1 -
arch/x86/kernel/dumpstack_64.c | 4 +-
arch/x86/kernel/nmi.c | 8 --
arch/x86/kernel/sev.c | 108 --------------------------
arch/x86/kernel/traps.c | 43 ----------
arch/x86/mm/cpu_entry_area.c | 1 -
11 files changed, 3 insertions(+), 266 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e4ddc793f841..187d42efd288 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -621,89 +621,6 @@ _ASM_NOKPROBE(\asmsym)
SYM_CODE_END(\asmsym)
.endm
-#ifdef CONFIG_AMD_MEM_ENCRYPT
-/**
- * idtentry_vc - Macro to generate entry stub for #VC
- * @vector: Vector number
- * @asmsym: ASM symbol for the entry point
- * @cfunc: C function to be called
- *
- * The macro emits code to set up the kernel context for #VC. The #VC handler
- * runs on an IST stack and needs to be able to cause nested #VC exceptions.
- *
- * To make this work the #VC entry code tries its best to pretend it doesn't use
- * an IST stack by switching to the task stack if coming from user-space (which
- * includes early SYSCALL entry path) or back to the stack in the IRET frame if
- * entered from kernel-mode.
- *
- * If entered from kernel-mode the return stack is validated first, and if it is
- * not safe to use (e.g. because it points to the entry stack) the #VC handler
- * will switch to a fall-back stack (VC2) and call a special handler function.
- *
- * The macro is only used for one vector, but it is planned to be extended in
- * the future for the #HV exception.
- */
-.macro idtentry_vc vector asmsym cfunc
-SYM_CODE_START(\asmsym)
- UNWIND_HINT_IRET_REGS
- ENDBR
- ASM_CLAC
- cld
-
- /*
- * If the entry is from userspace, switch stacks and treat it as
- * a normal entry.
- */
- testb $3, CS-ORIG_RAX(%rsp)
- jnz .Lfrom_usermode_switch_stack_\@
-
- PUSH_AND_CLEAR_REGS
- UNWIND_HINT_REGS
- ENCODE_FRAME_POINTER
-
- /*
- * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
- * EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS
- */
- call paranoid_entry
-
- /*
- * Switch off the IST stack to make it free for nested exceptions. The
- * vc_switch_off_ist() function will switch back to the interrupted
- * stack if it is safe to do so. If not it switches to the VC fall-back
- * stack.
- */
- movq %rsp, %rdi /* pt_regs pointer */
- call vc_switch_off_ist
- movq %rax, %rsp /* Switch to new stack */
-
- ENCODE_FRAME_POINTER
- UNWIND_HINT_REGS
-
- /* Update pt_regs */
- movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/
- movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
-
- movq %rsp, %rdi /* pt_regs pointer */
-
- call kernel_\cfunc
-
- /*
- * No need to switch back to the IST stack. The current stack is either
- * identical to the stack in the IRET frame or the VC fall-back stack,
- * so it is definitely mapped even with PTI enabled.
- */
- jmp paranoid_exit
-
- /* Switch to the regular task stack */
-.Lfrom_usermode_switch_stack_\@:
- idtentry_body user_\cfunc, has_error_code=1
-
-_ASM_NOKPROBE(\asmsym)
-SYM_CODE_END(\asmsym)
-.endm
-#endif
-
/*
* Double fault entry. Straight paranoid. No checks from which context
* this comes because for the espfix induced #DF this would do the wrong
diff --git a/arch/x86/entry/ist_entry.c b/arch/x86/entry/ist_entry.c
index 946b3b537bd5..c0cbd4527033 100644
--- a/arch/x86/entry/ist_entry.c
+++ b/arch/x86/entry/ist_entry.c
@@ -44,7 +44,7 @@ static __always_inline bool identify_ist_##sym_name( \
DEFINE_IDENTIFY_IST(NMI, nmi, true)
DEFINE_IDENTIFY_IST(DB, debug, true)
DEFINE_IDENTIFY_IST(MCE, machine_check, IS_ENABLED(CONFIG_X86_MCE))
-DEFINE_IDENTIFY_IST(VC, vmm_communication, false)
+DEFINE_IDENTIFY_IST(VC, vmm_communication, IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
static __always_inline bool identify_ist(
const struct pt_regs *regs, struct cea_exception_stacks *stacks,
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index a373e8c37e25..618aa698eb82 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -30,8 +30,6 @@
char MCE_stack[EXCEPTION_STKSZ]; \
char VC_stack_guard[guardsize]; \
char VC_stack[optional_stack_size]; \
- char VC2_stack_guard[guardsize]; \
- char VC2_stack[optional_stack_size]; \
char IST_stack_guard[guardsize]; \
char IST_stack[IST_MAIN_STKSZ]; \
char IST_top_guard[guardsize]; \
@@ -55,7 +53,6 @@ enum exception_stack_ordering {
ESTACK_DB,
ESTACK_MCE,
ESTACK_VC,
- ESTACK_VC2,
ESTACK_IST,
N_EXCEPTION_STACKS
};
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 01f3152ffe82..5f3250e589ec 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -466,7 +466,7 @@ __visible noinstr void func(struct pt_regs *regs, \
DECLARE_IDTENTRY(vector, func)
# define DECLARE_IDTENTRY_VC(vector, func) \
- idtentry_vc vector asm_##func func
+ idtentry_ist vector asm_##func kernel_##func user_##func has_error_code=1 stack_offset=CEA_stacks_VC
#else
# define DECLARE_IDTENTRY_MCE(vector, func) \
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index ebc271bb6d8e..ce554b3a818d 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -135,18 +135,6 @@ struct snp_secrets_page_layout {
#ifdef CONFIG_AMD_MEM_ENCRYPT
extern struct static_key_false sev_es_enable_key;
-extern void __sev_es_ist_enter(struct pt_regs *regs);
-extern void __sev_es_ist_exit(void);
-static __always_inline void sev_es_ist_enter(struct pt_regs *regs)
-{
- if (static_branch_unlikely(&sev_es_enable_key))
- __sev_es_ist_enter(regs);
-}
-static __always_inline void sev_es_ist_exit(void)
-{
- if (static_branch_unlikely(&sev_es_enable_key))
- __sev_es_ist_exit();
-}
extern int sev_es_setup_ap_jump_table(struct real_mode_header *rmh);
extern void __sev_es_nmi_complete(void);
static __always_inline void sev_es_nmi_complete(void)
@@ -198,8 +186,6 @@ bool snp_init(struct boot_params *bp);
void __init __noreturn snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
#else
-static inline void sev_es_ist_enter(struct pt_regs *regs) { }
-static inline void sev_es_ist_exit(void) { }
static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
static inline void sev_es_nmi_complete(void) { }
static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 47ecfff2c83d..dc0da530f951 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -15,7 +15,6 @@ asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs);
asmlinkage __visible notrace
struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs);
void __init trap_init(void);
-asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs);
#endif
extern bool ibt_selftest(void);
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 3413b23fa9f1..b7ef2685f63b 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -25,7 +25,6 @@ static const char * const exception_stack_names[] = {
[ ESTACK_DB ] = "#DB",
[ ESTACK_MCE ] = "#MC",
[ ESTACK_VC ] = "#VC",
- [ ESTACK_VC2 ] = "#VC2",
[ ESTACK_IST ] = "#IST",
};
@@ -89,7 +88,6 @@ struct estack_pages estack_pages[CEA_ESTACK_PAGES] ____cacheline_aligned = {
EPAGERANGE(DB),
EPAGERANGE(MCE),
EPAGERANGE(VC),
- EPAGERANGE(VC2),
EPAGERANGE(IST),
};
@@ -100,7 +98,7 @@ static __always_inline bool in_exception_stack(unsigned long *stack, struct stac
struct pt_regs *regs;
unsigned int k;
- BUILD_BUG_ON(N_EXCEPTION_STACKS != 7);
+ BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
begin = (unsigned long)__this_cpu_read(cea_exception_stacks);
/*
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 776f4b1e395b..bafd0c7ca5b7 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -514,12 +514,6 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
}
nmi_restart:
- /*
- * Needs to happen before DR7 is accessed, because the hypervisor can
- * intercept DR7 reads/writes, turning those into #VC exceptions.
- */
- sev_es_ist_enter(regs);
-
this_cpu_write(nmi_dr7, local_db_save());
irq_state = irqentry_nmi_enter(regs);
@@ -544,8 +538,6 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
local_db_restore(this_cpu_read(nmi_dr7));
- sev_es_ist_exit();
-
if (unlikely(this_cpu_read(nmi_cr2) != read_cr2()))
write_cr2(this_cpu_read(nmi_cr2));
if (this_cpu_dec_return(nmi_state))
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 679026a640ef..74d55786c353 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -122,77 +122,6 @@ struct sev_config {
static struct sev_config sev_cfg __read_mostly;
-static __always_inline bool on_vc_stack(struct pt_regs *regs)
-{
- unsigned long sp = regs->sp;
-
- /* User-mode RSP is not trusted */
- if (user_mode(regs))
- return false;
-
- /* SYSCALL gap still has user-mode RSP */
- if (ip_within_syscall_gap(regs))
- return false;
-
- return ((sp >= __this_cpu_ist_bottom_va(VC)) && (sp < __this_cpu_ist_top_va(VC)));
-}
-
-/*
- * This function handles the case when an NMI is raised in the #VC
- * exception handler entry code, before the #VC handler has switched off
- * its IST stack. In this case, the IST entry for #VC must be adjusted,
- * so that any nested #VC exception will not overwrite the stack
- * contents of the interrupted #VC handler.
- *
- * The IST entry is adjusted unconditionally so that it can be also be
- * unconditionally adjusted back in __sev_es_ist_exit(). Otherwise a
- * nested sev_es_ist_exit() call may adjust back the IST entry too
- * early.
- *
- * The __sev_es_ist_enter() and __sev_es_ist_exit() functions always run
- * on the NMI IST stack, as they are only called from NMI handling code
- * right now.
- */
-void noinstr __sev_es_ist_enter(struct pt_regs *regs)
-{
- unsigned long old_ist, new_ist;
-
- /* Read old IST entry */
- new_ist = old_ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
-
- /*
- * If NMI happened while on the #VC IST stack, set the new IST
- * value below regs->sp, so that the interrupted stack frame is
- * not overwritten by subsequent #VC exceptions.
- */
- if (on_vc_stack(regs))
- new_ist = regs->sp;
-
- /*
- * Reserve additional 8 bytes and store old IST value so this
- * adjustment can be unrolled in __sev_es_ist_exit().
- */
- new_ist -= sizeof(old_ist);
- *(unsigned long *)new_ist = old_ist;
-
- /* Set new IST entry */
- this_cpu_write(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC], new_ist);
-}
-
-void noinstr __sev_es_ist_exit(void)
-{
- unsigned long ist;
-
- /* Read IST entry */
- ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
-
- if (WARN_ON(ist == __this_cpu_ist_top_va(VC)))
- return;
-
- /* Read back old IST entry and write it to the TSS */
- this_cpu_write(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC], *(unsigned long *)ist);
-}
-
/*
* Nothing shall interrupt this code path while holding the per-CPU
* GHCB. The backup GHCB is only for NMIs interrupting this path.
@@ -1841,26 +1770,6 @@ static __always_inline void vc_forward_exception(struct es_em_ctxt *ctxt)
}
}
-static __always_inline bool is_vc2_stack(unsigned long sp)
-{
- return (sp >= __this_cpu_ist_bottom_va(VC2) && sp < __this_cpu_ist_top_va(VC2));
-}
-
-static __always_inline bool vc_from_invalid_context(struct pt_regs *regs)
-{
- unsigned long sp, prev_sp;
-
- sp = (unsigned long)regs;
- prev_sp = regs->sp;
-
- /*
- * If the code was already executing on the VC2 stack when the #VC
- * happened, let it proceed to the normal handling routine. This way the
- * code executing on the VC2 stack can cause #VC exceptions to get handled.
- */
- return is_vc2_stack(sp) && !is_vc2_stack(prev_sp);
-}
-
static bool vc_raw_handle_exception(struct pt_regs *regs, unsigned long error_code)
{
struct ghcb_state state;
@@ -1930,23 +1839,6 @@ DEFINE_IDTENTRY_VC_KERNEL(exc_vmm_communication)
{
irqentry_state_t irq_state;
- /*
- * With the current implementation it is always possible to switch to a
- * safe stack because #VC exceptions only happen at known places, like
- * intercepted instructions or accesses to MMIO areas/IO ports. They can
- * also happen with code instrumentation when the hypervisor intercepts
- * #DB, but the critical paths are forbidden to be instrumented, so #DB
- * exceptions currently also only happen in safe places.
- *
- * But keep this here in case the noinstr annotations are violated due
- * to bug elsewhere.
- */
- if (unlikely(vc_from_invalid_context(regs))) {
- instrumentation_begin();
- panic("Can't handle #VC exception from unsupported context\n");
- instrumentation_end();
- }
-
/*
* Handle #DB before calling into !noinstr code to avoid recursive #DB.
*/
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d317dc3d06a3..6c697c175f7a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -864,49 +864,6 @@ asmlinkage __visible noinstr struct pt_regs *sync_regs(struct pt_regs *eregs)
return regs;
}
-#ifdef CONFIG_AMD_MEM_ENCRYPT
-asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *regs)
-{
- unsigned long sp, *stack;
- struct stack_info info;
- struct pt_regs *regs_ret;
-
- /*
- * In the SYSCALL entry path the RSP value comes from user-space - don't
- * trust it and switch to the current kernel stack
- */
- if (ip_within_syscall_gap(regs)) {
- sp = this_cpu_read(pcpu_hot.top_of_stack);
- goto sync;
- }
-
- /*
- * From here on the RSP value is trusted. Now check whether entry
- * happened from a safe stack. Not safe are the entry or unknown stacks,
- * use the fall-back stack instead in this case.
- */
- sp = regs->sp;
- stack = (unsigned long *)sp;
-
- if (!get_stack_info_noinstr(stack, current, &info) || info.type == STACK_TYPE_ENTRY ||
- info.type > STACK_TYPE_EXCEPTION_LAST)
- sp = __this_cpu_ist_top_va(VC2);
-
-sync:
- /*
- * Found a safe stack - switch to it as if the entry didn't happen via
- * IST stack. The code below only copies pt_regs, the real switch happens
- * in assembly code.
- */
- sp = ALIGN_DOWN(sp, 8) - sizeof(*regs_ret);
-
- regs_ret = (struct pt_regs *)sp;
- *regs_ret = *regs;
-
- return regs_ret;
-}
-#endif
-
asmlinkage __visible noinstr struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs)
{
struct pt_regs tmp, *new_stack;
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 62341cb819ab..7df1301ec343 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -153,7 +153,6 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
cea_map_stack(VC);
- cea_map_stack(VC2);
}
}
}
--
2.19.1.6.gb485710b
^ permalink raw reply related [flat|nested] 25+ messages in thread* [RFC PATCH 7/7] x86/entry: Test atomic-IST-entry via KVM
2023-04-03 14:05 [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST Lai Jiangshan
` (5 preceding siblings ...)
2023-04-03 14:06 ` [RFC PATCH 6/7] x86/entry: Use atomic-IST-entry for VC Lai Jiangshan
@ 2023-04-03 14:06 ` Lai Jiangshan
2023-04-03 14:23 ` [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST Dave Hansen
2023-04-03 16:53 ` Dave Hansen
8 siblings, 0 replies; 25+ messages in thread
From: Lai Jiangshan @ 2023-04-03 14:06 UTC (permalink / raw)
To: linux-kernel
Cc: Lai Jiangshan, H. Peter Anvin, Andi Kleen, Andrew Cooper,
Andy Lutomirski, Asit Mallick, Cfir Cohen, Dan Williams,
Dave Hansen, David Kaplan, David Rientjes, Dirk Hohndel,
Erdem Aktas, Jan Kiszka, Jiri Slaby, Joerg Roedel, Juergen Gross,
Kees Cook, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
Linus Torvalds, Mike Stunes, Peter Zijlstra, Raj Ashok,
Sean Christopherson, Thomas Gleixner, Tom Lendacky, Tony Luck,
kvm, linux-coco, x86, Paolo Bonzini, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin
From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Not for inclusion.
It is not part of the atomic-IST-entry patch.
Test it with a VM with only one vCPU.
Lauch the VM with the uncompressed vmlinux.
The test injects super exceptions (MCE NMI DB) randomly.
The test make the ratio of the occurence of nested exceptions
(different vectors) higher.
The test reables the nested of the same vector very hurry.
When the cpu reaches a commit_ip, all the events are reabled.
usage:
parameters="$parameters exc_nmi_handler=0x$(readelf -W -s vmlinux | awk '/ exc_nmi$/ { print $2}')"
parameters="$parameters asm_exc_nmi_handler=0x$(readelf -W -s vmlinux | awk '/ asm_exc_nmi$/ { print $2}')"
parameters="$parameters asm_exc_db_handler=0x$(readelf -W -s vmlinux | awk '/ asm_exc_debug$/ { print $2}')"
parameters="$parameters asm_exc_mce_handler=0x$(readelf -W -s vmlinux | awk '/ asm_exc_machine_check$/ { print $2}')"
parameters="$parameters exc_db_kernel_handler=0x$(readelf -W -s vmlinux | awk '/ exc_debug$/ { print $2}')"
parameters="$parameters exc_db_user_handler=0x$(readelf -W -s vmlinux | awk '/ noist_exc_debug$/ { print $2}')"
parameters="$parameters exc_mce_kernel_handler=0x$(readelf -W -s vmlinux | awk '/ exc_machine_check$/ { print $2}')"
parameters="$parameters exc_mce_user_handler=0x$(readelf -W -s vmlinux | awk '/ noist_exc_machine_check$/ { print $2}')"
parameters="$parameters asm_exc_nmi_commit_ip=0x$(readelf -W -s vmlinux | awk '/ commit_asm_exc_nmi$/ { print $2}')"
parameters="$parameters asm_exc_db_commit_ip=0x$(readelf -W -s vmlinux | awk '/ commit_asm_exc_debug$/ { print $2}')"
parameters="$parameters asm_exc_mce_commit_ip=0x$(readelf -W -s vmlinux | awk '/ commit_asm_exc_machine_check$/ { print $2}')"
parameters="$parameters native_irq_return_iret_ip=0x$(readelf -W -s vmlinux | awk '/ native_irq_return_iret$/ { print $2}')"
parameters="$parameters ist_emul_test=1"
insmod arch/x86/kvm/kvm.ko
insmod arch/x86/kvm/kvm-intel.ko $parameters
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
arch/x86/kvm/vmx/vmx.c | 441 ++++++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/x86.c | 10 +-
2 files changed, 447 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index bcac3efcde41..6f0d574a137a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1252,6 +1252,441 @@ void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel,
}
}
+#include <linux/prandom.h>
+
+// readelf -W -s vmlinux | awk '/ asm_exc_double_fault$/ { print $2}'
+static unsigned long asm_exc_df_handler;
+module_param(asm_exc_df_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ asm_exc_nmi$/ { print $2}'
+static unsigned long asm_exc_nmi_handler;
+module_param(asm_exc_nmi_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ asm_exc_debug$/ { print $2}'
+static unsigned long asm_exc_db_handler;
+module_param(asm_exc_db_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ asm_exc_machine_check$/ { print $2}'
+static unsigned long asm_exc_mce_handler;
+module_param(asm_exc_mce_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ exc_nmi$/ { print $2}'
+static unsigned long exc_nmi_handler;
+module_param(exc_nmi_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ exc_debug$/ { print $2}'
+static unsigned long exc_db_kernel_handler;
+module_param(exc_db_kernel_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ noist_exc_debug$/ { print $2}'
+static unsigned long exc_db_user_handler;
+module_param(exc_db_user_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ exc_machine_check$/ { print $2}'
+static unsigned long exc_mce_kernel_handler;
+module_param(exc_mce_kernel_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ noist_exc_machine_check$/ { print $2}'
+static unsigned long exc_mce_user_handler;
+module_param(exc_mce_user_handler, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ commit_asm_exc_nmi$/ { print $2}'
+static unsigned long asm_exc_nmi_commit_ip;
+module_param(asm_exc_nmi_commit_ip, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ commit_asm_exc_debug$/ { print $2}'
+static unsigned long asm_exc_db_commit_ip;
+module_param(asm_exc_db_commit_ip, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ commit_asm_exc_machine_check$/ { print $2}'
+static unsigned long asm_exc_mce_commit_ip;
+module_param(asm_exc_mce_commit_ip, ulong, S_IRUGO);
+
+// readelf -W -s vmlinux | awk '/ native_irq_return_iret$/ { print $2}'
+static unsigned long native_irq_return_iret_ip;
+module_param(native_irq_return_iret_ip, ulong, S_IRUGO);
+
+#define MAX_NESTING 50
+#define IST_START_RATE 512 /* rate of queuing an IST per vm-exit if no IST queued */
+#define IST_NESTED_RATE 256 /* rate of nesting an IST per instruction during handling IST */
+
+static int instruction_count;
+static int ist_depth, ist_atomic_depth;
+static int ist_nmi_blocking, ist_db_blocking, ist_mce_blocking;
+static u8 pending_vector[MAX_NESTING];
+static unsigned long ist_interrupted_rip[MAX_NESTING];
+static unsigned long commit_ip;
+
+#define TEST_STOPPED 0
+#define TEST_REQUESTED 1
+#define TEST_ONGOING 2
+static int ist_emul_test = TEST_STOPPED;
+// set ist_emul_test = 1 (TEST_REQUESTED) when loading the module to enable the test
+// in a script which also sets all the above symbol addresses
+module_param(ist_emul_test, int, S_IRUGO);
+
+static bool test_too_early_nesting_bug;
+module_param(test_too_early_nesting_bug, bool, S_IRUGO);
+
+extern int ist_emul_test_singlestep;
+
+static bool ist_emul_injecting;
+static bool ist_emul_returning;
+
+static void print_pending_vectors(void)
+{
+ int i;
+
+ pr_cont("current vectors:");
+ for (i = 0; i < ist_depth; i++) {
+ switch (pending_vector[i]) {
+ case MC_VECTOR: pr_cont(" #MC"); break;
+ case DB_VECTOR: pr_cont(" #DB"); break;
+ case 2: pr_cont(" NMI"); break;
+ default: WARN_ON(1);
+ }
+ }
+ pr_cont(";\n");
+}
+
+#define DR7_EN(i, type, len) (((type | len) & 0xf) << (DR_CONTROL_SHIFT + i * DR_CONTROL_SIZE)) |\
+ (DR_GLOBAL_ENABLE << (i * DR_ENABLE_SIZE))
+
+#define DR7_X(i) DR7_EN(i, X86_BREAKPOINT_EXECUTE, X86_BREAKPOINT_LEN_X)
+
+static void watch_injecting(struct kvm_vcpu *vcpu)
+{
+ kvm_set_dr(vcpu, 0, asm_exc_nmi_handler);
+ kvm_set_dr(vcpu, 1, asm_exc_db_handler);
+ kvm_set_dr(vcpu, 2, asm_exc_mce_handler);
+ kvm_set_dr(vcpu, 3, asm_exc_df_handler);
+ kvm_set_dr(vcpu, 6, DR6_RESERVED);
+ kvm_set_dr(vcpu, 7, DR7_X(0) | DR7_X(1) | DR7_X(2) | DR7_X(3));
+}
+
+static void finish_watching_injecting(struct kvm_vcpu *vcpu)
+{
+ kvm_set_dr(vcpu, 0, 0);
+ kvm_set_dr(vcpu, 1, 0);
+ kvm_set_dr(vcpu, 2, 0);
+ kvm_set_dr(vcpu, 3, 0);
+ kvm_set_dr(vcpu, 6, DR6_RESERVED);
+ kvm_set_dr(vcpu, 7, 0);
+}
+
+static void watch_handler_done(struct kvm_vcpu *vcpu)
+{
+ kvm_set_dr(vcpu, 0, ist_interrupted_rip[ist_depth-1]);
+ kvm_set_dr(vcpu, 1, asm_exc_df_handler);
+ //kvm_set_dr(vcpu, 2, exc_nmi_handler);
+ //kvm_set_dr(vcpu, 3, native_irq_return_iret_ip);
+ kvm_set_dr(vcpu, 6, DR6_RESERVED);
+ kvm_set_dr(vcpu, 7, DR7_X(0) | DR7_X(1));
+ //kvm_set_dr(vcpu, 7, DR7_X(0) | DR7_X(1) | DR7_X(2) | DR7_X(3));
+}
+
+static void ist_emul_test_start_singlestep(struct kvm_vcpu *vcpu)
+{
+ ist_emul_test_singlestep = 1;
+}
+
+static void ist_emul_test_stop_singlestep(struct kvm_vcpu *vcpu)
+{
+ ist_emul_test_singlestep = 0;
+ watch_handler_done(vcpu);
+ finish_watching_injecting(vcpu);
+}
+
+static void ist_emul_test_stop(struct kvm_vcpu *vcpu)
+{
+ ist_emul_test = TEST_STOPPED;
+ ist_emul_test_stop_singlestep(vcpu);
+}
+
+static bool is_entering_ip(struct kvm_vcpu *vcpu)
+{
+ unsigned long rip = kvm_rip_read(vcpu);
+
+ return rip == asm_exc_mce_handler || rip == asm_exc_db_handler || rip == asm_exc_nmi_handler;
+}
+
+static void emul_ist_bug_in_singlestep(struct kvm_vcpu *vcpu)
+{
+ unsigned long rip = kvm_rip_read(vcpu);
+
+ if (WARN_ON_ONCE(!ist_depth)) {
+ kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+ return;
+ }
+
+ if (rip == asm_exc_df_handler || instruction_count > 1000) {
+ ist_emul_test_stop(vcpu);
+ return;
+ }
+}
+
+static unsigned long read_one(struct kvm_vcpu *vcpu, unsigned long addr)
+{
+ unsigned long ret;
+ struct x86_exception e;
+ if (kvm_read_guest_virt(vcpu, addr, &ret, sizeof(ret), &e) != X86EMUL_CONTINUE)
+ return 0xf000f123f456f789;
+ return ret;
+}
+
+static void emul_ist_kctx_handler_return(struct kvm_vcpu *vcpu)
+{
+ unsigned long rsp = kvm_rsp_read(vcpu);
+ unsigned long ret_rip;
+ struct x86_exception e;
+
+ if (kvm_read_guest_virt(vcpu, rsp, &ret_rip, sizeof(ret_rip), &e) != X86EMUL_CONTINUE)
+ kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+
+ /* it is faked #MC #DB NMI, make the guest handle it as if the handler in kernel context was an NOP */
+ kvm_rsp_write(vcpu, rsp+8);
+ kvm_rip_write(vcpu, ret_rip);
+}
+
+static void emul_ist_kctx_handler_done(struct kvm_vcpu *vcpu)
+{
+ unsigned long rip = kvm_rip_read(vcpu);
+ int vector = -1;
+
+ if (rip == exc_nmi_handler)
+ vector = NMI_VECTOR;
+ if (rip == exc_db_kernel_handler || rip == exc_db_user_handler)
+ vector = DB_VECTOR;
+ if (rip == exc_mce_kernel_handler || rip == exc_mce_user_handler)
+ vector = MC_VECTOR;
+
+ if (vector > 0) {
+ if (WARN_ON(vector != pending_vector[ist_depth-1]))
+ kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+ pr_info("Enter the handler in kernel context; ");
+ print_pending_vectors();
+ emul_ist_kctx_handler_return(vcpu);
+ }
+}
+
+static void emul_ist_emul_enable_delivering(struct kvm_vcpu *vcpu)
+{
+ unsigned long rip = kvm_rip_read(vcpu);
+
+ if (rip == commit_ip) {
+ ist_atomic_depth = 0;
+ ist_nmi_blocking = 0;
+ ist_db_blocking = 0;
+ ist_mce_blocking = 0;
+ }
+}
+
+static bool test_nesting_bug(void)
+{
+ return test_too_early_nesting_bug && ((get_random_u32() % 1024) == 0);
+}
+
+static void __ist_emul_test_queue_one(struct kvm_vcpu *vcpu, int vector)
+{
+ if (!ist_atomic_depth)
+ ist_interrupted_rip[ist_depth] = kvm_rip_read(vcpu);
+ else
+ ist_interrupted_rip[ist_depth] = commit_ip;
+ pending_vector[ist_depth] = vector;
+ ist_depth++;
+ ist_atomic_depth++;
+ instruction_count = 0;
+ pr_info("queue an IST exception, depth:%d, atomic_depth:%d; ", ist_depth, ist_atomic_depth);
+ print_pending_vectors();
+
+ // per intel SDM, X86_EFLAGS_TF is cleared after the exception is diliverred.
+ // use hw breakpoint to force immediate exit
+ // and add X86_EFLAGS_TF after breakpoint hits
+ ist_emul_injecting = true;
+ watch_injecting(vcpu);
+ pr_info("interrupt ip: %lx return ip: %lx\n", kvm_rip_read(vcpu), ist_interrupted_rip[ist_depth-1]);
+}
+
+static void ist_emul_test_queue_one(struct kvm_vcpu *vcpu)
+{
+ u32 which = get_random_u32() % 3;
+
+ if (ist_depth >= MAX_NESTING)
+ return;
+
+ if (which <= 0 && (!ist_mce_blocking || test_nesting_bug())) {
+ kvm_queue_exception(vcpu, MC_VECTOR);
+ ist_mce_blocking = 1;
+ __ist_emul_test_queue_one(vcpu, MC_VECTOR);
+ commit_ip = asm_exc_mce_commit_ip;
+ return;
+ }
+
+ if (which <= 1 && (!ist_db_blocking || test_nesting_bug())) {
+ kvm_queue_exception_p(vcpu, DB_VECTOR, DR6_BS);
+ ist_db_blocking = 1;
+ __ist_emul_test_queue_one(vcpu, DB_VECTOR);
+ commit_ip = asm_exc_db_commit_ip;
+ return;
+ }
+
+ if (which <= 2 && (!ist_nmi_blocking || test_nesting_bug())) {
+ vmx_set_nmi_mask(vcpu, false);
+ kvm_inject_nmi(vcpu);
+ ist_nmi_blocking = 1;
+ __ist_emul_test_queue_one(vcpu, 2);
+ commit_ip = asm_exc_nmi_commit_ip;
+ return;
+ }
+}
+
+static void __handle_ist_emul_test_singlestep(struct kvm_vcpu *vcpu)
+{
+ instruction_count++;
+ //pr_info("singlestep ip: %lx\n", kvm_rip_read(vcpu));
+ emul_ist_emul_enable_delivering(vcpu);
+ emul_ist_kctx_handler_done(vcpu);
+ emul_ist_bug_in_singlestep(vcpu);
+
+ /* try to nest an IST exception if the previous is not done */
+ if (!ist_emul_test_singlestep)
+ return;
+
+ // no RF, the return rip need to be watched.
+ // #DB during a repeated string instruction will get RF (sync_regs())
+ if (vmx_get_rflags(vcpu) & X86_EFLAGS_RF)
+ return;
+
+ if (get_random_u32() % IST_NESTED_RATE)
+ return;
+
+ ist_emul_test_queue_one(vcpu);
+}
+
+static bool handle_ist_emul_test_singlestep(struct kvm_vcpu *vcpu, unsigned long dr6)
+{
+ unsigned long rip;
+
+ if (!ist_emul_test_singlestep)
+ return false;
+
+ //kvm_get_dr(vcpu, 6, &dr6);
+ //pr_info("singlestep rip %lx dr6 %lx\n", kvm_rip_read(vcpu), dr6);
+ //kvm_set_dr(vcpu, 6, DR6_RESERVED);
+ if (WARN_ON(!ist_emul_injecting && !ist_emul_returning && is_entering_ip(vcpu))) {
+ ist_emul_test_stop(vcpu);
+ return true;
+ }
+ if (ist_emul_injecting) {
+ //pr_info("xxx ip: %lx\n", kvm_rip_read(vcpu));
+ if (WARN_ON(!is_entering_ip(vcpu))) {
+ ist_emul_test_stop(vcpu);
+ return true;
+ }
+ ist_emul_injecting = false;
+ finish_watching_injecting(vcpu);
+ if (0)
+ read_one(vcpu, kvm_rsp_read(vcpu));
+ //pr_info("rip in ret %lx\n", read_one(vcpu, kvm_rsp_read(vcpu)));
+ //watch_handler_done(vcpu);
+ //if (vmx_get_rflags(vcpu) & X86_EFLAGS_TF)
+ // pr_info("xxxxxxxxxxxxxxxxxxxxxx\n");
+ vmx_set_rflags(vcpu, vmx_get_rflags(vcpu) | X86_EFLAGS_TF);
+ }
+ if (ist_emul_returning) {
+ rip = kvm_rip_read(vcpu);
+ if (rip != ist_interrupted_rip[ist_depth-1]) {
+ pr_info("rip %lx expect rip %lx\n", rip, ist_interrupted_rip[ist_depth-1]);
+ WARN_ON(1);
+ ist_emul_test_stop(vcpu);
+ return true;
+ }
+ ist_depth--;
+ pr_info("IST exception returned, depth:%d; ", ist_depth);
+ print_pending_vectors();
+
+ // print instruction_count for adjusting IST_NESTED_RATE
+ // to contrl the average ist_depth for the the test.
+ //pr_info("instruction_count %d\n", instruction_count);
+ instruction_count = 0;
+ ist_emul_returning = false;
+
+ if (!ist_depth) {
+ ist_emul_test_stop_singlestep(vcpu);
+ return true;
+ }
+ }
+
+
+ __handle_ist_emul_test_singlestep(vcpu);
+
+ rip = kvm_rip_read(vcpu);
+ if (!ist_emul_injecting && rip == native_irq_return_iret_ip) {
+ pr_info("IST exception is going to return\n");
+ //watch_handler_done(vcpu);
+ ist_emul_returning = true;
+ }
+ return true;
+}
+
+static void handle_ist_emul_test(struct kvm_vcpu *vcpu)
+{
+ //if (ist_emul_test_singlestep)
+ // pr_info("vmexit ip: %lx\n", kvm_rip_read(vcpu));
+
+ if (ist_emul_test == TEST_STOPPED)
+ return;
+
+ if (ist_emul_test == TEST_REQUESTED) {
+ /* the test can't only be started until the guest's booting process is done. */
+ if (vmx_get_cpl(vcpu))
+ ist_emul_test = TEST_ONGOING;
+ else
+ return;
+ }
+
+ /* Don't start the test if the test is already started or someone
+ * else is using the X86_EFLAGS_TF */
+ // no RF, the return rip need to be watched.
+ if (vmx_get_rflags(vcpu) & (X86_EFLAGS_TF | X86_EFLAGS_RF))
+ return;
+
+ if (ist_emul_test_singlestep) {
+ //kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+ return;
+ }
+
+ if (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
+ (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS))
+ return;
+
+ if (vcpu->arch.nmi_injected || vcpu->arch.interrupt.injected ||
+ vcpu->arch.exception.pending || vcpu->arch.exception.injected)
+ return;
+
+ pr_info_once("%lx %lx %lx %lx %lx %lx\n", vcpu->arch.eff_db[0], vcpu->arch.eff_db[1],
+ vcpu->arch.eff_db[2], vcpu->arch.eff_db[3], vcpu->arch.dr6, vcpu->arch.dr7);
+ if (vcpu->arch.eff_db[0] || vcpu->arch.eff_db[1] ||
+ vcpu->arch.eff_db[2] || vcpu->arch.eff_db[3] ||
+ ((vcpu->arch.dr6 | DR6_RESERVED) != DR6_RESERVED) ||
+ ((vcpu->arch.dr7 | DR7_FIXED_1) != DR7_FIXED_1))
+ return;
+
+ if (WARN_ON_ONCE(ist_depth)) {
+ kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+ return;
+ }
+
+ /* XXX: don't start the testing when the guest is handling real #MC #DB NMI.
+ * But how to detect it? Just skip the detecting since is a test. */
+
+ if (get_random_u32() % IST_START_RATE)
+ return;
+
+ ist_emul_test_start_singlestep(vcpu);
+ ist_emul_test_queue_one(vcpu);
+}
+
void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -5258,6 +5693,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
switch (ex_no) {
case DB_VECTOR:
dr6 = vmx_get_exit_qual(vcpu);
+ if (handle_ist_emul_test_singlestep(vcpu, dr6))
+ return 1;
if (!(vcpu->guest_debug &
(KVM_GUESTDBG_SINGLESTEP | KVM_GUESTDBG_USE_HW_BP))) {
/*
@@ -6579,6 +7016,8 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
{
int ret = __vmx_handle_exit(vcpu, exit_fastpath);
+ if (ret > 0)
+ handle_ist_emul_test(vcpu);
/*
* Exit to user space when bus lock detected to inform that there is
* a bus lock in guest.
@@ -7290,7 +7729,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
* vmentry fails as it then expects bit 14 (BS) in pending debug
* exceptions being set, but that's not correct for the guest debugging
* case. */
- if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
+ if (ist_emul_test_singlestep || (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP))
vmx_set_interrupt_shadow(vcpu, 0);
kvm_load_guest_xsave_state(vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7713420abab0..205a4d989046 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -814,6 +814,7 @@ void kvm_inject_nmi(struct kvm_vcpu *vcpu)
atomic_inc(&vcpu->arch.nmi_queued);
kvm_make_request(KVM_REQ_NMI, vcpu);
}
+EXPORT_SYMBOL_GPL(kvm_inject_nmi);
void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code)
{
@@ -12810,12 +12811,15 @@ bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long linear_rip)
}
EXPORT_SYMBOL_GPL(kvm_is_linear_rip);
+int ist_emul_test_singlestep;
+EXPORT_SYMBOL_GPL(ist_emul_test_singlestep);
+
unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu)
{
unsigned long rflags;
rflags = static_call(kvm_x86_get_rflags)(vcpu);
- if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
+ if (ist_emul_test_singlestep || (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP))
rflags &= ~X86_EFLAGS_TF;
return rflags;
}
@@ -12823,8 +12827,8 @@ EXPORT_SYMBOL_GPL(kvm_get_rflags);
static void __kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
{
- if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP &&
- kvm_is_linear_rip(vcpu, vcpu->arch.singlestep_rip))
+ if (ist_emul_test_singlestep || ((vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) &&
+ kvm_is_linear_rip(vcpu, vcpu->arch.singlestep_rip)))
rflags |= X86_EFLAGS_TF;
static_call(kvm_x86_set_rflags)(vcpu, rflags);
}
--
2.19.1.6.gb485710b
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST
2023-04-03 14:05 [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST Lai Jiangshan
` (6 preceding siblings ...)
2023-04-03 14:06 ` [RFC PATCH 7/7] x86/entry: Test atomic-IST-entry via KVM Lai Jiangshan
@ 2023-04-03 14:23 ` Dave Hansen
2023-04-03 16:32 ` Lai Jiangshan
2023-04-03 16:53 ` Dave Hansen
8 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2023-04-03 14:23 UTC (permalink / raw)
To: Lai Jiangshan, linux-kernel
Cc: Lai Jiangshan, H. Peter Anvin, Andi Kleen, Andrew Cooper,
Andy Lutomirski, Asit Mallick, Cfir Cohen, Dan Williams,
David Kaplan, David Rientjes, Dirk Hohndel, Erdem Aktas,
Jan Kiszka, Jiri Slaby, Joerg Roedel, Juergen Gross, Kees Cook,
Kirill Shutemov, Kuppuswamy Sathyanarayanan, Linus Torvalds,
Mike Stunes, Peter Zijlstra, Raj Ashok, Sean Christopherson,
Thomas Gleixner, Tom Lendacky, Tony Luck, kvm, linux-coco, x86
On 4/3/23 07:05, Lai Jiangshan wrote:
> 2.3 #VE
> -------
>
> The approach for fixing the kernel mode #VE recursion issue is to just
> NOT use IST for #VE although #VE is also considered to be one of the
> super exceptions and had raised some worries:
> https://lore.kernel.org/lkml/YCEQiDNSHTGBXBcj@hirez.programming.kicks-ass.net/
> https://lore.kernel.org/lkml/CALCETrU9XypKbj-TrXLB3CPW6=MZ__5ifLz0ckbB=c=Myegn9Q@mail.gmail.com/
> https://lore.kernel.org/lkml/1843debc-05e8-4d10-73e4-7ddce3b3eae2@intel.com/
>
> To remit the worries, SEPT_VE_DISABLE is forced used currently and
> also disables its abilities (accept-on-demand or memory balloon which
> is critical to lightweight VMs like Kata Containers):
> https://lore.kernel.org/lkml/YCb0%2FDg28uI7TRD%2F@google.com/
You don't need #VE for accept-on-demand. Pages go through _very_
well-defined software choke points before they get used *and* before
they get ballooned. Thus:
> https://lore.kernel.org/lkml/20230330114956.20342-3-kirill.shutemov@linux.intel.com/
BTW, _who_ considers #VE to be a "super exception"? Can you explain how
it is any more "super" than #PF? #PF can recurse. You can take #PF in
the entry paths.
I kinda don't think you should be using TDX and #VE as part of the
justification for this series.
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST
2023-04-03 14:23 ` [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST Dave Hansen
@ 2023-04-03 16:32 ` Lai Jiangshan
0 siblings, 0 replies; 25+ messages in thread
From: Lai Jiangshan @ 2023-04-03 16:32 UTC (permalink / raw)
To: Dave Hansen
Cc: linux-kernel, Lai Jiangshan, H. Peter Anvin, Andi Kleen,
Andrew Cooper, Andy Lutomirski, Asit Mallick, Cfir Cohen,
Dan Williams, David Kaplan, David Rientjes, Erdem Aktas,
Jan Kiszka, Jiri Slaby, Joerg Roedel, Juergen Gross, Kees Cook,
Kirill Shutemov, Kuppuswamy Sathyanarayanan, Linus Torvalds,
Mike Stunes, Peter Zijlstra, Raj Ashok, Sean Christopherson,
Thomas Gleixner, Tom Lendacky, Tony Luck, kvm, linux-coco, x86
On Mon, Apr 3, 2023 at 10:23 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/3/23 07:05, Lai Jiangshan wrote:
> > 2.3 #VE
> > -------
> >
> > The approach for fixing the kernel mode #VE recursion issue is to just
> > NOT use IST for #VE although #VE is also considered to be one of the
> > super exceptions and had raised some worries:
> > https://lore.kernel.org/lkml/YCEQiDNSHTGBXBcj@hirez.programming.kicks-ass.net/
> > https://lore.kernel.org/lkml/CALCETrU9XypKbj-TrXLB3CPW6=MZ__5ifLz0ckbB=c=Myegn9Q@mail.gmail.com/
> > https://lore.kernel.org/lkml/1843debc-05e8-4d10-73e4-7ddce3b3eae2@intel.com/
> >
> > To remit the worries, SEPT_VE_DISABLE is forced used currently and
> > also disables its abilities (accept-on-demand or memory balloon which
> > is critical to lightweight VMs like Kata Containers):
> > https://lore.kernel.org/lkml/YCb0%2FDg28uI7TRD%2F@google.com/
>
> You don't need #VE for accept-on-demand. Pages go through _very_
> well-defined software choke points before they get used *and* before
> they get ballooned. Thus:
>
> > https://lore.kernel.org/lkml/20230330114956.20342-3-kirill.shutemov@linux.intel.com/
>
Thanks for the information.
I will have a look to see how it supports memory balloons.
And if accept-on-demand were supported, do we still need this
CONFIG_UNACCEPTED_MEMORY?
> BTW, _who_ considers #VE to be a "super exception"? Can you explain how
> it is any more "super" than #PF? #PF can recurse. You can take #PF in
> the entry paths.
>
> I kinda don't think you should be using TDX and #VE as part of the
> justification for this series.
You are right, #VE is not a super exception anymore since SEPT_VE_DISABLE
is forced set in the Linux kernel and it is nothing to do with this series.
But #VE was once thought to be a super exception (I will correct the
sentence in the cover letter), so it is worth mentioning it.
And since SEPT_VE_DISABLE is configurable, it would allow some paranoids
to have a try with SEPT_VE_DISABLE=false even without FRED.
The paranoids can try it with IST #VE.
Thanks
Lai
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST
2023-04-03 14:05 [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST Lai Jiangshan
` (7 preceding siblings ...)
2023-04-03 14:23 ` [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST Dave Hansen
@ 2023-04-03 16:53 ` Dave Hansen
2023-04-04 3:17 ` Lai Jiangshan
8 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2023-04-03 16:53 UTC (permalink / raw)
To: Lai Jiangshan, linux-kernel
Cc: Lai Jiangshan, H. Peter Anvin, Andi Kleen, Andrew Cooper,
Andy Lutomirski, Asit Mallick, Cfir Cohen, Dan Williams,
David Kaplan, David Rientjes, Erdem Aktas, Jan Kiszka, Jiri Slaby,
Joerg Roedel, Juergen Gross, Kees Cook, Kirill Shutemov,
Kuppuswamy Sathyanarayanan, Linus Torvalds, Mike Stunes,
Peter Zijlstra, Raj Ashok, Sean Christopherson, Thomas Gleixner,
Tom Lendacky, Tony Luck, kvm, linux-coco, x86
On 4/3/23 07:05, Lai Jiangshan wrote:
> Documentation/x86/kernel-stacks.rst | 2 +
> arch/x86/entry/Makefile | 3 +
> arch/x86/entry/entry_64.S | 615 +++++++-------------------
> arch/x86/entry/ist_entry.c | 299 +++++++++++++
> arch/x86/include/asm/cpu_entry_area.h | 8 +-
> arch/x86/include/asm/idtentry.h | 15 +-
> arch/x86/include/asm/sev.h | 14 -
> arch/x86/include/asm/traps.h | 1 -
> arch/x86/kernel/asm-offsets_64.c | 7 +
> arch/x86/kernel/callthunks.c | 2 +
> arch/x86/kernel/dumpstack_64.c | 6 +-
> arch/x86/kernel/nmi.c | 8 -
> arch/x86/kernel/sev.c | 108 -----
> arch/x86/kernel/traps.c | 43 --
> arch/x86/kvm/vmx/vmx.c | 441 +++++++++++++++++-
> arch/x86/kvm/x86.c | 10 +-
> arch/x86/mm/cpu_entry_area.c | 2 +-
> tools/objtool/check.c | 7 +-
> 18 files changed, 937 insertions(+), 654 deletions(-)
> create mode 100644 arch/x86/entry/ist_entry.c
Some high-level comments...
The diffstat looks a lot nastier because of the testing patch. If you
that patch and trim the diffstat a bit, it starts to look a _lot_ more
appealing:
> arch/x86/entry/entry_64.S | 615 ++++++++----------------------------
> arch/x86/entry/ist_entry.c | 299 +++++++++++++++++
> arch/x86/kernel/sev.c | 108 ------
> arch/x86/kernel/traps.c | 43 --
...
> 16 files changed, 490 insertions(+), 650 deletions(-)
It removes more code than it adds and also removes a bunch of assembly.
If it were me posting this, I'd have shouted that from the rooftops
instead of obscuring it with a testing patch and leaving it as an
exercise to the reader to figure out.
It also comes with testing code and a great cover letter, which are rare
and _spectacular_.
That said, I'm torn. This series makes a valiant attempt to improve one
of the creakiest corners of arch/x86/. But, there are precious few
reviewers that can _really_ dig into this stuff. This is also a lot
less valuable now than it would have been when FRED was not on the horizon.
It's also not incremental at all. It's going to be a big pain when
someone bisects their random system hang to patch 4/7. It'll mean
there's a bug buried somewhere in the 500 lines of patch 3/7.
Grumbling aside, there's too much potential upside here to ignore. As
this moves out of RFC territory, it would be great if you could:
* Look at the FRED series and determine how much this collides
* Dig up some reviewers
* Flesh out the testing story. What kind of hardware was this tested
on? How much was bare metal vs. VMs, etc...?
* Reinforce what the benefits to end users are here. Are folks
_actually_ running into the #VC fragility, for instance?
Also, let's say we queue this, it starts getting linux-next testing, and
we start getting bug reports of hangs. It'll have to get reverted if we
can't find the bug fast.
How much of a pain would it be to make atomic-IST-entry _temporarily_
selectable at boot time? It would obviously need to keep the old code
around and would not have the nice diffstat. But that way, folks would
at least have a workaround while we're bug hunting.
1. https://lore.kernel.org/all/20230327075838.5403-1-xin3.li@intel.com/
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST
2023-04-03 16:53 ` Dave Hansen
@ 2023-04-04 3:17 ` Lai Jiangshan
2023-04-04 17:03 ` Paolo Bonzini
0 siblings, 1 reply; 25+ messages in thread
From: Lai Jiangshan @ 2023-04-04 3:17 UTC (permalink / raw)
To: Dave Hansen
Cc: linux-kernel, Lai Jiangshan, H. Peter Anvin, Andi Kleen,
Andrew Cooper, Andy Lutomirski, Asit Mallick, Cfir Cohen,
Dan Williams, David Kaplan, David Rientjes, Erdem Aktas,
Jan Kiszka, Jiri Slaby, Joerg Roedel, Juergen Gross, Kees Cook,
Kirill Shutemov, Kuppuswamy Sathyanarayanan, Linus Torvalds,
Mike Stunes, Peter Zijlstra, Raj Ashok, Sean Christopherson,
Thomas Gleixner, Tom Lendacky, Tony Luck, kvm, linux-coco, x86
On Tue, Apr 4, 2023 at 12:53 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/3/23 07:05, Lai Jiangshan wrote:
> > Documentation/x86/kernel-stacks.rst | 2 +
> > arch/x86/entry/Makefile | 3 +
> > arch/x86/entry/entry_64.S | 615 +++++++-------------------
> > arch/x86/entry/ist_entry.c | 299 +++++++++++++
> > arch/x86/include/asm/cpu_entry_area.h | 8 +-
> > arch/x86/include/asm/idtentry.h | 15 +-
> > arch/x86/include/asm/sev.h | 14 -
> > arch/x86/include/asm/traps.h | 1 -
> > arch/x86/kernel/asm-offsets_64.c | 7 +
> > arch/x86/kernel/callthunks.c | 2 +
> > arch/x86/kernel/dumpstack_64.c | 6 +-
> > arch/x86/kernel/nmi.c | 8 -
> > arch/x86/kernel/sev.c | 108 -----
> > arch/x86/kernel/traps.c | 43 --
> > arch/x86/kvm/vmx/vmx.c | 441 +++++++++++++++++-
> > arch/x86/kvm/x86.c | 10 +-
> > arch/x86/mm/cpu_entry_area.c | 2 +-
> > tools/objtool/check.c | 7 +-
> > 18 files changed, 937 insertions(+), 654 deletions(-)
> > create mode 100644 arch/x86/entry/ist_entry.c
>
> Some high-level comments...
>
> The diffstat looks a lot nastier because of the testing patch. If you
> that patch and trim the diffstat a bit, it starts to look a _lot_ more
> appealing:
>
> > arch/x86/entry/entry_64.S | 615 ++++++++----------------------------
> > arch/x86/entry/ist_entry.c | 299 +++++++++++++++++
> > arch/x86/kernel/sev.c | 108 ------
> > arch/x86/kernel/traps.c | 43 --
> ...
> > 16 files changed, 490 insertions(+), 650 deletions(-)
>
> It removes more code than it adds and also removes a bunch of assembly.
> If it were me posting this, I'd have shouted that from the rooftops
> instead of obscuring it with a testing patch and leaving it as an
> exercise to the reader to figure out.
The cover letter has 800+ lines of comments. About 100-300 lines
of comments will be moved into the code which would make the diffstat
not so appealing.
P.S.
One of the reasons I didn't move them is that the comments are much more
complicated than the code which is a sign of improvement-required.
I'm going to change the narration from save-touch-replicate-copy-commit
to save-copy-build-commit and avoid using "replicate".
copy=copy_outmost(), build=build_interrupted(), the new narration
will change the comments mainly, and it will not change the actual
work. If the new narration is not simpler, I will not send it out to
add any confusion.
> * Flesh out the testing story. What kind of hardware was this tested
> on? How much was bare metal vs. VMs, etc...?
It is tested on bare metal and VM. It is hard to stress the bare
metal on atomic-IST-entry. The testing code tests it in VM and the
super exceptions can be injected at will.
> * Reinforce what the benefits to end users are here. Are folks
> _actually_ running into the #VC fragility, for instance?
>
> Also, let's say we queue this, it starts getting linux-next testing, and
> we start getting bug reports of hangs. It'll have to get reverted if we
> can't find the bug fast.
>
> How much of a pain would it be to make atomic-IST-entry _temporarily_
> selectable at boot time? It would obviously need to keep the old code
> around and would not have the nice diffstat. But that way, folks would
> at least have a workaround while we're bug hunting.
It is easy to make atomic-IST-entry selectable at boot time since IDT
is an indirect table. I will do it and temporarily keep the old code.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST
2023-04-04 3:17 ` Lai Jiangshan
@ 2023-04-04 17:03 ` Paolo Bonzini
2023-04-06 10:12 ` Peter Zijlstra
0 siblings, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2023-04-04 17:03 UTC (permalink / raw)
To: Lai Jiangshan, Dave Hansen
Cc: linux-kernel, Lai Jiangshan, H. Peter Anvin, Andi Kleen,
Andrew Cooper, Andy Lutomirski, Asit Mallick, Cfir Cohen,
Dan Williams, David Kaplan, David Rientjes, Erdem Aktas,
Jan Kiszka, Jiri Slaby, Joerg Roedel, Juergen Gross, Kees Cook,
Kirill Shutemov, Kuppuswamy Sathyanarayanan, Linus Torvalds,
Mike Stunes, Peter Zijlstra, Raj Ashok, Sean Christopherson,
Thomas Gleixner, Tom Lendacky, Tony Luck, kvm, linux-coco, x86
On 4/4/23 05:17, Lai Jiangshan wrote:
> The cover letter has 800+ lines of comments. About 100-300 lines
> of comments will be moved into the code which would make the diffstat
> not so appealing.
Removing assembly from arch/x86/entry/ and adding English to
Documentation/? That's _even more_ appealing. :)
Paolo
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST
2023-04-04 17:03 ` Paolo Bonzini
@ 2023-04-06 10:12 ` Peter Zijlstra
2023-04-06 10:35 ` Jiri Slaby
0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-04-06 10:12 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Lai Jiangshan, Dave Hansen, linux-kernel, Lai Jiangshan,
H. Peter Anvin, Andi Kleen, Andrew Cooper, Andy Lutomirski,
Asit Mallick, Cfir Cohen, Dan Williams, David Kaplan,
David Rientjes, Erdem Aktas, Jan Kiszka, Jiri Slaby, Joerg Roedel,
Juergen Gross, Kees Cook, Kirill Shutemov,
Kuppuswamy Sathyanarayanan, Linus Torvalds, Mike Stunes,
Raj Ashok, Sean Christopherson, Thomas Gleixner, Tom Lendacky,
Tony Luck, kvm, linux-coco, x86
On Tue, Apr 04, 2023 at 07:03:45PM +0200, Paolo Bonzini wrote:
> On 4/4/23 05:17, Lai Jiangshan wrote:
> > The cover letter has 800+ lines of comments. About 100-300 lines
> > of comments will be moved into the code which would make the diffstat
> > not so appealing.
>
> Removing assembly from arch/x86/entry/ and adding English to Documentation/?
> That's _even more_ appealing. :)
I *much* prefer in-code comments to random gibberish that's instantly
out of date squirreled away somewhere in an unreadable format in
Documentation/
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST
2023-04-06 10:12 ` Peter Zijlstra
@ 2023-04-06 10:35 ` Jiri Slaby
2023-04-06 10:47 ` Peter Zijlstra
0 siblings, 1 reply; 25+ messages in thread
From: Jiri Slaby @ 2023-04-06 10:35 UTC (permalink / raw)
To: Peter Zijlstra, Paolo Bonzini
Cc: Lai Jiangshan, Dave Hansen, linux-kernel, Lai Jiangshan,
H. Peter Anvin, Andi Kleen, Andrew Cooper, Andy Lutomirski,
Asit Mallick, Cfir Cohen, Dan Williams, David Kaplan,
David Rientjes, Erdem Aktas, Jan Kiszka, Joerg Roedel,
Juergen Gross, Kees Cook, Kirill Shutemov,
Kuppuswamy Sathyanarayanan, Linus Torvalds, Mike Stunes,
Raj Ashok, Sean Christopherson, Thomas Gleixner, Tom Lendacky,
Tony Luck, kvm, linux-coco, x86
On 06. 04. 23, 12:12, Peter Zijlstra wrote:
> On Tue, Apr 04, 2023 at 07:03:45PM +0200, Paolo Bonzini wrote:
>> On 4/4/23 05:17, Lai Jiangshan wrote:
>>> The cover letter has 800+ lines of comments. About 100-300 lines
>>> of comments will be moved into the code which would make the diffstat
>>> not so appealing.
>>
>> Removing assembly from arch/x86/entry/ and adding English to Documentation/?
>> That's _even more_ appealing. :)
>
> I *much* prefer in-code comments to random gibberish that's instantly
> out of date squirreled away somewhere in an unreadable format in
> Documentation/
+1 as one can link comments in the code to Documentation easily
nowadays. They are sourced and end up in the generated Documentation [1]
then. One only needs to type the kernel-doc properly.
[1] https://www.kernel.org/doc/html/latest
--
js
suse labs
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST
2023-04-06 10:35 ` Jiri Slaby
@ 2023-04-06 10:47 ` Peter Zijlstra
2023-04-06 11:04 ` Jiri Slaby
0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-04-06 10:47 UTC (permalink / raw)
To: Jiri Slaby
Cc: Paolo Bonzini, Lai Jiangshan, Dave Hansen, linux-kernel,
Lai Jiangshan, H. Peter Anvin, Andi Kleen, Andrew Cooper,
Andy Lutomirski, Asit Mallick, Cfir Cohen, Dan Williams,
David Kaplan, David Rientjes, Erdem Aktas, Jan Kiszka,
Joerg Roedel, Juergen Gross, Kees Cook, Kirill Shutemov,
Kuppuswamy Sathyanarayanan, Linus Torvalds, Mike Stunes,
Raj Ashok, Sean Christopherson, Thomas Gleixner, Tom Lendacky,
Tony Luck, kvm, linux-coco, x86
On Thu, Apr 06, 2023 at 12:35:24PM +0200, Jiri Slaby wrote:
> On 06. 04. 23, 12:12, Peter Zijlstra wrote:
> > On Tue, Apr 04, 2023 at 07:03:45PM +0200, Paolo Bonzini wrote:
> > > On 4/4/23 05:17, Lai Jiangshan wrote:
> > > > The cover letter has 800+ lines of comments. About 100-300 lines
> > > > of comments will be moved into the code which would make the diffstat
> > > > not so appealing.
> > >
> > > Removing assembly from arch/x86/entry/ and adding English to Documentation/?
> > > That's _even more_ appealing. :)
> >
> > I *much* prefer in-code comments to random gibberish that's instantly
> > out of date squirreled away somewhere in an unreadable format in
> > Documentation/
>
> +1 as one can link comments in the code to Documentation easily nowadays.
> They are sourced and end up in the generated Documentation [1] then. One
> only needs to type the kernel-doc properly.
Urgh, so that kernel doc stuff can defeat its purpose too. Some of that
is so heavily formatted it is unreadable gibberish just like
Documentation/ :/
I really detest that whole RST thing, and my solution is to explicitly
not write kerneldoc, that way the doc generation stuff doesn't complain
and I don't get random drive by patches wrecking the perfectly readable
comment.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST
2023-04-06 10:47 ` Peter Zijlstra
@ 2023-04-06 11:04 ` Jiri Slaby
2023-04-06 12:37 ` Peter Zijlstra
0 siblings, 1 reply; 25+ messages in thread
From: Jiri Slaby @ 2023-04-06 11:04 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Paolo Bonzini, Lai Jiangshan, Dave Hansen, linux-kernel,
Lai Jiangshan, H. Peter Anvin, Andi Kleen, Andrew Cooper,
Andy Lutomirski, Asit Mallick, Cfir Cohen, Dan Williams,
David Kaplan, David Rientjes, Erdem Aktas, Jan Kiszka,
Joerg Roedel, Juergen Gross, Kees Cook, Kirill Shutemov,
Kuppuswamy Sathyanarayanan, Linus Torvalds, Mike Stunes,
Raj Ashok, Sean Christopherson, Thomas Gleixner, Tom Lendacky,
Tony Luck, kvm, linux-coco, x86
On 06. 04. 23, 12:47, Peter Zijlstra wrote:
> On Thu, Apr 06, 2023 at 12:35:24PM +0200, Jiri Slaby wrote:
>> On 06. 04. 23, 12:12, Peter Zijlstra wrote:
>>> On Tue, Apr 04, 2023 at 07:03:45PM +0200, Paolo Bonzini wrote:
>>>> On 4/4/23 05:17, Lai Jiangshan wrote:
>>>>> The cover letter has 800+ lines of comments. About 100-300 lines
>>>>> of comments will be moved into the code which would make the diffstat
>>>>> not so appealing.
>>>>
>>>> Removing assembly from arch/x86/entry/ and adding English to Documentation/?
>>>> That's _even more_ appealing. :)
>>>
>>> I *much* prefer in-code comments to random gibberish that's instantly
>>> out of date squirreled away somewhere in an unreadable format in
>>> Documentation/
>>
>> +1 as one can link comments in the code to Documentation easily nowadays.
>> They are sourced and end up in the generated Documentation [1] then. One
>> only needs to type the kernel-doc properly.
>
> Urgh, so that kernel doc stuff can defeat its purpose too. Some of that
> is so heavily formatted it is unreadable gibberish just like
> Documentation/ :/
Definitely it _can_ defeat the purpose and be heavily formatted.But it
doesn't have to. It's like programming in perl.
What I had in mind was e.g. "DOC: TTY Struct Flags":
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/tty.h#n261
Resulting in:
https://www.kernel.org/doc/html/latest/driver-api/tty/tty_struct.html#tty-struct-flags
Both the source and the result are quite readable, IMO. And the markup
in the source is not mandatory, it's only for emphasizing and hyperlinks.
As I wrote, you can link the comment in the code. But definitely you
don't have to, if you don't want. I like the linking in Documentation as
I can put the pieces from various sources/headers together to one place
and build a bigger picture.
> I really detest that whole RST thing, and my solution is to explicitly
> not write kerneldoc, that way the doc generation stuff doesn't complain
> and I don't get random drive by patches wrecking the perfectly readable
> comment.
Sure. Rst _sources_ are not readable, IMO. Only generated man pages or
html are.
thanks,
--
js
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH 0/7] x86/entry: Atomic statck switching for IST
2023-04-06 11:04 ` Jiri Slaby
@ 2023-04-06 12:37 ` Peter Zijlstra
0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-04-06 12:37 UTC (permalink / raw)
To: Jiri Slaby
Cc: Paolo Bonzini, Lai Jiangshan, Dave Hansen, linux-kernel,
Lai Jiangshan, H. Peter Anvin, Andi Kleen, Andrew Cooper,
Andy Lutomirski, Asit Mallick, Cfir Cohen, Dan Williams,
David Kaplan, David Rientjes, Erdem Aktas, Jan Kiszka,
Joerg Roedel, Juergen Gross, Kees Cook, Kirill Shutemov,
Kuppuswamy Sathyanarayanan, Linus Torvalds, Mike Stunes,
Raj Ashok, Sean Christopherson, Thomas Gleixner, Tom Lendacky,
Tony Luck, kvm, linux-coco, x86
On Thu, Apr 06, 2023 at 01:04:16PM +0200, Jiri Slaby wrote:
> Definitely it _can_ defeat the purpose and be heavily formatted.But it
> doesn't have to. It's like programming in perl.
>
> What I had in mind was e.g. "DOC: TTY Struct Flags":
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/tty.h#n261
* TTY_THROTTLED
* Driver input is throttled. The ldisc should call
* :c:member:`tty_driver.unthrottle()` in order to resume reception when
* it is ready to process more data (at threshold min).
That whole :c:member:'tty_driver.unthrottle()' is an abomination and
has no place in a comment.
> Resulting in:
> https://www.kernel.org/doc/html/latest/driver-api/tty/tty_struct.html#tty-struct-flags
>
> Both the source and the result are quite readable, IMO. And the markup in
> the source is not mandatory, it's only for emphasizing and hyperlinks.
>
> As I wrote, you can link the comment in the code. But definitely you don't
> have to, if you don't want. I like the linking in Documentation as I can put
> the pieces from various sources/headers together to one place and build a
> bigger picture.
>
> > I really detest that whole RST thing, and my solution is to explicitly
> > not write kerneldoc, that way the doc generation stuff doesn't complain
> > and I don't get random drive by patches wrecking the perfectly readable
> > comment.
>
> Sure. Rst _sources_ are not readable, IMO. Only generated man pages or html
> are.
But code comments are read in a text editor, not a browser. Hence all
the markup is counter productive.
Why would you go read something in a browser if you have the code right
there in a text editor?
^ permalink raw reply [flat|nested] 25+ messages in thread