linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] x86: Clean up fast syscall return validation
@ 2023-07-21 16:10 Brian Gerst
  2023-07-21 16:10 ` [PATCH v2 1/6] x86/entry/64: Remove obsolete comment on tracing vs. SYSRET Brian Gerst
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Brian Gerst @ 2023-07-21 16:10 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Borislav Petkov, H . Peter Anvin,
	Andy Lutomirski, Mika Penttilä, Brian Gerst

This patch set cleans up the tests done to determine if a fast syscall
return instruction can be used to return to userspace.  It converts the
code to C, and refactors existing code to be more readable.

v2:
 - Fix shift value for canonical RIP test and use
   __is_canonical_address()

Brian Gerst (6):
  x86/entry/64: Remove obsolete comment on tracing vs. SYSRET
  x86/entry/64: Convert SYSRET validation tests to C
  x86/entry/compat: Combine return value test from syscall handler
  x86/entry/32: Convert do_fast_syscall_32() to bool return type
  x86/entry/32: Remove SEP test for SYSEXIT
  x86/entry/32: Clean up syscall fast exit tests

 arch/x86/entry/common.c          | 99 +++++++++++++++++++++-----------
 arch/x86/entry/entry_32.S        |  2 +-
 arch/x86/entry/entry_64.S        | 68 +---------------------
 arch/x86/entry/entry_64_compat.S | 12 ++--
 arch/x86/include/asm/syscall.h   |  6 +-
 5 files changed, 77 insertions(+), 110 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/6] x86/entry/64: Remove obsolete comment on tracing vs. SYSRET
  2023-07-21 16:10 [PATCH v2 0/6] x86: Clean up fast syscall return validation Brian Gerst
@ 2023-07-21 16:10 ` Brian Gerst
  2023-10-05  8:28   ` [tip: x86/entry] " tip-bot2 for Brian Gerst
  2023-07-21 16:10 ` [PATCH v2 2/6] x86/entry/64: Convert SYSRET validation tests to C Brian Gerst
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Brian Gerst @ 2023-07-21 16:10 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Borislav Petkov, H . Peter Anvin,
	Andy Lutomirski, Mika Penttilä, Brian Gerst

This comment comes from a time when the kernel attempted to use SYSRET
on all returns to userspace, including interrupts and exceptions.  Ever
since commit fffbb5dc ("Move opportunistic sysret code to syscall code
path"), SYSRET is only used for returning from system calls. The
specific tracing issue listed in this comment is not possible anymore.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
---
 arch/x86/entry/entry_64.S | 19 +++----------------
 1 file changed, 3 insertions(+), 16 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 91f6818884fa..c01776a51545 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -166,22 +166,9 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
 	jne	swapgs_restore_regs_and_return_to_usermode
 
 	/*
-	 * SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot
-	 * restore RF properly. If the slowpath sets it for whatever reason, we
-	 * need to restore it correctly.
-	 *
-	 * SYSRET can restore TF, but unlike IRET, restoring TF results in a
-	 * trap from userspace immediately after SYSRET.  This would cause an
-	 * infinite loop whenever #DB happens with register state that satisfies
-	 * the opportunistic SYSRET conditions.  For example, single-stepping
-	 * this user code:
-	 *
-	 *           movq	$stuck_here, %rcx
-	 *           pushfq
-	 *           popq %r11
-	 *   stuck_here:
-	 *
-	 * would never get past 'stuck_here'.
+	 * SYSRET cannot restore RF.  It can restore TF, but unlike IRET,
+	 * restoring TF results in a trap from userspace immediately after
+	 * SYSRET.
 	 */
 	testq	$(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
 	jnz	swapgs_restore_regs_and_return_to_usermode
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 2/6] x86/entry/64: Convert SYSRET validation tests to C
  2023-07-21 16:10 [PATCH v2 0/6] x86: Clean up fast syscall return validation Brian Gerst
  2023-07-21 16:10 ` [PATCH v2 1/6] x86/entry/64: Remove obsolete comment on tracing vs. SYSRET Brian Gerst
@ 2023-07-21 16:10 ` Brian Gerst
  2023-07-23  9:53   ` Li, Xin3
  2023-07-21 16:10 ` [PATCH v2 3/6] x86/entry/compat: Combine return value test from syscall handler Brian Gerst
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Brian Gerst @ 2023-07-21 16:10 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Borislav Petkov, H . Peter Anvin,
	Andy Lutomirski, Mika Penttilä, Brian Gerst

Signed-off-by: Brian Gerst <brgerst@gmail.com>
---
 arch/x86/entry/common.c        | 40 ++++++++++++++++++++++++-
 arch/x86/entry/entry_64.S      | 55 ++--------------------------------
 arch/x86/include/asm/syscall.h |  2 +-
 3 files changed, 42 insertions(+), 55 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 6c2826417b33..cccdd18c8304 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -70,7 +70,8 @@ static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr)
 	return false;
 }
 
-__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
+/* Returns true to return using SYSRET, or false to use IRET */
+__visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr)
 {
 	add_random_kstack_offset();
 	nr = syscall_enter_from_user_mode(regs, nr);
@@ -84,6 +85,43 @@ __visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
 
 	instrumentation_end();
 	syscall_exit_to_user_mode(regs);
+
+	/*
+	 * Check that the register state is valid for using SYSRET to exit
+	 * to userspace.  Otherwise use the slower but fully capable IRET
+	 * exit path.
+	 */
+
+	/* XEN PV guests always use IRET path */
+	if (cpu_feature_enabled(X86_FEATURE_XENPV))
+		return false;
+
+	/* SYSRET requires RCX == RIP and R11 == EFLAGS */
+	if (unlikely(regs->cx != regs->ip || regs->r11 != regs->flags))
+		return false;
+
+	/* CS and SS must match the values set in MSR_STAR */
+	if (unlikely(regs->cs != __USER_CS || regs->ss != __USER_DS))
+		return false;
+
+	/*
+	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
+	 * in kernel space.  This essentially lets the user take over
+	 * the kernel, since userspace controls RSP.
+	 */
+	if (unlikely(!__is_canonical_address(regs->ip, __VIRTUAL_MASK_SHIFT + 1)))
+		return false;
+
+	/*
+	 * SYSRET cannot restore RF.  It can restore TF, but unlike IRET,
+	 * restoring TF results in a trap from userspace immediately after
+	 * SYSRET.
+	 */
+	if (unlikely(regs->flags & (X86_EFLAGS_RF | X86_EFLAGS_TF)))
+		return false;
+
+	/* Use SYSRET to exit to userspace */
+	return true;
 }
 #endif
 
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index c01776a51545..b1288e22cae8 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -123,60 +123,9 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
 	 * Try to use SYSRET instead of IRET if we're returning to
 	 * a completely clean 64-bit userspace context.  If we're not,
 	 * go to the slow exit path.
-	 * In the Xen PV case we must use iret anyway.
 	 */
-
-	ALTERNATIVE "", "jmp	swapgs_restore_regs_and_return_to_usermode", \
-		X86_FEATURE_XENPV
-
-	movq	RCX(%rsp), %rcx
-	movq	RIP(%rsp), %r11
-
-	cmpq	%rcx, %r11	/* SYSRET requires RCX == RIP */
-	jne	swapgs_restore_regs_and_return_to_usermode
-
-	/*
-	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
-	 * in kernel space.  This essentially lets the user take over
-	 * the kernel, since userspace controls RSP.
-	 *
-	 * If width of "canonical tail" ever becomes variable, this will need
-	 * to be updated to remain correct on both old and new CPUs.
-	 *
-	 * Change top bits to match most significant bit (47th or 56th bit
-	 * depending on paging mode) in the address.
-	 */
-#ifdef CONFIG_X86_5LEVEL
-	ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
-		"shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
-#else
-	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
-	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
-#endif
-
-	/* If this changed %rcx, it was not canonical */
-	cmpq	%rcx, %r11
-	jne	swapgs_restore_regs_and_return_to_usermode
-
-	cmpq	$__USER_CS, CS(%rsp)		/* CS must match SYSRET */
-	jne	swapgs_restore_regs_and_return_to_usermode
-
-	movq	R11(%rsp), %r11
-	cmpq	%r11, EFLAGS(%rsp)		/* R11 == RFLAGS */
-	jne	swapgs_restore_regs_and_return_to_usermode
-
-	/*
-	 * SYSRET cannot restore RF.  It can restore TF, but unlike IRET,
-	 * restoring TF results in a trap from userspace immediately after
-	 * SYSRET.
-	 */
-	testq	$(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
-	jnz	swapgs_restore_regs_and_return_to_usermode
-
-	/* nothing to check for RSP */
-
-	cmpq	$__USER_DS, SS(%rsp)		/* SS must match SYSRET */
-	jne	swapgs_restore_regs_and_return_to_usermode
+	testb	%al, %al
+	jz	swapgs_restore_regs_and_return_to_usermode
 
 	/*
 	 * We win! This label is here just for ease of understanding
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 4fb36fba4b5a..be6c5515e0b9 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -126,7 +126,7 @@ static inline int syscall_get_arch(struct task_struct *task)
 		? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
 }
 
-void do_syscall_64(struct pt_regs *regs, int nr);
+bool do_syscall_64(struct pt_regs *regs, int nr);
 
 #endif	/* CONFIG_X86_32 */
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 3/6] x86/entry/compat: Combine return value test from syscall handler
  2023-07-21 16:10 [PATCH v2 0/6] x86: Clean up fast syscall return validation Brian Gerst
  2023-07-21 16:10 ` [PATCH v2 1/6] x86/entry/64: Remove obsolete comment on tracing vs. SYSRET Brian Gerst
  2023-07-21 16:10 ` [PATCH v2 2/6] x86/entry/64: Convert SYSRET validation tests to C Brian Gerst
@ 2023-07-21 16:10 ` Brian Gerst
  2023-10-05  8:28   ` [tip: x86/entry] " tip-bot2 for Brian Gerst
  2023-07-21 16:10 ` [PATCH v2 4/6] x86/entry/32: Convert do_fast_syscall_32() to bool return type Brian Gerst
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Brian Gerst @ 2023-07-21 16:10 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Borislav Petkov, H . Peter Anvin,
	Andy Lutomirski, Mika Penttilä, Brian Gerst

Move the sysret32_from_system_call label to remove a duplicate test of
the return value from the syscall handler.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
---
 arch/x86/entry/entry_64_compat.S | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 70150298f8bd..b16272395f1a 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -118,9 +118,6 @@ SYM_INNER_LABEL(entry_SYSENTER_compat_after_hwframe, SYM_L_GLOBAL)
 
 	movq	%rsp, %rdi
 	call	do_SYSENTER_32
-	/* XEN PV guests always use IRET path */
-	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
-		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
 	jmp	sysret32_from_system_call
 
 .Lsysenter_fix_flags:
@@ -212,13 +209,15 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_after_hwframe, SYM_L_GLOBAL)
 
 	movq	%rsp, %rdi
 	call	do_fast_syscall_32
+
+sysret32_from_system_call:
 	/* XEN PV guests always use IRET path */
 	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
 		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
 
-	/* Opportunistic SYSRET */
-sysret32_from_system_call:
 	/*
+	 * Opportunistic SYSRET
+	 *
 	 * We are not going to return to userspace from the trampoline
 	 * stack. So let's erase the thread stack right now.
 	 */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 4/6] x86/entry/32: Convert do_fast_syscall_32() to bool return type
  2023-07-21 16:10 [PATCH v2 0/6] x86: Clean up fast syscall return validation Brian Gerst
                   ` (2 preceding siblings ...)
  2023-07-21 16:10 ` [PATCH v2 3/6] x86/entry/compat: Combine return value test from syscall handler Brian Gerst
@ 2023-07-21 16:10 ` Brian Gerst
  2023-10-05  8:28   ` [tip: x86/entry] " tip-bot2 for Brian Gerst
  2023-07-21 16:10 ` [PATCH v2 5/6] x86/entry/32: Remove SEP test for SYSEXIT Brian Gerst
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Brian Gerst @ 2023-07-21 16:10 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Borislav Petkov, H . Peter Anvin,
	Andy Lutomirski, Mika Penttilä, Brian Gerst

Signed-off-by: Brian Gerst <brgerst@gmail.com>
---
 arch/x86/entry/common.c          | 10 +++++-----
 arch/x86/entry/entry_32.S        |  2 +-
 arch/x86/entry/entry_64_compat.S |  2 +-
 arch/x86/include/asm/syscall.h   |  4 ++--
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index cccdd18c8304..5b3594457af3 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -220,8 +220,8 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 	return true;
 }
 
-/* Returns 0 to return using IRET or 1 to return using SYSEXIT/SYSRETL. */
-__visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
+/* Returns true to return using SYSEXIT/SYSRETL, or false to use IRET */
+__visible noinstr bool do_fast_syscall_32(struct pt_regs *regs)
 {
 	/*
 	 * Called using the internal vDSO SYSENTER/SYSCALL32 calling
@@ -239,7 +239,7 @@ __visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
 
 	/* Invoke the syscall. If it failed, keep it simple: use IRET. */
 	if (!__do_fast_syscall_32(regs))
-		return 0;
+		return false;
 
 #ifdef CONFIG_X86_64
 	/*
@@ -272,8 +272,8 @@ __visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
 #endif
 }
 
-/* Returns 0 to return using IRET or 1 to return using SYSEXIT/SYSRETL. */
-__visible noinstr long do_SYSENTER_32(struct pt_regs *regs)
+/* Returns true to return using SYSEXIT/SYSRETL, or false to use IRET */
+__visible noinstr bool do_SYSENTER_32(struct pt_regs *regs)
 {
 	/* SYSENTER loses RSP, but the vDSO saved it in RBP. */
 	regs->sp = regs->bp;
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 6e6af42e044a..c73047bf9f4b 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -837,7 +837,7 @@ SYM_FUNC_START(entry_SYSENTER_32)
 
 	movl	%esp, %eax
 	call	do_SYSENTER_32
-	testl	%eax, %eax
+	testb	%al, %al
 	jz	.Lsyscall_32_done
 
 	STACKLEAK_ERASE
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index b16272395f1a..27c05d08558a 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -212,7 +212,7 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_after_hwframe, SYM_L_GLOBAL)
 
 sysret32_from_system_call:
 	/* XEN PV guests always use IRET path */
-	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
+	ALTERNATIVE "testb %al, %al; jz swapgs_restore_regs_and_return_to_usermode", \
 		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
 
 	/*
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index be6c5515e0b9..f44e2f9ab65d 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -131,7 +131,7 @@ bool do_syscall_64(struct pt_regs *regs, int nr);
 #endif	/* CONFIG_X86_32 */
 
 void do_int80_syscall_32(struct pt_regs *regs);
-long do_fast_syscall_32(struct pt_regs *regs);
-long do_SYSENTER_32(struct pt_regs *regs);
+bool do_fast_syscall_32(struct pt_regs *regs);
+bool do_SYSENTER_32(struct pt_regs *regs);
 
 #endif	/* _ASM_X86_SYSCALL_H */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 5/6] x86/entry/32: Remove SEP test for SYSEXIT
  2023-07-21 16:10 [PATCH v2 0/6] x86: Clean up fast syscall return validation Brian Gerst
                   ` (3 preceding siblings ...)
  2023-07-21 16:10 ` [PATCH v2 4/6] x86/entry/32: Convert do_fast_syscall_32() to bool return type Brian Gerst
@ 2023-07-21 16:10 ` Brian Gerst
  2023-10-05  8:28   ` [tip: x86/entry] " tip-bot2 for Brian Gerst
  2023-07-21 16:10 ` [PATCH v2 6/6] x86/entry/32: Clean up syscall fast exit tests Brian Gerst
  2023-10-05  8:22 ` [PATCH v2 0/6] x86: Clean up fast syscall return validation Ingo Molnar
  6 siblings, 1 reply; 22+ messages in thread
From: Brian Gerst @ 2023-07-21 16:10 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Borislav Petkov, H . Peter Anvin,
	Andy Lutomirski, Mika Penttilä, Brian Gerst

SEP must be already be present in order for do_fast_syscall_32() to be
called on native 32-bit, so checking it again is unnecessary.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
---
 arch/x86/entry/common.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 5b3594457af3..4d7d6d25dda9 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -265,8 +265,7 @@ __visible noinstr bool do_fast_syscall_32(struct pt_regs *regs)
 	 * We don't allow syscalls at all from VM86 mode, but we still
 	 * need to check VM, because we might be returning from sys_vm86.
 	 */
-	return static_cpu_has(X86_FEATURE_SEP) &&
-		regs->cs == __USER_CS && regs->ss == __USER_DS &&
+	return regs->cs == __USER_CS && regs->ss == __USER_DS &&
 		regs->ip == landing_pad &&
 		(regs->flags & (X86_EFLAGS_RF | X86_EFLAGS_TF | X86_EFLAGS_VM)) == 0;
 #endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 6/6] x86/entry/32: Clean up syscall fast exit tests
  2023-07-21 16:10 [PATCH v2 0/6] x86: Clean up fast syscall return validation Brian Gerst
                   ` (4 preceding siblings ...)
  2023-07-21 16:10 ` [PATCH v2 5/6] x86/entry/32: Remove SEP test for SYSEXIT Brian Gerst
@ 2023-07-21 16:10 ` Brian Gerst
  2023-10-05  8:22 ` [PATCH v2 0/6] x86: Clean up fast syscall return validation Ingo Molnar
  6 siblings, 0 replies; 22+ messages in thread
From: Brian Gerst @ 2023-07-21 16:10 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Thomas Gleixner, Borislav Petkov, H . Peter Anvin,
	Andy Lutomirski, Mika Penttilä, Brian Gerst

Merge compat and native code and clarify comments.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
---
 arch/x86/entry/common.c          | 48 +++++++++++++++-----------------
 arch/x86/entry/entry_64_compat.S |  5 ++--
 2 files changed, 24 insertions(+), 29 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 4d7d6d25dda9..800664325542 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -241,34 +241,30 @@ __visible noinstr bool do_fast_syscall_32(struct pt_regs *regs)
 	if (!__do_fast_syscall_32(regs))
 		return false;
 
-#ifdef CONFIG_X86_64
 	/*
-	 * Opportunistic SYSRETL: if possible, try to return using SYSRETL.
-	 * SYSRETL is available on all 64-bit CPUs, so we don't need to
-	 * bother with SYSEXIT.
-	 *
-	 * Unlike 64-bit opportunistic SYSRET, we can't check that CX == IP,
-	 * because the ECX fixup above will ensure that this is essentially
-	 * never the case.
+	 * Check that the register state is valid for using SYSRETL/SYSEXIT
+	 * to exit to userspace.  Otherwise use the slower but fully capable
+	 * IRET exit path.
 	 */
-	return regs->cs == __USER32_CS && regs->ss == __USER_DS &&
-		regs->ip == landing_pad &&
-		(regs->flags & (X86_EFLAGS_RF | X86_EFLAGS_TF)) == 0;
-#else
-	/*
-	 * Opportunistic SYSEXIT: if possible, try to return using SYSEXIT.
-	 *
-	 * Unlike 64-bit opportunistic SYSRET, we can't check that CX == IP,
-	 * because the ECX fixup above will ensure that this is essentially
-	 * never the case.
-	 *
-	 * We don't allow syscalls at all from VM86 mode, but we still
-	 * need to check VM, because we might be returning from sys_vm86.
-	 */
-	return regs->cs == __USER_CS && regs->ss == __USER_DS &&
-		regs->ip == landing_pad &&
-		(regs->flags & (X86_EFLAGS_RF | X86_EFLAGS_TF | X86_EFLAGS_VM)) == 0;
-#endif
+
+	/* XEN PV guests always use IRET path */
+	if (cpu_feature_enabled(X86_FEATURE_XENPV))
+		return false;
+
+	/* EIP must point to the VDSO landing pad */
+	if (unlikely(regs->ip != landing_pad))
+		return false;
+
+	/* CS and SS must match the values set in MSR_STAR */
+	if (unlikely(regs->cs != __USER32_CS || regs->ss != __USER_DS))
+		return false;
+
+	/* If the TF, RF, or VM flags are set, use IRET */
+	if (unlikely(regs->flags & (X86_EFLAGS_RF | X86_EFLAGS_TF | X86_EFLAGS_VM)))
+		return false;
+
+	/* Use SYSRETL/SYSEXIT to exit to userspace */
+	return true;
 }
 
 /* Returns true to return using SYSEXIT/SYSRETL, or false to use IRET */
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 27c05d08558a..84e21d1ebf10 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -211,9 +211,8 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_after_hwframe, SYM_L_GLOBAL)
 	call	do_fast_syscall_32
 
 sysret32_from_system_call:
-	/* XEN PV guests always use IRET path */
-	ALTERNATIVE "testb %al, %al; jz swapgs_restore_regs_and_return_to_usermode", \
-		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
+	testb	%al, %al		/* Is SYSRET allowed? */
+	jz	swapgs_restore_regs_and_return_to_usermode
 
 	/*
 	 * Opportunistic SYSRET
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* RE: [PATCH v2 2/6] x86/entry/64: Convert SYSRET validation tests to C
  2023-07-21 16:10 ` [PATCH v2 2/6] x86/entry/64: Convert SYSRET validation tests to C Brian Gerst
@ 2023-07-23  9:53   ` Li, Xin3
  2023-07-23 11:17     ` Brian Gerst
  0 siblings, 1 reply; 22+ messages in thread
From: Li, Xin3 @ 2023-07-23  9:53 UTC (permalink / raw)
  To: Brian Gerst, linux-kernel@vger.kernel.org, x86@kernel.org
  Cc: Thomas Gleixner, Borislav Petkov, H . Peter Anvin,
	Lutomirski, Andy, Mika Penttilä


> @@ -84,6 +85,43 @@ __visible noinstr void do_syscall_64(struct pt_regs *regs, int
> nr)
> 
>  	instrumentation_end();
>  	syscall_exit_to_user_mode(regs);

Would it be better to make the following code a new function?

And then the similar changes in patch 6 could be merged into the new
function with #ifdef CONFIG_X86_64.

> +
> +	/*
> +	 * Check that the register state is valid for using SYSRET to exit
> +	 * to userspace.  Otherwise use the slower but fully capable IRET
> +	 * exit path.
> +	 */
> +
> +	/* XEN PV guests always use IRET path */
> +	if (cpu_feature_enabled(X86_FEATURE_XENPV))
> +		return false;
> +
> +	/* SYSRET requires RCX == RIP and R11 == EFLAGS */
> +	if (unlikely(regs->cx != regs->ip || regs->r11 != regs->flags))
> +		return false;
> +
> +	/* CS and SS must match the values set in MSR_STAR */
> +	if (unlikely(regs->cs != __USER_CS || regs->ss != __USER_DS))
> +		return false;
> +
> +	/*
> +	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
> +	 * in kernel space.  This essentially lets the user take over
> +	 * the kernel, since userspace controls RSP.
> +	 */
> +	if (unlikely(!__is_canonical_address(regs->ip, __VIRTUAL_MASK_SHIFT +
> 1)))
> +		return false;
> +
> +	/*
> +	 * SYSRET cannot restore RF.  It can restore TF, but unlike IRET,
> +	 * restoring TF results in a trap from userspace immediately after
> +	 * SYSRET.
> +	 */
> +	if (unlikely(regs->flags & (X86_EFLAGS_RF | X86_EFLAGS_TF)))
> +		return false;
> +
> +	/* Use SYSRET to exit to userspace */
> +	return true;
>  }
>  #endif
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/6] x86/entry/64: Convert SYSRET validation tests to C
  2023-07-23  9:53   ` Li, Xin3
@ 2023-07-23 11:17     ` Brian Gerst
  0 siblings, 0 replies; 22+ messages in thread
From: Brian Gerst @ 2023-07-23 11:17 UTC (permalink / raw)
  To: Li, Xin3
  Cc: linux-kernel@vger.kernel.org, x86@kernel.org, Thomas Gleixner,
	Borislav Petkov, H . Peter Anvin, Lutomirski, Andy,
	Mika Penttilä

On Sun, Jul 23, 2023 at 5:53 AM Li, Xin3 <xin3.li@intel.com> wrote:
>
>
> > @@ -84,6 +85,43 @@ __visible noinstr void do_syscall_64(struct pt_regs *regs, int
> > nr)
> >
> >       instrumentation_end();
> >       syscall_exit_to_user_mode(regs);
>
> Would it be better to make the following code a new function?
>
> And then the similar changes in patch 6 could be merged into the new
> function with #ifdef CONFIG_X86_64.
>
> > +
> > +     /*
> > +      * Check that the register state is valid for using SYSRET to exit
> > +      * to userspace.  Otherwise use the slower but fully capable IRET
> > +      * exit path.
> > +      */
> > +
> > +     /* XEN PV guests always use IRET path */
> > +     if (cpu_feature_enabled(X86_FEATURE_XENPV))
> > +             return false;
> > +
> > +     /* SYSRET requires RCX == RIP and R11 == EFLAGS */
> > +     if (unlikely(regs->cx != regs->ip || regs->r11 != regs->flags))
> > +             return false;
> > +
> > +     /* CS and SS must match the values set in MSR_STAR */
> > +     if (unlikely(regs->cs != __USER_CS || regs->ss != __USER_DS))
> > +             return false;
> > +
> > +     /*
> > +      * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
> > +      * in kernel space.  This essentially lets the user take over
> > +      * the kernel, since userspace controls RSP.
> > +      */
> > +     if (unlikely(!__is_canonical_address(regs->ip, __VIRTUAL_MASK_SHIFT +
> > 1)))
> > +             return false;
> > +
> > +     /*
> > +      * SYSRET cannot restore RF.  It can restore TF, but unlike IRET,
> > +      * restoring TF results in a trap from userspace immediately after
> > +      * SYSRET.
> > +      */
> > +     if (unlikely(regs->flags & (X86_EFLAGS_RF | X86_EFLAGS_TF)))
> > +             return false;
> > +
> > +     /* Use SYSRET to exit to userspace */
> > +     return true;
> >  }
> >  #endif
> >

The tests are similar but not enough to combine them.  If
IA32_EMULATION is enabled, both versions are needed so one copy of the
function with #ifdefs won't work..

Brian Gerst

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation
  2023-07-21 16:10 [PATCH v2 0/6] x86: Clean up fast syscall return validation Brian Gerst
                   ` (5 preceding siblings ...)
  2023-07-21 16:10 ` [PATCH v2 6/6] x86/entry/32: Clean up syscall fast exit tests Brian Gerst
@ 2023-10-05  8:22 ` Ingo Molnar
  2023-10-05 15:13   ` Brian Gerst
  6 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2023-10-05  8:22 UTC (permalink / raw)
  To: Brian Gerst
  Cc: linux-kernel, x86, Thomas Gleixner, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Mika Penttilä,
	Linus Torvalds, Uros Bizjak, Denys Vlasenko, Peter Zijlstra,
	Josh Poimboeuf


* Brian Gerst <brgerst@gmail.com> wrote:

> This patch set cleans up the tests done to determine if a fast syscall
> return instruction can be used to return to userspace.  It converts the
> code to C, and refactors existing code to be more readable.
> 
> v2:
>  - Fix shift value for canonical RIP test and use
>    __is_canonical_address()
> 
> Brian Gerst (6):
>   x86/entry/64: Remove obsolete comment on tracing vs. SYSRET
>   x86/entry/64: Convert SYSRET validation tests to C
>   x86/entry/compat: Combine return value test from syscall handler
>   x86/entry/32: Convert do_fast_syscall_32() to bool return type
>   x86/entry/32: Remove SEP test for SYSEXIT
>   x86/entry/32: Clean up syscall fast exit tests
> 
>  arch/x86/entry/common.c          | 99 +++++++++++++++++++++-----------
>  arch/x86/entry/entry_32.S        |  2 +-
>  arch/x86/entry/entry_64.S        | 68 +---------------------
>  arch/x86/entry/entry_64_compat.S | 12 ++--
>  arch/x86/include/asm/syscall.h   |  6 +-
>  5 files changed, 77 insertions(+), 110 deletions(-)

Ok, so I've applied patches #1, #3, #4 and #5 to tip:x86/entry,
(ie. skipped #2 & #6 for now), as they look correct and are good
improvements. None of these four patches depend on the skipped
patches in some way I missed, correct?

As for #2, I looked at the before/after disassembly, and the new
C code in do_syscall_64() looked suboptimal on x86-64 defconfig,
if I was reading it right.

Mind re-evaluating that, and if you still think the C conversion
is a good idea, mind putting a before/after analysis of the
generated instructions into the changelog? This is our primary
system call return path after all.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip: x86/entry] x86/entry/32: Remove SEP test for SYSEXIT
  2023-07-21 16:10 ` [PATCH v2 5/6] x86/entry/32: Remove SEP test for SYSEXIT Brian Gerst
@ 2023-10-05  8:28   ` tip-bot2 for Brian Gerst
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Brian Gerst @ 2023-10-05  8:28 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Brian Gerst, Ingo Molnar, x86, linux-kernel

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     bab9fa6dc5e4483749838877deebe038de3ce97e
Gitweb:        https://git.kernel.org/tip/bab9fa6dc5e4483749838877deebe038de3ce97e
Author:        Brian Gerst <brgerst@gmail.com>
AuthorDate:    Fri, 21 Jul 2023 12:10:16 -04:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 05 Oct 2023 10:06:43 +02:00

x86/entry/32: Remove SEP test for SYSEXIT

SEP must be already be present in order for do_fast_syscall_32() to be
called on native 32-bit, so checking it again is unnecessary.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230721161018.50214-6-brgerst@gmail.com
---
 arch/x86/entry/common.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index f1f3bf1..0551bcb 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -238,8 +238,7 @@ __visible noinstr bool do_fast_syscall_32(struct pt_regs *regs)
 	 * We don't allow syscalls at all from VM86 mode, but we still
 	 * need to check VM, because we might be returning from sys_vm86.
 	 */
-	return static_cpu_has(X86_FEATURE_SEP) &&
-		regs->cs == __USER_CS && regs->ss == __USER_DS &&
+	return regs->cs == __USER_CS && regs->ss == __USER_DS &&
 		regs->ip == landing_pad &&
 		(regs->flags & (X86_EFLAGS_RF | X86_EFLAGS_TF | X86_EFLAGS_VM)) == 0;
 #endif

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: x86/entry] x86/entry/32: Convert do_fast_syscall_32() to bool return type
  2023-07-21 16:10 ` [PATCH v2 4/6] x86/entry/32: Convert do_fast_syscall_32() to bool return type Brian Gerst
@ 2023-10-05  8:28   ` tip-bot2 for Brian Gerst
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Brian Gerst @ 2023-10-05  8:28 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Brian Gerst, Ingo Molnar, x86, linux-kernel

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     0d3109ad2e6125add5b3c88e8de3fb7bfd4e8c49
Gitweb:        https://git.kernel.org/tip/0d3109ad2e6125add5b3c88e8de3fb7bfd4e8c49
Author:        Brian Gerst <brgerst@gmail.com>
AuthorDate:    Fri, 21 Jul 2023 12:10:15 -04:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 05 Oct 2023 10:06:42 +02:00

x86/entry/32: Convert do_fast_syscall_32() to bool return type

Doesn't have to be 'long' - this simplifies the code a bit.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230721161018.50214-5-brgerst@gmail.com
---
 arch/x86/entry/common.c          | 10 +++++-----
 arch/x86/entry/entry_32.S        |  2 +-
 arch/x86/entry/entry_64_compat.S |  2 +-
 arch/x86/include/asm/syscall.h   |  4 ++--
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 328990d..f1f3bf1 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -193,8 +193,8 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 	return true;
 }
 
-/* Returns 0 to return using IRET or 1 to return using SYSEXIT/SYSRETL. */
-__visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
+/* Returns true to return using SYSEXIT/SYSRETL, or false to use IRET */
+__visible noinstr bool do_fast_syscall_32(struct pt_regs *regs)
 {
 	/*
 	 * Called using the internal vDSO SYSENTER/SYSCALL32 calling
@@ -212,7 +212,7 @@ __visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
 
 	/* Invoke the syscall. If it failed, keep it simple: use IRET. */
 	if (!__do_fast_syscall_32(regs))
-		return 0;
+		return false;
 
 #ifdef CONFIG_X86_64
 	/*
@@ -245,8 +245,8 @@ __visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
 #endif
 }
 
-/* Returns 0 to return using IRET or 1 to return using SYSEXIT/SYSRETL. */
-__visible noinstr long do_SYSENTER_32(struct pt_regs *regs)
+/* Returns true to return using SYSEXIT/SYSRETL, or false to use IRET */
+__visible noinstr bool do_SYSENTER_32(struct pt_regs *regs)
 {
 	/* SYSENTER loses RSP, but the vDSO saved it in RBP. */
 	regs->sp = regs->bp;
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 6e6af42..c73047b 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -837,7 +837,7 @@ SYM_FUNC_START(entry_SYSENTER_32)
 
 	movl	%esp, %eax
 	call	do_SYSENTER_32
-	testl	%eax, %eax
+	testb	%al, %al
 	jz	.Lsyscall_32_done
 
 	STACKLEAK_ERASE
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index b162723..27c05d0 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -212,7 +212,7 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_after_hwframe, SYM_L_GLOBAL)
 
 sysret32_from_system_call:
 	/* XEN PV guests always use IRET path */
-	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
+	ALTERNATIVE "testb %al, %al; jz swapgs_restore_regs_and_return_to_usermode", \
 		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
 
 	/*
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 4fb36fb..c7e25c9 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -131,7 +131,7 @@ void do_syscall_64(struct pt_regs *regs, int nr);
 #endif	/* CONFIG_X86_32 */
 
 void do_int80_syscall_32(struct pt_regs *regs);
-long do_fast_syscall_32(struct pt_regs *regs);
-long do_SYSENTER_32(struct pt_regs *regs);
+bool do_fast_syscall_32(struct pt_regs *regs);
+bool do_SYSENTER_32(struct pt_regs *regs);
 
 #endif	/* _ASM_X86_SYSCALL_H */

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: x86/entry] x86/entry/64: Remove obsolete comment on tracing vs. SYSRET
  2023-07-21 16:10 ` [PATCH v2 1/6] x86/entry/64: Remove obsolete comment on tracing vs. SYSRET Brian Gerst
@ 2023-10-05  8:28   ` tip-bot2 for Brian Gerst
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Brian Gerst @ 2023-10-05  8:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Brian Gerst, Ingo Molnar, Andy Lutomirski, Denys Vlasenko,
	H. Peter Anvin, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	Josh Poimboeuf, x86, linux-kernel

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     eb43c9b1517b48e2ff0d3a584aca197338987d7b
Gitweb:        https://git.kernel.org/tip/eb43c9b1517b48e2ff0d3a584aca197338987d7b
Author:        Brian Gerst <brgerst@gmail.com>
AuthorDate:    Fri, 21 Jul 2023 12:10:12 -04:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 05 Oct 2023 10:06:42 +02:00

x86/entry/64: Remove obsolete comment on tracing vs. SYSRET

This comment comes from a time when the kernel attempted to use SYSRET
on all returns to userspace, including interrupts and exceptions.  Ever
since commit fffbb5dc ("Move opportunistic sysret code to syscall code
path"), SYSRET is only used for returning from system calls. The
specific tracing issue listed in this comment is not possible anymore.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20230721161018.50214-2-brgerst@gmail.com
---
 arch/x86/entry/entry_64.S | 19 +++----------------
 1 file changed, 3 insertions(+), 16 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index f71664d..7574639 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -166,22 +166,9 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
 	jne	swapgs_restore_regs_and_return_to_usermode
 
 	/*
-	 * SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot
-	 * restore RF properly. If the slowpath sets it for whatever reason, we
-	 * need to restore it correctly.
-	 *
-	 * SYSRET can restore TF, but unlike IRET, restoring TF results in a
-	 * trap from userspace immediately after SYSRET.  This would cause an
-	 * infinite loop whenever #DB happens with register state that satisfies
-	 * the opportunistic SYSRET conditions.  For example, single-stepping
-	 * this user code:
-	 *
-	 *           movq	$stuck_here, %rcx
-	 *           pushfq
-	 *           popq %r11
-	 *   stuck_here:
-	 *
-	 * would never get past 'stuck_here'.
+	 * SYSRET cannot restore RF.  It can restore TF, but unlike IRET,
+	 * restoring TF results in a trap from userspace immediately after
+	 * SYSRET.
 	 */
 	testq	$(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
 	jnz	swapgs_restore_regs_and_return_to_usermode

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: x86/entry] x86/entry/compat: Combine return value test from syscall handler
  2023-07-21 16:10 ` [PATCH v2 3/6] x86/entry/compat: Combine return value test from syscall handler Brian Gerst
@ 2023-10-05  8:28   ` tip-bot2 for Brian Gerst
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Brian Gerst @ 2023-10-05  8:28 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Brian Gerst, Ingo Molnar, x86, linux-kernel

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     eec62f61e134d6711f98d4005c6439f24d03d54f
Gitweb:        https://git.kernel.org/tip/eec62f61e134d6711f98d4005c6439f24d03d54f
Author:        Brian Gerst <brgerst@gmail.com>
AuthorDate:    Fri, 21 Jul 2023 12:10:14 -04:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 05 Oct 2023 10:06:42 +02:00

x86/entry/compat: Combine return value test from syscall handler

Move the sysret32_from_system_call label to remove a duplicate test of
the return value from the syscall handler.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230721161018.50214-4-brgerst@gmail.com
---
 arch/x86/entry/entry_64_compat.S |  9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 7015029..b162723 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -118,9 +118,6 @@ SYM_INNER_LABEL(entry_SYSENTER_compat_after_hwframe, SYM_L_GLOBAL)
 
 	movq	%rsp, %rdi
 	call	do_SYSENTER_32
-	/* XEN PV guests always use IRET path */
-	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
-		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
 	jmp	sysret32_from_system_call
 
 .Lsysenter_fix_flags:
@@ -212,13 +209,15 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_after_hwframe, SYM_L_GLOBAL)
 
 	movq	%rsp, %rdi
 	call	do_fast_syscall_32
+
+sysret32_from_system_call:
 	/* XEN PV guests always use IRET path */
 	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
 		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
 
-	/* Opportunistic SYSRET */
-sysret32_from_system_call:
 	/*
+	 * Opportunistic SYSRET
+	 *
 	 * We are not going to return to userspace from the trampoline
 	 * stack. So let's erase the thread stack right now.
 	 */

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation
  2023-10-05  8:22 ` [PATCH v2 0/6] x86: Clean up fast syscall return validation Ingo Molnar
@ 2023-10-05 15:13   ` Brian Gerst
  2023-10-05 20:20     ` Ingo Molnar
  0 siblings, 1 reply; 22+ messages in thread
From: Brian Gerst @ 2023-10-05 15:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, Thomas Gleixner, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Mika Penttilä,
	Linus Torvalds, Uros Bizjak, Denys Vlasenko, Peter Zijlstra,
	Josh Poimboeuf

On Thu, Oct 5, 2023 at 4:22 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Brian Gerst <brgerst@gmail.com> wrote:
>
> > This patch set cleans up the tests done to determine if a fast syscall
> > return instruction can be used to return to userspace.  It converts the
> > code to C, and refactors existing code to be more readable.
> >
> > v2:
> >  - Fix shift value for canonical RIP test and use
> >    __is_canonical_address()
> >
> > Brian Gerst (6):
> >   x86/entry/64: Remove obsolete comment on tracing vs. SYSRET
> >   x86/entry/64: Convert SYSRET validation tests to C
> >   x86/entry/compat: Combine return value test from syscall handler
> >   x86/entry/32: Convert do_fast_syscall_32() to bool return type
> >   x86/entry/32: Remove SEP test for SYSEXIT
> >   x86/entry/32: Clean up syscall fast exit tests
> >
> >  arch/x86/entry/common.c          | 99 +++++++++++++++++++++-----------
> >  arch/x86/entry/entry_32.S        |  2 +-
> >  arch/x86/entry/entry_64.S        | 68 +---------------------
> >  arch/x86/entry/entry_64_compat.S | 12 ++--
> >  arch/x86/include/asm/syscall.h   |  6 +-
> >  5 files changed, 77 insertions(+), 110 deletions(-)
>
> Ok, so I've applied patches #1, #3, #4 and #5 to tip:x86/entry,
> (ie. skipped #2 & #6 for now), as they look correct and are good
> improvements. None of these four patches depend on the skipped
> patches in some way I missed, correct?
>
> As for #2, I looked at the before/after disassembly, and the new
> C code in do_syscall_64() looked suboptimal on x86-64 defconfig,
> if I was reading it right.
>
> Mind re-evaluating that, and if you still think the C conversion
> is a good idea, mind putting a before/after analysis of the
> generated instructions into the changelog? This is our primary
> system call return path after all.

Looking at the compiled output, the only suboptimal code appears to be
the canonical address test, where the C code uses the CL register for
the shifts instead of immediates.

 180:   e9 00 00 00 00          jmp    185 <do_syscall_64+0x85>
                        181: R_X86_64_PC32      .altinstr_aux-0x4
 185:   b9 07 00 00 00          mov    $0x7,%ecx
 18a:   eb 05                   jmp    191 <do_syscall_64+0x91>
 18c:   b9 10 00 00 00          mov    $0x10,%ecx
 191:   48 89 c2                mov    %rax,%rdx
 194:   48 d3 e2                shl    %cl,%rdx
 197:   48 d3 fa                sar    %cl,%rdx
 19a:   48 39 d0                cmp    %rdx,%rax
 19d:   75 39                   jne    1d8 <do_syscall_64+0xd8>

Was there anything else specifically that you can point out?

Brian Gerst

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation
  2023-10-05 15:13   ` Brian Gerst
@ 2023-10-05 20:20     ` Ingo Molnar
  2023-10-06 18:59       ` H. Peter Anvin
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2023-10-05 20:20 UTC (permalink / raw)
  To: Brian Gerst
  Cc: linux-kernel, x86, Thomas Gleixner, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Mika Penttilä,
	Linus Torvalds, Uros Bizjak, Denys Vlasenko, Peter Zijlstra,
	Josh Poimboeuf


* Brian Gerst <brgerst@gmail.com> wrote:

> Looking at the compiled output, the only suboptimal code appears to be
> the canonical address test, where the C code uses the CL register for
> the shifts instead of immediates.
> 
>  180:   e9 00 00 00 00          jmp    185 <do_syscall_64+0x85>
>                         181: R_X86_64_PC32      .altinstr_aux-0x4
>  185:   b9 07 00 00 00          mov    $0x7,%ecx
>  18a:   eb 05                   jmp    191 <do_syscall_64+0x91>
>  18c:   b9 10 00 00 00          mov    $0x10,%ecx
>  191:   48 89 c2                mov    %rax,%rdx
>  194:   48 d3 e2                shl    %cl,%rdx
>  197:   48 d3 fa                sar    %cl,%rdx
>  19a:   48 39 d0                cmp    %rdx,%rax
>  19d:   75 39                   jne    1d8 <do_syscall_64+0xd8>

Yeah, it didn't look equivalent - so I guess we want a C equivalent for:

-       ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
-               "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57

instead of the pgtable_l5_enabled() runtime test that 
__is_canonical_address() uses?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation
  2023-10-05 20:20     ` Ingo Molnar
@ 2023-10-06 18:59       ` H. Peter Anvin
  2023-10-06 21:32         ` Brian Gerst
                           ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: H. Peter Anvin @ 2023-10-06 18:59 UTC (permalink / raw)
  To: Ingo Molnar, Brian Gerst
  Cc: linux-kernel, x86, Thomas Gleixner, Borislav Petkov,
	Andy Lutomirski, Mika Penttilä, Linus Torvalds, Uros Bizjak,
	Denys Vlasenko, Peter Zijlstra, Josh Poimboeuf

On 10/5/23 13:20, Ingo Molnar wrote:
> 
> * Brian Gerst <brgerst@gmail.com> wrote:
> 
>> Looking at the compiled output, the only suboptimal code appears to be
>> the canonical address test, where the C code uses the CL register for
>> the shifts instead of immediates.
>>
>>   180:   e9 00 00 00 00          jmp    185 <do_syscall_64+0x85>
>>                          181: R_X86_64_PC32      .altinstr_aux-0x4
>>   185:   b9 07 00 00 00          mov    $0x7,%ecx
>>   18a:   eb 05                   jmp    191 <do_syscall_64+0x91>
>>   18c:   b9 10 00 00 00          mov    $0x10,%ecx
>>   191:   48 89 c2                mov    %rax,%rdx
>>   194:   48 d3 e2                shl    %cl,%rdx
>>   197:   48 d3 fa                sar    %cl,%rdx
>>   19a:   48 39 d0                cmp    %rdx,%rax
>>   19d:   75 39                   jne    1d8 <do_syscall_64+0xd8>
> 
> Yeah, it didn't look equivalent - so I guess we want a C equivalent for:
> 
> -       ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
> -               "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
> 
> instead of the pgtable_l5_enabled() runtime test that
> __is_canonical_address() uses?
> 

I don't think such a thing (without simply duplicate the above as an 
alternative asm, which is obviously easy enough, and still allows the 
compiler to pick the register used) would be possible without immediate 
patching support[*].

Incidentally, this is a question for Uros: is there a reason this is a 
mov to %ecx and not just %cl, which would save 3 bytes?

Incidentally, it is possible to save one instruction and use only *one* 
alternative immediate:

	leaq (%rax,%rax),%rdx
	xorq %rax,%rdx
	shrq $(63 - LA),%rdx		# Yes, 63, not 64
	# ZF=1 if canonical

This works because if bit [x] is set in the output, then bit [x] and 
[x-1] in the input are different (bit [-1] considered to be zero); and 
by definition a bit is canonical if and only if all the bits [63:LA] are 
identical, thus bits [63:LA+1] in the output must all be zero.

The first two instructions are pure arithmetic and can thus be done in C:

	bar = foo ^ (foo << 1);

... leaving only one instruction needing to be patched at runtime.

	-hpa



[*] which is a whole bit of infrastructure that I know first-hand is 
extremely complex[**] -- I had an almost-complete patchset done at one 
time, but it has severely bitrotted. The good part is that it is 
probably a lot easier to do today, with the alternatives-patching 
callback routines available.

[**] the absolute best would of course be if such infrastructure could 
be compiler-supported (probably as part as some really fancy support for 
alternatives/static branch), but that would probably be a nightmare to 
do in the compiler or a plugin.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation
  2023-10-06 18:59       ` H. Peter Anvin
@ 2023-10-06 21:32         ` Brian Gerst
  2023-10-07  9:42           ` Ingo Molnar
  2023-10-06 23:58         ` H. Peter Anvin
  2023-10-07  9:56         ` Uros Bizjak
  2 siblings, 1 reply; 22+ messages in thread
From: Brian Gerst @ 2023-10-06 21:32 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, linux-kernel, x86, Thomas Gleixner, Borislav Petkov,
	Andy Lutomirski, Mika Penttilä, Linus Torvalds, Uros Bizjak,
	Denys Vlasenko, Peter Zijlstra, Josh Poimboeuf

On Fri, Oct 6, 2023 at 2:59 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 10/5/23 13:20, Ingo Molnar wrote:
> >
> > * Brian Gerst <brgerst@gmail.com> wrote:
> >
> >> Looking at the compiled output, the only suboptimal code appears to be
> >> the canonical address test, where the C code uses the CL register for
> >> the shifts instead of immediates.
> >>
> >>   180:   e9 00 00 00 00          jmp    185 <do_syscall_64+0x85>
> >>                          181: R_X86_64_PC32      .altinstr_aux-0x4
> >>   185:   b9 07 00 00 00          mov    $0x7,%ecx
> >>   18a:   eb 05                   jmp    191 <do_syscall_64+0x91>
> >>   18c:   b9 10 00 00 00          mov    $0x10,%ecx
> >>   191:   48 89 c2                mov    %rax,%rdx
> >>   194:   48 d3 e2                shl    %cl,%rdx
> >>   197:   48 d3 fa                sar    %cl,%rdx
> >>   19a:   48 39 d0                cmp    %rdx,%rax
> >>   19d:   75 39                   jne    1d8 <do_syscall_64+0xd8>
> >
> > Yeah, it didn't look equivalent - so I guess we want a C equivalent for:
> >
> > -       ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
> > -               "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
> >
> > instead of the pgtable_l5_enabled() runtime test that
> > __is_canonical_address() uses?
> >
>
> I don't think such a thing (without simply duplicate the above as an
> alternative asm, which is obviously easy enough, and still allows the
> compiler to pick the register used) would be possible without immediate
> patching support[*].
>
> Incidentally, this is a question for Uros: is there a reason this is a
> mov to %ecx and not just %cl, which would save 3 bytes?
>
> Incidentally, it is possible to save one instruction and use only *one*
> alternative immediate:
>
>         leaq (%rax,%rax),%rdx
>         xorq %rax,%rdx
>         shrq $(63 - LA),%rdx            # Yes, 63, not 64
>         # ZF=1 if canonical
>
> This works because if bit [x] is set in the output, then bit [x] and
> [x-1] in the input are different (bit [-1] considered to be zero); and
> by definition a bit is canonical if and only if all the bits [63:LA] are
> identical, thus bits [63:LA+1] in the output must all be zero.
>
> The first two instructions are pure arithmetic and can thus be done in C:
>
>         bar = foo ^ (foo << 1);
>
> ... leaving only one instruction needing to be patched at runtime.
>
>         -hpa

One other alternative I have been considering is comparing against
TASK_SIZE_MAX.  The only user-executable address above that is the
long deprecated vsyscall page.  IMHO it's not worth optimizing for
that case, so just let it fall back to using IRET.

    if (unlikely(regs->ip >= TASK_SIZE_MAX)) return false;

compiles to:

 180:   48 b9 00 f0 ff ff ff    movabs $0x7ffffffff000,%rcx
 187:   7f 00 00
 18a:   48 39 c8                cmp    %rcx,%rax
 18d:   73 39                   jae    1c8 <do_syscall_64+0xc8>

0000000000000000 <.altinstr_replacement>:
   0:   48 b9 00 f0 ff ff ff    movabs $0xfffffffffff000,%rcx
   7:   ff ff 00

Brian Gerst

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation
  2023-10-06 18:59       ` H. Peter Anvin
  2023-10-06 21:32         ` Brian Gerst
@ 2023-10-06 23:58         ` H. Peter Anvin
  2023-10-07  9:56         ` Uros Bizjak
  2 siblings, 0 replies; 22+ messages in thread
From: H. Peter Anvin @ 2023-10-06 23:58 UTC (permalink / raw)
  To: Ingo Molnar, Brian Gerst
  Cc: linux-kernel, x86, Thomas Gleixner, Borislav Petkov,
	Andy Lutomirski, Mika Penttilä, Linus Torvalds, Uros Bizjak,
	Denys Vlasenko, Peter Zijlstra, Josh Poimboeuf

On 10/6/23 11:59, H. Peter Anvin wrote:
> 
> Incidentally, it is possible to save one instruction and use only *one* 
> alternative immediate:
> 
>      leaq (%rax,%rax),%rdx
>      xorq %rax,%rdx
>      shrq $(63 - LA),%rdx        # Yes, 63, not 64
>      # ZF=1 if canonical
> 
> This works because if bit [x] is set in the output, then bit [x] and 
> [x-1] in the input are different (bit [-1] considered to be zero); and 
> by definition a bit is canonical if and only if all the bits [63:LA] are 
> identical, thus bits [63:LA+1] in the output must all be zero.
> 

Yes, I'm a doofus. Bits [63:LA-1] must be identical, so 64 is correct :$)

	-hpa


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation
  2023-10-06 21:32         ` Brian Gerst
@ 2023-10-07  9:42           ` Ingo Molnar
  0 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2023-10-07  9:42 UTC (permalink / raw)
  To: Brian Gerst
  Cc: H. Peter Anvin, linux-kernel, x86, Thomas Gleixner,
	Borislav Petkov, Andy Lutomirski, Mika Penttilä,
	Linus Torvalds, Uros Bizjak, Denys Vlasenko, Peter Zijlstra,
	Josh Poimboeuf


* Brian Gerst <brgerst@gmail.com> wrote:

> On Fri, Oct 6, 2023 at 2:59 PM H. Peter Anvin <hpa@zytor.com> wrote:
> >
> > On 10/5/23 13:20, Ingo Molnar wrote:
> > >
> > > * Brian Gerst <brgerst@gmail.com> wrote:
> > >
> > >> Looking at the compiled output, the only suboptimal code appears to be
> > >> the canonical address test, where the C code uses the CL register for
> > >> the shifts instead of immediates.
> > >>
> > >>   180:   e9 00 00 00 00          jmp    185 <do_syscall_64+0x85>
> > >>                          181: R_X86_64_PC32      .altinstr_aux-0x4
> > >>   185:   b9 07 00 00 00          mov    $0x7,%ecx
> > >>   18a:   eb 05                   jmp    191 <do_syscall_64+0x91>
> > >>   18c:   b9 10 00 00 00          mov    $0x10,%ecx
> > >>   191:   48 89 c2                mov    %rax,%rdx
> > >>   194:   48 d3 e2                shl    %cl,%rdx
> > >>   197:   48 d3 fa                sar    %cl,%rdx
> > >>   19a:   48 39 d0                cmp    %rdx,%rax
> > >>   19d:   75 39                   jne    1d8 <do_syscall_64+0xd8>
> > >
> > > Yeah, it didn't look equivalent - so I guess we want a C equivalent for:
> > >
> > > -       ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
> > > -               "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
> > >
> > > instead of the pgtable_l5_enabled() runtime test that
> > > __is_canonical_address() uses?
> > >
> >
> > I don't think such a thing (without simply duplicate the above as an
> > alternative asm, which is obviously easy enough, and still allows the
> > compiler to pick the register used) would be possible without immediate
> > patching support[*].
> >
> > Incidentally, this is a question for Uros: is there a reason this is a
> > mov to %ecx and not just %cl, which would save 3 bytes?
> >
> > Incidentally, it is possible to save one instruction and use only *one*
> > alternative immediate:
> >
> >         leaq (%rax,%rax),%rdx
> >         xorq %rax,%rdx
> >         shrq $(63 - LA),%rdx            # Yes, 63, not 64
> >         # ZF=1 if canonical
> >
> > This works because if bit [x] is set in the output, then bit [x] and
> > [x-1] in the input are different (bit [-1] considered to be zero); and
> > by definition a bit is canonical if and only if all the bits [63:LA] are
> > identical, thus bits [63:LA+1] in the output must all be zero.
> >
> > The first two instructions are pure arithmetic and can thus be done in C:
> >
> >         bar = foo ^ (foo << 1);
> >
> > ... leaving only one instruction needing to be patched at runtime.
> >
> >         -hpa
> 
> One other alternative I have been considering is comparing against
> TASK_SIZE_MAX.  The only user-executable address above that is the
> long deprecated vsyscall page.  IMHO it's not worth optimizing for
> that case, so just let it fall back to using IRET.
> 
>     if (unlikely(regs->ip >= TASK_SIZE_MAX)) return false;
> 
> compiles to:
> 
>  180:   48 b9 00 f0 ff ff ff    movabs $0x7ffffffff000,%rcx
>  187:   7f 00 00
>  18a:   48 39 c8                cmp    %rcx,%rax
>  18d:   73 39                   jae    1c8 <do_syscall_64+0xc8>
> 
> 0000000000000000 <.altinstr_replacement>:
>    0:   48 b9 00 f0 ff ff ff    movabs $0xfffffffffff000,%rcx
>    7:   ff ff 00

That sounds good - and we could do this as a separate patch on top
of your existing  patches, to keep it bisectable in case there's
any problems.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation
  2023-10-06 18:59       ` H. Peter Anvin
  2023-10-06 21:32         ` Brian Gerst
  2023-10-06 23:58         ` H. Peter Anvin
@ 2023-10-07  9:56         ` Uros Bizjak
  2023-10-07 18:07           ` Linus Torvalds
  2 siblings, 1 reply; 22+ messages in thread
From: Uros Bizjak @ 2023-10-07  9:56 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Brian Gerst, linux-kernel, x86, Thomas Gleixner,
	Borislav Petkov, Andy Lutomirski, Mika Penttilä,
	Linus Torvalds, Denys Vlasenko, Peter Zijlstra, Josh Poimboeuf

On Fri, Oct 6, 2023 at 8:59 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 10/5/23 13:20, Ingo Molnar wrote:
> >
> > * Brian Gerst <brgerst@gmail.com> wrote:
> >
> >> Looking at the compiled output, the only suboptimal code appears to be
> >> the canonical address test, where the C code uses the CL register for
> >> the shifts instead of immediates.
> >>
> >>   180:   e9 00 00 00 00          jmp    185 <do_syscall_64+0x85>
> >>                          181: R_X86_64_PC32      .altinstr_aux-0x4
> >>   185:   b9 07 00 00 00          mov    $0x7,%ecx
> >>   18a:   eb 05                   jmp    191 <do_syscall_64+0x91>
> >>   18c:   b9 10 00 00 00          mov    $0x10,%ecx
> >>   191:   48 89 c2                mov    %rax,%rdx
> >>   194:   48 d3 e2                shl    %cl,%rdx
> >>   197:   48 d3 fa                sar    %cl,%rdx
> >>   19a:   48 39 d0                cmp    %rdx,%rax
> >>   19d:   75 39                   jne    1d8 <do_syscall_64+0xd8>
> >
> > Yeah, it didn't look equivalent - so I guess we want a C equivalent for:
> >
> > -       ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
> > -               "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
> >
> > instead of the pgtable_l5_enabled() runtime test that
> > __is_canonical_address() uses?
> >
>
> I don't think such a thing (without simply duplicate the above as an
> alternative asm, which is obviously easy enough, and still allows the
> compiler to pick the register used) would be possible without immediate
> patching support[*].
>
> Incidentally, this is a question for Uros: is there a reason this is a
> mov to %ecx and not just %cl, which would save 3 bytes?

The compiler uses 32-bit mode to move values between registers, even
when they are less than 32-bit wide. To avoid partial register stalls,
it uses 32-bit mode as much as possible (e.g. zero-extends from memory
to load 8-bit value, load of 32-bit constant, etc). Since the kernel
is compiled with -O2, the compiler does not care that much for the
size of instructions, and it uses full 32-bit width to initialize
register with a constant.

Please note that 8-bit movb instruction in fact represents insert into
word-mode register. The compiler does not know how this word-mode
register will be used, so to avoid partial register stalls, it takes a
cautious approach and (with -O2) moves constant to a register with a
word-width instruction.

Also, the compiler is quite eager to CSE constants. When there are two
or more uses of the same constant, it will move a constant into the
register first.

Uros.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation
  2023-10-07  9:56         ` Uros Bizjak
@ 2023-10-07 18:07           ` Linus Torvalds
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2023-10-07 18:07 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: H. Peter Anvin, Ingo Molnar, Brian Gerst, linux-kernel, x86,
	Thomas Gleixner, Borislav Petkov, Andy Lutomirski,
	Mika Penttilä, Denys Vlasenko, Peter Zijlstra,
	Josh Poimboeuf

On Sat, 7 Oct 2023 at 02:57, Uros Bizjak <ubizjak@gmail.com> wrote:
>
> Please note that 8-bit movb instruction in fact represents insert into
> word-mode register. The compiler does not know how this word-mode
> register will be used, so to avoid partial register stalls, it takes a
> cautious approach and (with -O2) moves constant to a register with a
> word-width instruction.

Yes. In this case I really think it's our kernel code that is disgusting.

People seem to think that because cpu_feature_enabled() uses static
jumps, it's somehow "free". Not so. The static jumps are often
horrendous for code quality, and they effectively make the compiler
blind to otherwise obvious optimizations.

We need to stop using those so blindly. I see code like this

       return pgtable_l5_enabled() ? PT64_ROOT_5LEVEL : PT64_ROOT_4LEVEL;

and it just makes me sad. And yes, that virtual mask is disgusting. This:

    #define __VIRTUAL_MASK_SHIFT       (pgtable_l5_enabled() ? 56 : 47)

is *NOT* cheap. The end result is *not* a constant, it's a two-way
branch, and while the branch is static and the compiler may then
optimize each side of the branch with the constant in question, most
of the time it just results in *disgusting* code like this where gcc
just has to load a constant into a register, and treat it as a
variable.

With the code shuffling, it's possibly *more* expensive than just
loading from memory, in that you just change a D$ load into an I$ one
instead. I'm sure that looks great on benchmarks that stay entirely in
the I$, but ...

Something like using alternatives is *much* cheaper. Better yet is the
suggestion to simplify the conditionals entirely to not depend on the
virtual shift at all (ie our recent access_ok() simplifications that
were due to _another_ CPU feature)

> Also, the compiler is quite eager to CSE constants. When there are two
> or more uses of the same constant, it will move a constant into the
> register first.

Now, honestly, that may be a good thing for other architectures that
have issues with constants, but on x86 - and integer constants - it's
likely a bad thing unless the constant is particularly complicated.

Most x86 instructions will take the usual 8/32-bit constants (but if
we have arbitrary 64-bit ones, you tend to need them in registers).

             Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-10-07 18:10 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-21 16:10 [PATCH v2 0/6] x86: Clean up fast syscall return validation Brian Gerst
2023-07-21 16:10 ` [PATCH v2 1/6] x86/entry/64: Remove obsolete comment on tracing vs. SYSRET Brian Gerst
2023-10-05  8:28   ` [tip: x86/entry] " tip-bot2 for Brian Gerst
2023-07-21 16:10 ` [PATCH v2 2/6] x86/entry/64: Convert SYSRET validation tests to C Brian Gerst
2023-07-23  9:53   ` Li, Xin3
2023-07-23 11:17     ` Brian Gerst
2023-07-21 16:10 ` [PATCH v2 3/6] x86/entry/compat: Combine return value test from syscall handler Brian Gerst
2023-10-05  8:28   ` [tip: x86/entry] " tip-bot2 for Brian Gerst
2023-07-21 16:10 ` [PATCH v2 4/6] x86/entry/32: Convert do_fast_syscall_32() to bool return type Brian Gerst
2023-10-05  8:28   ` [tip: x86/entry] " tip-bot2 for Brian Gerst
2023-07-21 16:10 ` [PATCH v2 5/6] x86/entry/32: Remove SEP test for SYSEXIT Brian Gerst
2023-10-05  8:28   ` [tip: x86/entry] " tip-bot2 for Brian Gerst
2023-07-21 16:10 ` [PATCH v2 6/6] x86/entry/32: Clean up syscall fast exit tests Brian Gerst
2023-10-05  8:22 ` [PATCH v2 0/6] x86: Clean up fast syscall return validation Ingo Molnar
2023-10-05 15:13   ` Brian Gerst
2023-10-05 20:20     ` Ingo Molnar
2023-10-06 18:59       ` H. Peter Anvin
2023-10-06 21:32         ` Brian Gerst
2023-10-07  9:42           ` Ingo Molnar
2023-10-06 23:58         ` H. Peter Anvin
2023-10-07  9:56         ` Uros Bizjak
2023-10-07 18:07           ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).