Add support for RD/WR FS/GSBASE

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Add support for RD/WR FS/GSBASE
@ 2014-04-28 22:12 Andi Kleen
  2014-04-28 22:12 ` [PATCH 1/7] percpu: Add a DEFINE_PER_CPU_2PAGE_ALIGNED Andi Kleen
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Andi Kleen @ 2014-04-28 22:12 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel

This adds kernel support for some Intel instructions that
allow fast access to the FS and GS 64bit base. They need
some changes to entry_64.S because they allow user
to fully control the GS base.

Advantages:
- NMIs (and other "paranoid" interrupts) avoid
one RDMSR which makes them faster
- User space can use these instructions, mainly
for efficient context switching with user thread libraries
- Context switches do not need to use MSR writes
anymore to context switch FS/GS base >4GB. This 
will speed up applications that have enough thread
local data that it won't fit below 4GB.
- User space can use GS efficiently as an additional
global pointer register

I also included one minor (unrelated) optimization to 
disable an unneeded old SWAPGS workaround.

-Andi


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/7] percpu: Add a DEFINE_PER_CPU_2PAGE_ALIGNED
  2014-04-28 22:12 Add support for RD/WR FS/GSBASE Andi Kleen
@ 2014-04-28 22:12 ` Andi Kleen
  2014-05-02 15:18   ` Tejun Heo
  2014-04-28 22:12 ` [PATCH 2/7] x86: Naturally align the debug IST stack Andi Kleen
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2014-04-28 22:12 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, Andi Kleen, tj

From: Andi Kleen <ak@linux.intel.com>

Needed in a followon patch which needs to naturally align a 8K stack.
I just put it into the page aligned section. This may cause an extra
4K hole if we're unlucky.

Cc: tj@kernel.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/percpu-defs.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index a5fc7d0..5b2f5b0 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -142,6 +142,10 @@
 	DEFINE_PER_CPU_SECTION(type, name, "..page_aligned")		\
 	__aligned(PAGE_SIZE)
 
+#define DEFINE_PER_CPU_2PAGE_ALIGNED(type, name)			\
+	DEFINE_PER_CPU_SECTION(type, name, "..page_aligned")		\
+	__aligned(2*PAGE_SIZE)
+
 /*
  * Declaration/definition used for per-CPU variables that must be read mostly.
  */
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/7] x86: Naturally align the debug IST stack
  2014-04-28 22:12 Add support for RD/WR FS/GSBASE Andi Kleen
  2014-04-28 22:12 ` [PATCH 1/7] percpu: Add a DEFINE_PER_CPU_2PAGE_ALIGNED Andi Kleen
@ 2014-04-28 22:12 ` Andi Kleen
  2014-04-28 22:12 ` [PATCH 3/7] x86: Add C intrinsics for new rd/wr fs/gs base instructions Andi Kleen
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2014-04-28 22:12 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

The followon patch for RD*BASE requires the IST stacks to be naturally
aligned because it uses and to access the bottom of the stack.
This was true for everyone except for the debug ist stack.
Make the debug ist stack naturally aligned too.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/page_64_types.h | 4 ++--
 arch/x86/kernel/cpu/common.c         | 8 +++++++-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 43dcd80..76564b6 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -17,8 +17,8 @@
 #define STACKFAULT_STACK 1
 #define DOUBLEFAULT_STACK 2
 #define NMI_STACK 3
-#define DEBUG_STACK 4
-#define MCE_STACK 5
+#define MCE_STACK 4
+#define DEBUG_STACK 5 /* Must be always last */
 #define N_EXCEPTION_STACKS 5  /* hw limit: 7 */
 
 #define PUD_PAGE_SIZE		(_AC(1, UL) << PUD_SHIFT)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 24b6fd1..79ba4b9 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1114,8 +1114,12 @@ static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
 	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
 };
 
+/* The IST stacks must be naturally aligned */
+
 static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
-	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
+	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ]);
+static DEFINE_PER_CPU_2PAGE_ALIGNED(char, debug_stack
+	[DEBUG_STKSZ]);
 
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
@@ -1285,6 +1289,8 @@ void cpu_init(void)
 		char *estacks = per_cpu(exception_stacks, cpu);
 
 		for (v = 0; v < N_EXCEPTION_STACKS; v++) {
+			if (v == DEBUG_STACK - 1)
+				estacks = per_cpu(debug_stack, cpu);
 			estacks += exception_stack_sizes[v];
 			oist->ist[v] = t->x86_tss.ist[v] =
 					(unsigned long)estacks;
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/7] x86: Add C intrinsics for new rd/wr fs/gs base instructions
  2014-04-28 22:12 Add support for RD/WR FS/GSBASE Andi Kleen
  2014-04-28 22:12 ` [PATCH 1/7] percpu: Add a DEFINE_PER_CPU_2PAGE_ALIGNED Andi Kleen
  2014-04-28 22:12 ` [PATCH 2/7] x86: Naturally align the debug IST stack Andi Kleen
@ 2014-04-28 22:12 ` Andi Kleen
  2014-04-29 14:10   ` Konrad Rzeszutek Wilk
  2014-04-28 22:12 ` [PATCH 4/7] x86: Add support for rd/wr fs/gs base Andi Kleen
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2014-04-28 22:12 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Plus for swapgs.

Very straight forward. Used in followon patch.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/fsgs.h | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)
 create mode 100644 arch/x86/include/asm/fsgs.h

diff --git a/arch/x86/include/asm/fsgs.h b/arch/x86/include/asm/fsgs.h
new file mode 100644
index 0000000..a3a5cd3
--- /dev/null
+++ b/arch/x86/include/asm/fsgs.h
@@ -0,0 +1,43 @@
+#ifndef _ASM_FSGS_H
+#define _ASM_FSGS_H 1
+
+static inline void swapgs(void)
+{
+	asm volatile("swapgs" ::: "memory");
+}
+
+/* Must be protected by X86_FEATURE_FSGSBASE check. */
+
+static inline unsigned long rdgsbase(void)
+{
+	unsigned long gs;
+	asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xc8 # rdgsbaseq %%rax"
+			: "=a" (gs)
+			:: "memory");
+	return gs;
+}
+
+static inline unsigned long rdfsbase(void)
+{
+	unsigned long fs;
+	asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xc0 # rdfsbaseq %%rax"
+			: "=a" (fs)
+			:: "memory");
+	return fs;
+}
+
+static inline void wrgsbase(unsigned long gs)
+{
+	asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xd8 # wrgsbaseq %%rax"
+			:: "a" (gs)
+			: "memory");
+}
+
+static inline void wrfsbase(unsigned long fs)
+{
+	asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xd0 # wrfsbaseq %%rax"
+			:: "a" (fs)
+			: "memory");
+}
+
+#endif
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/7] x86: Add support for rd/wr fs/gs base
  2014-04-28 22:12 Add support for RD/WR FS/GSBASE Andi Kleen
                   ` (2 preceding siblings ...)
  2014-04-28 22:12 ` [PATCH 3/7] x86: Add C intrinsics for new rd/wr fs/gs base instructions Andi Kleen
@ 2014-04-28 22:12 ` Andi Kleen
  2014-04-29 18:19   ` Andy Lutomirski
  2014-04-28 22:12 ` [PATCH 5/7] x86: Make old K8 swapgs workaround conditional Andi Kleen
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2014-04-28 22:12 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

IvyBridge added new instructions to directly write the fs and gs
64bit base registers. Previously this had to be done with a system
call to write to MSRs. The main use case is fast user space threading
and switching the fs/gs registers quickly there.

The instructions are opt-in and have to be explicitely enabled
by the OS.

Previously Linux couldn't support this because the paranoid
entry code relied on the gs base never being negative outside
the kernel to decide when to use swaps. It would check the gs MSR
value and assume it was already running in kernel if the value
was already negative.

This patch changes the paranoid entry code to use rdgsbase
if available.  Then we check the GS value against the expected GS value
stored at the bottom of the IST stack. If the value is the expected
value we skip swapgs.

This is also significantly faster than a MSR read, so will speed
NMis (critical for profiling)

An alternative would have been to save/restore the GS value
unconditionally, but this approach needs less changes.

Then after these changes we need to also use the new instructions
to save/restore fs and gs, so that the new values set by the
users won't disappear.  This is also significantly
faster for the case when the 64bit base has to be switched
(that is when GS is larger than 4GB), as we can replace
the slow MSR write with a faster wr[fg]sbase execution.

The instructions do not context switch
the segment index, so the old invariant that fs or gs index
have to be 0 for a different 64bit value to stick is still
true. Previously it was enforced by arch_prctl, now the user
program has to make sure it keeps the segment indexes zero.
If it doesn't the changes may not stick.

This is in term enables fast switching when there are
enough threads that their TLS segment does not fit below 4GB,
or alternatively programs that use fs as an additional base
register will not get a sigificant context switch penalty.

It is all done in a single patch to avoid bisect crash
holes.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/common.c |  6 ++++++
 arch/x86/kernel/entry_64.S   | 38 +++++++++++++++++++++++++++++++++++++-
 arch/x86/kernel/process_64.c | 28 ++++++++++++++++++++++++----
 3 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 79ba4b9..0fb8767 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -938,6 +938,9 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 #ifdef CONFIG_NUMA
 	numa_add_cpu(smp_processor_id());
 #endif
+
+	if (cpu_has(c, X86_FEATURE_FSGSBASE))
+		set_in_cr4(X86_CR4_FSGSBASE);
 }
 
 #ifdef CONFIG_X86_64
@@ -1287,10 +1290,13 @@ void cpu_init(void)
 	 */
 	if (!oist->ist[0]) {
 		char *estacks = per_cpu(exception_stacks, cpu);
+		void *gs = per_cpu(irq_stack_union.gs_base, cpu);
 
 		for (v = 0; v < N_EXCEPTION_STACKS; v++) {
 			if (v == DEBUG_STACK - 1)
 				estacks = per_cpu(debug_stack, cpu);
+			/* Store GS at bottom of stack for bootstrap access */
+			*(void **)estacks = gs;
 			estacks += exception_stack_sizes[v];
 			oist->ist[v] = t->x86_tss.ist[v] =
 					(unsigned long)estacks;
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1e96c36..7c77b2b 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -58,6 +58,7 @@
 #include <asm/asm.h>
 #include <asm/context_tracking.h>
 #include <asm/smap.h>
+#include <asm/alternative-asm.h>
 #include <linux/err.h>
 
 /* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this.  */
@@ -508,6 +509,10 @@ ENTRY(save_paranoid)
 	movq_cfi r14, R14+8
 	movq_cfi r15, R15+8
 	movl $1,%ebx
+33:
+	ASM_NOP5	/* May be replaced with jump to paranoid_save_gs */
+34:
+	movq $-1,ORIG_RAX+8(%rsp)	/* no syscall to restart */
 	movl $MSR_GS_BASE,%ecx
 	rdmsr
 	testl %edx,%edx
@@ -515,6 +520,37 @@ ENTRY(save_paranoid)
 	SWAPGS
 	xorl %ebx,%ebx
 1:	ret
+
+	/* Patch in jump to paranoid_save_gs for X86_FEATURE_FSGSBASE */
+	.section .altinstr_replacement,"ax"
+35:	.byte 0xe9 /* 32bit near jump */
+	.long paranoid_save_gs-34b
+	.previous
+	.section .altinstructions,"a"
+	altinstruction_entry 33b,35b,X86_FEATURE_FSGSBASE,5,5
+	.previous
+
+	/* Faster version not using RDMSR, and also not assuming
+	 * anything about the previous GS value.
+	 * This allows the user to arbitarily change GS using
+	 * WRGSBASE.
+	 */
+paranoid_save_gs:
+	.byte 0xf3,0x48,0x0f,0xae,0xc9	# rdgsbaseq %rcx
+	movq $-EXCEPTION_STKSZ,%rax	# non debug stack size
+	cmpq $DEBUG_STACK,ORIG_RAX+8(%rsp)
+	movq $-1,ORIG_RAX+8(%rsp)	# no syscall to restart
+	jne  1f
+	movq $-DEBUG_STKSZ,%rax		# debug stack size
+1:
+	andq %rsp,%rax			# bottom of stack
+	movq (%rax),%rdi		# get expected GS
+	cmpq %rdi,%rcx			# is it the kernel gs?
+	jz   2f
+	SWAPGS
+	xorl %ebx,%ebx
+2:	ret
+
 	CFI_ENDPROC
 END(save_paranoid)
 	.popsection
@@ -1245,7 +1281,7 @@ ENTRY(\sym)
 	INTR_FRAME
 	ASM_CLAC
 	PARAVIRT_ADJUST_EXCEPTION_FRAME
-	pushq_cfi $-1		/* ORIG_RAX: no syscall to restart */
+	pushq_cfi $\ist		/* ORIG_RAX: pass ist number to save_paranoid */
 	subq $ORIG_RAX-R15, %rsp
 	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
 	call save_paranoid
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 9c0280f..334a87a 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -49,6 +49,7 @@
 #include <asm/syscalls.h>
 #include <asm/debugreg.h>
 #include <asm/switch_to.h>
+#include <asm/fsgs.h>
 
 asmlinkage extern void ret_from_fork(void);
 
@@ -311,6 +312,13 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	 */
 	savesegment(fs, fsindex);
 	savesegment(gs, gsindex);
+	if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+		prev->fs = rdfsbase();
+		/* Interrupts are disabled here. */
+		swapgs();
+		prev->gs = rdgsbase();
+		swapgs();
+	}
 
 	load_TLS(next, cpu);
 
@@ -341,8 +349,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 			prev->fs = 0;
 	}
 	/* when next process has a 64bit base use it */
-	if (next->fs)
-		wrmsrl(MSR_FS_BASE, next->fs);
+	if (next->fs) {
+		if (static_cpu_has(X86_FEATURE_FSGSBASE))
+			wrfsbase(next->fs);
+		else
+			wrmsrl(MSR_FS_BASE, next->fs);
+	}
 	prev->fsindex = fsindex;
 
 	if (unlikely(gsindex | next->gsindex | prev->gs)) {
@@ -350,8 +362,16 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 		if (gsindex)
 			prev->gs = 0;
 	}
-	if (next->gs)
-		wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
+	if (next->gs) {
+		if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+			/* Interrupts are disabled here. */
+			swapgs();
+			wrgsbase(next->gs);
+			swapgs();
+		} else {
+			wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
+		}
+	}
 	prev->gsindex = gsindex;
 
 	switch_fpu_finish(next_p, fpu);
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/7] x86: Make old K8 swapgs workaround conditional
  2014-04-28 22:12 Add support for RD/WR FS/GSBASE Andi Kleen
                   ` (3 preceding siblings ...)
  2014-04-28 22:12 ` [PATCH 4/7] x86: Add support for rd/wr fs/gs base Andi Kleen
@ 2014-04-28 22:12 ` Andi Kleen
  2014-04-30  4:57   ` H. Peter Anvin
  2014-04-28 22:12 ` [PATCH 6/7] x86: Enumerate kernel FSGS capability in AT_HWCAP2 Andi Kleen
  2014-04-28 22:12 ` [PATCH 7/7] x86: Add documentation for rd/wr fs/gs base Andi Kleen
  6 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2014-04-28 22:12 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Every gs selector/index reload always paid an extra MFENCE
between the two SWAPGS. This was to work around an old
bug in early K8 steppings.  All other CPUs don't need the extra
mfence. Patch the extra MFENCE only in for K8.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/cpufeature.h |  1 +
 arch/x86/kernel/cpu/amd.c         |  3 +++
 arch/x86/kernel/entry_64.S        | 10 +++++++++-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 89270b4..eb4bb46 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -102,6 +102,7 @@
 #define X86_FEATURE_APERFMPERF	(3*32+28) /* APERFMPERF */
 #define X86_FEATURE_EAGER_FPU	(3*32+29) /* "eagerfpu" Non lazy FPU restore */
 #define X86_FEATURE_NONSTOP_TSC_S3 (3*32+30) /* TSC doesn't stop in S3 state */
+#define X86_FEATURE_SWAPGS_MFENCE (3*32+31) /* SWAPGS may need MFENCE */
 
 /* Intel-defined CPU features, CPUID level 0x00000001 (ecx), word 4 */
 #define X86_FEATURE_XMM3	(4*32+ 0) /* "pni" SSE-3 */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index d3153e2..65125d3 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -560,6 +560,9 @@ static void init_amd(struct cpuinfo_x86 *c)
 		if ((level >= 0x0f48 && level < 0x0f50) || level >= 0x0f58)
 			set_cpu_cap(c, X86_FEATURE_REP_GOOD);
 
+		if (c->x86 == 0xf)
+			set_cpu_cap(c, X86_FEATURE_SWAPGS_MFENCE);
+
 		/*
 		 * Some BIOSes incorrectly force this feature, but only K8
 		 * revision D (model = 0x14) and later actually support it.
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 7c77b2b..810f010 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1369,13 +1369,21 @@ ENTRY(native_load_gs_index)
 	SWAPGS
 gs_change:
 	movl %edi,%gs
-2:	mfence		/* workaround */
+2:	ASM_NOP3	/* may be replaced with mfence */
 	SWAPGS
 	popfq_cfi
 	ret
 	CFI_ENDPROC
 END(native_load_gs_index)
 
+	/* Early K8 systems needed an mfence after swapgs to workaround a bug */
+	.section .altinstr_replacement,"ax"
+3:	mfence
+	.previous
+	.section .altinstructions,"a"
+	altinstruction_entry 2b,3b,X86_FEATURE_SWAPGS_MFENCE,3,3
+	.previous
+
 	_ASM_EXTABLE(gs_change,bad_gs)
 	.section .fixup,"ax"
 	/* running with kernelgs */
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/7] x86: Enumerate kernel FSGS capability in AT_HWCAP2
  2014-04-28 22:12 Add support for RD/WR FS/GSBASE Andi Kleen
                   ` (4 preceding siblings ...)
  2014-04-28 22:12 ` [PATCH 5/7] x86: Make old K8 swapgs workaround conditional Andi Kleen
@ 2014-04-28 22:12 ` Andi Kleen
  2014-04-28 22:12 ` [PATCH 7/7] x86: Add documentation for rd/wr fs/gs base Andi Kleen
  6 siblings, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2014-04-28 22:12 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

The kernel needs to explicitely enable RD/WRFSBASE to handle context
switch correctly. So the application needs to know if it can safely use
these instruction. Just looking at the CPUID bit is not enough because it
may be running in a kernel that does not enable the instructions.

One way for the application would be to just try and catch the SIGILL.
But that is difficult to do in libraries which may not want
to overwrite the signal handlers of the main application.

So we need to provide a way for the application to discover the kernel
capability.

I used AT_HWCAP2 in the ELF aux vector which is already used by
PPC for similar things. We define a new Linux defined bitmap
returned in AT_HWCAP.  Currently it has only one bit set,
for kernel is FSGSBASE capable.

The application can then access it manually or using
the getauxval() function in newer glibc.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/elf.h        | 7 +++++++
 arch/x86/include/uapi/asm/hwcap.h | 7 +++++++
 arch/x86/kernel/cpu/common.c      | 7 ++++++-
 3 files changed, 20 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/uapi/asm/hwcap.h

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 9c999c1..0fbc9e0 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -251,6 +251,13 @@ extern int force_personality32;
 
 #define ELF_HWCAP		(boot_cpu_data.x86_capability[0])
 
+extern unsigned kernel_enabled_features;
+
+/* HWCAP2 supplies kernel enabled CPU feature, so that the application
+   can know that it can safely use them. The bits are defined in
+   uapi/asm/hwcap.h. */
+#define ELF_HWCAP2		kernel_enabled_features;
+
 /* This yields a string that ld.so will use to load implementation
    specific libraries for optimization.  This is more specific in
    intent than poking at uname or /proc/cpuinfo.
diff --git a/arch/x86/include/uapi/asm/hwcap.h b/arch/x86/include/uapi/asm/hwcap.h
new file mode 100644
index 0000000..d9c54f8
--- /dev/null
+++ b/arch/x86/include/uapi/asm/hwcap.h
@@ -0,0 +1,7 @@
+#ifndef _ASM_HWCAP_H
+#define _ASM_HWCAP_H 1
+
+#define HWCAP2_FSGSBASE	(1 << 0) 	/* Kernel enabled RD/WR FS/GS BASE */
+/* upto bit 31 free */
+
+#endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 0fb8767..a4e4aa6 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -31,6 +31,7 @@
 #include <asm/i387.h>
 #include <asm/fpu-internal.h>
 #include <asm/mtrr.h>
+#include <asm/hwcap.h>
 #include <linux/numa.h>
 #include <asm/asm.h>
 #include <asm/cpu.h>
@@ -46,6 +47,8 @@
 
 #include "cpu.h"
 
+unsigned kernel_enabled_features __read_mostly;
+
 /* all of these masks are initialized in setup_cpu_local_masks() */
 cpumask_var_t cpu_initialized_mask;
 cpumask_var_t cpu_callout_mask;
@@ -939,8 +942,10 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 	numa_add_cpu(smp_processor_id());
 #endif
 
-	if (cpu_has(c, X86_FEATURE_FSGSBASE))
+	if (cpu_has(c, X86_FEATURE_FSGSBASE)) {
+		kernel_enabled_features |= HWCAP2_FSGSBASE;
 		set_in_cr4(X86_CR4_FSGSBASE);
+	}
 }
 
 #ifdef CONFIG_X86_64
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 7/7] x86: Add documentation for rd/wr fs/gs base
  2014-04-28 22:12 Add support for RD/WR FS/GSBASE Andi Kleen
                   ` (5 preceding siblings ...)
  2014-04-28 22:12 ` [PATCH 6/7] x86: Enumerate kernel FSGS capability in AT_HWCAP2 Andi Kleen
@ 2014-04-28 22:12 ` Andi Kleen
  2014-04-29  2:23   ` Randy Dunlap
  6 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2014-04-28 22:12 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/x86/fsgs.txt | 73 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)
 create mode 100644 Documentation/x86/fsgs.txt

diff --git a/Documentation/x86/fsgs.txt b/Documentation/x86/fsgs.txt
new file mode 100644
index 0000000..0a1755f
--- /dev/null
+++ b/Documentation/x86/fsgs.txt
@@ -0,0 +1,73 @@
+
+Using FS and GS prefixes on x86_64-linux
+
+The x86 architecture supports segment prefixes per instruction to add an
+offset to an address.  On 64bit x86, these are mostly nops, except for FS
+and GS.
+
+This offers an efficient way to reference a global pointer.
+
+The compiler has to generate special code to use these base registers,
+or they can be accessed with inline assembler.
+
+	mov %gs:offset,%reg
+	mov %fs:offset,%reg
+
+FS is used to address the thread local segment (TLS), declared using
+__thread.  The compiler the automatically generates the correct prefixes and
+relocations to access these values.
+
+FS is normally managed by the runtime code or the threading library.
+
+GS is freely available, but may need special (compiler or inline assembler)
+code to use.
+
+Traditionally 64bit FS and GS could be set by the arch_prctl system call
+
+	arch_prctl(ARCH_SET_GS, value)
+	arch_prctl(ARCH_SET_FS, value)
+
+[There was also an older method using modify_ldt(), inherited from 32bit,
+but this is not discussed here.]
+
+However using a syscall is problematic for user space threading libraries
+that want to context switch in user space.
+
+Newer Intel CPUs (Ivy Bridge and later) added new instructions to directly
+access these registers quickly from user context
+
+	RDFSBASE %reg	read the FS base	(or _readfsbase_u64)
+	RDGSBASE %reg	read the GS base	(or _readgsbase_u64)
+
+	WRFSBASE %reg	write the FS base	(or _writefsbase_u64)
+	WRGSBASE %reg	write the GS base	(or _writegsbase_u64)
+
+The instructions are supported by the CPU when the "fsgsbase" string is shown in
+/proc/cpuinfo (or directly retrieved through the CPUID instruction).
+The instructions are only available to 64bit binaries.
+
+However the kernel needs to explicitely enable these instructions, as it
+may otherwise not correctly context switch the state. Newer Linux
+kernels enable this. When the kernel did not enable the instruction
+they will fault.
+
+An FSGSBASE enabled kernel can be detected by checking the AT_HWCAP2
+bitmask in the aux vector. When the HWCAP2_FSGSBASE bit is set the
+kernel supports RDFSGSBASE.
+
+	#include <sys/auxv.h>
+	#include <elf.h>
+
+	/* Will be eventually in asm/hwcap.h */
+	#define HWCAP2_FSGSBASE        (1 << 0)
+
+        unsigned val = getauxval(AT_HWCAP2);
+        if (val & HWCAP2_FSGSBASE) {
+                asm("wrgsbase %0" :: "r" (ptr));
+        }
+
+Another requirement is that the FS or GS selector has to be zero
+(is normally true unless changed explicitely)
+
+
+Andi Kleen
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] x86: Add documentation for rd/wr fs/gs base
  2014-04-28 22:12 ` [PATCH 7/7] x86: Add documentation for rd/wr fs/gs base Andi Kleen
@ 2014-04-29  2:23   ` Randy Dunlap
  0 siblings, 0 replies; 26+ messages in thread
From: Randy Dunlap @ 2014-04-29  2:23 UTC (permalink / raw)
  To: Andi Kleen, x86; +Cc: linux-kernel, Andi Kleen

On 04/28/14 15:12, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  Documentation/x86/fsgs.txt | 73 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 73 insertions(+)
>  create mode 100644 Documentation/x86/fsgs.txt
> 
> diff --git a/Documentation/x86/fsgs.txt b/Documentation/x86/fsgs.txt
> new file mode 100644
> index 0000000..0a1755f
> --- /dev/null
> +++ b/Documentation/x86/fsgs.txt
> @@ -0,0 +1,73 @@
> +
> +Using FS and GS prefixes on x86_64-linux
> +
> +The x86 architecture supports segment prefixes per instruction to add an
> +offset to an address.  On 64bit x86, these are mostly nops, except for FS

                             64-bit

> +and GS.
> +
> +This offers an efficient way to reference a global pointer.
> +
> +The compiler has to generate special code to use these base registers,
> +or they can be accessed with inline assembler.
> +
> +	mov %gs:offset,%reg
> +	mov %fs:offset,%reg
> +
> +FS is used to address the thread local segment (TLS), declared using
> +__thread.  The compiler the automatically generates the correct prefixes and

                           then

> +relocations to access these values.
> +
> +FS is normally managed by the runtime code or the threading library.
> +
> +GS is freely available, but may need special (compiler or inline assembler)
> +code to use.
> +
> +Traditionally 64bit FS and GS could be set by the arch_prctl system call

                 64-bit

> +
> +	arch_prctl(ARCH_SET_GS, value)
> +	arch_prctl(ARCH_SET_FS, value)
> +
> +[There was also an older method using modify_ldt(), inherited from 32bit,

                                                                      32-bit,

> +but this is not discussed here.]
> +
> +However using a syscall is problematic for user space threading libraries

   However,

> +that want to context switch in user space.
> +
> +Newer Intel CPUs (Ivy Bridge and later) added new instructions to directly
> +access these registers quickly from user context
> +
> +	RDFSBASE %reg	read the FS base	(or _readfsbase_u64)
> +	RDGSBASE %reg	read the GS base	(or _readgsbase_u64)
> +
> +	WRFSBASE %reg	write the FS base	(or _writefsbase_u64)
> +	WRGSBASE %reg	write the GS base	(or _writegsbase_u64)
> +
> +The instructions are supported by the CPU when the "fsgsbase" string is shown in
> +/proc/cpuinfo (or directly retrieved through the CPUID instruction).
> +The instructions are only available to 64bit binaries.

                                          64-bit

> +
> +However the kernel needs to explicitely enable these instructions, as it

   However,                    explicitly

> +may otherwise not correctly context switch the state. Newer Linux
> +kernels enable this. When the kernel did not enable the instruction
> +they will fault.
> +
> +An FSGSBASE enabled kernel can be detected by checking the AT_HWCAP2
> +bitmask in the aux vector. When the HWCAP2_FSGSBASE bit is set the
> +kernel supports RDFSGSBASE.
> +
> +	#include <sys/auxv.h>
> +	#include <elf.h>
> +
> +	/* Will be eventually in asm/hwcap.h */
> +	#define HWCAP2_FSGSBASE        (1 << 0)
> +
> +        unsigned val = getauxval(AT_HWCAP2);
> +        if (val & HWCAP2_FSGSBASE) {
> +                asm("wrgsbase %0" :: "r" (ptr));
> +        }
> +
> +Another requirement is that the FS or GS selector has to be zero
> +(is normally true unless changed explicitely)

                                    explicitly).

> +
> +
> +Andi Kleen
> 


-- 
~Randy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/7] x86: Add C intrinsics for new rd/wr fs/gs base instructions
  2014-04-28 22:12 ` [PATCH 3/7] x86: Add C intrinsics for new rd/wr fs/gs base instructions Andi Kleen
@ 2014-04-29 14:10   ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 26+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-29 14:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: x86, linux-kernel, Andi Kleen

On Mon, Apr 28, 2014 at 03:12:37PM -0700, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Plus for swapgs.
> 
> Very straight forward. Used in followon patch.

follow on
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/include/asm/fsgs.h | 43 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 43 insertions(+)
>  create mode 100644 arch/x86/include/asm/fsgs.h
> 
> diff --git a/arch/x86/include/asm/fsgs.h b/arch/x86/include/asm/fsgs.h
> new file mode 100644
> index 0000000..a3a5cd3
> --- /dev/null
> +++ b/arch/x86/include/asm/fsgs.h
> @@ -0,0 +1,43 @@
> +#ifndef _ASM_FSGS_H
> +#define _ASM_FSGS_H 1
> +
> +static inline void swapgs(void)
> +{
> +	asm volatile("swapgs" ::: "memory");
> +}
> +
> +/* Must be protected by X86_FEATURE_FSGSBASE check. */
> +
> +static inline unsigned long rdgsbase(void)
> +{
> +	unsigned long gs;
> +	asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xc8 # rdgsbaseq %%rax"
> +			: "=a" (gs)
> +			:: "memory");
> +	return gs;
> +}
> +
> +static inline unsigned long rdfsbase(void)
> +{
> +	unsigned long fs;
> +	asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xc0 # rdfsbaseq %%rax"
> +			: "=a" (fs)
> +			:: "memory");
> +	return fs;
> +}
> +
> +static inline void wrgsbase(unsigned long gs)
> +{
> +	asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xd8 # wrgsbaseq %%rax"
> +			:: "a" (gs)
> +			: "memory");
> +}
> +
> +static inline void wrfsbase(unsigned long fs)
> +{
> +	asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xd0 # wrfsbaseq %%rax"
> +			:: "a" (fs)
> +			: "memory");
> +}
> +
> +#endif
> -- 
> 1.9.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
  2014-04-28 22:12 ` [PATCH 4/7] x86: Add support for rd/wr fs/gs base Andi Kleen
@ 2014-04-29 18:19   ` Andy Lutomirski
  2014-04-29 23:39     ` Andi Kleen
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2014-04-29 18:19 UTC (permalink / raw)
  To: Andi Kleen, x86; +Cc: linux-kernel, Andi Kleen

On 04/28/2014 03:12 PM, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> IvyBridge added new instructions to directly write the fs and gs
> 64bit base registers. Previously this had to be done with a system
> call to write to MSRs. The main use case is fast user space threading
> and switching the fs/gs registers quickly there.
> 
> The instructions are opt-in and have to be explicitely enabled
> by the OS.
> 
> Previously Linux couldn't support this because the paranoid
> entry code relied on the gs base never being negative outside
> the kernel to decide when to use swaps. It would check the gs MSR
> value and assume it was already running in kernel if the value
> was already negative.
> 
> This patch changes the paranoid entry code to use rdgsbase
> if available.  Then we check the GS value against the expected GS value
> stored at the bottom of the IST stack. If the value is the expected
> value we skip swapgs.
> 
> This is also significantly faster than a MSR read, so will speed
> NMis (critical for profiling)
> 
> An alternative would have been to save/restore the GS value
> unconditionally, but this approach needs less changes.
> 
> Then after these changes we need to also use the new instructions
> to save/restore fs and gs, so that the new values set by the
> users won't disappear.  This is also significantly
> faster for the case when the 64bit base has to be switched
> (that is when GS is larger than 4GB), as we can replace
> the slow MSR write with a faster wr[fg]sbase execution.
> 
> The instructions do not context switch
> the segment index, so the old invariant that fs or gs index
> have to be 0 for a different 64bit value to stick is still
> true. Previously it was enforced by arch_prctl, now the user
> program has to make sure it keeps the segment indexes zero.
> If it doesn't the changes may not stick.
> 
> This is in term enables fast switching when there are
> enough threads that their TLS segment does not fit below 4GB,
> or alternatively programs that use fs as an additional base
> register will not get a sigificant context switch penalty.
> 
> It is all done in a single patch to avoid bisect crash
> holes.
> 

> +paranoid_save_gs:
> +	.byte 0xf3,0x48,0x0f,0xae,0xc9	# rdgsbaseq %rcx
> +	movq $-EXCEPTION_STKSZ,%rax	# non debug stack size
> +	cmpq $DEBUG_STACK,ORIG_RAX+8(%rsp)
> +	movq $-1,ORIG_RAX+8(%rsp)	# no syscall to restart
> +	jne  1f
> +	movq $-DEBUG_STKSZ,%rax		# debug stack size
> +1:
> +	andq %rsp,%rax			# bottom of stack
> +	movq (%rax),%rdi		# get expected GS
> +	cmpq %rdi,%rcx			# is it the kernel gs?

I don't like this part.  There are now three cases:

1. User gs, gsbase != kernel gs base.  This works the same as before

2. Kernel gs.  This also works the same as before.

3. User gs, but gsbase == kernel gs base.  This will cause C code to
execute on the *user* gs base.

Case 3 is annoying.  If nothing tries to change the user gs base, then
everything is okay because the user gs base and the kernel gs bases are
equal.  But if something does try to change the user gs base, then it
will accidentally change the kernel gs base instead.

For the IST entries, this should be fine -- cpu migration, scheduling,
and such are impossible anyway.  For the non-IST entries, I'm less
convinced.  The entry_64.S code suggests that the problematic entries are:

double_fault
stack_segment
machine_check

Of course, all of those entries really do use IST, so I wonder why they
are paranoid*entry instead of paranoid*entry_ist.  Is it because they're
supposedly non-recursive?

In any case, wouldn't this all be much simpler and less magical if the
paranoid entries just saved the old gsbase to the rbx and loaded the new
ones?  The exits could do the inverse.  This should be really fast:

rdgsbaseq %rbx
wrgsbaseq {the correct value}

...

wrgsbaseq %rbx

This still doesn't support changing the usergs value inside a paranoid
entry, but at least it will fail consistently instead of only failing if
the user gs has a particular special value.

I don't know the actual latencies, but I suspect that this would be
faster, too -- it removes some branches, and wrgsbase and rdgsbase
deserve to be faster than swapgs.  It's probably no good for
non-rd/wrgsbase-capable cpus, though, since I suspect that three MSR
accesses are much worse than one MSR access and two swapgs calls.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
  2014-04-29 18:19   ` Andy Lutomirski
@ 2014-04-29 23:39     ` Andi Kleen
  2014-04-30  4:52       ` H. Peter Anvin
  0 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2014-04-29 23:39 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Andi Kleen, x86, linux-kernel, Andi Kleen

> Case 3 is annoying.  If nothing tries to change the user gs base, then
> everything is okay because the user gs base and the kernel gs bases are
> equal.  But if something does try to change the user gs base, then it
> will accidentally change the kernel gs base instead.

It doesn't really matter, as they are the same.
They would just switch identities.

Besides I don't think anyone does that.

> 
> For the IST entries, this should be fine -- cpu migration, scheduling,
> and such are impossible anyway.  For the non-IST entries, I'm less
> convinced.  The entry_64.S code suggests that the problematic entries are:
> 
> double_fault
> stack_segment
> machine_check

I don't think any of them can schedule.

> 
> Of course, all of those entries really do use IST, so I wonder why they
> are paranoid*entry instead of paranoid*entry_ist.  Is it because they're
> supposedly non-recursive?

Yes, only the DEBUG stack is big enough to recurse.

> 
> In any case, wouldn't this all be much simpler and less magical if the
> paranoid entries just saved the old gsbase to the rbx and loaded the new
> ones?  The exits could do the inverse.  This should be really fast:

I had it originally in a similar scheme, but it was significantly
more complicated, with changed exit path So I switched to this "only a 
single hook needed" variant, which mirrors the existing code
closely.

> I don't know the actual latencies, but I suspect that this would be
> faster, too -- it removes some branches, and wrgsbase and rdgsbase
> deserve to be faster than swapgs.  It's probably no good for
> non-rd/wrgsbase-capable cpus, though, since I suspect that three MSR
> accesses are much worse than one MSR access and two swapgs calls.

Probably doesn't matter much, it's MUCH faster than the old
code in any case.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
  2014-04-29 23:39     ` Andi Kleen
@ 2014-04-30  4:52       ` H. Peter Anvin
  2014-04-30  4:57         ` H. Peter Anvin
  2014-04-30 23:44         ` Andy Lutomirski
  0 siblings, 2 replies; 26+ messages in thread
From: H. Peter Anvin @ 2014-04-30  4:52 UTC (permalink / raw)
  To: Andi Kleen, Andy Lutomirski; +Cc: x86, linux-kernel, Andi Kleen

On 04/29/2014 04:39 PM, Andi Kleen wrote:
>> Case 3 is annoying.  If nothing tries to change the user gs base, then
>> everything is okay because the user gs base and the kernel gs bases are
>> equal.  But if something does try to change the user gs base, then it
>> will accidentally change the kernel gs base instead.
> 
> It doesn't really matter, as they are the same.
> They would just switch identities.
> 
> Besides I don't think anyone does that.
> 

It matters -- greatly -- if (and only if) we can enter the kernel with
usergs == kernelgs and then want to change usergs inside a paranoid
routine.  At that point we risk being upside down, which basically means
we're rooted.

However, I believe this patchset also means only IST entries can be
paranoid, which in turn means we can't sleep inside them.  To the very
best of my knowledge the only times we change usergs is on context
switch or inside a system call.  We need to make sure that is actually
the case, though.

I'm at ELC for a few days, so I'll have limited decent-sized-monitor
time, but it shouldn't be too hard to convince ourselves of... mostly a
matter of making sure something like ptrace can't to stupid crap.

	-hpa

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/7] x86: Make old K8 swapgs workaround conditional
  2014-04-28 22:12 ` [PATCH 5/7] x86: Make old K8 swapgs workaround conditional Andi Kleen
@ 2014-04-30  4:57   ` H. Peter Anvin
  0 siblings, 0 replies; 26+ messages in thread
From: H. Peter Anvin @ 2014-04-30  4:57 UTC (permalink / raw)
  To: Andi Kleen, x86; +Cc: linux-kernel, Andi Kleen

On 04/28/2014 03:12 PM, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Every gs selector/index reload always paid an extra MFENCE
> between the two SWAPGS. This was to work around an old
> bug in early K8 steppings.  All other CPUs don't need the extra
> mfence. Patch the extra MFENCE only in for K8.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/include/asm/cpufeature.h |  1 +
>  arch/x86/kernel/cpu/amd.c         |  3 +++
>  arch/x86/kernel/entry_64.S        | 10 +++++++++-
>  3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
> index 89270b4..eb4bb46 100644
> --- a/arch/x86/include/asm/cpufeature.h
> +++ b/arch/x86/include/asm/cpufeature.h
> @@ -102,6 +102,7 @@
>  #define X86_FEATURE_APERFMPERF	(3*32+28) /* APERFMPERF */
>  #define X86_FEATURE_EAGER_FPU	(3*32+29) /* "eagerfpu" Non lazy FPU restore */
>  #define X86_FEATURE_NONSTOP_TSC_S3 (3*32+30) /* TSC doesn't stop in S3 state */
> +#define X86_FEATURE_SWAPGS_MFENCE (3*32+31) /* SWAPGS may need MFENCE */
>  

Nitpick: should be an X86_BUG_ instead.

	-hpa



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
  2014-04-30  4:52       ` H. Peter Anvin
@ 2014-04-30  4:57         ` H. Peter Anvin
  2014-04-30 23:44         ` Andy Lutomirski
  1 sibling, 0 replies; 26+ messages in thread
From: H. Peter Anvin @ 2014-04-30  4:57 UTC (permalink / raw)
  To: Andi Kleen, Andy Lutomirski; +Cc: x86, linux-kernel, Andi Kleen

On 04/29/2014 09:52 PM, H. Peter Anvin wrote:
> 
> It matters -- greatly -- if (and only if) we can enter the kernel with
> usergs == kernelgs and then want to change usergs inside a paranoid
> routine.  At that point we risk being upside down, which basically means
> we're rooted.
> 
> However, I believe this patchset also means only IST entries can be
> paranoid, which in turn means we can't sleep inside them.  To the very
> best of my knowledge the only times we change usergs is on context
> switch or inside a system call.  We need to make sure that is actually
> the case, though.
> 
> I'm at ELC for a few days, so I'll have limited decent-sized-monitor
> time, but it shouldn't be too hard to convince ourselves of... mostly a
> matter of making sure something like ptrace can't to stupid crap.
> 

Just in case anyone is getting the wrong impression: this is a
discussion about details.  I'm glad to see this work getting done.

	-hpa



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
  2014-04-30  4:52       ` H. Peter Anvin
  2014-04-30  4:57         ` H. Peter Anvin
@ 2014-04-30 23:44         ` Andy Lutomirski
  2014-04-30 23:47           ` Andy Lutomirski
  2014-05-01 21:15           ` Andi Kleen
  1 sibling, 2 replies; 26+ messages in thread
From: Andy Lutomirski @ 2014-04-30 23:44 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, X86 ML, linux-kernel@vger.kernel.org, Andi Kleen

On Tue, Apr 29, 2014 at 9:52 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/29/2014 04:39 PM, Andi Kleen wrote:
>>> Case 3 is annoying.  If nothing tries to change the user gs base, then
>>> everything is okay because the user gs base and the kernel gs bases are
>>> equal.  But if something does try to change the user gs base, then it
>>> will accidentally change the kernel gs base instead.
>>
>> It doesn't really matter, as they are the same.
>> They would just switch identities.
>>
>> Besides I don't think anyone does that.
>>
>
> It matters -- greatly -- if (and only if) we can enter the kernel with
> usergs == kernelgs and then want to change usergs inside a paranoid
> routine.  At that point we risk being upside down, which basically means
> we're rooted.
>
> However, I believe this patchset also means only IST entries can be
> paranoid, which in turn means we can't sleep inside them.  To the very
> best of my knowledge the only times we change usergs is on context
> switch or inside a system call.  We need to make sure that is actually
> the case, though.
>
> I'm at ELC for a few days, so I'll have limited decent-sized-monitor
> time, but it shouldn't be too hard to convince ourselves of... mostly a
> matter of making sure something like ptrace can't to stupid crap.

The only things that look relevant are the context switch paths and
the kvm stuff.  I don't know what happens if an IST exception happens
while running a guest, though.  TBH I have no idea what the VMX and
SVM interfaces look like.

paranoid_schedule looks scary.  If I'm understanding it correctly, it
expects to be executed with gs == usergs.  I think it's okay, since it
will only be invoked if we trapped from userspace, in which case the
state is well-defined.  But this bit could be wrong:

    testl %ebx,%ebx                /* swapgs needed? */
    jnz paranoid_restore
    testl $3,CS(%rsp)
    jnz   paranoid_userspace

If usergs == kernelgs, then ebx will always be 1 and we'll never end
up in paranoid_userspace.

This could be fixed in two ways.  We could just switch the order of
the tests, since the only way to have ebx == 1 and CS with CPL == 3
should be if we're coming from userspace with usergs==kernelgs.  Or we
could get rid of the paranoid schedule code entirely.  It is actually
needed for anything?  Timer and rescheduling interrupts shouldn't be
paranoid, and if there's any paranoid code that will trigger a
reschedule, couldn't it do it much more sanely by sending an IPI to
self and thus deferring the reschedule until interrupts are enabled?

Alternatively, what if the paranoid entry checked whether we're coming
from userspace at the very beginning and, if so, just jumped to the
non-paranoid entry?

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
  2014-04-30 23:44         ` Andy Lutomirski
@ 2014-04-30 23:47           ` Andy Lutomirski
  2014-05-01 21:15           ` Andi Kleen
  1 sibling, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2014-04-30 23:47 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, X86 ML, linux-kernel@vger.kernel.org, Andi Kleen

On Wed, Apr 30, 2014 at 4:44 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Apr 29, 2014 at 9:52 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 04/29/2014 04:39 PM, Andi Kleen wrote:
>>>> Case 3 is annoying.  If nothing tries to change the user gs base, then
>>>> everything is okay because the user gs base and the kernel gs bases are
>>>> equal.  But if something does try to change the user gs base, then it
>>>> will accidentally change the kernel gs base instead.
>>>
>>> It doesn't really matter, as they are the same.
>>> They would just switch identities.
>>>
>>> Besides I don't think anyone does that.
>>>
>>
>> It matters -- greatly -- if (and only if) we can enter the kernel with
>> usergs == kernelgs and then want to change usergs inside a paranoid
>> routine.  At that point we risk being upside down, which basically means
>> we're rooted.
>>
>> However, I believe this patchset also means only IST entries can be
>> paranoid, which in turn means we can't sleep inside them.  To the very
>> best of my knowledge the only times we change usergs is on context
>> switch or inside a system call.  We need to make sure that is actually
>> the case, though.
>>
>> I'm at ELC for a few days, so I'll have limited decent-sized-monitor
>> time, but it shouldn't be too hard to convince ourselves of... mostly a
>> matter of making sure something like ptrace can't to stupid crap.
>
> The only things that look relevant are the context switch paths and
> the kvm stuff.  I don't know what happens if an IST exception happens
> while running a guest, though.  TBH I have no idea what the VMX and
> SVM interfaces look like.
>
> paranoid_schedule looks scary.  If I'm understanding it correctly, it
> expects to be executed with gs == usergs.  I think it's okay, since it
> will only be invoked if we trapped from userspace, in which case the
> state is well-defined.  But this bit could be wrong:
>
>     testl %ebx,%ebx                /* swapgs needed? */
>     jnz paranoid_restore
>     testl $3,CS(%rsp)
>     jnz   paranoid_userspace
>
> If usergs == kernelgs, then ebx will always be 1 and we'll never end
> up in paranoid_userspace.
>
> This could be fixed in two ways.  We could just switch the order of
> the tests, since the only way to have ebx == 1 and CS with CPL == 3
> should be if we're coming from userspace with usergs==kernelgs.  Or we
> could get rid of the paranoid schedule code entirely.  It is actually
> needed for anything?  Timer and rescheduling interrupts shouldn't be
> paranoid, and if there's any paranoid code that will trigger a
> reschedule, couldn't it do it much more sanely by sending an IPI to
> self and thus deferring the reschedule until interrupts are enabled?

Having just asked this, isn't the current code already broken if
something like an NMI or MCE tried to reschedule the current cpu?  It
could hit just before running hlt in the non-polling idle loop or it
could happen during execution of a kernel thread.  In either case, I
don't see why anything is guaranteed to notice the resched flag being
set.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
  2014-04-30 23:44         ` Andy Lutomirski
  2014-04-30 23:47           ` Andy Lutomirski
@ 2014-05-01 21:15           ` Andi Kleen
  2014-05-01 21:39             ` Andy Lutomirski
  2014-05-01 21:58             ` H. Peter Anvin
  1 sibling, 2 replies; 26+ messages in thread
From: Andi Kleen @ 2014-05-01 21:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, Andi Kleen, X86 ML, linux-kernel@vger.kernel.org,
	Andi Kleen

> If usergs == kernelgs, then ebx will always be 1 and we'll never end
> up in paranoid_userspace.

You may miss a reschedule in this obscure case. It shouldn't really
happen because loading a kernel pointer is not useful for user space.

Doesn't seem like a real issue to me.

We only happen need to handle it to avoid crashing.

> Alternatively, what if the paranoid entry checked whether we're coming
> from userspace at the very beginning and, if so, just jumped to the
> non-paranoid entry?

That would work, but I doubt it would be worth it.

-Andi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
  2014-05-01 21:15           ` Andi Kleen
@ 2014-05-01 21:39             ` Andy Lutomirski
  2014-05-01 21:51               ` Andi Kleen
  2014-05-01 21:58             ` H. Peter Anvin
  1 sibling, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2014-05-01 21:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, X86 ML, linux-kernel@vger.kernel.org, Andi Kleen

On Thu, May 1, 2014 at 2:15 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> If usergs == kernelgs, then ebx will always be 1 and we'll never end
>> up in paranoid_userspace.
>
> You may miss a reschedule in this obscure case. It shouldn't really
> happen because loading a kernel pointer is not useful for user space.
>
> Doesn't seem like a real issue to me.
>
> We only happen need to handle it to avoid crashing.

Allowing userspace to prevent itself from being rescheduled by loading
something strange into gsbase seems unfortunate.

--Andy

>
>> Alternatively, what if the paranoid entry checked whether we're coming
>> from userspace at the very beginning and, if so, just jumped to the
>> non-paranoid entry?
>
> That would work, but I doubt it would be worth it.
>


> -Andi



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
  2014-05-01 21:39             ` Andy Lutomirski
@ 2014-05-01 21:51               ` Andi Kleen
  2014-05-01 21:53                 ` Andy Lutomirski
  0 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2014-05-01 21:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, H. Peter Anvin, X86 ML, linux-kernel@vger.kernel.org,
	Andi Kleen

> Allowing userspace to prevent itself from being rescheduled by loading
> something strange into gsbase seems unfortunate.

The timer tick will eventually catch it, so any delay is tightly
bounded.

Also still gets rescheduled most of the time, just not when a paranoid 
exception handler is running.

-Andi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
  2014-05-01 21:51               ` Andi Kleen
@ 2014-05-01 21:53                 ` Andy Lutomirski
  0 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2014-05-01 21:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, X86 ML, linux-kernel@vger.kernel.org, Andi Kleen

On Thu, May 1, 2014 at 2:51 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> Allowing userspace to prevent itself from being rescheduled by loading
>> something strange into gsbase seems unfortunate.
>
> The timer tick will eventually catch it, so any delay is tightly
> bounded.
>

What about NO_HZ_FULL?


> Also still gets rescheduled most of the time, just not when a paranoid
> exception handler is running.

If rescheduling on exit from a paranoid exception handler isn't
important, then let's just remove it.  Otherwise let's keep it
working.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
  2014-05-01 21:15           ` Andi Kleen
  2014-05-01 21:39             ` Andy Lutomirski
@ 2014-05-01 21:58             ` H. Peter Anvin
  2014-05-01 22:06               ` Andy Lutomirski
  2014-05-01 22:18               ` Andi Kleen
  1 sibling, 2 replies; 26+ messages in thread
From: H. Peter Anvin @ 2014-05-01 21:58 UTC (permalink / raw)
  To: Andi Kleen, Andy Lutomirski
  Cc: X86 ML, linux-kernel@vger.kernel.org, Andi Kleen

On 05/01/2014 02:15 PM, Andi Kleen wrote:
>> If usergs == kernelgs, then ebx will always be 1 and we'll never end
>> up in paranoid_userspace.
> 
> You may miss a reschedule in this obscure case. It shouldn't really
> happen because loading a kernel pointer is not useful for user space.
> 
> Doesn't seem like a real issue to me.
> 
> We only happen need to handle it to avoid crashing.
> 

No, it would be a rootable security hole, not just a crash.

>> Alternatively, what if the paranoid entry checked whether we're coming
>> from userspace at the very beginning and, if so, just jumped to the
>> non-paranoid entry?
> 
> That would work, but I doubt it would be worth it.

If that would solve the problem it is simple enough, but the tricky part
is when we end up in a "crack" where we are in kernel mode with the user GS.

I haven't looked through the flows (I'm at LCE so I have limited screen
bandwidth) to see how that would be handled in this case, but in the
general paranoid case it comes down to the fact that in this particular
subcase we don't necessarily know exactly how many SWAPGS are between us
and userspace after we IRET.

	-hpa

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
  2014-05-01 21:58             ` H. Peter Anvin
@ 2014-05-01 22:06               ` Andy Lutomirski
  2014-05-01 22:18               ` Andi Kleen
  1 sibling, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2014-05-01 22:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, X86 ML, linux-kernel@vger.kernel.org, Andi Kleen

On Thu, May 1, 2014 at 2:58 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 05/01/2014 02:15 PM, Andi Kleen wrote:
>>> If usergs == kernelgs, then ebx will always be 1 and we'll never end
>>> up in paranoid_userspace.
>>
>> You may miss a reschedule in this obscure case. It shouldn't really
>> happen because loading a kernel pointer is not useful for user space.
>>
>> Doesn't seem like a real issue to me.
>>
>> We only happen need to handle it to avoid crashing.
>>
>
> No, it would be a rootable security hole, not just a crash.
>
>>> Alternatively, what if the paranoid entry checked whether we're coming
>>> from userspace at the very beginning and, if so, just jumped to the
>>> non-paranoid entry?
>>
>> That would work, but I doubt it would be worth it.
>
> If that would solve the problem it is simple enough, but the tricky part
> is when we end up in a "crack" where we are in kernel mode with the user GS.
>
> I haven't looked through the flows (I'm at LCE so I have limited screen
> bandwidth) to see how that would be handled in this case, but in the
> general paranoid case it comes down to the fact that in this particular
> subcase we don't necessarily know exactly how many SWAPGS are between us
> and userspace after we IRET.

The current code looks like it will never try to reschedule on
paranoid exit unless it came from user *CS*, in which case there
shouldn't be any weird gs issues.  Given that the current code won't
reschedule even on a paranoid entry that hits during interruptable
kernel code, I find it unlikely that this code is important.  You
probably know more about its history and significance than I do.

What happens when ftrace or perf tries to wake a task from a debug
interrupt or NMI?

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
  2014-05-01 21:58             ` H. Peter Anvin
  2014-05-01 22:06               ` Andy Lutomirski
@ 2014-05-01 22:18               ` Andi Kleen
  2014-05-01 22:45                 ` H. Peter Anvin
  1 sibling, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2014-05-01 22:18 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org

> I haven't looked through the flows (I'm at LCE so I have limited screen
> bandwidth) to see how that would be handled in this case, but in the
> general paranoid case it comes down to the fact that in this particular
> subcase we don't necessarily know exactly how many SWAPGS are between us
> and userspace after we IRET.

There is none as far as I know. Certainly wasn't any when the code
was originally written.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] x86: Add support for rd/wr fs/gs base
  2014-05-01 22:18               ` Andi Kleen
@ 2014-05-01 22:45                 ` H. Peter Anvin
  0 siblings, 0 replies; 26+ messages in thread
From: H. Peter Anvin @ 2014-05-01 22:45 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org

On 05/01/2014 03:18 PM, Andi Kleen wrote:
>> I haven't looked through the flows (I'm at LCE so I have limited screen
>> bandwidth) to see how that would be handled in this case, but in the
>> general paranoid case it comes down to the fact that in this particular
>> subcase we don't necessarily know exactly how many SWAPGS are between us
>> and userspace after we IRET.
> 
> There is none as far as I know. Certainly wasn't any when the code
> was originally written.
> 

This applies for an asynchronous entry from kernel space.  Obviously in
the case where we actually come directly from user space (the stack
frame CS.RPL == 3) then that doesn't apply.

	-hpa



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/7] percpu: Add a DEFINE_PER_CPU_2PAGE_ALIGNED
  2014-04-28 22:12 ` [PATCH 1/7] percpu: Add a DEFINE_PER_CPU_2PAGE_ALIGNED Andi Kleen
@ 2014-05-02 15:18   ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2014-05-02 15:18 UTC (permalink / raw)
  To: Andi Kleen; +Cc: x86, linux-kernel, Andi Kleen

On Mon, Apr 28, 2014 at 03:12:35PM -0700, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Needed in a followon patch which needs to naturally align a 8K stack.
> I just put it into the page aligned section. This may cause an extra
> 4K hole if we're unlucky.

Add comment to that effect?  Saying that this should be broken into a
separate section if it ever grows more users?

> Cc: tj@kernel.org
> Signed-off-by: Andi Kleen <ak@linux.intel.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2014-05-02 15:18 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-28 22:12 Add support for RD/WR FS/GSBASE Andi Kleen
2014-04-28 22:12 ` [PATCH 1/7] percpu: Add a DEFINE_PER_CPU_2PAGE_ALIGNED Andi Kleen
2014-05-02 15:18   ` Tejun Heo
2014-04-28 22:12 ` [PATCH 2/7] x86: Naturally align the debug IST stack Andi Kleen
2014-04-28 22:12 ` [PATCH 3/7] x86: Add C intrinsics for new rd/wr fs/gs base instructions Andi Kleen
2014-04-29 14:10   ` Konrad Rzeszutek Wilk
2014-04-28 22:12 ` [PATCH 4/7] x86: Add support for rd/wr fs/gs base Andi Kleen
2014-04-29 18:19   ` Andy Lutomirski
2014-04-29 23:39     ` Andi Kleen
2014-04-30  4:52       ` H. Peter Anvin
2014-04-30  4:57         ` H. Peter Anvin
2014-04-30 23:44         ` Andy Lutomirski
2014-04-30 23:47           ` Andy Lutomirski
2014-05-01 21:15           ` Andi Kleen
2014-05-01 21:39             ` Andy Lutomirski
2014-05-01 21:51               ` Andi Kleen
2014-05-01 21:53                 ` Andy Lutomirski
2014-05-01 21:58             ` H. Peter Anvin
2014-05-01 22:06               ` Andy Lutomirski
2014-05-01 22:18               ` Andi Kleen
2014-05-01 22:45                 ` H. Peter Anvin
2014-04-28 22:12 ` [PATCH 5/7] x86: Make old K8 swapgs workaround conditional Andi Kleen
2014-04-30  4:57   ` H. Peter Anvin
2014-04-28 22:12 ` [PATCH 6/7] x86: Enumerate kernel FSGS capability in AT_HWCAP2 Andi Kleen
2014-04-28 22:12 ` [PATCH 7/7] x86: Add documentation for rd/wr fs/gs base Andi Kleen
2014-04-29  2:23   ` Randy Dunlap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox