All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] WIP.x86/mm fixes
@ 2017-12-01  6:29 Andy Lutomirski
  2017-12-01  6:29 ` [PATCH 1/6] x86/orc: Don't bail on stack overflow Andy Lutomirski
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Andy Lutomirski @ 2017-12-01  6:29 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

This is a bit oddly formatted, since it's meant to by a set of changes
to a tree, not a normal patch set.

"x86/orc: Don't bail on stack overflow" is a fixed version of
"x86/unwinder/orc: Don't bail on stack overflow".  If you'd rather
just manually patch it, change "regs->sp" to "state->sp".  Bug noticed
by Dan Carpenter.

Patch 2 is a bugfix that prevents a potential KVM explosion.  The
original patch failed to update KVM.  Thanks, KVM, for having a
separate copy of everything related to CPU state.

Patch 3 is another bugfix that prevents a potential KVM explosion
once the rest of KAISER is patched in.  (I haven't tested, but I imagine
we'd blow up horribly on the first interrupt from user mode after a
VM exit.)

Patch 4 fixes a *huge* performance regression.  Well, not as huge as
KAISER, but still huge.  It turns out that pushq; retq is very, very
slow.

Patch 5 fixes a potential bug.  Thomas, I think you said you had a fix
on top of this fix.  If you want my help, let me know.

Patch 6 is new.  It makes the TSS remap RO on 64-bit kernels.

Andy Lutomirski (6):
  x86/orc: Don't bail on stack overflow
  Fixup "x86/asm: Fix assumptions that the HW TSS is at the beginning of
    cpu_tss"
  Fixup "x86/asm: Remap the TSS into the cpu entry area"
  Unsuck "x86/entry/64: Create a percpu SYSCALL entry trampoline"
  Fixup "x86/entry/64: Move the IST stacks into cpu_entry_area"
  x86/entry/64: Make cpu_entry_area.tss read-only

 arch/x86/entry/entry_32.S          |  4 ++--
 arch/x86/entry/entry_64.S          | 24 +++++++++++++------
 arch/x86/include/asm/fixmap.h      | 15 ++++++++----
 arch/x86/include/asm/processor.h   | 17 +++++++------
 arch/x86/include/asm/switch_to.h   |  4 ++--
 arch/x86/include/asm/thread_info.h |  2 +-
 arch/x86/kernel/asm-offsets.c      |  6 ++---
 arch/x86/kernel/asm-offsets_32.c   |  4 ++--
 arch/x86/kernel/cpu/common.c       | 49 +++++++++++++++++++++++++++-----------
 arch/x86/kernel/ioport.c           |  2 +-
 arch/x86/kernel/process.c          |  6 ++---
 arch/x86/kernel/process_32.c       |  2 +-
 arch/x86/kernel/process_64.c       |  2 +-
 arch/x86/kernel/traps.c            | 10 ++++++--
 arch/x86/kernel/unwind_orc.c       | 14 +++++++++--
 arch/x86/kvm/vmx.c                 |  2 +-
 arch/x86/lib/delay.c               |  4 ++--
 arch/x86/xen/enlighten_pv.c        |  2 +-
 18 files changed, 110 insertions(+), 59 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/6] x86/orc: Don't bail on stack overflow
  2017-12-01  6:29 [PATCH 0/6] WIP.x86/mm fixes Andy Lutomirski
@ 2017-12-01  6:29 ` Andy Lutomirski
  2017-12-01  6:29 ` [PATCH 2/6] Fixup "x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss" Andy Lutomirski
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Andy Lutomirski @ 2017-12-01  6:29 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

If we overflow the stack into a guard page and then try to unwind it
with ORC, it should work well: by construction, there can't be any
meaningful data in the guard page because no writes to the guard page
will have succeeded.

This patch fixes a bug that unwinding from working correctly: if the
starting register state has RSP pointing into a stack guard page, the
ORC unwinder bails out immediately.  This patch fixes that: the ORC
unwinder will start the unwind.

I tested this by intentionally overflowing the task stack.  The
result is an accurate call trace instead of a trace consisting
purely of '?' entries.

There are a few other bugs that are triggered if the unwinder
encounters a stack overflow after the first step, and Josh has WIP
patches to fix those as well.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/unwind_orc.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index a3f973b2c97a..ff8e1132b2ae 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -553,8 +553,18 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task,
 	}
 
 	if (get_stack_info((unsigned long *)state->sp, state->task,
-			   &state->stack_info, &state->stack_mask))
-		return;
+			   &state->stack_info, &state->stack_mask)) {
+		/*
+		 * We weren't on a valid stack.  It's possible that
+		 * we overflowed a valid stack into a guard page.
+		 * See if the next page up is valid so that we can
+		 * generate some kind of backtrace if this happens.
+		 */
+		void *next_page = (void *)PAGE_ALIGN((unsigned long)state->sp);
+		if (get_stack_info(next_page, state->task, &state->stack_info,
+				   &state->stack_mask))
+			return;
+	}
 
 	/*
 	 * The caller can provide the address of the first frame directly
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/6] Fixup "x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss"
  2017-12-01  6:29 [PATCH 0/6] WIP.x86/mm fixes Andy Lutomirski
  2017-12-01  6:29 ` [PATCH 1/6] x86/orc: Don't bail on stack overflow Andy Lutomirski
@ 2017-12-01  6:29 ` Andy Lutomirski
  2017-12-01  6:29 ` [PATCH 3/6] Fixup "x86/asm: Remap the TSS into the cpu entry area" Andy Lutomirski
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Andy Lutomirski @ 2017-12-01  6:29 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kvm/vmx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a6f4f095f8f4..2abe0073b573 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2291,7 +2291,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		 * processors.  See 22.2.4.
 		 */
 		vmcs_writel(HOST_TR_BASE,
-			    (unsigned long)this_cpu_ptr(&cpu_tss));
+			    (unsigned long)this_cpu_ptr(&cpu_tss.x86_tss));
 		vmcs_writel(HOST_GDTR_BASE, (unsigned long)gdt);   /* 22.2.4 */
 
 		/*
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/6] Fixup "x86/asm: Remap the TSS into the cpu entry area"
  2017-12-01  6:29 [PATCH 0/6] WIP.x86/mm fixes Andy Lutomirski
  2017-12-01  6:29 ` [PATCH 1/6] x86/orc: Don't bail on stack overflow Andy Lutomirski
  2017-12-01  6:29 ` [PATCH 2/6] Fixup "x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss" Andy Lutomirski
@ 2017-12-01  6:29 ` Andy Lutomirski
  2017-12-01  6:29 ` [PATCH 4/6] Unsuck "x86/entry/64: Create a percpu SYSCALL entry trampoline" Andy Lutomirski
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Andy Lutomirski @ 2017-12-01  6:29 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kvm/vmx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2abe0073b573..62ee4362e1c1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2291,7 +2291,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		 * processors.  See 22.2.4.
 		 */
 		vmcs_writel(HOST_TR_BASE,
-			    (unsigned long)this_cpu_ptr(&cpu_tss.x86_tss));
+			    (unsigned long)&get_cpu_entry_area(cpu)->tss.x86_tss);
 		vmcs_writel(HOST_GDTR_BASE, (unsigned long)gdt);   /* 22.2.4 */
 
 		/*
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/6] Unsuck "x86/entry/64: Create a percpu SYSCALL entry trampoline"
  2017-12-01  6:29 [PATCH 0/6] WIP.x86/mm fixes Andy Lutomirski
                   ` (2 preceding siblings ...)
  2017-12-01  6:29 ` [PATCH 3/6] Fixup "x86/asm: Remap the TSS into the cpu entry area" Andy Lutomirski
@ 2017-12-01  6:29 ` Andy Lutomirski
  2017-12-02 15:18   ` Josh Poimboeuf
  2017-12-01  6:29 ` [PATCH 5/6] Fixup "x86/entry/64: Move the IST stacks into cpu_entry_area" Andy Lutomirski
  2017-12-01  6:29 ` [PATCH 6/6] x86/entry/64: Make cpu_entry_area.tss read-only Andy Lutomirski
  5 siblings, 1 reply; 9+ messages in thread
From: Andy Lutomirski @ 2017-12-01  6:29 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

This fixes a huge performance regression.

Please add to the changelog:

This patch actually seems to be a small speedup.  With this patch,
SYSCALL touches an extra cache line and an extra virtual page, but
the pipeline no longer stalls waiting for SWAPGS.  It seems that, at
least in a tight loop, the latter outweights the former.

Thanks to David Laight for an optimization tip.

[end addition to changelog]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index caf74a1bb3de..28f4e7553c26 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -180,14 +180,24 @@ ENTRY(entry_SYSCALL_64_trampoline)
 
 	/*
 	 * x86 lacks a near absolute jump, and we can't jump to the real
-	 * entry text with a relative jump, so we fake it using retq.
+	 * entry text with a relative jump.  We could push the target
+	 * address and then use retq, but this destroys the pipeline on
+	 * many CPUs (wasting over 20 cycles on Sandy Bridge).  Instead,
+	 * spill RDI and restore it in a second-stage trampoline.
 	 */
-	pushq	$entry_SYSCALL_64_after_hwframe
-	retq
+	pushq	%rdi
+	movq	$entry_SYSCALL_64_stage2, %rdi
+	jmp	*%rdi
 END(entry_SYSCALL_64_trampoline)
 
 	.popsection
 
+ENTRY(entry_SYSCALL_64_stage2)
+	UNWIND_HINT_EMPTY
+	popq	%rdi
+	jmp	entry_SYSCALL_64_after_hwframe
+END(entry_SYSCALL_64_stage2)
+
 ENTRY(entry_SYSCALL_64)
 	UNWIND_HINT_EMPTY
 	/*
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 5/6] Fixup "x86/entry/64: Move the IST stacks into cpu_entry_area"
  2017-12-01  6:29 [PATCH 0/6] WIP.x86/mm fixes Andy Lutomirski
                   ` (3 preceding siblings ...)
  2017-12-01  6:29 ` [PATCH 4/6] Unsuck "x86/entry/64: Create a percpu SYSCALL entry trampoline" Andy Lutomirski
@ 2017-12-01  6:29 ` Andy Lutomirski
  2017-12-01  6:29 ` [PATCH 6/6] x86/entry/64: Make cpu_entry_area.tss read-only Andy Lutomirski
  5 siblings, 0 replies; 9+ messages in thread
From: Andy Lutomirski @ 2017-12-01  6:29 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

I'm not entirely certain, but I suspect this caused the last kbuild
bot error.  I wasn't able to reproduce it, but it seems plausble.

Add to the commit log:

The IST stacks are unlike the rest of cpu_entry_area: they're used
even for entries from kernel mode.  This means that they should be set
up before we load the final IDT.  Since the kernel sets up all
possible CPUs' percpu areas early in boot of the BP, move
cpu_entry_area setup to trap_init() and do it for all CPUs at once.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/fixmap.h |  2 ++
 arch/x86/kernel/cpu/common.c  | 26 +++++++++++++++++++-------
 arch/x86/kernel/traps.c       |  6 ++++++
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 5a1013df456e..9a4caed665fd 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -242,5 +242,7 @@ static inline struct SYSENTER_stack *cpu_SYSENTER_stack(int cpu)
 	return &get_cpu_entry_area((cpu))->tss.SYSENTER_stack;
 }
 
+extern void setup_cpu_entry_areas(void);
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_FIXMAP_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 1509f09abf5e..c0f11a684acf 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -490,7 +490,8 @@ void load_percpu_segment(int cpu)
 	load_stack_canary_segment();
 }
 
-static void set_percpu_fixmap_pages(int fixmap_index, void *ptr, int pages, pgprot_t prot)
+static void __init
+set_percpu_fixmap_pages(int fixmap_index, void *ptr, int pages, pgprot_t prot)
 {
 	int i;
 
@@ -520,7 +521,7 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 #endif
 
 /* Setup the fixmap mappings only once per-processor */
-static inline void setup_cpu_entry_area(int cpu)
+static void __init setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
 	extern char _entry_trampoline[];
@@ -569,7 +570,7 @@ static inline void setup_cpu_entry_area(int cpu)
 				PAGE_KERNEL);
 
 #ifdef CONFIG_X86_32
-	this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
+	per_cpu(cpu_entry_area, cpu) = get_cpu_entry_area(cpu);
 #endif
 
 #ifdef CONFIG_X86_64
@@ -586,6 +587,21 @@ static inline void setup_cpu_entry_area(int cpu)
 #endif
 }
 
+void __init setup_cpu_entry_areas(void)
+{
+	int cpu;
+
+	/*
+	 * For better or for worse, the kernel allocates percpu space
+	 * for all possible CPUs early in BP startup.  Map every CPU's
+	 * cpu_entry_area right off the bat so that they're available
+	 * before anything in AP boot could need them.
+	 */
+	for_each_possible_cpu(cpu) {
+		setup_cpu_entry_area(cpu);
+	}
+}
+
 /* Load the original GDT from the per-cpu structure */
 void load_direct_gdt(int cpu)
 {
@@ -1658,8 +1674,6 @@ void cpu_init(void)
 	initialize_tlbstate_and_flush();
 	enter_lazy_tlb(&init_mm, me);
 
-	setup_cpu_entry_area(cpu);
-
 	/*
 	 * Initialize the TSS.  sp0 points to the entry trampoline stack
 	 * regardless of what task is running.
@@ -1718,8 +1732,6 @@ void cpu_init(void)
 	initialize_tlbstate_and_flush();
 	enter_lazy_tlb(&init_mm, curr);
 
-	setup_cpu_entry_area(cpu);
-
 	/*
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 61e26b03afd8..b70aec60ebbd 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -946,6 +946,12 @@ dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code)
 
 void __init trap_init(void)
 {
+	/*
+	 * We need cpu_entry_area working before any IST-using entries could
+	 * happen.
+	 */
+	setup_cpu_entry_areas();
+
 	idt_setup_traps();
 
 	/*
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 6/6] x86/entry/64: Make cpu_entry_area.tss read-only
  2017-12-01  6:29 [PATCH 0/6] WIP.x86/mm fixes Andy Lutomirski
                   ` (4 preceding siblings ...)
  2017-12-01  6:29 ` [PATCH 5/6] Fixup "x86/entry/64: Move the IST stacks into cpu_entry_area" Andy Lutomirski
@ 2017-12-01  6:29 ` Andy Lutomirski
  5 siblings, 0 replies; 9+ messages in thread
From: Andy Lutomirski @ 2017-12-01  6:29 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra, Andy Lutomirski

The TSS is a fairly juicy target for exploits, and, now that the TSS
is in the cpu_entry_area, it's no longer protected by kASLR.  Make it
read-only on x86_64.

On x86_32, it can't be RO because it's written by the CPU during task
switches, and we use a task gate for double faults.  I'd also be
nervous about errata if we tried to make it RO even on configurations
without double fault handling.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_32.S          |  4 ++--
 arch/x86/entry/entry_64.S          |  8 ++++----
 arch/x86/include/asm/fixmap.h      | 13 +++++++++----
 arch/x86/include/asm/processor.h   | 17 ++++++++---------
 arch/x86/include/asm/switch_to.h   |  4 ++--
 arch/x86/include/asm/thread_info.h |  2 +-
 arch/x86/kernel/asm-offsets.c      |  6 ++----
 arch/x86/kernel/asm-offsets_32.c   |  4 ++--
 arch/x86/kernel/cpu/common.c       | 23 ++++++++++++++++-------
 arch/x86/kernel/ioport.c           |  2 +-
 arch/x86/kernel/process.c          |  6 +++---
 arch/x86/kernel/process_32.c       |  2 +-
 arch/x86/kernel/process_64.c       |  2 +-
 arch/x86/kernel/traps.c            |  4 ++--
 arch/x86/lib/delay.c               |  4 ++--
 arch/x86/xen/enlighten_pv.c        |  2 +-
 16 files changed, 57 insertions(+), 46 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 3629bcbf85a2..bd8b57a5c874 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -942,7 +942,7 @@ ENTRY(debug)
 
 	/* Are we currently on the SYSENTER stack? */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
-	addl	$CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
+	addl	$CPU_ENTRY_AREA_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Ldebug_from_sysenter_stack
@@ -986,7 +986,7 @@ ENTRY(nmi)
 
 	/* Are we currently on the SYSENTER stack? */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
-	addl	$CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
+	addl	$CPU_ENTRY_AREA_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Lnmi_from_sysenter_stack
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 28f4e7553c26..0b0735030328 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -158,7 +158,7 @@ END(native_usergs_sysret64)
 	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
 
 /* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH	CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + \
+#define RSP_SCRATCH	CPU_ENTRY_AREA_SYSENTER_stack + \
 			SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
 
 ENTRY(entry_SYSCALL_64_trampoline)
@@ -394,7 +394,7 @@ syscall_return_via_sysret:
 	 * Save old stack pointer and switch to trampoline stack.
 	 */
 	movq	%rsp, %rdi
-	movq	PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+	movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
 
 	pushq	RSP-RDI(%rdi)	/* RSP */
 	pushq	(%rdi)		/* RDI */
@@ -722,7 +722,7 @@ GLOBAL(swapgs_restore_regs_and_return_to_usermode)
 	 * Save old stack pointer and switch to trampoline stack.
 	 */
 	movq	%rsp, %rdi
-	movq	PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+	movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
 
 	/* Copy the IRET frame to the trampoline stack. */
 	pushq	6*8(%rdi)	/* SS */
@@ -937,7 +937,7 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
 /*
  * Exception entry points.
  */
-#define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss) + (TSS_ist + ((x) - 1) * 8)
+#define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss_rw) + (TSS_ist + ((x) - 1) * 8)
 
 /*
  * Switch to the thread stack.  This is called with the IRET frame and
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 9a4caed665fd..3c291d0ebfcd 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -56,9 +56,14 @@ struct cpu_entry_area {
 	char gdt[PAGE_SIZE];
 
 	/*
-	 * The GDT is just below cpu_tss and thus serves (on x86_64) as a
-	 * a read-only guard page for the SYSENTER stack at the bottom
-	 * of the TSS region.
+	 * The GDT is just below SYSENTER_stack and thus serves (on x86_64) as
+	 * a a read-only guard page.
+	 */
+	struct SYSENTER_stack_page SYSENTER_stack_page;
+
+	/*
+	 * On x86_64, the TSS is mapped RO.  On x86_32, it's mapped RW because
+	 * we need task switches to work, and task switches write to the TSS.
 	 */
 	struct tss_struct tss;
 
@@ -239,7 +244,7 @@ static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
 
 static inline struct SYSENTER_stack *cpu_SYSENTER_stack(int cpu)
 {
-	return &get_cpu_entry_area((cpu))->tss.SYSENTER_stack;
+	return &get_cpu_entry_area((cpu))->SYSENTER_stack_page.stack;
 }
 
 extern void setup_cpu_entry_areas(void);
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 38fa358fce2d..c9aaa43313d1 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -341,13 +341,11 @@ struct SYSENTER_stack {
 	unsigned long		words[64];
 };
 
-struct tss_struct {
-	/*
-	 * Space for the temporary SYSENTER stack, used for SYSENTER
-	 * and the entry trampoline as well.
-	 */
-	struct SYSENTER_stack	SYSENTER_stack;
+struct SYSENTER_stack_page {
+	struct SYSENTER_stack stack;
+} __aligned(PAGE_SIZE);
 
+struct tss_struct {
 	/*
 	 * The fixed hardware portion.  This must not cross a page boundary
 	 * at risk of violating the SDM's advice and potentially triggering
@@ -364,7 +362,7 @@ struct tss_struct {
 	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
 } __aligned(PAGE_SIZE);
 
-DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
+DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw);
 
 /*
  * sizeof(unsigned long) coming from an extra "long" at the end
@@ -379,7 +377,8 @@ DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
 #ifdef CONFIG_X86_32
 DECLARE_PER_CPU(unsigned long, cpu_current_top_of_stack);
 #else
-#define cpu_current_top_of_stack cpu_tss.x86_tss.sp1
+/* The RO copy can't be accessed with this_cpu_xyz(), so use the RW copy. */
+#define cpu_current_top_of_stack cpu_tss_rw.x86_tss.sp1
 #endif
 
 /*
@@ -539,7 +538,7 @@ static inline void native_set_iopl_mask(unsigned mask)
 static inline void
 native_load_sp0(unsigned long sp0)
 {
-	this_cpu_write(cpu_tss.x86_tss.sp0, sp0);
+	this_cpu_write(cpu_tss_rw.x86_tss.sp0, sp0);
 }
 
 static inline void native_swapgs(void)
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index b0a1aecb365f..3d529ebce27b 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -79,10 +79,10 @@ do {									\
 static inline void refresh_sysenter_cs(struct thread_struct *thread)
 {
 	/* Only happens when SEP is enabled, no need to test "SEP"arately: */
-	if (unlikely(this_cpu_read(cpu_tss.x86_tss.ss1) == thread->sysenter_cs))
+	if (unlikely(this_cpu_read(cpu_tss_rw.x86_tss.ss1) == thread->sysenter_cs))
 		return;
 
-	this_cpu_write(cpu_tss.x86_tss.ss1, thread->sysenter_cs);
+	this_cpu_write(cpu_tss_rw.x86_tss.ss1, thread->sysenter_cs);
 	wrmsr(MSR_IA32_SYSENTER_CS, thread->sysenter_cs, 0);
 }
 #endif
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 44a04999791e..00223333821a 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -207,7 +207,7 @@ static inline int arch_within_stack_frames(const void * const stack,
 #else /* !__ASSEMBLY__ */
 
 #ifdef CONFIG_X86_64
-# define cpu_current_top_of_stack (cpu_tss + TSS_sp1)
+# define cpu_current_top_of_stack (cpu_tss_rw + TSS_sp1)
 #endif
 
 #endif
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 46c0995344aa..b8baf3db5a12 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -94,10 +94,8 @@ void common(void) {
 	BLANK();
 	DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
 
-	OFFSET(TSS_STRUCT_SYSENTER_stack, tss_struct, SYSENTER_stack);
-	DEFINE(SIZEOF_SYSENTER_stack, sizeof(struct SYSENTER_stack));
-
-	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
 	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
+	OFFSET(CPU_ENTRY_AREA_SYSENTER_stack, cpu_entry_area, SYSENTER_stack_page);
+	DEFINE(SIZEOF_SYSENTER_stack, sizeof(struct SYSENTER_stack));
 }
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index 52ce4ea16e53..7d20d9c0b3d6 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -47,8 +47,8 @@ void foo(void)
 	BLANK();
 
 	/* Offset from the sysenter stack to tss.sp0 */
-	DEFINE(TSS_sysenter_sp0, offsetof(struct tss_struct, x86_tss.sp0) -
-	       offsetofend(struct tss_struct, SYSENTER_stack));
+	DEFINE(TSS_sysenter_sp0, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
+	       offsetofend(struct cpu_entry_area, SYSENTER_stack_page.stack));
 
 #ifdef CONFIG_CC_STACKPROTECTOR
 	BLANK();
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c0f11a684acf..f74645c4cd9a 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -518,31 +518,40 @@ static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
 
 static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
+
 #endif
 
+static DEFINE_PER_CPU_PAGE_ALIGNED(struct SYSENTER_stack_page, SYSENTER_stack_storage);
+
 /* Setup the fixmap mappings only once per-processor */
 static void __init setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
 	extern char _entry_trampoline[];
 
-	/* On 64-bit systems, we use a read-only fixmap GDT. */
+	/* On 64-bit systems, we use a read-only fixmap GDT and TSS. */
 	pgprot_t gdt_prot = PAGE_KERNEL_RO;
+	pgprot_t tss_prot = PAGE_KERNEL_RO;
 #else
 	/*
 	 * On native 32-bit systems, the GDT cannot be read-only because
 	 * our double fault handler uses a task gate, and entering through
 	 * a task gate needs to change an available TSS to busy.  If the GDT
-	 * is read-only, that will triple fault.
+	 * is read-only, that will triple fault.  The TSS cannot be read-only
+	 * because the CPU writes to it on task switches.
 	 *
 	 * On Xen PV, the GDT must be read-only because the hypervisor requires
 	 * it.
 	 */
 	pgprot_t gdt_prot = boot_cpu_has(X86_FEATURE_XENPV) ?
 		PAGE_KERNEL_RO : PAGE_KERNEL;
+	pgprot_t tss_prot = PAGE_KERNEL;
 #endif
 
 	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, SYSENTER_stack_page),
+				per_cpu_ptr(&SYSENTER_stack_storage, cpu), 1,
+				PAGE_KERNEL);
 
 	/*
 	 * The Intel SDM says (Volume 3, 7.2.1):
@@ -565,9 +574,9 @@ static void __init setup_cpu_entry_area(int cpu)
 		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
 	BUILD_BUG_ON(sizeof(struct tss_struct) % PAGE_SIZE != 0);
 	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, tss),
-				&per_cpu(cpu_tss, cpu),
+				&per_cpu(cpu_tss_rw, cpu),
 				sizeof(struct tss_struct) / PAGE_SIZE,
-				PAGE_KERNEL);
+				tss_prot);
 
 #ifdef CONFIG_X86_32
 	per_cpu(cpu_entry_area, cpu) = get_cpu_entry_area(cpu);
@@ -1339,7 +1348,7 @@ void enable_sep_cpu(void)
 		return;
 
 	cpu = get_cpu();
-	tss = &per_cpu(cpu_tss, cpu);
+	tss = &per_cpu(cpu_tss_rw, cpu);
 
 	/*
 	 * We cache MSR_IA32_SYSENTER_CS's value in the TSS's ss1 field --
@@ -1609,7 +1618,7 @@ void cpu_init(void)
 	if (cpu)
 		load_ucode_ap();
 
-	t = &per_cpu(cpu_tss, cpu);
+	t = &per_cpu(cpu_tss_rw, cpu);
 	oist = &per_cpu(orig_ist, cpu);
 
 #ifdef CONFIG_NUMA
@@ -1701,7 +1710,7 @@ void cpu_init(void)
 {
 	int cpu = smp_processor_id();
 	struct task_struct *curr = current;
-	struct tss_struct *t = &per_cpu(cpu_tss, cpu);
+	struct tss_struct *t = &per_cpu(cpu_tss_rw, cpu);
 
 	wait_for_master_cpu(cpu);
 
diff --git a/arch/x86/kernel/ioport.c b/arch/x86/kernel/ioport.c
index 3feb648781c4..2f723301eb58 100644
--- a/arch/x86/kernel/ioport.c
+++ b/arch/x86/kernel/ioport.c
@@ -67,7 +67,7 @@ asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int turn_on)
 	 * because the ->io_bitmap_max value must match the bitmap
 	 * contents:
 	 */
-	tss = &per_cpu(cpu_tss, get_cpu());
+	tss = &per_cpu(cpu_tss_rw, get_cpu());
 
 	if (turn_on)
 		bitmap_clear(t->io_bitmap_ptr, from, num);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 298be43d63de..aed9d94bd46f 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -47,7 +47,7 @@
  * section. Since TSS's are completely CPU-local, we want them
  * on exact cacheline boundaries, to eliminate cacheline ping-pong.
  */
-__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
+__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss_rw) = {
 	.x86_tss = {
 		/*
 		 * .sp0 is only used when entering ring 0 from a lower
@@ -82,7 +82,7 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
 	.io_bitmap		= { [0 ... IO_BITMAP_LONGS] = ~0 },
 #endif
 };
-EXPORT_PER_CPU_SYMBOL(cpu_tss);
+EXPORT_PER_CPU_SYMBOL(cpu_tss_rw);
 
 DEFINE_PER_CPU(bool, __tss_limit_invalid);
 EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid);
@@ -111,7 +111,7 @@ void exit_thread(struct task_struct *tsk)
 	struct fpu *fpu = &t->fpu;
 
 	if (bp) {
-		struct tss_struct *tss = &per_cpu(cpu_tss, get_cpu());
+		struct tss_struct *tss = &per_cpu(cpu_tss_rw, get_cpu());
 
 		t->io_bitmap_ptr = NULL;
 		clear_thread_flag(TIF_IO_BITMAP);
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 45bf0c5f93e1..5224c6099184 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -234,7 +234,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	struct fpu *prev_fpu = &prev->fpu;
 	struct fpu *next_fpu = &next->fpu;
 	int cpu = smp_processor_id();
-	struct tss_struct *tss = &per_cpu(cpu_tss, cpu);
+	struct tss_struct *tss = &per_cpu(cpu_tss_rw, cpu);
 
 	/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */
 
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index bafe65b08697..2678b0bc99d9 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -400,7 +400,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	struct fpu *prev_fpu = &prev->fpu;
 	struct fpu *next_fpu = &next->fpu;
 	int cpu = smp_processor_id();
-	struct tss_struct *tss = &per_cpu(cpu_tss, cpu);
+	struct tss_struct *tss = &per_cpu(cpu_tss_rw, cpu);
 
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
 		     this_cpu_read(irq_count) != -1);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b70aec60ebbd..554f14d87575 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -365,7 +365,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 		regs->cs == __KERNEL_CS &&
 		regs->ip == (unsigned long)native_irq_return_iret)
 	{
-		struct pt_regs *gpregs = (struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
+		struct pt_regs *gpregs = (struct pt_regs *)this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1;
 
 		/*
 		 * regs->sp points to the failing IRET frame on the
@@ -655,7 +655,7 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
 	 * exception came from the IRET target.
 	 */
 	struct bad_iret_stack *new_stack =
-		(struct bad_iret_stack *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
+		(struct bad_iret_stack *)this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1;
 
 	/* Copy the IRET target to the new stack. */
 	memmove(&new_stack->regs.ip, (void *)s->regs.sp, 5*8);
diff --git a/arch/x86/lib/delay.c b/arch/x86/lib/delay.c
index 553f8fd23cc4..4846eff7e4c8 100644
--- a/arch/x86/lib/delay.c
+++ b/arch/x86/lib/delay.c
@@ -107,10 +107,10 @@ static void delay_mwaitx(unsigned long __loops)
 		delay = min_t(u64, MWAITX_MAX_LOOPS, loops);
 
 		/*
-		 * Use cpu_tss as a cacheline-aligned, seldomly
+		 * Use cpu_tss_rw as a cacheline-aligned, seldomly
 		 * accessed per-cpu variable as the monitor target.
 		 */
-		__monitorx(raw_cpu_ptr(&cpu_tss), 0, 0);
+		__monitorx(raw_cpu_ptr(&cpu_tss_rw), 0, 0);
 
 		/*
 		 * AMD, like Intel, supports the EAX hint and EAX=0xf
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 5b2b3f3f6531..b0ac47dd7b4a 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -818,7 +818,7 @@ static void xen_load_sp0(unsigned long sp0)
 	mcs = xen_mc_entry(0);
 	MULTI_stack_switch(mcs.mc, __KERNEL_DS, sp0);
 	xen_mc_issue(PARAVIRT_LAZY_CPU);
-	this_cpu_write(cpu_tss.x86_tss.sp0, sp0);
+	this_cpu_write(cpu_tss_rw.x86_tss.sp0, sp0);
 }
 
 void xen_set_iopl_mask(unsigned mask)
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 4/6] Unsuck "x86/entry/64: Create a percpu SYSCALL entry trampoline"
  2017-12-01  6:29 ` [PATCH 4/6] Unsuck "x86/entry/64: Create a percpu SYSCALL entry trampoline" Andy Lutomirski
@ 2017-12-02 15:18   ` Josh Poimboeuf
  2017-12-02 16:05     ` Andy Lutomirski
  0 siblings, 1 reply; 9+ messages in thread
From: Josh Poimboeuf @ 2017-12-02 15:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Borislav Petkov, Brian Gerst, David Laight,
	Kees Cook, Peter Zijlstra

On Thu, Nov 30, 2017 at 10:29:44PM -0800, Andy Lutomirski wrote:
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index caf74a1bb3de..28f4e7553c26 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -180,14 +180,24 @@ ENTRY(entry_SYSCALL_64_trampoline)
>  
>  	/*
>  	 * x86 lacks a near absolute jump, and we can't jump to the real
> -	 * entry text with a relative jump, so we fake it using retq.
> +	 * entry text with a relative jump.  We could push the target
> +	 * address and then use retq, but this destroys the pipeline on
> +	 * many CPUs (wasting over 20 cycles on Sandy Bridge).  Instead,
> +	 * spill RDI and restore it in a second-stage trampoline.
>  	 */
> -	pushq	$entry_SYSCALL_64_after_hwframe
> -	retq
> +	pushq	%rdi
> +	movq	$entry_SYSCALL_64_stage2, %rdi
> +	jmp	*%rdi
>  END(entry_SYSCALL_64_trampoline)
>  
>  	.popsection
>  
> +ENTRY(entry_SYSCALL_64_stage2)
> +	UNWIND_HINT_EMPTY
> +	popq	%rdi
> +	jmp	entry_SYSCALL_64_after_hwframe
> +END(entry_SYSCALL_64_stage2)
> +
>  ENTRY(entry_SYSCALL_64)
>  	UNWIND_HINT_EMPTY
>  	/*

Another crazy idea:

	call	1f
1:	movq	$entry_SYSCALL_64_after_hwframe, (%rsp)
	ret

Does that fix the regression?

-- 
Josh

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 4/6] Unsuck "x86/entry/64: Create a percpu SYSCALL entry trampoline"
  2017-12-02 15:18   ` Josh Poimboeuf
@ 2017-12-02 16:05     ` Andy Lutomirski
  0 siblings, 0 replies; 9+ messages in thread
From: Andy Lutomirski @ 2017-12-02 16:05 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org,
	Borislav Petkov, Brian Gerst, David Laight, Kees Cook,
	Peter Zijlstra

On Sat, Dec 2, 2017 at 7:18 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Thu, Nov 30, 2017 at 10:29:44PM -0800, Andy Lutomirski wrote:
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index caf74a1bb3de..28f4e7553c26 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -180,14 +180,24 @@ ENTRY(entry_SYSCALL_64_trampoline)
>>
>>       /*
>>        * x86 lacks a near absolute jump, and we can't jump to the real
>> -      * entry text with a relative jump, so we fake it using retq.
>> +      * entry text with a relative jump.  We could push the target
>> +      * address and then use retq, but this destroys the pipeline on
>> +      * many CPUs (wasting over 20 cycles on Sandy Bridge).  Instead,
>> +      * spill RDI and restore it in a second-stage trampoline.
>>        */
>> -     pushq   $entry_SYSCALL_64_after_hwframe
>> -     retq
>> +     pushq   %rdi
>> +     movq    $entry_SYSCALL_64_stage2, %rdi
>> +     jmp     *%rdi
>>  END(entry_SYSCALL_64_trampoline)
>>
>>       .popsection
>>
>> +ENTRY(entry_SYSCALL_64_stage2)
>> +     UNWIND_HINT_EMPTY
>> +     popq    %rdi
>> +     jmp     entry_SYSCALL_64_after_hwframe
>> +END(entry_SYSCALL_64_stage2)
>> +
>>  ENTRY(entry_SYSCALL_64)
>>       UNWIND_HINT_EMPTY
>>       /*
>
> Another crazy idea:
>
>         call    1f
> 1:      movq    $entry_SYSCALL_64_after_hwframe, (%rsp)
>         ret
>
> Does that fix the regression?

I suspect that's as bad or worse.  The issue (I think) is that the CPU
has a little invisible internal stack that tracks calls and rets and
the CPU will speculate past a ret under the assumption that it returns
to the last call on the stack.  If it doesn't, then the CPU has to
start over.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-12-02 16:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-12-01  6:29 [PATCH 0/6] WIP.x86/mm fixes Andy Lutomirski
2017-12-01  6:29 ` [PATCH 1/6] x86/orc: Don't bail on stack overflow Andy Lutomirski
2017-12-01  6:29 ` [PATCH 2/6] Fixup "x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss" Andy Lutomirski
2017-12-01  6:29 ` [PATCH 3/6] Fixup "x86/asm: Remap the TSS into the cpu entry area" Andy Lutomirski
2017-12-01  6:29 ` [PATCH 4/6] Unsuck "x86/entry/64: Create a percpu SYSCALL entry trampoline" Andy Lutomirski
2017-12-02 15:18   ` Josh Poimboeuf
2017-12-02 16:05     ` Andy Lutomirski
2017-12-01  6:29 ` [PATCH 5/6] Fixup "x86/entry/64: Move the IST stacks into cpu_entry_area" Andy Lutomirski
2017-12-01  6:29 ` [PATCH 6/6] x86/entry/64: Make cpu_entry_area.tss read-only Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.