All of lore.kernel.org
 help / color / mirror / Atom feed
From: Borislav Petkov <bp@alien8.de>
To: Andy Lutomirski <luto@kernel.org>
Cc: x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Denys Vlasenko <vda.linux@googlemail.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Brian Gerst <brgerst@gmail.com>,
	Denys Vlasenko <dvlasenk@redhat.com>,
	Ingo Molnar <mingo@kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Alexei Starovoitov <ast@plumgrid.com>,
	Will Drewry <wad@chromium.org>, Kees Cook <keescook@chromium.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue
Date: Sat, 25 Apr 2015 23:12:06 +0200	[thread overview]
Message-ID: <20150425211206.GE32099@pd.tnic> (raw)
In-Reply-To: <5d120f358612d73fc909f5bfa47e7bd082db0af0.1429841474.git.luto@kernel.org>

On Thu, Apr 23, 2015 at 07:15:01PM -0700, Andy Lutomirski wrote:
> AMD CPUs don't reinitialize the SS descriptor on SYSRET, so SYSRET
> with SS == 0 results in an invalid usermode state in which SS is
> apparently equal to __USER_DS but causes #SS if used.
> 
> Work around the issue by replacing NULL SS values with __KERNEL_DS
> in __switch_to, thus ensuring that SYSRET never happens with SS set
> to NULL.
> 
> This was exposed by a recent vDSO cleanup.
> 
> Fixes: e7d6eefaaa44 x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> 
> Tested only on Intel, which isn't very interesting.  I'll tidy up
> and send a test case, too, once Borislav confirms that it works.

So I did some benchmarking today. Custom kernel build measured with perf
stat, 10 builds with --pre doing

$ cat pre-build-kernel.sh
make -s clean
echo 3 > /proc/sys/vm/drop_caches

$ cat measure.sh
EVENTS="cpu-clock,task-clock,cycles,instructions,branches,branch-misses,context-switches,migrations"
perf stat -e $EVENTS --sync -a --repeat 10 --pre ~/kernel/pre-build-kernel.sh make -s -j64

I've prepended the perf stat output with markers A:, B: or C: for easier
comparing. The markers mean:

A: Linus' master from a couple of days ago + tip/master + tip/x86/asm
B: With Andy's SYSRET patch ontop
C: Without RCX canonicalness check (see patch at the end).

Numbers are from an AMD F16h box:

A:    2835570.145246      cpu-clock (msec)                                              ( +-  0.02% ) [100.00%]
B:    2833364.074970      cpu-clock (msec)                                              ( +-  0.04% ) [100.00%]
C:    2834708.335431      cpu-clock (msec)                                              ( +-  0.02% ) [100.00%]

This is interesting - The SYSRET SS fix makes it minimally better and
the C-patch is a bit worse again. Net win is 861 msec, almost a second,
oh well.

A:    2835570.099981      task-clock (msec)         #    3.996 CPUs utilized            ( +-  0.02% ) [100.00%]
B:    2833364.073633      task-clock (msec)         #    3.996 CPUs utilized            ( +-  0.04% ) [100.00%]
C:    2834708.350387      task-clock (msec)         #    3.996 CPUs utilized            ( +-  0.02% ) [100.00%]

Similar thing observable here.

A: 5,591,213,166,613      cycles                    #    1.972 GHz                      ( +-  0.03% ) [75.00%]
B: 5,585,023,802,888      cycles                    #    1.971 GHz                      ( +-  0.03% ) [75.00%]
C: 5,587,983,212,758      cycles                    #    1.971 GHz                      ( +-  0.02% ) [75.00%]

net win is 3,229,953,855 cycles drop.

A: 3,106,707,101,530      instructions              #    0.56  insns per cycle          ( +-  0.01% ) [75.00%]
B: 3,106,632,251,528      instructions              #    0.56  insns per cycle          ( +-  0.00% ) [75.00%]
C: 3,106,265,958,142      instructions              #    0.56  insns per cycle          ( +-  0.00% ) [75.00%]

This looks like it would make sense - instruction count drops from A -> B -> C.

A:   683,676,044,429      branches                  #  241.107 M/sec                    ( +-  0.01% ) [75.00%]
B:   683,670,899,595      branches                  #  241.293 M/sec                    ( +-  0.01% ) [75.00%]
C:   683,675,772,858      branches                  #  241.180 M/sec                    ( +-  0.01% ) [75.00%]

Also makes sense - the C patch adds an unconditional JMP over the
RCX-canonicalness check.

A:    43,829,535,008      branch-misses             #    6.41% of all branches          ( +-  0.02% ) [75.00%]
B:    43,844,118,416      branch-misses             #    6.41% of all branches          ( +-  0.03% ) [75.00%]
C:    43,819,871,086      branch-misses             #    6.41% of all branches          ( +-  0.02% ) [75.00%]

And this is nice, branch misses are the smallest with C, cool. It makes
sense again - the C patch adds an unconditional JMP which doesn't miss.

A:         2,030,357      context-switches          #    0.716 K/sec                    ( +-  0.06% ) [100.00%]
B:         2,029,313      context-switches          #    0.716 K/sec                    ( +-  0.05% ) [100.00%]
C:         2,028,566      context-switches          #    0.716 K/sec                    ( +-  0.06% ) [100.00%]

Those look good.

A:            52,421      migrations                #    0.018 K/sec                    ( +-  1.13% )
B:            52,049      migrations                #    0.018 K/sec                    ( +-  1.02% )
C:            51,365      migrations                #    0.018 K/sec                    ( +-  0.92% )

Same here.

A:     709.528485252 seconds time elapsed                                          ( +-  0.02% )
B:     708.976557288 seconds time elapsed                                          ( +-  0.04% )
C:     709.312844791 seconds time elapsed                                          ( +-  0.02% )

Interestingly, the unconditional JMP kinda costs... Btw, I'm not sure if
kernel build is the optimal workload for benchmarking here but I don't
see why not - it does a lot of syscalls so it should exercise the SYSRET
path sufficiently.

Anyway, we can do this below. Or not, I'm sitting on the fence about
that one.

---
From: Borislav Petkov <bp@suse.de>
Date: Sat, 25 Apr 2015 19:30:33 +0200
Subject: [PATCH] x86/entry: Avoid canonical RCX check on AMD

It is not needed on AMD as RCX canonicalness is not checked during
SYSRET there.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/cpufeature.h |  1 +
 arch/x86/kernel/cpu/intel.c       |  2 ++
 arch/x86/kernel/entry_64.S        | 13 +++++++++----
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 7ee9b94d9921..8d555b046fe9 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -265,6 +265,7 @@
 #define X86_BUG_11AP		X86_BUG(5) /* Bad local APIC aka 11AP */
 #define X86_BUG_FXSAVE_LEAK	X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */
 #define X86_BUG_CLFLUSH_MONITOR	X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */
+#define X86_BUG_CANONICAL_RCX	X86_BUG(8) /* SYSRET #GPs when %RCX non-canonical */
 
 #if defined(__KERNEL__) && !defined(__ASSEMBLY__)
 
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 50163fa9034f..109a51815e92 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -159,6 +159,8 @@ static void early_init_intel(struct cpuinfo_x86 *c)
 		pr_info("Disabling PGE capability bit\n");
 		setup_clear_cpu_cap(X86_FEATURE_PGE);
 	}
+
+	set_cpu_bug(c, X86_BUG_CANONICAL_RCX);
 }
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index e952f6bf1d6d..d01fb6c1362f 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -415,16 +415,20 @@ syscall_return:
 	jne opportunistic_sysret_failed
 
 	/*
-	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
-	 * in kernel space.  This essentially lets the user take over
-	 * the kernel, since userspace controls RSP.
-	 *
 	 * If width of "canonical tail" ever becomes variable, this will need
 	 * to be updated to remain correct on both old and new CPUs.
 	 */
 	.ifne __VIRTUAL_MASK_SHIFT - 47
 	.error "virtual address width changed -- SYSRET checks need update"
 	.endif
+
+	/*
+	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
+	 * in kernel space.  This essentially lets the user take over
+	 * the kernel, since userspace controls RSP.
+	 */
+	ALTERNATIVE "jmp 1f", "", X86_BUG_CANONICAL_RCX
+
 	/* Change top 16 bits to be the sign-extension of 47th bit */
 	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
 	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
@@ -432,6 +436,7 @@ syscall_return:
 	cmpq	%rcx, %r11
 	jne	opportunistic_sysret_failed
 
+1:
 	cmpq $__USER_CS,CS(%rsp)	/* CS must match SYSRET */
 	jne opportunistic_sysret_failed
 
-- 
2.3.5

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

  parent reply	other threads:[~2015-04-25 21:12 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-24  2:15 [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue Andy Lutomirski
2015-04-24  2:18 ` Andy Lutomirski
2015-04-26 12:34   ` Denys Vlasenko
2015-04-24  3:58 ` Brian Gerst
2015-04-24  9:59 ` Denys Vlasenko
2015-04-24 10:59   ` Borislav Petkov
2015-04-24 19:58     ` Borislav Petkov
2015-04-24 11:27 ` Denys Vlasenko
2015-04-24 12:00   ` Brian Gerst
2015-04-24 16:25     ` Linus Torvalds
2015-04-24 17:33       ` Brian Gerst
2015-04-24 17:41         ` Linus Torvalds
2015-04-24 17:57           ` Brian Gerst
2015-04-24 20:21 ` Andy Lutomirski
2015-04-24 20:46   ` Denys Vlasenko
2015-04-24 20:50     ` Andy Lutomirski
2015-04-24 21:45       ` H. Peter Anvin
2015-04-24 21:45       ` H. Peter Anvin
2015-04-24 21:45       ` H. Peter Anvin
2015-04-24 21:45       ` H. Peter Anvin
2015-04-24 21:45       ` H. Peter Anvin
2015-04-24 21:45       ` H. Peter Anvin
2015-04-25  2:17       ` Denys Vlasenko
2015-04-26 23:36         ` Andy Lutomirski
2015-04-24 20:53   ` Linus Torvalds
2015-04-25 21:12 ` Borislav Petkov [this message]
2015-04-26 11:22   ` perf numbers (was: Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue) Borislav Petkov
2015-04-26 23:39   ` [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue Andy Lutomirski
2015-04-27  8:53     ` Borislav Petkov
2015-04-27 10:07       ` Denys Vlasenko
2015-04-27 10:09         ` Borislav Petkov
2015-04-27 11:35       ` Borislav Petkov
2015-04-27 12:08         ` Denys Vlasenko
2015-04-27 12:48           ` Borislav Petkov
2015-04-27 14:57         ` Linus Torvalds
2015-04-27 15:06           ` Linus Torvalds
2015-04-27 15:35             ` Borislav Petkov
2015-04-27 15:46           ` Borislav Petkov
2015-04-27 15:56             ` Andy Lutomirski
2015-04-27 16:04               ` Brian Gerst
2015-04-27 16:10                 ` Denys Vlasenko
2015-04-27 16:00             ` Linus Torvalds
2015-04-27 16:40               ` Borislav Petkov
2015-04-27 18:14                 ` Linus Torvalds
2015-04-27 18:38                   ` Borislav Petkov
2015-04-27 18:47                     ` Linus Torvalds
2015-04-27 18:53                       ` Borislav Petkov
2015-04-27 19:59                         ` H. Peter Anvin
2015-04-27 20:03                           ` Borislav Petkov
2015-04-27 20:14                             ` H. Peter Anvin
2015-04-28 15:55                               ` Borislav Petkov
2015-04-28 16:28                                 ` Linus Torvalds
2015-04-28 16:58                                   ` Borislav Petkov
2015-04-28 17:16                                     ` Linus Torvalds
2015-04-28 18:38                                       ` Borislav Petkov
2015-04-30 21:39                                         ` H. Peter Anvin
2015-04-30 23:23                                           ` H. Peter Anvin
2015-05-01  9:03                                             ` Borislav Petkov
2015-05-03 11:51                                           ` Borislav Petkov
2015-04-27 19:11                     ` Borislav Petkov
2015-04-27 19:21                       ` Denys Vlasenko
2015-04-27 19:45                         ` Borislav Petkov
2015-04-28 13:40                           ` Borislav Petkov
2015-04-27 16:12           ` Denys Vlasenko
2015-04-27 18:12             ` Linus Torvalds
2015-04-27 18:47               ` Borislav Petkov
2015-04-27 14:39     ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150425211206.GE32099@pd.tnic \
    --to=bp@alien8.de \
    --cc=ast@plumgrid.com \
    --cc=brgerst@gmail.com \
    --cc=dvlasenk@redhat.com \
    --cc=fweisbec@gmail.com \
    --cc=hpa@zytor.com \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=luto@kernel.org \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vda.linux@googlemail.com \
    --cc=wad@chromium.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.