Re: [PATCH 4/5] x86/asm/entry/32: Replace RESTORE_RSI_RDI[_RDX] with open-coded 32-bit reads

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Denys Vlasenko <dvlasenk@redhat.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Borislav Petkov <bp@alien8.de>, "H. Peter Anvin" <hpa@zytor.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Oleg Nesterov <oleg@redhat.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Alexei Starovoitov <ast@plumgrid.com>,
	Will Drewry <wad@chromium.org>, Kees Cook <keescook@chromium.org>,
	x86@kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 4/5] x86/asm/entry/32: Replace RESTORE_RSI_RDI[_RDX] with open-coded 32-bit reads
Date: Thu, 18 Jun 2015 12:59:07 +0200	[thread overview]
Message-ID: <5582A47B.4020802@redhat.com> (raw)
In-Reply-To: <20150618093134.GA1094@gmail.com>

On 06/18/2015 11:31 AM, Ingo Molnar wrote:
>> If it takes, say, 20 cycles to pull data from e.g. L3 cache to ECX,
>> then SYSRET can't possibly complete sooner than in 20 cycles.
>
> Yeah, that's true, but my point is: SYSRET has to do a lot of other things
> (permission checks, loading the user mode state - most of which are unrelated to
> R11/RCX), which take dozens of cycles,

SYSRET was designed to avoid doing that. It does not check permissions
- it slam-dunks CPL3 and resets CS and SS to preset values.
It does not touch stack register or restores any other GP register.

Having said that, I'd try to get cold hard facts, i.e. experimentally
measure SYSRET latency.


> and which are probably overlapped with any
> cache misses on arguments such as R11/RCX.
>
> It's not impossible that reordering helps, for example if SYSRET has some internal 
> dependencies that makes it parallelism worse than ideal - but I'd complicate this 
> code only if it gives a measurable improvement for cache-cold syscall performance.

I attempted to test it. With the patch which moves RCX and R11 loads all the way down:

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index f2064bd..0ea09a3 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -139,9 +139,6 @@ sysexit_from_sys_call:
 	 * with 'sysenter' and it uses the SYSENTER calling convention.
 	 */
 	andl	$~TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
-	/* Prepare registers for SYSRET insn */
-	movl	RIP(%rsp), %ecx		/* User %eip */
-	movl	EFLAGS(%rsp), %r11d	/* User eflags *
 	/* Restore registers per SYSEXIT ABI requirements: */
 	/* arg1 (ebx): preserved by virtue of being a callee-saved register */
 	/* arg2 (ecx): used by SYSEXIT to restore esp (and by SYSRET to restore eip) */
@@ -155,6 +152,9 @@ sysexit_from_sys_call:
 	xorl	%r8d, %r8d
 	xorl	%r9d, %r9d
 	xorl	%r10d, %r10d
+	/* Prepare registers for SYSRET insn */
+	movl	RIP(%rsp), %ecx		/* User %eip */
+	movl	EFLAGS(%rsp), %r11d	/* User eflags *
 	TRACE_IRQS_ON

 	/*
@@ -374,9 +374,6 @@ cstar_dispatch:

 sysretl_from_sys_call:
 	andl	$~TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
-	/* Prepare registers for SYSRET insn */
-	movl	RIP(%rsp), %ecx		/* User %eip */
-	movl	EFLAGS(%rsp), %r11d	/* User eflags */
 	/* Restore registers per SYSRET ABI requirements: */
 	/* arg1 (ebx): preserved by virtue of being a callee-saved register */
 	/* arg2 (ebp): preserved (already restored, see above) */
@@ -388,6 +385,9 @@ sysretl_from_sys_call:
 	xorl	%r8d, %r8d
 	xorl	%r9d, %r9d
 	xorl	%r10d, %r10d
+	/* Prepare registers for SYSRET insn */
+	movl	RIP(%rsp), %ecx		/* User %eip */
+	movl	EFLAGS(%rsp), %r11d	/* User eflags */
 	TRACE_IRQS_ON
 	movl	RSP(%rsp), %esp
 	/*

This does not change instructions sizes and therefore code
cacheline alignments over entire bzImage.


Testing getpid() in a loop (IOW: cache-hot test) did show that with
this patch it is slower, but by statistically insignificant amount:

before patch, it's 61.92 ns per syscall.
after patch, it's  61.99 ns per syscall.

That's less than one cycle, more like 0.15 cycles.
However, it is reproducible.

I did not figure out how to do a cache-cold test.
Tried a 65kbyte-ish read from "/dev/zero". That takes ~3885 ns
and its variability of +-10 ns drowns out a possible difference.

next prev parent reply	other threads:[~2015-06-18 11:00 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-09 18:54 [PATCH 1/5] x86/asm/entry/32: Fix fallout from r9 trick removal in SYSCALL code Denys Vlasenko
2015-06-09 18:54 ` [PATCH 2/5] x86/asm/entry/32: Explain reloading of registers after __audit_syscall_entry Denys Vlasenko
2015-06-10  7:09   ` [tip:x86/asm] x86/asm/entry/32: Explain reloading of registers after __audit_syscall_entry() tip-bot for Denys Vlasenko
2015-06-09 18:54 ` [PATCH 3/5] x86/asm/entry/32: Shorten __audit_syscall_entry args preparation Denys Vlasenko
2015-06-10  6:21   ` Ingo Molnar
2015-06-12 23:28     ` Andy Lutomirski
2015-06-10  7:10   ` [tip:x86/asm] x86/asm/entry/32: Shorten __audit_syscall_entry() " tip-bot for Denys Vlasenko
2015-06-09 18:54 ` [PATCH 4/5] x86/asm/entry/32: Replace RESTORE_RSI_RDI[_RDX] with open-coded 32-bit reads Denys Vlasenko
2015-06-09 19:01   ` Andy Lutomirski
2015-06-09 19:03     ` Denys Vlasenko
2015-06-09 19:11       ` Andy Lutomirski
2015-06-09 19:18         ` Denys Vlasenko
2015-06-09 19:27           ` Andy Lutomirski
2015-06-14  8:40   ` Ingo Molnar
2015-06-14 15:21     ` Denys Vlasenko
2015-06-15 20:20       ` Ingo Molnar
2015-06-16  0:24         ` Denys Vlasenko
2015-06-18  9:31           ` Ingo Molnar
2015-06-18 10:59             ` Denys Vlasenko [this message]
2015-06-09 18:54 ` [PATCH 5/5] x86/asm/entry/32: Simplify ptrace register shuffling Denys Vlasenko
2015-06-09 18:59   ` Andy Lutomirski
2015-06-09 19:14     ` Denys Vlasenko
2015-06-18  9:33   ` Ingo Molnar
2015-06-10  7:09 ` [tip:x86/asm] x86/asm/entry/32: Fix fallout from the R9 trick removal in the SYSCALL code tip-bot for Denys Vlasenko

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:f2064bd dfblob:0ea09a3 )
 OR (
bs:"Re: [PATCH 4/5] x86/asm/entry/32: Replace RESTORE_RSI_RDI[_RDX] with open-coded 32-bit reads" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5582A47B.4020802@redhat.com \
    --to=dvlasenk@redhat.com \
    --cc=ast@plumgrid.com \
    --cc=bp@alien8.de \
    --cc=fweisbec@gmail.com \
    --cc=hpa@zytor.com \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=torvalds@linux-foundation.org \
    --cc=wad@chromium.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox