From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S933055AbbCYOzz (ORCPT <rfc822;w@1wt.eu>);
	Wed, 25 Mar 2015 10:55:55 -0400
Received: from mx1.redhat.com ([209.132.183.28]:54799 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S932171AbbCYOzv (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Wed, 25 Mar 2015 10:55:51 -0400
Message-ID: <5512CC5A.8060506@redhat.com>
Date: Wed, 25 Mar 2015 15:55:22 +0100
From: Denys Vlasenko <dvlasenk@redhat.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0
MIME-Version: 1.0
To: Andy Lutomirski <luto@amacapital.net>
CC: Brian Gerst <brgerst@gmail.com>, Ingo Molnar <mingo@kernel.org>,
        Denys Vlasenko <vda.linux@googlemail.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Steven Rostedt <rostedt@goodmis.org>, Borislav Petkov <bp@alien8.de>,
        "H. Peter Anvin" <hpa@zytor.com>, Oleg Nesterov <oleg@redhat.com>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Alexei Starovoitov <ast@plumgrid.com>, Will Drewry <wad@chromium.org>,
        Kees Cook <keescook@chromium.org>, X86 ML <x86@kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
References: <1427129240-15543-1-git-send-email-dvlasenk@redhat.com> <CALCETrW4n_PD=Y=Ozf8FGCHk_-+zvHHYCMXRTr4BmMG85ddQrQ@mail.gmail.com> <CALCETrXq7uxDCu5GCzVoyijuYQ9yoWkRbyPY1=rBFptgtWjQng@mail.gmail.com> <CAK1hOcMX+FuOyvmkTBr73n=sb5qi5i8rupyYNHLj9q0-ydDNtw@mail.gmail.com> <20150324063430.GB26302@gmail.com> <55116FC1.1020400@redhat.com> <CAMzpN2huVk1nNGLYsYXCOyCvpYBTwKYCyn29JQWZ+r8ZZmyskA@mail.gmail.com> <5511C641.7000700@redhat.com> <CALCETrU=fWvyOf-yWG=UQL4jfhbp1vwzPpBd+eeTLjk94xX+8A@mail.gmail.com>
In-Reply-To: <CALCETrU=fWvyOf-yWG=UQL4jfhbp1vwzPpBd+eeTLjk94xX+8A@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 03/24/2015 10:40 PM, Andy Lutomirski wrote:
> The syscall and sysenter stuff is IMO really nasty.  Here's how I'd
> like it to work:
> 
> When you do "call __kernel_vsyscall", I want the net effect to be that
> your eax, ebx, ecx, edx, esi, edi, and ebp at the time of the call end
> up *verbatim* in pt_regs.  Your eip and rsp should be such that, if we
> iret normally using pt_regs, we end up returning correctly to
> userspace.  I want this to be true *regardless* of whether we're doing
> a fast-path or slow-path system call.
> 
> This means that we have, literally (see below for why ret $4):
> 
> int $0x80
> ret $4  <-- regs->eip points here
> 
> Then we add an opportunistic return trampoline: if a special ti flag
> is set (which we set on entry here) and the return eip and regs are
> appropriate, then we change the return at the last minute to vdso code
> that looks like:
> 
> popl $ecx
> popl $edx
> ret

I don't fully understand your intent.

> The vdso code would be something like (so untested it's not even funny):
> 
> __kernel_vsyscall:
>   ALTERNATIVE_2(something or other)
> 
> __kernel_vsyscall_for_intel:
>   pushl $edx
>   pushl $ecx
>   sysenter
>   hlt  <-- just for clarity
> 
> __kernel_vsyscall_for_amd:
>   pushl $ecx
>   syscall
> __vsyscall_after_syscall_insn:
>  ret $4 <-- for binary tracers only

This ret would use former ecx value as return address?


> __kernel_vsyscall_for_int80:
>   int $0x80  <-- regs->eip points here during *all* vsyscalls
> 
> __kernel_vsyscall_slow_ret:
>   ret $4

After returning, this will pop an extra word from __kernel_vsyscall() caller.
They don't expect that.


> __kernel_vsyscall_sysretl_target:
>   popl $ecx
>   ret
> 
> There is no sysexit.  Take that, Intel.
> 
> On sysenter, we copy regs->cx and regs->dx from user memory and then
> we increment regs->sp by 4 and point regs->eip to
> __kernel_vsyscall_for_int80.  On syscall, we copy regs->cx from user
> memory and point regs->eip to __kernel_vsyscall_for_int80.
> 
> On opportunistic sysretl, we do:
> 
> *regs->sp = regs->cx;  /* put_user or whatever */
> regs->eip = __kernel_vsyscall_sysretl_target
> ...
> sysretl
> 
> We never do sysexit or sysretl in any other code path.  That is, there
> is no really fast path anymore.

I still don't understand the purpose those "ret 4" insns.
They don't look right.