From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933055AbbCYOzz (ORCPT ); Wed, 25 Mar 2015 10:55:55 -0400 Received: from mx1.redhat.com ([209.132.183.28]:54799 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932171AbbCYOzv (ORCPT ); Wed, 25 Mar 2015 10:55:51 -0400 Message-ID: <5512CC5A.8060506@redhat.com> Date: Wed, 25 Mar 2015 15:55:22 +0100 From: Denys Vlasenko User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Andy Lutomirski CC: Brian Gerst , Ingo Molnar , Denys Vlasenko , Linus Torvalds , Steven Rostedt , Borislav Petkov , "H. Peter Anvin" , Oleg Nesterov , Frederic Weisbecker , Alexei Starovoitov , Will Drewry , Kees Cook , X86 ML , "linux-kernel@vger.kernel.org" Subject: Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss References: <1427129240-15543-1-git-send-email-dvlasenk@redhat.com> <20150324063430.GB26302@gmail.com> <55116FC1.1020400@redhat.com> <5511C641.7000700@redhat.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/24/2015 10:40 PM, Andy Lutomirski wrote: > The syscall and sysenter stuff is IMO really nasty. Here's how I'd > like it to work: > > When you do "call __kernel_vsyscall", I want the net effect to be that > your eax, ebx, ecx, edx, esi, edi, and ebp at the time of the call end > up *verbatim* in pt_regs. Your eip and rsp should be such that, if we > iret normally using pt_regs, we end up returning correctly to > userspace. I want this to be true *regardless* of whether we're doing > a fast-path or slow-path system call. > > This means that we have, literally (see below for why ret $4): > > int $0x80 > ret $4 <-- regs->eip points here > > Then we add an opportunistic return trampoline: if a special ti flag > is set (which we set on entry here) and the return eip and regs are > appropriate, then we change the return at the last minute to vdso code > that looks like: > > popl $ecx > popl $edx > ret I don't fully understand your intent. > The vdso code would be something like (so untested it's not even funny): > > __kernel_vsyscall: > ALTERNATIVE_2(something or other) > > __kernel_vsyscall_for_intel: > pushl $edx > pushl $ecx > sysenter > hlt <-- just for clarity > > __kernel_vsyscall_for_amd: > pushl $ecx > syscall > __vsyscall_after_syscall_insn: > ret $4 <-- for binary tracers only This ret would use former ecx value as return address? > __kernel_vsyscall_for_int80: > int $0x80 <-- regs->eip points here during *all* vsyscalls > > __kernel_vsyscall_slow_ret: > ret $4 After returning, this will pop an extra word from __kernel_vsyscall() caller. They don't expect that. > __kernel_vsyscall_sysretl_target: > popl $ecx > ret > > There is no sysexit. Take that, Intel. > > On sysenter, we copy regs->cx and regs->dx from user memory and then > we increment regs->sp by 4 and point regs->eip to > __kernel_vsyscall_for_int80. On syscall, we copy regs->cx from user > memory and point regs->eip to __kernel_vsyscall_for_int80. > > On opportunistic sysretl, we do: > > *regs->sp = regs->cx; /* put_user or whatever */ > regs->eip = __kernel_vsyscall_sysretl_target > ... > sysretl > > We never do sysexit or sysretl in any other code path. That is, there > is no really fast path anymore. I still don't understand the purpose those "ret 4" insns. They don't look right.