From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753623Ab1HWFK6 (ORCPT ); Tue, 23 Aug 2011 01:10:58 -0400 Received: from DMZ-MAILSEC-SCANNER-5.MIT.EDU ([18.7.68.34]:57909 "EHLO dmz-mailsec-scanner-5.mit.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750908Ab1HWFKv (ORCPT ); Tue, 23 Aug 2011 01:10:51 -0400 X-AuditID: 12074422-b7ba7ae000000a14-70-4e53365386e6 Message-ID: <4E533651.8070205@mit.edu> Date: Tue, 23 Aug 2011 01:10:41 -0400 From: Andrew Lutomirski User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:5.0) Gecko/20110707 Thunderbird/5.0 MIME-Version: 1.0 To: "H. Peter Anvin" CC: Al Viro , Linus Torvalds , mingo@redhat.com, Richard Weinberger , user-mode-linux-devel@lists.sourceforge.net, linux-kernel@vger.kernel.org Subject: Re: SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386) References: <20110821063443.GH2203@ZenIV.linux.org.uk> <20110821084230.GI2203@ZenIV.linux.org.uk> <20110821144352.GJ2203@ZenIV.linux.org.uk> <20110821164124.GL2203@ZenIV.linux.org.uk> <20110822011645.GM2203@ZenIV.linux.org.uk> <20110822040759.GQ2203@ZenIV.linux.org.uk> <4E51D70A.1060001@zytor.com> <20110822042605.GR2203@ZenIV.linux.org.uk> <4E51E325.2050502@zytor.com> In-Reply-To: <4E51E325.2050502@zytor.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFlrBKsWRmVeSWpSXmKPExsUixG6nohtiFuxncKKVyWLaRnGLy7vmsFlc OrCAyWLyzjeMFo/63rJbzOqdwmJx/u9xVgd2jxMzfrN47F7wmcnj5rxCj/f7rrJ5fN4k57Hp yVsmjxMtX1gD2KO4bFJSczLLUov07RK4Mo62P2UrWKxVcWD6UfYGxtuKXYycHBICJhIt/f0s ELaYxIV769m6GLk4hAT2MUrcODKDFcLZwCjxfOECFgjnLZPEtve32EBaeAXUJDq7LjOB2CwC qhKt51+xg9hsQPG7Z9vB4qICQRL3fzewQNQLSpyc+QTMFhFQkZj+8xMTyFBmgd+MEldXnQQb KiyQKfFp4V5GiG0HWSQ617wG6+AU0JSYMmclUBEHUIe6xPp5QiBhZgF5ie1v5zBPYBSchWTH LISqWUiqFjAyr2KUTcmt0s1NzMwpTk3WLU5OzMtLLdI11cvNLNFLTSndxAiOEBelHYw/Dyod YhTgYFTi4f15IMhPiDWxrLgy9xCjJAeTkihvtXGwnxBfUn5KZUZicUZ8UWlOavEhRgkOZiUR 3oV3gMp5UxIrq1KL8mFS0hwsSuK8XDsd/IQE0hNLUrNTUwtSi2CyMhwcShK8i02BhgoWpaan VqRl5pQgpJk4OEGG8wAN7wap4S0uSMwtzkyHyJ9iVJQS510LkhAASWSU5sH1whLYK0ZxoFeE eXtBqniAyQ+u+xXQYCagwRy/QK4uLklESEk1MC5V/DNbfUuD5FYOH9lOU13Fry8EkqaLWh5d 9efzz5fHZVrljXndbL2X/NCZcP9wVnC0101HllsLa1wNCzK9+B08OVb7VEzJOd1it0W5PCTe ZHXbTSZ5CSV/w1/Cy5/lT7jxIGZdy3HfWXELuA7wSra+iWUPP7rPWsv/8OsNvnzSU4R/cL60 UmIpzkg01GIuKk4EAAemzdY7AwAA Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08/22/2011 01:03 AM, H. Peter Anvin wrote: > On 08/21/2011 09:26 PM, Al Viro wrote: >> On Sun, Aug 21, 2011 at 09:11:54PM -0700, H. Peter Anvin wrote: >>>> lack of point - the *only* CPU where it would matter would be K6-2, IIRC, >>>> and (again, IIRC) it had some differences in SYSCALL semantics compared to >>>> K7 (which supports SYSENTER as well). Bugger if I remember what those >>>> differences might've been... Some flag not cleared? >>> >>> The most likely reason for a binary to execute a stray SYSCALL is >>> because they read it out of the vdso. Totally daft, but we certainly >>> see a lot of stupid things as evidenced by the JIT thread earlier this >>> month. >> >> Um... What, blindly, no matter what surrounds it in there? What will >> happen to the same eager JIT when it steps on SYSENTER? > > The JIT will have had to manage SYSENTER already. It's not a change, > whereas SYSCALL would be. We could just try it, and see if anything > breaks, of course. Here's a possible solution that works for standalone SYSCALL and vdso SYSCALL. The idea is to preserve the exact same SYSCALL invocation sequence. Logically, the SYSCALL instruction does: push %ebp mov %ebp,%ecx mov 4(%esp),%ebp call __fake_int80 and __fake_int80 is: int 0x80 mov 4(%esp),%ebp ret $4 The entire system call sequence is then (effectively): push %ebp movl %ecx,%ebp ; "SYSCALL" starts here push %ebp mov %ebp,%ecx mov 4(%esp),%ebp call __fake_int80 ; "SYSCALL ends here movl %ebp,%ecx popl %ebp ret So we rearrange ebp and ecx and then immediately rearrange them back. The landing point tweaks them again so that we preserve the old semantics of SYSCALL. But now the pt_regs values exactly match what would have happened if we entered via the int 0x80 path, so there shouldn't be any corner cases with ptrace or restart -- as far as either one is concerned, we actually entered via int 0x80. If we deliver a signal, the signal handler returns to the int 0x80 instruction. Am I missing something? Extremely buggy, incomplete code that implements this is: diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index a0e866d..6cda8ce 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -291,24 +291,59 @@ ENTRY(ia32_cstar_target) ENABLE_INTERRUPTS(CLBR_NONE) SAVE_ARGS 8,0,0 movl %eax,%eax /* zero extension */ - movq %rax,ORIG_RAX-ARGOFFSET(%rsp) - movq %rcx,RIP-ARGOFFSET(%rsp) - CFI_REL_OFFSET rip,RIP-ARGOFFSET - movq %rbp,RCX-ARGOFFSET(%rsp) /* this lies slightly to ptrace */ - movl %ebp,%ecx + + /* + * This does (from the user's point of view): + * push %ebp + * mov %ebp, %ecx + * mov 4(%esp), %ebp + * call + * + * User address access does not need access_ok check as r8 + * has been zero-extended, so even with the offsets it cannot + * exceed 2**32 + 8. + */ + + /* XXX: need to check that vdso actually exists. */ + /* XXX: ia32_badarg may do bad things to the user state. */ + + /* move ebp into place on the user stack */ + 1: movl %ebp,-4(%r8) + .section __ex_table,"a" + .quad 1b,ia32_badarg + .previous + + /* move eip into place on the user stack */ + 1: movl %ecx,-8(%r8) /* user eip is in ecx */ + .section __ex_table,"a" + .quad 1b,ia32_badarg + .previous + + /* move ebp to ecx in CPU registers and argument save area */ + mov %ebp,%ecx + movq %ecx,RCX-ARGOFFSET(%rsp) + + /* + * move arg6 to ebp in CPU registers and argument save area + * minor optimization: the actual value of ebp is irrelevent, + * so stick it straight into r9d -- see the definition of + * IA32_ARG_FIXUP. + */ +1: movl (%r8),%r9d + .section __ex_table,"a" + .quad 1b,ia32_badarg + .previous + + /* Do the fake call */ + movl [insert address of int 0x80; ret helper + 2 here],RIP-ARGOFFSET(%rsp) + subl $8,%r8 /* we pushed twice */ + movq $__USER32_CS,CS-ARGOFFSET(%rsp) movq $__USER32_DS,SS-ARGOFFSET(%rsp) movq %r11,EFLAGS-ARGOFFSET(%rsp) /*CFI_REL_OFFSET rflags,EFLAGS-ARGOFFSET*/ movq %r8,RSP-ARGOFFSET(%rsp) CFI_REL_OFFSET rsp,RSP-ARGOFFSET - /* no need to do an access_ok check here because r8 has been - 32bit zero extended */ - /* hardware stack frame is complete now */ -1: movl (%r8),%r9d - .section __ex_table,"a" - .quad 1b,ia32_badarg - .previous GET_THREAD_INFO(%r10) orl $TS_COMPAT,TI_status(%r10) testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10) diff --git a/arch/x86/vdso/vdso32/syscall.S b/arch/x86/vdso/vdso32/syscall.S index 5415b56..a3e48b0 100644 --- a/arch/x86/vdso/vdso32/syscall.S +++ b/arch/x86/vdso/vdso32/syscall.S @@ -19,8 +19,8 @@ __kernel_vsyscall: .Lpush_ebp: movl %ecx, %ebp syscall - movl $__USER32_DS, %ecx - movl %ecx, %ss + /* The ret in the fake int80 entry lands here */ + /* ss is already correct AFAICS */ movl %ebp, %ecx popl %ebp .Lpop_ebp: @@ -28,6 +28,11 @@ __kernel_vsyscall: .LEND_vsyscall: .size __kernel_vsyscall,.-.LSTART_vsyscall +__kernel_vsyscall_fake_int80: + int 0x80 + mov 4(%esp),%ebp + ret $4 + .section .eh_frame,"a",@progbits .LSTARTFRAME: .long .LENDCIE-.LSTARTCIE This could be further simplified by checking if any work flags are set and bailing immediately to the right place in the int 0x80 entry. --Andy