RFC: (almost) getting rid of FIXUP/RESTORE_TOP_OF

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* RFC: (almost) getting rid of FIXUP/RESTORE_TOP_OF_STACK?
@ 2015-02-23 21:07 Denys Vlasenko
  2015-02-23 23:58 ` Andy Lutomirski
  0 siblings, 1 reply; 4+ messages in thread
From: Denys Vlasenko @ 2015-02-23 21:07 UTC (permalink / raw)
  To: Andy Lutomirski, Borislav Petkov, H. Peter Anvin
  Cc: X86 ML, LKML, Linus Torvalds

Hi Andy, Peter, Borislav,

Since Andy just took my big patch (whew), further simplifications
in entry64.S are possible.

I want to discuss one particular change I have in mind.

The main 64-bit syscall entry point to kernel is used
when userspace executes SYSCALL insn.

When that happens, we load %rsp so that there is
a space for iret stack on top (5 stack slots),
but we don't fully populate it, we only use %rip slot there
(pt_regs->rip). Instead of pt_regs->rsp, we save
user's %rsp in PER_CPU(old_rsp).

Current logic is as follows: SYSCALL insn saves %rip to %rcx,
%rflags to %r11. We save %rcx to pt_regs->rip,
but save %r11 to its "native" place in pt_regs->r11.
On syscall exit, we load pt_regs->rip to %rcx,
pt_regs->r11 to %r11, and execute SYSRET.
So far so good.

Now, the complications. There are cases where we may need
to return via IRET insn. This requires full, valid iret stack.
In particular, pt_regs->r11 needs to be copied to pt_regs->rflags.

Also, we have ptrace poking into pt_regs. It can change pt_regs->rflags,
pt_regs->rsp. These changes need to be copied to pt_regs->r11,
PER_CPU(old_rsp). Also, ptrace can look at pt_regs->cs/ss, they
need to be set to __USER_CS/__USER_DS.

Both of these cases are handled by FIXUP_TOP_OF_STACK and
RESTORE_TOP_OF_STACK macros, which are ugly:

        .macro FIXUP_TOP_OF_STACK tmp offset=0
        movq PER_CPU_VAR(old_rsp),\tmp
        movq \tmp,RSP+\offset(%rsp)
        movq $__USER_DS,SS+\offset(%rsp)
        movq $__USER_CS,CS+\offset(%rsp)
        movq RIP+\offset(%rsp),\tmp  /* get rip */
        movq \tmp,RCX+\offset(%rsp)  /* copy it to rcx as sysret would do */
        movq R11+\offset(%rsp),\tmp  /* get eflags */
        movq \tmp,EFLAGS+\offset(%rsp)
        .endm

        .macro RESTORE_TOP_OF_STACK tmp offset=0
        movq RSP+\offset(%rsp),\tmp
        movq \tmp,PER_CPU_VAR(old_rsp)
        movq EFLAGS+\offset(%rsp),\tmp
        movq \tmp,R11+\offset(%rsp)
        .endm


I propose to change (simplify?) it.


- First step: save %r11 in pt_regs->rflags (because it really
  is user's %rflags). With this change, FIXUP/RESTORE macros
  do not need to copy it to and fro.
  stub_iopl is not needed (it exists now because it touches pt_regs->rflags).
  Question: what to do with pt_regs->r11 slot? It is accessible
  via ptrace, should we zero it out? Save %r11 twice?
  Don't populate it, and make FIXUP_TOP_OF_STACK do
    pt_regs->r11 = pt_regs->rflags copy?
  (We do the latter for rcx now, i.e. FIXUP_TOP_OF_STACK does
    pt_regs->rcx = pt_regs->rip).


- Second: save user's %rsp in pt_regs->rsp, not PER_CPU(old_rsp).
  In fact, PER_CPU(old_rsp) can't be completely eliminated,
  but it can be relegated to a purely temporary storage on syscall entry path.
  All references to it from C code will be gone,
  thread.usersp field is also gone, a few ugly places
  are becoming less ugly, they consistently use pt_regs->sp now. Example:

   unsigned long KSTK_ESP(struct task_struct *task)
   {
  -       return (test_tsk_thread_flag(task, TIF_IA32)) ?
  -                       (task_pt_regs(task)->sp) : ((task)->thread.usersp);
  +       return task_pt_regs(task)->sp;
   }

  In entry64.S, this change boils down to this:

        movq    %rsp,PER_CPU_VAR(old_rsp)
        movq    PER_CPU_VAR(kernel_stack),%rsp
    ...
        movq    $-ENOSYS,RAX(%rsp)
        movq_cfi rax,(ORIG_RAX)
        movq    %rcx,RIP(%rsp)
  +     movq    PER_CPU_VAR(old_rsp),%rcx
  +     movq    %rcx,RSP(%rsp)
    ...
    ...call syscall_table[i]...
    ...
  -     movq    PER_CPU_VAR(old_rsp), %rsp
  +     movq    RSP(%rsp), %rsp
        USERGS_SYSRET64

  FIXUP/RESTORE_TOP_OF_STACK will no longer need to move ->rsp value
  back and forth.

  Is this simplification worth the penalty of having two more MOV insns?
  My testing shows that syscall fast path is ~54.3 ns before
  and after the patch (on 2.7 GHz Sandy Bridge CPU).


- With these two changes, RESTORE_TOP_OF_STACK becomes a no-op
  (both pt_regs->rsp and pt_regs->rflags are in their rightful places now).
  FIXUP_TOP_OF_STACK will be reduced to:

        .macro FIXUP_TOP_OF_STACK tmp offset=0
        movq $__USER_DS,SS+\offset(%rsp)
        movq $__USER_CS,CS+\offset(%rsp)
        movq RIP+\offset(%rsp),\tmp    /* get rip */
        movq \tmp,RCX+\offset(%rsp)    /* copy it to rcx as sysret would do */
        movq EFLAGS+\offset(%rsp),\tmp /* ditto for rflags->r11 */
        movq \tmp,R11+\offset(%rsp)
        .endm

  iow, it now doesn't play around with pt_regs->rsp, and r11<->rflags
  move is in the opposite direction.


- Next possible change is to use PUSH insns to build the stack. Something along the lines of
        swapgs
	mov	%rsp,%gs:old_rsp
        mov	%gs:kernel_stack,%rsp
        sti
	push	$__USER_DS   /* pt_regs->ss */
	push	%gs:old_rsp  /* ->rsp */
	push	%r11         /* ->rflags */
	push	$__USER_CS   /* ->cs */
	push	%rcx         /* ->rip */
	push	%rax         /* ->orig_rax */
	push	%rdi
	push	%rsi
	push	%rdx
	push	%rcx
	push	$-ENOSYS     /* ->rax */
	push	%r8
	push	%r9
	push	%r10
	push	%r11
	sub     $(6*8),%rsp /* rbx, rbp, r12-r15 not saved */

  This populates pt_regs->ss/cs, making FIXUP_TOP_OF_STACK even less complex.
  Also, pushes are 1 or 2-byte insns. For comparison, current code looks like this
  in disassembly:
     120:       0f 01 f8                swapgs
     123:       65 48 89 24 25 00 00    mov    %rsp,%gs:old_rsp
     12a:       00 00
     12c:       65 48 8b 24 25 00 00    mov    %gs:kernel_stack,%rsp
     133:       00 00
     135:       fb                      sti
     136:       48 81 ec 80 00 00 00    sub    $0x80,%rsp
     13d:       4c 89 5c 24 30          mov    %r11,0x30(%rsp)
     142:       4c 89 54 24 38          mov    %r10,0x38(%rsp)
     147:       4c 89 4c 24 40          mov    %r9,0x40(%rsp)
     14c:       4c 89 44 24 48          mov    %r8,0x48(%rsp)
     151:       48 89 54 24 60          mov    %rdx,0x60(%rsp)
     156:       48 89 74 24 68          mov    %rsi,0x68(%rsp)
     15b:       48 89 7c 24 70          mov    %rdi,0x70(%rsp)
     160:       48 c7 44 24 50 da ff    movq   $-ENOSYS,0x50(%rsp)
     167:       ff ff
     169:       48 89 44 24 78          mov    %rax,0x78(%rsp)
     16e:       48 89 8c 24 80 00 00    mov    %rcx,0x80(%rsp)

  As you see, "MOV const,(mem64)" is a ten-byte insn. "PUSH const" is 2 bytes.
  Other MOVs are biggish too.

  However, such prologue performs four more stores that our current one
  (->cs/ss, and extra pushes for rcx/r11).
  Is it a show-stopper?


Thoughts?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RFC: (almost) getting rid of FIXUP/RESTORE_TOP_OF_STACK?
  2015-02-23 21:07 RFC: (almost) getting rid of FIXUP/RESTORE_TOP_OF_STACK? Denys Vlasenko
@ 2015-02-23 23:58 ` Andy Lutomirski
  2015-02-24 11:45   ` Denys Vlasenko
  0 siblings, 1 reply; 4+ messages in thread
From: Andy Lutomirski @ 2015-02-23 23:58 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Borislav Petkov, H. Peter Anvin, X86 ML, LKML, Linus Torvalds,
	Alexander van Heukelum

On Mon, Feb 23, 2015 at 1:07 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> Hi Andy, Peter, Borislav,
>
> Since Andy just took my big patch (whew), further simplifications
> in entry64.S are possible.
>
> I want to discuss one particular change I have in mind.
>
> The main 64-bit syscall entry point to kernel is used
> when userspace executes SYSCALL insn.
>
> When that happens, we load %rsp so that there is
> a space for iret stack on top (5 stack slots),
> but we don't fully populate it, we only use %rip slot there
> (pt_regs->rip). Instead of pt_regs->rsp, we save
> user's %rsp in PER_CPU(old_rsp).
>
> Current logic is as follows: SYSCALL insn saves %rip to %rcx,
> %rflags to %r11. We save %rcx to pt_regs->rip,
> but save %r11 to its "native" place in pt_regs->r11.
> On syscall exit, we load pt_regs->rip to %rcx,
> pt_regs->r11 to %r11, and execute SYSRET.
> So far so good.
>
> Now, the complications. There are cases where we may need
> to return via IRET insn. This requires full, valid iret stack.
> In particular, pt_regs->r11 needs to be copied to pt_regs->rflags.
>
> Also, we have ptrace poking into pt_regs. It can change pt_regs->rflags,
> pt_regs->rsp. These changes need to be copied to pt_regs->r11,
> PER_CPU(old_rsp). Also, ptrace can look at pt_regs->cs/ss, they
> need to be set to __USER_CS/__USER_DS.
>
> Both of these cases are handled by FIXUP_TOP_OF_STACK and
> RESTORE_TOP_OF_STACK macros, which are ugly:
>
>         .macro FIXUP_TOP_OF_STACK tmp offset=0
>         movq PER_CPU_VAR(old_rsp),\tmp
>         movq \tmp,RSP+\offset(%rsp)
>         movq $__USER_DS,SS+\offset(%rsp)
>         movq $__USER_CS,CS+\offset(%rsp)
>         movq RIP+\offset(%rsp),\tmp  /* get rip */
>         movq \tmp,RCX+\offset(%rsp)  /* copy it to rcx as sysret would do */
>         movq R11+\offset(%rsp),\tmp  /* get eflags */
>         movq \tmp,EFLAGS+\offset(%rsp)
>         .endm
>
>         .macro RESTORE_TOP_OF_STACK tmp offset=0
>         movq RSP+\offset(%rsp),\tmp
>         movq \tmp,PER_CPU_VAR(old_rsp)
>         movq EFLAGS+\offset(%rsp),\tmp
>         movq \tmp,R11+\offset(%rsp)
>         .endm
>
>
> I propose to change (simplify?) it.
>

For previous discussion, see:

http://lkml.kernel.org/r/CALCETrU84yDv+k=KojhODKfaJ-HHCWMuJteH98FR6md-JbmiQQ@mail.gmail.com

>
> - First step: save %r11 in pt_regs->rflags (because it really
>   is user's %rflags). With this change, FIXUP/RESTORE macros
>   do not need to copy it to and fro.
>   stub_iopl is not needed (it exists now because it touches pt_regs->rflags).

Yes please!

>   Question: what to do with pt_regs->r11 slot? It is accessible
>   via ptrace, should we zero it out? Save %r11 twice?
>   Don't populate it, and make FIXUP_TOP_OF_STACK do
>     pt_regs->r11 = pt_regs->rflags copy?
>   (We do the latter for rcx now, i.e. FIXUP_TOP_OF_STACK does
>     pt_regs->rcx = pt_regs->rip).
>

For step 1, just don't populate it at all.  If there are cases where
ptrace can see it, they're already buggy, and this is nothing new.  I
think this should be its own patch, since it's a simplification *and*
a clear performance gain, even if tiny.

>
> - Second: save user's %rsp in pt_regs->rsp, not PER_CPU(old_rsp).
>   In fact, PER_CPU(old_rsp) can't be completely eliminated,
>   but it can be relegated to a purely temporary storage on syscall entry path.
>   All references to it from C code will be gone,
>   thread.usersp field is also gone, a few ugly places
>   are becoming less ugly, they consistently use pt_regs->sp now. Example:
>
>    unsigned long KSTK_ESP(struct task_struct *task)
>    {
>   -       return (test_tsk_thread_flag(task, TIF_IA32)) ?
>   -                       (task_pt_regs(task)->sp) : ((task)->thread.usersp);
>   +       return task_pt_regs(task)->sp;
>    }
>
>   In entry64.S, this change boils down to this:
>
>         movq    %rsp,PER_CPU_VAR(old_rsp)
>         movq    PER_CPU_VAR(kernel_stack),%rsp
>     ...
>         movq    $-ENOSYS,RAX(%rsp)
>         movq_cfi rax,(ORIG_RAX)
>         movq    %rcx,RIP(%rsp)
>   +     movq    PER_CPU_VAR(old_rsp),%rcx
>   +     movq    %rcx,RSP(%rsp)
>     ...
>     ...call syscall_table[i]...
>     ...
>   -     movq    PER_CPU_VAR(old_rsp), %rsp
>   +     movq    RSP(%rsp), %rsp
>         USERGS_SYSRET64
>
>   FIXUP/RESTORE_TOP_OF_STACK will no longer need to move ->rsp value
>   back and forth.
>
>   Is this simplification worth the penalty of having two more MOV insns?
>   My testing shows that syscall fast path is ~54.3 ns before
>   and after the patch (on 2.7 GHz Sandy Bridge CPU).
>

Yes please.  Unless store forwarding can't handle gs-based addressing,
this should be nearly free.  Separate patch, though, please.

>
> - With these two changes, RESTORE_TOP_OF_STACK becomes a no-op
>   (both pt_regs->rsp and pt_regs->rflags are in their rightful places now).
>   FIXUP_TOP_OF_STACK will be reduced to:
>
>         .macro FIXUP_TOP_OF_STACK tmp offset=0
>         movq $__USER_DS,SS+\offset(%rsp)
>         movq $__USER_CS,CS+\offset(%rsp)
>         movq RIP+\offset(%rsp),\tmp    /* get rip */
>         movq \tmp,RCX+\offset(%rsp)    /* copy it to rcx as sysret would do */
>         movq EFLAGS+\offset(%rsp),\tmp /* ditto for rflags->r11 */
>         movq \tmp,R11+\offset(%rsp)
>         .endm
>
>   iow, it now doesn't play around with pt_regs->rsp, and r11<->rflags
>   move is in the opposite direction.
>
>
> - Next possible change is to use PUSH insns to build the stack. Something along the lines of
>         swapgs
>         mov     %rsp,%gs:old_rsp
>         mov     %gs:kernel_stack,%rsp
>         sti
>         push    $__USER_DS   /* pt_regs->ss */
>         push    %gs:old_rsp  /* ->rsp */
>         push    %r11         /* ->rflags */
>         push    $__USER_CS   /* ->cs */
>         push    %rcx         /* ->rip */
>         push    %rax         /* ->orig_rax */
>         push    %rdi
>         push    %rsi
>         push    %rdx
>         push    %rcx
>         push    $-ENOSYS     /* ->rax */
>         push    %r8
>         push    %r9
>         push    %r10
>         push    %r11
>         sub     $(6*8),%rsp /* rbx, rbp, r12-r15 not saved */

Yay!  Can we have a macro PUSH_XYZ for most of this?

Except that we should do this in three parts.  One is to always
populate the extra copy of r11 and rcx, one is to always populate cs
and ss, and one is to switch to using push for the GPRs.

Also, at some point I want to move that sti down a ways.  I'm scared
of information leaks with the current structure.

>
>   This populates pt_regs->ss/cs, making FIXUP_TOP_OF_STACK even less complex.
>   Also, pushes are 1 or 2-byte insns. For comparison, current code looks like this
>   in disassembly:
>      120:       0f 01 f8                swapgs
>      123:       65 48 89 24 25 00 00    mov    %rsp,%gs:old_rsp
>      12a:       00 00
>      12c:       65 48 8b 24 25 00 00    mov    %gs:kernel_stack,%rsp
>      133:       00 00
>      135:       fb                      sti
>      136:       48 81 ec 80 00 00 00    sub    $0x80,%rsp
>      13d:       4c 89 5c 24 30          mov    %r11,0x30(%rsp)
>      142:       4c 89 54 24 38          mov    %r10,0x38(%rsp)
>      147:       4c 89 4c 24 40          mov    %r9,0x40(%rsp)
>      14c:       4c 89 44 24 48          mov    %r8,0x48(%rsp)
>      151:       48 89 54 24 60          mov    %rdx,0x60(%rsp)
>      156:       48 89 74 24 68          mov    %rsi,0x68(%rsp)
>      15b:       48 89 7c 24 70          mov    %rdi,0x70(%rsp)
>      160:       48 c7 44 24 50 da ff    movq   $-ENOSYS,0x50(%rsp)
>      167:       ff ff
>      169:       48 89 44 24 78          mov    %rax,0x78(%rsp)
>      16e:       48 89 8c 24 80 00 00    mov    %rcx,0x80(%rsp)
>
>   As you see, "MOV const,(mem64)" is a ten-byte insn. "PUSH const" is 2 bytes.
>   Other MOVs are biggish too.
>
>   However, such prologue performs four more stores that our current one
>   (->cs/ss, and extra pushes for rcx/r11).
>   Is it a show-stopper?

I bet it's close enough to neutral that the simplification makes it
worth it.  The current code is terrifying.

The final step would be unconditionally push rbx, rbp, and r12 through
r15.  Doing that would enable *huge* cleanups in the rest of the asm
code.  I've spent a *long* time fighting that optimization wrt
seccomp.

--Andy

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RFC: (almost) getting rid of FIXUP/RESTORE_TOP_OF_STACK?
  2015-02-23 23:58 ` Andy Lutomirski
@ 2015-02-24 11:45   ` Denys Vlasenko
  2015-02-25 16:09     ` Andy Lutomirski
  0 siblings, 1 reply; 4+ messages in thread
From: Denys Vlasenko @ 2015-02-24 11:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, H. Peter Anvin, X86 ML, LKML, Linus Torvalds,
	Alexander van Heukelum

On 02/24/2015 12:58 AM, Andy Lutomirski wrote:
>> - Next possible change is to use PUSH insns to build the stack. Something along the lines of
>>         swapgs
>>         mov     %rsp,%gs:old_rsp
>>         mov     %gs:kernel_stack,%rsp
>>         sti
>>         push    $__USER_DS   /* pt_regs->ss */
>>         push    %gs:old_rsp  /* ->rsp */
>>         push    %r11         /* ->rflags */
>>         push    $__USER_CS   /* ->cs */
>>         push    %rcx         /* ->rip */
>>         push    %rax         /* ->orig_rax */
>>         push    %rdi
>>         push    %rsi
>>         push    %rdx
>>         push    %rcx
>>         push    $-ENOSYS     /* ->rax */
>>         push    %r8
>>         push    %r9
>>         push    %r10
>>         push    %r11
>>         sub     $(6*8),%rsp /* rbx, rbp, r12-r15 not saved */

Correction:
         push    %r11
         sub     $(6*8),%rsp /* rbx, rbp, r12-r15 not saved */
should be
         sub     $(7*8),%rsp /* r11, rbx, rbp, r12-r15 not saved */
since we don't need to save r11 either.

> 
> Yay!

"yay!" as in "I like this!" or as in "I am surprised" ?

>  Can we have a macro PUSH_XYZ for most of this?

Yes, it can be a macro. I'm not sure we'll have more than one
such place, tho.
Do you want a macro even if it will be once-use only?


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RFC: (almost) getting rid of FIXUP/RESTORE_TOP_OF_STACK?
  2015-02-24 11:45   ` Denys Vlasenko
@ 2015-02-25 16:09     ` Andy Lutomirski
  0 siblings, 0 replies; 4+ messages in thread
From: Andy Lutomirski @ 2015-02-25 16:09 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Borislav Petkov, H. Peter Anvin, X86 ML, LKML, Linus Torvalds,
	Alexander van Heukelum

On Tue, Feb 24, 2015 at 3:45 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 02/24/2015 12:58 AM, Andy Lutomirski wrote:
>>> - Next possible change is to use PUSH insns to build the stack. Something along the lines of
>>>         swapgs
>>>         mov     %rsp,%gs:old_rsp
>>>         mov     %gs:kernel_stack,%rsp
>>>         sti
>>>         push    $__USER_DS   /* pt_regs->ss */
>>>         push    %gs:old_rsp  /* ->rsp */
>>>         push    %r11         /* ->rflags */
>>>         push    $__USER_CS   /* ->cs */
>>>         push    %rcx         /* ->rip */
>>>         push    %rax         /* ->orig_rax */
>>>         push    %rdi
>>>         push    %rsi
>>>         push    %rdx
>>>         push    %rcx
>>>         push    $-ENOSYS     /* ->rax */
>>>         push    %r8
>>>         push    %r9
>>>         push    %r10
>>>         push    %r11
>>>         sub     $(6*8),%rsp /* rbx, rbp, r12-r15 not saved */
>
> Correction:
>          push    %r11
>          sub     $(6*8),%rsp /* rbx, rbp, r12-r15 not saved */
> should be
>          sub     $(7*8),%rsp /* r11, rbx, rbp, r12-r15 not saved */
> since we don't need to save r11 either.
>
>>
>> Yay!
>
> "yay!" as in "I like this!" or as in "I am surprised" ?
>
>>  Can we have a macro PUSH_XYZ for most of this?
>
> Yes, it can be a macro. I'm not sure we'll have more than one
> such place, tho.
> Do you want a macro even if it will be once-use only?
>

Maybe not.  If it gets other uses, it can be macro-ized.

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-02-25 16:09 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-23 21:07 RFC: (almost) getting rid of FIXUP/RESTORE_TOP_OF_STACK? Denys Vlasenko
2015-02-23 23:58 ` Andy Lutomirski
2015-02-24 11:45   ` Denys Vlasenko
2015-02-25 16:09     ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox