linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* ia32_sysenter_target does not preserve EFLAGS
@ 2015-03-27 14:25 Denys Vlasenko
  2015-03-27 17:12 ` Borislav Petkov
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Denys Vlasenko @ 2015-03-27 14:25 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski, Borislav Petkov, X86 ML, LKML

Hi,

While running some tests I noticed that EFLAGS
is not saved across syscalls if I use 32-bit
userspace, use SYSENTER, and paravirt is active.

Looking at the code, it's actually clear why that happens.

/*
 * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs.
 * IF and VM in rflags are cleared (IOW: interrupts are off).
 * SYSENTER does not save anything on the stack,
 * and does not save old rip (!!!) and rflags.
 */
ENTRY(ia32_sysenter_target)
        SWAPGS_UNSAFE_STACK  <============================
        movq    PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
        ENABLE_INTERRUPTS(CLBR_NONE)

        movl    %ebp, %ebp
        movl    %eax, %eax
        movl    ASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d

        /* Construct struct pt_regs on stack */
        pushq_cfi       $__USER32_DS            /* pt_regs->ss */
        pushq_cfi       %rbp                    /* pt_regs->sp */
        CFI_REL_OFFSET  rsp,0
        pushfq_cfi                              /* pt_regs->flags */

The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks,
will change EFLAGS, and it *can't* save/restore them -
there is no place to save it, since neither stack nor
PER_CPU() is usable at that point.

Interestingly, *no one ever complained*!

Apparently, users *don't* depend on arithmetic flags
to survive over syscall. They also okay with DF flag
being cleared.

Let's go flag-by-flag.

ID - probably no one depends on it
VIP,VIF,VM - v86 stuff, not supported in 64bit
AC - someone probably do use this
RF - should be cleared to 0
NT - iret via task gate, not supported in 64bit
IOPL - usually 00, sys_iopl() can change it
DF - according to C ABI, should be 0
IF - should be preserved (but almost always 1)
TF - should be preserved
arith flags - probably no one cares

IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_
IF.

AC and IOPL are preserved even with this paravirt quirk
because paravirt hooks do not mangle them.

TF preservation and proper restoration is handled by
	do_debug + syscall_trace_enter_phase2 + iret
combo.

We unconditionally set IF. This is only a problem for applications
which use sys_iopl(3) and, disable IRQs in userspace and perform
syscalls. The set of such apps is probably empty.
(This "bug" exists even for non-paravirt case).

So, formally, we have a bug: we do not preserve IF,
DF and arith flags.

I'm proposing to use this opportunity to amend syscall ABI
to say that arith flags are not preserved across syscalls,
and DF can be cleared to 0 by syscalls (but can't be set to 1).
Evidently, it's broken for some time for some virtualized
setups and users are okay.

I'm not sure what to do with the "bug" of forcing IF=1.
Fix it? Or also declare that syscalls can set IF=1?
Do you think this is a legitimate userspace code?

	sys_iopl(3);
	cli;
	syscall();
	/* expects irqs still disabled */

-- 
vda

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
  2015-03-27 14:25 ia32_sysenter_target does not preserve EFLAGS Denys Vlasenko
@ 2015-03-27 17:12 ` Borislav Petkov
  2015-03-27 18:37 ` Andy Lutomirski
  2015-03-27 20:00 ` Linus Torvalds
  2 siblings, 0 replies; 19+ messages in thread
From: Borislav Petkov @ 2015-03-27 17:12 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: Ingo Molnar, Andy Lutomirski, X86 ML, LKML, Linus Torvalds

+ Linus.

On Fri, Mar 27, 2015 at 03:25:47PM +0100, Denys Vlasenko wrote:
> Hi,
> 
> While running some tests I noticed that EFLAGS
> is not saved across syscalls if I use 32-bit
> userspace, use SYSENTER, and paravirt is active.
> 
> Looking at the code, it's actually clear why that happens.
> 
> /*
>  * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs.
>  * IF and VM in rflags are cleared (IOW: interrupts are off).
>  * SYSENTER does not save anything on the stack,
>  * and does not save old rip (!!!) and rflags.
>  */
> ENTRY(ia32_sysenter_target)
>         SWAPGS_UNSAFE_STACK  <============================
>         movq    PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
>         ENABLE_INTERRUPTS(CLBR_NONE)
> 
>         movl    %ebp, %ebp
>         movl    %eax, %eax
>         movl    ASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d
> 
>         /* Construct struct pt_regs on stack */
>         pushq_cfi       $__USER32_DS            /* pt_regs->ss */
>         pushq_cfi       %rbp                    /* pt_regs->sp */
>         CFI_REL_OFFSET  rsp,0
>         pushfq_cfi                              /* pt_regs->flags */
> 
> The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks,
> will change EFLAGS, and it *can't* save/restore them -
> there is no place to save it, since neither stack nor
> PER_CPU() is usable at that point.
> 
> Interestingly, *no one ever complained*!
> 
> Apparently, users *don't* depend on arithmetic flags
> to survive over syscall. They also okay with DF flag
> being cleared.
> 
> Let's go flag-by-flag.
> 
> ID - probably no one depends on it

It is used as a toggle to detect CPUID support. Can a SYSENTER happen
while something toggles it? Probably...

> VIP,VIF,VM - v86 stuff, not supported in 64bit
> AC - someone probably do use this
> RF - should be cleared to 0
> NT - iret via task gate, not supported in 64bit
> IOPL - usually 00, sys_iopl() can change it
> DF - according to C ABI, should be 0
> IF - should be preserved (but almost always 1)
> TF - should be preserved
> arith flags - probably no one cares
> 
> IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_
> IF.
> 
> AC and IOPL are preserved even with this paravirt quirk
> because paravirt hooks do not mangle them.
> 
> TF preservation and proper restoration is handled by
> 	do_debug + syscall_trace_enter_phase2 + iret
> combo.
> 
> We unconditionally set IF. This is only a problem for applications
> which use sys_iopl(3) and, disable IRQs in userspace and perform
> syscalls. The set of such apps is probably empty.
> (This "bug" exists even for non-paravirt case).
> 
> So, formally, we have a bug: we do not preserve IF,
> DF and arith flags.
> 
> I'm proposing to use this opportunity to amend syscall ABI
> to say that arith flags are not preserved across syscalls,
> and DF can be cleared to 0 by syscalls (but can't be set to 1).
> Evidently, it's broken for some time for some virtualized
> setups and users are okay.
> 
> I'm not sure what to do with the "bug" of forcing IF=1.
> Fix it? Or also declare that syscalls can set IF=1?
> Do you think this is a legitimate userspace code?
> 
> 	sys_iopl(3);
> 	cli;
> 	syscall();
> 	/* expects irqs still disabled */
> 
> -- 
> vda
> 

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
  2015-03-27 14:25 ia32_sysenter_target does not preserve EFLAGS Denys Vlasenko
  2015-03-27 17:12 ` Borislav Petkov
@ 2015-03-27 18:37 ` Andy Lutomirski
  2015-03-27 20:09   ` Linus Torvalds
  2015-03-27 20:53   ` Brian Gerst
  2015-03-27 20:00 ` Linus Torvalds
  2 siblings, 2 replies; 19+ messages in thread
From: Andy Lutomirski @ 2015-03-27 18:37 UTC (permalink / raw)
  To: Denys Vlasenko, Linus Torvalds
  Cc: Borislav Petkov, linux-kernel@vger.kernel.org, X86 ML,
	Ingo Molnar

On Mar 27, 2015 7:26 AM, "Denys Vlasenko" <dvlasenk@redhat.com> wrote:
>
> Hi,
>
> While running some tests I noticed that EFLAGS
> is not saved across syscalls if I use 32-bit
> userspace, use SYSENTER, and paravirt is active.
>
> Looking at the code, it's actually clear why that happens.
>
> /*
>  * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs.
>  * IF and VM in rflags are cleared (IOW: interrupts are off).
>  * SYSENTER does not save anything on the stack,
>  * and does not save old rip (!!!) and rflags.
>  */
> ENTRY(ia32_sysenter_target)
>         SWAPGS_UNSAFE_STACK  <============================
>         movq    PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
>         ENABLE_INTERRUPTS(CLBR_NONE)
>
>         movl    %ebp, %ebp
>         movl    %eax, %eax
>         movl    ASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d
>
>         /* Construct struct pt_regs on stack */
>         pushq_cfi       $__USER32_DS            /* pt_regs->ss */
>         pushq_cfi       %rbp                    /* pt_regs->sp */
>         CFI_REL_OFFSET  rsp,0
>         pushfq_cfi                              /* pt_regs->flags */
>
> The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks,
> will change EFLAGS, and it *can't* save/restore them -
> there is no place to save it, since neither stack nor
> PER_CPU() is usable at that point.
>
> Interestingly, *no one ever complained*!
>
> Apparently, users *don't* depend on arithmetic flags
> to survive over syscall. They also okay with DF flag
> being cleared.
>
> Let's go flag-by-flag.
>
> ID - probably no one depends on it
> VIP,VIF,VM - v86 stuff, not supported in 64bit
> AC - someone probably do use this
> RF - should be cleared to 0
> NT - iret via task gate, not supported in 64bit
> IOPL - usually 00, sys_iopl() can change it
> DF - according to C ABI, should be 0
> IF - should be preserved (but almost always 1)
> TF - should be preserved
> arith flags - probably no one cares
>
> IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_
> IF.
>
> AC and IOPL are preserved even with this paravirt quirk
> because paravirt hooks do not mangle them.
>
> TF preservation and proper restoration is handled by
>         do_debug + syscall_trace_enter_phase2 + iret
> combo.
>
> We unconditionally set IF. This is only a problem for applications
> which use sys_iopl(3) and, disable IRQs in userspace and perform
> syscalls. The set of such apps is probably empty.
> (This "bug" exists even for non-paravirt case).
>
> So, formally, we have a bug: we do not preserve IF,
> DF and arith flags.
>
> I'm proposing to use this opportunity to amend syscall ABI
> to say that arith flags are not preserved across syscalls,
> and DF can be cleared to 0 by syscalls (but can't be set to 1).
> Evidently, it's broken for some time for some virtualized
> setups and users are okay.

I think I'd rather fix it.  I want to give x86_64 a sysenter stack
like x86_32's, since AFAICT the only reason that #DF needs to use IST
is because sysenter with TF set is the only way I can see that #DF
could happen with an invalid stack.

Also, Houston, we have a bug, probably rootable, and probably damn
near impossible to exploit without crashing your system:

User does sysenter.  We end up in native_irq_enable_sysexit.  We do:

swapgs
sti

<-- NMI here can happen on some (all?) cpus, returns successfully
*with interrupts unmasked*

<-- IRQ.  Boom

My preferred fix would be to use sysretl instead of sysexit.  As far
as I know, there are no 64-bit CPUs at all that don't support sysretl.

--Andy

>
> I'm not sure what to do with the "bug" of forcing IF=1.
> Fix it? Or also declare that syscalls can set IF=1?
> Do you think this is a legitimate userspace code?
>
>         sys_iopl(3);
>         cli;
>         syscall();
>         /* expects irqs still disabled */
>
> --
> vda

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
  2015-03-27 14:25 ia32_sysenter_target does not preserve EFLAGS Denys Vlasenko
  2015-03-27 17:12 ` Borislav Petkov
  2015-03-27 18:37 ` Andy Lutomirski
@ 2015-03-27 20:00 ` Linus Torvalds
  2015-03-28  0:34   ` Denys Vlasenko
  2 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2015-03-27 20:00 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, Andy Lutomirski, Borislav Petkov, X86 ML, LKML

On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>
> Apparently, users *don't* depend on arithmetic flags
> to survive over syscall. They also okay with DF flag
> being cleared.

Generally, users probably dont' care about many registers at all being
saved, but it's worth noting that the reason system calls save/restore
even caller-saved registers is at least partly in order to avoid any
kernel information leaks.

I don't believe that user mode will ever reasonably care about the
arithmetic flags being changed, but at the same time I also don't it
is something we should ever consider a "feature" we should try to take
advantage of. Generally we should try to not mess with the flag state,
and I'd *much* rather make the rule be that all the system call return
paths restore flags as much as possible.

                          Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
  2015-03-27 18:37 ` Andy Lutomirski
@ 2015-03-27 20:09   ` Linus Torvalds
  2015-03-27 20:16     ` Andy Lutomirski
  2015-03-28  9:42     ` Borislav Petkov
  2015-03-27 20:53   ` Brian Gerst
  1 sibling, 2 replies; 19+ messages in thread
From: Linus Torvalds @ 2015-03-27 20:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Borislav Petkov, linux-kernel@vger.kernel.org,
	X86 ML, Ingo Molnar

On Fri, Mar 27, 2015 at 11:37 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> User does sysenter.  We end up in native_irq_enable_sysexit.  We do:
>
> swapgs
> sti
>
> <-- NMI here can happen on some (all?) cpus, returns successfully
> *with interrupts unmasked*

I think AMD documented that the sti "interrupt shadow" shadows even
NMI. And for Intel, it definitely does not (but "mov ss" and "pop ss"
masks even NMI for the next instruction - so the interrupt shadow is
different for these cases).

That said, it wasn't 100% clear that the "NMI return to immediate
regular interrupt" can actually happen even on Intel. Iirc, there was
some discussion about when the CPU actually tests the IRQ line after
an 'iret'. It might end up testing the IF only after executing the
instruction it returns to, so it's possible that the sequence

    .. interrupts disabled ..
    sti
    sysexit

can not be interrupted by regular interrupts between the 'sti' and the
'sysexit' even if an NMI were to have happened between the two.

> My preferred fix would be to use sysretl instead of sysexit.  As far
> as I know, there are no 64-bit CPUs at all that don't support sysretl.

That 'sti+sysexit" is used for the native 32-bit case too, not just
the compat mode for x86-64. So I don't necessarily disagree with using
sysretl, but..

                       Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
  2015-03-27 20:09   ` Linus Torvalds
@ 2015-03-27 20:16     ` Andy Lutomirski
  2015-03-27 20:31       ` Linus Torvalds
  2015-03-28  9:42     ` Borislav Petkov
  1 sibling, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2015-03-27 20:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Denys Vlasenko, Borislav Petkov, linux-kernel@vger.kernel.org,
	X86 ML, Ingo Molnar

On Fri, Mar 27, 2015 at 1:09 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Mar 27, 2015 at 11:37 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> User does sysenter.  We end up in native_irq_enable_sysexit.  We do:
>>
>> swapgs
>> sti
>>
>> <-- NMI here can happen on some (all?) cpus, returns successfully
>> *with interrupts unmasked*
>
> I think AMD documented that the sti "interrupt shadow" shadows even
> NMI. And for Intel, it definitely does not (but "mov ss" and "pop ss"
> masks even NMI for the next instruction - so the interrupt shadow is
> different for these cases).
>
> That said, it wasn't 100% clear that the "NMI return to immediate
> regular interrupt" can actually happen even on Intel. Iirc, there was
> some discussion about when the CPU actually tests the IRQ line after
> an 'iret'. It might end up testing the IF only after executing the
> instruction it returns to, so it's possible that the sequence
>
>     .. interrupts disabled ..
>     sti
>     sysexit
>
> can not be interrupted by regular interrupts between the 'sti' and the
> 'sysexit' even if an NMI were to have happened between the two.
>
>> My preferred fix would be to use sysretl instead of sysexit.  As far
>> as I know, there are no 64-bit CPUs at all that don't support sysretl.
>
> That 'sti+sysexit" is used for the native 32-bit case too, not just
> the compat mode for x86-64. So I don't necessarily disagree with using
> sysretl, but..

Does it matter on 32-bit kernels?  There's no swapgs, so IRQs should
still be safe, and we have a real stack pointer before sysexit.

--Andy

>
>                        Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
  2015-03-27 20:16     ` Andy Lutomirski
@ 2015-03-27 20:31       ` Linus Torvalds
  2015-03-27 21:11         ` Andy Lutomirski
  0 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2015-03-27 20:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Borislav Petkov, linux-kernel@vger.kernel.org,
	X86 ML, Ingo Molnar

On Fri, Mar 27, 2015 at 1:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Does it matter on 32-bit kernels?  There's no swapgs, so IRQs should
> still be safe, and we have a real stack pointer before sysexit.

Fair enough.  On 32-bit, the only worry is the race between "return to
user space" and "something set a thread flag", resulting in delayed
signals and/or higher scheduling latency etc. So on 32-bit, the bug is
much less of an issue, I agree.

So yeah, using sysretl instead of sti+sysexit on 64-bit sounds more
reasonable given the potential worry about sti+sysexit atomicity in
the presense of nmi's.

                      Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
  2015-03-27 18:37 ` Andy Lutomirski
  2015-03-27 20:09   ` Linus Torvalds
@ 2015-03-27 20:53   ` Brian Gerst
  2015-03-27 21:02     ` Linus Torvalds
  1 sibling, 1 reply; 19+ messages in thread
From: Brian Gerst @ 2015-03-27 20:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Linus Torvalds, Borislav Petkov,
	linux-kernel@vger.kernel.org, X86 ML, Ingo Molnar

On Fri, Mar 27, 2015 at 2:37 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mar 27, 2015 7:26 AM, "Denys Vlasenko" <dvlasenk@redhat.com> wrote:
>>
>> Hi,
>>
>> While running some tests I noticed that EFLAGS
>> is not saved across syscalls if I use 32-bit
>> userspace, use SYSENTER, and paravirt is active.
>>
>> Looking at the code, it's actually clear why that happens.
>>
>> /*
>>  * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs.
>>  * IF and VM in rflags are cleared (IOW: interrupts are off).
>>  * SYSENTER does not save anything on the stack,
>>  * and does not save old rip (!!!) and rflags.
>>  */
>> ENTRY(ia32_sysenter_target)
>>         SWAPGS_UNSAFE_STACK  <============================
>>         movq    PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
>>         ENABLE_INTERRUPTS(CLBR_NONE)
>>
>>         movl    %ebp, %ebp
>>         movl    %eax, %eax
>>         movl    ASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d
>>
>>         /* Construct struct pt_regs on stack */
>>         pushq_cfi       $__USER32_DS            /* pt_regs->ss */
>>         pushq_cfi       %rbp                    /* pt_regs->sp */
>>         CFI_REL_OFFSET  rsp,0
>>         pushfq_cfi                              /* pt_regs->flags */
>>
>> The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks,
>> will change EFLAGS, and it *can't* save/restore them -
>> there is no place to save it, since neither stack nor
>> PER_CPU() is usable at that point.
>>
>> Interestingly, *no one ever complained*!
>>
>> Apparently, users *don't* depend on arithmetic flags
>> to survive over syscall. They also okay with DF flag
>> being cleared.
>>
>> Let's go flag-by-flag.
>>
>> ID - probably no one depends on it
>> VIP,VIF,VM - v86 stuff, not supported in 64bit
>> AC - someone probably do use this
>> RF - should be cleared to 0
>> NT - iret via task gate, not supported in 64bit
>> IOPL - usually 00, sys_iopl() can change it
>> DF - according to C ABI, should be 0
>> IF - should be preserved (but almost always 1)
>> TF - should be preserved
>> arith flags - probably no one cares
>>
>> IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_
>> IF.
>>
>> AC and IOPL are preserved even with this paravirt quirk
>> because paravirt hooks do not mangle them.
>>
>> TF preservation and proper restoration is handled by
>>         do_debug + syscall_trace_enter_phase2 + iret
>> combo.
>>
>> We unconditionally set IF. This is only a problem for applications
>> which use sys_iopl(3) and, disable IRQs in userspace and perform
>> syscalls. The set of such apps is probably empty.
>> (This "bug" exists even for non-paravirt case).
>>
>> So, formally, we have a bug: we do not preserve IF,
>> DF and arith flags.
>>
>> I'm proposing to use this opportunity to amend syscall ABI
>> to say that arith flags are not preserved across syscalls,
>> and DF can be cleared to 0 by syscalls (but can't be set to 1).
>> Evidently, it's broken for some time for some virtualized
>> setups and users are okay.
>
> I think I'd rather fix it.  I want to give x86_64 a sysenter stack
> like x86_32's, since AFAICT the only reason that #DF needs to use IST
> is because sysenter with TF set is the only way I can see that #DF
> could happen with an invalid stack.

What if RSP gets corrupted in the kernel?  That would cause a fault
that gets promoted to #DF, since the iret frame can't be pushed.  You
would at least get an oops out instead of a triple fault reset.

> Also, Houston, we have a bug, probably rootable, and probably damn
> near impossible to exploit without crashing your system:
>
> User does sysenter.  We end up in native_irq_enable_sysexit.  We do:
>
> swapgs
> sti
>
> <-- NMI here can happen on some (all?) cpus, returns successfully
> *with interrupts unmasked*
>
> <-- IRQ.  Boom

The sti will delay interrupts for one instruction, and that should include NMIs.

The Intel SDM states for STI:
"The IF flag and the STI and CLI instructions do not prohibit the
generation of exceptions and NMI interrupts. NMI
interrupts (and SMIs) may be blocked for one macroinstruction following an STI."

> My preferred fix would be to use sysretl instead of sysexit.  As far
> as I know, there are no 64-bit CPUs at all that don't support sysretl.

It would be nice to have one less return path.  I'm curious to know
why Intel doesn't support syscall from compatibility mode but does
support sysret to compatibility mode.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
  2015-03-27 20:53   ` Brian Gerst
@ 2015-03-27 21:02     ` Linus Torvalds
  2015-03-28  9:30       ` Ingo Molnar
  2015-03-28  9:39       ` Ingo Molnar
  0 siblings, 2 replies; 19+ messages in thread
From: Linus Torvalds @ 2015-03-27 21:02 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, Denys Vlasenko, Borislav Petkov,
	linux-kernel@vger.kernel.org, X86 ML, Ingo Molnar

On Fri, Mar 27, 2015 at 1:53 PM, Brian Gerst <brgerst@gmail.com> wrote:
>> <-- IRQ.  Boom
>
> The sti will delay interrupts for one instruction, and that should include NMIs.

Nope. Intel explicitly documents the NMI case only for mov->ss and popss.

> The Intel SDM states for STI:
> "The IF flag and the STI and CLI instructions do not prohibit the
> generation of exceptions and NMI interrupts. NMI
> interrupts (and SMIs) may be blocked for one macroinstruction following an STI."

Note the *may*. For movss and popss the software developer guide
explicitly says that NMI's are also blocked.

For plain sti, it seems to be dependent on microarchitecture.

                     Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
  2015-03-27 20:31       ` Linus Torvalds
@ 2015-03-27 21:11         ` Andy Lutomirski
  0 siblings, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2015-03-27 21:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Denys Vlasenko, Borislav Petkov, linux-kernel@vger.kernel.org,
	X86 ML, Ingo Molnar

On Fri, Mar 27, 2015 at 1:31 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Mar 27, 2015 at 1:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> Does it matter on 32-bit kernels?  There's no swapgs, so IRQs should
>> still be safe, and we have a real stack pointer before sysexit.
>
> Fair enough.  On 32-bit, the only worry is the race between "return to
> user space" and "something set a thread flag", resulting in delayed
> signals and/or higher scheduling latency etc. So on 32-bit, the bug is
> much less of an issue, I agree.

Right, except for one nasty case: KVM user return notifiers.  It's
possible we'd re-enter user mode with some MSRs set wrong.  Yuck.

--Andy

>
> So yeah, using sysretl instead of sti+sysexit on 64-bit sounds more
> reasonable given the potential worry about sti+sysexit atomicity in
> the presense of nmi's.
>
>                       Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
  2015-03-27 20:00 ` Linus Torvalds
@ 2015-03-28  0:34   ` Denys Vlasenko
  2015-03-28  9:28     ` Olivier Galibert
                       ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Denys Vlasenko @ 2015-03-28  0:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Denys Vlasenko, Ingo Molnar, Andy Lutomirski, Borislav Petkov,
	X86 ML, LKML

On Fri, Mar 27, 2015 at 9:00 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>
>> Apparently, users *don't* depend on arithmetic flags
>> to survive over syscall. They also okay with DF flag
>> being cleared.
>
> Generally, users probably dont' care about many registers at all being
> saved, but it's worth noting that the reason system calls save/restore
> even caller-saved registers is at least partly in order to avoid any
> kernel information leaks.
>
> I don't believe that user mode will ever reasonably care about the
> arithmetic flags being changed, but at the same time I also don't it
> is something we should ever consider a "feature" we should try to take
> advantage of. Generally we should try to not mess with the flag state,
> and I'd *much* rather make the rule be that all the system call return
> paths restore flags as much as possible.

"We don't clobber anything" ABI has its appeal.
OTOH, fulfilling ABI's promises has cost which hast to be paid
on every syscall, regardless whether userspace needed it or not.

Example. This is the uclibc implementation of write():

00000000004acfc4 <__libc_write>:
  4acfc4:       53                      push   %rbx
  4acfc5:       48 63 ff                movslq %edi,%rdi
  4acfc8:       b8 01 00 00 00          mov    $0x1,%eax
  4acfcd:       0f 05                   syscall
  4acfcf:       48 89 c3                mov    %rax,%rbx
  4acfd2:       48 81 fb 00 f0 ff ff    cmp    $0xfffffffffffff000,%rbx
  4acfd9:       76 0f                   jbe    4acfea <__libc_write+0x26>
  4acfdb:       e8 64 15 00 00          callq  4ae544 <__GI___errno_location>
  4acfe0:       89 da                   mov    %ebx,%edx
  4acfe2:       f7 da                   neg    %edx
  4acfe4:       89 10                   mov    %edx,(%rax)
  4acfe6:       48 83 c8 ff             or     $0xffffffffffffffff,%rax
  4acfea:       5b                      pop    %rbx
  4acfeb:       c3                      retq

This is a C function. Therefore any its caller assumes that C-clobbered
registers can be, indeed, clobbered here, so if that caller uses any
of them, it saves/restores them.

All efforts by kernel code to save/restore C-clobbered registers,
eight of them, are in vain. It's just useless work. Userspace
does not benefit from that effort.

If our syscall ABI would say that those regs are not preserved,
we could have a bit faster syscalls. Any userspace code which
really had to have those registers preserved across a particular
syscall, could push/pop them itself.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
  2015-03-28  0:34   ` Denys Vlasenko
@ 2015-03-28  9:28     ` Olivier Galibert
  2015-03-28  9:46     ` Ingo Molnar
       [not found]     ` <CA+55aFxwNq6g+Oi-UhGBgEZuDQyNkeg6qZnkDY11PNhTN=fmzg@mail.gmail.com>
  2 siblings, 0 replies; 19+ messages in thread
From: Olivier Galibert @ 2015-03-28  9:28 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Denys Vlasenko, Ingo Molnar, Andy Lutomirski,
	Borislav Petkov, X86 ML, LKML

  Hi,

Beware that could be opening the door to information leaks for a very
small gain (most syscalls are not getuid).

Best,

  OG.


On Sat, Mar 28, 2015 at 1:34 AM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> On Fri, Mar 27, 2015 at 9:00 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>>
>>> Apparently, users *don't* depend on arithmetic flags
>>> to survive over syscall. They also okay with DF flag
>>> being cleared.
>>
>> Generally, users probably dont' care about many registers at all being
>> saved, but it's worth noting that the reason system calls save/restore
>> even caller-saved registers is at least partly in order to avoid any
>> kernel information leaks.
>>
>> I don't believe that user mode will ever reasonably care about the
>> arithmetic flags being changed, but at the same time I also don't it
>> is something we should ever consider a "feature" we should try to take
>> advantage of. Generally we should try to not mess with the flag state,
>> and I'd *much* rather make the rule be that all the system call return
>> paths restore flags as much as possible.
>
> "We don't clobber anything" ABI has its appeal.
> OTOH, fulfilling ABI's promises has cost which hast to be paid
> on every syscall, regardless whether userspace needed it or not.
>
> Example. This is the uclibc implementation of write():
>
> 00000000004acfc4 <__libc_write>:
>   4acfc4:       53                      push   %rbx
>   4acfc5:       48 63 ff                movslq %edi,%rdi
>   4acfc8:       b8 01 00 00 00          mov    $0x1,%eax
>   4acfcd:       0f 05                   syscall
>   4acfcf:       48 89 c3                mov    %rax,%rbx
>   4acfd2:       48 81 fb 00 f0 ff ff    cmp    $0xfffffffffffff000,%rbx
>   4acfd9:       76 0f                   jbe    4acfea <__libc_write+0x26>
>   4acfdb:       e8 64 15 00 00          callq  4ae544 <__GI___errno_location>
>   4acfe0:       89 da                   mov    %ebx,%edx
>   4acfe2:       f7 da                   neg    %edx
>   4acfe4:       89 10                   mov    %edx,(%rax)
>   4acfe6:       48 83 c8 ff             or     $0xffffffffffffffff,%rax
>   4acfea:       5b                      pop    %rbx
>   4acfeb:       c3                      retq
>
> This is a C function. Therefore any its caller assumes that C-clobbered
> registers can be, indeed, clobbered here, so if that caller uses any
> of them, it saves/restores them.
>
> All efforts by kernel code to save/restore C-clobbered registers,
> eight of them, are in vain. It's just useless work. Userspace
> does not benefit from that effort.
>
> If our syscall ABI would say that those regs are not preserved,
> we could have a bit faster syscalls. Any userspace code which
> really had to have those registers preserved across a particular
> syscall, could push/pop them itself.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
  2015-03-27 21:02     ` Linus Torvalds
@ 2015-03-28  9:30       ` Ingo Molnar
  2015-03-28  9:39       ` Ingo Molnar
  1 sibling, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2015-03-28  9:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Brian Gerst, Andy Lutomirski, Denys Vlasenko, Borislav Petkov,
	linux-kernel@vger.kernel.org, X86 ML


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Mar 27, 2015 at 1:53 PM, Brian Gerst <brgerst@gmail.com> wrote:
> >> <-- IRQ.  Boom
> >
> > The sti will delay interrupts for one instruction, and that should include NMIs.
> 
> Nope. Intel explicitly documents the NMI case only for mov->ss and popss.

Interestingly, I still see a STI 'NMI shadow' even on Intel CPUs.

Try something like this as root on a system with Intel CPUs (running 
recent tools/perf), with high-freq NMI sampling:

   perf top -F 10000

execute a tight syscall loop on all CPUs (getppid() loop for example), 
and you'll see something like this in the profile:

Samples: 1M of event 'cycles', Event count (approx.): 377899840545                                                                                                        
Overhead  Shared Object                      Symbol                                                                                                                      ◆
  27.67%  libc-2.19.so                       [.] __GI___getppid                                                                                                          ▒
  21.34%  [kernel]                           [k] system_call                                                                                                             ▒
  17.42%  [kernel]                           [k] system_call_after_swapgs                                                                                                ▒
  12.00%  [kernel]                           [k] pid_vnr                                                                                                                 ▒
   7.49%  [kernel]                           [k] sys_getppid                                                                                                             ▒
   5.49%  [kernel]                           [k] sysret_check                                                                                                            ▒
   5.34%  loop-getppid                       [.] main                                                                                                                    ▒
   1.56%  [kernel]                           [k] system_call_fastpath                                                                                                    ▒
   0.36%  loop-getppid                       [.] getppid@plt                                                                                                             ▒

Note the very high sample count (due to sampling at 10 KHz).

Now if you hit '<Enter>' twice to annotate system_call_after_swapgs 
you should see something like this (the live kernel image disassembly, 
annotated):

  system_call_after_swapgs  /proc/kcore                                                                                                                                     
       │                                             
       │                                             
       │                                             
       │              Disassembly of section load0:  
       │                                             
       │              ffffffff8178b3f3 <load0>:      
  9.72 │ffffffff8178b3f3:   mov    %rsp,%gs:0xb040
 44.24 │ffffffff8178b3fc:   mov    %gs:0xb888,%rsp
  0.02 │ffffffff8178b405:   sti                      
       │ffffffff8178b406:   nopl   0x0(%rax)         
 16.04 │ffffffff8178b40d:   sub    $0x50,%rsp
       │ffffffff8178b411:   mov    %rdi,0x40(%rsp)   
  6.51 │ffffffff8178b416:   mov    %rsi,0x38(%rsp)
  5.81 │ffffffff8178b41b:   mov    %rdx,0x30(%rsp)
  2.22 │ffffffff8178b420:   mov    %rax,0x20(%rsp)
  2.16 │ffffffff8178b425:   mov    %r8,0x18(%rsp)
  0.93 │ffffffff8178b42a:   mov    %r9,0x10(%rsp)
  1.57 │ffffffff8178b42f:   mov    %r10,0x8(%rsp)
  3.70 │ffffffff8178b434:   mov    %r11,(%rsp)
  2.27 │ffffffff8178b438:   mov    %rax,0x48(%rsp)
                                                
Note how the 7-byte NOP after the STI did not get a single profiler 
hit.

This is with the default '-e cycles', not '-e cycles:pp', so what we 
see as profiler hits should be the raw NMI entry RIPs.

Arguably this could be just the decoder hiding the NOP efficiently, 
I'll try to run some more experiments ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
  2015-03-27 21:02     ` Linus Torvalds
  2015-03-28  9:30       ` Ingo Molnar
@ 2015-03-28  9:39       ` Ingo Molnar
  1 sibling, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2015-03-28  9:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Brian Gerst, Andy Lutomirski, Denys Vlasenko, Borislav Petkov,
	linux-kernel@vger.kernel.org, X86 ML


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Mar 27, 2015 at 1:53 PM, Brian Gerst <brgerst@gmail.com> wrote:
> >> <-- IRQ.  Boom
> >
> > The sti will delay interrupts for one instruction, and that should include NMIs.
> 
> Nope. Intel explicitly documents the NMI case only for mov->ss and popss.
> 
> > The Intel SDM states for STI:
> > "The IF flag and the STI and CLI instructions do not prohibit the
> > generation of exceptions and NMI interrupts. NMI
> > interrupts (and SMIs) may be blocked for one macroinstruction following an STI."
> 
> Note the *may*. For movss and popss the software developer guide 
> explicitly says that NMI's are also blocked.
> 
> For plain sti, it seems to be dependent on microarchitecture.

Well, how about 'STI+HLT' aka safe_halt()?

If an NMI is allowed after that STI then we might lose wakeups, or in 
extreme cases (with full-dynticks) might lock up for a long time until 
the next irq comes, even with runnable tasks around?

Arguably that's a race condition that is not very easy to notice on a 
typical system.

Random hypothesis: maybe Intel just messed up their STI shadow in a 
single (possibly ancient) microarchitecture in some rare situations 
and figured it could fix it cheaply via updating the documentation to 
match the breakage, not via actually fixing the CPU?

Might be useful if someone from Intel could chime in.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
  2015-03-27 20:09   ` Linus Torvalds
  2015-03-27 20:16     ` Andy Lutomirski
@ 2015-03-28  9:42     ` Borislav Petkov
  1 sibling, 0 replies; 19+ messages in thread
From: Borislav Petkov @ 2015-03-28  9:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Denys Vlasenko, linux-kernel@vger.kernel.org,
	X86 ML, Ingo Molnar

On Fri, Mar 27, 2015 at 01:09:34PM -0700, Linus Torvalds wrote:
> I think AMD documented that the sti "interrupt shadow" shadows even
> NMI.

Hmm, official docs says this:

"15.21.5 Interrupt Shadows

The x86 architecture defines the notion of an interrupt shadow—a
single-instruction window during which interrupts are not recognized.
For example, the instruction after an STI instruction that sets
EFLAGS.IF (from zero to one) does not recognize interrupts or certain
debug traps."

And I think with "interrupts" this means maskable interrupts, i.e. not
NMI.

STI description itself:

"Sets the interrupt flag (IF) in the rFLAGS register to 1, thereby
allowing external interrupts received on the INTR input. Interrupts
received on the non-maskable interrupt (NMI) input are not affected by
this instruction."

but there might be some other documentation which I cannot find right
now.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
  2015-03-28  0:34   ` Denys Vlasenko
  2015-03-28  9:28     ` Olivier Galibert
@ 2015-03-28  9:46     ` Ingo Molnar
  2015-03-28 11:17       ` Denys Vlasenko
       [not found]     ` <CA+55aFxwNq6g+Oi-UhGBgEZuDQyNkeg6qZnkDY11PNhTN=fmzg@mail.gmail.com>
  2 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2015-03-28  9:46 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Denys Vlasenko, Andy Lutomirski, Borislav Petkov,
	X86 ML, LKML


* Denys Vlasenko <vda.linux@googlemail.com> wrote:

> On Fri, Mar 27, 2015 at 9:00 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> >>
> >> Apparently, users *don't* depend on arithmetic flags
> >> to survive over syscall. They also okay with DF flag
> >> being cleared.
> >
> > Generally, users probably dont' care about many registers at all being
> > saved, but it's worth noting that the reason system calls save/restore
> > even caller-saved registers is at least partly in order to avoid any
> > kernel information leaks.
> >
> > I don't believe that user mode will ever reasonably care about the
> > arithmetic flags being changed, but at the same time I also don't it
> > is something we should ever consider a "feature" we should try to take
> > advantage of. Generally we should try to not mess with the flag state,
> > and I'd *much* rather make the rule be that all the system call return
> > paths restore flags as much as possible.
> 
> "We don't clobber anything" ABI has its appeal.
> OTOH, fulfilling ABI's promises has cost which hast to be paid
> on every syscall, regardless whether userspace needed it or not.
> 
> Example. This is the uclibc implementation of write():
> 
> 00000000004acfc4 <__libc_write>:
>   4acfc4:       53                      push   %rbx
>   4acfc5:       48 63 ff                movslq %edi,%rdi
>   4acfc8:       b8 01 00 00 00          mov    $0x1,%eax
>   4acfcd:       0f 05                   syscall
>   4acfcf:       48 89 c3                mov    %rax,%rbx
>   4acfd2:       48 81 fb 00 f0 ff ff    cmp    $0xfffffffffffff000,%rbx
>   4acfd9:       76 0f                   jbe    4acfea <__libc_write+0x26>
>   4acfdb:       e8 64 15 00 00          callq  4ae544 <__GI___errno_location>
>   4acfe0:       89 da                   mov    %ebx,%edx
>   4acfe2:       f7 da                   neg    %edx
>   4acfe4:       89 10                   mov    %edx,(%rax)
>   4acfe6:       48 83 c8 ff             or     $0xffffffffffffffff,%rax
>   4acfea:       5b                      pop    %rbx
>   4acfeb:       c3                      retq
> 
> This is a C function. [...]

Arguably that's a self-inflicted wound of uclibc: nothing keeps it 
from taking advantage of the syscall ABI and avoiding the double 
save/restores.

> [...] Therefore any its caller assumes that C-clobbered registers 
> can be, indeed, clobbered here, so if that caller uses any of them, 
> it saves/restores them.
> 
> All efforts by kernel code to save/restore C-clobbered registers, 
> eight of them, are in vain. It's just useless work. Userspace does 
> not benefit from that effort.

That's true only in this particular uclibc case, where user-space 
decided to not take advantage of the save/restore property of the 
kernel.

> If our syscall ABI would say that those regs are not preserved, we 
> could have a bit faster syscalls. Any userspace code which really 
> had to have those registers preserved across a particular syscall, 
> could push/pop them itself.

We'd at minimum have to zero out the registers to avoid the 
information leak and at that point it's in fact faster to just 
save/restore in the syscall and allow user-space to take advantage of 
that, if it wishes to.

We cannot do it the other way around.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
       [not found]     ` <CA+55aFxwNq6g+Oi-UhGBgEZuDQyNkeg6qZnkDY11PNhTN=fmzg@mail.gmail.com>
@ 2015-03-28 11:01       ` Denys Vlasenko
  0 siblings, 0 replies; 19+ messages in thread
From: Denys Vlasenko @ 2015-03-28 11:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Denys Vlasenko, Andy Lutomirski, Ingo Molnar, Borislav Petkov,
	X86 ML, LKML

On Sat, Mar 28, 2015 at 1:39 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> What part of "don't leak kernel data" did you have trouble understanding?
>
> IOW, this is a *security* issue. Stop arguing for crazy shit.

We can zero the registers instead of saving/restoring them.
push/pop pair takes 1-2 cycles at best, whereas
on four-issue CPUs "xor reg,reg" has throughput of four
instructions at a time, often even not requiring
an execution unit.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
  2015-03-28  9:46     ` Ingo Molnar
@ 2015-03-28 11:17       ` Denys Vlasenko
  2015-03-29 17:12         ` Andy Lutomirski
  0 siblings, 1 reply; 19+ messages in thread
From: Denys Vlasenko @ 2015-03-28 11:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Denys Vlasenko, Andy Lutomirski, Borislav Petkov,
	X86 ML, LKML

On Sat, Mar 28, 2015 at 10:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
> * Denys Vlasenko <vda.linux@googlemail.com> wrote:
>> This is a C function. [...]
>
> Arguably that's a self-inflicted wound of uclibc: nothing keeps it
> from taking advantage of the syscall ABI and avoiding the double
> save/restores.

It's not uclibc who calls write(), it's user program.

IIRC uclibc can't make user program aware that write()
is not clobbering registers.

Even if it could do that via an __attribute__ somehow,
it would be a violation of standards: write() is supposed to be
an ordinary C function, users must be able to take its address
and assign it to a pointer declared as

    ssize_t (*ptr)(int, const void *, size_t);

Slapping __attribute__((different_abi))
onto write() makes that impossible, the signature
no longer matches.

Let me go back from hypothetics to the actual situation.
We can't do such a drastic ABI change now, it's too big.

But is looks like we can relax ABI wrt saving flags,
because it's broken for some time, and no one complains.

If we say that arith flags and DF are not saved, it may
mean that we don't need to kill ourselves with
awkward and costly (popf is 20 cycles) EFLAGS
manipulations.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ia32_sysenter_target does not preserve EFLAGS
  2015-03-28 11:17       ` Denys Vlasenko
@ 2015-03-29 17:12         ` Andy Lutomirski
  0 siblings, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2015-03-29 17:12 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Borislav Petkov, linux-kernel@vger.kernel.org, X86 ML,
	Linus Torvalds, Denys Vlasenko, Ingo Molnar

On Mar 28, 2015 6:17 AM, "Denys Vlasenko" <vda.linux@googlemail.com> wrote:
>
> On Sat, Mar 28, 2015 at 10:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
> > * Denys Vlasenko <vda.linux@googlemail.com> wrote:
> >> This is a C function. [...]
> >
> > Arguably that's a self-inflicted wound of uclibc: nothing keeps it
> > from taking advantage of the syscall ABI and avoiding the double
> > save/restores.
>
> It's not uclibc who calls write(), it's user program.
>
> IIRC uclibc can't make user program aware that write()
> is not clobbering registers.
>
> Even if it could do that via an __attribute__ somehow,
> it would be a violation of standards: write() is supposed to be
> an ordinary C function, users must be able to take its address
> and assign it to a pointer declared as
>
>     ssize_t (*ptr)(int, const void *, size_t);
>
> Slapping __attribute__((different_abi))
> onto write() makes that impossible, the signature
> no longer matches.
>
> Let me go back from hypothetics to the actual situation.
> We can't do such a drastic ABI change now, it's too big.
>
> But is looks like we can relax ABI wrt saving flags,
> because it's broken for some time, and no one complains.
>
> If we say that arith flags and DF are not saved, it may
> mean that we don't need to kill ourselves with
> awkward and costly (popf is 20 cycles) EFLAGS
> manipulations.

Given that I just posted a patch that removes the popf by using
sysret, I'm unconvinced this is worthwhile.

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-03-29 17:12 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-27 14:25 ia32_sysenter_target does not preserve EFLAGS Denys Vlasenko
2015-03-27 17:12 ` Borislav Petkov
2015-03-27 18:37 ` Andy Lutomirski
2015-03-27 20:09   ` Linus Torvalds
2015-03-27 20:16     ` Andy Lutomirski
2015-03-27 20:31       ` Linus Torvalds
2015-03-27 21:11         ` Andy Lutomirski
2015-03-28  9:42     ` Borislav Petkov
2015-03-27 20:53   ` Brian Gerst
2015-03-27 21:02     ` Linus Torvalds
2015-03-28  9:30       ` Ingo Molnar
2015-03-28  9:39       ` Ingo Molnar
2015-03-27 20:00 ` Linus Torvalds
2015-03-28  0:34   ` Denys Vlasenko
2015-03-28  9:28     ` Olivier Galibert
2015-03-28  9:46     ` Ingo Molnar
2015-03-28 11:17       ` Denys Vlasenko
2015-03-29 17:12         ` Andy Lutomirski
     [not found]     ` <CA+55aFxwNq6g+Oi-UhGBgEZuDQyNkeg6qZnkDY11PNhTN=fmzg@mail.gmail.com>
2015-03-28 11:01       ` Denys Vlasenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).