* [RFC] x86, ia32entry: Use sysretl to return from sysenter
@ 2015-03-27 21:54 Andy Lutomirski
2015-03-28 8:35 ` Ingo Molnar
2015-03-29 19:07 ` Denys Vlasenko
0 siblings, 2 replies; 7+ messages in thread
From: Andy Lutomirski @ 2015-03-27 21:54 UTC (permalink / raw)
To: Linus Torvalds
Cc: Denys Vlasenko, Borislav Petkov, linux-kernel, X86 ML,
Ingo Molnar, hpa, Andy Lutomirski, stable
Sysexit is scary on 64-bit kernels -- sysexit must be invoked with
usergs and IRQs on. That means that we rely on sti to correctly
mask interrupts for one instruction. This is okay by itself, but
the semantics with respect to NMIs are unclear.
Avoid the whole issue by using sysretl instead. For background,
Intel CPUs don't allow syscall from compat mode, but they do allow
sysret back to compat mode. Go figure.
Oddly this seems to be 30 cycles or so faster. Avoiding popfq and
sti will account for under half of that, I think, so my best guess
is that Intel just optimizes sysret much better than sysexit.
Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
This needs careful review even though it's short. It everyone likes
it, I'll resubmit with a second patch to tear out the associated
paravirt gunk.
I wouldn't be at all surprised if this breaks Xen for some reason
I haven't thought of.
arch/x86/ia32/ia32entry.S | 32 +++++++++++++++++++-------------
1 file changed, 19 insertions(+), 13 deletions(-)
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 5d2641ce9957..356be82fef0c 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -180,28 +180,34 @@ sysenter_dispatch:
testl $_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
jnz sysexit_audit
sysexit_from_sys_call:
+ /*
+ * NB: sysexit is not obviously safe for 64-bit kernels -- an
+ * NMI between sti and sysexit has poorly specified behavior,
+ * and and NMI followed by an IRQ with usergs is fatal. So
+ * we just pretend we're using sysexit but we really use
+ * sysretl instead.
+ *
+ * This code path is still called sysexit because it pairs
+ * with sysenter and it uses the sysenter calling convention.
+ */
andl $~TS_COMPAT,ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
- /* clear IF, that popfq doesn't enable interrupts early */
- andl $~0x200,EFLAGS(%rsp)
- movl RIP(%rsp),%edx /* User %eip */
- CFI_REGISTER rip,rdx
+ movl RIP(%rsp),%ecx /* User %eip */
+ CFI_REGISTER rip,rcx
RESTORE_RSI_RDI
- /* pop everything except ss,rsp,rflags slots */
- REMOVE_PT_GPREGS_FROM_STACK 3*8
+ xorl %edx,%edx
xorq %r8,%r8
xorq %r9,%r9
xorq %r10,%r10
- xorq %r11,%r11
- popfq_cfi
+ movl EFLAGS(%rsp),%r11d /* User eflags */
/*CFI_RESTORE rflags*/
- popq_cfi %rcx /* User %esp */
- CFI_REGISTER rsp,rcx
TRACE_IRQS_ON
/*
- * 32bit SYSEXIT restores eip from edx, esp from ecx.
- * cs and ss are loaded from MSRs.
+ * Sysretl works even on Intel CPUs. Use it in preference to sysexit,
+ * since it avoids a dicey window with interrupts enabled.
+ * CS and SS are loaded from MSRs.
*/
- ENABLE_INTERRUPTS_SYSEXIT32
+ movl RSP(%rsp),%esp
+ USERGS_SYSRET32
CFI_RESTORE_STATE
--
2.3.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [RFC] x86, ia32entry: Use sysretl to return from sysenter
2015-03-27 21:54 [RFC] x86, ia32entry: Use sysretl to return from sysenter Andy Lutomirski
@ 2015-03-28 8:35 ` Ingo Molnar
2015-03-28 15:17 ` Andy Lutomirski
2015-03-29 19:07 ` Denys Vlasenko
1 sibling, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2015-03-28 8:35 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Linus Torvalds, Denys Vlasenko, Borislav Petkov, linux-kernel,
X86 ML, hpa, stable
* Andy Lutomirski <luto@kernel.org> wrote:
> Sysexit is scary on 64-bit kernels -- sysexit must be invoked with
> usergs and IRQs on. That means that we rely on sti to correctly
> mask interrupts for one instruction. This is okay by itself, but
> the semantics with respect to NMIs are unclear.
At least judging by profiling output I think NMIs observe the STI
window of one instruction non-execution as well. (But I'm not 100%
sure.)
> Avoid the whole issue by using sysretl instead. For background,
> Intel CPUs don't allow syscall from compat mode, but they do allow
> sysret back to compat mode. Go figure.
>
> Oddly this seems to be 30 cycles or so faster. Avoiding popfq and
> sti will account for under half of that, I think, so my best guess
> is that Intel just optimizes sysret much better than sysexit.
>
> Cc: stable@vger.kernel.org
I like it, but no way is this automatic -stable material ... if proven
upstream we can forward it as a fix for SYSEXIT fragility, but not
automatically, IMHO.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] x86, ia32entry: Use sysretl to return from sysenter
2015-03-28 8:35 ` Ingo Molnar
@ 2015-03-28 15:17 ` Andy Lutomirski
2015-03-29 12:58 ` Borislav Petkov
0 siblings, 1 reply; 7+ messages in thread
From: Andy Lutomirski @ 2015-03-28 15:17 UTC (permalink / raw)
To: Ingo Molnar
Cc: H. Peter Anvin, Borislav Petkov, X86 ML, Linus Torvalds,
Denys Vlasenko, Andy Lutomirski, stable,
linux-kernel@vger.kernel.org
On Mar 28, 2015 1:35 AM, "Ingo Molnar" <mingo@kernel.org> wrote:
>
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>
> > Sysexit is scary on 64-bit kernels -- sysexit must be invoked with
> > usergs and IRQs on. That means that we rely on sti to correctly
> > mask interrupts for one instruction. This is okay by itself, but
> > the semantics with respect to NMIs are unclear.
>
> At least judging by profiling output I think NMIs observe the STI
> window of one instruction non-execution as well. (But I'm not 100%
> sure.)
>
> > Avoid the whole issue by using sysretl instead. For background,
> > Intel CPUs don't allow syscall from compat mode, but they do allow
> > sysret back to compat mode. Go figure.
> >
> > Oddly this seems to be 30 cycles or so faster. Avoiding popfq and
> > sti will account for under half of that, I think, so my best guess
> > is that Intel just optimizes sysret much better than sysexit.
> >
> > Cc: stable@vger.kernel.org
>
> I like it, but no way is this automatic -stable material ... if proven
> upstream we can forward it as a fix for SYSEXIT fragility, but not
> automatically, IMHO.
Agreed. I wish we had a Stable-after-a-long-soak tag.
--Andy
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] x86, ia32entry: Use sysretl to return from sysenter
2015-03-28 15:17 ` Andy Lutomirski
@ 2015-03-29 12:58 ` Borislav Petkov
0 siblings, 0 replies; 7+ messages in thread
From: Borislav Petkov @ 2015-03-29 12:58 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Ingo Molnar, H. Peter Anvin, X86 ML, Linus Torvalds,
Denys Vlasenko, Andy Lutomirski, stable,
linux-kernel@vger.kernel.org
On Sat, Mar 28, 2015 at 08:17:42AM -0700, Andy Lutomirski wrote:
> Agreed. I wish we had a Stable-after-a-long-soak tag.
You can set yourself a reminder for, say 6 months from now and if all is
still kosher with the patch, backport it then yourself and send it to
stable@.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] x86, ia32entry: Use sysretl to return from sysenter
2015-03-27 21:54 [RFC] x86, ia32entry: Use sysretl to return from sysenter Andy Lutomirski
2015-03-28 8:35 ` Ingo Molnar
@ 2015-03-29 19:07 ` Denys Vlasenko
2015-03-29 21:16 ` Andy Lutomirski
1 sibling, 1 reply; 7+ messages in thread
From: Denys Vlasenko @ 2015-03-29 19:07 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Linus Torvalds, Denys Vlasenko, Borislav Petkov,
Linux Kernel Mailing List, X86 ML, Ingo Molnar, H. Peter Anvin,
stable
On Fri, Mar 27, 2015 at 10:54 PM, Andy Lutomirski <luto@kernel.org> wrote:
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -180,28 +180,34 @@ sysenter_dispatch:
> testl $_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
> jnz sysexit_audit
> sysexit_from_sys_call:
> + /*
> + * NB: sysexit is not obviously safe for 64-bit kernels -- an
> + * NMI between sti and sysexit has poorly specified behavior,
> + * and and NMI followed by an IRQ with usergs is fatal. So
> + * we just pretend we're using sysexit but we really use
> + * sysretl instead.
> + *
> + * This code path is still called sysexit because it pairs
> + * with sysenter and it uses the sysenter calling convention.
> + */
> andl $~TS_COMPAT,ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
> - /* clear IF, that popfq doesn't enable interrupts early */
> - andl $~0x200,EFLAGS(%rsp)
> - movl RIP(%rsp),%edx /* User %eip */
> - CFI_REGISTER rip,rdx
> + movl RIP(%rsp),%ecx /* User %eip */
> + CFI_REGISTER rip,rcx
> RESTORE_RSI_RDI
I think you need to replace
RESTORE_RSI_RDI with RESTORE_RSI_RDI_RDX
> - /* pop everything except ss,rsp,rflags slots */
> - REMOVE_PT_GPREGS_FROM_STACK 3*8
> + xorl %edx,%edx
Why do you clear %edx?
> xorq %r8,%r8
> xorq %r9,%r9
> xorq %r10,%r10
> - xorq %r11,%r11
> - popfq_cfi
> + movl EFLAGS(%rsp),%r11d /* User eflags */
> /*CFI_RESTORE rflags*/
> - popq_cfi %rcx /* User %esp */
> - CFI_REGISTER rsp,rcx
> TRACE_IRQS_ON
> /*
> - * 32bit SYSEXIT restores eip from edx, esp from ecx.
> - * cs and ss are loaded from MSRs.
> + * Sysretl works even on Intel CPUs. Use it in preference to sysexit,
> + * since it avoids a dicey window with interrupts enabled.
> + * CS and SS are loaded from MSRs.
Please do not remove the mini-doc which says what is restored from where.
These instructions are not that obvious. I propose:
* 64bit->32bit SYSRET restores eip from ecx,
* eflags from r11 (but RF and VM bits are forced to 0),
* CS and SS are loaded from MSRs. CS and SS are loaded from MSRs.
Or maybe just
...
sysexit_from_sys_call:
/*
* Sysretl works even on Intel CPUs. Use it in preference to sysexit,
* since it avoids a dicey window with interrupts enabled.
*/
jmp to sysretl_from_sys_call
and remove the entire "sysexit tail" code path?
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [RFC] x86, ia32entry: Use sysretl to return from sysenter
2015-03-29 19:07 ` Denys Vlasenko
@ 2015-03-29 21:16 ` Andy Lutomirski
2015-03-30 9:04 ` Borislav Petkov
0 siblings, 1 reply; 7+ messages in thread
From: Andy Lutomirski @ 2015-03-29 21:16 UTC (permalink / raw)
To: Denys Vlasenko
Cc: Andy Lutomirski, Linus Torvalds, Denys Vlasenko, Borislav Petkov,
Linux Kernel Mailing List, X86 ML, Ingo Molnar, H. Peter Anvin,
stable
On Sun, Mar 29, 2015 at 12:07 PM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> On Fri, Mar 27, 2015 at 10:54 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> --- a/arch/x86/ia32/ia32entry.S
>> +++ b/arch/x86/ia32/ia32entry.S
>> @@ -180,28 +180,34 @@ sysenter_dispatch:
>> testl $_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>> jnz sysexit_audit
>> sysexit_from_sys_call:
>> + /*
>> + * NB: sysexit is not obviously safe for 64-bit kernels -- an
>> + * NMI between sti and sysexit has poorly specified behavior,
>> + * and and NMI followed by an IRQ with usergs is fatal. So
>> + * we just pretend we're using sysexit but we really use
>> + * sysretl instead.
>> + *
>> + * This code path is still called sysexit because it pairs
>> + * with sysenter and it uses the sysenter calling convention.
>> + */
>> andl $~TS_COMPAT,ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
>> - /* clear IF, that popfq doesn't enable interrupts early */
>> - andl $~0x200,EFLAGS(%rsp)
>> - movl RIP(%rsp),%edx /* User %eip */
>> - CFI_REGISTER rip,rdx
>> + movl RIP(%rsp),%ecx /* User %eip */
>> + CFI_REGISTER rip,rcx
>> RESTORE_RSI_RDI
>
> I think you need to replace
> RESTORE_RSI_RDI with RESTORE_RSI_RDI_RDX
I didn't because I clear it...
>
>> - /* pop everything except ss,rsp,rflags slots */
>> - REMOVE_PT_GPREGS_FROM_STACK 3*8
>> + xorl %edx,%edx
>
> Why do you clear %edx?
>
...and I clear it because the post-sysexit trampoline in the vdso pops edx.
Eventually I'd like to clean that stuff up too, but in the mean time
IMO it would be nice to keep this change self-contained and avoid
fiddling with calling conventions. There's a long history of nasty
bugs in there.
See, for example, this long thread:
http://lkml.iu.edu/hypermail/linux/kernel/1108.2/02158.html
>
>> xorq %r8,%r8
>> xorq %r9,%r9
>> xorq %r10,%r10
>> - xorq %r11,%r11
>> - popfq_cfi
>> + movl EFLAGS(%rsp),%r11d /* User eflags */
>> /*CFI_RESTORE rflags*/
>> - popq_cfi %rcx /* User %esp */
>> - CFI_REGISTER rsp,rcx
>> TRACE_IRQS_ON
>> /*
>> - * 32bit SYSEXIT restores eip from edx, esp from ecx.
>> - * cs and ss are loaded from MSRs.
>> + * Sysretl works even on Intel CPUs. Use it in preference to sysexit,
>> + * since it avoids a dicey window with interrupts enabled.
>> + * CS and SS are loaded from MSRs.
>
> Please do not remove the mini-doc which says what is restored from where.
> These instructions are not that obvious. I propose:
>
> * 64bit->32bit SYSRET restores eip from ecx,
> * eflags from r11 (but RF and VM bits are forced to 0),
> * CS and SS are loaded from MSRs. CS and SS are loaded from MSRs.
>
I'll do something like that in v2.
>
>
> Or maybe just
>
> ...
> sysexit_from_sys_call:
> /*
> * Sysretl works even on Intel CPUs. Use it in preference to sysexit,
> * since it avoids a dicey window with interrupts enabled.
> */
> jmp to sysretl_from_sys_call
>
>
>
> and remove the entire "sysexit tail" code path?
I'd rather eventually do the big cleanup I mentioned in the other thread.
--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] x86, ia32entry: Use sysretl to return from sysenter
2015-03-29 21:16 ` Andy Lutomirski
@ 2015-03-30 9:04 ` Borislav Petkov
0 siblings, 0 replies; 7+ messages in thread
From: Borislav Petkov @ 2015-03-30 9:04 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Denys Vlasenko, Andy Lutomirski, Linus Torvalds, Denys Vlasenko,
Linux Kernel Mailing List, X86 ML, Ingo Molnar, H. Peter Anvin,
stable
On Sun, Mar 29, 2015 at 02:16:05PM -0700, Andy Lutomirski wrote:
> > sysexit_from_sys_call:
> > /*
> > * Sysretl works even on Intel CPUs. Use it in preference to sysexit,
> > * since it avoids a dicey window with interrupts enabled.
Btw, let's agree on the convention to write instructions in capital
letters, i.e. Sysretl -> SYSRET and so on...
Thanks.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-03-30 9:06 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-27 21:54 [RFC] x86, ia32entry: Use sysretl to return from sysenter Andy Lutomirski
2015-03-28 8:35 ` Ingo Molnar
2015-03-28 15:17 ` Andy Lutomirski
2015-03-29 12:58 ` Borislav Petkov
2015-03-29 19:07 ` Denys Vlasenko
2015-03-29 21:16 ` Andy Lutomirski
2015-03-30 9:04 ` Borislav Petkov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox