linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86/entry: Improve system call entry comments
@ 2016-03-06 17:39 Andy Lutomirski
  2016-03-07  8:22 ` Ingo Molnar
  0 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2016-03-06 17:39 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Borislav Petkov, Oleg Nesterov, Andrew Cooper,
	Brian Gerst, Andy Lutomirski

Ingo suggested that the comments should explain when the various
entries are used.  This adds these explanations and improves other
parts of the comments.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---

This applies on top of all the TF / debug / SYSENTER stuff.  If this
is too confusing, I'll rebase it and resend after everything settles
in -tip.

 arch/x86/entry/entry_32.S        | 59 +++++++++++++++++++++++++++-
 arch/x86/entry/entry_64.S        | 10 +++++
 arch/x86/entry/entry_64_compat.S | 85 +++++++++++++++++++++++++++-------------
 3 files changed, 126 insertions(+), 28 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 09360c445c69..846a6c478bfd 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -307,6 +307,38 @@ ENTRY(xen_sysenter_target)
 	jmp	sysenter_past_esp
 #endif
 
+/*
+ * 32-bit SYSENTER entry.
+ *
+ * 32-bit system calls through the vDSO's __kernel_vsyscall enter here
+ * if X86_FEATURE_SEP is available.  This is the preferred system call
+ * entry on 32-bit systems.
+ *
+ * The SYSENTER instruction, in principle, should *only* occur in the
+ * vDSO.  In practice, a small number of Android devices were shipped
+ * with a copy of Bionic that inlined a SYSENTER instruction.  This
+ * never happened in any of Google's Bionic versions -- it only happened
+ * in a narrow range of Intel-provided versions.
+ *
+ * SYSENTER loads SS, ESP, CS, and EIP from previously programmed MSRs.
+ * IF and VM in RFLAGS are cleared (IOW: interrupts are off).
+ * SYSENTER does not save anything on the stack,
+ * and does not save old EIP (!!!), ESP, or EFLAGS.
+ *
+ * To avoid losing track of EFLAGS.VM (and thus potentially corrupting
+ * user and/or vm86 state), we explicitly disable the SYSENTER
+ * instruction in vm86 mode by reprogramming the MSRs.
+ *
+ * Arguments:
+ * eax  system call number
+ * ebx  arg1
+ * ecx  arg2
+ * edx  arg3
+ * esi  arg4
+ * edi  arg5
+ * ebp  user stack
+ * 0(%ebp) arg6
+ */
 ENTRY(entry_SYSENTER_32)
 	movl	TSS_sysenter_sp0(%esp), %esp
 sysenter_past_esp:
@@ -398,7 +430,32 @@ sysenter_past_esp:
 GLOBAL(__end_SYSENTER_singlestep_region)
 ENDPROC(entry_SYSENTER_32)
 
-	# system call handler stub
+/*
+ * 32-bit legacy system call entry.
+ *
+ * 32-bit x86 Linux system calls traditionally used the INT $0x80
+ * instruction.  INT $0x80 lands here.
+ *
+ * This entry point can be used by 32-bit and 64-bit programs to perform
+ * 32-bit system calls.  Instances of INT $0x80 can be found inline in
+ * various programs and libraries.  It is also used by the vDSO's
+ * __kernel_vsyscall fallback for hardware that doesn't support a faster
+ * entry method.  Restarted 32-bit system calls also fall back to INT
+ * $0x80 regardless of what instruction was originally used to do the
+ * system call.
+ *
+ * This is considered a slow path.  It is not used by modern libc
+ * implementations on modern hardware except during process startup.
+ *
+ * Arguments:
+ * eax  system call number
+ * ebx  arg1
+ * ecx  arg2
+ * edx  arg3
+ * esi  arg4
+ * edi  arg5
+ * ebp  arg6
+ */
 ENTRY(entry_INT80_32)
 	ASM_CLAC
 	pushl	%eax			/* pt_regs->orig_ax */
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 2f61059dacc3..6f5fbc584cb1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -106,6 +106,16 @@ ENDPROC(native_usergs_sysret64)
 /*
  * 64-bit SYSCALL instruction entry. Up to 6 arguments in registers.
  *
+ * This is the only entry point used for 64-bit system calls.  The
+ * hardware interface is reasonably well designed and the register to
+ * argument mapping Linux uses fits well with the registers that are
+ * available when SYSCALL is used.
+ *
+ * SYSCALL instructions can be found inlined in libc implementations as
+ * well as some other programs and libraries.  There are also a handful
+ * of SYSCALL instructions in the vDSO used, for example, as a
+ * clock_gettimeofday fallback.
+ *
  * 64-bit SYSCALL saves rip to rcx, clears rflags.RF, then saves rflags to r11,
  * then loads new ss, cs, and rip from previously programmed MSRs.
  * rflags gets masked by a value from another MSR (so CLD and CLAC
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index aa3f79ba2e4d..1838dcdd3886 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -19,12 +19,21 @@
 	.section .entry.text, "ax"
 
 /*
- * 32-bit SYSENTER instruction entry.
+ * 32-bit SYSENTER entry.
  *
- * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs.
- * IF and VM in rflags are cleared (IOW: interrupts are off).
+ * 32-bit system calls through the vDSO's __kernel_vsyscall enter here
+ * on 64-bit kernels running on Intel CPUs.
+ *
+ * The SYSENTER instruction, in principle, should *only* occur in the
+ * vDSO.  In practice, a small number of Android devices were shipped
+ * with a copy of Bionic that inlined a SYSENTER instruction.  This
+ * never happened in any of Google's Bionic versions -- it only happened
+ * in a narrow range of Intel-provided versions.
+ *
+ * SYSENTER loads SS, RSP, CS, and RIP from previously programmed MSRs.
+ * IF and VM in RFLAGS are cleared (IOW: interrupts are off).
  * SYSENTER does not save anything on the stack,
- * and does not save old rip (!!!) and rflags.
+ * and does not save old RIP (!!!), RSP, or RFLAGS.
  *
  * Arguments:
  * eax  system call number
@@ -35,10 +44,6 @@
  * edi  arg5
  * ebp  user stack
  * 0(%ebp) arg6
- *
- * This is purely a fast path. For anything complicated we use the int 0x80
- * path below. We set up a complete hardware stack frame to share code
- * with the int 0x80 path.
  */
 ENTRY(entry_SYSENTER_compat)
 	/* Interrupts are off on entry. */
@@ -131,17 +136,38 @@ GLOBAL(__end_entry_SYSENTER_compat)
 ENDPROC(entry_SYSENTER_compat)
 
 /*
- * 32-bit SYSCALL instruction entry.
+ * 32-bit SYSCALL entry.
+ *
+ * 32-bit system calls through the vDSO's __kernel_vsyscall enter here
+ * on 64-bit kernels running on AMD CPUs.
+ *
+ * The SYSCALL instruction, in principle, should *only* occur in the
+ * vDSO.  In practice, it appears that this really is the case.
+ * As evidence:
+ *
+ *  - The calling convention for SYSCALL has changed several times without
+ *    anyone noticing.
+ *
+ *  - Prior to the in-kernel X86_BUG_SYSRET_SS_ATTRS fixup, anything
+ *    user task that did SYSCALL without immediately reloading SS
+ *    would randomly crash.
  *
- * 32-bit SYSCALL saves rip to rcx, clears rflags.RF, then saves rflags to r11,
- * then loads new ss, cs, and rip from previously programmed MSRs.
- * rflags gets masked by a value from another MSR (so CLD and CLAC
- * are not needed). SYSCALL does not save anything on the stack
- * and does not change rsp.
+ *  - Most programmers do not directly target AMD CPUs, and the 32-bit
+ *    SYSCALL instruction does not exist on Intel CPUs.  Even on AMD
+ *    CPUs, Linux disables the SYSCALL instruction on 32-bit kernels
+ *    because the SYSCALL instruction in legacy/native 32-bit mode (as
+ *    opposed to compat mode) is sufficiently poorly designed as to be
+ *    essentially unusable.
  *
- * Note: rflags saving+masking-with-MSR happens only in Long mode
+ * 32-bit SYSCALL saves RIP to RCX, clears RFLAGS.RF, then saves
+ * RFLAGS to R11, then loads new SS, CS, and RIP from previously
+ * programmed MSRs.  RFLAGS gets masked by a value from another MSR
+ * (so CLD and CLAC are not needed).  SYSCALL does not save anything on
+ * the stack and does not change RSP.
+ *
+ * Note: RFLAGS saving+masking-with-MSR happens only in Long mode
  * (in legacy 32-bit mode, IF, RF and VM bits are cleared and that's it).
- * Don't get confused: rflags saving+masking depends on Long Mode Active bit
+ * Don't get confused: RFLAGS saving+masking depends on Long Mode Active bit
  * (EFER.LMA=1), NOT on bitness of userspace where SYSCALL executes
  * or target CS descriptor's L bit (SYSCALL does not read segment descriptors).
  *
@@ -241,7 +267,21 @@ sysret32_from_system_call:
 END(entry_SYSCALL_compat)
 
 /*
- * Emulated IA32 system calls via int 0x80.
+ * 32-bit legacy system call entry.
+ *
+ * 32-bit x86 Linux system calls traditionally used the INT $0x80
+ * instruction.  INT $0x80 lands here.
+ *
+ * This entry point can be used by 32-bit and 64-bit programs to perform
+ * 32-bit system calls.  Instances of INT $0x80 can be found inline in
+ * various programs and libraries.  It is also used by the vDSO's
+ * __kernel_vsyscall fallback for hardware that doesn't support a faster
+ * entry method.  Restarted 32-bit system calls also fall back to INT
+ * $0x80 regardless of what instruction was originally used to do the
+ * system call.
+ *
+ * This is considered a slow path.  It is not used by modern libc
+ * implementations on modern hardware except during process startup.
  *
  * Arguments:
  * eax  system call number
@@ -250,17 +290,8 @@ END(entry_SYSCALL_compat)
  * edx  arg3
  * esi  arg4
  * edi  arg5
- * ebp  arg6	(note: not saved in the stack frame, should not be touched)
- *
- * Notes:
- * Uses the same stack frame as the x86-64 version.
- * All registers except eax must be saved (but ptrace may violate that).
- * Arguments are zero extended. For system calls that want sign extension and
- * take long arguments a wrapper is needed. Most calls can just be called
- * directly.
- * Assumes it is only called from user space and entered with interrupts off.
+ * ebp  arg6
  */
-
 ENTRY(entry_INT80_compat)
 	/*
 	 * Interrupts are off on entry.
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/entry: Improve system call entry comments
  2016-03-06 17:39 [PATCH] x86/entry: Improve system call entry comments Andy Lutomirski
@ 2016-03-07  8:22 ` Ingo Molnar
  2016-03-07 16:34   ` H. Peter Anvin
  2016-03-07 17:01   ` Andy Lutomirski
  0 siblings, 2 replies; 13+ messages in thread
From: Ingo Molnar @ 2016-03-07  8:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Borislav Petkov, Oleg Nesterov, Andrew Cooper,
	Brian Gerst, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Thomas Gleixner, H. Peter Anvin


* Andy Lutomirski <luto@kernel.org> wrote:

> Ingo suggested that the comments should explain when the various
> entries are used.  This adds these explanations and improves other
> parts of the comments.

Thanks for doing this, this is really useful!

One very small detail I noticed:

> +/*
> + * 32-bit legacy system call entry.
> + *
> + * 32-bit x86 Linux system calls traditionally used the INT $0x80
> + * instruction.  INT $0x80 lands here.
> + *
> + * This entry point can be used by 32-bit and 64-bit programs to perform
> + * 32-bit system calls.  Instances of INT $0x80 can be found inline in
> + * various programs and libraries.  It is also used by the vDSO's
> + * __kernel_vsyscall fallback for hardware that doesn't support a faster
> + * entry method.  Restarted 32-bit system calls also fall back to INT
> + * $0x80 regardless of what instruction was originally used to do the
> + * system call.
> + *
> + * This is considered a slow path.  It is not used by modern libc
> + * implementations on modern hardware except during process startup.
> + *
> + * Arguments:
> + * eax  system call number
> + * ebx  arg1
> + * ecx  arg2
> + * edx  arg3
> + * esi  arg4
> + * edi  arg5
> + * ebp  arg6
> + */
>  ENTRY(entry_INT80_32)

entry_INT80_32() is only used on pure 32-bit kernels, 64-bit kernels use 
entry_INT80_compat(). So the above text should not talk about 64-bit programs, as 
they can never trigger this specific entry point, right?

So I'd change the explanation to something like:

> + * This entry point is active on 32-bit kernels and can thus be used by 32-bit 
> + * programs to perform 32-bit system calls. (Programs running on 64-bit
> + * kernels executing INT $0x80 will land on another entry point: 
> + * entry_INT80_compat. The ABI is identical.)

Agreed?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/entry: Improve system call entry comments
  2016-03-07  8:22 ` Ingo Molnar
@ 2016-03-07 16:34   ` H. Peter Anvin
  2016-03-08 10:30     ` Ingo Molnar
  2016-03-07 17:01   ` Andy Lutomirski
  1 sibling, 1 reply; 13+ messages in thread
From: H. Peter Anvin @ 2016-03-07 16:34 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: x86, linux-kernel, Borislav Petkov, Oleg Nesterov, Andrew Cooper,
	Brian Gerst, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Thomas Gleixner

On March 7, 2016 12:22:28 AM PST, Ingo Molnar <mingo@kernel.org> wrote:
>
>* Andy Lutomirski <luto@kernel.org> wrote:
>
>> Ingo suggested that the comments should explain when the various
>> entries are used.  This adds these explanations and improves other
>> parts of the comments.
>
>Thanks for doing this, this is really useful!
>
>One very small detail I noticed:
>
>> +/*
>> + * 32-bit legacy system call entry.
>> + *
>> + * 32-bit x86 Linux system calls traditionally used the INT $0x80
>> + * instruction.  INT $0x80 lands here.
>> + *
>> + * This entry point can be used by 32-bit and 64-bit programs to
>perform
>> + * 32-bit system calls.  Instances of INT $0x80 can be found inline
>in
>> + * various programs and libraries.  It is also used by the vDSO's
>> + * __kernel_vsyscall fallback for hardware that doesn't support a
>faster
>> + * entry method.  Restarted 32-bit system calls also fall back to
>INT
>> + * $0x80 regardless of what instruction was originally used to do
>the
>> + * system call.
>> + *
>> + * This is considered a slow path.  It is not used by modern libc
>> + * implementations on modern hardware except during process startup.
>> + *
>> + * Arguments:
>> + * eax  system call number
>> + * ebx  arg1
>> + * ecx  arg2
>> + * edx  arg3
>> + * esi  arg4
>> + * edi  arg5
>> + * ebp  arg6
>> + */
>>  ENTRY(entry_INT80_32)
>
>entry_INT80_32() is only used on pure 32-bit kernels, 64-bit kernels
>use 
>entry_INT80_compat(). So the above text should not talk about 64-bit
>programs, as 
>they can never trigger this specific entry point, right?
>
>So I'd change the explanation to something like:
>
>> + * This entry point is active on 32-bit kernels and can thus be used
>by 32-bit 
>> + * programs to perform 32-bit system calls. (Programs running on
>64-bit
>> + * kernels executing INT $0x80 will land on another entry point: 
>> + * entry_INT80_compat. The ABI is identical.)
>
>Agreed?
>
>Thanks,
>
>	Ingo

Sadly I believe Android still uses int $0x80 in the upstream version.
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/entry: Improve system call entry comments
  2016-03-07  8:22 ` Ingo Molnar
  2016-03-07 16:34   ` H. Peter Anvin
@ 2016-03-07 17:01   ` Andy Lutomirski
  2016-03-08 10:27     ` Ingo Molnar
  1 sibling, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2016-03-07 17:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Borislav Petkov, linux-kernel@vger.kernel.org,
	Andrew Morton, Andrew Cooper, Oleg Nesterov, Peter Zijlstra,
	Brian Gerst, H. Peter Anvin, X86 ML, Linus Torvalds

On Mar 7, 2016 12:22 AM, "Ingo Molnar" <mingo@kernel.org> wrote:
>
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>
> > Ingo suggested that the comments should explain when the various
> > entries are used.  This adds these explanations and improves other
> > parts of the comments.
>
> Thanks for doing this, this is really useful!
>
> One very small detail I noticed:
>
> > +/*
> > + * 32-bit legacy system call entry.
> > + *
> > + * 32-bit x86 Linux system calls traditionally used the INT $0x80
> > + * instruction.  INT $0x80 lands here.
> > + *
> > + * This entry point can be used by 32-bit and 64-bit programs to perform
> > + * 32-bit system calls.  Instances of INT $0x80 can be found inline in
> > + * various programs and libraries.  It is also used by the vDSO's
> > + * __kernel_vsyscall fallback for hardware that doesn't support a faster
> > + * entry method.  Restarted 32-bit system calls also fall back to INT
> > + * $0x80 regardless of what instruction was originally used to do the
> > + * system call.
> > + *
> > + * This is considered a slow path.  It is not used by modern libc
> > + * implementations on modern hardware except during process startup.
> > + *
> > + * Arguments:
> > + * eax  system call number
> > + * ebx  arg1
> > + * ecx  arg2
> > + * edx  arg3
> > + * esi  arg4
> > + * edi  arg5
> > + * ebp  arg6
> > + */
> >  ENTRY(entry_INT80_32)
>
> entry_INT80_32() is only used on pure 32-bit kernels, 64-bit kernels use
> entry_INT80_compat(). So the above text should not talk about 64-bit programs, as
> they can never trigger this specific entry point, right?
>

64-bit programs can and sometimes do trigger this entry point.  It
does a 32-bit syscall regardless of the caller's bitness, but it
returns back to the caller's original context, whatever it was.

> So I'd change the explanation to something like:
>
> > + * This entry point is active on 32-bit kernels and can thus be used by 32-bit
> > + * programs to perform 32-bit system calls. (Programs running on 64-bit
> > + * kernels executing INT $0x80 will land on another entry point:
> > + * entry_INT80_compat. The ABI is identical.)

I like the part in parentheses.

--Andy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/entry: Improve system call entry comments
  2016-03-07 17:01   ` Andy Lutomirski
@ 2016-03-08 10:27     ` Ingo Molnar
  2016-03-08 18:29       ` Andy Lutomirski
  0 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2016-03-08 10:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Borislav Petkov, linux-kernel@vger.kernel.org,
	Andrew Morton, Andrew Cooper, Oleg Nesterov, Peter Zijlstra,
	Brian Gerst, H. Peter Anvin, X86 ML, Linus Torvalds


* Andy Lutomirski <luto@amacapital.net> wrote:

> > >  ENTRY(entry_INT80_32)
> >
> > entry_INT80_32() is only used on pure 32-bit kernels, 64-bit kernels use
> > entry_INT80_compat(). So the above text should not talk about 64-bit programs, as
> > they can never trigger this specific entry point, right?
> >
> 
> 64-bit programs can and sometimes do trigger this entry point. [...]

How can 64-bit programs trigger entry_INT80_32? It's only ever set on 32-bit 
kernels:

#ifdef CONFIG_X86_32
        set_system_trap_gate(IA32_SYSCALL_VECTOR, entry_INT80_32);
        set_bit(IA32_SYSCALL_VECTOR, used_vectors);
#endif

> [...]  It does a 32-bit syscall regardless of the caller's bitness, but it 
> returns back to the caller's original context, whatever it was.

That's true of INT $0x80, but I'm talking about the entry point: AFAICS 
entry_INT80_32 can only ever execute on 32-bit kernels.

We don't even build the entry_32.S::entry_INT80_32 entry point on 64-bit kernels:

obj-y                           := entry_$(BITS).o [...]

> 
> > So I'd change the explanation to something like:
> >
> > > + * This entry point is active on 32-bit kernels and can thus be used by 32-bit
> > > + * programs to perform 32-bit system calls. (Programs running on 64-bit
> > > + * kernels executing INT $0x80 will land on another entry point:
> > > + * entry_INT80_compat. The ABI is identical.)
> 
> I like the part in parentheses.

So the part in parentheses conflict with your above statement :)

What I wanted to say with this:

> > > + * This entry point is active on 32-bit kernels and can thus be used by 32-bit
> > > + * programs to perform 32-bit system calls. (Programs running on 64-bit
> > > + * kernels executing INT $0x80 will land on another entry point:
> > > + * entry_INT80_compat. The ABI is identical.)

... is what it says: that entry_INT80_32 is only active on 32-bit kernels, running 
32-bit programs, performing 32-bit system calls.

Programs running on 64-bit kernels can use INT $0x80 as well, but will land on 
another, different, 64-bit kernel specific entry point.

What am I missing?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/entry: Improve system call entry comments
  2016-03-07 16:34   ` H. Peter Anvin
@ 2016-03-08 10:30     ` Ingo Molnar
  2016-03-08 18:40       ` H. Peter Anvin
  0 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2016-03-08 10:30 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, x86, linux-kernel, Borislav Petkov,
	Oleg Nesterov, Andrew Cooper, Brian Gerst, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Thomas Gleixner


* H. Peter Anvin <hpa@zytor.com> wrote:

> On March 7, 2016 12:22:28 AM PST, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >* Andy Lutomirski <luto@kernel.org> wrote:
> >
> >> Ingo suggested that the comments should explain when the various
> >> entries are used.  This adds these explanations and improves other
> >> parts of the comments.
> >
> >Thanks for doing this, this is really useful!
> >
> >One very small detail I noticed:
> >
> >> +/*
> >> + * 32-bit legacy system call entry.
> >> + *
> >> + * 32-bit x86 Linux system calls traditionally used the INT $0x80
> >> + * instruction.  INT $0x80 lands here.
> >> + *
> >> + * This entry point can be used by 32-bit and 64-bit programs to
> >perform
> >> + * 32-bit system calls.  Instances of INT $0x80 can be found inline
> >in
> >> + * various programs and libraries.  It is also used by the vDSO's
> >> + * __kernel_vsyscall fallback for hardware that doesn't support a
> >faster
> >> + * entry method.  Restarted 32-bit system calls also fall back to
> >INT
> >> + * $0x80 regardless of what instruction was originally used to do
> >the
> >> + * system call.
> >> + *
> >> + * This is considered a slow path.  It is not used by modern libc
> >> + * implementations on modern hardware except during process startup.
> >> + *
> >> + * Arguments:
> >> + * eax  system call number
> >> + * ebx  arg1
> >> + * ecx  arg2
> >> + * edx  arg3
> >> + * esi  arg4
> >> + * edi  arg5
> >> + * ebp  arg6
> >> + */
> >>  ENTRY(entry_INT80_32)
> >
> >entry_INT80_32() is only used on pure 32-bit kernels, 64-bit kernels
> >use 
> >entry_INT80_compat(). So the above text should not talk about 64-bit
> >programs, as 
> >they can never trigger this specific entry point, right?
> >
> >So I'd change the explanation to something like:
> >
> >> + * This entry point is active on 32-bit kernels and can thus be used
> >by 32-bit 
> >> + * programs to perform 32-bit system calls. (Programs running on
> >64-bit
> >> + * kernels executing INT $0x80 will land on another entry point: 
> >> + * entry_INT80_compat. The ABI is identical.)
> >
> >Agreed?
> >
> >Thanks,
> >
> >	Ingo
> 
> Sadly I believe Android still uses int $0x80 in the upstream version.

I don't see how that fact conflicts with my statement: on 64-bit kernels INT $0x80 
will (of course) work, but will land on another entry point: entry_INT80_compat(), 
not entry_INT80_32().

On 32-bit kernels the INT $0x80 entry point is entry_INT80_32().

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/entry: Improve system call entry comments
  2016-03-08 10:27     ` Ingo Molnar
@ 2016-03-08 18:29       ` Andy Lutomirski
  0 siblings, 0 replies; 13+ messages in thread
From: Andy Lutomirski @ 2016-03-08 18:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Borislav Petkov, X86 ML, Andrew Morton,
	Oleg Nesterov, Andrew Cooper, Peter Zijlstra, Brian Gerst,
	linux-kernel@vger.kernel.org, H. Peter Anvin, Linus Torvalds

On Mar 8, 2016 2:27 AM, "Ingo Molnar" <mingo@kernel.org> wrote:
>
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
> > > >  ENTRY(entry_INT80_32)
> > >
> > > entry_INT80_32() is only used on pure 32-bit kernels, 64-bit kernels use
> > > entry_INT80_compat(). So the above text should not talk about 64-bit programs, as
> > > they can never trigger this specific entry point, right?
> > >
> >
> > 64-bit programs can and sometimes do trigger this entry point. [...]
>
> How can 64-bit programs trigger entry_INT80_32? It's only ever set on 32-bit
> kernels:
>
> #ifdef CONFIG_X86_32
>         set_system_trap_gate(IA32_SYSCALL_VECTOR, entry_INT80_32);
>         set_bit(IA32_SYSCALL_VECTOR, used_vectors);
> #endif
>
> > [...]  It does a 32-bit syscall regardless of the caller's bitness, but it
> > returns back to the caller's original context, whatever it was.
>
> That's true of INT $0x80, but I'm talking about the entry point: AFAICS
> entry_INT80_32 can only ever execute on 32-bit kernels.

Oh, duh.

>
> We don't even build the entry_32.S::entry_INT80_32 entry point on 64-bit kernels:
>
> obj-y                           := entry_$(BITS).o [...]
>
> >
> > > So I'd change the explanation to something like:
> > >
> > > > + * This entry point is active on 32-bit kernels and can thus be used by 32-bit
> > > > + * programs to perform 32-bit system calls. (Programs running on 64-bit
> > > > + * kernels executing INT $0x80 will land on another entry point:
> > > > + * entry_INT80_compat. The ABI is identical.)
> >
> > I like the part in parentheses.
>
> So the part in parentheses conflict with your above statement :)
>
> What I wanted to say with this:
>
> > > > + * This entry point is active on 32-bit kernels and can thus be used by 32-bit
> > > > + * programs to perform 32-bit system calls. (Programs running on 64-bit
> > > > + * kernels executing INT $0x80 will land on another entry point:
> > > > + * entry_INT80_compat. The ABI is identical.)
>
> ... is what it says: that entry_INT80_32 is only active on 32-bit kernels, running
> 32-bit programs, performing 32-bit system calls.
>
> Programs running on 64-bit kernels can use INT $0x80 as well, but will land on
> another, different, 64-bit kernel specific entry point.
>
> What am I missing?
>

Nothing.  I mis-read your earlier email.  Want to fix and apply it, or
should I send a new version?

> Thanks,
>
>         Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/entry: Improve system call entry comments
  2016-03-08 10:30     ` Ingo Molnar
@ 2016-03-08 18:40       ` H. Peter Anvin
  2016-03-08 18:45         ` Andy Lutomirski
  0 siblings, 1 reply; 13+ messages in thread
From: H. Peter Anvin @ 2016-03-08 18:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, x86, linux-kernel, Borislav Petkov,
	Oleg Nesterov, Andrew Cooper, Brian Gerst, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Thomas Gleixner

On 03/08/16 02:30, Ingo Molnar wrote:
>>>> + *
>>>> + * This is considered a slow path.  It is not used by modern libc
>>>> + * implementations on modern hardware except during process startup.
>>>> + *
>>
>> Sadly I believe Android still uses int $0x80 in the upstream version.
> 
> I don't see how that fact conflicts with my statement: on 64-bit kernels INT $0x80 
> will (of course) work, but will land on another entry point: entry_INT80_compat(), 
> not entry_INT80_32().
> 
> On 32-bit kernels the INT $0x80 entry point is entry_INT80_32().
> 

It doesn't.  I was referring to the above quote. Trying to fix that.

	-hpa

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/entry: Improve system call entry comments
  2016-03-08 18:40       ` H. Peter Anvin
@ 2016-03-08 18:45         ` Andy Lutomirski
  2016-03-08 18:47           ` H. Peter Anvin
  0 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2016-03-08 18:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Oleg Nesterov,
	Andrew Cooper, Brian Gerst, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner

On Tue, Mar 8, 2016 at 10:40 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 03/08/16 02:30, Ingo Molnar wrote:
>>>>> + *
>>>>> + * This is considered a slow path.  It is not used by modern libc
>>>>> + * implementations on modern hardware except during process startup.
>>>>> + *
>>>
>>> Sadly I believe Android still uses int $0x80 in the upstream version.
>>
>> I don't see how that fact conflicts with my statement: on 64-bit kernels INT $0x80
>> will (of course) work, but will land on another entry point: entry_INT80_compat(),
>> not entry_INT80_32().
>>
>> On 32-bit kernels the INT $0x80 entry point is entry_INT80_32().
>>
>
> It doesn't.  I was referring to the above quote. Trying to fix that.

s/modern/most, perhaps?

I'm hoping that some day Bionic goes away and gets replaced by musl.

Of course, musl doesn't always use fast syscalls because it needs a
vdso facility that doesn't currently exist.  I'll deal with that
eventually.

>
>         -hpa
>
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/entry: Improve system call entry comments
  2016-03-08 18:45         ` Andy Lutomirski
@ 2016-03-08 18:47           ` H. Peter Anvin
  2016-03-08 18:50             ` Andy Lutomirski
  0 siblings, 1 reply; 13+ messages in thread
From: H. Peter Anvin @ 2016-03-08 18:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Oleg Nesterov,
	Andrew Cooper, Brian Gerst, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner

On 03/08/16 10:45, Andy Lutomirski wrote:
> 
> s/modern/most, perhaps?
> 
> I'm hoping that some day Bionic goes away and gets replaced by musl.
> 
> Of course, musl doesn't always use fast syscalls because it needs a
> vdso facility that doesn't currently exist.  I'll deal with that
> eventually.
> 

You don't actually need actual DSO support to support fast system calls
on i386.  Even klibc uses them now, and the additional code to support
it is trivial.

	-hpa

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/entry: Improve system call entry comments
  2016-03-08 18:47           ` H. Peter Anvin
@ 2016-03-08 18:50             ` Andy Lutomirski
  2016-03-08 18:59               ` H. Peter Anvin
  0 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2016-03-08 18:50 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Oleg Nesterov,
	Andrew Cooper, Brian Gerst, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner

On Tue, Mar 8, 2016 at 10:47 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 03/08/16 10:45, Andy Lutomirski wrote:
>>
>> s/modern/most, perhaps?
>>
>> I'm hoping that some day Bionic goes away and gets replaced by musl.
>>
>> Of course, musl doesn't always use fast syscalls because it needs a
>> vdso facility that doesn't currently exist.  I'll deal with that
>> eventually.
>>
>
> You don't actually need actual DSO support to support fast system calls
> on i386.  Even klibc uses them now, and the additional code to support
> it is trivial.

That's not the issue.  The issue is that musl does something
crazy^Wclever to support POSIX pthread cancellation, and it involves
being able to tell whether a signal's ucontext points to a syscall
and, if so, what the return address is.  This is straightforward with
an inlined int $0x80, but doing it reliably with the current vdso
design would requiring parsing the DWARF data, and I can't really
blame musl for not wanting to do that.

There was a thread awhile back about adding a new vdso helper to do
this.  I think I even had some code for it.  If I find time, I'll try
to send patches for 4.7.

--Andy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/entry: Improve system call entry comments
  2016-03-08 18:50             ` Andy Lutomirski
@ 2016-03-08 18:59               ` H. Peter Anvin
  2016-03-08 19:11                 ` Andy Lutomirski
  0 siblings, 1 reply; 13+ messages in thread
From: H. Peter Anvin @ 2016-03-08 18:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Oleg Nesterov,
	Andrew Cooper, Brian Gerst, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner

On 03/08/16 10:50, Andy Lutomirski wrote:
> On Tue, Mar 8, 2016 at 10:47 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 03/08/16 10:45, Andy Lutomirski wrote:
>>>
>>> s/modern/most, perhaps?
>>>
>>> I'm hoping that some day Bionic goes away and gets replaced by musl.
>>>
>>> Of course, musl doesn't always use fast syscalls because it needs a
>>> vdso facility that doesn't currently exist.  I'll deal with that
>>> eventually.
>>>
>>
>> You don't actually need actual DSO support to support fast system calls
>> on i386.  Even klibc uses them now, and the additional code to support
>> it is trivial.
> 
> That's not the issue.  The issue is that musl does something
> crazy^Wclever to support POSIX pthread cancellation, and it involves
> being able to tell whether a signal's ucontext points to a syscall
> and, if so, what the return address is.  This is straightforward with
> an inlined int $0x80, but doing it reliably with the current vdso
> design would requiring parsing the DWARF data, and I can't really
> blame musl for not wanting to do that.
> 
> There was a thread awhile back about adding a new vdso helper to do
> this.  I think I even had some code for it.  If I find time, I'll try
> to send patches for 4.7.
> 

As far as I know, when we get a signal the EIP always points to int
$0x80 as we don't support system call restart (being a rare case) for
the fast system calls.

	-hpa

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86/entry: Improve system call entry comments
  2016-03-08 18:59               ` H. Peter Anvin
@ 2016-03-08 19:11                 ` Andy Lutomirski
  0 siblings, 0 replies; 13+ messages in thread
From: Andy Lutomirski @ 2016-03-08 19:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Oleg Nesterov,
	Andrew Cooper, Brian Gerst, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner

On Tue, Mar 8, 2016 at 10:59 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 03/08/16 10:50, Andy Lutomirski wrote:
>> On Tue, Mar 8, 2016 at 10:47 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> On 03/08/16 10:45, Andy Lutomirski wrote:
>>>>
>>>> s/modern/most, perhaps?
>>>>
>>>> I'm hoping that some day Bionic goes away and gets replaced by musl.
>>>>
>>>> Of course, musl doesn't always use fast syscalls because it needs a
>>>> vdso facility that doesn't currently exist.  I'll deal with that
>>>> eventually.
>>>>
>>>
>>> You don't actually need actual DSO support to support fast system calls
>>> on i386.  Even klibc uses them now, and the additional code to support
>>> it is trivial.
>>
>> That's not the issue.  The issue is that musl does something
>> crazy^Wclever to support POSIX pthread cancellation, and it involves
>> being able to tell whether a signal's ucontext points to a syscall
>> and, if so, what the return address is.  This is straightforward with
>> an inlined int $0x80, but doing it reliably with the current vdso
>> design would requiring parsing the DWARF data, and I can't really
>> blame musl for not wanting to do that.
>>
>> There was a thread awhile back about adding a new vdso helper to do
>> this.  I think I even had some code for it.  If I find time, I'll try
>> to send patches for 4.7.
>>
>
> As far as I know, when we get a signal the EIP always points to int
> $0x80 as we don't support system call restart (being a rare case) for
> the fast system calls.
>

We actually fully support system call restart on fast syscalls as of
(IIRC) 4.5, even on AMD.  Phew!

However, the nasty case for musl is when the cancellation signal
happens immediately before the actual kernel entry.  The signal
handler needs some way to detect whether the thread is at a
cancellation point.

--Andy
-

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-03-08 19:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-06 17:39 [PATCH] x86/entry: Improve system call entry comments Andy Lutomirski
2016-03-07  8:22 ` Ingo Molnar
2016-03-07 16:34   ` H. Peter Anvin
2016-03-08 10:30     ` Ingo Molnar
2016-03-08 18:40       ` H. Peter Anvin
2016-03-08 18:45         ` Andy Lutomirski
2016-03-08 18:47           ` H. Peter Anvin
2016-03-08 18:50             ` Andy Lutomirski
2016-03-08 18:59               ` H. Peter Anvin
2016-03-08 19:11                 ` Andy Lutomirski
2016-03-07 17:01   ` Andy Lutomirski
2016-03-08 10:27     ` Ingo Molnar
2016-03-08 18:29       ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).