Re: Intel P6 vs P7 system call performance

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Intel P6 vs P7 system call performance
@ 2002-12-18 12:55 Terje Eggestad
  2002-12-18 20:14 ` H. Peter Anvin
  0 siblings, 1 reply; 268+ messages in thread
From: Terje Eggestad @ 2002-12-18 12:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ulrich Drepper, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, linux-kernel, hpa


what about:

int (*_vsyscall) (int, ...);
_vsyscall = mmap(NULL, getpagesize(),  PROT_READ|PROT_EXEC,
MAP_VSYSCALL, , ); 

or if you're afraid of running out of MAP_* flags:

fd = open("/dev/vsyscall", );
_vsyscall = mmap(NULL, getpagesize(),  PROT_READ|PROT_EXEC, MAP_SHARED,
fd, 0);

Then you can leisurely map it in just after the programs text segment. 

TJ


On tir, 2002-12-17 at 18:55, Linus Torvalds wrote: 
> On Tue, 17 Dec 2002, Matti Aarnio wrote:
> >
> > On Tue, Dec 17, 2002 at 09:07:21AM -0800, Linus Torvalds wrote:
> > > On Tue, 17 Dec 2002, Hugh Dickins wrote:
> > > > I thought that last page was intentionally left invalid?
> > >
> > > It was. But I thought it made sense to use, as it's the only really
> > > "special" page.
> >
> >   In couple of occasions I have caught myself from pre-decrementing
> >   a char pointer which "just happened" to be NULL.
> >
> >   Please keep the last page, as well as a few of the first pages as
> >   NULL-pointer poisons.
> 
> I think I have a good clean solution to this, that not only avoids the
> need for any hard-coded address _at_all_, but also solves Uli's problem
> quite cleanly.
> 
> Uli, how about I just add one ne warchitecture-specific ELF AT flag, which
> is the "base of sysinfo page". Right now that page is all zeroes except
> for the system call trampoline at the beginning, but we might want to add
> other system information to the page in the future (it is readable, after
> all).
> 
> So we'd have an AT_SYSINFO entry, that with the current implementation
> would just get the value 0xfffff000. And then the glibc startup code could
> easily be backwards compatible with the suggestion I had in the previous
> email. Since we basically want to do an indirect jump anyway (because of
> the lack of absolute jumps in the instruction set), this looks like the
> natural way to do it.
> 
> That also allows the kernel to move around the SYSINFO page at will, and
> even makes it possible to avoid it altogether (ie this will solve the
> inevitable problems with UML - UML just wouldn't set AT_SYSINFO, so user
> level just wouldn't even _try_ to use it).
> 
> With that, there's nothing "special" about the vsyscall page, and I'd just
> go back to having the very last page unmapped (and have the vsyscall page
> in some other fixmap location that might even depend on kernel
> configuration).
> 
> Whaddaya think?
> 
> 		Linus
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
_________________________________________________________________________

Terje Eggestad                  mailto:terje.eggestad@scali.no
Scali Scalable Linux Systems    http://www.scali.com

Olaf Helsets Vei 6              tel:    +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal                     +47 975 31 574  (MOBILE)
N-0619 Oslo                     fax:    +47 22 62 89 51
NORWAY            
_________________________________________________________________________


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 12:55 Intel P6 vs P7 system call performance Terje Eggestad
@ 2002-12-18 20:14 ` H. Peter Anvin
  2002-12-18 20:25   ` Richard B. Johnson
  2002-12-18 22:28   ` Jamie Lokier
  0 siblings, 2 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-18 20:14 UTC (permalink / raw)
  To: Terje Eggestad
  Cc: Linus Torvalds, Ulrich Drepper, Matti Aarnio, Hugh Dickins,
	Dave Jones, Ingo Molnar, linux-kernel

Terje Eggestad wrote:
> what about:
> 
> int (*_vsyscall) (int, ...);
> _vsyscall = mmap(NULL, getpagesize(),  PROT_READ|PROT_EXEC,
> MAP_VSYSCALL, , ); 
> 
> or if you're afraid of running out of MAP_* flags:
> 
> fd = open("/dev/vsyscall", );
> _vsyscall = mmap(NULL, getpagesize(),  PROT_READ|PROT_EXEC, MAP_SHARED,
> fd, 0);
> 
> Then you can leisurely map it in just after the programs text segment. 
> 

Very ugly -- then the application has to do indirect calls.

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 20:14 ` H. Peter Anvin
@ 2002-12-18 20:25   ` Richard B. Johnson
  2002-12-18 20:26     ` H. Peter Anvin
  2002-12-18 22:28   ` Jamie Lokier
  1 sibling, 1 reply; 268+ messages in thread
From: Richard B. Johnson @ 2002-12-18 20:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Terje Eggestad, Linus Torvalds, Ulrich Drepper, Matti Aarnio,
	Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1166 bytes --]


The number of CPU clocks necessary to make the 'far' or
full-pointer call by pushing the segment register, the offset,
then issuing a 'lret' is 33 clocks on a Pentium II.
 
longcall clocks = 46
call clocks = 13
actual full-pointer call clocks = 33

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 5
model name	: Pentium II (Deschutes)
stepping	: 1
cpu MHz		: 399.573
cache size	: 512 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips	: 797.90

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 5
model name	: Pentium II (Deschutes)
stepping	: 1
cpu MHz		: 399.573
cache size	: 512 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips	: 797.90


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


[-- Attachment #2: Type: APPLICATION/octet-stream, Size: 3962 bytes --]

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 20:25   ` Richard B. Johnson
@ 2002-12-18 20:26     ` H. Peter Anvin
  0 siblings, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-18 20:26 UTC (permalink / raw)
  To: root
  Cc: Terje Eggestad, Linus Torvalds, Ulrich Drepper, Matti Aarnio,
	Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel

Richard B. Johnson wrote:
> The number of CPU clocks necessary to make the 'far' or
> full-pointer call by pushing the segment register, the offset,
> then issuing a 'lret' is 33 clocks on a Pentium II.
>
> longcall clocks = 46
> call clocks = 13
> actual full-pointer call clocks = 33

That's not a call, that's a jump.  Comparing it to a call instruction is
meaningless.

	-hpa


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 20:14 ` H. Peter Anvin
  2002-12-18 20:25   ` Richard B. Johnson
@ 2002-12-18 22:28   ` Jamie Lokier
  2002-12-18 22:37     ` Linus Torvalds
  2002-12-18 22:39     ` H. Peter Anvin
  1 sibling, 2 replies; 268+ messages in thread
From: Jamie Lokier @ 2002-12-18 22:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Terje Eggestad, Linus Torvalds, Ulrich Drepper, Matti Aarnio,
	Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel

H. Peter Anvin wrote:
> Terje Eggestad wrote:
> > fd = open("/dev/vsyscall", );
> > _vsyscall = mmap(NULL, getpagesize(),  PROT_READ|PROT_EXEC, MAP_SHARED,
> > fd, 0);
> 
> Very ugly -- then the application has to do indirect calls.

No it doesn't.

The application, or library, would map the vsyscall page to an address
in its own data section.  This means that position-independent code
can do vsyscalls without any relocations, and hence without dirtying
its own caller pages.

In some ways this is better than the 0xfffe0000 address: _that_
requires position-independent code to do indirect calls to the
absolute address, or to dirty its caller pages.

That said, you always need the page at 0xfffe0000 mapped anyway, so
that sysexit can jump to a fixed address (which is fastest).

-- Jamie

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 22:28   ` Jamie Lokier
@ 2002-12-18 22:37     ` Linus Torvalds
  2002-12-18 22:57       ` Linus Torvalds
  2002-12-18 22:39     ` H. Peter Anvin
  1 sibling, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-18 22:37 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: H. Peter Anvin, Terje Eggestad, Ulrich Drepper, Matti Aarnio,
	Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel

On Wed, 18 Dec 2002, Jamie Lokier wrote:
> 
> That said, you always need the page at 0xfffe0000 mapped anyway, so
> that sysexit can jump to a fixed address (which is fastest).

Yes. This is important. There _needs_ to be some fixed address at least as 
far as the kernel is concerned (it might move around between reboots or 
something like that, but it needs to be something the kernel knows about 
intimately and doesn't need lots of dynamic lookup).

However, there's another issue, namely process startup cost. I personally 
want it to be as light as at all possible. I hate doing an "strace" on 
user processes and seeing tons and tons of crapola showing up. Just for 
fun, do a

	strace /bin/sh -c "echo hello"

to see what I'm talking about. And that's actually a _lot_ better these 
days than it used to be.

Anyway, I really hate to see "unnecessary crap" in the user mode startup 
just because kernel interfaces are bad. That's why I like the AT_SYSINFO 
ELF auxilliary table approach - it's something that is already _there_ for 
the process to just take advantage of. Having to do a magic mmap for 
somehting that everybody needs to do is just bad design.

			Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 22:37     ` Linus Torvalds
@ 2002-12-18 22:57       ` Linus Torvalds
  2002-12-20  0:53         ` Daniel Jacobowitz
  0 siblings, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-18 22:57 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: H. Peter Anvin, Terje Eggestad, Ulrich Drepper, Matti Aarnio,
	Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel


Btw, I'm pushing what looks like the "final" version of sysenter/sysexit 
for now. There may be bugs left, but all the known issues are resolved:

 - single-stepping over the system call now works. It doesn't actually see 
   all of the user-mode instructions, since the fast system call interface 
   does not lend itself well to restoring "TF" in eflags on return, but 
   the trampoline code saves and restores the flags, so you will be able 
   to step over the important bits.

   (ptrace also doesn't actually allow you to look at the instruction 
   contents in high memory, so gdb won't see the instructions in the
   user-mode fast system call trampoline even when it can single-step
   them, and I don't think I'll bother to fix it up).

 - NMI at the "wrong" time (just before first instruction in kernel 
   space) should now be a non-issue. The per-CPU SEP stack looks like a 
   real (nonpreemptable) process, and follows all the conventions needed 
   for "current_thread_info()" and friends. This behaviour is also 
   triggered by the single-step debug trap, so while I've obviously not 
   tested NMI behaviour, I _have_ tested the very same concept at that 
   exact point.

 - The APM problem was confirmed by Andrew to apparently be just a GDT 
   that was too small for the new layout.

This is in addition to the six-argument issues and the glibc address query
issues that were resolved yesterday.

			Linus


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 22:57       ` Linus Torvalds
@ 2002-12-20  0:53         ` Daniel Jacobowitz
  2002-12-20  1:47           ` Linus Torvalds
  0 siblings, 1 reply; 268+ messages in thread
From: Daniel Jacobowitz @ 2002-12-20  0:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, H. Peter Anvin, Terje Eggestad, Ulrich Drepper,
	Matti Aarnio, Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel

On Wed, Dec 18, 2002 at 02:57:11PM -0800, Linus Torvalds wrote:
> 
> Btw, I'm pushing what looks like the "final" version of sysenter/sysexit 
> for now. There may be bugs left, but all the known issues are resolved:
> 
>  - single-stepping over the system call now works. It doesn't actually see 
>    all of the user-mode instructions, since the fast system call interface 
>    does not lend itself well to restoring "TF" in eflags on return, but 
>    the trampoline code saves and restores the flags, so you will be able 
>    to step over the important bits.
> 
>    (ptrace also doesn't actually allow you to look at the instruction 
>    contents in high memory, so gdb won't see the instructions in the
>    user-mode fast system call trampoline even when it can single-step
>    them, and I don't think I'll bother to fix it up).

This worries me.  I'm no x86 guru, but I assume the trampoline's setting of
the TF bit will kick in right around the following 'ret'.  So the
application will stop and GDB won't be able to read the instruction at
PC.  I bet that makes it unhappy.

Shouldn't be that hard to fix this up in ptrace, though.

-- 
Daniel Jacobowitz
MontaVista Software                         Debian GNU/Linux Developer

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-20  0:53         ` Daniel Jacobowitz
@ 2002-12-20  1:47           ` Linus Torvalds
  2002-12-20  2:37             ` Daniel Jacobowitz
  0 siblings, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-20  1:47 UTC (permalink / raw)
  To: Daniel Jacobowitz
  Cc: Jamie Lokier, H. Peter Anvin, Terje Eggestad, Ulrich Drepper,
	Matti Aarnio, Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel



On Thu, 19 Dec 2002, Daniel Jacobowitz wrote:
> >
> >    (ptrace also doesn't actually allow you to look at the instruction
> >    contents in high memory, so gdb won't see the instructions in the
> >    user-mode fast system call trampoline even when it can single-step
> >    them, and I don't think I'll bother to fix it up).
>
> This worries me.  I'm no x86 guru, but I assume the trampoline's setting of
> the TF bit will kick in right around the following 'ret'.  So the
> application will stop and GDB won't be able to read the instruction at
> PC.  I bet that makes it unhappy.

It doesn't make gdb all that unhappy, everything seems to work fine
despite the fact that gdb decides it just can't display the instructions.

> Shouldn't be that hard to fix this up in ptrace, though.

Or even in user space, since the high pages are all the same in all
processes (so gdb doesn't even strictly need ptrace, it can just read it's
_own_ codespace there). But yeah, we could make ptrace aware of the magic
pages.

		Linus


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-20  1:47           ` Linus Torvalds
@ 2002-12-20  2:37             ` Daniel Jacobowitz
  0 siblings, 0 replies; 268+ messages in thread
From: Daniel Jacobowitz @ 2002-12-20  2:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, H. Peter Anvin, Terje Eggestad, Ulrich Drepper,
	Matti Aarnio, Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel

On Thu, Dec 19, 2002 at 05:47:55PM -0800, Linus Torvalds wrote:
> 
> 
> On Thu, 19 Dec 2002, Daniel Jacobowitz wrote:
> > >
> > >    (ptrace also doesn't actually allow you to look at the instruction
> > >    contents in high memory, so gdb won't see the instructions in the
> > >    user-mode fast system call trampoline even when it can single-step
> > >    them, and I don't think I'll bother to fix it up).
> >
> > This worries me.  I'm no x86 guru, but I assume the trampoline's setting of
> > the TF bit will kick in right around the following 'ret'.  So the
> > application will stop and GDB won't be able to read the instruction at
> > PC.  I bet that makes it unhappy.
> 
> It doesn't make gdb all that unhappy, everything seems to work fine
> despite the fact that gdb decides it just can't display the instructions.

Yeah; sometimes it will generate that error in the middle of
single-stepping over something larger, though, and it breaks you out of
whatever you were doing.

> > Shouldn't be that hard to fix this up in ptrace, though.
> 
> Or even in user space, since the high pages are all the same in all
> processes (so gdb doesn't even strictly need ptrace, it can just read it's
> _own_ codespace there). But yeah, we could make ptrace aware of the magic
> pages.

I need to get back to my scheduled ptrace cleanups.  Meanwhile, here's
a patch to do this.  Completely untested, like all good patches; but
it's pretty simple.

===== arch/i386/kernel/ptrace.c 1.17 vs edited =====
--- 1.17/arch/i386/kernel/ptrace.c	Wed Nov 27 18:14:11 2002
+++ edited/arch/i386/kernel/ptrace.c	Thu Dec 19 21:33:37 2002
@@ -21,6 +21,7 @@
 #include <asm/processor.h>
 #include <asm/i387.h>
 #include <asm/debugreg.h>
+#include <asm/fixmap.h>
 
 /*
  * does not yet catch signals sent when the child dies.
@@ -196,6 +197,18 @@
 	case PTRACE_PEEKDATA: {
 		unsigned long tmp;
 		int copied;
+
+		/* Allow ptrace to read from the vsyscall page.  */
+		if (addr >= FIXADDR_START && addr < FIXADDR_TOP &&
+		    (addr & 3) == 0) {
+			int idx = virt_to_fix (addr);
+			if (idx == FIX_VSYSCALL) {
+				tmp = * (unsigned long *) addr;
+				ret = put_user (tmp, (unsigned long *) data);
+				break;
+			}
+		}
+			
 
 		copied = access_process_vm(child, addr, &tmp, sizeof(tmp), 0);
 		ret = -EIO;


-- 
Daniel Jacobowitz
MontaVista Software                         Debian GNU/Linux Developer

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 22:28   ` Jamie Lokier
  2002-12-18 22:37     ` Linus Torvalds
@ 2002-12-18 22:39     ` H. Peter Anvin
  1 sibling, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-18 22:39 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Terje Eggestad, Linus Torvalds, Ulrich Drepper, Matti Aarnio,
	Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel

Jamie Lokier wrote:
> H. Peter Anvin wrote:
> 
>>Terje Eggestad wrote:
>>
>>>fd = open("/dev/vsyscall", );
>>>_vsyscall = mmap(NULL, getpagesize(),  PROT_READ|PROT_EXEC, MAP_SHARED,
>>>fd, 0);
>>
>>Very ugly -- then the application has to do indirect calls.
> 
> 
> No it doesn't.
> 
> The application, or library, would map the vsyscall page to an address
> in its own data section.  This means that position-independent code
> can do vsyscalls without any relocations, and hence without dirtying
> its own caller pages.
> 

Oh, I see... you don't really mean NULL in the first argument :)

This has one additional advantage: an application which wants to
override vsyscalls can simply map something instead of the kernel page,
and UML can present its own vsyscall page.

> In some ways this is better than the 0xfffe0000 address: _that_
> requires position-independent code to do indirect calls to the
> absolute address, or to dirty its caller pages.
> 
> That said, you always need the page at 0xfffe0000 mapped anyway, so
> that sysexit can jump to a fixed address (which is fastest).

That's a possiblity, or if the task_struct contains the desired return
address for a particular process that might also work -- it's just a GPR
after all.

	-hpa




^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2003-01-10 18:08 Gabriel Paubert
  0 siblings, 0 replies; 268+ messages in thread
From: Gabriel Paubert @ 2003-01-10 18:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Jamie Lokier, Ulrich Drepper, davej, linux-kernel

Linus Torvalds wrote:
> It shouldn't matter.
>
> NT is only tested by "iret", and if somebody sets NT in user space they
> get exactly what they deserve.

Indeed. I realized after I sent the previous mail that I had missed the
flags save/restore in switch_to :-(

Still, does this mean that there is some micro optimization opportunity in
the lcall7/lcall27 handlers to remove the popfl? After all TF is now
handled by some magic in do_debug unless I miss (again) something,
NT has become irrelevant, and cld in SAVE_ALL takes care of DF.

In short something like the following (I just love patches which only
remove code):

===== entry.S 1.51 vs edited =====
--- 1.51/arch/i386/kernel/entry.S	Mon Jan  6 04:54:58 2003
+++ edited/entry.S	Fri Jan 10 18:57:42 2003
@@ -156,16 +156,6 @@
 	movl %edx,EIP(%ebp)	# Now we move them to their "normal" places
 	movl %ecx,CS(%ebp)	#

-	#
-	# Call gates don't clear TF and NT in eflags like
-	# traps do, so we need to do it ourselves.
-	# %eax already contains eflags (but it may have
-	# DF set, clear that also)
-	#
-	andl $~(DF_MASK | TF_MASK | NT_MASK),%eax
-	pushl %eax
-	popfl
-
 	andl $-8192, %ebp	# GET_THREAD_INFO
 	movl TI_EXEC_DOMAIN(%ebp), %edx	# Get the execution domain
 	call *4(%edx)		# Call the lcall7 handler for the domain


>>For example, set NT and then execute sysenter with garbage in %eax, the
>>kernel will try to return (-ENOSYS) with iret and kill the task. As long
>>as it only allows a task to kill itself, it's not a big deal. But NT is
>>not cleared across task switches unless I miss something, and that looks
>>very dangerous.
>
>
> It _is_ cleared by task-switching these days. Or rather, it's saved and
> restored, so the original NT setter will get it restored when resumed.

Yeah, sorry for the noise.

>
>
>>I'm no Ingo, unfortunately, but you'll need at least the following patch
>>(the second hunk is only a typo fix) to the iret exception recovery code,
>>which used push and pops to get the smallest possible code size.
>
>
> Good job.

That was too easy since I did originally suggest the push/pop sequence :-)

	Gabriel.




^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-30 13:06 Manfred Spraul
  2002-12-30 14:54 ` Andi Kleen
  0 siblings, 1 reply; 268+ messages in thread
From: Manfred Spraul @ 2002-12-30 13:06 UTC (permalink / raw)
  To: Dave Jones; +Cc: linux-kernel

DaveJ wrote:

>On Sat, Dec 28, 2002 at 10:37:06PM +0200, Ville Herva wrote:
>
> > > SYSCALL is AMD.  SYSENTER is Intel, and is likely to be significantly
> > Now that Linus has killed the dragon and everybody seems happy with the
> > shiny new SYSENTER code, let just add one more stupid question to this
> > thread: has anyone made benchmarks on SYSCALL/SYSENTER/INT80 on Athlon? Is
> > SYSCALL worth doing separately for Athlon (and perhaps Hammer/32-bit mode)?
>
>Its something I wondered about too. Even if it isn't a win for K7,
>it's possible that the K6 family may benefit from SYSCALL support.
>Maybe even the K5 if it was around that early ? (too lazy to check pdf's)
>  
>

I looked at SYSCALL once, and noticed some problems:

- it doesn't even load ESP with a kernel value, a task gate for NMI is 
mandatory.
- SMP support is only possible with a per-cpu entry point with 
(boot-time) fixups to the address where the entry point can find the 
kernel stack.
- The AMD docs contain one odd sentence:
"The CS and SS registers must not be modified by the operating system 
between the execution of the SYSCALL and the corresponding SYSRET 
instruction".
Is SYSCALL+iretd permitted? That's needed for execve, iopl, task 
switches, signal delivery.
What about interrupts during SYSCALLs? NMI to taskgate?

Either that sentence is just wrong, or SYSCALL is unusable.

It's not supported by the K5 cpu:
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/20734.pdf

--
    Manfred


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-30 13:06 Manfred Spraul
@ 2002-12-30 14:54 ` Andi Kleen
  0 siblings, 0 replies; 268+ messages in thread
From: Andi Kleen @ 2002-12-30 14:54 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-kernel

Manfred Spraul <manfred@colorfullife.com> writes:

> - The AMD docs contain one odd sentence:
> "The CS and SS registers must not be modified by the operating system
> between the execution of the SYSCALL and the corresponding SYSRET
> instruction".

As I understand it is:

SYSCALL does not actually load the new CS/SS from the GDT,
but just set the internal base/limit/descriptor valid registers to
"flat" values. SYSEXIT undo this. When the SYSRET is executed the
selector should be still the same, otherwise the resulting state may not be 
exactly the same as when SYSCALL happend.

But if you make sure that CS/SS have the same state on SYSRET then
everything should be ok. This means a context switch to a process with
possibly different CS/SS values should be ok.

The GDT is guaranteed to stay constant and SYSCALL forbids use of an
LDT for SS/CS.

I have not actually tried this on an Athlon, but on x86-64 with 
entering long mode it works (including context switches to processes with
different segments etc.)

To add to confusion there are three different all slightly different 
SYSCALL flavours: 

K6 (unusable iirc), Athlon/Hammer with 32bit OS 
(would work, but they have SYSENTER too so it makes sense to just share code 
with Intel), Hammer with 64bit OS (working, has to use SYSCALL from both
32bit and 64bit processes)

They are different in what registers they clobber and how EFLAGS is handled
and some other details.

On x86-64 SYSCALL is the only and native system call entry instruction
for 64bit processes.  The only reason to use SYSCALL from 32bit programs is 
that on a x86-64 SYSENTER from 32bit processes to 64bit kernels is not 
supported, so the 2.5.53 x86-64 kernel implements the AT_SYSINFO vsyscall
page using SYSCALL.

> Is SYSCALL+iretd permitted? That's needed for execve, iopl, task

Yes, it works.

> switches, signal delivery.
> What about interrupts during SYSCALLs? 

SYSCALL in 32bit mode turns off IF on entry (in long mode it is configurable 
using a MSR) 

-Andi

^ permalink raw reply	[flat|nested] 268+ messages in thread

* RE: Intel P6 vs P7 system call performance
@ 2002-12-22 15:45 Nakajima, Jun
  0 siblings, 0 replies; 268+ messages in thread
From: Nakajima, Jun @ 2002-12-22 15:45 UTC (permalink / raw)
  To: Mikael Pettersson, mingo, torvalds; +Cc: drepper, linux-kernel

Correct. Please look at Table B-1. Most of MSRs are shared, but some MSRs are unique in each logical processor, to provide the x86 architectural state. Those SYSENTER MSRs, and Machine Check register save state (IA32_MCG_XXX), for example, are unique.

Jun

> -----Original Message-----
> From: Mikael Pettersson [mailto:mikpe@csd.uu.se]
> Sent: Sunday, December 22, 2002 4:34 AM
> To: mingo@elte.hu; torvalds@transmeta.com
> Cc: drepper@redhat.com; Nakajima, Jun; linux-kernel@vger.kernel.org
> Subject: Re: Intel P6 vs P7 system call performance
> 
> On Sun, 22 Dec 2002 11:23:08 +0100 (CET), Ingo Molnar wrote:
> >while reviewing the sysenter trampoline code i started wondering about
> the
> >HT case. Dont HT boxes share the MSRs between logical CPUs? This pretty
> >much breaks the concept of per-logical-CPU sysenter trampolines. It also
> >makes context-switch time sysenter MSR writing impossible, so i really
> >hope this is not the case.
> 
> Some MSRs are shared, some aren't. One must always check this in
> the IA32 Volume 3 manual. The three SYSENTER MSRs are not shared.
> 
> However, no-one has yet proven that writing to these in the context
> switch path has acceptable performance -- remember, there is _no_
> a priori reason to assume _anything_ about performance on P4s,
> you really do need to measure things before taking design decisions.
> 
> Manfred had a version with fixed MSR values and the varying data
> in memory. Maybe that's actually faster.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-22 12:33 Mikael Pettersson
  2002-12-22 16:00 ` Jamie Lokier
  0 siblings, 1 reply; 268+ messages in thread
From: Mikael Pettersson @ 2002-12-22 12:33 UTC (permalink / raw)
  To: mingo, torvalds; +Cc: drepper, jun.nakajima, linux-kernel

On Sun, 22 Dec 2002 11:23:08 +0100 (CET), Ingo Molnar wrote:
>while reviewing the sysenter trampoline code i started wondering about the
>HT case. Dont HT boxes share the MSRs between logical CPUs? This pretty
>much breaks the concept of per-logical-CPU sysenter trampolines. It also
>makes context-switch time sysenter MSR writing impossible, so i really
>hope this is not the case.

Some MSRs are shared, some aren't. One must always check this in
the IA32 Volume 3 manual. The three SYSENTER MSRs are not shared.

However, no-one has yet proven that writing to these in the context
switch path has acceptable performance -- remember, there is _no_
a priori reason to assume _anything_ about performance on P4s,
you really do need to measure things before taking design decisions.

Manfred had a version with fixed MSR values and the varying data
in memory. Maybe that's actually faster.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-22 12:33 Mikael Pettersson
@ 2002-12-22 16:00 ` Jamie Lokier
  0 siblings, 0 replies; 268+ messages in thread
From: Jamie Lokier @ 2002-12-22 16:00 UTC (permalink / raw)
  To: Mikael Pettersson; +Cc: mingo, drepper, jun.nakajima, linux-kernel

Mikael Pettersson wrote:
> Manfred had a version with fixed MSR values and the varying data
> in memory. Maybe that's actually faster.

The stack pointer is already changed on context switches since Ingo
changed the kernel to use per-cpu TSS segments.

Manfred's code used a tiny stack (without a valid task struct,
a different method of recovering than Linus' code).  You can get that
stack down to 6 words, (3 for debug trap, 3 more for nested NMI).

Combining these leads to an IMHO beautiful hack, which does work btw:
the 6 words fit into unused parts of the per-CPU TSS (just before
tss->es).  The MSR has a constant value:

	wrmsr(MSR_IA32_SYSENTER_ESP, (u32) &tss->es, 0);

I found the fastest sysenter entry code looks like this:

sysenter_entry_point:
	cld			# Faster before sti.
	sti			# Re-enable interrupts after next insn.
	movl	-68(%esp),%esp	# Load per-CPU stack from tss->esp0.

with appropriate fixups at the start of the NMI and debug trap handlers.

enjoy,
-- Jamie

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-19 18:46 billyrose
  0 siblings, 0 replies; 268+ messages in thread
From: billyrose @ 2002-12-19 18:46 UTC (permalink / raw)
  To: bart; +Cc: root, linux-kernel


> Not true. A ret(urn) is (sort of) equivalent to 'pop %eip'. The above
> code would actually jump to address 0xfffff000, but probably be slow
> since it confuses the branch prediction.
>
>
>Bart

that being the case, then the original code that Linus put forth:

        pushl $0xfffff000
        call *(%esp)
        add $4,%esp

would be the way to go as it is highly readable. actually, the code at
0xfffff000 could issue a ret $4 and eliminate the add after the call.


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-19 16:10 billyrose
  0 siblings, 0 replies; 268+ messages in thread
From: billyrose @ 2002-12-19 16:10 UTC (permalink / raw)
  To: root; +Cc: linux-kernel

Richard B. Johnson wrote:

> Because the number pushed onto the stack is a displacement, not
> an address, i.e., -4095. To have the address act as an address,
> you need to load a full-pointer, i.e. SEG:OFFSET (like the old
> 16-bit days). The offset is 32-bits and the segment is whatever
> the kernel has set up for __USER_CS (0x23). All the 'near' calls
> are calls to a signed displacement, same for jumps.

call's and jmp's use displacement, ret's are _always_ absolute.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-19 15:20 bart
  0 siblings, 0 replies; 268+ messages in thread
From: bart @ 2002-12-19 15:20 UTC (permalink / raw)
  To: root; +Cc: linux-kernel, billyrose

On 19 Dec, Richard B. Johnson wrote:
> On Thu, 19 Dec 2002 billyrose@billyrose.net wrote:

>> long_call:
>>         pushl $0xfffff000
>>         ret
>> 
> 
> Because the number pushed onto the stack is a displacement, not
> an address, i.e., -4095. To have the address act as an address,

Not true. A ret(urn) is (sort of) equivalent to 'pop %eip'. The above
code would actually jump to address 0xfffff000, but probably be slow
since it confuses the branch prediction.

Bart

-- 
Bart Hartgers - TUE Eindhoven 
http://plasimo.phys.tue.nl/bart/contact.html

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-19 14:57 bart
  0 siblings, 0 replies; 268+ messages in thread
From: bart @ 2002-12-19 14:57 UTC (permalink / raw)
  To: billyrose; +Cc: root, linux-kernel

On 19 Dec, billyrose@billyrose.net wrote:
> long_call:
>         pushl $0xfffff000
>         ret
> 

A ret(urn) to an address that wasn't put on the stack by a call
severly confuses the branch prediction on many processors.


-- 
Bart Hartgers - TUE Eindhoven 
http://plasimo.phys.tue.nl/bart/contact.html

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-19 14:40 billyrose
  2002-12-19 15:11 ` Richard B. Johnson
  0 siblings, 1 reply; 268+ messages in thread
From: billyrose @ 2002-12-19 14:40 UTC (permalink / raw)
  To: root; +Cc: linux-kernel

Richard B. Johnson wrote:

> The target, i.e., the label 'goto' would be the reserved page for the
> system call. The whole purpose was to minimize the number of CPU cycles
> necessary to call 0xfffff000 and return. The system call does not have
> issue a 'far' return, it can do anything it requires. The page at
> 0xfffff000 is mapped into every process and is in that process CS space
> already.

that being the case, why push %cs and reload it without reason as the
code is mapped into every process?

therefore, would it not suffice to use:

        ...
        long_call(); //call to $0xfffff000 via near ret
        //code at $0xfffff000 returns directly here when a ret is issued
        ...

long_call:
        pushl $0xfffff000
        ret


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 14:40 billyrose
@ 2002-12-19 15:11 ` Richard B. Johnson
  0 siblings, 0 replies; 268+ messages in thread
From: Richard B. Johnson @ 2002-12-19 15:11 UTC (permalink / raw)
  To: billyrose; +Cc: linux-kernel

On Thu, 19 Dec 2002 billyrose@billyrose.net wrote:

> Richard B. Johnson wrote:
> 
> > The target, i.e., the label 'goto' would be the reserved page for the
> > system call. The whole purpose was to minimize the number of CPU cycles
> > necessary to call 0xfffff000 and return. The system call does not have
> > issue a 'far' return, it can do anything it requires. The page at
> > 0xfffff000 is mapped into every process and is in that process CS space
> > already.
> 
> that being the case, why push %cs and reload it without reason as the
> code is mapped into every process?
> 
> therefore, would it not suffice to use:
> 
>         ...
>         long_call(); //call to $0xfffff000 via near ret
>         //code at $0xfffff000 returns directly here when a ret is issued
>         ...
> 
> long_call:
>         pushl $0xfffff000
>         ret
> 

Because the number pushed onto the stack is a displacement, not
an address, i.e., -4095. To have the address act as an address,
you need to load a full-pointer, i.e. SEG:OFFSET (like the old
16-bit days). The offset is 32-bits and the segment is whatever
the kernel has set up for __USER_CS (0x23). All the 'near' calls
are calls to a signed displacement, same for jumps.

It would be nice if you could just do 	call	$0xfffff000,
the problem is that the 'call' expects a displacement, usually
determined (fixed-up) by the linker. So, in this case, you
end up calling some code that exists 4095 bytes before the
call instruction NotGood(tm).

So the whole idea of this exercise is to do the same thing as
	call far ptr 0x23:0xfffff000 (Intel syntax), without
reguiring a fixup, but minimizing the instructions and disruption
due to reloading a segment.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-19 13:55 bart
  2002-12-19 19:37 ` Linus Torvalds
  0 siblings, 1 reply; 268+ messages in thread
From: bart @ 2002-12-19 13:55 UTC (permalink / raw)
  To: davej
  Cc: torvalds, lk, hpa, terje.eggestad, drepper, matti.aarnio, hugh,
	mingo, linux-kernel

On 19 Dec, Dave Jones wrote:
> On Thu, Dec 19, 2002 at 02:22:36PM +0100, bart@etpmod.phys.tue.nl wrote:
>  > > However, there's another issue, namely process startup cost. I personally 
>  > > want it to be as light as at all possible. I hate doing an "strace" on 
>  > > user processes and seeing tons and tons of crapola showing up. Just for 
>  > So why not map the magic page at 0xffffe000 at some other address as
>  > well? 
>  > Static binaries can just directly jump/call into the magic page.
> 
> .. and explode nicely when you try to run them on an older kernel
> without the new syscall magick. This is what Linus' first
> proof-of-concept code did.


True, but unless I really don't get it, compatibility of a new static
binary with an old kernel is going to break anyway. 
My point was that the double-mapped page trick adds no overhead in the
case of a static binary, and just one extra mmap in case of a shared
binary.

Bart

> 
> 		Dave
> 

-- 
Bart Hartgers - TUE Eindhoven 
http://plasimo.phys.tue.nl/bart/contact.html

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 13:55 bart
@ 2002-12-19 19:37 ` Linus Torvalds
  2002-12-19 22:10   ` Jamie Lokier
  2002-12-20 10:08   ` Ulrich Drepper
  0 siblings, 2 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-19 19:37 UTC (permalink / raw)
  To: bart
  Cc: davej, lk, hpa, terje.eggestad, drepper, matti.aarnio, hugh,
	mingo, linux-kernel

On Thu, 19 Dec 2002 bart@etpmod.phys.tue.nl wrote:
> 
> True, but unless I really don't get it, compatibility of a new static
> binary with an old kernel is going to break anyway. 

NO.

The current code in 2.5.x is perfectly able to be 100% compatible with 
binaries even on old kernels. This whole discussion is _totally_ 
pointless. I solved all the glibc problems early on, and Uli is already 
happy with the interfaces, and they work fine for old kernels that don't 
have a clue about the new system call interfaces.

WITHOUT any new magic system calls.

WITHOUT any stupid SIGSEGV tricks.

WITHOUT and silly mmap()'s on magic files.

> My point was that the double-mapped page trick adds no overhead in the
> case of a static binary, and just one extra mmap in case of a shared
> binary.

For _zero_ gain.  The jump to the library address has to be indirect 
anyway, and glibc has several places to put the information without any 
mmap's or anything like that.

		Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 19:37 ` Linus Torvalds
@ 2002-12-19 22:10   ` Jamie Lokier
  2002-12-19 22:16     ` H. Peter Anvin
  2002-12-19 22:22     ` Linus Torvalds
  2002-12-20 10:08   ` Ulrich Drepper
  1 sibling, 2 replies; 268+ messages in thread
From: Jamie Lokier @ 2002-12-19 22:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: bart, davej, hpa, terje.eggestad, drepper, matti.aarnio, hugh,
	mingo, linux-kernel

Linus Torvalds wrote:
> For _zero_ gain.  The jump to the library address has to be indirect 
> anyway, and glibc has several places to put the information without any 
> mmap's or anything like that.

This is not true, (but your overall point is still correct).

The jump to the magic page can be direct in statically linked code, or
in the executable itself.  The assembler and linker have no problem
with this, I have just tried it.

What people (not Linus) have said about static binaries is moot,
because a static binary is linked at an absolute address itself, and
so can use the standard "call relative" instruction directly to the
fixed magic page address.

Dynamic binaries or libraries can use the indirect call or relocate
the calls at load time, or if they _really_ want a magic page at a
position relative to the library, they can just _copy_ the magic page
from 0xfffe0000.  It is not all that magic.

-- Jamie

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 22:10   ` Jamie Lokier
@ 2002-12-19 22:16     ` H. Peter Anvin
  2002-12-19 22:22     ` Linus Torvalds
  1 sibling, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-19 22:16 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Linus Torvalds, bart, davej, terje.eggestad, drepper,
	matti.aarnio, hugh, mingo, linux-kernel

Jamie Lokier wrote:
> 
> Dynamic binaries or libraries can use the indirect call or relocate
> the calls at load time, or if they _really_ want a magic page at a
> position relative to the library, they can just _copy_ the magic page
> from 0xfffe0000.  It is not all that magic.
> 

That would make it impossible for the kernel to have kernel-controlled
data on that page|other page though...

I personally would like to see some better interface than mmap()
/proc/self/mem in order to alias pages, anyway.  We could use a
MAP_ALIAS flag in mmap() for this (where the fd would be ignored, but
the offset would matter.)

	-hpa

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 22:10   ` Jamie Lokier
  2002-12-19 22:16     ` H. Peter Anvin
@ 2002-12-19 22:22     ` Linus Torvalds
  2002-12-19 22:26       ` H. Peter Anvin
  2002-12-22 11:08       ` James H. Cloos Jr.
  1 sibling, 2 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-19 22:22 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: bart, davej, hpa, terje.eggestad, drepper, matti.aarnio, hugh,
	mingo, linux-kernel

On Thu, 19 Dec 2002, Jamie Lokier wrote:
> Linus Torvalds wrote:
> > For _zero_ gain.  The jump to the library address has to be indirect 
> > anyway, and glibc has several places to put the information without any 
> > mmap's or anything like that.
> 
> This is not true, (but your overall point is still correct).

Go back and read the postings by Uli.

Uli's suggested glibc approach is to just put the magis system call 
address (which glibc gets from the AT_SYSINFO elf aux table entry) into 
the per-thread TLS area, which is alway spointed to by %gs anyway.

THIS WORKS WITH ALL DSO'S WITHOUT ANY GAMES, ANY MMAP'S, ANY RELINKING, OR
ANY EXTRA WORK AT ALL!

The system call entry becomes a simple

	call *%gs:constant-offset

Not mmap. No magic system calls. No relinking. Not _nothing_. One 
instruction, that's it. 

See for example Uli's posting in this thread from the day before
yesterday, message ID <3DFF6D4B.3060107@redhat.com>. So please stop 
arguing about any extra work, because none is needed.

		Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 22:22     ` Linus Torvalds
@ 2002-12-19 22:26       ` H. Peter Anvin
  2002-12-19 22:49         ` Linus Torvalds
  2002-12-22 11:08       ` James H. Cloos Jr.
  1 sibling, 1 reply; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-19 22:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, bart, davej, terje.eggestad, drepper, matti.aarnio,
	hugh, mingo, linux-kernel

Linus Torvalds wrote:
> 
> Uli's suggested glibc approach is to just put the magis system call 
> address (which glibc gets from the AT_SYSINFO elf aux table entry) into 
> the per-thread TLS area, which is alway spointed to by %gs anyway.
>
> THIS WORKS WITH ALL DSO'S WITHOUT ANY GAMES, ANY MMAP'S, ANY RELINKING, OR
> ANY EXTRA WORK AT ALL!
> 
> The system call entry becomes a simple
> 
> 	call *%gs:constant-offset
> 
> Not mmap. No magic system calls. No relinking. Not _nothing_. One 
> instruction, that's it. 
> 

Unfortunately it means taking an indirect call cost for every invocation...

	-hpa


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 22:26       ` H. Peter Anvin
@ 2002-12-19 22:49         ` Linus Torvalds
  2002-12-19 23:30           ` Linus Torvalds
  0 siblings, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-19 22:49 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jamie Lokier, bart, davej, terje.eggestad, drepper, matti.aarnio,
	hugh, mingo, linux-kernel

On Thu, 19 Dec 2002, H. Peter Anvin wrote:
> 
> Unfortunately it means taking an indirect call cost for every invocation...

Ehh.. I just tested the "cost" of this on a PIII (comparing a indirect
call with a direct one), and it's exactly one extra cycle.

ONE CYCLE. 

On a P4 the difference was 4 cycles. On my test P95 system I didn't see
any difference at all. And I don't have an athlon handy in my office.

That's the difference between

	static void *address = &do_nothing;
	asm("call *%0" :"m" (address))

and

	asm("call do_nothing");

So it's between 0-4 cycles on machines that take 200 - 1000 cycles for
just the system call overhead.

And for that "overhead", you get a binary that trivially works on all
kernels, _and_ doesn't need extra mmap's etc (which are _easily_ thousands
of cycles).

		Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 22:49         ` Linus Torvalds
@ 2002-12-19 23:30           ` Linus Torvalds
  0 siblings, 0 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-19 23:30 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jamie Lokier, bart, davej, terje.eggestad, drepper, matti.aarnio,
	hugh, mingo, linux-kernel

On Thu, 19 Dec 2002, Linus Torvalds wrote:
> 
> So it's between 0-4 cycles on machines that take 200 - 1000 cycles for
> just the system call overhead.

Side note: I'd expect indirect calls - and especially the predictable 
ones, like this - to maintain competitive behaviour on CPU's. Even the P4, 
which usually has really bad worst-case behaviour for more complex 
instructions (just look at the 2000 cycles for a regular int80/iret and 
shudder) does a indirect call without huge problems.

That's because indirect calls are actually very common, and to some degree 
getting _more_ so with the proliferation of OO languages (and I'm 
discounting the "indirect call" that is just a return statement - they've 
obviously always been common, but a return stack means that CPU 
optimizations for "ret" instructions are different from "real" indirect 
calls).

So I don't worry about the indirection per se. I'd worry a lot more about
some of the tricks people mentioned (ie the "pushl $0xfffff000 ; ret"  
approach probably sucks quite badly on some CPU's, simply because it does
bad things to return stacks on modern CPU's - not necessarily visible in 
microbenchmarks, but..).

			Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 22:22     ` Linus Torvalds
  2002-12-19 22:26       ` H. Peter Anvin
@ 2002-12-22 11:08       ` James H. Cloos Jr.
  2002-12-22 18:49         ` Linus Torvalds
  1 sibling, 1 reply; 268+ messages in thread
From: James H. Cloos Jr. @ 2002-12-22 11:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, Ulrich Drepper

Linus> The system call entry becomes a simple

Linus> 	call *%gs:constant-offset

Linus> Not mmap. No magic system calls. No relinking. Not
Linus> _nothing_. One instruction, that's it.

I presume *%gs:0x18 is only for shared objects?

A naïve:

-               asm volatile("call 0xffffe000"
+               asm volatile("call *%%gs:0x18"

in the trivial getppid benchmark code gives a SEGV, since
(according to gdb's info all-registers) %gs == 0 when it runs.

Is it just that my glibc is too old, or is there a shared vs static difference?

-JimC

P.S.    On a (1 Gig) mobile p3 the getppid bench gives ~333 cycles for
        int $0x80 and ~215 for call 0xffffe000, before yesterday's push.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-22 11:08       ` James H. Cloos Jr.
@ 2002-12-22 18:49         ` Linus Torvalds
  2002-12-22 19:07           ` Ulrich Drepper
  2002-12-22 19:17           ` Ulrich Drepper
  0 siblings, 2 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-22 18:49 UTC (permalink / raw)
  To: James H. Cloos Jr.; +Cc: linux-kernel, Ulrich Drepper


On 22 Dec 2002, James H. Cloos Jr. wrote:
>
> I presume *%gs:0x18 is only for shared objects?

No, it's for everything, but it requires a glibc that has set it up.

Uli, do you make public snapshots available so that people can test the
new libraries and maybe see system-wide performance issues?

(It would also be good for testing - I've tried to be _very_ careful
inside the kernel, but in the end wide testing is always a good idea)

		Linus


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-22 18:49         ` Linus Torvalds
@ 2002-12-22 19:07           ` Ulrich Drepper
  2002-12-22 19:34             ` Linus Torvalds
  2002-12-22 19:17           ` Ulrich Drepper
  1 sibling, 1 reply; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-22 19:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: James H. Cloos Jr., linux-kernel

Linus Torvalds wrote:

> Uli, do you make public snapshots available so that people can test the
> new libraries and maybe see system-wide performance issues?

It is already available.  I've announced it on the NPTL mailing list a
couple of days ago.  There is no support without NPTL since the TLS
setup isn't present in sufficient form in the LinuxThreads code which
has to work on stone-old kernels.  But the NPTL code is more than stable
enough to run on test systems.  In fact, I've a complete system running
using it.

Announcement:

https://listman.redhat.com/pipermail/phil-list/2002-December/000387.html

It is not easy to build glibc and you can easily ruin your system.  You
need very recent tools, the CVS version of glibc and the NPTL add-on.
See for instance

https://listman.redhat.com/pipermail/phil-list/2002-December/000352.html

for a recipe on how to build glibc and how to run binaries using it
*without* replacing your system's libc.  There have been That's save but
still the build is demanding.  I know I'll be lynched again for saying
this, but it's the only experience I have: use RHL8 and get the very
latest tools (gcc, binutils) from rawhide.  Then you should be fine.

If there is interest in RPMs of the binaries I might _try_ to provide
some.  But this would mean replacing the system's libc.

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-22 19:07           ` Ulrich Drepper
@ 2002-12-22 19:34             ` Linus Torvalds
  2002-12-22 19:51               ` Ulrich Drepper
  0 siblings, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-22 19:34 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: James H. Cloos Jr., linux-kernel



On Sun, 22 Dec 2002, Ulrich Drepper wrote:
>
> It is already available.  I've announced it on the NPTL mailing list a
> couple of days ago.

Ok. I was definitely thinking of something rpm-like, since I know building
it is a bitch, and doing things wrong tends to result in systems that
don't work all that well.

> If there is interest in RPMs of the binaries I might _try_ to provide
> some.  But this would mean replacing the system's libc.

I suspect that many people who test out 2.5.x kernels (and especially -bk
snapshots) don't find that too scary.

		Linus


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-22 19:34             ` Linus Torvalds
@ 2002-12-22 19:51               ` Ulrich Drepper
  2002-12-22 20:50                 ` James H. Cloos Jr.
  0 siblings, 1 reply; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-22 19:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: James H. Cloos Jr., linux-kernel

Linus Torvalds wrote:

> Ok. I was definitely thinking of something rpm-like, since I know building
> it is a bitch, and doing things wrong tends to result in systems that
> don't work all that well.

I've talked to our guy producing the glibc RPMs and he said that he'll
produce them soon.  We'll let people know when it happened.

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-22 19:51               ` Ulrich Drepper
@ 2002-12-22 20:50                 ` James H. Cloos Jr.
  2002-12-22 20:56                   ` Ulrich Drepper
  0 siblings, 1 reply; 268+ messages in thread
From: James H. Cloos Jr. @ 2002-12-22 20:50 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Linus Torvalds, linux-kernel

>>>>> "Ulrich" == Ulrich Drepper <drepper@redhat.com> writes:

Ulrich> I've talked to our guy producing the glibc RPMs and he said
Ulrich> that he'll produce them soon.  We'll let people know when it
Ulrich> happened.

I'd tend to prefer an LD_PRELOAD-able dso that just set up %gs and had
entries for each of the foo(2) over a full glibc rpm.  I've only got
the one box to test on right now, but would like to see how well
sysenter¹ works.

-JimC

¹ Assuming I didn't just mix up the intel and amd opcodes....

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-22 20:50                 ` James H. Cloos Jr.
@ 2002-12-22 20:56                   ` Ulrich Drepper
  0 siblings, 0 replies; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-22 20:56 UTC (permalink / raw)
  To: James H. Cloos Jr.; +Cc: Linus Torvalds, linux-kernel

James H. Cloos Jr. wrote:

> I'd tend to prefer an LD_PRELOAD-able dso that just set up %gs and had
> entries for each of the foo(2) over a full glibc rpm.

This is not possible.  The infrastructure is set up in the dynamic
linker.  Read the mail with the references to the NPTL mailing list.
The second referenced mail contains a recipe for building glibc and then
using it in-place.  This is not possible with binary RPMs in the way we
build them.

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-22 18:49         ` Linus Torvalds
  2002-12-22 19:07           ` Ulrich Drepper
@ 2002-12-22 19:17           ` Ulrich Drepper
  1 sibling, 0 replies; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-22 19:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: James H. Cloos Jr., linux-kernel

Linus Torvalds wrote:

>>I presume *%gs:0x18 is only for shared objects?
> 
> 
> No, it's for everything, but it requires a glibc that has set it up.

Actually, the above is used only in the DSOs.  In static objects I'm
using a global variable.  This saves the %gs prefix.

But of course Linus is right: using the new functionality needs quite a
bit of infrastructure which most definitely isn't present in the libc
you have on your system.  See my post from a few minutes ago.

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 19:37 ` Linus Torvalds
  2002-12-19 22:10   ` Jamie Lokier
@ 2002-12-20 10:08   ` Ulrich Drepper
  2002-12-20 12:06     ` Jamie Lokier
  1 sibling, 1 reply; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-20 10:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: bart, davej, lk, hpa, terje.eggestad, matti.aarnio, hugh, mingo,
	linux-kernel

Linus Torvalds wrote:

> For _zero_ gain.  The jump to the library address has to be indirect 
> anyway, and glibc has several places to put the information without any 
> mmap's or anything like that.

Correct.  The current implementation is optimal.

It is necessary to have indirection since the target address can change.

I'm never going to use self-modifying code.

And it's a simple, one-instruction change.

  int $0x80  ->  call *%gs:0x18

That's it.  It's all implemented and tested.  The results are in the
latest NPTL source drop.  The code won't be available in LinuxThreads
since it requires a kernel with TLS support.

As far as I'm concerned the discussion is over.  I'm happy with what I
have now.  The additional overhead for the case when AT_SYSINFO is not
available is neglegable (and can be compiled-out completely if one
really wants), and in case AT_SYSINFO is available the code really is
the fatest possible given the constraints mentioned above.

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-20 10:08   ` Ulrich Drepper
@ 2002-12-20 12:06     ` Jamie Lokier
  2002-12-20 16:47       ` Linus Torvalds
  0 siblings, 1 reply; 268+ messages in thread
From: Jamie Lokier @ 2002-12-20 12:06 UTC (permalink / raw)
  To: Ulrich Drepper, Linus Torvalds
  Cc: bart, davej, hpa, terje.eggestad, matti.aarnio, hugh, mingo,
	linux-kernel

This is a suggestion on a small performance improvement.

Ulrich Drepper wrote:
>   int $0x80  ->  call *%gs:0x18

The calling convention has been (slightly) changed - i.e. 6 argument
calls don't work, so why not go a bit further: allow the vsyscall entry
point to clobber more GPRs?

I see 3 pushes and pops in the vsyscall page (if I've looked at the
correct patch from Linus), to preserve %ecx, %edx and %ebp:

	vsyscall:
		pushl	%ebp
		pushl	%ecx
		pushl	%edx
	0:
		movl	%esp,%ebp
		sysenter
		jmp	0b
		popl	%edx
		popl	%ecx
		popl	%ebp
		ret

The benefit is that this allows Glibc to do a wholesale replacement of
"int $0x80" -> "single call instruction".  Otherwise, those pushes are
completely unnecessary.  It could be this short instead:

	vsyscall:
		movl	%esp,%ebp
		sysenter
		jmp	vsyscall
		ret

It is nice to be able to use the _exact_ same convention in glibc, for
getting a patch out of the door quickly.  But it is just as easy to do
that putting the pushes and pops into the library itself:

Instead of

	int $0x80 ->	call	*%gs:0x18

Write

	int $0x80 ->	pushl	%ebp
			pushl	%ecx
			pushl	%edx
			call	*%gs:0x18
			popl	%edx
			popl	%ecx
			popl	%ebp

It has exactly the same cost as the current patches, but provides
userspace with more optimisation flexibility, using an asm clobber
list instead of explicit instructions for inline syscalls, etc.

Cheers,
-- Jamie

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-20 12:06     ` Jamie Lokier
@ 2002-12-20 16:47       ` Linus Torvalds
  2002-12-20 23:38         ` Jamie Lokier
  0 siblings, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-20 16:47 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Ulrich Drepper, bart, davej, hpa, terje.eggestad, matti.aarnio,
	hugh, mingo, linux-kernel

On Fri, 20 Dec 2002, Jamie Lokier wrote:
>
> Ulrich Drepper wrote:
> >   int $0x80  ->  call *%gs:0x18
>
> The calling convention has been (slightly) changed - i.e. 6 argument
> calls don't work, so why not go a bit further: allow the vsyscall entry
> point to clobber more GPRs?

Actually, six-argument syscalls _do_ work. I admit that the way to make
them work is "interesting", but it's also extremely simple.

> The benefit is that this allows Glibc to do a wholesale replacement of
> "int $0x80" -> "single call instruction".  Otherwise, those pushes are
> completely unnecessary.  It could be this short instead:
>
> 	vsyscall:
> 		movl	%esp,%ebp
> 		sysenter
> 		jmp	vsyscall
> 		ret

Yes, we could have changed the implementation to clobber more registers,
but if we want to support all system calls it would still have to save
%ebp, so the minimal approach would have been

	vsyscall:
		pushl %ebp
	0:
		movl %esp,%ebp
		sysenter
		jmp 0b	/* only done for restarting */
		popl %ebp
		ret

which is all of 4 (simple) instructions cheaper than the one we have now.

And if the caller cannot depend on registers being saved, the caller may
actually end up being more complicated. For example, with the current
setup, you can have

	getpid():
		movl $__NR_getpid,%eax
		jmp *%gs:0x18

but if system calls clobber registers, the caller needs to be

	getpid():
		pushl %ebx
		pushl %esi
		pushl %edi
		pushl %ebp
		movl $__NR_getpid,%eax
		call *%gs:0x18
		popl %ebp
		popl %edi
		popl %esi
		popl %ebx
		ret

and notice how the _real_ code sequence actually got much _worse_ from the
fact that you tried to save time by not saving registers.

> It is nice to be able to use the _exact_ same convention in glibc, for
> getting a patch out of the door quickly.  But it is just as easy to do
> that putting the pushes and pops into the library itself:
>
> Instead of
>
> 	int $0x80 ->	call	*%gs:0x18
>
> Write
>
> 	int $0x80 ->	pushl	%ebp
> 			pushl	%ecx
> 			pushl	%edx
> 			call	*%gs:0x18
> 			popl	%edx
> 			popl	%ecx
> 			popl	%ebp

But where's the advantage then? You use the same number of instructions
dynamically, and you use _more_ icache space than if you have the pushes
and pops in just one place?

> It has exactly the same cost as the current patches, but provides
> userspace with more optimisation flexibility, using an asm clobber
> list instead of explicit instructions for inline syscalls, etc.

In practice, there is nothing around the call. And you have to realize
that the pushes and pops you added in your version are _wasted_ for other
cases. If the system call ends up being int 0x80, you just wasted time. If
the system call ended up being AMD's x86-64 version of syscall, you just
wsted time.

The advantage of putting all the register save in the trampoline is that
user mode literally doesn't have to _know_ what it is calling. It only
needs to know two simple rules:

 - registers are preserved (except for %eax which is the return value, of
   course)
 - it should fill in arguments in %ebx, %ecx ... (but the things that
   aren't arguments can just be left untouched)

And then depending on what the real low-level calling convention is, the
trampoline will save the _minimum_ number of registers (ie some calling
conventions might clobber different registers than %ecx/%edx - you have to
remember that "sysenter" is just _one_ out of at least three calling
conventions available).

			Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-20 16:47       ` Linus Torvalds
@ 2002-12-20 23:38         ` Jamie Lokier
  2002-12-20 23:50           ` H. Peter Anvin
  2002-12-21  0:09           ` Linus Torvalds
  0 siblings, 2 replies; 268+ messages in thread
From: Jamie Lokier @ 2002-12-20 23:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ulrich Drepper, bart, davej, hpa, terje.eggestad, matti.aarnio,
	hugh, mingo, linux-kernel

Linus Torvalds wrote:
> And if the caller cannot depend on registers being saved, the caller may
> actually end up being more complicated. For example, with the current
> setup, you can have
> 
> 	getpid():
> 		movl $__NR_getpid,%eax
> 		jmp *%gs:0x18
> 
> but if system calls clobber registers, the caller needs to be
> 
> [long code snippet]
> 
> and notice how the _real_ code sequence actually got much _worse_ from the
> fact that you tried to save time by not saving registers.

No, your "real" code sequence is wrong.

%ebx/%edi/%esi are preserved across sysenter/sysexit, whereas
%ecx/%edx are call-clobbered registers in the i386 function call ABI.

This is not a coincidence.

So, getpid looks like this with the _smaller_ vsyscall code:

 	getpid():
 		movl $__NR_getpid,%eax
 		call *%gs:0x18
 		ret

Intel didn't choose %ecx/%edx as the sysexit registers by accident.
They were chosen for exactly this reason.

By the way, the same applies to AMD's syscall/sysret, which clobbers %ecx.

What I'm suggesting is that we should say that "call 0xffffe000"
clobbers only the registers (%eax/%ecx/%edx) that _normal_ function
calls clobber on i386, and preserves the call-saved registers.

This keeps the size of system call stubs in libc to the minimum.
Think about it.

-- Jamie

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-20 23:38         ` Jamie Lokier
@ 2002-12-20 23:50           ` H. Peter Anvin
  2002-12-21  0:09           ` Linus Torvalds
  1 sibling, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-20 23:50 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Linus Torvalds, Ulrich Drepper, bart, davej, terje.eggestad,
	matti.aarnio, hugh, mingo, linux-kernel

Jamie Lokier wrote:
> 
> No, your "real" code sequence is wrong.
> 
> %ebx/%edi/%esi are preserved across sysenter/sysexit, whereas
> %ecx/%edx are call-clobbered registers in the i386 function call ABI.
> 
> This is not a coincidence.
> 
> So, getpid looks like this with the _smaller_ vsyscall code:
> 
>  	getpid():
>  		movl $__NR_getpid,%eax
>  		call *%gs:0x18
>  		ret

... or just...

getpid:
	movl $__NR_getpid, %eax
	jmp *%gs:0x18

This doesn't mess up the call/return stack, even, because the ret in the
stub matches the call to getpid.

	-hpa


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-20 23:38         ` Jamie Lokier
  2002-12-20 23:50           ` H. Peter Anvin
@ 2002-12-21  0:09           ` Linus Torvalds
  2002-12-21 17:18             ` Jamie Lokier
  1 sibling, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-21  0:09 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Ulrich Drepper, bart, davej, hpa, terje.eggestad, matti.aarnio,
	hugh, mingo, linux-kernel

On Fri, 20 Dec 2002, Jamie Lokier wrote:
>
> %ebx/%edi/%esi are preserved across sysenter/sysexit, whereas
> %ecx/%edx are call-clobbered registers in the i386 function call ABI.
>
> This is not a coincidence.

Yes, you can make the "clobbers %eax/%edx/%ecx" argument, but the fact is,
we quite fundamentally need to save %edx/%ecx _anyway_.

The reason is system call restarting and signal handling. You don't see it
right now, because the system call restart mechanism doesn't actually use
"sysexit" at all, but that's because the current implementation is only
the minimal possible implementation.

The way we do signal handling right now, we always punt to the "old" code,
ie the return path that will eventually return with an "iret".

And that old code will restore _all_ registers, including %ecx and %edx.
So when we return after a restart to the restart handler, %ecx and %edx
will have their original values, which is why restarting works right now.

The "iret" will trash "%ebp", simply because we fake out the whole %ebp
saving to get the six-argument case right. That's why we have to have that
extra complicated restart sequence:

	0:
		movl %esp,%ebp
		syscall
	restart:
		jmp 0b

but once we start using sysexit for the signal handler return path too, we
will need to restore %edx and %ecx too, otherwise our restarted system
call will have crap in the registers. I already wrote the code, but
decided that as long as we don't do that kind of restarting, we shouldn't
have the overhead in the trampoline. But basically the trampoline then
will become

	system_call_trampoline:
		pushfl
		pushl %ecx
		pushl %edx
		pushl %ebp
		movl %esp,%ebp
		syscall
	0:
		movl %esp,%ebp
		movl 4(%ebp),%edx
		movl 8(%ebp),%ecx
		syscall

	restart:
		jmp 0b
	sysenter_return_point:
		popl %ebp
		popl %edx
		popl %ecx
		popfl
		ret

see? So you _have_ to really save the arguments anyway, because you cannot
do a sysexit-based system call restart otherwise. And once you save them,
you might as well restore them too.

And since you have to restore them for system call restart anyway, you
might as well just make it part of the calling convention.

Yes, I'm thinking ahead. Sue me.

			Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-21  0:09           ` Linus Torvalds
@ 2002-12-21 17:18             ` Jamie Lokier
  2002-12-21 19:39               ` Linus Torvalds
  0 siblings, 1 reply; 268+ messages in thread
From: Jamie Lokier @ 2002-12-21 17:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ulrich Drepper, bart, davej, hpa, terje.eggestad, matti.aarnio,
	hugh, mingo, linux-kernel

Linus Torvalds wrote:
> Yes, you can make the "clobbers %eax/%edx/%ecx" argument, but the fact is,
> we quite fundamentally need to save %edx/%ecx _anyway_.

On the kernel side, yes.  In the userspace trampoline, it's not required.

> but once we start using sysexit for the signal handler return path too, we
> will need to restore %edx and %ecx too, otherwise our restarted system
> call will have crap in the registers. I already wrote the code, but
> decided that as long as we don't do that kind of restarting, we shouldn't
> have the overhead in the trampoline. But basically the trampoline then
> will become
> 
> 	system_call_trampoline:
> 		pushfl
> 		pushl %ecx
> 		pushl %edx
> 		pushl %ebp
> 		movl %esp,%ebp
> 		syscall
> 	0:
> 		movl %esp,%ebp
> 		movl 4(%ebp),%edx
> 		movl 8(%ebp),%ecx
> 		syscall
> 
> 	restart:
> 		jmp 0b
> 	sysenter_return_point:
> 		popl %ebp
> 		popl %edx
> 		popl %ecx
> 		popfl
> 		ret

> see? So you _have_ to really save the arguments anyway, because you cannot
> do a sysexit-based system call restart otherwise. And once you save them,
> you might as well restore them too.
> 
> And since you have to restore them for system call restart anyway, you
> might as well just make it part of the calling convention.
>
> Yes, I'm thinking ahead. Sue me.

You're optimising the _rare_ case.

The correct [;)] trampoline looks like this:

 	system_call_trampoline:
 		pushl %ebp
		movl %esp,%ebp
 		sysenter
 	sysenter_return_point:
		popl %ebp
 		ret
 	sysenter_restart:
		popl %edx
		popl %ecx
		movl %esp,%ebp
 		sysenter

This is accompanied by changing this line in arch/i386/kernel/signal.c:

	regs->eip -= 2;

To this (best moved to an inline function):

	if (likely (regs->eip == sysenter_return_point)) {
		unsigned long * esp = (unsigned long *) regs->esp - 8;
		if (!access_ok(VERIFY_WRITE, esp, 8)
		    || __put_user(regs->edx, esp)
		    || __put_user(regs->ecx, esp+4)) {
			if (sig == SIGSEGV)
				ka->sa.sa_handler = SIG_DFL;
			force_sig(SIGSEGV, current);
		}
		regs->esp = (long) esp;
		regs->eip = sysenter_restart;
	} else {
		regs->eip -= 2;
	}

Thus the common case, system calls, are optimised.  The uncommon case,
signal interrupts system call, is penalised, though it's not a large
penalty.  (Much less than the gain from using sysexit!)  Which is more
common?

By the way, this works with the AMD syscall instruction too.
Then the trampoline is:

	system_call_trampoline:
		pushl %ebp
		movl %ecx,%ebp
		syscall
	syscall_return_point:
		popl %ebp
		ret
	syscall_restart:
		popl %edx
		popl %ebp
		syscall

Finally, there is no need to restore %ebp in the kernel sysexit/sysret
paths, because it will always be restored in the trampoline.  So you
save 1 memory access there too.

(ps. I have a question: why does your trampoline save & restore
the flags?)

-- Jamie

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-21 17:18             ` Jamie Lokier
@ 2002-12-21 19:39               ` Linus Torvalds
  2002-12-22  2:18                 ` Jamie Lokier
  0 siblings, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-21 19:39 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Ulrich Drepper, bart, davej, hpa, terje.eggestad, matti.aarnio,
	hugh, mingo, linux-kernel

On Sat, 21 Dec 2002, Jamie Lokier wrote:
>
> Linus Torvalds wrote:
> > Yes, you can make the "clobbers %eax/%edx/%ecx" argument, but the fact is,
> > we quite fundamentally need to save %edx/%ecx _anyway_.
>
> On the kernel side, yes.  In the userspace trampoline, it's not required.

No, it _is_ required.

There are a few registers that _have_ to be saved on the user side,
because the kernel will trash them. Those registers are:

 - eflags (kernel has no sane way to restore things like TF in it
   atomically with a sysexit)
 - ebp (kernel has to reload it with arg-6)
 - ecx/edx (kernel _cannot_ restore them).

Your games with looking at %eip are fragile as hell.

> You're optimising the _rare_ case.

NO. I'm making it WORK.

> This is accompanied by changing this line in arch/i386/kernel/signal.c:
>
> 	regs->eip -= 2;

You're full of it.

You're adding fundamental complexity and special cases, because you have
a political agenda that you want to support, that is not really
supportable.

The fact is, system calls have a special calling convention anyway, and
doing them the way we're doing them now is a hell of a lot saner than
making much more complex code. Saving and restoring the two registers
means that they get easier and more efficient to use from inline asms for
example, and means that the code is simpler.

Your suggestion has _zero_ advantages. Doing two register pop's takes a
cycle, and means that the calling sequence is simple and has no special
cases.

Th eexample code you posted is fragile as hell. Looking at "eip" means
that the different system call entry points now have to be extra careful
not to have the same return points, which is just _bad_ programming.

		Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-21 19:39               ` Linus Torvalds
@ 2002-12-22  2:18                 ` Jamie Lokier
  2002-12-22  3:11                   ` Linus Torvalds
  0 siblings, 1 reply; 268+ messages in thread
From: Jamie Lokier @ 2002-12-22  2:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ulrich Drepper, bart, davej, hpa, terje.eggestad, matti.aarnio,
	hugh, mingo, linux-kernel

Linus Torvalds wrote:
>  - eflags (kernel has no sane way to restore things like TF in it
>    atomically with a sysexit)

It is better to use iret with TF.  The penalty of forcing _every_
system call to pushfl and popfl in user space is quite a lot: I
measured 30 cycles for "pushfl;popfl" on my 366MHz Celeron.

("sysenter, setup segments, call a function in kernel space, restore
segments, sysexit" takes 82 cycles on the same Celeron, so 30 cycles
is quite a significant proportion to add to that.  Btw, _82_, not 200
or so).

>  - ebp (kernel has to reload it with arg-6)
>  - ecx/edx (kernel _cannot_ restore them).

These are only needed when delivering a signal.

> Your games with looking at %eip are fragile as hell.

Like we don't play %eip games anywhere else... (the page fault fixup
table comes to mind).

> because you have a political agenda that you want to support, that
> is not really supportable.

And there was me thinking I was performance-tuning some code.
Politics, it gets everywhere, like curry gets onto anything white.

> Saving and restoring the two registers
> means that they get easier and more efficient to use from inline asms for
> example, and means that the code is simpler.

They are not more efficient from inline asms, though marginally
simpler to write (shorter clobber list).  You just moved the cost from
the asm itself, where it is usually optimised away, to the trampoline
where it is always present (and cast in stone).

> Your suggestion has _zero_ advantages. Doing two register pop's takes a
> cycle, and means that the calling sequence is simple and has no special
> cases.

(Plus another cycle for the two pushes...)

> Th eexample code you posted is fragile as hell. Looking at "eip" means
> that the different system call entry points now have to be extra careful
> not to have the same return points, which is just _bad_ programming.

We are talking about a _very_ small trampoline, which is simplicity
itself compared with entry.S in general.  You're right about the extra
care (so write a comment!), although it does just work for _all_ entry
points.  Is this really worse than your own "wonderful hack"?

<shrug> You're the executive decision maker.  I just know how to
write fast code.  Thanks for listening.

-- Jamie

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-22  2:18                 ` Jamie Lokier
@ 2002-12-22  3:11                   ` Linus Torvalds
  2002-12-22 10:13                     ` Ingo Molnar
  2002-12-22 10:23                     ` Ingo Molnar
  0 siblings, 2 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-22  3:11 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Ulrich Drepper, bart, davej, hpa, terje.eggestad, matti.aarnio,
	hugh, mingo, linux-kernel

On Sun, 22 Dec 2002, Jamie Lokier wrote:
>
> It is better to use iret with TF.  The penalty of forcing _every_
> system call to pushfl and popfl in user space is quite a lot: I
> measured 30 cycles for "pushfl;popfl" on my 366MHz Celeron.

Jamie, please stop these emails.

The fact is, when a user enters the kernel with TF set using "sysenter",
the kernel doesn't even _know_ that TF is set, because it will take a
debug trap on the very first instruction, and the debug handler has no
real option except to just return with TF cleared before the kernel even
had a chance to save eflags. So at no point in the sysenter/sysexit path
does the code have a chance to even _realize_ that the user called it with
single-stepping on.

So how do you want the code to figure that out, and then (a) set TF on the
stack and (b) do the jump to the slow path? Sure, we could add magic
per-process flags in the debug handler, and then test them in the sysenter
path - but that really is pretty gross.

Saving and restoring eflags in user mode avoids all of these
complications, and means that there are no special cases. None. Zero.
Nada.

Special case code is bad. It's certainly a lot more important to me to
have a straightforward approach that doesn't have any strange cases, and
where debugging "just works", instead of having a lot of magic small
details scattered all over the place.

So if you really care, create all your special case magic tricks, and see
just how ugly it gets. Then see whether it makes any difference at all
except on the very simplest system calls ("getpid" really isn't very
important).

			Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-22  3:11                   ` Linus Torvalds
@ 2002-12-22 10:13                     ` Ingo Molnar
  2002-12-22 15:32                       ` Jamie Lokier
                                         ` (2 more replies)
  2002-12-22 10:23                     ` Ingo Molnar
  1 sibling, 3 replies; 268+ messages in thread
From: Ingo Molnar @ 2002-12-22 10:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Ulrich Drepper, bart, davej, hpa, terje.eggestad,
	matti.aarnio, hugh, linux-kernel


On Sat, 21 Dec 2002, Linus Torvalds wrote:

> Saving and restoring eflags in user mode avoids all of these
> complications, and means that there are no special cases. None. Zero.
> Nada.

and i'm 100% sure the more robust eflags saving will also avoid security
holes. The amount of security-relevant complexity that comes from all the
x86 features [and their combinations] is amazing.

	Ingo


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-22 10:13                     ` Ingo Molnar
@ 2002-12-22 15:32                       ` Jamie Lokier
  2002-12-22 18:53                       ` Linus Torvalds
  2002-12-24 19:36                       ` Linus Torvalds
  2 siblings, 0 replies; 268+ messages in thread
From: Jamie Lokier @ 2002-12-22 15:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Ulrich Drepper, bart, davej, hpa, terje.eggestad,
	matti.aarnio, hugh, linux-kernel

Ingo Molnar wrote:
> and i'm 100% sure the more robust eflags saving will also avoid security
> holes. The amount of security-relevant complexity that comes from all the
> x86 features [and their combinations] is amazing.

Userspace can skip the "popfl" with a well-timed signal.  If the
"sysexit" path leaves the kernel with an unsafe eflags, that will
propagate into the signal handler.

AFAICT, one of these is required:

	1. eflags must be safe before leaving kernel space, or
	2. setup_sigcontext() must clean it up (it already does clear TF).

-- Jamie

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-22 10:13                     ` Ingo Molnar
  2002-12-22 15:32                       ` Jamie Lokier
@ 2002-12-22 18:53                       ` Linus Torvalds
  2002-12-23  5:03                         ` Linus Torvalds
  2002-12-24 19:36                       ` Linus Torvalds
  2 siblings, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-22 18:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jamie Lokier, Ulrich Drepper, bart, davej, hpa, terje.eggestad,
	matti.aarnio, hugh, linux-kernel

On Sun, 22 Dec 2002, Ingo Molnar wrote:
>
> On Sat, 21 Dec 2002, Linus Torvalds wrote:
>
> > Saving and restoring eflags in user mode avoids all of these
> > complications, and means that there are no special cases. None. Zero.
> > Nada.
>
> and i'm 100% sure the more robust eflags saving will also avoid security
> holes. The amount of security-relevant complexity that comes from all the
> x86 features [and their combinations] is amazing.

I looked a bit at what it would take to have the TF bit handled by the
sysenter path, and it might not be so horrible - certainly not as ugly as
the register restore bits.

Jamie, if you want to do it, it looks like you could add a new "work" bit
in the thread flags, and add it to the _TIF_ALLWORK_MASK tests. At least
that way it wouldn't touch the regular code, and I don't think that the
result would have any strange "magic EIP" tests or anything horrible like
that ;)

		Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-22 18:53                       ` Linus Torvalds
@ 2002-12-23  5:03                         ` Linus Torvalds
  2002-12-23  7:14                           ` Ulrich Drepper
  2002-12-23 23:27                           ` Petr Vandrovec
  0 siblings, 2 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-23  5:03 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Ingo Molnar, Ulrich Drepper, bart, davej, hpa, terje.eggestad,
	matti.aarnio, hugh, linux-kernel



On Sun, 22 Dec 2002, Linus Torvalds wrote:
>
> I looked a bit at what it would take to have the TF bit handled by the
> sysenter path, and it might not be so horrible - certainly not as ugly as
> the register restore bits.

Hey, I tried it out, and it does indeed turn out to be fairly easy and
clean (in fact, it's mostly four pretty obvious "one-liners").

Let nobody say I won't change my mind - you were right Jamie (*). The
pushfl/popfl is unnessary, and does show up in microbenchmarks.

How does the attached patch work for people? I've verified that
single-stepping works, and I've also verified that it does improve
performance for simple system calls. Everything looks quite simple.

		Linus

(*) In fact, people sometimes complain that I change my mind way too
often. Hey, sue me.

-=-=-=

===== arch/i386/kernel/signal.c 1.22 vs edited =====
--- 1.22/arch/i386/kernel/signal.c	Fri Dec  6 09:43:43 2002
+++ edited/arch/i386/kernel/signal.c	Sun Dec 22 20:31:38 2002
@@ -609,6 +609,11 @@
 void do_notify_resume(struct pt_regs *regs, sigset_t *oldset,
 		      __u32 thread_info_flags)
 {
+	/* Pending single-step? */
+	if (thread_info_flags & _TIF_SINGLESTEP) {
+		regs->eflags |= TF_MASK;
+		clear_thread_flag(TIF_SINGLESTEP);
+	}
 	/* deal with pending signal delivery */
 	if (thread_info_flags & _TIF_SIGPENDING)
 		do_signal(regs,oldset);
===== arch/i386/kernel/sysenter.c 1.4 vs edited =====
--- 1.4/arch/i386/kernel/sysenter.c	Sat Dec 21 16:02:02 2002
+++ edited/arch/i386/kernel/sysenter.c	Sun Dec 22 20:17:28 2002
@@ -54,19 +54,18 @@
 		0xc3			/* ret */
 	};
 	static const char sysent[] = {
-		0x9c,			/* pushf */
 		0x51,			/* push %ecx */
 		0x52,			/* push %edx */
 		0x55,			/* push %ebp */
 		0x89, 0xe5,		/* movl %esp,%ebp */
 		0x0f, 0x34,		/* sysenter */
+		0x00,			/* align return point */
 	/* System call restart point is here! (SYSENTER_RETURN - 2) */
 		0xeb, 0xfa,		/* jmp to "movl %esp,%ebp" */
 	/* System call normal return point is here! (SYSENTER_RETURN in entry.S) */
 		0x5d,			/* pop %ebp */
 		0x5a,			/* pop %edx */
 		0x59,			/* pop %ecx */
-		0x9d,			/* popf - restore TF */
 		0xc3			/* ret */
 	};
 	unsigned long page = get_zeroed_page(GFP_ATOMIC);
===== arch/i386/kernel/traps.c 1.36 vs edited =====
--- 1.36/arch/i386/kernel/traps.c	Mon Nov 18 10:10:45 2002
+++ edited/arch/i386/kernel/traps.c	Sun Dec 22 20:03:35 2002
@@ -605,7 +605,7 @@
 		 * interface.
 		 */
 		if ((regs->xcs & 3) == 0)
-			goto clear_TF;
+			goto clear_TF_reenable;
 		if ((tsk->ptrace & (PT_DTRACE|PT_PTRACED)) == PT_DTRACE)
 			goto clear_TF;
 	}
@@ -637,6 +637,8 @@
 	handle_vm86_trap((struct kernel_vm86_regs *) regs, error_code, 1);
 	return;

+clear_TF_reenable:
+	set_tsk_thread_flag(tsk, TIF_SINGLESTEP);
 clear_TF:
 	regs->eflags &= ~TF_MASK;
 	return;
===== include/asm-i386/thread_info.h 1.8 vs edited =====
--- 1.8/include/asm-i386/thread_info.h	Fri Dec  6 09:43:43 2002
+++ edited/include/asm-i386/thread_info.h	Sun Dec 22 20:30:28 2002
@@ -109,6 +109,7 @@
 #define TIF_NOTIFY_RESUME	1	/* resumption notification requested */
 #define TIF_SIGPENDING		2	/* signal pending */
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
+#define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
 #define TIF_USEDFPU		16	/* FPU was used by this task this quantum (SMP) */
 #define TIF_POLLING_NRFLAG	17	/* true if poll_idle() is polling TIF_NEED_RESCHED */

@@ -116,6 +117,7 @@
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1<<TIF_NEED_RESCHED)
+#define _TIF_SINGLESTEP		(1<<TIF_SINGLESTEP)
 #define _TIF_USEDFPU		(1<<TIF_USEDFPU)
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-23  5:03                         ` Linus Torvalds
@ 2002-12-23  7:14                           ` Ulrich Drepper
  2002-12-23 23:27                           ` Petr Vandrovec
  1 sibling, 0 replies; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-23  7:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Ingo Molnar, bart, davej, hpa, terje.eggestad,
	matti.aarnio, hugh, linux-kernel

Linus Torvalds wrote:

> How does the attached patch work for people?

I've compiled glibc and ran the test suite without any problems.

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-23  5:03                         ` Linus Torvalds
  2002-12-23  7:14                           ` Ulrich Drepper
@ 2002-12-23 23:27                           ` Petr Vandrovec
  2002-12-24  0:22                             ` Stephen Rothwell
  1 sibling, 1 reply; 268+ messages in thread
From: Petr Vandrovec @ 2002-12-23 23:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Ingo Molnar, Ulrich Drepper, bart, davej, hpa,
	terje.eggestad, matti.aarnio, hugh, linux-kernel

On Sun, Dec 22, 2002 at 09:03:44PM -0800, Linus Torvalds wrote:
> 
> How does the attached patch work for people? I've verified that
> single-stepping works, and I've also verified that it does improve
> performance for simple system calls. Everything looks quite simple.

> ===== arch/i386/kernel/sysenter.c 1.4 vs edited =====
> --- 1.4/arch/i386/kernel/sysenter.c	Sat Dec 21 16:02:02 2002
> +++ edited/arch/i386/kernel/sysenter.c	Sun Dec 22 20:17:28 2002
> @@ -54,19 +54,18 @@
>  		0xc3			/* ret */
>  	};
>  	static const char sysent[] = {
> -		0x9c,			/* pushf */
>  		0x51,			/* push %ecx */
>  		0x52,			/* push %edx */
>  		0x55,			/* push %ebp */
>  		0x89, 0xe5,		/* movl %esp,%ebp */
>  		0x0f, 0x34,		/* sysenter */
> +		0x00,			/* align return point */
>  	/* System call restart point is here! (SYSENTER_RETURN - 2) */
>  		0xeb, 0xfa,		/* jmp to "movl %esp,%ebp" */

Hi Linus,

Jump instruction should be 0xeb, 0xf9, with 0xeb, 0xfa it jumps into 
the middle of movl %esp,%ebp because of added alignment.

Maybe glibc tests needs also something to check restarted syscall...
					Thanks,
						Petr Vandrovec
						vandrove@vc.cvut.cz


--- linux/arch/i386/kernel/sysenter.c.orig	2002-12-24 00:23:41.000000000 +0100
+++ linux/arch/i386/kernel/sysenter.c	2002-12-24 00:23:50.000000000 +0100
@@ -61,7 +61,7 @@
 		0x0f, 0x34,		/* sysenter */
 		0x00,			/* align return point */
 	/* System call restart point is here! (SYSENTER_RETURN - 2) */
-		0xeb, 0xfa,		/* jmp to "movl %esp,%ebp" */
+		0xeb, 0xf9,		/* jmp to "movl %esp,%ebp" */
 	/* System call normal return point is here! (SYSENTER_RETURN in entry.S) */
 		0x5d,			/* pop %ebp */
 		0x5a,			/* pop %edx */

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-23 23:27                           ` Petr Vandrovec
@ 2002-12-24  0:22                             ` Stephen Rothwell
  2002-12-24  4:10                               ` Linus Torvalds
  0 siblings, 1 reply; 268+ messages in thread
From: Stephen Rothwell @ 2002-12-24  0:22 UTC (permalink / raw)
  To: Petr Vandrovec
  Cc: torvalds, lk, mingo, drepper, bart, davej, hpa, terje.eggestad,
	matti.aarnio, hugh, linux-kernel

On Tue, 24 Dec 2002 00:27:43 +0100 Petr Vandrovec <vandrove@vc.cvut.cz> wrote:
>
> On Sun, Dec 22, 2002 at 09:03:44PM -0800, Linus Torvalds wrote:
> > 
> > How does the attached patch work for people? I've verified that
> > single-stepping works, and I've also verified that it does improve
> > performance for simple system calls. Everything looks quite simple.
> 
> > ===== arch/i386/kernel/sysenter.c 1.4 vs edited =====
> > --- 1.4/arch/i386/kernel/sysenter.c	Sat Dec 21 16:02:02 2002
> > +++ edited/arch/i386/kernel/sysenter.c	Sun Dec 22 20:17:28 2002
> > @@ -54,19 +54,18 @@
> >  		0xc3			/* ret */
> >  	};
> >  	static const char sysent[] = {
> > -		0x9c,			/* pushf */
> >  		0x51,			/* push %ecx */
> >  		0x52,			/* push %edx */
> >  		0x55,			/* push %ebp */
> >  		0x89, 0xe5,		/* movl %esp,%ebp */
> >  		0x0f, 0x34,		/* sysenter */
> > +		0x00,			/* align return point */
> >  	/* System call restart point is here! (SYSENTER_RETURN - 2) */
> >  		0xeb, 0xfa,		/* jmp to "movl %esp,%ebp" */
> 
> Hi Linus,
> 
> Jump instruction should be 0xeb, 0xf9, with 0xeb, 0xfa it jumps into 
> the middle of movl %esp,%ebp because of added alignment.

And if you change the 0x00 use for alighment to 0x90 (nop) you can
use gdb to disassemble that array of bytes to check any changes ...

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-24  0:22                             ` Stephen Rothwell
@ 2002-12-24  4:10                               ` Linus Torvalds
  2002-12-24  8:05                                 ` Rogier Wolff
  2002-12-27 16:14                                 ` Kai Henningsen
  0 siblings, 2 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-24  4:10 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Petr Vandrovec, lk, Ingo Molnar, drepper, bart, davej, hpa,
	terje.eggestad, matti.aarnio, hugh, linux-kernel



On Tue, 24 Dec 2002, Stephen Rothwell wrote:
>
> And if you change the 0x00 use for alighment to 0x90 (nop) you can
> use gdb to disassemble that array of bytes to check any changes ...

Yeah, and I really should align the _normal_ return address (and not the
restart address).

Something like the appended, perhaps?

		Linus

===== arch/i386/kernel/entry.S 1.45 vs edited =====
--- 1.45/arch/i386/kernel/entry.S	Wed Dec 18 14:42:17 2002
+++ edited/arch/i386/kernel/entry.S	Mon Dec 23 20:02:10 2002
@@ -233,7 +233,7 @@
 #endif

 /* Points to after the "sysenter" instruction in the vsyscall page */
-#define SYSENTER_RETURN 0xffffe00a
+#define SYSENTER_RETURN 0xffffe010

 	# sysenter call handler stub
 	ALIGN
===== arch/i386/kernel/sysenter.c 1.5 vs edited =====
--- 1.5/arch/i386/kernel/sysenter.c	Sun Dec 22 21:12:23 2002
+++ edited/arch/i386/kernel/sysenter.c	Mon Dec 23 20:04:33 2002
@@ -57,12 +57,17 @@
 		0x51,			/* push %ecx */
 		0x52,			/* push %edx */
 		0x55,			/* push %ebp */
+	/* 3: backjump target */
 		0x89, 0xe5,		/* movl %esp,%ebp */
 		0x0f, 0x34,		/* sysenter */
-		0x00,			/* align return point */
-	/* System call restart point is here! (SYSENTER_RETURN - 2) */
-		0xeb, 0xfa,		/* jmp to "movl %esp,%ebp" */
-	/* System call normal return point is here! (SYSENTER_RETURN in entry.S) */
+
+	/* 7: align return point with nop's to make disassembly easier */
+		0x90, 0x90, 0x90, 0x90,
+		0x90, 0x90, 0x90,
+
+	/* 14: System call restart point is here! (SYSENTER_RETURN - 2) */
+		0xeb, 0xf3,		/* jmp to "movl %esp,%ebp" */
+	/* 16: System call normal return point is here! (SYSENTER_RETURN in entry.S) */
 		0x5d,			/* pop %ebp */
 		0x5a,			/* pop %edx */
 		0x59,			/* pop %ecx */


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-24  4:10                               ` Linus Torvalds
@ 2002-12-24  8:05                                 ` Rogier Wolff
  2002-12-24 18:51                                   ` Linus Torvalds
  2002-12-27 16:14                                 ` Kai Henningsen
  1 sibling, 1 reply; 268+ messages in thread
From: Rogier Wolff @ 2002-12-24  8:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen Rothwell, Petr Vandrovec, lk, Ingo Molnar, drepper, bart,
	davej, hpa, terje.eggestad, matti.aarnio, hugh, linux-kernel

On Mon, Dec 23, 2002 at 08:10:14PM -0800, Linus Torvalds wrote:
> 
> 
> On Tue, 24 Dec 2002, Stephen Rothwell wrote:
> >
> > And if you change the 0x00 use for alighment to 0x90 (nop) you can
> > use gdb to disassemble that array of bytes to check any changes ...
> 
> Yeah, and I really should align the _normal_ return address (and not the
> restart address).
> 
> Something like the appended, perhaps?
> 
> 		Linus
> 
> ===== arch/i386/kernel/entry.S 1.45 vs edited =====
> --- 1.45/arch/i386/kernel/entry.S	Wed Dec 18 14:42:17 2002
> +++ edited/arch/i386/kernel/entry.S	Mon Dec 23 20:02:10 2002
> @@ -233,7 +233,7 @@
>  #endif
> 
>  /* Points to after the "sysenter" instruction in the vsyscall page */
> -#define SYSENTER_RETURN 0xffffe00a
> +#define SYSENTER_RETURN 0xffffe010
> 
>  	# sysenter call handler stub
>  	ALIGN
> ===== arch/i386/kernel/sysenter.c 1.5 vs edited =====
> --- 1.5/arch/i386/kernel/sysenter.c	Sun Dec 22 21:12:23 2002
> +++ edited/arch/i386/kernel/sysenter.c	Mon Dec 23 20:04:33 2002
> @@ -57,12 +57,17 @@
>  		0x51,			/* push %ecx */
>  		0x52,			/* push %edx */
>  		0x55,			/* push %ebp */
> +	/* 3: backjump target */
>  		0x89, 0xe5,		/* movl %esp,%ebp */
>  		0x0f, 0x34,		/* sysenter */
> -		0x00,			/* align return point */
> -	/* System call restart point is here! (SYSENTER_RETURN - 2) */
> -		0xeb, 0xfa,		/* jmp to "movl %esp,%ebp" */
> -	/* System call normal return point is here! (SYSENTER_RETURN in entry.S) */
> +
> +	/* 7: align return point with nop's to make disassembly easier */
> +		0x90, 0x90, 0x90, 0x90,
> +		0x90, 0x90, 0x90,
> +
> +	/* 14: System call restart point is here! (SYSENTER_RETURN - 2) */
> +		0xeb, 0xf3,		/* jmp to "movl %esp,%ebp" */
> +	/* 16: System call normal return point is here! (SYSENTER_RETURN in entry.S) */
>  		0x5d,			/* pop %ebp */
>  		0x5a,			/* pop %edx */
>  		0x59,			/* pop %ecx */

Ehmm, Linus, 

Why do you want to align the return point? Why are jump-targets aligned?
Because they are faster. But why are they faster? Because the
cache-line fill is more efficient: the CPU might execute those 
instructions, while it has a smaller chance of hitting  the instructions
before the target. 

In this case, I'd guess we'd have more benefit from the sysenter return
prefetching the sysenter cache line, than from prefetching the bunch
of noops just behind the return from syscall. 

Now this is very hard to prove using a benchmark: In the benchmark 
you'll quite likely run from a hot cache, and the cache line
effects are the things we would want to measure. 

			Roger.

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* The Worlds Ecosystem is a stable system. Stable systems may experience *
* excursions from the stable situation. We are currently in such an      * 
* excursion: The stable situation does not include humans. ***************

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-24  8:05                                 ` Rogier Wolff
@ 2002-12-24 18:51                                   ` Linus Torvalds
  2002-12-24 21:10                                     ` Rogier Wolff
  0 siblings, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-24 18:51 UTC (permalink / raw)
  To: Rogier Wolff
  Cc: Stephen Rothwell, Petr Vandrovec, lk, Ingo Molnar, drepper, bart,
	davej, hpa, terje.eggestad, matti.aarnio, hugh, linux-kernel



On Tue, 24 Dec 2002, Rogier Wolff wrote:
>
> Ehmm, Linus,
>
> Why do you want to align the return point? Why are jump-targets aligned?
> Because they are faster. But why are they faster? Because the
> cache-line fill is more efficient: the CPU might execute those
> instructions, while it has a smaller chance of hitting  the instructions
> before the target.

Actually, no. Many CPU's apparently also have issues with instruction
decoding etc, where certain alignments (4 or 8-byte aligned) are better
simply because they feed the decode logic more efficiently.

Everything here fits in one cache-line, so clearly the cacheline issues
don't matter.

		Linus


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-24 18:51                                   ` Linus Torvalds
@ 2002-12-24 21:10                                     ` Rogier Wolff
  0 siblings, 0 replies; 268+ messages in thread
From: Rogier Wolff @ 2002-12-24 21:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rogier Wolff, Stephen Rothwell, Petr Vandrovec, lk, Ingo Molnar,
	drepper, bart, davej, hpa, terje.eggestad, matti.aarnio, hugh,
	linux-kernel

On Tue, Dec 24, 2002 at 10:51:11AM -0800, Linus Torvalds wrote:
> 
> Everything here fits in one cache-line, so clearly the cacheline issues
> don't matter.

I'm getting old. Larger cache lines, you're right. 

			Roger. 

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* The Worlds Ecosystem is a stable system. Stable systems may experience *
* excursions from the stable situation. We are currently in such an      * 
* excursion: The stable situation does not include humans. ***************

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-24  4:10                               ` Linus Torvalds
  2002-12-24  8:05                                 ` Rogier Wolff
@ 2002-12-27 16:14                                 ` Kai Henningsen
  1 sibling, 0 replies; 268+ messages in thread
From: Kai Henningsen @ 2002-12-27 16:14 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel

torvalds@transmeta.com (Linus Torvalds)  wrote on 23.12.02 in <Pine.LNX.4.44.0212232005080.2328-100000@home.transmeta.com>:

> Something like the appended, perhaps?
>
> 		Linus
>
> ===== arch/i386/kernel/entry.S 1.45 vs edited =====
> --- 1.45/arch/i386/kernel/entry.S	Wed Dec 18 14:42:17 2002
> +++ edited/arch/i386/kernel/entry.S	Mon Dec 23 20:02:10 2002
> @@ -233,7 +233,7 @@
>  #endif
>
>  /* Points to after the "sysenter" instruction in the vsyscall page */
> -#define SYSENTER_RETURN 0xffffe00a
> +#define SYSENTER_RETURN 0xffffe010
>
>  	# sysenter call handler stub
>  	ALIGN
> ===== arch/i386/kernel/sysenter.c 1.5 vs edited =====
> --- 1.5/arch/i386/kernel/sysenter.c	Sun Dec 22 21:12:23 2002
> +++ edited/arch/i386/kernel/sysenter.c	Mon Dec 23 20:04:33 2002
> @@ -57,12 +57,17 @@
>  		0x51,			/* push %ecx */
>  		0x52,			/* push %edx */
>  		0x55,			/* push %ebp */
> +	/* 3: backjump target */
>  		0x89, 0xe5,		/* movl %esp,%ebp */
>  		0x0f, 0x34,		/* sysenter */
> -		0x00,			/* align return point */

Also 0x90 here?

> -	/* System call restart point is here! (SYSENTER_RETURN - 2) */
> -		0xeb, 0xfa,		/* jmp to "movl %esp,%ebp" */
> -	/* System call normal return point is here! (SYSENTER_RETURN in entry.S)
> */ +
> +	/* 7: align return point with nop's to make disassembly easier */
> +		0x90, 0x90, 0x90, 0x90,
> +		0x90, 0x90, 0x90,
> +
> +	/* 14: System call restart point is here! (SYSENTER_RETURN - 2) */
> +		0xeb, 0xf3,		/* jmp to "movl %esp,%ebp" */
> +	/* 16: System call normal return point is here! (SYSENTER_RETURN in
> entry.S) */  		0x5d,			/* pop %ebp */
>  		0x5a,			/* pop %edx */
>  		0x59,			/* pop %ecx */


MfG Kai

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-22 10:13                     ` Ingo Molnar
  2002-12-22 15:32                       ` Jamie Lokier
  2002-12-22 18:53                       ` Linus Torvalds
@ 2002-12-24 19:36                       ` Linus Torvalds
  2002-12-24 20:20                         ` Ingo Molnar
                                           ` (3 more replies)
  2 siblings, 4 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-24 19:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jamie Lokier, Ulrich Drepper, bart, davej, hpa, terje.eggestad,
	matti.aarnio, hugh, linux-kernel

Ok, one final optimization.

We have traditionally held ES/DS constant at __KERNEL_DS in the kernel,
and we've used that to avoid saving unnecessary segment registers over
context switches etc.

I realized that there is really no reason to use __KERNEL_DS for this, and
that as far as the kernel is concerned, the only thing that matters is
that it's a flat 32-bit segment. So we might as well make the kernel
always load ES/DS with __USER_DS instead, which has the advantage that we
can avoid one set of segment loads for the "sysenter/sysexit" case.

(We still need to load ES/DS at entry to the kernel, since we cannot rely
on user space not trying to do strange things. But once we load them with
__USER_DS, we at least don't need to restore them on return to user mode
any more, since "sysenter" only works in a flat 32-bit user mode anyway
(*)).

This doesn't matter much for a P4 (surprisingly, a P4 does very well
indeed on segment loads), but it does make a difference on PIII-class
CPU's.

This makes a PIII do a "getpid()" system call in something like 160
cycles (a P4 is at 430 cycles, oh well).

Ingo, would you mind taking a look at the patch, to see if you see any
paths where we don't follow the new segment register rules. It looks like
swsuspend isn't properly saving and restoring segment register contents.
so that will need double-checking (it wasn't correct before either, so
this doesn't make it any worse, at least).

			Linus

(*) We could avoid even that initial load by instead _testing_ that the
values are the correct ones and jumping out if not, but I worry about vm86
mode being able to fool us with segments that have the right selectors but
the wrong segment caches. I disabled sysenter for vm86 mode, but it's so
subtle that I prefer just doing the segment loads rather than doing two
moves and comparisons.

###########################################
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/12/24	torvalds@home.transmeta.com	1.953
# Make the default values for DS/ES be the _user_ segment descriptors
# on x86 - the kernel doesn't really care (as long as it's all flat 32-bit),
# and it means that the return path for sysenter/sysexit can avoid re-loading
# the segment registers.
#
# NOTE! This means that _all_ kernel code (not just the sysenter path) must
# be appropriately changed, since the kernel knows the conventions and doesn't
# save/restore DS/ES internally on context switches etc.
# --------------------------------------------
#
diff -Nru a/arch/i386/kernel/entry.S b/arch/i386/kernel/entry.S
--- a/arch/i386/kernel/entry.S	Tue Dec 24 11:34:28 2002
+++ b/arch/i386/kernel/entry.S	Tue Dec 24 11:34:28 2002
@@ -91,18 +91,21 @@
 	pushl %edx; \
 	pushl %ecx; \
 	pushl %ebx; \
-	movl $(__KERNEL_DS), %edx; \
+	movl $(__USER_DS), %edx; \
 	movl %edx, %ds; \
 	movl %edx, %es;

-#define RESTORE_REGS	\
+#define RESTORE_INT_REGS \
 	popl %ebx;	\
 	popl %ecx;	\
 	popl %edx;	\
 	popl %esi;	\
 	popl %edi;	\
 	popl %ebp;	\
-	popl %eax;	\
+	popl %eax
+
+#define RESTORE_REGS	\
+	RESTORE_INT_REGS; \
 1:	popl %ds;	\
 2:	popl %es;	\
 .section .fixup,"ax";	\
@@ -271,9 +274,9 @@
 	movl TI_FLAGS(%ebx), %ecx
 	testw $_TIF_ALLWORK_MASK, %cx
 	jne syscall_exit_work
-	RESTORE_REGS
-	movl 4(%esp),%edx
-	movl 16(%esp),%ecx
+	RESTORE_INT_REGS
+	movl 12(%esp),%edx
+	movl 24(%esp),%ecx
 	sti
 	sysexit

@@ -428,7 +431,7 @@
 	movl %esp, %edx
 	pushl %esi			# push the error code
 	pushl %edx			# push the pt_regs pointer
-	movl $(__KERNEL_DS), %edx
+	movl $(__USER_DS), %edx
 	movl %edx, %ds
 	movl %edx, %es
 	call *%edi
diff -Nru a/arch/i386/kernel/head.S b/arch/i386/kernel/head.S
--- a/arch/i386/kernel/head.S	Tue Dec 24 11:34:28 2002
+++ b/arch/i386/kernel/head.S	Tue Dec 24 11:34:28 2002
@@ -235,12 +235,15 @@
 	lidt idt_descr
 	ljmp $(__KERNEL_CS),$1f
 1:	movl $(__KERNEL_DS),%eax	# reload all the segment registers
-	movl %eax,%ds		# after changing gdt.
+	movl %eax,%ss			# after changing gdt.
+
+	movl $(__USER_DS),%eax		# DS/ES contains default USER segment
+	movl %eax,%ds
 	movl %eax,%es
+
+	xorl %eax,%eax			# Clear FS/GS and LDT
 	movl %eax,%fs
 	movl %eax,%gs
-	movl %eax,%ss
-	xorl %eax,%eax
 	lldt %ax
 	cld			# gcc2 wants the direction flag cleared at all times
 #ifdef CONFIG_SMP
diff -Nru a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
--- a/arch/i386/kernel/process.c	Tue Dec 24 11:34:28 2002
+++ b/arch/i386/kernel/process.c	Tue Dec 24 11:34:28 2002
@@ -219,8 +219,8 @@
 	regs.ebx = (unsigned long) fn;
 	regs.edx = (unsigned long) arg;

-	regs.xds = __KERNEL_DS;
-	regs.xes = __KERNEL_DS;
+	regs.xds = __USER_DS;
+	regs.xes = __USER_DS;
 	regs.orig_eax = -1;
 	regs.eip = (unsigned long) kernel_thread_helper;
 	regs.xcs = __KERNEL_CS;

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-24 19:36                       ` Linus Torvalds
@ 2002-12-24 20:20                         ` Ingo Molnar
  2002-12-24 20:27                           ` Linus Torvalds
  2002-12-24 20:31                         ` Ingo Molnar
                                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 268+ messages in thread
From: Ingo Molnar @ 2002-12-24 20:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Ulrich Drepper, bart, davej, hpa, terje.eggestad,
	matti.aarnio, hugh, linux-kernel


On Tue, 24 Dec 2002, Linus Torvalds wrote:

> Ingo, would you mind taking a look at the patch, to see if you see any
> paths where we don't follow the new segment register rules. It looks
> like swsuspend isn't properly saving and restoring segment register
> contents. so that will need double-checking (it wasn't correct before
> either, so this doesn't make it any worse, at least).

this reminds me of another related matter that is not fixed yet, which bug
caused XFree86 to crash if it was linked against the new libpthreads - in
vm86 mode we did not save/restore %gs [and %fs] properly, which breaks
new-style threading. The attached patch is against the 2.4 backport of the
threading stuff, i'll do a 2.5 patch after christmas eve :-)

	Ingo

--- linux/include/asm-i386/processor.h.orig	2002-12-06 11:49:24.000000000 +0100
+++ linux/include/asm-i386/processor.h	2002-12-06 11:52:39.000000000 +0100
@@ -388,6 +388,7 @@
 	struct vm86_struct	* vm86_info;
 	unsigned long		screen_bitmap;
 	unsigned long		v86flags, v86mask, saved_esp0;
+	unsigned int		saved_fs, saved_gs;
 /* IO permissions */
 	int		ioperm;
 	unsigned long	io_bitmap[IO_BITMAP_SIZE+1];
--- linux/arch/i386/kernel/vm86.c.orig	2002-12-06 11:50:26.000000000 +0100
+++ linux/arch/i386/kernel/vm86.c	2002-12-06 11:53:40.000000000 +0100
@@ -113,6 +113,8 @@
 	tss = init_tss + smp_processor_id();
 	tss->esp0 = current->thread.esp0 = current->thread.saved_esp0;
 	current->thread.saved_esp0 = 0;
+	loadsegment(fs, current->thread.saved_fs);
+	loadsegment(gs, current->thread.saved_gs);
 	ret = KVM86->regs32;
 	return ret;
 }
@@ -277,6 +279,9 @@
  */
 	info->regs32->eax = 0;
 	tsk->thread.saved_esp0 = tsk->thread.esp0;
+	asm volatile("movl %%fs,%0":"=m" (tsk->thread.saved_fs));
+	asm volatile("movl %%gs,%0":"=m" (tsk->thread.saved_gs));
+
 	tss = init_tss + smp_processor_id();
 	tss->esp0 = tsk->thread.esp0 = (unsigned long) &info->VM86_TSS_ESP0;
 


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-24 20:20                         ` Ingo Molnar
@ 2002-12-24 20:27                           ` Linus Torvalds
  0 siblings, 0 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-24 20:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jamie Lokier, Ulrich Drepper, bart, davej, hpa, terje.eggestad,
	matti.aarnio, hugh, linux-kernel



On Tue, 24 Dec 2002, Ingo Molnar wrote:
>
> this reminds me of another related matter that is not fixed yet, which bug
> caused XFree86 to crash if it was linked against the new libpthreads - in
> vm86 mode we did not save/restore %gs [and %fs] properly, which breaks
> new-style threading. The attached patch is against the 2.4 backport of the
> threading stuff, i'll do a 2.5 patch after christmas eve :-)

Actually, pretty much nothing has changed in vm86 handling, so the patch
should work fine as-is on 2.5.x too.

		Linus


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-24 19:36                       ` Linus Torvalds
  2002-12-24 20:20                         ` Ingo Molnar
@ 2002-12-24 20:31                         ` Ingo Molnar
  2002-12-24 20:39                           ` Linus Torvalds
  2002-12-28  2:04                           ` H. Peter Anvin
  2002-12-26  7:47                         ` Pavel Machek
  2003-01-10 11:30                         ` Gabriel Paubert
  3 siblings, 2 replies; 268+ messages in thread
From: Ingo Molnar @ 2002-12-24 20:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Ulrich Drepper, bart, davej, hpa, terje.eggestad,
	Matti Aarnio, hugh, linux-kernel


On Tue, 24 Dec 2002, Linus Torvalds wrote:

> I realized that there is really no reason to use __KERNEL_DS for this,
> and that as far as the kernel is concerned, the only thing that matters
> is that it's a flat 32-bit segment. So we might as well make the kernel
> always load ES/DS with __USER_DS instead, which has the advantage that
> we can avoid one set of segment loads for the "sysenter/sysexit" case.

this basically hardcodes flat segment layout on x86. If anything (Wine?)
modifies the default segments, it can wrap syscalls by saving/restoring
the modified %ds and %es selectors explicitly.

	Ingo


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-24 20:31                         ` Ingo Molnar
@ 2002-12-24 20:39                           ` Linus Torvalds
  2002-12-28  2:05                             ` H. Peter Anvin
  2002-12-28  2:04                           ` H. Peter Anvin
  1 sibling, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-24 20:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jamie Lokier, Ulrich Drepper, bart, davej, hpa, terje.eggestad,
	Matti Aarnio, hugh, linux-kernel

On Tue, 24 Dec 2002, Ingo Molnar wrote:
>
> this basically hardcodes flat segment layout on x86. If anything (Wine?)
> modifies the default segments, it can wrap syscalls by saving/restoring
> the modified %ds and %es selectors explicitly.

Note that that was true even before this patch - you cannot use glibc
without having the default DS/ES settings anyway. I not only checked with
Uli, but gcc simply cannot generate code that has different segments for
stack and data, so if you have non-flat segments you had to either

 - flatten them out before calling the standard library
 - do your system calls directly by hand

And note how both of these still work fine (if you flatten things out it
trivially works, and if you do system calls by hand the old "int 0x80"
approach obviously doesn't change anything, and non-flat still works).

So the new code really only takes advantage of the fact that non-flat
wouldn't have worked with glibc in the first place, and without glibc you
don't see any difference in behaviour since it won't be using the new
calling convention.

				Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-24 20:39                           ` Linus Torvalds
@ 2002-12-28  2:05                             ` H. Peter Anvin
  0 siblings, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-28  2:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Jamie Lokier, Ulrich Drepper, bart, davej,
	terje.eggestad, Matti Aarnio, hugh, linux-kernel

Linus Torvalds wrote:
> Note that that was true even before this patch - you cannot use glibc
> without having the default DS/ES settings anyway. I not only checked with
> Uli, but gcc simply cannot generate code that has different segments for
> stack and data, so if you have non-flat segments you had to either

More importantly, SYSENTER hardcodes flat layout.

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-24 20:31                         ` Ingo Molnar
  2002-12-24 20:39                           ` Linus Torvalds
@ 2002-12-28  2:04                           ` H. Peter Anvin
  1 sibling, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-28  2:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Jamie Lokier, Ulrich Drepper, bart, davej,
	terje.eggestad, Matti Aarnio, hugh, linux-kernel

Ingo Molnar wrote:
> On Tue, 24 Dec 2002, Linus Torvalds wrote:
> 
> 
>>I realized that there is really no reason to use __KERNEL_DS for this,
>>and that as far as the kernel is concerned, the only thing that matters
>>is that it's a flat 32-bit segment. So we might as well make the kernel
>>always load ES/DS with __USER_DS instead, which has the advantage that
>>we can avoid one set of segment loads for the "sysenter/sysexit" case.
> 
> 
> this basically hardcodes flat segment layout on x86. If anything (Wine?)
> modifies the default segments, it can wrap syscalls by saving/restoring
> the modified %ds and %es selectors explicitly.
> 

I don't think you can modify the GDT segments.

	-hpa

P.S. Please don't use my @transmeta.com address for non-Transmeta 
business.  I'm trying very hard to keep my mailboxes semi-organized.




^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-24 19:36                       ` Linus Torvalds
  2002-12-24 20:20                         ` Ingo Molnar
  2002-12-24 20:31                         ` Ingo Molnar
@ 2002-12-26  7:47                         ` Pavel Machek
  2003-01-10 11:30                         ` Gabriel Paubert
  3 siblings, 0 replies; 268+ messages in thread
From: Pavel Machek @ 2002-12-26  7:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Jamie Lokier, Ulrich Drepper, bart, davej, hpa,
	terje.eggestad, matti.aarnio, hugh, linux-kernel

Hi!

> Ok, one final optimization.
> 
> We have traditionally held ES/DS constant at __KERNEL_DS in the kernel,
> and we've used that to avoid saving unnecessary segment registers over
> context switches etc.
> 
> I realized that there is really no reason to use __KERNEL_DS for this, and
> that as far as the kernel is concerned, the only thing that matters is
> that it's a flat 32-bit segment. So we might as well make the kernel
> always load ES/DS with __USER_DS instead, which has the advantage that we
> can avoid one set of segment loads for the "sysenter/sysexit" case.
> 
> (We still need to load ES/DS at entry to the kernel, since we cannot rely
> on user space not trying to do strange things. But once we load them with
> __USER_DS, we at least don't need to restore them on return to user mode
> any more, since "sysenter" only works in a flat 32-bit user mode anyway
> (*)).
> 
> This doesn't matter much for a P4 (surprisingly, a P4 does very well
> indeed on segment loads), but it does make a difference on PIII-class
> CPU's.
> 
> This makes a PIII do a "getpid()" system call in something like 160
> cycles (a P4 is at 430 cycles, oh well).
> 
> Ingo, would you mind taking a look at the patch, to see if you see any
> paths where we don't follow the new segment register rules. It looks like
> swsuspend isn't properly saving and restoring segment register contents.
> so that will need double-checking (it wasn't correct before either, so
> this doesn't make it any worse, at least).

Does this look like fixing it?
								Pavel

--- clean/arch/i386/kernel/suspend_asm.S	2002-12-18 22:20:47.000000000 +0100
+++ linux-swsusp/arch/i386/kernel/suspend_asm.S	2002-12-26 08:45:34.000000000 +0100
@@ -64,9 +64,10 @@
 	jb .L1455
 	.p2align 4,,7
 .L1453:
-	movl $104,%eax
+	movl $__USER_DS,%eax
 
 	movw %ax, %ds
+	movw %ax, %es
 	movl saved_context_esp, %esp
 	movl saved_context_ebp, %ebp
 	movl saved_context_eax, %eax


-- 
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-24 19:36                       ` Linus Torvalds
                                           ` (2 preceding siblings ...)
  2002-12-26  7:47                         ` Pavel Machek
@ 2003-01-10 11:30                         ` Gabriel Paubert
  2003-01-10 17:11                           ` Linus Torvalds
  3 siblings, 1 reply; 268+ messages in thread
From: Gabriel Paubert @ 2003-01-10 11:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Jamie Lokier, Ulrich Drepper, davej, linux-kernel

On Tue, 24 Dec 2002, Linus Torvalds wrote:

[That's old, I know. I'm slowly catching up on my email backlog after
almost 3 weeks away]

>
> Ok, one final optimization.
>
> We have traditionally held ES/DS constant at __KERNEL_DS in the kernel,
> and we've used that to avoid saving unnecessary segment registers over
> context switches etc.
>
> I realized that there is really no reason to use __KERNEL_DS for this, and
> that as far as the kernel is concerned, the only thing that matters is
> that it's a flat 32-bit segment. So we might as well make the kernel
> always load ES/DS with __USER_DS instead, which has the advantage that we
> can avoid one set of segment loads for the "sysenter/sysexit" case.
>
> (We still need to load ES/DS at entry to the kernel, since we cannot rely
> on user space not trying to do strange things. But once we load them with
> __USER_DS, we at least don't need to restore them on return to user mode
> any more, since "sysenter" only works in a flat 32-bit user mode anyway
> (*)).

We cannot rely either on userspace not setting NT bit in eflags. While
it won't cause an oops since the only instruction which ever depends on
it, iret, has a handler (which needs to be patched, see below),
I'm absolutely not convinced that all code paths are "NT safe" ;-)

For example, set NT and then execute sysenter with garbage in %eax, the
kernel will try to return (-ENOSYS) with iret and kill the task. As long
as it only allows a task to kill itself, it's not a big deal. But NT is
not cleared across task switches unless I miss something, and that looks
very dangerous.

It's so complex that I'm not sure that clearing NT in __switch_to is
sufficient, but clearing it in every sysenter path will make clock cycles
accountants scream (the only way is through popfl).

>
> This doesn't matter much for a P4 (surprisingly, a P4 does very well
> indeed on segment loads), but it does make a difference on PIII-class
> CPU's.
>
> This makes a PIII do a "getpid()" system call in something like 160
> cycles (a P4 is at 430 cycles, oh well).
>
> Ingo, would you mind taking a look at the patch, to see if you see any
> paths where we don't follow the new segment register rules. It looks like
> swsuspend isn't properly saving and restoring segment register contents.
> so that will need double-checking (it wasn't correct before either, so
> this doesn't make it any worse, at least).

I'm no Ingo, unfortunately, but you'll need at least the following patch
(the second hunk is only a typo fix) to the iret exception recovery code,
which used push and pops to get the smallest possible code size.

That's a minimal patch, let me know if you prefer to have a single copy of
the exception handler for all instances of RESTORE_ALL.

===== entry.S 1.49 vs edited =====
--- 1.49/arch/i386/kernel/entry.S	Sat Jan  4 19:06:07 2003
+++ edited/entry.S	Fri Jan 10 02:12:00 2003
@@ -126,10 +126,9 @@
 	addl $4, %esp;	\
 1:	iret;		\
 .section .fixup,"ax";   \
-2:	pushl %ss;	\
-	popl %ds;	\
-	pushl %ss;	\
-	popl %es;	\
+2:	movl $(__USER_DS), %edx; \
+	movl %edx, %ds; \
+	movl %edx, %es; \
 	pushl $11;	\
 	call do_exit;	\
 .previous;		\
@@ -225,7 +224,7 @@
 	movl TI_FLAGS(%ebx), %ecx	# need_resched set ?
 	testb $_TIF_NEED_RESCHED, %cl
 	jz restore_all
-	testl $IF_MASK,EFLAGS(%esp)     # interrupts off (execption path) ?
+	testl $IF_MASK,EFLAGS(%esp)     # interrupts off (exception path) ?
 	jz restore_all
 	movl $PREEMPT_ACTIVE,TI_PRE_COUNT(%ebx)
 	sti

	Regards,
	Gabriel.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2003-01-10 11:30                         ` Gabriel Paubert
@ 2003-01-10 17:11                           ` Linus Torvalds
  0 siblings, 0 replies; 268+ messages in thread
From: Linus Torvalds @ 2003-01-10 17:11 UTC (permalink / raw)
  To: Gabriel Paubert
  Cc: Ingo Molnar, Jamie Lokier, Ulrich Drepper, davej, linux-kernel


On Fri, 10 Jan 2003, Gabriel Paubert wrote:
> 
> We cannot rely either on userspace not setting NT bit in eflags. While
> it won't cause an oops since the only instruction which ever depends on
> it, iret, has a handler (which needs to be patched, see below),
> I'm absolutely not convinced that all code paths are "NT safe" ;-)

It shouldn't matter.

NT is only tested by "iret", and if somebody sets NT in user space they 
get exactly what they deserve. 

> For example, set NT and then execute sysenter with garbage in %eax, the
> kernel will try to return (-ENOSYS) with iret and kill the task. As long
> as it only allows a task to kill itself, it's not a big deal. But NT is
> not cleared across task switches unless I miss something, and that looks
> very dangerous.

It _is_ cleared by task-switching these days. Or rather, it's saved and 
restored, so the original NT setter will get it restored when resumed. 

> I'm no Ingo, unfortunately, but you'll need at least the following patch
> (the second hunk is only a typo fix) to the iret exception recovery code,
> which used push and pops to get the smallest possible code size.

Good job. 

		Linus


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-22  3:11                   ` Linus Torvalds
  2002-12-22 10:13                     ` Ingo Molnar
@ 2002-12-22 10:23                     ` Ingo Molnar
  1 sibling, 0 replies; 268+ messages in thread
From: Ingo Molnar @ 2002-12-22 10:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nakajima, Jun, Ulrich Drepper, linux-kernel

while reviewing the sysenter trampoline code i started wondering about the
HT case. Dont HT boxes share the MSRs between logical CPUs? This pretty
much breaks the concept of per-logical-CPU sysenter trampolines. It also
makes context-switch time sysenter MSR writing impossible, so i really
hope this is not the case.

	Ingo

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-19 13:22 bart
  2002-12-19 13:38 ` Dave Jones
  2002-12-19 19:29 ` H. Peter Anvin
  0 siblings, 2 replies; 268+ messages in thread
From: bart @ 2002-12-19 13:22 UTC (permalink / raw)
  To: torvalds
  Cc: lk, hpa, terje.eggestad, drepper, matti.aarnio, hugh, davej,
	mingo, linux-kernel

On 18 Dec, Linus Torvalds wrote:
> 
> On Wed, 18 Dec 2002, Jamie Lokier wrote:
>> 
>> That said, you always need the page at 0xfffe0000 mapped anyway, so
>> that sysexit can jump to a fixed address (which is fastest).
> 
> Yes. This is important. There _needs_ to be some fixed address at least as 
> far as the kernel is concerned (it might move around between reboots or 
> something like that, but it needs to be something the kernel knows about 
> intimately and doesn't need lots of dynamic lookup).
> 
> However, there's another issue, namely process startup cost. I personally 
> want it to be as light as at all possible. I hate doing an "strace" on 
> user processes and seeing tons and tons of crapola showing up. Just for 

So why not map the magic page at 0xffffe000 at some other address as
well? 

Static binaries can just directly jump/call into the magic page.

Shared binaries do somekind of mmap("/proc/self/mem") magic to put a
copy of the page at an address that is convenient for them. Shared
binaries have to do a lot of mmap-ing anyway, so the overhead should be
negligible.




-- 
Bart Hartgers - TUE Eindhoven 
http://plasimo.phys.tue.nl/bart/contact.html

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 13:22 bart
@ 2002-12-19 13:38 ` Dave Jones
  2002-12-19 14:22   ` Jamie Lokier
  2002-12-19 19:29 ` H. Peter Anvin
  1 sibling, 1 reply; 268+ messages in thread
From: Dave Jones @ 2002-12-19 13:38 UTC (permalink / raw)
  To: bart
  Cc: torvalds, lk, hpa, terje.eggestad, drepper, matti.aarnio, hugh,
	davej, mingo, linux-kernel

On Thu, Dec 19, 2002 at 02:22:36PM +0100, bart@etpmod.phys.tue.nl wrote:
 > > However, there's another issue, namely process startup cost. I personally 
 > > want it to be as light as at all possible. I hate doing an "strace" on 
 > > user processes and seeing tons and tons of crapola showing up. Just for 
 > So why not map the magic page at 0xffffe000 at some other address as
 > well? 
 > Static binaries can just directly jump/call into the magic page.

.. and explode nicely when you try to run them on an older kernel
without the new syscall magick. This is what Linus' first
proof-of-concept code did.

		Dave

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 13:38 ` Dave Jones
@ 2002-12-19 14:22   ` Jamie Lokier
  2002-12-19 16:56     ` Dave Jones
  0 siblings, 1 reply; 268+ messages in thread
From: Jamie Lokier @ 2002-12-19 14:22 UTC (permalink / raw)
  To: Dave Jones, bart, torvalds, hpa, terje.eggestad, drepper,
	matti.aarnio, hugh, mingo, linux-kernel

Dave Jones wrote:
>  > Static binaries can just directly jump/call into the magic page.
> 
> .. and explode nicely when you try to run them on an older kernel
> without the new syscall magick. This is what Linus' first
> proof-of-concept code did.

<evil-grin>

No, because the static binary installs a SIGSEGV handler to emulate
the magic page on older kernels :)

</evil-grin>

-- Jamie

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 14:22   ` Jamie Lokier
@ 2002-12-19 16:56     ` Dave Jones
  0 siblings, 0 replies; 268+ messages in thread
From: Dave Jones @ 2002-12-19 16:56 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: bart, torvalds, hpa, terje.eggestad, drepper, matti.aarnio, hugh,
	mingo, linux-kernel

On Thu, Dec 19, 2002 at 02:22:12PM +0000, Jamie Lokier wrote:

 > <evil-grin>
 > No, because the static binary installs a SIGSEGV handler to emulate
 > the magic page on older kernels :)
 > </evil-grin>

You're a sick man. Really. 8)

		Dave

-- 
| Dave Jones.        http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 13:22 bart
  2002-12-19 13:38 ` Dave Jones
@ 2002-12-19 19:29 ` H. Peter Anvin
  1 sibling, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-19 19:29 UTC (permalink / raw)
  To: bart
  Cc: torvalds, lk, terje.eggestad, drepper, matti.aarnio, hugh, davej,
	mingo, linux-kernel

bart@etpmod.phys.tue.nl wrote:
> On 18 Dec, Linus Torvalds wrote:
> 
>>On Wed, 18 Dec 2002, Jamie Lokier wrote:
>>
>>>That said, you always need the page at 0xfffe0000 mapped anyway, so
>>>that sysexit can jump to a fixed address (which is fastest).
>>
>>Yes. This is important. There _needs_ to be some fixed address at least as 
>>far as the kernel is concerned (it might move around between reboots or 
>>something like that, but it needs to be something the kernel knows about 
>>intimately and doesn't need lots of dynamic lookup).
>>
>>However, there's another issue, namely process startup cost. I personally 
>>want it to be as light as at all possible. I hate doing an "strace" on 
>>user processes and seeing tons and tons of crapola showing up. Just for 
> 
> So why not map the magic page at 0xffffe000 at some other address as
> well? 
> 
> Static binaries can just directly jump/call into the magic page.
> 
> Shared binaries do somekind of mmap("/proc/self/mem") magic to put a
> copy of the page at an address that is convenient for them. Shared
> binaries have to do a lot of mmap-ing anyway, so the overhead should be
> negligible.
> 

That would require /proc to be mounted for all shared binaries to work.
 That is tantamount to killing chroot().

Perhaps it could be done with mremap(), but I would assume that would
entail a special case in the mremap() code.

A special system call would be a bit gross, but it's better than a total
hack.

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-18 23:51 billyrose
  2002-12-19 13:10 ` Richard B. Johnson
  0 siblings, 1 reply; 268+ messages in thread
From: billyrose @ 2002-12-18 23:51 UTC (permalink / raw)
  To: root; +Cc: linux-kernel

Richard B. Johnson wrote:
> The number of CPU clocks necessary to make the 'far' or
> full-pointer call by pushing the segment register, the offset,
> then issuing a 'lret' is 33 clocks on a Pentium II.
>
> longcall clocks = 46
> call clocks = 13
> actual full-pointer call clocks = 33

this is not correct. the assumed target (of a _far_ call) would issue a far 
return and only an offset would be left on the stack to return to (oops). the 
code segment of the orginal caller needs pushed to create the seg:off pair and 
hence a far return would land back at the original calling routine. this is a 
very convoluted method of making the orginal call being far, as simply calling 
far in the first pace should issue much faster. OTOH, if you are making a 
workaround to an already existing piece of code, this works beautifully (with 
the additional seg pushed on the stack).

b.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 23:51 billyrose
@ 2002-12-19 13:10 ` Richard B. Johnson
  0 siblings, 0 replies; 268+ messages in thread
From: Richard B. Johnson @ 2002-12-19 13:10 UTC (permalink / raw)
  To: billyrose; +Cc: linux-kernel

On Wed, 18 Dec 2002 billyrose@billyrose.net wrote:

> Richard B. Johnson wrote:
> > The number of CPU clocks necessary to make the 'far' or
> > full-pointer call by pushing the segment register, the offset,
> > then issuing a 'lret' is 33 clocks on a Pentium II.
> >
> > longcall clocks = 46
> > call clocks = 13
> > actual full-pointer call clocks = 33
> 
> this is not correct. the assumed target (of a _far_ call) would issue a far 
> return and only an offset would be left on the stack to return to (oops). the 
> code segment of the orginal caller needs pushed to create the seg:off pair and 
> hence a far return would land back at the original calling routine. this is a 
> very convoluted method of making the orginal call being far, as simply calling 
> far in the first pace should issue much faster. OTOH, if you are making a 
> workaround to an already existing piece of code, this works beautifully (with 
> the additional seg pushed on the stack).
> 

The target, i.e., the label 'goto' would be the reserved page for the
system call. The whole purpose was to minimize the number of CPU cycles
necessary to call 0xfffff000 and return. The system call does not have
issue a 'far' return, it can do anything it requires. The page at
0xfffff000 is mapped into every process and is in that process CS space
already.

I have already gotten responses from people who looked at the code
and said it was broken. It is not broken. It does what is expected.
It takes the same number of CPU cycles as:

		pushl	$0xfffff000
		call	*(%esp)
		addl	$4, %esp

... which is the current proposal. It has the advantage that only
the return address is on the stack when the target code is executed.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* RE: Intel P6 vs P7 system call performance
@ 2002-12-18  1:30 Nakajima, Jun
  2002-12-18  1:54 ` Ulrich Drepper
  0 siblings, 1 reply; 268+ messages in thread
From: Nakajima, Jun @ 2002-12-18  1:30 UTC (permalink / raw)
  To: Ulrich Drepper, Linus Torvalds
  Cc: Matti Aarnio, Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel,
	hpa

AMD (at least Athlon, as far as I know) supports sysenter/sysexit. We tested it on an Athlon box as well, and it worked fine. And sysenter/sysexit was better than int/iret too (about 40% faster) there. 

Jun

> -----Original Message-----
> From: Ulrich Drepper [mailto:drepper@redhat.com]
> Sent: Tuesday, December 17, 2002 11:19 AM
> To: Linus Torvalds
> Cc: Matti Aarnio; Hugh Dickins; Dave Jones; Ingo Molnar; linux-
> kernel@vger.kernel.org; hpa@transmeta.com
> Subject: Re: Intel P6 vs P7 system call performance
> 
> Linus Torvalds wrote:
> 
> > In the meantime, I do agree with you that the TLS approach should work
> > too, and might be better. It will allow all six arguments to be used if
> we
> > just find a good calling conventions
> 
> If you push out the AT_* patch I'll hack the glibc bits (probably the
> TLS variant).  Won't take too  long, you'll get results this afternoon.
> 
> What about AMD's instruction?  Is it as flawed as sysenter?  If not and
> %ebp is available I really should use the TLS method.
> 
> --
> --------------.                        ,-.            444 Castro Street
> Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
> Red Hat         `--' drepper at redhat.com `---------------------------
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  1:30 Nakajima, Jun
@ 2002-12-18  1:54 ` Ulrich Drepper
  2002-12-18  3:36   ` H. Peter Anvin
  2002-12-18  6:00   ` Brian Gerst
  0 siblings, 2 replies; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-18  1:54 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: Linus Torvalds, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, linux-kernel, hpa

Nakajima, Jun wrote:
> AMD (at least Athlon, as far as I know) supports sysenter/sysexit. We tested it on an Athlon box as well, and it worked fine. And sysenter/sysexit was better than int/iret too (about 40% faster) there. 

That's good to know but not what I meant.

I referred to syscall/sysret opcodes.  They are broken in their own way
(destroying ecx on kernel entry) but at least they preserve eip.

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  1:54 ` Ulrich Drepper
@ 2002-12-18  3:36   ` H. Peter Anvin
  2002-12-18  4:05     ` Linus Torvalds
  2002-12-18  4:07     ` Linus Torvalds
  2002-12-18  6:00   ` Brian Gerst
  1 sibling, 2 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-18  3:36 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Nakajima, Jun, Linus Torvalds, Matti Aarnio, Hugh Dickins,
	Dave Jones, Ingo Molnar, linux-kernel

Ulrich Drepper wrote:
> 
> That's good to know but not what I meant.
> 
> I referred to syscall/sysret opcodes.  They are broken in their own way
> (destroying ecx on kernel entry) but at least they preserve eip.
> 

Destroying %ecx is a lot less destructive than destroying %eip and %esp...

	-hpa


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  3:36   ` H. Peter Anvin
@ 2002-12-18  4:05     ` Linus Torvalds
  2002-12-18  4:36       ` H. Peter Anvin
  2002-12-18  4:07     ` Linus Torvalds
  1 sibling, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-18  4:05 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ulrich Drepper, Nakajima, Jun, Matti Aarnio, Hugh Dickins,
	Dave Jones, Ingo Molnar, linux-kernel

On Tue, 17 Dec 2002, H. Peter Anvin wrote:
> Ulrich Drepper wrote:
> >
> > That's good to know but not what I meant.
> >
> > I referred to syscall/sysret opcodes.  They are broken in their own way
> > (destroying ecx on kernel entry) but at least they preserve eip.
> >
>
> Destroying %ecx is a lot less destructive than destroying %eip and %esp...

Actually, as far as the kernel is concerned, they are about equally bad.

Destroying %eip is the _least_ bad register to destroy, since the kernel
can control that part, and it is trivial to just have a single call site.

But destroying %esp or %ecx is pretty much totally equivalent - it
destroys one user mode register, and it doesn't really matter _which_ one.
In both cases 32 bits of user information is destroyed, and they are 100%
equivalent as far as the kernel is concerned.

On intel with sysenter, destroying %esp means that we have to save the
value in %ebp, and we thus lose argument 6.

On AMD, %ecx is destroyed on entry, which means that we lose argument 2
(which i smore important than arg6, but that only means that the AMD
trampoline will have to move the old value of %ecx into %ebp, at which
point the two approaches are again exactly the same).

In either case, one GP register is irrevocably lost, which means that
there are only 5 GP registers left for arguments. And thus both Intel and
AMD will have _exactly_ the same problem with six-argument system calls.

The _sane_ thing to do would have been to save the old user %esp/%eip on
the kernel stack. Preferably together with %eflags and %ss and %cs, just
for completeness. That stack save part is _not_ the expensive or complex
part of a "int 0x80" or long call (the _complex_ part is all the stupid
GDT/IDT lookups and all the segment switching crap).

In short, both AMD and Intel optimized away too much.

The good news is that since both of them suck, it's easier to make the
six-argument decision. Since six arguments are problematic for all major
"fast" system calls, my executive decision is to just say that
six-argument system calls will just have to continue using the old and
slower system call interface. It's kind of a crock, but it's simply due to
silly CPU designers.

			Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  4:05     ` Linus Torvalds
@ 2002-12-18  4:36       ` H. Peter Anvin
  0 siblings, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-18  4:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ulrich Drepper, Nakajima, Jun, Matti Aarnio, Hugh Dickins,
	Dave Jones, Ingo Molnar, linux-kernel

Linus Torvalds wrote:
>>
>>Destroying %ecx is a lot less destructive than destroying %eip and %esp...
> 
> Actually, as far as the kernel is concerned, they are about equally bad.
> 

Right, but from a user-mode point of view it means at least one extra 
instruction.

> Destroying %eip is the _least_ bad register to destroy, since the kernel
> can control that part, and it is trivial to just have a single call site.

Trivial, perhaps, but it requires a call/ret pair in userspace, which is 
   a fairly expensive form of push/pop.

> The good news is that since both of them suck, it's easier to make the
> six-argument decision. Since six arguments are problematic for all major
> "fast" system calls, my executive decision is to just say that
> six-argument system calls will just have to continue using the old and
> slower system call interface. It's kind of a crock, but it's simply due to
> silly CPU designers.

Oh, so you're not going to do the "read from stack" thing?  (Agreed, by 
the way, on the CPU design -- both SYSENTER and SYSCALL suck.  SYSCALL 
was changed rather substantially in x86-64 for that reason.)

	-hpa




^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  3:36   ` H. Peter Anvin
  2002-12-18  4:05     ` Linus Torvalds
@ 2002-12-18  4:07     ` Linus Torvalds
  2002-12-18  4:40       ` Stephen Rothwell
  2002-12-18 23:45       ` Pavel Machek
  1 sibling, 2 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-18  4:07 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ulrich Drepper, Nakajima, Jun, Matti Aarnio, Hugh Dickins,
	Dave Jones, Ingo Molnar, linux-kernel

Btw, on another tangent - Andrew Morton reports that APM is unhappy about
the fact that the fast system call stuff required us to move the segments
around a bit. That's probably because the APM code has the old APM segment
numbers hardcoded somewhere, but I don't see where (I certainly knew about
the segment number issue, and tried to update the cases I saw).

Debugging help would be appreciated, especially from somebody who knows
the APM code.

		Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  4:07     ` Linus Torvalds
@ 2002-12-18  4:40       ` Stephen Rothwell
  2002-12-18  4:52         ` Linus Torvalds
                           ` (2 more replies)
  2002-12-18 23:45       ` Pavel Machek
  1 sibling, 3 replies; 268+ messages in thread
From: Stephen Rothwell @ 2002-12-18  4:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: LKML, Andrew Morton

Hi Linus, Andrew,

On Tue, 17 Dec 2002 20:07:53 -0800 (PST) Linus Torvalds <torvalds@transmeta.com> wrote:
>
> Btw, on another tangent - Andrew Morton reports that APM is unhappy about
> the fact that the fast system call stuff required us to move the segments
> around a bit. That's probably because the APM code has the old APM segment
> numbers hardcoded somewhere, but I don't see where (I certainly knew about
> the segment number issue, and tried to update the cases I saw).

I looked at this yesterday and decided that it was OK as well.

> Debugging help would be appreciated, especially from somebody who knows
> the APM code.

It would help to know what "unhappy" means :-)

Does the following fix it for you? Untested, assumes cache lines are 32
bytes.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

diff -ruN 2.5.52-200212181207/include/asm-i386/segment.h 2.5.52-200212181207-apm/include/asm-i386/segment.h
--- 2.5.52-200212181207/include/asm-i386/segment.h	2002-12-18 15:25:48.000000000 +1100
+++ 2.5.52-200212181207-apm/include/asm-i386/segment.h	2002-12-18 15:38:34.000000000 +1100
@@ -65,9 +65,9 @@
 #define GDT_ENTRY_APMBIOS_BASE		(GDT_ENTRY_KERNEL_BASE + 11)
 
 /*
- * The GDT has 23 entries but we pad it to cacheline boundary:
+ * The GDT has 25 entries but we pad it to cacheline boundary:
  */
-#define GDT_ENTRIES 24
+#define GDT_ENTRIES 28
 
 #define GDT_SIZE (GDT_ENTRIES * 8)
 

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  4:40       ` Stephen Rothwell
@ 2002-12-18  4:52         ` Linus Torvalds
  2002-12-18  4:53         ` Andrew Morton
  2002-12-18 19:12         ` Andrew Morton
  2 siblings, 0 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-18  4:52 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: LKML, Andrew Morton



On Wed, 18 Dec 2002, Stephen Rothwell wrote:
>
> It would help to know what "unhappy" means :-)

Andrew reported an oops in the BIOS. I ahev the full oops info somewhere,
but quite frankly it isn't that readable. It shows

	EIP:    00b8:[<000044d7>]    Not tainted
	ds: 0000   es: 0000   ss: 0068
	Call Trace:
	 [<c0112739>] apm_bios_call+0x75/0xf4
	 [<c0130000>] cache_init_objs+0x34/0xd8
	 [<c0112b72>] apm_get_power_status+0x42/0x84
	 [<c012d843>] __alloc_pages+0x77/0x244
	 [<c0113828>] apm_get_info+0x38/0xe4
	 [<c016982d>] proc_file_read+0xa9/0x1ac
	 [<c0141b53>] vfs_read+0xb7/0x138
	 [<c0141dee>] sys_read+0x2a/0x40
	 [<c0108e67>] syscall_call+0x7/0xb

and I suspect the problem is that 0 in ds/es..

> Does the following fix it for you? Untested, assumes cache lines are 32
> bytes.

Andrew?

		Linus


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  4:40       ` Stephen Rothwell
  2002-12-18  4:52         ` Linus Torvalds
@ 2002-12-18  4:53         ` Andrew Morton
  2002-12-18 19:12         ` Andrew Morton
  2 siblings, 0 replies; 268+ messages in thread
From: Andrew Morton @ 2002-12-18  4:53 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: Linus Torvalds, LKML

Stephen Rothwell wrote:
> 
> Hi Linus, Andrew,
> 
> On Tue, 17 Dec 2002 20:07:53 -0800 (PST) Linus Torvalds <torvalds@transmeta.com> wrote:
> >
> > Btw, on another tangent - Andrew Morton reports that APM is unhappy about
> > the fact that the fast system call stuff required us to move the segments
> > around a bit. That's probably because the APM code has the old APM segment
> > numbers hardcoded somewhere, but I don't see where (I certainly knew about
> > the segment number issue, and tried to update the cases I saw).
> 
> I looked at this yesterday and decided that it was OK as well.
> 
> > Debugging help would be appreciated, especially from somebody who knows
> > the APM code.
> 
> It would help to know what "unhappy" means :-)

The lcall seems to be going awry.  It oopses when apmd starts up,
and the sysenter patch is the trigger.

CPU:    0
EIP:    00b8:[<000044d7>]    Not tainted
EFLAGS: 00010202
EIP is at 0x44d7
eax: 000000c8   ebx: 00000001   ecx: 00000000   edx: 00000000
esi: c02e0091   edi: 000000ff   ebp: ceed1ec4   esp: ceed1e74
ds: 0000   es: 0000   ss: 0068
Process apmd (pid: 679, threadinfo=ceed0000 task=cfa058a0)
Stack: 0000530a 00b844e8 00000000 ceed1ec4 c0112739 00000060 ceed1ec4 000000ff 
       00000068 00000068 ceed1f32 c02e0091 000000ff 00000202 ceed0000 cf706c24 
       00000000 00000000 c0130000 c1740000 ceed1f04 c0112b72 0000530a 00000001 
Call Trace:
 [<c0112739>] apm_bios_call+0x75/0xf4
 [<c0130000>] cache_init_objs+0x34/0xd8
 [<c0112b72>] apm_get_power_status+0x42/0x84
 [<c012d843>] __alloc_pages+0x77/0x244
 [<c0113828>] apm_get_info+0x38/0xe4
 [<c016982d>] proc_file_read+0xa9/0x1ac
 [<c0141b53>] vfs_read+0xb7/0x138
 [<c0141dee>] sys_read+0x2a/0x40
 [<c0108e67>] syscall_call+0x7/0xb

> Does the following fix it for you? Untested, assumes cache lines are 32
> bytes.
> 

I cleverly left the laptop at work.  Shall test tomorrow.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  4:40       ` Stephen Rothwell
  2002-12-18  4:52         ` Linus Torvalds
  2002-12-18  4:53         ` Andrew Morton
@ 2002-12-18 19:12         ` Andrew Morton
  2 siblings, 0 replies; 268+ messages in thread
From: Andrew Morton @ 2002-12-18 19:12 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: Linus Torvalds, LKML

Stephen Rothwell wrote:
> 
> Hi Linus, Andrew,
> 
> On Tue, 17 Dec 2002 20:07:53 -0800 (PST) Linus Torvalds <torvalds@transmeta.com> wrote:
> >
> > Btw, on another tangent - Andrew Morton reports that APM is unhappy about
> > the fact that the fast system call stuff required us to move the segments
> > around a bit. That's probably because the APM code has the old APM segment
> > numbers hardcoded somewhere, but I don't see where (I certainly knew about
> > the segment number issue, and tried to update the cases I saw).
> 
> I looked at this yesterday and decided that it was OK as well.
> 
> > Debugging help would be appreciated, especially from somebody who knows
> > the APM code.
> 
> It would help to know what "unhappy" means :-)
> 
> Does the following fix it for you? Untested, assumes cache lines are 32
> bytes.

It does fix the apmd oops, and APM works fine.

Here's the patch again.  (But what happens if cachelines are not 32 bytes?)


--- 25/include/asm-i386/segment.h~sfr	Wed Dec 18 10:54:07 2002
+++ 25-akpm/include/asm-i386/segment.h	Wed Dec 18 10:54:07 2002
@@ -65,9 +65,9 @@
 #define GDT_ENTRY_APMBIOS_BASE		(GDT_ENTRY_KERNEL_BASE + 11)
 
 /*
- * The GDT has 23 entries but we pad it to cacheline boundary:
+ * The GDT has 25 entries but we pad it to cacheline boundary:
  */
-#define GDT_ENTRIES 24
+#define GDT_ENTRIES 28
 
 #define GDT_SIZE (GDT_ENTRIES * 8)
 

_

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  4:07     ` Linus Torvalds
  2002-12-18  4:40       ` Stephen Rothwell
@ 2002-12-18 23:45       ` Pavel Machek
  2002-12-20  3:05         ` Alan Cox
  1 sibling, 1 reply; 268+ messages in thread
From: Pavel Machek @ 2002-12-18 23:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Ulrich Drepper, Nakajima, Jun, Matti Aarnio,
	Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel

Hi!

> Btw, on another tangent - Andrew Morton reports that APM is unhappy about
> the fact that the fast system call stuff required us to move the segments
> around a bit. That's probably because the APM code has the old APM segment
> numbers hardcoded somewhere, but I don't see where (I certainly knew about
> the segment number issue, and tried to update the cases I saw).
> 
> Debugging help would be appreciated, especially from somebody who knows
> the APM code.

IIRC, segment 0x40 was special in BIOS days, and some APM bioses
blindly access 0x40 even from protected mode (windows have segment
0x40 with base 0x400....) Is that issue you are hitting?
								Pavel

-- 
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 23:45       ` Pavel Machek
@ 2002-12-20  3:05         ` Alan Cox
  2002-12-20  4:03           ` Stephen Rothwell
  0 siblings, 1 reply; 268+ messages in thread
From: Alan Cox @ 2002-12-20  3:05 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, H. Peter Anvin, Ulrich Drepper, Nakajima, Jun,
	Matti Aarnio, Hugh Dickins, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List

On Wed, 2002-12-18 at 23:45, Pavel Machek wrote:
> IIRC, segment 0x40 was special in BIOS days, and some APM bioses
> blindly access 0x40 even from protected mode (windows have segment
> 0x40 with base 0x400....) Is that issue you are hitting?

Well the spec says it is not special. Windows leaves it pointing to
0x400 and if you don't do that your APM doesn't work.

Alan


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-20  3:05         ` Alan Cox
@ 2002-12-20  4:03           ` Stephen Rothwell
  0 siblings, 0 replies; 268+ messages in thread
From: Stephen Rothwell @ 2002-12-20  4:03 UTC (permalink / raw)
  To: Alan Cox
  Cc: pavel, torvalds, hpa, drepper, jun.nakajima, matti.aarnio, hugh,
	davej, mingo, linux-kernel

On 20 Dec 2002 03:05:15 +0000 Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>
> On Wed, 2002-12-18 at 23:45, Pavel Machek wrote:
> > IIRC, segment 0x40 was special in BIOS days, and some APM bioses
> > blindly access 0x40 even from protected mode (windows have segment
> > 0x40 with base 0x400....) Is that issue you are hitting?
> 
> Well the spec says it is not special. Windows leaves it pointing to
> 0x400 and if you don't do that your APM doesn't work.

The problem with the new syscall stuff is fixed in BK (the GDT was no longer
long enough ...)

The 0x40 thing is set up and torn down for each BIOS call these days.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  1:54 ` Ulrich Drepper
  2002-12-18  3:36   ` H. Peter Anvin
@ 2002-12-18  6:00   ` Brian Gerst
  1 sibling, 0 replies; 268+ messages in thread
From: Brian Gerst @ 2002-12-18  6:00 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Nakajima, Jun, Linus Torvalds, Matti Aarnio, Hugh Dickins,
	Dave Jones, Ingo Molnar, linux-kernel, hpa

Ulrich Drepper wrote:
> Nakajima, Jun wrote:
> 
>>AMD (at least Athlon, as far as I know) supports sysenter/sysexit. We tested it on an Athlon box as well, and it worked fine. And sysenter/sysexit was better than int/iret too (about 40% faster) there. 
> 
> 
> That's good to know but not what I meant.
> 
> I referred to syscall/sysret opcodes.  They are broken in their own way
> (destroying ecx on kernel entry) but at least they preserve eip.
> 

syscall is pretty much unusable unless the NMI is changed to a task 
gate.  syscall does not change %esp on entry to the kernel, so an NMI 
before the manual stack switch would still use the user stack, which is 
not guaranteed to be valid - oops.  x86-64 gets around this by using an 
interrupt stack, its replacement for task gates.

--
				Brian Gerst


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-17 16:32 Manfred Spraul
  2002-12-17 17:13 ` Richard B. Johnson
  0 siblings, 1 reply; 268+ messages in thread
From: Manfred Spraul @ 2002-12-17 16:32 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: linux-kernel

>
>
>   pushl %ebp
>   movl $0xfffff000, %ebp
>   call *%ebp
>   popl %ebp
>  
>

You could avoid clobbering a register with something like

pushl $0xfffff000
call *(%esp)
addl %esp,4

--
    Manfred


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 16:32 Manfred Spraul
@ 2002-12-17 17:13 ` Richard B. Johnson
  2002-12-17 17:19   ` Richard B. Johnson
  0 siblings, 1 reply; 268+ messages in thread
From: Richard B. Johnson @ 2002-12-17 17:13 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Ulrich Drepper, linux-kernel

On Tue, 17 Dec 2002, Manfred Spraul wrote:

> >
> >
> >   pushl %ebp
> >   movl $0xfffff000, %ebp
> >   call *%ebp
> >   popl %ebp
> >  
> >
> 
> You could avoid clobbering a register with something like
> 
> pushl $0xfffff000
> call *(%esp)
> addl %esp,4
> 

This is a near 'call'.

	pushl $0xfffff000
	ret

This is a 'far' 'call' that I think you will need to reload the segment
back to user-mode segments on the return.

	pushl	$KERNEL_CS
	pushl	$0xfffff000
	lret

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 17:13 ` Richard B. Johnson
@ 2002-12-17 17:19   ` Richard B. Johnson
  2002-12-17 17:37     ` Mikael Pettersson
  0 siblings, 1 reply; 268+ messages in thread
From: Richard B. Johnson @ 2002-12-17 17:19 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Ulrich Drepper, linux-kernel

On Tue, 17 Dec 2002, Richard B. Johnson wrote:

> On Tue, 17 Dec 2002, Manfred Spraul wrote:
> 
> > >
> > >
> > >   pushl %ebp
> > >   movl $0xfffff000, %ebp
> > >   call *%ebp
> > >   popl %ebp
> > >  
> > >
> > 
> > You could avoid clobbering a register with something like
> > 
> > pushl $0xfffff000
> > call *(%esp)
> > addl %esp,4
> > 
> 
> This is a near 'call'.
> 
> 	pushl $0xfffff000
> 	ret
> 

I hate answering my own stuff......... This gets back and modifies
no registers.

Actually I should be:

	pushl	$next_address	# Where to go when the call returns
	pushl	$0xfffff000	# Put this on the stack
	ret			# 'Return' to it (jump)
next_address:			# Were we end up after



Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 17:19   ` Richard B. Johnson
@ 2002-12-17 17:37     ` Mikael Pettersson
  0 siblings, 0 replies; 268+ messages in thread
From: Mikael Pettersson @ 2002-12-17 17:37 UTC (permalink / raw)
  To: root; +Cc: Ulrich Drepper, linux-kernel

Richard B. Johnson writes:
 > Actually I should be:
 > 
 > 	pushl	$next_address	# Where to go when the call returns
 > 	pushl	$0xfffff000	# Put this on the stack
 > 	ret			# 'Return' to it (jump)
 > next_address:			# Were we end up after

You just killed that process' performance by causing the
return-stack branch prediction buffer to go out of sync.

It might have worked ok on a 486, but P6+ don't like it one bit.

This is also why I'm slightly unhappy about the
s/int $0x80/call <address of sysenter>/ approach, since it leads
to yet another recursion level and risk overflowing the RSB.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-17 16:14 John Reiser
  0 siblings, 0 replies; 268+ messages in thread
From: John Reiser @ 2002-12-17 16:14 UTC (permalink / raw)
  To: linux-kernel

Ulrich Drepper wrote:
[snip]
  >    pushl %ebp
  >    movl $0xfffff000, %ebp
  >    call *%ebp
  >    popl %ebp

This does not work for mmap64 [syscall 192], which passes a parameter in %ebp.

-- 
John Reiser, jreiser@BitWagon.com


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-17 16:01 John Reiser
  0 siblings, 0 replies; 268+ messages in thread
From: John Reiser @ 2002-12-17 16:01 UTC (permalink / raw)
  To: linux-kernel

On Mon, 16 Dec 2002, Linus Torvalds wrote [regarding vsyscall implementation]:
 > The good news is that the kernel part really looks pretty clean.

Where is the CPU serializing instruction which must be executed before return
to user mode, so that kernel accesses to hardware devices are guaranteed to
complete before any subsequent user access begins?  (Otherwise a read/write
by the user to a memory-mapped device page can appear out-of-order with respect
to the kernel accesses in a preceding syscall.)  The only generally useful
serializing instructions are IRET and CPUID; only IRET is implemented univerally.

-- 
John Reiser, jreiser@BitWagon.com

^ permalink raw reply	[flat|nested] 268+ messages in thread

[parent not found: <20021209193649.GC10316@suse.de.suse.lists.linux.kernel>]

[parent not found: <Pine.LNX.4.44.0212161639310.1623-100000@penguin.transmeta.com.suse.lists.linux.kernel>]

* Re: Intel P6 vs P7 system call performance
       [not found] ` <Pine.LNX.4.44.0212161639310.1623-100000@penguin.transmeta.com.suse.lists.linux.kernel>
@ 2002-12-17  8:56   ` Andi Kleen
  2002-12-17 16:57     ` Linus Torvalds
  0 siblings, 1 reply; 268+ messages in thread
From: Andi Kleen @ 2002-12-17  8:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mingo, linux-kernel, davej

Linus Torvalds <torvalds@transmeta.com> writes:
> 
> That NMI problem is pretty fundamentally unfixable due to the stupid
> sysenter semantics, but we could just make the NMI handlers be real
> careful about it and fix it up if it happens.

You just have to make the NMI a task gate with an own TSS, then the 
microcode will set up an own stack for you.

The only issue afterwards is that "current" does not work, but that
can be worked around by being a bit careful in the handler.
It has to run with interrupts off too to avoid a race with an 
timer interrupt which uses current (or alternatively the timer
interrupt could check for the "in nmi condition" - I don't think
any other interrupts access current except when they crash)

[in theory it would be also possible to align the NMI stacks to
8K and put a "pseudo" task into that stack, but it would look
a bit inelegant for me]

Using a task gate would be a good idea for kernel stack faults and
double faults too, then it would be at least possible to get an oops
for them, not the usual double fault.

[x86-64 does it similarly, except that it uses ISTs instead of task
gates and avoids the current problem by using an explicit base register]

I cannot implement SYSENTER for x86-64/32bit emulation, but I think
I can change the vsyscall code to use SYSCALL, not SYSENTER. The only
issue is that I cannot easily use a fixmap to map into 32bit processes,
because the kernel fixmap are way up into the 48bit address space
and not reachable from compatibility mode.
I suspect a similar trick as with the lazy vmallocs - map it in the
page fault handler on demand will work. I hope there won't be much
more of these special cases though, do_page_fault is getting
awfully complicated.

-Andi

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  8:56   ` Andi Kleen
@ 2002-12-17 16:57     ` Linus Torvalds
  2002-12-18  5:25       ` Brian Gerst
  0 siblings, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17 16:57 UTC (permalink / raw)
  To: Andi Kleen; +Cc: mingo, linux-kernel, davej

On 17 Dec 2002, Andi Kleen wrote:
>
> Linus Torvalds <torvalds@transmeta.com> writes:
> >
> > That NMI problem is pretty fundamentally unfixable due to the stupid
> > sysenter semantics, but we could just make the NMI handlers be real
> > careful about it and fix it up if it happens.
>
> You just have to make the NMI a task gate with an own TSS, then the
> microcode will set up an own stack for you.

Actually, I came up with a much simpler solution (which I didn't yet
implement, but should be just a few lines).

The simpler solution is to just make the temporary ESP stack _look_ like
it's a real process - ie make it 8kB per CPU (instead of the current 4kB)
and put a fake "thread_info" at the bottom of it with the right CPU
number etc. That way if an NMI comes in (in the _extremely_ tiny window),
it will still see a sane picture of the system. It will basically think
that we had a micro-task-switch between two instructions.

It's also entirely possible that the NMI window may not actually even
exist, since I'm not even sure that Intel checks for pending interrupt
before the first instruction of a trap handler.

> Using a task gate would be a good idea for kernel stack faults and
> double faults too, then it would be at least possible to get an oops
> for them, not the usual double fault.

We can't get stack faults without degrading performance horribly (they
require you to set up the stack segment in magic ways that gcc doesn't
even support). For double-faults, yes, but quite frankly, if you ever get
a double fault things are _so_ screwed up that it's not very funny any
more.

> I cannot implement SYSENTER for x86-64/32bit emulation, but I think
> I can change the vsyscall code to use SYSCALL, not SYSENTER.

Right. The point of my patches is that user-level really _cannot_ use
sysenter directly, because the sysenter semantics are just not useful for
user land. So as far as user land is concerned, it really _is_ just a
"call 0xfffff000", and then the kernel can do whatever is appropriate for
that CPU.

		Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 16:57     ` Linus Torvalds
@ 2002-12-18  5:25       ` Brian Gerst
  2002-12-18  6:06         ` Linus Torvalds
  2002-12-21 16:07         ` Christian Leber
  0 siblings, 2 replies; 268+ messages in thread
From: Brian Gerst @ 2002-12-18  5:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andi Kleen, mingo, linux-kernel, davej

[-- Attachment #1: Type: text/plain, Size: 1538 bytes --]

Linus Torvalds wrote:
> 
> On 17 Dec 2002, Andi Kleen wrote:
> 
>>Linus Torvalds <torvalds@transmeta.com> writes:
>>
>>>That NMI problem is pretty fundamentally unfixable due to the stupid
>>>sysenter semantics, but we could just make the NMI handlers be real
>>>careful about it and fix it up if it happens.
>>
>>You just have to make the NMI a task gate with an own TSS, then the
>>microcode will set up an own stack for you.
> 
> 
> Actually, I came up with a much simpler solution (which I didn't yet
> implement, but should be just a few lines).
> 
> The simpler solution is to just make the temporary ESP stack _look_ like
> it's a real process - ie make it 8kB per CPU (instead of the current 4kB)
> and put a fake "thread_info" at the bottom of it with the right CPU
> number etc. That way if an NMI comes in (in the _extremely_ tiny window),
> it will still see a sane picture of the system. It will basically think
> that we had a micro-task-switch between two instructions.
> 
> It's also entirely possible that the NMI window may not actually even
> exist, since I'm not even sure that Intel checks for pending interrupt
> before the first instruction of a trap handler.

How about this patch?  Instead of making a per-cpu trampoline, write to 
the msr during each context switch.  This means that the stack pointer 
is valid at all times, and also saves memory and a cache line bounce.  I 
also included some misc cleanups.

Tested on an Athlon XP.
sysenter: 158.854423 cycles
int80:    273.658134 cycles

--
				Brian Gerst

[-- Attachment #2: sysenter-1 --]
[-- Type: text/plain, Size: 5572 bytes --]

diff -urN linux-2.5.52-bk2/arch/i386/kernel/cpu/common.c linux/arch/i386/kernel/cpu/common.c
--- linux-2.5.52-bk2/arch/i386/kernel/cpu/common.c	Sat Dec 14 12:32:00 2002
+++ linux/arch/i386/kernel/cpu/common.c	Tue Dec 17 23:21:55 2002
@@ -487,7 +487,7 @@
 		BUG();
 	enter_lazy_tlb(&init_mm, current, cpu);
 
-	t->esp0 = thread->esp0;
+	load_esp0(t, thread->esp0);
 	set_tss_desc(cpu,t);
 	cpu_gdt_table[cpu][GDT_ENTRY_TSS].b &= 0xfffffdff;
 	load_TR_desc();
diff -urN linux-2.5.52-bk2/arch/i386/kernel/process.c linux/arch/i386/kernel/process.c
--- linux-2.5.52-bk2/arch/i386/kernel/process.c	Sat Dec 14 12:32:04 2002
+++ linux/arch/i386/kernel/process.c	Tue Dec 17 23:29:54 2002
@@ -440,7 +440,7 @@
 	/*
 	 * Reload esp0, LDT and the page table pointer:
 	 */
-	tss->esp0 = next->esp0;
+	load_esp0(tss, next->esp0);
 
 	/*
 	 * Load the per-thread Thread-Local Storage descriptor.
diff -urN linux-2.5.52-bk2/arch/i386/kernel/sysenter.c linux/arch/i386/kernel/sysenter.c
--- linux-2.5.52-bk2/arch/i386/kernel/sysenter.c	Tue Dec 17 23:21:45 2002
+++ linux/arch/i386/kernel/sysenter.c	Tue Dec 17 23:31:01 2002
@@ -20,22 +20,12 @@
 
 static void __init enable_sep_cpu(void *info)
 {
-	unsigned long page = __get_free_page(GFP_ATOMIC);
 	int cpu = get_cpu();
-	unsigned long *esp0_ptr = &(init_tss + cpu)->esp0;
-	unsigned long rel32;
+	struct tss_struct *tss = init_tss + cpu;
 
-	rel32 = (unsigned long) sysenter_entry - (page+11);
-
-	
-	*(short *) (page+0) = 0x258b;		/* movl xxxxx,%esp */
-	*(long **) (page+2) = esp0_ptr;
-	*(char *)  (page+6) = 0xe9;		/* jmp rl32 */
-	*(long *)  (page+7) = rel32;
-
-	wrmsr(0x174, __KERNEL_CS, 0);		/* SYSENTER_CS_MSR */
-	wrmsr(0x175, page+PAGE_SIZE, 0);	/* SYSENTER_ESP_MSR */
-	wrmsr(0x176, page, 0);			/* SYSENTER_EIP_MSR */
+	wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
+	wrmsr(MSR_IA32_SYSENTER_ESP, tss->esp0, 0);
+	wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long) sysenter_entry, 0);
 
 	printk("Enabling SEP on CPU %d\n", cpu);
 	put_cpu();	
@@ -60,14 +50,15 @@
 	};
 	unsigned long page = get_zeroed_page(GFP_ATOMIC);
 
+	if (cpu_has_sep) {
+		memcpy((void *) page, sysent, sizeof(sysent));
+		enable_sep_cpu(NULL);
+		smp_call_function(enable_sep_cpu, NULL, 1, 1);
+	} else
+		memcpy((void *) page, int80, sizeof(int80));
+
 	__set_fixmap(FIX_VSYSCALL, __pa(page), PAGE_READONLY);
-	memcpy((void *) page, int80, sizeof(int80));
-	if (!boot_cpu_has(X86_FEATURE_SEP))
-		return 0;
-
-	memcpy((void *) page, sysent, sizeof(sysent));
-	enable_sep_cpu(NULL);
-	smp_call_function(enable_sep_cpu, NULL, 1, 1);
+
 	return 0;
 }
 
diff -urN linux-2.5.52-bk2/arch/i386/kernel/vm86.c linux/arch/i386/kernel/vm86.c
--- linux-2.5.52-bk2/arch/i386/kernel/vm86.c	Sat Dec 14 12:32:02 2002
+++ linux/arch/i386/kernel/vm86.c	Tue Dec 17 23:21:55 2002
@@ -113,7 +113,7 @@
 		do_exit(SIGSEGV);
 	}
 	tss = init_tss + smp_processor_id();
-	tss->esp0 = current->thread.esp0 = current->thread.saved_esp0;
+	load_esp0(tss, current->thread.saved_esp0);
 	current->thread.saved_esp0 = 0;
 	ret = KVM86->regs32;
 	return ret;
@@ -283,7 +283,8 @@
 	info->regs32->eax = 0;
 	tsk->thread.saved_esp0 = tsk->thread.esp0;
 	tss = init_tss + smp_processor_id();
-	tss->esp0 = tsk->thread.esp0 = (unsigned long) &info->VM86_TSS_ESP0;
+	tsk->thread.esp0 = (unsigned long) &info->VM86_TSS_ESP0;
+	load_esp0(tss, tsk->thread.esp0);
 
 	tsk->thread.screen_bitmap = info->screen_bitmap;
 	if (info->flags & VM86_SCREEN_BITMAP)
diff -urN linux-2.5.52-bk2/include/asm-i386/cpufeature.h linux/include/asm-i386/cpufeature.h
--- linux-2.5.52-bk2/include/asm-i386/cpufeature.h	Sun Sep 15 22:18:22 2002
+++ linux/include/asm-i386/cpufeature.h	Tue Dec 17 23:29:27 2002
@@ -7,6 +7,8 @@
 #ifndef __ASM_I386_CPUFEATURE_H
 #define __ASM_I386_CPUFEATURE_H
 
+#include <linux/bitops.h>
+
 #define NCAPINTS	4	/* Currently we have 4 32-bit words worth of info */
 
 /* Intel-defined CPU features, CPUID level 0x00000001, word 0 */
@@ -74,6 +76,7 @@
 #define cpu_has_pae		boot_cpu_has(X86_FEATURE_PAE)
 #define cpu_has_pge		boot_cpu_has(X86_FEATURE_PGE)
 #define cpu_has_apic		boot_cpu_has(X86_FEATURE_APIC)
+#define cpu_has_sep		boot_cpu_has(X86_FEATURE_SEP)
 #define cpu_has_mtrr		boot_cpu_has(X86_FEATURE_MTRR)
 #define cpu_has_mmx		boot_cpu_has(X86_FEATURE_MMX)
 #define cpu_has_fxsr		boot_cpu_has(X86_FEATURE_FXSR)
diff -urN linux-2.5.52-bk2/include/asm-i386/msr.h linux/include/asm-i386/msr.h
--- linux-2.5.52-bk2/include/asm-i386/msr.h	Sat Dec 14 12:32:05 2002
+++ linux/include/asm-i386/msr.h	Tue Dec 17 23:21:55 2002
@@ -53,6 +53,10 @@
 
 #define MSR_IA32_BBL_CR_CTL		0x119
 
+#define MSR_IA32_SYSENTER_CS		0x174
+#define MSR_IA32_SYSENTER_ESP		0x175
+#define MSR_IA32_SYSENTER_EIP		0x176
+
 #define MSR_IA32_MCG_CAP		0x179
 #define MSR_IA32_MCG_STATUS		0x17a
 #define MSR_IA32_MCG_CTL		0x17b
diff -urN linux-2.5.52-bk2/include/asm-i386/processor.h linux/include/asm-i386/processor.h
--- linux-2.5.52-bk2/include/asm-i386/processor.h	Sat Dec 14 12:32:08 2002
+++ linux/include/asm-i386/processor.h	Tue Dec 17 23:26:16 2002
@@ -14,6 +14,7 @@
 #include <asm/types.h>
 #include <asm/sigcontext.h>
 #include <asm/cpufeature.h>
+#include <asm/msr.h>
 #include <linux/cache.h>
 #include <linux/config.h>
 #include <linux/threads.h>
@@ -416,6 +417,13 @@
 	{~0, } /* ioperm */					\
 }
 
+static inline void load_esp0(struct tss_struct *tss, unsigned long esp0)
+{
+	tss->esp0 = esp0;
+	if (cpu_has_sep)
+		wrmsr(MSR_IA32_SYSENTER_ESP, esp0, 0);
+}
+
 #define start_thread(regs, new_eip, new_esp) do {		\
 	__asm__("movl %0,%%fs ; movl %0,%%gs": :"r" (0));	\
 	set_fs(USER_DS);					\

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  5:25       ` Brian Gerst
@ 2002-12-18  6:06         ` Linus Torvalds
  2002-12-21 11:24           ` Ingo Molnar
  2002-12-21 16:07         ` Christian Leber
  1 sibling, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-18  6:06 UTC (permalink / raw)
  To: Brian Gerst; +Cc: Andi Kleen, mingo, linux-kernel, davej



On Wed, 18 Dec 2002, Brian Gerst wrote:
>
> How about this patch?  Instead of making a per-cpu trampoline, write to
> the msr during each context switch.

I wanted to avoid slowing down the context switch, but I didn't actually
time how much the MSR write hurts you (it needs to be conditional, though,
I think).

		Linus


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  6:06         ` Linus Torvalds
@ 2002-12-21 11:24           ` Ingo Molnar
  2002-12-21 17:28             ` Jamie Lokier
  0 siblings, 1 reply; 268+ messages in thread
From: Ingo Molnar @ 2002-12-21 11:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Brian Gerst, Andi Kleen, linux-kernel, davej

On Tue, 17 Dec 2002, Linus Torvalds wrote:

> > How about this patch?  Instead of making a per-cpu trampoline, write to
> > the msr during each context switch.
> 
> I wanted to avoid slowing down the context switch, but I didn't actually
> time how much the MSR write hurts you (it needs to be conditional,
> though, I think).

this is the solution i took in the original vsyscall patches. The syscall
entry cost is at least one factor more important than the context-switch
cost. The MSR write was not all that slow when i measured it (it was on
the range of 20 cycles), and it's definitely something the chip makers
should keep fast.

	Ingo

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-21 11:24           ` Ingo Molnar
@ 2002-12-21 17:28             ` Jamie Lokier
  0 siblings, 0 replies; 268+ messages in thread
From: Jamie Lokier @ 2002-12-21 17:28 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, Brian Gerst, Andi Kleen, linux-kernel, davej

Ingo Molnar wrote:
> > > How about this patch?  Instead of making a per-cpu trampoline, write to
> > > the msr during each context switch.
> > 
> > I wanted to avoid slowing down the context switch, but I didn't actually
> > time how much the MSR write hurts you (it needs to be conditional,
> > though, I think).
> 
> this is the solution i took in the original vsyscall patches. The syscall
> entry cost is at least one factor more important than the context-switch
> cost. The MSR write was not all that slow when i measured it (it was on
> the range of 20 cycles), and it's definitely something the chip makers
> should keep fast.

I think it would be better to make NMI and Debug trap use task gates,
if that does actually work, and let the NMI and debug handlers fix the
stack if they trap in the entry paths.  They are much rarer after all.

My unreliable memory recalls about 40 cycles for an MSR write on my Celeron.

However, I think you still need MSR writes _sometimes_ in the context
switch to disable sysenter for vm86-mode tasks.

Could the context switch be written like this?:

	if (unlikely((prev_task->flags | next_task->flags) & PF_SLOW_SWITCH)) {
		// Do rare stuff including debug registers
		// and sysenter/syscall MSR change for vm86.
	}

	// Other stuff.

-- Jamie

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  5:25       ` Brian Gerst
  2002-12-18  6:06         ` Linus Torvalds
@ 2002-12-21 16:07         ` Christian Leber
  1 sibling, 0 replies; 268+ messages in thread
From: Christian Leber @ 2002-12-21 16:07 UTC (permalink / raw)
  To: Brian Gerst; +Cc: linux-kernel

On Wed, Dec 18, 2002 at 12:25:38AM -0500, Brian Gerst wrote:

> How about this patch?  Instead of making a per-cpu trampoline, write to 
> the msr during each context switch.  This means that the stack pointer 
> is valid at all times, and also saves memory and a cache line bounce.  I 
> also included some misc cleanups.

Just a little bit of benchmarking:
(little testprogram by Linus out of this thread)
(on a AMD Duron 750)

2.5.52-bk2+sysenter-1 (Brian Gerst):
igor3:~# ./a.out
187.894946 cycles    (call 0xfffff000)
299.155075 cycles    (int $0x80)

2.5.52-bk6:
igor3:~# ./a.out
202.134535 cycles    (call 0xffffe000)
299.117583 cycles    (int $0x80)

Not really much, but the difference is there. (I don't about other side
effects)


Christian Leber


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-15  8:43 scott thomason
  0 siblings, 0 replies; 268+ messages in thread
From: scott thomason @ 2002-12-15  8:43 UTC (permalink / raw)
  To: Linux Kernel Mailing List

On Saturday 14 December 2002 11:48 am, Mike Dresser wrote:
> On Sat, 14 Dec 2002, Dave Jones wrote:
> > Note that there are more factors at play than raw cpu speed in a
> > kernel compile. Your time here is slightly faster than my 2.8Ghz
> > P4-HT for example.  My guess is you have faster disk(s) than I
> > do, as most of the time mine seems to be waiting for something to
> > do.
>
> Quantum Fireball AS's in that machine.  My main comment was that
> his Althon MP at 1.8 was half or less the speed of a single P4.
> Even with compiler changes, I wouldn't think it would make THAT
> much of a difference?

I've been doing a lot of benchmarking with "contest" lately, and one
thing I can state emphatically is that the kernel that you are
running while performing a compile can be a large factor, especially
if you are maxing out the machine with a large "make -jN". Some
kernel versions vary enormously in their ability to handle I/O load
(an area I've been paying close attention to). Sounds like you have
some decent SMP hardware, and probably a good chunk of memory to go
with it, so you might want to experiment with these kernels, which
have given good I/O performance in my tests:

linux-2.4.19-rmap14c
linux-2.4.19-rmap15a
linux-2.4.18-rml-O1 (slow at creating tarballs, fast everwhere else)

And if you you don't mind bleeding edge, just go with a more recent
2.5 kernel that you can make work. You simply can't get comparable
performance out of 2.4.

I've attached some contest numbers for tests I've run to-date below. 
Please note that while I use contest as the benchmarking tool, I use 
qmail compiles as the actual load, not kernel compiles (I don't have 
the patience--qmail compiles take about 35-40% the time as a kernel 
compile. Now if we can get Con to work on speeding up "Killing the 
the load process..." <g>).
---scott

sorry for the html table to text pasting conversion :(

noload
process_load
ctar_load
xtar_load
read_load
list_load
mem_load

linux-2.4.18
16.73
22.61
244.52
78.84
108.52
18.58
53.12

linux-2.4.18-ac3
19.01
25.64
99.52
94.23
314.29
23.34
119.95

linux-2.4.18-rc1-akpm-low-latency
16.69
21.92
335.62
79.10
122.34
18.39
104.80

linux-2.4.18-rc4-aa1
16.43
93.85
179.12
100.29
46.64
17.15
96.91

linux-2.4.18-rmap12h
18.84
24.72
143.12
95.11
298.85
23.17
121.22

linux-2.4.18-rml-O1
16.83
31.42
266.28
77.98
77.15
18.18
63.03

linux-2.4.18-rml-preempt
16.93
21.87
334.08
84.22
116.30
18.46
60.30

linux-2.4.18-rml-preempt+lockbreak
16.85
22.42
271.52
74.37
229.96
19.57
45.21

linux-2.4.19
16.99
22.42
261.69
103.61
163.55
18.44
66.16

linux-2.4.19-ac4
19.08
30.32
176.03
89.38
288.53
22.79
102.09

linux-2.4.19-akpm-low-latency
16.90
21.87
230.92
111.37
179.63
18.36
87.47

linux-2.4.19-ck14
-
-
-
-
-
-
176.41

linux-2.4.19-rc5-aa1
18.37
27.18
931.45
154.94
372.73
22.01
125.92

linux-2.4.19-rmap14c
17.84
24.56
74.81
76.73
121.86
20.57
165.10

linux-2.4.19-rmap15
18.27
24.09
71.32
77.05
146.68
18.99
102.56

linux-2.4.19-rmap15-splitactive
17.28
23.09
69.16
79.49
140.15
20.27
129.84

linux-2.4.19-rmap15a
17.10
23.00
62.44
78.12
138.96
18.46
133.32

linux-2.4.19-rml-O1
16.61
25.45
314.24
90.43
124.27
18.32
72.90

linux-2.4.19-rml-preempt
16.88
21.80
238.80
86.46
155.89
18.45
56.74

linux-2.4.20
16.62
21.84
191.12
101.06
100.35
18.22
70.47

linux-2.4.20-aa1
18.23
29.03
331.96
137.70
96.88
22.22
143.22

linux-2.4.20-ac1
20.24
28.41
776.73
138.35
221.55
22.06
171.13

linux-2.4.20-rc2-aa1
18.44
28.39
255.79
156.30
86.78
21.98
139.04

linux-2.5.49
17.66
22.39
36.73
26.85
19.91
20.29
57.34

linux-2.5.50
17.80
24.19
32.81
25.87
21.43
21.17
45.96

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-15  4:06 Albert D. Cahalan
  2002-12-15 22:01 ` Pavel Machek
  0 siblings, 1 reply; 268+ messages in thread
From: Albert D. Cahalan @ 2002-12-15  4:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: hpa, terje.eggestad

H. Peter Anvin writes:

> As far as I know, though, the SYSENTER patch didn't deal with several of
> the corner cases introduced by the generally weird SYSENTER instruction
> (such as the fact that V86 tasks can execute it despite the fact there
> is in general no way to resume execution of the V86 task afterwards.)
>
> In practice this means that vsyscalls is pretty much the only sensible
> way to do this.  Also note that INT 80h will need to be supported
> indefinitely.
>
> Personally, I wonder if it's worth the trouble, when x86-64 takes care
> of the issue anyway :)

There is another way:

Have apps enter kernel mode via Intel's purposely undefined
instruction, plus a few bytes of padding and identification.
Require that this not cross a page boundry. When it faults,
write the SYSENTER, INT 0x80, or SYSCALL as needed. Leave
the page marked clean so it doesn't need to hit swap; if it
gets paged in again it gets patched again.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-15  4:06 Albert D. Cahalan
@ 2002-12-15 22:01 ` Pavel Machek
  2002-12-16  7:33   ` Albert D. Cahalan
  0 siblings, 1 reply; 268+ messages in thread
From: Pavel Machek @ 2002-12-15 22:01 UTC (permalink / raw)
  To: Albert D. Cahalan; +Cc: linux-kernel, hpa, terje.eggestad

Hi!

> > As far as I know, though, the SYSENTER patch didn't deal with several of
> > the corner cases introduced by the generally weird SYSENTER instruction
> > (such as the fact that V86 tasks can execute it despite the fact there
> > is in general no way to resume execution of the V86 task afterwards.)
> >
> > In practice this means that vsyscalls is pretty much the only sensible
> > way to do this.  Also note that INT 80h will need to be supported
> > indefinitely.
> >
> > Personally, I wonder if it's worth the trouble, when x86-64 takes care
> > of the issue anyway :)
> 
> There is another way:
> 
> Have apps enter kernel mode via Intel's purposely undefined
> instruction, plus a few bytes of padding and identification.
> Require that this not cross a page boundry. When it faults,
> write the SYSENTER, INT 0x80, or SYSCALL as needed. Leave
> the page marked clean so it doesn't need to hit swap; if it
> gets paged in again it gets patched again.

Thats *very* dirty hack. vsyscalls seem cleaner than that.
								Pavel
-- 
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-15 22:01 ` Pavel Machek
@ 2002-12-16  7:33   ` Albert D. Cahalan
  2002-12-16 11:17     ` Pavel Machek
  0 siblings, 1 reply; 268+ messages in thread
From: Albert D. Cahalan @ 2002-12-16  7:33 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Albert D. Cahalan, linux-kernel, hpa, terje.eggestad

Pavel Machek writes:
> [Albert Cahalan]

>> Have apps enter kernel mode via Intel's purposely undefined
>> instruction, plus a few bytes of padding and identification.
>> Require that this not cross a page boundry. When it faults,
>> write the SYSENTER, INT 0x80, or SYSCALL as needed. Leave
>> the page marked clean so it doesn't need to hit swap; if it
>> gets paged in again it gets patched again.
>
> Thats *very* dirty hack. vsyscalls seem cleaner than that.

Sure it's dirty. It's also fast, with the only overhead being
a few NOPs that could get skipped on syscall return anyway.
Patching overhead is negligible, since it only happens when a
page is brought in fresh from the disk.

The vsyscall stuff costs you on every syscall. It's nice for
when you can avoid entering kernel mode entirely, but in that
case the hack I described above can write out a call to user
code (for time-of-day I imagine) just as well as it can write
out a SYSENTER, INT 0x80, or SYSCALL instruction.

Enter with INT 0x42 if you prefer, or just pick one of the new
instructions.

An alternative would be to hack ld.so to patch the syscalls,
but then you get dirty C-O-W pages in every address space.
Permissions change, swap gets used, etc.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-16  7:33   ` Albert D. Cahalan
@ 2002-12-16 11:17     ` Pavel Machek
  2002-12-16 17:54       ` Mark Mielke
  2002-12-16 19:55       ` H. Peter Anvin
  0 siblings, 2 replies; 268+ messages in thread
From: Pavel Machek @ 2002-12-16 11:17 UTC (permalink / raw)
  To: Albert D. Cahalan; +Cc: linux-kernel, hpa, terje.eggestad

Hi!

> >> Have apps enter kernel mode via Intel's purposely undefined
> >> instruction, plus a few bytes of padding and identification.
> >> Require that this not cross a page boundry. When it faults,
> >> write the SYSENTER, INT 0x80, or SYSCALL as needed. Leave
> >> the page marked clean so it doesn't need to hit swap; if it
> >> gets paged in again it gets patched again.
> >
> > Thats *very* dirty hack. vsyscalls seem cleaner than that.
> 
> Sure it's dirty. It's also fast, with the only overhead being
> a few NOPs that could get skipped on syscall return anyway.
> Patching overhead is negligible, since it only happens when a
> page is brought in fresh from the disk.

Yes but "read only" code changing under you... Should better be
avoided.

> The vsyscall stuff costs you on every syscall. It's nice for

Well, the cost is basically one call. That's not *that* big cost.

							Pavel
-- 
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-16 11:17     ` Pavel Machek
@ 2002-12-16 17:54       ` Mark Mielke
  2002-12-16 16:07         ` Jonah Sherman
  2002-12-17  8:02         ` Helge Hafting
  2002-12-16 19:55       ` H. Peter Anvin
  1 sibling, 2 replies; 268+ messages in thread
From: Mark Mielke @ 2002-12-16 17:54 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Albert D. Cahalan, linux-kernel, hpa, terje.eggestad

On Mon, Dec 16, 2002 at 12:17:59PM +0100, Pavel Machek wrote:
> > Sure it's dirty. It's also fast, with the only overhead being
> > a few NOPs that could get skipped on syscall return anyway.
> > Patching overhead is negligible, since it only happens when a
> > page is brought in fresh from the disk.
> Yes but "read only" code changing under you... Should better be
> avoided.

Programs that self verify their own CRC may get a little confused (are
there any of these left?), but other than that, 'goto is better avoided'
as well, but sometimes 'goto' is the best answer.

> > The vsyscall stuff costs you on every syscall. It's nice for
> Well, the cost is basically one call. That's not *that* big cost.

Time for benchmarks... :-)

mark

-- 
mark@mielke.cc/markm@ncf.ca/markm@nortelnetworks.com __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-16 17:54       ` Mark Mielke
@ 2002-12-16 16:07         ` Jonah Sherman
  2002-12-17  4:10           ` David Schwartz
  2002-12-17  8:02         ` Helge Hafting
  1 sibling, 1 reply; 268+ messages in thread
From: Jonah Sherman @ 2002-12-16 16:07 UTC (permalink / raw)
  To: Mark Mielke; +Cc: linux-kernel

On Mon, Dec 16, 2002 at 12:54:32PM -0500, Mark Mielke wrote:
> Programs that self verify their own CRC may get a little confused (are
> there any of these left?), but other than that, 'goto is better avoided'
> as well, but sometimes 'goto' is the best answer.

This shouldn't cause any problems.  The only way this would cause a problem is if the program had direct system calls in it, but as long as they are using libc(what self-crcing program doesn't use libc?), the changes would only be made to code pages inside libc, so the program's own code pages would remain untouched.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-16 16:07         ` Jonah Sherman
@ 2002-12-17  4:10           ` David Schwartz
  0 siblings, 0 replies; 268+ messages in thread
From: David Schwartz @ 2002-12-17  4:10 UTC (permalink / raw)
  To: jsherman, Mark Mielke; +Cc: linux-kernel


On Mon, 16 Dec 2002 11:07:06 -0500, Jonah Sherman wrote:

>On Mon, Dec 16, 2002 at 12:54:32PM -0500, Mark Mielke wrote:

>>Programs that self verify their own CRC may get a little confused (are
>>there any of these left?), but other than that, 'goto is better avoided'
>>as well, but sometimes 'goto' is the best answer.

>This shouldn't cause any problems.  The only way this would cause a problem
>is if the program had direct system calls in it, but as long as they are
>using libc(what self-crcing program doesn't use libc?), the changes would
>only be made to code pages inside libc, so the program's own code pages
>would remain untouched.

	A program that checked its own CRC would probably be statically linked. This 
is especially likely to be true if the CRC was for security reasons.

	DS



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-16 17:54       ` Mark Mielke
  2002-12-16 16:07         ` Jonah Sherman
@ 2002-12-17  8:02         ` Helge Hafting
  1 sibling, 0 replies; 268+ messages in thread
From: Helge Hafting @ 2002-12-17  8:02 UTC (permalink / raw)
  To: Mark Mielke, linux-kernel

Mark Mielke wrote:
> 
> On Mon, Dec 16, 2002 at 12:17:59PM +0100, Pavel Machek wrote:
> > > Sure it's dirty. It's also fast, with the only overhead being
> > > a few NOPs that could get skipped on syscall return anyway.
> > > Patching overhead is negligible, since it only happens when a
> > > page is brought in fresh from the disk.
> > Yes but "read only" code changing under you... Should better be
> > avoided.
> 
> Programs that self verify their own CRC may get a little confused (are
> there any of these left?), but other than that, 'goto is better avoided'
> as well, but sometimes 'goto' is the best answer.

And then there's programs that store constants as parts of the code,
so that their constant-ness is enforced byt the mmu.

This can be taken further - the compiler can save space by looking
through the generated code and use an address in the code as the
constant if it happens to have the right value.  With some
bad luck it chooses the syscall sequence that it really don't expect 
to be modified.

Helge Hafting

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-16 11:17     ` Pavel Machek
  2002-12-16 17:54       ` Mark Mielke
@ 2002-12-16 19:55       ` H. Peter Anvin
  1 sibling, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-16 19:55 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Albert D. Cahalan, linux-kernel, terje.eggestad

Pavel Machek wrote:
> 
>>The vsyscall stuff costs you on every syscall. It's nice for
> 
> 
> Well, the cost is basically one call. That's not *that* big cost.
> 

You absolutely, positively *need* the call anyway.  SYSENTER trashes EIP.

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-13 21:52 Margit Schubert-While
  0 siblings, 0 replies; 268+ messages in thread
From: Margit Schubert-While @ 2002-12-13 21:52 UTC (permalink / raw)
  To: linux-kernel

Hmm Apples & Oranges

diff hanoi.c hanoi2.c
17a18
 > void  mov();
51c52
<               mov(disk,1,3);
---
 >               (void)mov(disk,1,3);
58c59
< mov(n,f,t)
---
 > void mov(n,f,t)
67,69c68,70
<       mov(n-1,f,o);
<       mov(1,f,t);
<       mov(n-1,o,t);
---
 >       (void)mov(n-1,f,o);
 >       (void)mov(1,f,t);
 >       (void)mov(n-1,o,t);


cc -O3 -march=i686 -mcpu=i686 -fomit-frame-pointer  hanoi.c -o hanoi
cc -O3 -march=i686 -mcpu=i686 -fomit-frame-pointer  hanoi2.c -o hanoi2
./hanoi 10
536837 loops
./hanoi 10
538709 loops
./hanoi2 10
850127 loops
./hanoi2 10
852651 loops

Huu ?

Margit 


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-13 19:32 Dieter Nützel
  0 siblings, 0 replies; 268+ messages in thread
From: Dieter Nützel @ 2002-12-13 19:32 UTC (permalink / raw)
  To: Margit Schubert-While; +Cc: Linux Kernel List

> Well, in the 2.4.x kernels, the P4 gets compiled as a I686 with NO special
> treatment :-) (Not even prefetch, because of an ifdef bug)
> The P3 at least gets one level of prefetch and the AMD's get special compile
> options(arch=k6,athlon), full prefetch and SSE.
>
> >From Mike Hayward
> >Dual Pentium 4 Xeon 2.4Ghz 2.4.19 kernel 33661.9 lps (10 secs, 6 samples)
>
> Hmm, P4 2.4Ghz , also gcc -O3 -march=i686
>
> margit:/disk03/bytebench-3.1/src # ./hanoi 10
> 576264 loops
> margit:/disk03/bytebench-3.1/src # ./hanoi 10
> 571001 loops
> margit:/disk03/bytebench-3.1/src # ./hanoi 10
> 571133 loops
> margit:/disk03/bytebench-3.1/src # ./hanoi 10
> 570517 loops
> margit:/disk03/bytebench-3.1/src # ./hanoi 10
> 571019 loops
> margit:/disk03/bytebench-3.1/src # ./hanoi 10
> 582688 loops

Apples and oranges? ;-)

dual AMD Athlon MP 1900+, 1.6 GHz
(but single threaded app)
2.4.20-aa1
gcc-2.95.3

unixbench-4.1.0/src> gcc -O -mcpu=k6 -march=i686 -fomit-frame-pointer 
-mpreferred-stack-boundary=2 -malign-functions=4 -o hanoi hanoi.c
unixbench-4.1.0/src> sync
unixbench-4.1.0/src> ./hanoi 10                                                            
565338 loops
unixbench-4.1.0/src> ./hanoi 10
565379 loops
unixbench-4.1.0/src> ./hanoi 10
565448 loops
unixbench-4.1.0/src> ./hanoi 10
565218 loops
unixbench-4.1.0/src> ./hanoi 10
565148 loops
unixbench-4.1.0/src> ./hanoi 10
565136 loops

You should run "./Run hanoi"...

Recursion Test--Tower of Hanoi            58404.5 lps   (19.3 secs, 3 samples)

Regards,
	Dieter
-- 
Dieter Nützel
Graduate Student, Computer Science

University of Hamburg
Department of Computer Science
@home: Dieter.Nuetzel at hamburg.de (replace at with @)

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-13 17:51 Margit Schubert-While
  0 siblings, 0 replies; 268+ messages in thread
From: Margit Schubert-While @ 2002-12-13 17:51 UTC (permalink / raw)
  To: linux-kernel

Well, in the 2.4.x kernels, the P4 gets compiled as a I686 with NO special
treatment :-) (Not even prefetch, because of an ifdef bug)
The P3 at least gets one level of prefetch and the AMD's get special compile
options(arch=k6,athlon), full prefetch and SSE.

 >From Mike Hayward
 >Dual Pentium 4 Xeon 2.4Ghz 2.4.19 kernel 33661.9 lps (10 secs, 6 samples)

Hmm, P4 2.4Ghz , also gcc -O3 -march=i686

margit:/disk03/bytebench-3.1/src # ./hanoi 10
576264 loops
margit:/disk03/bytebench-3.1/src # ./hanoi 10
571001 loops
margit:/disk03/bytebench-3.1/src # ./hanoi 10
571133 loops
margit:/disk03/bytebench-3.1/src # ./hanoi 10
570517 loops
margit:/disk03/bytebench-3.1/src # ./hanoi 10
571019 loops
margit:/disk03/bytebench-3.1/src # ./hanoi 10
582688 loops

Margit 


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-11 12:48 Terje Eggestad
  2002-12-11 18:50 ` H. Peter Anvin
  0 siblings, 1 reply; 268+ messages in thread
From: Terje Eggestad @ 2002-12-11 12:48 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel, Dave Jones

It get even worse with Hammer. When you run hammer in compatibility mode
(32 bit app on a 64 bit OS) the sysenter is an illegal instruction.
Since Intel don't implement syscall, there is no portable sys*
instruction for 32 bit apps. You could argue that libc hides it for you
and you just need libc to test the host at startup (do I get a sigill if
I try to do getpid() with sysenter? syscall? if so we uses int80 for
syscalls).  But not all programs are linked dyn.

Too bad really, I tried the sysenter patch once, and the gain (on PIII
and athlon) was significant.

Fortunately the 64bit libc for hammer uses syscall. 


PS:  rdtsc on P4 is also painfully slow!!!

TJ

On man, 2002-12-09 at 20:46, H. Peter Anvin wrote: 
> Followup to:  <20021209193649.GC10316@suse.de>
> By author:    Dave Jones <davej@codemonkey.org.uk>
> In newsgroup: linux.dev.kernel
> >
> > On Mon, Dec 09, 2002 at 05:48:45PM +0000, Linus Torvalds wrote:
> > 
> >  > P4's really suck at system calls.  A 2.8GHz P4 does a simple system call
> >  > a lot _slower_ than a 500MHz PIII. 
> >  > 
> >  > The P4 has problems with some other things too, but the "int + iret"
> >  > instruction combination is absolutely the worst I've seen.  A 1.2GHz
> >  > Athlon will be 5-10 times faster than the fastest P4 on system call
> >  > overhead. 
> > 
> > Time to look into an alternative like SYSCALL perhaps ?
> > 
> 
> SYSCALL is AMD.  SYSENTER is Intel, and is likely to be significantly
> faster.  Unfortunately SYSENTER is also extremely braindamaged, in
> that it destroys *both* the EIP and the ESP beyond recovery, and
> because it's allowed in V86 and 16-bit modes (where it will cause
> permanent data loss) which means that it needs to be able to be turned
> off for things like DOSEMU and WINE to work correctly.
> 
> 	-hpa



-- 
_________________________________________________________________________

Terje Eggestad                  mailto:terje.eggestad@scali.no
Scali Scalable Linux Systems    http://www.scali.com

Olaf Helsets Vei 6              tel:    +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal                     +47 975 31 574  (MOBILE)
N-0619 Oslo                     fax:    +47 22 62 89 51
NORWAY            
_________________________________________________________________________


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-11 12:48 Terje Eggestad
@ 2002-12-11 18:50 ` H. Peter Anvin
  2002-12-12  9:42   ` Terje Eggestad
  0 siblings, 1 reply; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-11 18:50 UTC (permalink / raw)
  To: Terje Eggestad; +Cc: linux-kernel, Dave Jones

Terje Eggestad wrote:
> It get even worse with Hammer. When you run hammer in compatibility mode
> (32 bit app on a 64 bit OS) the sysenter is an illegal instruction.
>
> Since Intel don't implement syscall, there is no portable sys*
> instruction for 32 bit apps. You could argue that libc hides it for you
> and you just need libc to test the host at startup (do I get a sigill if
> I try to do getpid() with sysenter? syscall? if so we uses int80 for
> syscalls).  But not all programs are linked dyn.


Linus talked about this once, and it was agreed that the only sane way
to do this properly was via vsyscalls... have a page mapped somewhere in
high (kernel-area) memory, say at 0xfffff000, but readable by normal
processes.  A system call can be invoked via call 0xfffff000, and the
*kernel* enters whatever code is appropriate to enter itself.

> Too bad really, I tried the sysenter patch once, and the gain (on PIII
> and athlon) was significant.
> 
> Fortunately the 64bit libc for hammer uses syscall. 
> 

Yes.

> 
> PS:  rdtsc on P4 is also painfully slow!!!
> 

Now that's just braindead...

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-11 18:50 ` H. Peter Anvin
@ 2002-12-12  9:42   ` Terje Eggestad
  2002-12-12 10:06     ` Arjan van de Ven
  2002-12-12 20:36     ` Mark Mielke
  0 siblings, 2 replies; 268+ messages in thread
From: Terje Eggestad @ 2002-12-12  9:42 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel, Dave Jones

On ons, 2002-12-11 at 19:50, H. Peter Anvin wrote:
> Terje Eggestad wrote:
> 
> > 
> > PS:  rdtsc on P4 is also painfully slow!!!
> > 
> 
> Now that's just braindead...
> 

It takes about 11 cycles on athlon, 34 on PII, and a whooping 84 on P4.

For a simple op like that, even 11 is a lot... Really makes you wonder.
 

> 	-hpa

TJ

-- 
_________________________________________________________________________

Terje Eggestad                  mailto:terje.eggestad@scali.no
Scali Scalable Linux Systems    http://www.scali.com

Olaf Helsets Vei 6              tel:    +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal                     +47 975 31 574  (MOBILE)
N-0619 Oslo                     fax:    +47 22 62 89 51
NORWAY            
_________________________________________________________________________


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-12  9:42   ` Terje Eggestad
@ 2002-12-12 10:06     ` Arjan van de Ven
  2002-12-12 10:31       ` Terje Eggestad
  2002-12-12 19:03       ` H. Peter Anvin
  2002-12-12 20:36     ` Mark Mielke
  1 sibling, 2 replies; 268+ messages in thread
From: Arjan van de Ven @ 2002-12-12 10:06 UTC (permalink / raw)
  To: Terje Eggestad; +Cc: H. Peter Anvin, linux-kernel, Dave Jones

On Thu, 2002-12-12 at 10:42, Terje Eggestad wrote:

> It takes about 11 cycles on athlon, 34 on PII, and a whooping 84 on P4.
> 
> For a simple op like that, even 11 is a lot... Really makes you wonder.

wasn't rdtsc also supposed to be a pipeline sync of the cpu?
(or am I confusing it with cpuid)

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-12 10:06     ` Arjan van de Ven
@ 2002-12-12 10:31       ` Terje Eggestad
  2002-12-12 19:03       ` H. Peter Anvin
  1 sibling, 0 replies; 268+ messages in thread
From: Terje Eggestad @ 2002-12-12 10:31 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: H. Peter Anvin, linux-kernel, Dave Jones

On tor, 2002-12-12 at 11:06, Arjan van de Ven wrote:
> On Thu, 2002-12-12 at 10:42, Terje Eggestad wrote:
> 
> > It takes about 11 cycles on athlon, 34 on PII, and a whooping 84 on P4.
> > 
> > For a simple op like that, even 11 is a lot... Really makes you wonder.
> 
> wasn't rdtsc also supposed to be a pipeline sync of the cpu?
> (or am I confusing it with cpuid)

THis is what the P4 manual says:

"The RDTSC instruction is not a serializing instruction. Thus, it does
not necessarily wait until all previous instructions have been executed
before reading the counter. Similarly, subsequent instructions may begin
execution before the read operation is performed."

Thus it *shouldn't* sync the pipeline. cpuid is a serializing inst, yes.

TJ

-- 
_________________________________________________________________________

Terje Eggestad                  mailto:terje.eggestad@scali.no
Scali Scalable Linux Systems    http://www.scali.com

Olaf Helsets Vei 6              tel:    +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal                     +47 975 31 574  (MOBILE)
N-0619 Oslo                     fax:    +47 22 62 89 51
NORWAY            
_________________________________________________________________________


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-12 10:06     ` Arjan van de Ven
  2002-12-12 10:31       ` Terje Eggestad
@ 2002-12-12 19:03       ` H. Peter Anvin
  1 sibling, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-12 19:03 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Terje Eggestad, linux-kernel, Dave Jones

Arjan van de Ven wrote:
> On Thu, 2002-12-12 at 10:42, Terje Eggestad wrote:
> 
> 
>>It takes about 11 cycles on athlon, 34 on PII, and a whooping 84 on P4.
>>
>>For a simple op like that, even 11 is a lot... Really makes you wonder.
> 
> 
> wasn't rdtsc also supposed to be a pipeline sync of the cpu?
> (or am I confusing it with cpuid)

That's CPUID.

	-hpa


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-12  9:42   ` Terje Eggestad
  2002-12-12 10:06     ` Arjan van de Ven
@ 2002-12-12 20:36     ` Mark Mielke
  2002-12-12 20:56       ` J.A. Magallon
  2002-12-12 20:56       ` Vojtech Pavlik
  1 sibling, 2 replies; 268+ messages in thread
From: Mark Mielke @ 2002-12-12 20:36 UTC (permalink / raw)
  To: Terje Eggestad; +Cc: H. Peter Anvin, linux-kernel, Dave Jones

On Thu, Dec 12, 2002 at 10:42:56AM +0100, Terje Eggestad wrote:
> On ons, 2002-12-11 at 19:50, H. Peter Anvin wrote:
> > Terje Eggestad wrote:
> > > PS:  rdtsc on P4 is also painfully slow!!!
> > Now that's just braindead...
> It takes about 11 cycles on athlon, 34 on PII, and a whooping 84 on P4.
> For a simple op like that, even 11 is a lot... Really makes you wonder.

Some of this discussion is a little bit unfair. My understanding of what
Intel has done with the P4, is create an architecture that allows for
higher clock rates. Sure the P4 might take 84, vs PII 34, but how many
PII 2.4 Ghz machines have you ever seen on the market?

Certainly, some of their decisions seem to be a little odd on the surface.

That doesn't mean the situation is black and white.

mark

-- 
mark@mielke.cc/markm@ncf.ca/markm@nortelnetworks.com __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-12 20:36     ` Mark Mielke
@ 2002-12-12 20:56       ` J.A. Magallon
  2002-12-12 20:12         ` Zac Hansen
  2002-12-13  9:21         ` Terje Eggestad
  2002-12-12 20:56       ` Vojtech Pavlik
  1 sibling, 2 replies; 268+ messages in thread
From: J.A. Magallon @ 2002-12-12 20:56 UTC (permalink / raw)
  To: Mark Mielke; +Cc: Terje Eggestad, H. Peter Anvin, linux-kernel, Dave Jones


On 2002.12.12 Mark Mielke wrote:
>On Thu, Dec 12, 2002 at 10:42:56AM +0100, Terje Eggestad wrote:
>> On ons, 2002-12-11 at 19:50, H. Peter Anvin wrote:
>> > Terje Eggestad wrote:
>> > > PS:  rdtsc on P4 is also painfully slow!!!
>> > Now that's just braindead...
>> It takes about 11 cycles on athlon, 34 on PII, and a whooping 84 on P4.
>> For a simple op like that, even 11 is a lot... Really makes you wonder.
>
>Some of this discussion is a little bit unfair. My understanding of what
>Intel has done with the P4, is create an architecture that allows for
>higher clock rates. Sure the P4 might take 84, vs PII 34, but how many
>PII 2.4 Ghz machines have you ever seen on the market?
>
>Certainly, some of their decisions seem to be a little odd on the surface.
>
>That doesn't mean the situation is black and white.
>

No. The situation is just black. Each day Intel processors are a bigger
pile of crap and less intelligent, but MHz compensate for the average
office user. Think of what could a P4 do if the same effort put on
Hz was put on getting cheap a cache of 4Mb or 8Mb like MIPSes have. Or
closer, 1Mb like G4s.
If syscalls take 300% time but processor is also 300% faster 'nobody
notices'. 

-- 
J.A. Magallon <jamagallon@able.es>      \                 Software is like sex:
werewolf.able.es                         \           It's better when it's free
Mandrake Linux release 9.1 (Cooker) for i586
Linux 2.4.20-jam1 (gcc 3.2 (Mandrake Linux 9.1 3.2-4mdk))

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-12 20:56       ` J.A. Magallon
@ 2002-12-12 20:12         ` Zac Hansen
  2002-12-13  9:21         ` Terje Eggestad
  1 sibling, 0 replies; 268+ messages in thread
From: Zac Hansen @ 2002-12-12 20:12 UTC (permalink / raw)
  To: J.A. Magallon; +Cc: linux-kernel

> 
> No. The situation is just black. Each day Intel processors are a bigger
> pile of crap and less intelligent

My hyper-threaded xeons beg to argue with you -- all 4 (2) of them.

, but MHz compensate for the average
> office user. Think of what could a P4 do if the same effort put on
> Hz was put on getting cheap a cache of 4Mb or 8Mb like MIPSes have. Or
> closer, 1Mb like G4s.

Err, syscalls are still going to take the same amount of time no matter 
how much cache the chip has on it.  And, IMHO, adding more cache to make a 
processor faster is just as "dumb" as bumping the MHz.  

> If syscalls take 300% time but processor is also 300% faster 'nobody
> notices'. 
> 

The point many are forgetting is that processors do a lot more than system 
calls.  And P4's are quite quick at doing this.. especially those new 
3+GHz ones (with hyperthreading).

By the way, did everyone see the test on Tom's Hardware Guide comparison 
between the p4 3.06 with hyperthreading on and a p4 3.6 without 
hyperthreading.. 

http://www17.tomshardware.com/cpu/20021114/index.html

For those of you who just want the info -- here's the spoiler -- when 
running multiple apps, the 3.06 can torch the 3.6.  Check out the second 
benchmark on this page

http://www17.tomshardware.com/cpu/20021114/p4_306ht-16.html

25% faster.  Most of the other benchmarks don't show off hyperthreading, 
as they're running a single process, but from personal experience, it's 
nice.  I don't know why they give you the option to turn it off in the 
bios.  I have 2 xeons, and even then I leave HT on on both.  I'd not even 
think about considering turning it off if I only had 1 processor..

--Zac
xaxxon@slackworks.com

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-12 20:56       ` J.A. Magallon
  2002-12-12 20:12         ` Zac Hansen
@ 2002-12-13  9:21         ` Terje Eggestad
  2002-12-13 15:58           ` Ville Herva
  1 sibling, 1 reply; 268+ messages in thread
From: Terje Eggestad @ 2002-12-13  9:21 UTC (permalink / raw)
  To: J.A. Magallon; +Cc: Mark Mielke, H. Peter Anvin, linux-kernel, Dave Jones

On tor, 2002-12-12 at 21:56, J.A. Magallon wrote:
> On 2002.12.12 Mark Mielke wrote:
> >On Thu, Dec 12, 2002 at 10:42:56AM +0100, Terje Eggestad wrote:
> >> On ons, 2002-12-11 at 19:50, H. Peter Anvin wrote:
> >> > Terje Eggestad wrote:
> >> > > PS:  rdtsc on P4 is also painfully slow!!!
> >> > Now that's just braindead...
> >> It takes about 11 cycles on athlon, 34 on PII, and a whooping 84 on P4.
> >> For a simple op like that, even 11 is a lot... Really makes you wonder.
> >
> >Some of this discussion is a little bit unfair. My understanding of what
> >Intel has done with the P4, is create an architecture that allows for
> >higher clock rates. Sure the P4 might take 84, vs PII 34, but how many
> >PII 2.4 Ghz machines have you ever seen on the market?
> >
> >Certainly, some of their decisions seem to be a little odd on the surface.
> >
> >That doesn't mean the situation is black and white.
> >
> 
> No. The situation is just black. Each day Intel processors are a bigger
> pile of crap and less intelligent, but MHz compensate for the average
> office user. Think of what could a P4 do if the same effort put on
> Hz was put on getting cheap a cache of 4Mb or 8Mb like MIPSes have. Or
> closer, 1Mb like G4s.
> If syscalls take 300% time but processor is also 300% faster 'nobody
> notices'.
  
Well, it does make sense if Intel optimized away rdtsc for more commonly
used things, but even that don't seem to be the case. I'm measuring the
overhead of doing a syscall on Linux (int 80) to be ~280 cycles on PIII,
and Athlon, while it's 1600 cycles on P4.

TJ


 
-- 
_________________________________________________________________________

Terje Eggestad                  mailto:terje.eggestad@scali.no
Scali Scalable Linux Systems    http://www.scali.com

Olaf Helsets Vei 6              tel:    +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal                     +47 975 31 574  (MOBILE)
N-0619 Oslo                     fax:    +47 22 62 89 51
NORWAY            
_________________________________________________________________________


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-13  9:21         ` Terje Eggestad
@ 2002-12-13 15:58           ` Ville Herva
  2002-12-13 21:57             ` Terje Eggestad
  0 siblings, 1 reply; 268+ messages in thread
From: Ville Herva @ 2002-12-13 15:58 UTC (permalink / raw)
  To: Terje Eggestad
  Cc: J.A. Magallon, Mark Mielke, H. Peter Anvin, linux-kernel,
	Dave Jones

On Fri, Dec 13, 2002 at 10:21:11AM +0100, you [Terje Eggestad] wrote:
>   
> Well, it does make sense if Intel optimized away rdtsc for more commonly
> used things, but even that don't seem to be the case. I'm measuring the
> overhead of doing a syscall on Linux (int 80) to be ~280 cycles on PIII,
> and Athlon, while it's 1600 cycles on P4.

Just out of interest, how much would sysenter (or syscall on amd) cost,
then? (Supposing it can be feasibly implemented.)

I think I heard WinXP (W2k too?) is using sysenter?


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-13 15:58           ` Ville Herva
@ 2002-12-13 21:57             ` Terje Eggestad
  2002-12-13 22:53               ` H. Peter Anvin
  0 siblings, 1 reply; 268+ messages in thread
From: Terje Eggestad @ 2002-12-13 21:57 UTC (permalink / raw)
  To: Ville Herva
  Cc: J.A. Magallon, Mark Mielke, H. Peter Anvin, linux-kernel,
	Dave Jones

I haven't tried the vsyscall patch, but there was a sysenter patch
floating around that I tried. It reduced the syscall overhead with 1/3
to 1/4, but I never tried it on P4.

FYI: Just note that I say overhead, which I assume to be the time it
take to do someting like getpid(), write(-1,...), select(-1, ...) (etc
that is immediatlely returned with -EINVAL by the kernel). 
Since the kernel do execute a quite afew instructions beside the
int/iret sysenter/sysexit, it's an assumption that the int 80  is the
culprit. 

I would be nice if someone bothered to try this on an windoze box.
(Un)fortunatly I live in a windoze free environment. :-)

TJ


On Fri, 2002-12-13 at 16:58, Ville Herva wrote:
    On Fri, Dec 13, 2002 at 10:21:11AM +0100, you [Terje Eggestad] wrote:
    >   
    > Well, it does make sense if Intel optimized away rdtsc for more commonly
    > used things, but even that don't seem to be the case. I'm measuring the
    > overhead of doing a syscall on Linux (int 80) to be ~280 cycles on PIII,
    > and Athlon, while it's 1600 cycles on P4.
    
    Just out of interest, how much would sysenter (or syscall on amd) cost,
    then? (Supposing it can be feasibly implemented.)
    
    I think I heard WinXP (W2k too?) is using sysenter?
    
    
    -- v --
    
v@iki.fi
-- 
_________________________________________________________________________

Terje Eggestad                  mailto:terje.eggestad@scali.no
Scali Scalable Linux Systems    http://www.scali.com

Olaf Helsets Vei 6              tel:    +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal                     +47 975 31 574  (MOBILE)
N-0619 Oslo                     fax:    +47 22 62 89 51
NORWAY            
_________________________________________________________________________


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-13 21:57             ` Terje Eggestad
@ 2002-12-13 22:53               ` H. Peter Anvin
  0 siblings, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-13 22:53 UTC (permalink / raw)
  To: Terje Eggestad
  Cc: Ville Herva, J.A. Magallon, Mark Mielke, linux-kernel, Dave Jones

Terje Eggestad wrote:
> I haven't tried the vsyscall patch, but there was a sysenter patch
> floating around that I tried. It reduced the syscall overhead with 1/3
> to 1/4, but I never tried it on P4.
> 
> FYI: Just note that I say overhead, which I assume to be the time it
> take to do someting like getpid(), write(-1,...), select(-1, ...) (etc
> that is immediatlely returned with -EINVAL by the kernel). 
> Since the kernel do execute a quite afew instructions beside the
> int/iret sysenter/sysexit, it's an assumption that the int 80  is the
> culprit. 
> 

IRET in particular is a very slow instruction.

As far as I know, though, the SYSENTER patch didn't deal with several of
the corner cases introduced by the generally weird SYSENTER instruction
(such as the fact that V86 tasks can execute it despite the fact there
is in general no way to resume execution of the V86 task afterwards.)

In practice this means that vsyscalls is pretty much the only sensible
way to do this.  Also note that INT 80h will need to be supported
indefinitely.

Personally, I wonder if it's worth the trouble, when x86-64 takes care
of the issue anyway :)

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-12 20:36     ` Mark Mielke
  2002-12-12 20:56       ` J.A. Magallon
@ 2002-12-12 20:56       ` Vojtech Pavlik
  1 sibling, 0 replies; 268+ messages in thread
From: Vojtech Pavlik @ 2002-12-12 20:56 UTC (permalink / raw)
  To: Mark Mielke; +Cc: Terje Eggestad, H. Peter Anvin, linux-kernel, Dave Jones

On Thu, Dec 12, 2002 at 03:36:46PM -0500, Mark Mielke wrote:
> On Thu, Dec 12, 2002 at 10:42:56AM +0100, Terje Eggestad wrote:
> > On ons, 2002-12-11 at 19:50, H. Peter Anvin wrote:
> > > Terje Eggestad wrote:
> > > > PS:  rdtsc on P4 is also painfully slow!!!
> > > Now that's just braindead...
> > It takes about 11 cycles on athlon, 34 on PII, and a whooping 84 on P4.
> > For a simple op like that, even 11 is a lot... Really makes you wonder.
> 
> Some of this discussion is a little bit unfair. My understanding of what
> Intel has done with the P4, is create an architecture that allows for
> higher clock rates. Sure the P4 might take 84, vs PII 34, but how many
> PII 2.4 Ghz machines have you ever seen on the market?
> 
> Certainly, some of their decisions seem to be a little odd on the surface.
> 
> That doesn't mean the situation is black and white.

Assume a 1GHz P-III. 34 clocks @ 1GHz = 34 ns. 84 clocks @ 2.4 GHz = 35 ns.
That's actually slower. Fortunately the P4 isn't this bad on all
instructions.

-- 
Vojtech Pavlik
SuSE Labs

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Intel P6 vs P7 system call performance
@ 2002-12-09  8:30 Mike Hayward
  2002-12-09 15:40 ` erich
                   ` (2 more replies)
  0 siblings, 3 replies; 268+ messages in thread
From: Mike Hayward @ 2002-12-09  8:30 UTC (permalink / raw)
  To: linux-kernel

I have been benchmarking Pentium 4 boxes against my Pentium III laptop
with the exact same kernel and executables as well as custom compiled
kernels.  The Pentium III has a much lower clock rate and I have
noticed that system call performance (and hence io performance) is up
to an order of magnitude higher on my Pentium III laptop.  1k block IO
reads/writes are anemic on the Pentium 4, for example, so I'm trying
to figure out why and thought someone might have an idea.

Notice below that the System Call overhead is much higher on the
Pentium 4 even though the cpu runs more than twice the speed and the
system has DDRAM, a 400 Mhz FSB, etc.  I even get pretty remarkable
syscall/io performance on my Pentium III laptop vs. an otherwise idle
dual Xeon.

See how the performance is nearly opposite of what one would expect:

----------------------------------------------------------------------
basic sys call performance iterated for 10 secs:

        while (1) {
                close(dup(0));
                getpid();
                getuid();
                umask(022);
                iter++;
        }

M-Pentium III 850Mhz Sys Call Rate   433741.8
  Pentium 4     2Ghz Sys Call Rate   233637.8
  Xeon x 2    2.4Ghz Sys Call Rate   207684.2

----------------------------------------------------------------------
1k read sys calls iterated for 10 secs (all buffered reads, no disk):

M-Pentium III 850Mhz File Read      1492961.0 (~149 io/s)
  Pentium 4     2Ghz File Read      1088629.0 (~108 io/s)
  Xeon x 2    2.4Ghz File Read       686892.0 (~ 69 io/s)

Any ideas?  Not sure I want to upgrade to the P7 architecture if this
is right, since for me system calls are probably more important than
raw cpu computational power.

- Mike

--- Mobile Pentium III 850 Mhz ---

  BYTE UNIX Benchmarks (Version 3.11)
  System -- Linux flux.loup.net 2.4.7-10 #1 Thu Sep 6 17:27:27 EDT 2001 i686 unknown
  Start Benchmark Run: Thu Nov  8 07:55:04 PST 2001
   1 interactive users.
Dhrystone 2 without register variables   1652556.1 lps   (10 secs, 6 samples)
Dhrystone 2 using register variables     1513809.2 lps   (10 secs, 6 samples)
Arithmetic Test (type = arithoh)         3770106.2 lps   (10 secs, 6 samples)
Arithmetic Test (type = register)        230897.5 lps   (10 secs, 6 samples)
Arithmetic Test (type = short)           230586.1 lps   (10 secs, 6 samples)
Arithmetic Test (type = int)             230916.2 lps   (10 secs, 6 samples)
Arithmetic Test (type = long)            232229.7 lps   (10 secs, 6 samples)
Arithmetic Test (type = float)           222990.2 lps   (10 secs, 6 samples)
Arithmetic Test (type = double)          224339.4 lps   (10 secs, 6 samples)
System Call Overhead Test                433741.8 lps   (10 secs, 6 samples)
Pipe Throughput Test                     499465.5 lps   (10 secs, 6 samples)
Pipe-based Context Switching Test        229029.2 lps   (10 secs, 6 samples)
Process Creation Test                      8696.6 lps   (10 secs, 6 samples)
Execl Throughput Test                      1089.8 lps   (9 secs, 6 samples)
File Read  (10 seconds)                  1492961.0 KBps  (10 secs, 6 samples)
File Write (10 seconds)                  157663.0 KBps  (10 secs, 6 samples)
File Copy  (10 seconds)                   32516.0 KBps  (10 secs, 6 samples)
File Read  (30 seconds)                  1507645.0 KBps  (30 secs, 6 samples)
File Write (30 seconds)                  161130.0 KBps  (30 secs, 6 samples)
File Copy  (30 seconds)                   20155.0 KBps  (30 secs, 6 samples)
C Compiler Test                             491.2 lpm   (60 secs, 3 samples)
Shell scripts (1 concurrent)               1315.2 lpm   (60 secs, 3 samples)
Shell scripts (2 concurrent)                694.4 lpm   (60 secs, 3 samples)
Shell scripts (4 concurrent)                357.1 lpm   (60 secs, 3 samples)
Shell scripts (8 concurrent)                180.4 lpm   (60 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places          46831.0 lpm   (60 secs, 6 samples)
Recursion Test--Tower of Hanoi            20954.1 lps   (10 secs, 6 samples)


                     INDEX VALUES            
TEST                                        BASELINE     RESULT      INDEX

Arithmetic Test (type = double)               2541.7   224339.4       88.3
Dhrystone 2 without register variables       22366.3  1652556.1       73.9
Execl Throughput Test                           16.5     1089.8       66.0
File Copy  (30 seconds)                        179.0    20155.0      112.6
Pipe-based Context Switching Test             1318.5   229029.2      173.7
Shell scripts (8 concurrent)                     4.0      180.4       45.1
                                                                 =========
     SUM of  6 items                                                 559.6
     AVERAGE                                                          93.3

--- Desktop Pentium 4 2.0 Ghz w/ 266 Mhz DDR ---

  BYTE UNIX Benchmarks (Version 3.11)
  System -- Linux gw2 2.4.19 #1 Mon Dec 9 05:31:23 GMT-7 2002 i686 unknown
  Start Benchmark Run: Mon Dec  9 05:45:47 GMT-7 2002
   1 interactive users.
Dhrystone 2 without register variables   2910759.3 lps   (10 secs, 6 samples)
Dhrystone 2 using register variables     2928495.6 lps   (10 secs, 6 samples)
Arithmetic Test (type = arithoh)         9252565.4 lps   (10 secs, 6 samples)
Arithmetic Test (type = register)        498894.3 lps   (10 secs, 6 samples)
Arithmetic Test (type = short)           473452.0 lps   (10 secs, 6 samples)
Arithmetic Test (type = int)             498956.5 lps   (10 secs, 6 samples)
Arithmetic Test (type = long)            498932.0 lps   (10 secs, 6 samples)
Arithmetic Test (type = float)           451138.8 lps   (10 secs, 6 samples)
Arithmetic Test (type = double)          451106.8 lps   (10 secs, 6 samples)
System Call Overhead Test                233637.8 lps   (10 secs, 6 samples)
Pipe Throughput Test                     437441.1 lps   (10 secs, 6 samples)
Pipe-based Context Switching Test        167229.2 lps   (10 secs, 6 samples)
Process Creation Test                      9407.2 lps   (10 secs, 6 samples)
Execl Throughput Test                      2158.8 lps   (10 secs, 6 samples)
File Read  (10 seconds)                  1088629.0 KBps  (10 secs, 6 samples)
File Write (10 seconds)                  472315.0 KBps  (10 secs, 6 samples)
File Copy  (10 seconds)                   10569.0 KBps  (10 secs, 6 samples)
File Read  (120 seconds)                 1089526.0 KBps  (120 secs, 6 samples)
File Write (120 seconds)                 467028.0 KBps  (120 secs, 6 samples)
File Copy  (120 seconds)                   3541.0 KBps  (120 secs, 6 samples)
C Compiler Test                             973.9 lpm   (60 secs, 3 samples)
Shell scripts (1 concurrent)               2590.8 lpm   (60 secs, 3 samples)
Shell scripts (2 concurrent)               1359.6 lpm   (60 secs, 3 samples)
Shell scripts (4 concurrent)                696.4 lpm   (60 secs, 3 samples)
Shell scripts (8 concurrent)                352.1 lpm   (60 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places          99120.4 lpm   (60 secs, 6 samples)
Recursion Test--Tower of Hanoi            44857.5 lps   (10 secs, 6 samples)


                     INDEX VALUES            
TEST                                        BASELINE     RESULT      INDEX

Arithmetic Test (type = double)               2541.7   451106.8      177.5
Dhrystone 2 without register variables       22366.3  2910759.3      130.1
Execl Throughput Test                           16.5     2158.8      130.8
File Copy  (120 seconds)                       179.0     3541.0       19.7
Pipe-based Context Switching Test             1318.5   167229.2      126.8
Shell scripts (8 concurrent)                     4.0      352.1       88.0
                                                                 =========
     SUM of  6 items                                                 673.0
     AVERAGE                                                         112.1


--- Pentium 4 Xeon 2.4 Ghz x 2 w/ 2.4.19 ---

  BYTE UNIX Benchmarks (Version 3.11)
  System -- Linux brent-xeon 2.4.19-kel #5 SMP Wed Sep 25 03:15:13 GMT 2002 i686 unknown
  Start Benchmark Run: Thu Oct 10 03:48:07 MDT 2002
   0 interactive users.
Dhrystone 2 without register variables   2200821.4 lps   (10 secs, 6 samples)
Dhrystone 2 using register variables     2233296.6 lps   (10 secs, 6 samples)
Arithmetic Test (type = arithoh)         7366670.5 lps   (10 secs, 6 samples)
Arithmetic Test (type = register)        399261.4 lps   (10 secs, 6 samples)
Arithmetic Test (type = short)           361354.7 lps   (10 secs, 6 samples)
Arithmetic Test (type = int)             364200.0 lps   (10 secs, 6 samples)
Arithmetic Test (type = long)            345292.9 lps   (10 secs, 6 samples)
Arithmetic Test (type = float)           539907.7 lps   (10 secs, 6 samples)
Arithmetic Test (type = double)          537355.5 lps   (10 secs, 6 samples)
System Call Overhead Test                207684.2 lps   (10 secs, 6 samples)
Pipe Throughput Test                     283868.3 lps   (10 secs, 6 samples)
Pipe-based Context Switching Test         98205.6 lps   (10 secs, 6 samples)
Process Creation Test                      5395.9 lps   (10 secs, 6 samples)
Execl Throughput Test                      1612.9 lps   (9 secs, 6 samples)
File Read  (10 seconds)                  686892.0 KBps  (10 secs, 6 samples)
File Write (10 seconds)                  272217.0 KBps  (10 secs, 6 samples)
File Copy  (10 seconds)                   56415.0 KBps  (10 secs, 6 samples)
File Read  (30 seconds)                  681181.0 KBps  (30 secs, 6 samples)
File Write (30 seconds)                  272351.0 KBps  (30 secs, 6 samples)
File Copy  (30 seconds)                   20611.0 KBps  (30 secs, 6 samples)
C Compiler Test                             873.5 lpm   (60 secs, 3 samples)
Shell scripts (1 concurrent)               2970.1 lpm   (60 secs, 3 samples)
Shell scripts (2 concurrent)               1294.2 lpm   (60 secs, 3 samples)
Shell scripts (4 concurrent)                845.2 lpm   (60 secs, 3 samples)
Shell scripts (8 concurrent)                409.2 lpm   (60 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places           no measured results
Recursion Test--Tower of Hanoi            33661.9 lps   (10 secs, 6 samples)


                     INDEX VALUES            
TEST                                        BASELINE     RESULT      INDEX

Arithmetic Test (type = double)               2541.7   537355.5      211.4
Dhrystone 2 without register variables       22366.3  2200821.4       98.4
Execl Throughput Test                           16.5     1612.9       97.8
File Copy  (30 seconds)                        179.0    20611.0      115.1
Pipe-based Context Switching Test             1318.5    98205.6       74.5
Shell scripts (8 concurrent)                     4.0      409.2      102.3
                                                                 =========
     SUM of  6 items                                                 699.5
     AVERAGE                                                         116.6

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-09  8:30 Mike Hayward
@ 2002-12-09 15:40 ` erich
  2002-12-09 17:48 ` Linus Torvalds
  2002-12-13 15:45 ` William Lee Irwin III
  2 siblings, 0 replies; 268+ messages in thread
From: erich @ 2002-12-09 15:40 UTC (permalink / raw)
  To: Mike Hayward; +Cc: linux-kernel


Mike Hayward <hayward@loup.net> wrote:

> I have been benchmarking Pentium 4 boxes against my Pentium III laptop
> with the exact same kernel and executables as well as custom compiled
> kernels.  The Pentium III has a much lower clock rate and I have
> noticed that system call performance (and hence io performance) is up
> to an order of magnitude higher on my Pentium III laptop.  1k block IO
> reads/writes are anemic on the Pentium 4, for example, so I'm trying
> to figure out why and thought someone might have an idea.
> 
> Notice below that the System Call overhead is much higher on the
> Pentium 4 even though the cpu runs more than twice the speed and the
> system has DDRAM, a 400 Mhz FSB, etc.  I even get pretty remarkable
> syscall/io performance on my Pentium III laptop vs. an otherwise idle
> dual Xeon.
> 
> See how the performance is nearly opposite of what one would expect:
...
> M-Pentium III 850Mhz Sys Call Rate   433741.8
>   Pentium 4     2Ghz Sys Call Rate   233637.8
>   Xeon x 2    2.4Ghz Sys Call Rate   207684.2
...[other benchmark deleted]...
> Any ideas?  Not sure I want to upgrade to the P7 architecture if this
> is right, since for me system calls are probably more important than
> raw cpu computational power.

You're assuming that ALL operations in a P4 are linearly faster than
a P-III.  This is definitely not the case.

A P4 has a much longer pipeline (for a many cases, considerably
longer than the diagrams imply) than the P-III, and in particular
it has a much longer latency in handling mode transitions.

The results you got don't surprise me whatsoever.  In fact the raw
system call transition instructions are likely 5x slower on the
P4.

--
    Erich Stefan Boleyn     <erich@uruk.org>     http://www.uruk.org/
"Reality is truly stranger than fiction; Probably why fiction is so popular"

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-09  8:30 Mike Hayward
  2002-12-09 15:40 ` erich
@ 2002-12-09 17:48 ` Linus Torvalds
  2002-12-09 19:36   ` Dave Jones
  2002-12-13 15:45 ` William Lee Irwin III
  2 siblings, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-09 17:48 UTC (permalink / raw)
  To: linux-kernel

In article <200212090830.gB98USW05593@flux.loup.net>,
Mike Hayward  <hayward@loup.net> wrote:
>
>I have been benchmarking Pentium 4 boxes against my Pentium III laptop
>with the exact same kernel and executables as well as custom compiled
>kernels.  The Pentium III has a much lower clock rate and I have
>noticed that system call performance (and hence io performance) is up
>to an order of magnitude higher on my Pentium III laptop.  1k block IO
>reads/writes are anemic on the Pentium 4, for example, so I'm trying
>to figure out why and thought someone might have an idea.

P4's really suck at system calls.  A 2.8GHz P4 does a simple system call
a lot _slower_ than a 500MHz PIII. 

The P4 has problems with some other things too, but the "int + iret"
instruction combination is absolutely the worst I've seen.  A 1.2GHz
Athlon will be 5-10 times faster than the fastest P4 on system call
overhead. 

HOWEVER, the P4 is really good at a lot of other things. On average, a
P4 tends to perform quite well on most loads, and hyperthreading (if you
have a Xeon or one of the newer desktop CPU's) also tends to work quite
well to smooth things out in real life.

In short: the P4 architecture excels at some things, and it sucks at
others. It _mostly_ tends to excel more than suck.

			Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-09 17:48 ` Linus Torvalds
@ 2002-12-09 19:36   ` Dave Jones
  2002-12-09 19:46     ` H. Peter Anvin
  2002-12-17  0:47     ` Linus Torvalds
  0 siblings, 2 replies; 268+ messages in thread
From: Dave Jones @ 2002-12-09 19:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Mon, Dec 09, 2002 at 05:48:45PM +0000, Linus Torvalds wrote:

 > P4's really suck at system calls.  A 2.8GHz P4 does a simple system call
 > a lot _slower_ than a 500MHz PIII. 
 > 
 > The P4 has problems with some other things too, but the "int + iret"
 > instruction combination is absolutely the worst I've seen.  A 1.2GHz
 > Athlon will be 5-10 times faster than the fastest P4 on system call
 > overhead. 

Time to look into an alternative like SYSCALL perhaps ?

		Dave

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-09 19:36   ` Dave Jones
@ 2002-12-09 19:46     ` H. Peter Anvin
  2002-12-28 20:37       ` Ville Herva
  2002-12-17  0:47     ` Linus Torvalds
  1 sibling, 1 reply; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-09 19:46 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <20021209193649.GC10316@suse.de>
By author:    Dave Jones <davej@codemonkey.org.uk>
In newsgroup: linux.dev.kernel
>
> On Mon, Dec 09, 2002 at 05:48:45PM +0000, Linus Torvalds wrote:
> 
>  > P4's really suck at system calls.  A 2.8GHz P4 does a simple system call
>  > a lot _slower_ than a 500MHz PIII. 
>  > 
>  > The P4 has problems with some other things too, but the "int + iret"
>  > instruction combination is absolutely the worst I've seen.  A 1.2GHz
>  > Athlon will be 5-10 times faster than the fastest P4 on system call
>  > overhead. 
> 
> Time to look into an alternative like SYSCALL perhaps ?
> 

SYSCALL is AMD.  SYSENTER is Intel, and is likely to be significantly
faster.  Unfortunately SYSENTER is also extremely braindamaged, in
that it destroys *both* the EIP and the ESP beyond recovery, and
because it's allowed in V86 and 16-bit modes (where it will cause
permanent data loss) which means that it needs to be able to be turned
off for things like DOSEMU and WINE to work correctly.

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt	<amsp@zytor.com>

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-09 19:46     ` H. Peter Anvin
@ 2002-12-28 20:37       ` Ville Herva
  2002-12-29  2:05         ` Christian Leber
  2002-12-30 11:29         ` Dave Jones
  0 siblings, 2 replies; 268+ messages in thread
From: Ville Herva @ 2002-12-28 20:37 UTC (permalink / raw)
  To: linux-kernel

On Mon, Dec 09, 2002 at 11:46:47AM -0800, you [H. Peter Anvin] wrote:
> 
> SYSCALL is AMD.  SYSENTER is Intel, and is likely to be significantly

Now that Linus has killed the dragon and everybody seems happy with the
shiny new SYSENTER code, let just add one more stupid question to this
thread: has anyone made benchmarks on SYSCALL/SYSENTER/INT80 on Athlon? Is
SYSCALL worth doing separately for Athlon (and perhaps Hammer/32-bit mode)?

-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-28 20:37       ` Ville Herva
@ 2002-12-29  2:05         ` Christian Leber
  2002-12-30 18:22           ` Christian Leber
  2002-12-30 11:29         ` Dave Jones
  1 sibling, 1 reply; 268+ messages in thread
From: Christian Leber @ 2002-12-29  2:05 UTC (permalink / raw)
  To: linux-kernel

On Sat, Dec 28, 2002 at 10:37:06PM +0200, Ville Herva wrote:

> Now that Linus has killed the dragon and everybody seems happy with the
> shiny new SYSENTER code, let just add one more stupid question to this
> thread: has anyone made benchmarks on SYSCALL/SYSENTER/INT80 on Athlon? Is
> SYSCALL worth doing separately for Athlon (and perhaps Hammer/32-bit mode)?

Yes, the output of the programm Linus posted is on a Duron 750 with
2.5.53 like this:

igor3:~# ./a.out
187.894946 cycles  (call 0xffffe000)
299.155075 cycles  (int 80)

(cycles per getpid() call)


Christian Leber

-- 
  "Omnis enim res, quae dando non deficit, dum habetur et non datur,
   nondum habetur, quomodo habenda est."       (Aurelius Augustinus)
  Translation: <http://gnuhh.org/work/fsf-europe/augustinus.html>

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-29  2:05         ` Christian Leber
@ 2002-12-30 18:22           ` Christian Leber
  2002-12-30 21:22             ` Linus Torvalds
  0 siblings, 1 reply; 268+ messages in thread
From: Christian Leber @ 2002-12-30 18:22 UTC (permalink / raw)
  To: linux-kernel

On Sun, Dec 29, 2002 at 03:05:10AM +0100, Christian Leber wrote:

> > Now that Linus has killed the dragon and everybody seems happy with the
> > shiny new SYSENTER code, let just add one more stupid question to this
> > thread: has anyone made benchmarks on SYSCALL/SYSENTER/INT80 on Athlon? Is
> > SYSCALL worth doing separately for Athlon (and perhaps Hammer/32-bit mode)?
> 
> Yes, the output of the programm Linus posted is on a Duron 750 with
> 2.5.53 like this:
> 
> igor3:~# ./a.out
> 187.894946 cycles  (call 0xffffe000)
> 299.155075 cycles  (int 80)
> (cycles per getpid() call)

Damn, false lines, this where numbers from 2.5.52-bk2+sysenter-patch.

But now the right and interesting lines:

2.5.53:
igor3:~# ./a.out
166.283549 cycles
278.461609 cycles

2.5.53-bk5:
igor3:~# ./a.out
150.895348 cycles
279.441955 cycles

The question is: are the numbers correct?
(I don't know if the TSC thing is actually right)

And why have int 80 also gotten faster?


Is this a valid testprogramm to find out how long a system call takes?
igor3:~# cat sysc.c 
#define rdtscl(low) \
__asm__ __volatile__ ("rdtsc" : "=a" (low) : : "edx")

int getpiddd()
{
        int i=0; return i+10;
}

int main(int argc, char **argv) {
        long a,b,c,d;
        int i1,i2,i3;

        rdtscl(a);
        i1 = getpiddd(); //just to see how long a simple function takes
        rdtscl(b);
        i2 = getpid();
        rdtscl(c);
        i3 = getpid();
        rdtscl(d);
        printf("function call: %lu first: %lu second: %lu cycles\n",b-a,c-b,d-c);
        return 0;
}

I link it against a slightly modified (1 line of code) dietlibc:
igor3:~# dietlibc-0.22/bin-i386/diet gcc sysc.c
igor3:~# ./a.out 
function call: 42 first: 1821 second: 169 cycles

I heard that there are serious problems involved with TSC, therefore I
don't know if the numbers are correct/make seens.


Christian Leber

-- 
  "Omnis enim res, quae dando non deficit, dum habetur et non datur,
   nondum habetur, quomodo habenda est."       (Aurelius Augustinus)
  Translation: <http://gnuhh.org/work/fsf-europe/augustinus.html>

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-30 18:22           ` Christian Leber
@ 2002-12-30 21:22             ` Linus Torvalds
  0 siblings, 0 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-30 21:22 UTC (permalink / raw)
  To: linux-kernel

In article <20021230182209.GA3981@core.home>,
Christian Leber  <christian@leber.de> wrote:
>
>But now the right and interesting lines:
>
>2.5.53:
>igor3:~# ./a.out
>166.283549 cycles
>278.461609 cycles
>
>2.5.53-bk5:
>igor3:~# ./a.out
>150.895348 cycles
>279.441955 cycles
>
>The question is: are the numbers correct?

Roughly. The program I posted has some overflow errors (which you will
hit if testing expensive system calls that take >4000 cycles). They also

do an average, which is "mostly correct", but not stable if there is
some load in the machine. The right way to do timings like this is
probably to do minimums for individual calls, and then subtract out the
TSC reading overhead. See attached silly program.

>And why have int 80 also gotten faster?

Random luck. Sometimes you get cacheline alignment magic etc. Or just
because the timings aren't stable for other reasons (background
processes etc).

>Is this a valid testprogramm to find out how long a system call takes?

Not really. The results won't be stable, since you might have cache
misses, page faults, other processes, whatever.

So you'll get _somehat_ correct numbers, but they may be randomly off.

		Linus

---
#include <sys/types.h>
#include <time.h>
#include <sys/time.h>
#include <sys/fcntl.h>
#include <asm/unistd.h>
#include <sys/stat.h>
#include <stdio.h>

#define rdtsc() ({ unsigned long a, d; asm volatile("rdtsc":"=a" (a), "=d" (d)); a; })

// for testing _just_ system call overhead.
//#define __NR_syscall __NR_stat64
#define __NR_syscall __NR_getpid

#define NR (100000)

int main()
{
	int i, ret;
	unsigned long fast = ~0UL, slow = ~0UL, overhead = ~0UL;
	struct timeval x,y;
	char *filename = "test";
	struct stat st;
	int j;

	for (i = 0; i < NR; i++) {
		unsigned long cycles = rdtsc();
		asm volatile("");
		cycles = rdtsc() - cycles;
		if (cycles < overhead)
			overhead = cycles;
	}

	printf("overhead: %6d\n", overhead);

	for (j = 0; j < 10; j++)
	for (i = 0; i < NR; i++) {
		unsigned long cycles = rdtsc();
		asm volatile("call 0xffffe000"
			:"=a" (ret)
			:"0" (__NR_syscall),
			 "b" (filename),
			 "c" (&st));
		cycles = rdtsc() - cycles;
		if (cycles < fast)
			fast = cycles;
	}

	fast -= overhead;
	printf("sysenter: %6d cycles\n", fast);

	for (i = 0; i < NR; i++) {
		unsigned long cycles = rdtsc();
		asm volatile("int $0x80"
			:"=a" (ret)
			:"0" (__NR_syscall),
			 "b" (filename),
			 "c" (&st));
		cycles = rdtsc() - cycles;
		if (cycles < slow)
			slow = cycles;
	}

	slow -= overhead;
	printf("int0x80:  %6d cycles\n", slow);
	printf("          %6d cycles difference\n", slow-fast);
	printf("factor %f\n", (double) slow / fast);
}



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-28 20:37       ` Ville Herva
  2002-12-29  2:05         ` Christian Leber
@ 2002-12-30 11:29         ` Dave Jones
  1 sibling, 0 replies; 268+ messages in thread
From: Dave Jones @ 2002-12-30 11:29 UTC (permalink / raw)
  To: Ville Herva, linux-kernel

On Sat, Dec 28, 2002 at 10:37:06PM +0200, Ville Herva wrote:

 > > SYSCALL is AMD.  SYSENTER is Intel, and is likely to be significantly
 > Now that Linus has killed the dragon and everybody seems happy with the
 > shiny new SYSENTER code, let just add one more stupid question to this
 > thread: has anyone made benchmarks on SYSCALL/SYSENTER/INT80 on Athlon? Is
 > SYSCALL worth doing separately for Athlon (and perhaps Hammer/32-bit mode)?

Its something I wondered about too. Even if it isn't a win for K7,
it's possible that the K6 family may benefit from SYSCALL support.
Maybe even the K5 if it was around that early ? (too lazy to check pdf's)

		Dave

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-09 19:36   ` Dave Jones
  2002-12-09 19:46     ` H. Peter Anvin
@ 2002-12-17  0:47     ` Linus Torvalds
  2002-12-17  1:03       ` Dave Jones
  1 sibling, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17  0:47 UTC (permalink / raw)
  To: Dave Jones, Ingo Molnar; +Cc: linux-kernel


On Mon, 9 Dec 2002, Dave Jones wrote:
> 
> Time to look into an alternative like SYSCALL perhaps ?

Well, here's a very raw first try at using intel sysenter/sysexit.

It does actually work, I've done a "hello world" program that used 
sysenter to enter the kernel, but kernel exit requires knowing where to 
return to (the SYSENTER_RETURN define in entry.S), and I didn't set up a 
fixmap entry for this yet, so I don't have a good value to return to yet.

But this, together with a fixmap entry that is user-readable (and thus
executable) that contains the "sysenter" instruction (and enough setup so
that %ebp points to the stack we want to return with), and together with
some debugging should get you there.

WARNING! I may be setting up the stack slightly incorrectly, since this
also hurls chunks when debugging. Dunno. Ingo, care to take a look?

Btw, that per-CPU sysenter entry-point is really clever of me, but it's 
not strictly NMI-safe. There's a single-instruction window between having 
started "sysenter" and having a valid kernel stack, and if an NMI comes in 
at that point, the NMI will now have a bogus stack pointer.

That NMI problem is pretty fundamentally unfixable due to the stupid
sysenter semantics, but we could just make the NMI handlers be real
careful about it and fix it up if it happens.

Most of the diff here is actually moving around some of the segments, 
since sysenter/sysexit wants them in one particular order. The setup code 
to initialize sysenter is itself pretty trivial.

		Linus

----
===== arch/i386/kernel/sysenter.c 1.1 vs edited =====
--- 1.1/arch/i386/kernel/sysenter.c	Sat Dec 14 04:38:56 2002
+++ edited/arch/i386/kernel/sysenter.c	2002-12-16 16:37:32.000000000 -0800
@@ -0,0 +1,52 @@
+/*
+ * linux/arch/i386/kernel/sysenter.c
+ *
+ * (C) Copyright 2002 Linus Torvalds
+ *
+ * This file contains the needed initializations to support sysenter.
+ */
+
+#include <linux/init.h>
+#include <linux/smp.h>
+#include <linux/thread_info.h>
+#include <linux/gfp.h>
+
+#include <asm/cpufeature.h>
+#include <asm/msr.h>
+
+extern asmlinkage void sysenter_entry(void);
+
+static void __init enable_sep_cpu(void *info)
+{
+	unsigned long page = __get_free_page(GFP_ATOMIC);
+	int cpu = get_cpu();
+	unsigned long *esp0_ptr = &(init_tss + cpu)->esp0;
+	unsigned long rel32;
+
+	rel32 = (unsigned long) sysenter_entry - (page+11);
+
+	
+	*(short *) (page+0) = 0x258b;		/* movl xxxxx,%esp */
+	*(long **) (page+2) = esp0_ptr;
+	*(char *)  (page+6) = 0xe9;		/* jmp rl32 */
+	*(long *)  (page+7) = rel32;
+
+	wrmsr(0x174, __KERNEL_CS, 0);		/* SYSENTER_CS_MSR */
+	wrmsr(0x175, page+PAGE_SIZE, 0);	/* SYSENTER_ESP_MSR */
+	wrmsr(0x176, page, 0);			/* SYSENTER_EIP_MSR */
+
+	printk("Enabling SEP on CPU %d\n", cpu);
+	put_cpu();	
+}
+
+static int __init sysenter_setup(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_SEP))
+		return 0;
+
+	enable_sep_cpu(NULL);
+	smp_call_function(enable_sep_cpu, NULL, 1, 1);
+	return 0;
+}
+
+__initcall(sysenter_setup);
===== arch/i386/kernel/Makefile 1.30 vs edited =====
--- 1.30/arch/i386/kernel/Makefile	Sat Dec 14 04:38:56 2002
+++ edited/arch/i386/kernel/Makefile	Mon Dec 16 13:43:57 2002
@@ -29,6 +29,7 @@
 obj-$(CONFIG_PROFILING)		+= profile.o
 obj-$(CONFIG_EDD)             	+= edd.o
 obj-$(CONFIG_MODULES)		+= module.o
+obj-y				+= sysenter.o
 
 EXTRA_AFLAGS   := -traditional
 
===== arch/i386/kernel/entry.S 1.41 vs edited =====
--- 1.41/arch/i386/kernel/entry.S	Fri Dec  6 09:43:43 2002
+++ edited/arch/i386/kernel/entry.S	Mon Dec 16 16:17:47 2002
@@ -94,7 +94,7 @@
 	movl %edx, %ds; \
 	movl %edx, %es;
 
-#define RESTORE_ALL	\
+#define RESTORE_REGS	\
 	popl %ebx;	\
 	popl %ecx;	\
 	popl %edx;	\
@@ -104,14 +104,25 @@
 	popl %eax;	\
 1:	popl %ds;	\
 2:	popl %es;	\
-	addl $4, %esp;	\
-3:	iret;		\
 .section .fixup,"ax";	\
-4:	movl $0,(%esp);	\
+3:	movl $0,(%esp);	\
 	jmp 1b;		\
-5:	movl $0,(%esp);	\
+4:	movl $0,(%esp);	\
 	jmp 2b;		\
-6:	pushl %ss;	\
+.previous;		\
+.section __ex_table,"a";\
+	.align 4;	\
+	.long 1b,3b;	\
+	.long 2b,4b;	\
+.previous
+
+
+#define RESTORE_ALL	\
+	RESTORE_REGS	\
+	addl $4, %esp;	\
+1:	iret;		\
+.section .fixup,"ax";   \
+2:	pushl %ss;	\
 	popl %ds;	\
 	pushl %ss;	\
 	popl %es;	\
@@ -120,11 +131,11 @@
 .previous;		\
 .section __ex_table,"a";\
 	.align 4;	\
-	.long 1b,4b;	\
-	.long 2b,5b;	\
-	.long 3b,6b;	\
+	.long 1b,2b;	\
 .previous
 
+
+
 ENTRY(lcall7)
 	pushfl			# We get a different stack layout with call
 				# gates, which has to be cleaned up later..
@@ -219,6 +230,39 @@
 	cli
 	jmp need_resched
 #endif
+
+#define SYSENTER_RETURN 0
+
+	# sysenter call handler stub
+	ALIGN
+ENTRY(sysenter_entry)
+	sti
+	pushl $(__USER_DS)
+	pushl %ebp
+	pushfl
+	pushl $(__USER_CS)
+	pushl $SYSENTER_RETURN
+
+	pushl %eax
+	SAVE_ALL
+	GET_THREAD_INFO(%ebx)
+	cmpl $(NR_syscalls), %eax
+	jae syscall_badsys
+
+	testb $_TIF_SYSCALL_TRACE,TI_FLAGS(%ebx)
+	jnz syscall_trace_entry
+	call *sys_call_table(,%eax,4)
+	movl %eax,EAX(%esp)
+	cli
+	movl TI_FLAGS(%ebx), %ecx
+	testw $_TIF_ALLWORK_MASK, %cx
+	jne syscall_exit_work
+	RESTORE_REGS
+	movl 4(%esp),%edx
+	movl 16(%esp),%ecx
+	sti
+	sysexit
+
 
 	# system call handler stub
 	ALIGN
===== arch/i386/kernel/head.S 1.18 vs edited =====
--- 1.18/arch/i386/kernel/head.S	Thu Dec  5 18:56:49 2002
+++ edited/arch/i386/kernel/head.S	Mon Dec 16 14:14:44 2002
@@ -414,8 +414,8 @@
 	.quad 0x0000000000000000	/* 0x0b reserved */
 	.quad 0x0000000000000000	/* 0x13 reserved */
 	.quad 0x0000000000000000	/* 0x1b reserved */
-	.quad 0x00cffa000000ffff	/* 0x23 user 4GB code at 0x00000000 */
-	.quad 0x00cff2000000ffff	/* 0x2b user 4GB data at 0x00000000 */
+	.quad 0x0000000000000000	/* 0x20 unused */
+	.quad 0x0000000000000000	/* 0x28 unused */
 	.quad 0x0000000000000000	/* 0x33 TLS entry 1 */
 	.quad 0x0000000000000000	/* 0x3b TLS entry 2 */
 	.quad 0x0000000000000000	/* 0x43 TLS entry 3 */
@@ -425,22 +425,25 @@
 
 	.quad 0x00cf9a000000ffff	/* 0x60 kernel 4GB code at 0x00000000 */
 	.quad 0x00cf92000000ffff	/* 0x68 kernel 4GB data at 0x00000000 */
-	.quad 0x0000000000000000	/* 0x70 TSS descriptor */
-	.quad 0x0000000000000000	/* 0x78 LDT descriptor */
+	.quad 0x00cffa000000ffff	/* 0x73 user 4GB code at 0x00000000 */
+	.quad 0x00cff2000000ffff	/* 0x7b user 4GB data at 0x00000000 */
+
+	.quad 0x0000000000000000	/* 0x80 TSS descriptor */
+	.quad 0x0000000000000000	/* 0x88 LDT descriptor */
 
 	/* Segments used for calling PnP BIOS */
-	.quad 0x00c09a0000000000	/* 0x80 32-bit code */
-	.quad 0x00809a0000000000	/* 0x88 16-bit code */
-	.quad 0x0080920000000000	/* 0x90 16-bit data */
-	.quad 0x0080920000000000	/* 0x98 16-bit data */
+	.quad 0x00c09a0000000000	/* 0x90 32-bit code */
+	.quad 0x00809a0000000000	/* 0x98 16-bit code */
 	.quad 0x0080920000000000	/* 0xa0 16-bit data */
+	.quad 0x0080920000000000	/* 0xa8 16-bit data */
+	.quad 0x0080920000000000	/* 0xb0 16-bit data */
 	/*
 	 * The APM segments have byte granularity and their bases
 	 * and limits are set at run time.
 	 */
-	.quad 0x00409a0000000000	/* 0xa8 APM CS    code */
-	.quad 0x00009a0000000000	/* 0xb0 APM CS 16 code (16 bit) */
-	.quad 0x0040920000000000	/* 0xb8 APM DS    data */
+	.quad 0x00409a0000000000	/* 0xb8 APM CS    code */
+	.quad 0x00009a0000000000	/* 0xc0 APM CS 16 code (16 bit) */
+	.quad 0x0040920000000000	/* 0xc8 APM DS    data */
 
 #if CONFIG_SMP
 	.fill (NR_CPUS-1)*GDT_ENTRIES,8,0 /* other CPU's GDT */
===== include/asm-i386/segment.h 1.2 vs edited =====
--- 1.2/include/asm-i386/segment.h	Mon Aug 12 10:56:27 2002
+++ edited/include/asm-i386/segment.h	Mon Dec 16 14:08:09 2002
@@ -9,8 +9,8 @@
  *   2 - reserved
  *   3 - reserved
  *
- *   4 - default user CS		<==== new cacheline
- *   5 - default user DS
+ *   4 - unused			<==== new cacheline
+ *   5 - unused
  *
  *  ------- start of TLS (Thread-Local Storage) segments:
  *
@@ -25,16 +25,18 @@
  *
  *  12 - kernel code segment		<==== new cacheline
  *  13 - kernel data segment
- *  14 - TSS
- *  15 - LDT
- *  16 - PNPBIOS support (16->32 gate)
- *  17 - PNPBIOS support
- *  18 - PNPBIOS support
+ *  14 - default user CS
+ *  15 - default user DS
+ *  16 - TSS
+ *  17 - LDT
+ *  18 - PNPBIOS support (16->32 gate)
  *  19 - PNPBIOS support
  *  20 - PNPBIOS support
- *  21 - APM BIOS support
- *  22 - APM BIOS support
- *  23 - APM BIOS support 
+ *  21 - PNPBIOS support
+ *  22 - PNPBIOS support
+ *  23 - APM BIOS support
+ *  24 - APM BIOS support
+ *  25 - APM BIOS support 
  */
 #define GDT_ENTRY_TLS_ENTRIES	3
 #define GDT_ENTRY_TLS_MIN	6
@@ -42,10 +44,10 @@
 
 #define TLS_SIZE (GDT_ENTRY_TLS_ENTRIES * 8)
 
-#define GDT_ENTRY_DEFAULT_USER_CS	4
+#define GDT_ENTRY_DEFAULT_USER_CS	14
 #define __USER_CS (GDT_ENTRY_DEFAULT_USER_CS * 8 + 3)
 
-#define GDT_ENTRY_DEFAULT_USER_DS	5
+#define GDT_ENTRY_DEFAULT_USER_DS	15
 #define __USER_DS (GDT_ENTRY_DEFAULT_USER_DS * 8 + 3)
 
 #define GDT_ENTRY_KERNEL_BASE	12
@@ -56,14 +58,14 @@
 #define GDT_ENTRY_KERNEL_DS		(GDT_ENTRY_KERNEL_BASE + 1)
 #define __KERNEL_DS (GDT_ENTRY_KERNEL_DS * 8)
 
-#define GDT_ENTRY_TSS			(GDT_ENTRY_KERNEL_BASE + 2)
-#define GDT_ENTRY_LDT			(GDT_ENTRY_KERNEL_BASE + 3)
+#define GDT_ENTRY_TSS			(GDT_ENTRY_KERNEL_BASE + 4)
+#define GDT_ENTRY_LDT			(GDT_ENTRY_KERNEL_BASE + 5)
 
-#define GDT_ENTRY_PNPBIOS_BASE		(GDT_ENTRY_KERNEL_BASE + 4)
-#define GDT_ENTRY_APMBIOS_BASE		(GDT_ENTRY_KERNEL_BASE + 9)
+#define GDT_ENTRY_PNPBIOS_BASE		(GDT_ENTRY_KERNEL_BASE + 6)
+#define GDT_ENTRY_APMBIOS_BASE		(GDT_ENTRY_KERNEL_BASE + 11)
 
 /*
- * The GDT has 21 entries but we pad it to cacheline boundary:
+ * The GDT has 23 entries but we pad it to cacheline boundary:
  */
 #define GDT_ENTRIES 24
 


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  0:47     ` Linus Torvalds
@ 2002-12-17  1:03       ` Dave Jones
  2002-12-17  2:36         ` Linus Torvalds
  0 siblings, 1 reply; 268+ messages in thread
From: Dave Jones @ 2002-12-17  1:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, linux-kernel

On Mon, Dec 16, 2002 at 04:47:00PM -0800, Linus Torvalds wrote:

Cool, new toys 8-) I'll have a play with this tomorrow.
after a quick glance, one thing jumped out at me.

 > +static int __init sysenter_setup(void)
 > +{
 > +	if (!boot_cpu_has(X86_FEATURE_SEP))
 > +		return 0;
 > +
 > +	enable_sep_cpu(NULL);
 > +	smp_call_function(enable_sep_cpu, NULL, 1, 1);
 > +	return 0;

I'm sure I recall seeing errata on at least 1 CPU re sysenter.
If we do decide to go this route, we'll need to blacklist ones
with any really icky problems.

		Dave

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  1:03       ` Dave Jones
@ 2002-12-17  2:36         ` Linus Torvalds
  2002-12-17  5:55           ` Linus Torvalds
  0 siblings, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17  2:36 UTC (permalink / raw)
  To: Dave Jones; +Cc: Ingo Molnar, linux-kernel

On Tue, 17 Dec 2002, Dave Jones wrote:
>
> I'm sure I recall seeing errata on at least 1 CPU re sysenter.
> If we do decide to go this route, we'll need to blacklist ones
> with any really icky problems.

The errata is something like "all P6's report SEP, but it doesn't
actually _work_ on anything before the third stepping".

However, that should _not_ be handled by magic sysenter-specific code.
That's what the per-vendor cpu feature fixups are there for, so that these
kinds of bugs get fixed in _one_ place (initialization) and not in all the
users of the feature flags.

In fact, we already have that code in the proper place, namely
arch/i386/kernel/cpu/intel.c:

        /* SEP CPUID bug: Pentium Pro reports SEP but doesn't have it */
        if ( c->x86 == 6 && c->x86_model < 3 && c->x86_mask < 3 )
                clear_bit(X86_FEATURE_SEP, c->x86_capability);

so the stuff I sent out should work on everything.

(Modulo the missing syscall page I already mentioned and potential bugs
in the code itself, of course ;)

		Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  2:36         ` Linus Torvalds
@ 2002-12-17  5:55           ` Linus Torvalds
  2002-12-17  6:09             ` Linus Torvalds
                               ` (4 more replies)
  0 siblings, 5 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17  5:55 UTC (permalink / raw)
  To: Dave Jones; +Cc: Ingo Molnar, linux-kernel, hpa

On Mon, 16 Dec 2002, Linus Torvalds wrote:
>
> (Modulo the missing syscall page I already mentioned and potential bugs
> in the code itself, of course ;)

Ok, I did the vsyscall page too, and tried to make it do the right thing
(but I didn't bother to test it on a non-SEP machine).

I'm pushing the changes out right now, but basically it boils down to the
fact that with these changes, user space can instead of doing an

	int $0x80

instruction for a system call just do a

	call 0xfffff000

instead. The vsyscall page will be set up to use sysenter if the CPU
supports it, and if it doesn't, it will just do the old "int $0x80"
instead (and it could use the AMD syscall instruction if it wants to).
User mode shouldn't know or care, the calling convention is the same as it
ever was.

On my P4 machine, a "getppid()" is 641 cycles with sysenter/sysexit, and
something like 1761 cycles with the old "int 0x80/iret" approach. That's a
noticeable improvement, but I have to say that I'm a bit disappointed in
the P4 still, it shouldn't be even that much.

As a comparison, an Athlon will do a full int/iret faster than a P4 does a
sysenter/sysexit. Pathetic. But it's better than it used to be.

Whatever. The code is extremely simple, and while I'm sure there are
things I've missed I'd love to hear if this works for anybody else. I'm
appending the (extremely stupid) test-program I used to test it.

The way I did this, things like system call restarting etc _should_ all
work fine even with "sysenter", simply by virtue of both sysenter and "int
0x80" being two-byte opcodes. But it might be interesting to verify that a
recompiled glibc (or even just a preload) really works with this on a
"whole system" testbed rather than just testing one system call (and not
even caring about the return value) a million times.

The good news is that the kernel part really looks pretty clean.

		Linus

---
#include <time.h>
#include <sys/time.h>
#include <asm/unistd.h>
#include <sys/stat.h>
#include <stdio.h>

#define rdtsc() ({ unsigned long a,d; asm volatile("rdtsc":"=a" (a), "=d" (d)); a; })

int main()
{
	int i, ret;
	unsigned long start, end;

	start = rdtsc();
	for (i = 0; i < 1000000; i++) {
		asm volatile("call 0xfffff000"
			:"=a" (ret)
			:"0" (__NR_getppid));
	}
	end = rdtsc();
	printf("%f cycles\n", (end - start) / 1000000.0);

	start = rdtsc();
	for (i = 0; i < 1000000; i++) {
		asm volatile("int $0x80"
			:"=a" (ret)
			:"0" (__NR_getppid));
	}
	end = rdtsc();
	printf("%f cycles\n", (end - start) / 1000000.0);
}

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  5:55           ` Linus Torvalds
@ 2002-12-17  6:09             ` Linus Torvalds
  2002-12-17  6:18               ` Linus Torvalds
                                 ` (3 more replies)
  2002-12-17  9:45             ` Andre Hedrick
                               ` (3 subsequent siblings)
  4 siblings, 4 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17  6:09 UTC (permalink / raw)
  To: Dave Jones; +Cc: Ingo Molnar, linux-kernel, hpa

On Mon, 16 Dec 2002, Linus Torvalds wrote:
>
> On my P4 machine, a "getppid()" is 641 cycles with sysenter/sysexit, and
> something like 1761 cycles with the old "int 0x80/iret" approach. That's a
> noticeable improvement, but I have to say that I'm a bit disappointed in
> the P4 still, it shouldn't be even that much.

On a slightly more real system call (gettimeofday - which actually matters
in real life) the difference is still visible, but less so - because the
system call itself takes more of the time, and the kernel entry overhead
is thus not as clear.

For gettimeofday(), the results on my P4 are:

	sysenter:	1280.425844 cycles
	int/iret:	2415.698224 cycles
			1135.272380 cycles diff
	factor 1.886637

ie sysenter makes that system call almost twice as fast.

It's not as good as a pure user-mode solution using tsc could be, but
we've seen the kinds of complexities that has with multi-CPU systems, and
they are so painful that I suspect the sysenter approach is a lot more
palatable even if it doesn't allow for the absolute best theoretical
numbers.

			Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  6:09             ` Linus Torvalds
@ 2002-12-17  6:18               ` Linus Torvalds
  2002-12-19 14:03                 ` Shuji YAMAMURA
  2002-12-17  6:19               ` GrandMasterLee
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17  6:18 UTC (permalink / raw)
  To: Dave Jones; +Cc: Ingo Molnar, linux-kernel, hpa



On Mon, 16 Dec 2002, Linus Torvalds wrote:
>
> For gettimeofday(), the results on my P4 are:
>
> 	sysenter:	1280.425844 cycles
> 	int/iret:	2415.698224 cycles
> 			1135.272380 cycles diff
> 	factor 1.886637
>
> ie sysenter makes that system call almost twice as fast.

Final comparison for the evening: a PIII looks very different, since the
system call overhead is much smaller to begin with. On a PIII, the above
ends up looking like

   gettimeofday() testing:
	sysenter:	561.697236 cycles
	int/iret:	686.170463 cycles
			124.473227 cycles diff
	factor 1.221602

ie the speedup is much less because the original int/iret numbers aren't
nearly as embarrassing as the P4 ones. It's still a win, though.

		Linus


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  6:18               ` Linus Torvalds
@ 2002-12-19 14:03                 ` Shuji YAMAMURA
  0 siblings, 0 replies; 268+ messages in thread
From: Shuji YAMAMURA @ 2002-12-19 14:03 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds

Hi,

We've measured gettimeofday() cost on both of Xeon and P3, too.
We also measured them on different kernels (UP and MP).

                Xeon(2GHz)     P3(1GHz)
=========================================
UP kernel       939[ns]       441[ns]
               1878[cycles]   441[cycles]
-----------------------------------------
MP kernel      1054[ns]       485[ns]
               2108[cycles]   485[cycles]
-----------------------------------------
(The kernel version is 2.4.18)

In this experiment, Xeon is two times slower than P3, despite that the
frequency of the Xeon is two times faster.  More over, the performance
difference between UP and MP is very interesting in Xeon case.  The
difference of Xeon (230 cycles) is five times larger than that of P3
(44 cycles).

We think that the instructions with lock prefix in the MP kernel
damage the Xeon performance which serialize operations in an execution
pipeline.  On the P4/Xeon systems, these lock operations should be
avoided as possible as we can.

The following web page shows the details of this experiment.

http://www.labs.fujitsu.com/en/techinfo/linux/lse-0211/index.html

Regards

At Mon, 16 Dec 2002 22:18:27 -0800 (PST),
Linus wrote:
> On Mon, 16 Dec 2002, Linus Torvalds wrote:
> >
> > For gettimeofday(), the results on my P4 are:
> >
> > 	sysenter:	1280.425844 cycles
> > 	int/iret:	2415.698224 cycles
> > 			1135.272380 cycles diff
> > 	factor 1.886637
> >
> > ie sysenter makes that system call almost twice as fast.
> 
> Final comparison for the evening: a PIII looks very different, since the
> system call overhead is much smaller to begin with. On a PIII, the above
> ends up looking like
> 
>    gettimeofday() testing:
> 	sysenter:	561.697236 cycles
> 	int/iret:	686.170463 cycles
> 			124.473227 cycles diff
> 	factor 1.221602

-----
Shuji Yamamura (yamamura@flab.fujitsu.co.jp)
Grid Computing & Bioinformatics Laboratory
Information Technology Core Laboratories
Fujitsu Laboratories LTD.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  6:09             ` Linus Torvalds
  2002-12-17  6:18               ` Linus Torvalds
@ 2002-12-17  6:19               ` GrandMasterLee
  2002-12-17  6:43               ` dean gaudet
  2002-12-17 19:12               ` H. Peter Anvin
  3 siblings, 0 replies; 268+ messages in thread
From: GrandMasterLee @ 2002-12-17  6:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Ingo Molnar, linux-kernel, hpa

On Tue, 2002-12-17 at 00:09, Linus Torvalds wrote:
> On Mon, 16 Dec 2002, Linus Torvalds wrote:
> >
> > On my P4 machine, a "getppid()" is 641 cycles with sysenter/sysexit, and
> > something like 1761 cycles with the old "int 0x80/iret" approach. That's a
> > noticeable improvement, but I have to say that I'm a bit disappointed in
> > the P4 still, it shouldn't be even that much.
> 
> On a slightly more real system call (gettimeofday - which actually matters
> in real life) the difference is still visible, but less so - because the
> system call itself takes more of the time, and the kernel entry overhead
> is thus not as clear.
> 
> For gettimeofday(), the results on my P4 are:
> 
> 	sysenter:	1280.425844 cycles
> 	int/iret:	2415.698224 cycles
> 			1135.272380 cycles diff
> 	factor 1.886637
> 
> ie sysenter makes that system call almost twice as fast.


I'm curious, if this is one of the Dual P4's non-Xeon(say, 2.4 Ghz+?) or
if this is one of the Xeons? There seems to be some perceived disparity
between which performs how. I think the biggest difference on the Xeon's
is the stepping and the cache,(pipeline too?), but not too much else.

[...]
> 			Linus
> 


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  6:09             ` Linus Torvalds
  2002-12-17  6:18               ` Linus Torvalds
  2002-12-17  6:19               ` GrandMasterLee
@ 2002-12-17  6:43               ` dean gaudet
  2002-12-17 16:50                 ` Linus Torvalds
                                   ` (2 more replies)
  2002-12-17 19:12               ` H. Peter Anvin
  3 siblings, 3 replies; 268+ messages in thread
From: dean gaudet @ 2002-12-17  6:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Ingo Molnar, linux-kernel, hpa

On Mon, 16 Dec 2002, Linus Torvalds wrote:

> It's not as good as a pure user-mode solution using tsc could be, but
> we've seen the kinds of complexities that has with multi-CPU systems, and
> they are so painful that I suspect the sysenter approach is a lot more
> palatable even if it doesn't allow for the absolute best theoretical
> numbers.

don't many of the multi-CPU problems with tsc go away because you've got a
per-cpu physical page for the vsyscall?

i.e. per-cpu tsc epoch and scaling can be set on that page.

the only trouble i know of is what happens when an interrupt occurs and
the task is rescheduled on another cpu... in theory you could test %eip
against 0xfffffxxx and "rollback" (or complete) any incomplete
gettimeofday call prior to saving a task's state.  but i bet that test is
undesirable on all interrupt paths.

-dean

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  6:43               ` dean gaudet
@ 2002-12-17 16:50                 ` Linus Torvalds
  2002-12-17 19:11                 ` H. Peter Anvin
  2002-12-18 23:53                 ` Pavel Machek
  2 siblings, 0 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17 16:50 UTC (permalink / raw)
  To: dean gaudet; +Cc: Dave Jones, Ingo Molnar, linux-kernel, hpa

On Mon, 16 Dec 2002, dean gaudet wrote:
>
> don't many of the multi-CPU problems with tsc go away because you've got a
> per-cpu physical page for the vsyscall?

No.

The per-cpu page is _inside_ the kernel, and is only pointed at by the
SYSENTER_EIP_MSR, and not accessible from user space. It's not virtually
mapped to the same address at all.

The userspace vsyscall page is shared on the whole system, and has to be
so, because anything else is a disaster from a TLB standpoint (two threads
running on different CPU's have the same page tables, so it's basically
impossible to sanely do per-cpu TLB mappings with a hw-filled TLB like the
x86).

		Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  6:43               ` dean gaudet
  2002-12-17 16:50                 ` Linus Torvalds
@ 2002-12-17 19:11                 ` H. Peter Anvin
  2002-12-17 21:39                   ` Benjamin LaHaise
  2002-12-18 23:53                 ` Pavel Machek
  2 siblings, 1 reply; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-17 19:11 UTC (permalink / raw)
  To: dean gaudet; +Cc: Linus Torvalds, Dave Jones, Ingo Molnar, linux-kernel

dean gaudet wrote:
> On Mon, 16 Dec 2002, Linus Torvalds wrote:
> 
>>It's not as good as a pure user-mode solution using tsc could be, but
>>we've seen the kinds of complexities that has with multi-CPU systems, and
>>they are so painful that I suspect the sysenter approach is a lot more
>>palatable even if it doesn't allow for the absolute best theoretical
>>numbers.
> 
> don't many of the multi-CPU problems with tsc go away because you've got a
> per-cpu physical page for the vsyscall?
> 
> i.e. per-cpu tsc epoch and scaling can be set on that page.
> 
> the only trouble i know of is what happens when an interrupt occurs and
> the task is rescheduled on another cpu... in theory you could test %eip
> against 0xfffffxxx and "rollback" (or complete) any incomplete
> gettimeofday call prior to saving a task's state.  but i bet that test is
> undesirable on all interrupt paths.
> 

Exactly.  This is a real problem.

	-hpa


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:11                 ` H. Peter Anvin
@ 2002-12-17 21:39                   ` Benjamin LaHaise
  2002-12-17 21:41                     ` H. Peter Anvin
  0 siblings, 1 reply; 268+ messages in thread
From: Benjamin LaHaise @ 2002-12-17 21:39 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: dean gaudet, Linus Torvalds, Dave Jones, Ingo Molnar,
	linux-kernel

On Tue, Dec 17, 2002 at 11:11:19AM -0800, H. Peter Anvin wrote:
> > against 0xfffffxxx and "rollback" (or complete) any incomplete
> > gettimeofday call prior to saving a task's state.  but i bet that test is
> > undesirable on all interrupt paths.
> > 
> 
> Exactly.  This is a real problem.

No, just take the number of context switches before and after the attempt 
to read the time of day.

		-ben
-- 
"Do you seek knowledge in time travel?"

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 21:39                   ` Benjamin LaHaise
@ 2002-12-17 21:41                     ` H. Peter Anvin
  2002-12-17 21:53                       ` Benjamin LaHaise
  0 siblings, 1 reply; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-17 21:41 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: dean gaudet, Linus Torvalds, Dave Jones, Ingo Molnar,
	linux-kernel

Benjamin LaHaise wrote:
> On Tue, Dec 17, 2002 at 11:11:19AM -0800, H. Peter Anvin wrote:
> 
>>>against 0xfffffxxx and "rollback" (or complete) any incomplete
>>>gettimeofday call prior to saving a task's state.  but i bet that test is
>>>undesirable on all interrupt paths.
>>>
>>
>>Exactly.  This is a real problem.
> 
> 
> No, just take the number of context switches before and after the attempt 
> to read the time of day.
> 

How do you do that from userspace, atomically?  A counter in the shared
page?

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 21:41                     ` H. Peter Anvin
@ 2002-12-17 21:53                       ` Benjamin LaHaise
  0 siblings, 0 replies; 268+ messages in thread
From: Benjamin LaHaise @ 2002-12-17 21:53 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: dean gaudet, Linus Torvalds, Dave Jones, Ingo Molnar,
	linux-kernel

On Tue, Dec 17, 2002 at 01:41:55PM -0800, H. Peter Anvin wrote:
> > No, just take the number of context switches before and after the attempt 
> > to read the time of day.

> How do you do that from userspace, atomically?  A counter in the shared
> page?

Yup.  You need some shared data for the TSC offset such anyways, so 
moving the context switch counter onto such a page won't be much of 
a problem.  Using the %tr trick to get the CPU number would allow for 
some of these data structures to be per-cpu without incurring any LOCK 
overhead, too.

		-ben
-- 
"Do you seek knowledge in time travel?"

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  6:43               ` dean gaudet
  2002-12-17 16:50                 ` Linus Torvalds
  2002-12-17 19:11                 ` H. Peter Anvin
@ 2002-12-18 23:53                 ` Pavel Machek
  2002-12-19 22:18                   ` H. Peter Anvin
  2 siblings, 1 reply; 268+ messages in thread
From: Pavel Machek @ 2002-12-18 23:53 UTC (permalink / raw)
  To: dean gaudet; +Cc: Linus Torvalds, Dave Jones, Ingo Molnar, linux-kernel, hpa

Hi!

> > It's not as good as a pure user-mode solution using tsc could be, but
> > we've seen the kinds of complexities that has with multi-CPU systems, and
> > they are so painful that I suspect the sysenter approach is a lot more
> > palatable even if it doesn't allow for the absolute best theoretical
> > numbers.
> 
> don't many of the multi-CPU problems with tsc go away because you've got a
> per-cpu physical page for the vsyscall?
> 
> i.e. per-cpu tsc epoch and scaling can be set on that page.

Problem is that cpu's can randomly drift +/- 100 clocks or so... Not
nice at all.
								Pavel
-- 
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 23:53                 ` Pavel Machek
@ 2002-12-19 22:18                   ` H. Peter Anvin
  2002-12-19 22:21                     ` Pavel Machek
  0 siblings, 1 reply; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-19 22:18 UTC (permalink / raw)
  To: Pavel Machek
  Cc: dean gaudet, Linus Torvalds, Dave Jones, Ingo Molnar,
	linux-kernel

Pavel Machek wrote:
>>
>>don't many of the multi-CPU problems with tsc go away because you've got a
>>per-cpu physical page for the vsyscall?
>>
>>i.e. per-cpu tsc epoch and scaling can be set on that page.
> 
> Problem is that cpu's can randomly drift +/- 100 clocks or so... Not
> nice at all.
> 

±100 clocks is what... ±50 ns these days?  You can't get that kind of
accuracy for anything outside the CPU core anyway...

	-hpa


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 22:18                   ` H. Peter Anvin
@ 2002-12-19 22:21                     ` Pavel Machek
  2002-12-19 22:23                       ` H. Peter Anvin
  0 siblings, 1 reply; 268+ messages in thread
From: Pavel Machek @ 2002-12-19 22:21 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Pavel Machek, dean gaudet, Linus Torvalds, Dave Jones,
	Ingo Molnar, linux-kernel

Hi!

> >>don't many of the multi-CPU problems with tsc go away because you've got a
> >>per-cpu physical page for the vsyscall?
> >>
> >>i.e. per-cpu tsc epoch and scaling can be set on that page.
> > 
> > Problem is that cpu's can randomly drift +/- 100 clocks or so... Not
> > nice at all.
> > 
> 
> ?100 clocks is what... ?50 ns these days?  You can't get that kind of
> accuracy for anything outside the CPU core anyway...

50ns is bad enough when it makes your time go backwards.

								Pavel
-- 
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 22:21                     ` Pavel Machek
@ 2002-12-19 22:23                       ` H. Peter Anvin
  2002-12-19 22:26                         ` Pavel Machek
  0 siblings, 1 reply; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-19 22:23 UTC (permalink / raw)
  To: Pavel Machek
  Cc: dean gaudet, Linus Torvalds, Dave Jones, Ingo Molnar,
	linux-kernel

Pavel Machek wrote:
> Hi!
> 
> 
>>>>don't many of the multi-CPU problems with tsc go away because you've got a
>>>>per-cpu physical page for the vsyscall?
>>>>
>>>>i.e. per-cpu tsc epoch and scaling can be set on that page.
>>>
>>>Problem is that cpu's can randomly drift +/- 100 clocks or so... Not
>>>nice at all.
>>>
>>
>>?100 clocks is what... ?50 ns these days?  You can't get that kind of
>>accuracy for anything outside the CPU core anyway...
> 
> 50ns is bad enough when it makes your time go backwards.
> 

Backwards??  Clock spreading should make the rate change, but it should
never decrement.

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 22:23                       ` H. Peter Anvin
@ 2002-12-19 22:26                         ` Pavel Machek
  2002-12-19 22:30                           ` H. Peter Anvin
  0 siblings, 1 reply; 268+ messages in thread
From: Pavel Machek @ 2002-12-19 22:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: dean gaudet, Linus Torvalds, Dave Jones, Ingo Molnar,
	linux-kernel

Hi!

> >>>>don't many of the multi-CPU problems with tsc go away because you've got a
> >>>>per-cpu physical page for the vsyscall?
> >>>>
> >>>>i.e. per-cpu tsc epoch and scaling can be set on that page.
> >>>
> >>>Problem is that cpu's can randomly drift +/- 100 clocks or so... Not
> >>>nice at all.
> >>>
> >>
> >>?100 clocks is what... ?50 ns these days?  You can't get that kind of
> >>accuracy for anything outside the CPU core anyway...
> > 
> > 50ns is bad enough when it makes your time go backwards.
> > 
> 
> Backwards??  Clock spreading should make the rate change, but it should
> never decrement.

User on cpu1 reads time, communicates it to cpu2, but cpu2 is drifted
-50ns, so it reads time "before" time reported cpu1. And gets confused.

								Pavel
-- 
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 22:26                         ` Pavel Machek
@ 2002-12-19 22:30                           ` H. Peter Anvin
  2002-12-19 22:34                             ` Pavel Machek
  0 siblings, 1 reply; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-19 22:30 UTC (permalink / raw)
  To: Pavel Machek
  Cc: dean gaudet, Linus Torvalds, Dave Jones, Ingo Molnar,
	linux-kernel

Pavel Machek wrote:
> 
> User on cpu1 reads time, communicates it to cpu2, but cpu2 is drifted
> -50ns, so it reads time "before" time reported cpu1. And gets confused.
> 

How can you get that communication to happen in < 50 ns?

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 22:30                           ` H. Peter Anvin
@ 2002-12-19 22:34                             ` Pavel Machek
  2002-12-19 22:36                               ` H. Peter Anvin
  0 siblings, 1 reply; 268+ messages in thread
From: Pavel Machek @ 2002-12-19 22:34 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Pavel Machek, dean gaudet, Linus Torvalds, Dave Jones,
	Ingo Molnar, linux-kernel

Hi!

> > User on cpu1 reads time, communicates it to cpu2, but cpu2 is drifted
> > -50ns, so it reads time "before" time reported cpu1. And gets confused.
> > 
> 
> How can you get that communication to happen in < 50 ns?

I'm not sure I can do that, but I'm not sure I can't either. CPUs
snoop each other's cache, and that's supposed to be fast...

-- 
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-19 22:34                             ` Pavel Machek
@ 2002-12-19 22:36                               ` H. Peter Anvin
  0 siblings, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-19 22:36 UTC (permalink / raw)
  To: Pavel Machek
  Cc: dean gaudet, Linus Torvalds, Dave Jones, Ingo Molnar,
	linux-kernel

Pavel Machek wrote:
> Hi!
> 
> 
>>>User on cpu1 reads time, communicates it to cpu2, but cpu2 is drifted
>>>-50ns, so it reads time "before" time reported cpu1. And gets confused.
>>>
>>
>>How can you get that communication to happen in < 50 ns?
> 
> 
> I'm not sure I can do that, but I'm not sure I can't either. CPUs
> snoop each other's cache, and that's supposed to be fast...
> 

Even over a 400 MHz FSB you have 2.5 ns cycles.  I doubt you can
transfer a cache line in 20 FSB cycles.

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  6:09             ` Linus Torvalds
                                 ` (2 preceding siblings ...)
  2002-12-17  6:43               ` dean gaudet
@ 2002-12-17 19:12               ` H. Peter Anvin
  2002-12-17 19:26                 ` Martin J. Bligh
  2002-12-17 20:49                 ` Alan Cox
  3 siblings, 2 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-17 19:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Ingo Molnar, linux-kernel

Linus Torvalds wrote:
> 
> It's not as good as a pure user-mode solution using tsc could be, but
> we've seen the kinds of complexities that has with multi-CPU systems, and
> they are so painful that I suspect the sysenter approach is a lot more
> palatable even if it doesn't allow for the absolute best theoretical
> numbers.
> 

The complexity only applies to nonsynchronized TSCs though, I would
assume.  I believe x86-64 uses a vsyscall using the TSC when it can
provide synchronized TSCs, and if it can't it puts a normal system call
inside the vsyscall in question.

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:12               ` H. Peter Anvin
@ 2002-12-17 19:26                 ` Martin J. Bligh
  2002-12-17 20:51                   ` Alan Cox
  2002-12-17 20:49                 ` Alan Cox
  1 sibling, 1 reply; 268+ messages in thread
From: Martin J. Bligh @ 2002-12-17 19:26 UTC (permalink / raw)
  To: H. Peter Anvin, Linus Torvalds; +Cc: Dave Jones, Ingo Molnar, linux-kernel

>> It's not as good as a pure user-mode solution using tsc could be, but
>> we've seen the kinds of complexities that has with multi-CPU systems, and
>> they are so painful that I suspect the sysenter approach is a lot more
>> palatable even if it doesn't allow for the absolute best theoretical
>> numbers.
> 
> The complexity only applies to nonsynchronized TSCs though, I would
> assume.  I believe x86-64 uses a vsyscall using the TSC when it can
> provide synchronized TSCs, and if it can't it puts a normal system call
> inside the vsyscall in question.

You can't use the TSC to do gettimeofday on boxes where they aren't 
syncronised anyway though. That's nothing to do with vsyscalls, you just
need a different time source (eg the legacy stuff or HPET/cyclone).

M.


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:26                 ` Martin J. Bligh
@ 2002-12-17 20:51                   ` Alan Cox
  2002-12-17 20:16                     ` H. Peter Anvin
  0 siblings, 1 reply; 268+ messages in thread
From: Alan Cox @ 2002-12-17 20:51 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: H. Peter Anvin, Linus Torvalds, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List

On Tue, 2002-12-17 at 19:26, Martin J. Bligh wrote:
> >> It's not as good as a pure user-mode solution using tsc could be, but
> You can't use the TSC to do gettimeofday on boxes where they aren't 
> syncronised anyway though. That's nothing to do with vsyscalls, you just
> need a different time source (eg the legacy stuff or HPET/cyclone).

Ditto all the laptops and the like. With code provided by the kernel we
can cheat however. If we know the fastest the CPU can go (ie full speed
on spudstop/powernow etc) we can tell the tsc value at which we have to
query the kernel to get time to any given accuracy, so allowing limited
caching

Ditto by knowing the worst case drift on summit

Alan


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 20:51                   ` Alan Cox
@ 2002-12-17 20:16                     ` H. Peter Anvin
  0 siblings, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-17 20:16 UTC (permalink / raw)
  To: Alan Cox
  Cc: Martin J. Bligh, Linus Torvalds, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List

Alan Cox wrote:
> On Tue, 2002-12-17 at 19:26, Martin J. Bligh wrote:
> 
>>>>It's not as good as a pure user-mode solution using tsc could be, but
>>
>>You can't use the TSC to do gettimeofday on boxes where they aren't 
>>syncronised anyway though. That's nothing to do with vsyscalls, you just
>>need a different time source (eg the legacy stuff or HPET/cyclone).
> 
> 
> Ditto all the laptops and the like. With code provided by the kernel we
> can cheat however. If we know the fastest the CPU can go (ie full speed
> on spudstop/powernow etc) we can tell the tsc value at which we have to
> query the kernel to get time to any given accuracy, so allowing limited
> caching
> 
> Ditto by knowing the worst case drift on summit
> 

Clever.  I like it :)

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:12               ` H. Peter Anvin
  2002-12-17 19:26                 ` Martin J. Bligh
@ 2002-12-17 20:49                 ` Alan Cox
  2002-12-17 20:12                   ` H. Peter Anvin
  1 sibling, 1 reply; 268+ messages in thread
From: Alan Cox @ 2002-12-17 20:49 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List

On Tue, 2002-12-17 at 19:12, H. Peter Anvin wrote:
> The complexity only applies to nonsynchronized TSCs though, I would
> assume.  I believe x86-64 uses a vsyscall using the TSC when it can
> provide synchronized TSCs, and if it can't it puts a normal system call
> inside the vsyscall in question.

For x86-64 there is the hpet timer, which is a lot saner but I don't
think we can mmap it



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 20:49                 ` Alan Cox
@ 2002-12-17 20:12                   ` H. Peter Anvin
  0 siblings, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-17 20:12 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linus Torvalds, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List

Alan Cox wrote:
> On Tue, 2002-12-17 at 19:12, H. Peter Anvin wrote:
> 
>>The complexity only applies to nonsynchronized TSCs though, I would
>>assume.  I believe x86-64 uses a vsyscall using the TSC when it can
>>provide synchronized TSCs, and if it can't it puts a normal system call
>>inside the vsyscall in question.
> 
> 
> For x86-64 there is the hpet timer, which is a lot saner but I don't
> think we can mmap it
> 

It's only necessary, though, when TSC isn't usable.  TSC is psycho fast
when it's available.  Just about anything is saner than the old 8042 or
whatever it is called timer chip, though...

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  5:55           ` Linus Torvalds
  2002-12-17  6:09             ` Linus Torvalds
@ 2002-12-17  9:45             ` Andre Hedrick
  2002-12-17 12:40               ` Dave Jones
  2002-12-17 15:12               ` Alan Cox
  2002-12-17 10:53             ` Ulrich Drepper
                               ` (2 subsequent siblings)
  4 siblings, 2 replies; 268+ messages in thread
From: Andre Hedrick @ 2002-12-17  9:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Ingo Molnar, linux-kernel, hpa


Linus,

Are you serious about moving of the banging we currently do on 0x80?
If so, I have a P4 development board with leds to monitor all the lower io
ports and can decode for you.

On Mon, 16 Dec 2002, Linus Torvalds wrote:

> 
> On Mon, 16 Dec 2002, Linus Torvalds wrote:
> >
> > (Modulo the missing syscall page I already mentioned and potential bugs
> > in the code itself, of course ;)
> 
> Ok, I did the vsyscall page too, and tried to make it do the right thing
> (but I didn't bother to test it on a non-SEP machine).
> 
> I'm pushing the changes out right now, but basically it boils down to the
> fact that with these changes, user space can instead of doing an
> 
> 	int $0x80
> 
> instruction for a system call just do a
> 
> 	call 0xfffff000
> 
> instead. The vsyscall page will be set up to use sysenter if the CPU
> supports it, and if it doesn't, it will just do the old "int $0x80"
> instead (and it could use the AMD syscall instruction if it wants to).
> User mode shouldn't know or care, the calling convention is the same as it
> ever was.
> 
> On my P4 machine, a "getppid()" is 641 cycles with sysenter/sysexit, and
> something like 1761 cycles with the old "int 0x80/iret" approach. That's a
> noticeable improvement, but I have to say that I'm a bit disappointed in
> the P4 still, it shouldn't be even that much.
> 
> As a comparison, an Athlon will do a full int/iret faster than a P4 does a
> sysenter/sysexit. Pathetic. But it's better than it used to be.
> 
> Whatever. The code is extremely simple, and while I'm sure there are
> things I've missed I'd love to hear if this works for anybody else. I'm
> appending the (extremely stupid) test-program I used to test it.
> 
> The way I did this, things like system call restarting etc _should_ all
> work fine even with "sysenter", simply by virtue of both sysenter and "int
> 0x80" being two-byte opcodes. But it might be interesting to verify that a
> recompiled glibc (or even just a preload) really works with this on a
> "whole system" testbed rather than just testing one system call (and not
> even caring about the return value) a million times.
> 
> The good news is that the kernel part really looks pretty clean.
> 
> 		Linus
> 
> ---
> #include <time.h>
> #include <sys/time.h>
> #include <asm/unistd.h>
> #include <sys/stat.h>
> #include <stdio.h>
> 
> #define rdtsc() ({ unsigned long a,d; asm volatile("rdtsc":"=a" (a), "=d" (d)); a; })
> 
> int main()
> {
> 	int i, ret;
> 	unsigned long start, end;
> 
> 	start = rdtsc();
> 	for (i = 0; i < 1000000; i++) {
> 		asm volatile("call 0xfffff000"
> 			:"=a" (ret)
> 			:"0" (__NR_getppid));
> 	}
> 	end = rdtsc();
> 	printf("%f cycles\n", (end - start) / 1000000.0);
> 
> 	start = rdtsc();
> 	for (i = 0; i < 1000000; i++) {
> 		asm volatile("int $0x80"
> 			:"=a" (ret)
> 			:"0" (__NR_getppid));
> 	}
> 	end = rdtsc();
> 	printf("%f cycles\n", (end - start) / 1000000.0);
> }
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

Andre Hedrick
LAD Storage Consulting Group


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  9:45             ` Andre Hedrick
@ 2002-12-17 12:40               ` Dave Jones
  2002-12-17 23:18                 ` Andre Hedrick
  2002-12-17 15:12               ` Alan Cox
  1 sibling, 1 reply; 268+ messages in thread
From: Dave Jones @ 2002-12-17 12:40 UTC (permalink / raw)
  To: Andre Hedrick; +Cc: Linus Torvalds, Ingo Molnar, linux-kernel, hpa

On Tue, Dec 17, 2002 at 01:45:52AM -0800, Andre Hedrick wrote:
 
 > Are you serious about moving of the banging we currently do on 0x80?
 > If so, I have a P4 development board with leds to monitor all the lower io
 > ports and can decode for you.

INT 0x80 != IO port 0x80

8-)

		Dave

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 12:40               ` Dave Jones
@ 2002-12-17 23:18                 ` Andre Hedrick
  0 siblings, 0 replies; 268+ messages in thread
From: Andre Hedrick @ 2002-12-17 23:18 UTC (permalink / raw)
  To: Dave Jones; +Cc: Linus Torvalds, Ingo Molnar, linux-kernel, hpa


Okay I will go back to my storage cave, call when you need something.

Got some meat tenderizer for the shoe leather to make choking it down
easier?

Cheers,

On Tue, 17 Dec 2002, Dave Jones wrote:

> On Tue, Dec 17, 2002 at 01:45:52AM -0800, Andre Hedrick wrote:
>  
>  > Are you serious about moving of the banging we currently do on 0x80?
>  > If so, I have a P4 development board with leds to monitor all the lower io
>  > ports and can decode for you.
> 
> INT 0x80 != IO port 0x80
> 
> 8-)
> 
> 		Dave
> 
> -- 
> | Dave Jones.        http://www.codemonkey.org.uk
> | SuSE Labs
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

Andre Hedrick
LAD Storage Consulting Group


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  9:45             ` Andre Hedrick
  2002-12-17 12:40               ` Dave Jones
@ 2002-12-17 15:12               ` Alan Cox
  2002-12-18 23:55                 ` Pavel Machek
  1 sibling, 1 reply; 268+ messages in thread
From: Alan Cox @ 2002-12-17 15:12 UTC (permalink / raw)
  To: Andre Hedrick
  Cc: Linus Torvalds, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List, hpa

On Tue, 2002-12-17 at 09:45, Andre Hedrick wrote:
> 
> Linus,
> 
> Are you serious about moving of the banging we currently do on 0x80?
> If so, I have a P4 development board with leds to monitor all the lower io
> ports and can decode for you.

Different thing - int 0x80 syscall not i/o port 80. I've done I/O port
80 (its very easy), but requires we set up some udelay constants with an
initial safety value right at boot (which we should do - we udelay
before it is initialised)


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 15:12               ` Alan Cox
@ 2002-12-18 23:55                 ` Pavel Machek
  2002-12-19 22:17                   ` H. Peter Anvin
  0 siblings, 1 reply; 268+ messages in thread
From: Pavel Machek @ 2002-12-18 23:55 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andre Hedrick, Linus Torvalds, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List, hpa

Hi!

> > Are you serious about moving of the banging we currently do on 0x80?
> > If so, I have a P4 development board with leds to monitor all the lower io
> > ports and can decode for you.
> 
> Different thing - int 0x80 syscall not i/o port 80. I've done I/O port
> 80 (its very easy), but requires we set up some udelay constants with an
> initial safety value right at boot (which we should do - we udelay
> before it is initialised)

Actually that would be nice -- I have leds on 0x80 too ;-).
								Pavel
-- 
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 23:55                 ` Pavel Machek
@ 2002-12-19 22:17                   ` H. Peter Anvin
  0 siblings, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-19 22:17 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Cox, Andre Hedrick, Linus Torvalds, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List

Pavel Machek wrote:
>>
>>Different thing - int 0x80 syscall not i/o port 80. I've done I/O port
>>80 (its very easy), but requires we set up some udelay constants with an
>>initial safety value right at boot (which we should do - we udelay
>>before it is initialised)
> 
> Actually that would be nice -- I have leds on 0x80 too ;-).
> 								Pavel

We have tried before, and failed.  Phoenix uses something like 0xe2, but
apparently some machines with non-Phoenix BIOSes actually use that port.

	-hpa


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  5:55           ` Linus Torvalds
  2002-12-17  6:09             ` Linus Torvalds
  2002-12-17  9:45             ` Andre Hedrick
@ 2002-12-17 10:53             ` Ulrich Drepper
  2002-12-17 11:17               ` dada1
                                 ` (2 more replies)
  2002-12-17 16:12             ` Hugh Dickins
  2002-12-18 23:51             ` Pavel Machek
  4 siblings, 3 replies; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-17 10:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Ingo Molnar, linux-kernel, hpa

Linus Torvalds wrote:

> Ok, I did the vsyscall page too, and tried to make it do the right thing
> (but I didn't bother to test it on a non-SEP machine).
> 
> But it might be interesting to verify that a
> recompiled glibc (or even just a preload) really works with this on a
> "whole system" testbed rather than just testing one system call (and not
> even caring about the return value) a million times.

I've created a modified glibc which uses the syscall code for almost
everything.  There are a few int $0x80 left here and there but mostly it
is a centralized change.

The result: all works as expected.  Nice.

On my test machine your little test program performs the syscalls on
roughly twice as fast (HT P4, pretty new).  Your numbers are perhaps for
the P4 Xeons.  Anyway, when measuring some more involved code (I ran my
thread benchmark) I got only about 3% performance increase.  It's doing
a fair amount of system calls.  But again, the good news is your code
survived even this stress test.

The problem with the current solution is the instruction set of the x86.
 In your test code you simply use call 0xfffff000 and it magically work.
 But this is only the case because your program is linked statically.

For the libc DSO I had to play some dirty tricks.  The x86 CPU has no
absolute call.  The variant with an immediate parameter is a relative
jump.  Only when jumping through a register or memory location is it
possible to jump to an absolute address.  To be clear, if I have

    call 0xfffff000

in a DSO which is loaded at address 0x80000000 the jumps ends at
0x7fffffff.  The problem is that the static linker doesn't know the load
address.  We could of course have the dynamic linker fix up the
addresses but this is plain stupid.  It would mean fixing up a lot of
places and making of those pages covered non-sharable.

Instead I've changed the syscall handling to effectve do

   pushl %ebp
   movl $0xfffff000, %ebp
   call *%ebp
   popl %ebp

An alternative is to store the address in a memory location.  But since
%ebx is used for a syscall parameter it is necessary to address the
memory relative to the stack pointer which would mean loading the stack
address with 0xfffff000 before making the syscall.  Not much better than
the code sequence above.

Anyway, it's still an improvement.  But now the question comes up: how
the ld.so detect that the kernel supports these syscalls and can use an
appropriate DSO?  This brings up again the idea of the read-only page(s)
mapped into all processes (you remember).

Anyway, it works nicely.  If you need more testing let me know.

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 10:53             ` Ulrich Drepper
@ 2002-12-17 11:17               ` dada1
  2002-12-17 17:33                 ` Ulrich Drepper
  2002-12-17 17:06               ` Linus Torvalds
  2002-12-18 23:59               ` Pavel Machek
  2 siblings, 1 reply; 268+ messages in thread
From: dada1 @ 2002-12-17 11:17 UTC (permalink / raw)
  To: Ulrich Drepper, Linus Torvalds; +Cc: Dave Jones, Ingo Molnar, linux-kernel, hpa

> For the libc DSO I had to play some dirty tricks.  The x86 CPU has no
> absolute call.  The variant with an immediate parameter is a relative
> jump.  Only when jumping through a register or memory location is it
> possible to jump to an absolute address.  To be clear, if I have
>
>     call 0xfffff000
>
> in a DSO which is loaded at address 0x80000000 the jumps ends at
> 0x7fffffff.  The problem is that the static linker doesn't know the load
> address.  We could of course have the dynamic linker fix up the
> addresses but this is plain stupid.  It would mean fixing up a lot of
> places and making of those pages covered non-sharable.
>

You could have only one routine that would need a relocation / patch at
dynamic linking stage :

absolute_syscall:
    jmp  0xfffff000

Then all syscalls routine could use :

getpid:
    ...
    call absolute_syscall
    ...
instead of "call 0xfffff000"


If the kernel doesnt support the 0xfffff000 page, you could patch
absolute_syscall (if it resides in .data section) with :
    absolute_syscall:
            int 0x80
            ret
(3 bytes instead of 5 bytes)

See you


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 11:17               ` dada1
@ 2002-12-17 17:33                 ` Ulrich Drepper
  0 siblings, 0 replies; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-17 17:33 UTC (permalink / raw)
  To: dada1; +Cc: Linus Torvalds, Dave Jones, Ingo Molnar, linux-kernel, hpa

dada1 wrote:

> You could have only one routine that would need a relocation / patch at
> dynamic linking stage :

That's a horrible way to deal with this in DSOs.  THere is no writable
and executable segment and it would have to be created which means
enormous additional setup costs and higher memory requirement.  I'm not
going to use any scode modification.

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 10:53             ` Ulrich Drepper
  2002-12-17 11:17               ` dada1
@ 2002-12-17 17:06               ` Linus Torvalds
  2002-12-17 17:55                 ` Ulrich Drepper
  2002-12-18 23:59               ` Pavel Machek
  2 siblings, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17 17:06 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Dave Jones, Ingo Molnar, linux-kernel, hpa

On Tue, 17 Dec 2002, Ulrich Drepper wrote:
>
> The problem with the current solution is the instruction set of the x86.
>  In your test code you simply use call 0xfffff000 and it magically work.
>  But this is only the case because your program is linked statically.

Yeah, it's not very convenient. I didn't find any real alternatives,
though, and you can always just put 0xfffff000 in memory or registers and
jump to that. In fact, I suspect that if you actually want to use it in
glibc, then at least in the short term that's what you need to do anyway,
sinc eyou probably don't want to have a glibc that only works with very
recent kernels.

So I was actually assuming that the glibc code would look more like
something like this:

	old_fashioned:
		int $0x80
		ret

	unsigned long system_call_ptr = old_fashioned;

	/* .. startup .. */
	if (kernel_version > xxx)
		system_call_ptr = 0xfffff000;

	/* ... usage ... */
		call *system_call_ptr;

since you cannot depend on the 0xfffff000 on older kernels anyway..

> Instead I've changed the syscall handling to effectve do
>
>    pushl %ebp
>    movl $0xfffff000, %ebp
>    call *%ebp
>    popl %ebp

The above will work, but then you'd have limited yourself to a binary that
_only_ works on new kernels. So I'd suggest the memory indirection
instead.

		Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 17:06               ` Linus Torvalds
@ 2002-12-17 17:55                 ` Ulrich Drepper
  2002-12-17 18:01                   ` Linus Torvalds
  2002-12-17 19:23                   ` Alan Cox
  0 siblings, 2 replies; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-17 17:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Ingo Molnar, linux-kernel, hpa

Linus Torvalds wrote:

> Yeah, it's not very convenient. I didn't find any real alternatives,
> though, and you can always just put 0xfffff000 in memory or registers and
> jump to that.

Putting the value into memory myself is not possible.  In a DSO I have
to address memory indirectly.  But all registers (except %ebp, and maybe
it'll be used some day) are used at the time of the call.

But there is a way: if I'm using

   #define makesyscall(name) \
        movl $__NR_##name, $eax; \
        call 0xfffff000-__NR_##name($eax)

and you'd put at address 0xfffff000 the address of the entry point the
wrappers wouldn't have any problems finding it.

> In fact, I suspect that if you actually want to use it in
> glibc, then at least in the short term that's what you need to do anyway,
> sinc eyou probably don't want to have a glibc that only works with very
> recent kernels.

That's a compilation option.  We might want to do dynamic testing and
yes, a simple pointer indirection is adequate.

But still, the problem is detecting the capable kernels.  You have said
not long ago that comparing kernel versions is wrong.  And I agree.  It
doesn't cover backports and nothing.  But there is a lack of an alternative.

If you don't like the process-global page thingy (anymore) the
alternative would be a sysconf() system call.

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 17:55                 ` Ulrich Drepper
@ 2002-12-17 18:01                   ` Linus Torvalds
  2002-12-17 19:23                   ` Alan Cox
  1 sibling, 0 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17 18:01 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Dave Jones, Ingo Molnar, linux-kernel, hpa

On Tue, 17 Dec 2002, Ulrich Drepper wrote:
>
> If you don't like the process-global page thingy (anymore) the
> alternative would be a sysconf() system call.

Well, we do _have_ the process-global thingy now - it's the vsyscall page.
It's not settable by the process, but it's useful for information.
Together with an elf AT_ entry pointing to it, it's certainly sufficient
for this usage, and it should also be sufficient for "future use" (ie we
can add future system information in the page later: bitmaps of features
at offset "start + 128" for example).

		Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 17:55                 ` Ulrich Drepper
  2002-12-17 18:01                   ` Linus Torvalds
@ 2002-12-17 19:23                   ` Alan Cox
  2002-12-17 18:48                     ` Ulrich Drepper
  2002-12-17 18:49                     ` Linus Torvalds
  1 sibling, 2 replies; 268+ messages in thread
From: Alan Cox @ 2002-12-17 19:23 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Linus Torvalds, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List, hpa

On Tue, 2002-12-17 at 17:55, Ulrich Drepper wrote:
> But there is a way: if I'm using
> 
>    #define makesyscall(name) \
>         movl $__NR_##name, $eax; \
>         call 0xfffff000-__NR_##name($eax)
> 
> and you'd put at address 0xfffff000 the address of the entry point the
> wrappers wouldn't have any problems finding it.

Is there any reason you can't just keep the linker out of the entire
mess by generating

	.byte whatever
	.dword 0xFFFF0000

instead of call ?



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:23                   ` Alan Cox
@ 2002-12-17 18:48                     ` Ulrich Drepper
  2002-12-17 19:19                       ` H. Peter Anvin
  2002-12-17 19:44                       ` Alan Cox
  2002-12-17 18:49                     ` Linus Torvalds
  1 sibling, 2 replies; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-17 18:48 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List, hpa

Alan Cox wrote:

> Is there any reason you can't just keep the linker out of the entire
> mess by generating
> 
> 	.byte whatever
> 	.dword 0xFFFF0000
> 
> instead of call ?

There is no such instruction.  Unless you know about some secret
undocumented opcode...

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 18:48                     ` Ulrich Drepper
@ 2002-12-17 19:19                       ` H. Peter Anvin
  2002-12-17 19:44                       ` Alan Cox
  1 sibling, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-17 19:19 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Alan Cox, Linus Torvalds, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List

Ulrich Drepper wrote:
> Alan Cox wrote:
> 
> 
>>Is there any reason you can't just keep the linker out of the entire
>>mess by generating
>>
>>	.byte whatever
>>	.dword 0xFFFF0000
>>
>>instead of call ?
> 
> 
> There is no such instruction.  Unless you know about some secret
> undocumented opcode...
> 

Well, there is lcall $0xffff0000, $USER_CS... (no, I'm most definitely
*not* suggesting it.)

	-hpa


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 18:48                     ` Ulrich Drepper
  2002-12-17 19:19                       ` H. Peter Anvin
@ 2002-12-17 19:44                       ` Alan Cox
  2002-12-17 19:52                         ` Richard B. Johnson
  1 sibling, 1 reply; 268+ messages in thread
From: Alan Cox @ 2002-12-17 19:44 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Linus Torvalds, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List, hpa

On Tue, 2002-12-17 at 18:48, Ulrich Drepper wrote:
> Alan Cox wrote:
> 
> > Is there any reason you can't just keep the linker out of the entire
> > mess by generating
> > 
> > 	.byte whatever
> > 	.dword 0xFFFF0000
> > 
> > instead of call ?
> 
> There is no such instruction.  Unless you know about some secret
> undocumented opcode...

No I'd forgotten how broken x86 was


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:44                       ` Alan Cox
@ 2002-12-17 19:52                         ` Richard B. Johnson
  2002-12-17 19:54                           ` H. Peter Anvin
  2002-12-17 19:58                           ` Linus Torvalds
  0 siblings, 2 replies; 268+ messages in thread
From: Richard B. Johnson @ 2002-12-17 19:52 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ulrich Drepper, Linus Torvalds, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List, hpa

On 17 Dec 2002, Alan Cox wrote:

> On Tue, 2002-12-17 at 18:48, Ulrich Drepper wrote:
> > Alan Cox wrote:
> > 
> > > Is there any reason you can't just keep the linker out of the entire
> > > mess by generating
> > > 
> > > 	.byte whatever
> > > 	.dword 0xFFFF0000
> > > 
> > > instead of call ?
> > 
> > There is no such instruction.  Unless you know about some secret
> > undocumented opcode...
> 
> No I'd forgotten how broken x86 was
> 

You can call intersegment with a full pointer. I don't know how
expensive that is. Since USER_CS is a fixed value in Linux, it
can be hard-coded

		.byte 0x9a
		.dword 0xfffff000
		.word USER_CS

No. I didn't try this, I'm just looking at the manual. I don't know
what the USER_CS is (didn't look in the kernel) The book says the
pointer is 16:32 which means that it's a dword, followed by a word.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:52                         ` Richard B. Johnson
@ 2002-12-17 19:54                           ` H. Peter Anvin
  2002-12-17 19:58                           ` Linus Torvalds
  1 sibling, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-17 19:54 UTC (permalink / raw)
  To: root
  Cc: Alan Cox, Ulrich Drepper, Linus Torvalds, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List

Richard B. Johnson wrote:
> 
> You can call intersegment with a full pointer. I don't know how
> expensive that is. Since USER_CS is a fixed value in Linux, it
> can be hard-coded
> 
> 		.byte 0x9a
> 		.dword 0xfffff000
> 		.word USER_CS
> 
> No. I didn't try this, I'm just looking at the manual. I don't know
> what the USER_CS is (didn't look in the kernel) The book says the
> pointer is 16:32 which means that it's a dword, followed by a word.
> 

It's quite expensive (not as expensive as INT, but not that far from
it), and you also push CS onto the stack.

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:52                         ` Richard B. Johnson
  2002-12-17 19:54                           ` H. Peter Anvin
@ 2002-12-17 19:58                           ` Linus Torvalds
  2002-12-18  7:20                             ` Kai Henningsen
  1 sibling, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17 19:58 UTC (permalink / raw)
  To: Richard B. Johnson
  Cc: Alan Cox, Ulrich Drepper, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List, hpa



On Tue, 17 Dec 2002, Richard B. Johnson wrote:
>
> You can call intersegment with a full pointer. I don't know how
> expensive that is.

It's so expensive as to not be worth it, it's cheaper to load a register
or something, i eyou can do

	pushl $0xfffff000
	call *(%esp)

faster than doing a far call.

		Linus


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:58                           ` Linus Torvalds
@ 2002-12-18  7:20                             ` Kai Henningsen
  0 siblings, 0 replies; 268+ messages in thread
From: Kai Henningsen @ 2002-12-18  7:20 UTC (permalink / raw)
  To: linux-kernel

torvalds@transmeta.com (Linus Torvalds)  wrote on 17.12.02 in <Pine.LNX.4.44.0212171157050.1095-100000@home.transmeta.com>:

> On Tue, 17 Dec 2002, Richard B. Johnson wrote:
> >
> > You can call intersegment with a full pointer. I don't know how
> > expensive that is.
>
> It's so expensive as to not be worth it, it's cheaper to load a register
> or something, i eyou can do
>
> 	pushl $0xfffff000
> 	call *(%esp)
>
> faster than doing a far call.

Hmm ...

How expensive would it be to have a special virtual DSO built into ld.so  
which exported this (and any other future entry points), to be linked  
against like any other DSO? That way, the *actual* interface would only be  
between the kernel and ld.so.

MfG Kai

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:23                   ` Alan Cox
  2002-12-17 18:48                     ` Ulrich Drepper
@ 2002-12-17 18:49                     ` Linus Torvalds
  2002-12-17 19:09                       ` Ross Biro
  2002-12-17 21:34                       ` Benjamin LaHaise
  1 sibling, 2 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17 18:49 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ulrich Drepper, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List, hpa

On 17 Dec 2002, Alan Cox wrote:
>
> Is there any reason you can't just keep the linker out of the entire
> mess by generating
>
> 	.byte whatever
> 	.dword 0xFFFF0000
>
> instead of call ?

Alan, the problem is that there _is_ no such instruction as a "call
absolute".

There is only a "call relative" or "call indirect-absolute". So you either
have to indirect through memory or a register, or you have to fix up the
call at link-time.

Yeah, I know it sounds strange, but it makes sense. Absolute calls are
actually very unusual, and using relative calls is _usually_ the right
thing to do. It's only in cases like this that we really want to call a
specific address.

			Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 18:49                     ` Linus Torvalds
@ 2002-12-17 19:09                       ` Ross Biro
  2002-12-17 21:34                       ` Benjamin LaHaise
  1 sibling, 0 replies; 268+ messages in thread
From: Ross Biro @ 2002-12-17 19:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Ulrich Drepper, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List, hpa

It doesn't make sense to me to use a specially formatted page forced 
into user space to tell libraries how to do system calls.  Perhaps each 
executable personality in the kernel should export a special shared 
library in it's own native format that contains the necessary 
information.  That way we don't have to worry as much about code or 
values changing sizes or locations.

We would have the chicken/egg problem with how the special shared 
library gets loaded in the first place.  For that we either support a 
legacy syscall method (i.e. int 0x80 on x86) which should only be used 
by ld.so or the equivalent or magically force the library into user 
space at a known address.

    Ross

Linus Torvalds wrote:

>On 17 Dec 2002, Alan Cox wrote:
>  
>
>>Is there any reason you can't just keep the linker out of the entire
>>mess by generating
>>
>>	.byte whatever
>>	.dword 0xFFFF0000
>>
>>instead of call ?
>>    
>>
>
>Alan, the problem is that there _is_ no such instruction as a "call
>absolute".
>
>There is only a "call relative" or "call indirect-absolute". So you either
>have to indirect through memory or a register, or you have to fix up the
>call at link-time.
>
>Yeah, I know it sounds strange, but it makes sense. Absolute calls are
>actually very unusual, and using relative calls is _usually_ the right
>thing to do. It's only in cases like this that we really want to call a
>specific address.
>
>			Linus
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>  
>

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 18:49                     ` Linus Torvalds
  2002-12-17 19:09                       ` Ross Biro
@ 2002-12-17 21:34                       ` Benjamin LaHaise
  2002-12-17 21:36                         ` H. Peter Anvin
  1 sibling, 1 reply; 268+ messages in thread
From: Benjamin LaHaise @ 2002-12-17 21:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Ulrich Drepper, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List, hpa

On Tue, Dec 17, 2002 at 10:49:31AM -0800, Linus Torvalds wrote:
> There is only a "call relative" or "call indirect-absolute". So you either
> have to indirect through memory or a register, or you have to fix up the
> call at link-time.
> 
> Yeah, I know it sounds strange, but it makes sense. Absolute calls are
> actually very unusual, and using relative calls is _usually_ the right
> thing to do. It's only in cases like this that we really want to call a
> specific address.

The stubs I used for the vsyscall bits just did an absolute jump to 
the vsyscall page, which would then do a ret to the original calling 
userspace code (since that provided library symbols for the user to 
bind against).

		-ben
-- 
"Do you seek knowledge in time travel?"

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 21:34                       ` Benjamin LaHaise
@ 2002-12-17 21:36                         ` H. Peter Anvin
  2002-12-17 21:50                           ` Benjamin LaHaise
  0 siblings, 1 reply; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-17 21:36 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Linus Torvalds, Alan Cox, Ulrich Drepper, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List

Benjamin LaHaise wrote:
>
> The stubs I used for the vsyscall bits just did an absolute jump to 
> the vsyscall page, which would then do a ret to the original calling 
> userspace code (since that provided library symbols for the user to 
> bind against).
> 

What kind of "absolute jumps" were this?

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 21:36                         ` H. Peter Anvin
@ 2002-12-17 21:50                           ` Benjamin LaHaise
  0 siblings, 0 replies; 268+ messages in thread
From: Benjamin LaHaise @ 2002-12-17 21:50 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Alan Cox, Ulrich Drepper, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List

On Tue, Dec 17, 2002 at 01:36:53PM -0800, H. Peter Anvin wrote:
> Benjamin LaHaise wrote:
> >
> > The stubs I used for the vsyscall bits just did an absolute jump to 
> > the vsyscall page, which would then do a ret to the original calling 
> > userspace code (since that provided library symbols for the user to 
> > bind against).
> > 
> 
> What kind of "absolute jumps" were this?

It was a far jump (ljmp $__USER_CS,<address>).

		-ben
-- 
"Do you seek knowledge in time travel?"

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 10:53             ` Ulrich Drepper
  2002-12-17 11:17               ` dada1
  2002-12-17 17:06               ` Linus Torvalds
@ 2002-12-18 23:59               ` Pavel Machek
  2 siblings, 0 replies; 268+ messages in thread
From: Pavel Machek @ 2002-12-18 23:59 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Linus Torvalds, Dave Jones, Ingo Molnar, linux-kernel, hpa

Hi!

> I've created a modified glibc which uses the syscall code for almost
> everything.  There are a few int $0x80 left here and there but mostly it
> is a centralized change.
> 
> The result: all works as expected.  Nice.
> 
> On my test machine your little test program performs the syscalls on
> roughly twice as fast (HT P4, pretty new).  Your numbers are perhaps for
> the P4 Xeons.  Anyway, when measuring some more involved code (I ran my
> thread benchmark) I got only about 3% performance increase.  It's doing
> a fair amount of system calls.  But again, the good news is your code
> survived even this stress test.
> 
> 
> The problem with the current solution is the instruction set of the x86.
>  In your test code you simply use call 0xfffff000 and it magically work.
>  But this is only the case because your program is linked statically.
> 
> For the libc DSO I had to play some dirty tricks.  The x86 CPU has no
> absolute call.  The variant with an immediate parameter is a relative
> jump.  Only when jumping through a register or memory location is it
> possible to jump to an absolute address.  To be clear, if I have
> 
>     call 0xfffff000
> 
> in a DSO which is loaded at address 0x80000000 the jumps ends at
> 0x7fffffff.  The problem is that the static linker doesn't know the load
> address.  We could of course have the dynamic linker fix up the
> addresses but this is plain stupid.  It would mean fixing up a lot of
> places and making of those pages covered non-sharable.

Can't you do call far __SOME_CS, 0xfffff000?

								Pavel
-- 
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  5:55           ` Linus Torvalds
                               ` (2 preceding siblings ...)
  2002-12-17 10:53             ` Ulrich Drepper
@ 2002-12-17 16:12             ` Hugh Dickins
  2002-12-17 16:33               ` Richard B. Johnson
                                 ` (2 more replies)
  2002-12-18 23:51             ` Pavel Machek
  4 siblings, 3 replies; 268+ messages in thread
From: Hugh Dickins @ 2002-12-17 16:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Ingo Molnar, Ulrich Drepper, linux-kernel, hpa

On Mon, 16 Dec 2002, Linus Torvalds wrote:
> 
> Ok, I did the vsyscall page too, and tried to make it do the right thing
> (but I didn't bother to test it on a non-SEP machine).
> 
> I'm pushing the changes out right now, but basically it boils down to the
> fact that with these changes, user space can instead of doing an
> 
> 	int $0x80
> 
> instruction for a system call just do a
> 
> 	call 0xfffff000

I thought that last page was intentionally left invalid?

So that, for example, *(char *)MAP_FAILED will give SIGSEGV;
whereas now I can read a 0 there (and perhaps you should be
using get_zeroed_page rather than __get_free_page?).

I cannot name anything which relies on that page being invalid,
but think it would be safer to keep that it way; though I guess
more compatibility pain to use the next page down (or could
seg lim be used? I forget the granularity restrictions).

Hugh

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 16:12             ` Hugh Dickins
@ 2002-12-17 16:33               ` Richard B. Johnson
  2002-12-17 17:47                 ` Linus Torvalds
  2002-12-17 16:54               ` Hugh Dickins
  2002-12-17 17:07               ` Linus Torvalds
  2 siblings, 1 reply; 268+ messages in thread
From: Richard B. Johnson @ 2002-12-17 16:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Dave Jones, Ingo Molnar, Ulrich Drepper,
	linux-kernel, hpa

On Tue, 17 Dec 2002, Hugh Dickins wrote:

> On Mon, 16 Dec 2002, Linus Torvalds wrote:
> > 
> > Ok, I did the vsyscall page too, and tried to make it do the right thing
> > (but I didn't bother to test it on a non-SEP machine).
> > 
> > I'm pushing the changes out right now, but basically it boils down to the
> > fact that with these changes, user space can instead of doing an
> > 
> > 	int $0x80
> > 
> > instruction for a system call just do a
> > 
> > 	call 0xfffff000
> 

So you are going to do a system-call off a trap instead of an interrupt.
The difference in performance should be practically nothing. There is
also going to be additional overhead in returning from the trap since
the IP and caller's segment was not saved by the initial trap. I don't
see how you can possibly claim any improvement in performance. Further,
it doesn't make any sense. We don't call physical addresses from a
virtual address anyway, so there will be additional translation that
must take some time. With the current page-table translation you
would need to put your system-call entry point at 0xfffff000 - 0xc0000000
= 0x3ffff000 and there might not even be any RAM there. This guarantees
that you are going to have to set up a special PTE, resulting in
additional overhead.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 16:33               ` Richard B. Johnson
@ 2002-12-17 17:47                 ` Linus Torvalds
  0 siblings, 0 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17 17:47 UTC (permalink / raw)
  To: Richard B. Johnson
  Cc: Hugh Dickins, Dave Jones, Ingo Molnar, Ulrich Drepper,
	linux-kernel, hpa




On Tue, 17 Dec 2002, Richard B. Johnson wrote:
> On Mon, 16 Dec 2002, Linus Torvalds wrote:
> >
> > instruction for a system call just do a
> >
> > 	call 0xfffff000
>
> So you are going to do a system-call off a trap instead of an interrupt.

No no. The kernel maps a magic read-only page at 0xfffff000, and there's
no trap involved. The code at that address is kernel-generated for the CPU
in question, and it will do whatever is most convenient.

No traps. They're slow as hell.

		Linus


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 16:12             ` Hugh Dickins
  2002-12-17 16:33               ` Richard B. Johnson
@ 2002-12-17 16:54               ` Hugh Dickins
  2002-12-17 17:07               ` Linus Torvalds
  2 siblings, 0 replies; 268+ messages in thread
From: Hugh Dickins @ 2002-12-17 16:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Ingo Molnar, Ulrich Drepper, linux-kernel, hpa

On Tue, 17 Dec 2002, Hugh Dickins wrote:
> whereas now I can read a 0 there (and perhaps you should be
> using get_zeroed_page rather than __get_free_page?).

Sorry, yes, you are using get_zeroed_page for the one that needs it.

Hugh


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 16:12             ` Hugh Dickins
  2002-12-17 16:33               ` Richard B. Johnson
  2002-12-17 16:54               ` Hugh Dickins
@ 2002-12-17 17:07               ` Linus Torvalds
  2002-12-17 17:19                 ` Matti Aarnio
  2 siblings, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17 17:07 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Dave Jones, Ingo Molnar, Ulrich Drepper, linux-kernel, hpa



On Tue, 17 Dec 2002, Hugh Dickins wrote:
>
> I thought that last page was intentionally left invalid?

It was. But I thought it made sense to use, as it's the only really
"special" page.

But yes, we should decide on this quickly - it's easy to change right now,
but..

		Linus


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 17:07               ` Linus Torvalds
@ 2002-12-17 17:19                 ` Matti Aarnio
  2002-12-17 17:55                   ` Linus Torvalds
  0 siblings, 1 reply; 268+ messages in thread
From: Matti Aarnio @ 2002-12-17 17:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Dave Jones, Ingo Molnar, Ulrich Drepper,
	linux-kernel, hpa

On Tue, Dec 17, 2002 at 09:07:21AM -0800, Linus Torvalds wrote:
> On Tue, 17 Dec 2002, Hugh Dickins wrote:
> > I thought that last page was intentionally left invalid?
> 
> It was. But I thought it made sense to use, as it's the only really
> "special" page.

  In couple of occasions I have caught myself from pre-decrementing
  a char pointer which "just happened" to be NULL.

  Please keep the last page, as well as a few of the first pages as
  NULL-pointer poisons.

> But yes, we should decide on this quickly - it's easy to change right now,
> but..
> 
> 		Linus

/Matti Aarnio

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 17:19                 ` Matti Aarnio
@ 2002-12-17 17:55                   ` Linus Torvalds
  2002-12-17 18:24                     ` Linus Torvalds
                                       ` (3 more replies)
  0 siblings, 4 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17 17:55 UTC (permalink / raw)
  To: Ulrich Drepper, Matti Aarnio
  Cc: Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel, hpa

On Tue, 17 Dec 2002, Matti Aarnio wrote:
>
> On Tue, Dec 17, 2002 at 09:07:21AM -0800, Linus Torvalds wrote:
> > On Tue, 17 Dec 2002, Hugh Dickins wrote:
> > > I thought that last page was intentionally left invalid?
> >
> > It was. But I thought it made sense to use, as it's the only really
> > "special" page.
>
>   In couple of occasions I have caught myself from pre-decrementing
>   a char pointer which "just happened" to be NULL.
>
>   Please keep the last page, as well as a few of the first pages as
>   NULL-pointer poisons.

I think I have a good clean solution to this, that not only avoids the
need for any hard-coded address _at_all_, but also solves Uli's problem
quite cleanly.

Uli, how about I just add one ne warchitecture-specific ELF AT flag, which
is the "base of sysinfo page". Right now that page is all zeroes except
for the system call trampoline at the beginning, but we might want to add
other system information to the page in the future (it is readable, after
all).

So we'd have an AT_SYSINFO entry, that with the current implementation
would just get the value 0xfffff000. And then the glibc startup code could
easily be backwards compatible with the suggestion I had in the previous
email. Since we basically want to do an indirect jump anyway (because of
the lack of absolute jumps in the instruction set), this looks like the
natural way to do it.

That also allows the kernel to move around the SYSINFO page at will, and
even makes it possible to avoid it altogether (ie this will solve the
inevitable problems with UML - UML just wouldn't set AT_SYSINFO, so user
level just wouldn't even _try_ to use it).

With that, there's nothing "special" about the vsyscall page, and I'd just
go back to having the very last page unmapped (and have the vsyscall page
in some other fixmap location that might even depend on kernel
configuration).

Whaddaya think?

		Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 17:55                   ` Linus Torvalds
@ 2002-12-17 18:24                     ` Linus Torvalds
  2002-12-17 18:33                       ` Ulrich Drepper
  2002-12-17 18:30                     ` Ulrich Drepper
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17 18:24 UTC (permalink / raw)
  To: Ulrich Drepper, Matti Aarnio
  Cc: Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel, hpa



On Tue, 17 Dec 2002, Linus Torvalds wrote:
>
> Uli, how about I just add one ne warchitecture-specific ELF AT flag, which
> is the "base of sysinfo page". Right now that page is all zeroes except
> for the system call trampoline at the beginning, but we might want to add
> other system information to the page in the future (it is readable, after
> all).

Here's the suggested (totally untested as of yet) patch:

 - it moves the system call page to 0xffffe000 instead, leaving an
   unmapped page at the very top of the address space. So trying to
   dereference -1 will cause a SIGSEGV.

 - it adds the AT_SYSINFO elf entry on x86 that points to the system page.

Thus glibc startup should be able to just do

	ptr = default_int80_syscall;
	if (AT_SYSINFO entry found)
		ptr = value(AT_SYSINFO)

and then you can just do a

	call *ptr

to do a system call regardless of kernel version. This also allows the
kernel to later move the page around as it sees fit.

The advantage of using an AT_SYSINFO entry is that

 - no new system call needed to figure anything out
 - backwards compatibility (ie old kernels automatically detected)
 - I think glibc already parses the AT entries at startup anyway

so it _looks_ like a perfect way to do this.

		Linus

----
===== arch/i386/kernel/entry.S 1.42 vs edited =====
--- 1.42/arch/i386/kernel/entry.S	Mon Dec 16 21:39:04 2002
+++ edited/arch/i386/kernel/entry.S	Tue Dec 17 10:13:16 2002
@@ -232,7 +232,7 @@
 #endif

 /* Points to after the "sysenter" instruction in the vsyscall page */
-#define SYSENTER_RETURN 0xfffff007
+#define SYSENTER_RETURN 0xffffe007

 	# sysenter call handler stub
 	ALIGN
===== include/asm-i386/elf.h 1.3 vs edited =====
--- 1.3/include/asm-i386/elf.h	Thu Oct 17 00:48:55 2002
+++ edited/include/asm-i386/elf.h	Tue Dec 17 10:12:58 2002
@@ -100,6 +100,12 @@

 #define ELF_PLATFORM  (system_utsname.machine)

+/*
+ * Architecture-neutral AT_ values in 0-17, leave some room
+ * for more of them, start the x86-specific ones at 32.
+ */
+#define AT_SYSINFO	32
+
 #ifdef __KERNEL__
 #define SET_PERSONALITY(ex, ibcs2) set_personality((ibcs2)?PER_SVR4:PER_LINUX)

@@ -115,6 +121,11 @@
 extern void dump_smp_unlazy_fpu(void);
 #define ELF_CORE_SYNC dump_smp_unlazy_fpu
 #endif
+
+#define ARCH_DLINFO					\
+do {							\
+		NEW_AUX_ENT(AT_SYSINFO, 0xffffe000);	\
+} while (0)

 #endif

===== include/asm-i386/fixmap.h 1.9 vs edited =====
--- 1.9/include/asm-i386/fixmap.h	Mon Dec 16 21:39:04 2002
+++ edited/include/asm-i386/fixmap.h	Tue Dec 17 10:11:31 2002
@@ -42,8 +42,8 @@
  * task switches.
  */
 enum fixed_addresses {
-	FIX_VSYSCALL,
 	FIX_HOLE,
+	FIX_VSYSCALL,
 #ifdef CONFIG_X86_LOCAL_APIC
 	FIX_APIC_BASE,	/* local (CPU) APIC) -- required for SMP or not */
 #endif


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 18:24                     ` Linus Torvalds
@ 2002-12-17 18:33                       ` Ulrich Drepper
  0 siblings, 0 replies; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-17 18:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matti Aarnio, Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel,
	hpa

Linus Torvalds wrote:

> Thus glibc startup should be able to just do
> 
> 	ptr = default_int80_syscall;
> 	if (AT_SYSINFO entry found)
> 		ptr = value(AT_SYSINFO)
> 
> and then you can just do a
> 
> 	call *ptr

This won't work as I just wrote but something similar I can make work.
I think the use of the TCB is the best thing to do.  Replicating the
info in all thread new thread's TCBs doesn't cost much and with NPTL
it's even lower cost since we reuse old TCBs.

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 17:55                   ` Linus Torvalds
  2002-12-17 18:24                     ` Linus Torvalds
@ 2002-12-17 18:30                     ` Ulrich Drepper
  2002-12-17 19:04                       ` Linus Torvalds
  2002-12-17 19:26                       ` Alan Cox
  2002-12-17 18:39                     ` Jeff Dike
  2002-12-18  5:34                     ` Jeremy Fitzhardinge
  3 siblings, 2 replies; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-17 18:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matti Aarnio, Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel,
	hpa

Linus Torvalds wrote:

> Uli, how about I just add one ne warchitecture-specific ELF AT flag, which
> is the "base of sysinfo page". Right now that page is all zeroes except
> for the system call trampoline at the beginning, but we might want to add
> other system information to the page in the future (it is readable, after
> all).
> 
> So we'd have an AT_SYSINFO entry, that with the current implementation
> would just get the value 0xfffff000. And then the glibc startup code could
> easily be backwards compatible with the suggestion I had in the previous
> email. Since we basically want to do an indirect jump anyway (because of
> the lack of absolute jumps in the instruction set), this looks like the
> natural way to do it.

Yes, I definitely think that a new AT_* value is at order and it's a
nice way to determine the address.

But it will eliminate the problem.  Remember: the x86 (unlike x86-64)
has no PC-relative data addressing mode.  I.e., in a DSO to find a
memory location with an address I need a base register which isn't
available anymore at the time the call is made.

You have to assume that all the registers, including %ebp, are used at
the time of the call.  This makes it impossible to find a memory
location in a DSO without text relocation (i.e., making parts of the
code writable, at least for a moment).  This is time consuming and not
resource friendly.

There is one way around this and maybe it is what should be done: we
have the TLS memory available.  And since this vsyscall stuff gets
introduced after the TLS is functional it is a possibility.

The address received in AT_SYSINFO is stored in a word in the TCB
(thread control block).  Then the code to call through this is a variant
of what I posted earlier

  movl $__NR_##name, %eax
  call *%gs:-__NR_##name+TCB_OFFSET(%eax)

In case the vsyscall stuff is not available we jump to something like

   int $0x80
   ret

The address of this code is the default value of the TCB word.

There is another thing we might want to consider.  The above code jump
to 0xfffff000 or whatever adddres is specified.  I.e., the
demultiplexing happens in the kernel.  Do we want to do this at
userlevel?  This would allow almost no-cost determination of those
syscalls which can be handled at userlevel (getpid, getppid, ...).

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 18:30                     ` Ulrich Drepper
@ 2002-12-17 19:04                       ` Linus Torvalds
  2002-12-17 19:19                         ` Ulrich Drepper
  2002-12-17 19:28                         ` Linus Torvalds
  2002-12-17 19:26                       ` Alan Cox
  1 sibling, 2 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17 19:04 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Matti Aarnio, Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel,
	hpa

On Tue, 17 Dec 2002, Ulrich Drepper wrote:
>
> But it will eliminate the problem.  Remember: the x86 (unlike x86-64)
> has no PC-relative data addressing mode.  I.e., in a DSO to find a
> memory location with an address I need a base register which isn't
> available anymore at the time the call is made.

Actually, I see a more serious problem with the current "syscall"
interface: it doesn't allow six-argument system calls AT ALL, since it
needed %ebp to keep the stack pointer.

So a six-argument system call _has_ to use "int $0x80" anyway, which to
some degree simplifies your problem: you can only use the indirect call
approach for things where %ebp will be free for use anyway.

So then you can use %ebp as the indirection, and the code will look
something like

games, since that is guaranteed not to be ever used by a system call (it
wasn't guaranteed before, but since the sysenter really needs something to
hold the stack pointer I made %ebp do that, so there's no way we can ever
use %ebp for system calls on x86).

So you _can_ do something like this:

	syscall_with_5_args:
		pushl %ebx
		pushl %esi
		pushl %edi
		pushl %ebp
		movl syscall_ptr + GOT,%ebp	// uses DSO ptr in %ebx or whatever
		movl $__NR_xxxxxx,%eax
		movl 20(%esp),%ebx
		movl 24(%esp),%ecx
		movl 28(%esp),%edx
		movl 32(%esp),%esi
		movl 36(%esp),%edi
		call *%ebp
		.. test for errno if needed ..
		popl %ebp
		popl %edi
		popl %esi
		popl %ebx
		ret

> You have to assume that all the registers, including %ebp, are used at
> the time of the call.

See why this isn't possible right now anyway.

Hmm.. Which system calls have all six parameters? I'll have to see if I
can find any way to make those use the new interface too.

In the meantime, I do agree with you that the TLS approach should work
too, and might be better. It will allow all six arguments to be used if we
just find a good calling conventions (too bad sysenter is such a pig of an
instruction, it's really not very well designed since it loses
information).

			Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:04                       ` Linus Torvalds
@ 2002-12-17 19:19                         ` Ulrich Drepper
  2002-12-17 19:28                         ` Linus Torvalds
  1 sibling, 0 replies; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-17 19:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matti Aarnio, Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel,
	hpa

Linus Torvalds wrote:

> In the meantime, I do agree with you that the TLS approach should work
> too, and might be better. It will allow all six arguments to be used if we
> just find a good calling conventions 

If you push out the AT_* patch I'll hack the glibc bits (probably the
TLS variant).  Won't take too  long, you'll get results this afternoon.

What about AMD's instruction?  Is it as flawed as sysenter?  If not and
%ebp is available I really should use the TLS method.

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:04                       ` Linus Torvalds
  2002-12-17 19:19                         ` Ulrich Drepper
@ 2002-12-17 19:28                         ` Linus Torvalds
  2002-12-17 19:32                           ` H. Peter Anvin
  2002-12-17 19:53                           ` Ulrich Drepper
  1 sibling, 2 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17 19:28 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Matti Aarnio, Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel,
	hpa

On Tue, 17 Dec 2002, Linus Torvalds wrote:
>
> Hmm.. Which system calls have all six parameters? I'll have to see if I
> can find any way to make those use the new interface too.

The only ones I found from a quick grep are
 - sys_recvfrom
 - sys_sendto
 - sys_mmap2()
 - sys_ipc()

and none of them are of a kind where the system call entry itself is the
biggest performance issue (and sys_ipc() is deprecated anyway), so it's
probably acceptable to just use the old interface for them.

One other alternative is to change the calling convention for the
new-style system call, and not have arguments in registers at all. We
could make the interface something like

 - %eax contains system call number
 - %edx contains pointer to argument block
 - call *syscallptr	// trashes all registers

and then the old "compatibility" function would be something like

	movl 0(%edx),%ebx
	movl 4(%edx),%ecx
	movl 12(%edx),%esi
	movl 16(%edx),%edi
	movl 20(%edx),%ebp
	movl 8(%edx),%edx
	int $0x80
	ret

while the "sysenter" interface would do the loads from kernel space.

That would make some things easier, but the problem with this approach is
that if you have a single-argument system call, and you just pass in the
stack pointer offset in %edx directly, then the system call stubs will
always load 6 arguments, and if we're just at the end of the stack it
won't actually _work_. So part of the calling convention would have to be
the guarantee that there is stack-space available (should always be true
in practice, of course).

			Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:28                         ` Linus Torvalds
@ 2002-12-17 19:32                           ` H. Peter Anvin
  2002-12-17 19:44                             ` Linus Torvalds
  2002-12-17 19:53                           ` Ulrich Drepper
  1 sibling, 1 reply; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-17 19:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ulrich Drepper, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, linux-kernel

Linus Torvalds wrote:
> 
> On Tue, 17 Dec 2002, Linus Torvalds wrote:
> 
>>Hmm.. Which system calls have all six parameters? I'll have to see if I
>>can find any way to make those use the new interface too.
> 
> 
> The only ones I found from a quick grep are
>  - sys_recvfrom
>  - sys_sendto
>  - sys_mmap2()
>  - sys_ipc()
> 
> and none of them are of a kind where the system call entry itself is the
> biggest performance issue (and sys_ipc() is deprecated anyway), so it's
> probably acceptable to just use the old interface for them.
> 

recvfrom() and sendto() can also be implemeted as sendmsg() recvmsg() if
one really wants to.

What one can also do is that a sixth argument, if one exists, is passed
on the stack (i.e. in (%ebp), since %ebp contains the stack pointer.)

	-hpa


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:32                           ` H. Peter Anvin
@ 2002-12-17 19:44                             ` Linus Torvalds
  0 siblings, 0 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17 19:44 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ulrich Drepper, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, linux-kernel

On Tue, 17 Dec 2002, H. Peter Anvin wrote:
>
> What one can also do is that a sixth argument, if one exists, is passed
> on the stack (i.e. in (%ebp), since %ebp contains the stack pointer.)

I like this. I will make it so. It will allow the old calling conventions
and has none of the stack size issues that my "memory block" approach had.

Also since this will all be done inside the wrapper and is thus entirely
invisible to the caller. Good, this solves the six-arg case nicely.

		Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:28                         ` Linus Torvalds
  2002-12-17 19:32                           ` H. Peter Anvin
@ 2002-12-17 19:53                           ` Ulrich Drepper
  2002-12-17 20:01                             ` Linus Torvalds
  1 sibling, 1 reply; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-17 19:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matti Aarnio, Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel,
	hpa

Linus Torvalds wrote:
> 

> The only ones I found from a quick grep are
>  - sys_recvfrom
>  - sys_sendto
>  - sys_mmap2()
>  - sys_ipc()

All but mmap2 do not use 6 parameters.  They are implemented via the
sys_ipc multiplexer which takes the stack pointer as an argument which
then helps to locate the parameters.


-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:53                           ` Ulrich Drepper
@ 2002-12-17 20:01                             ` Linus Torvalds
  2002-12-17 20:17                               ` Ulrich Drepper
  2002-12-18  4:15                               ` Linus Torvalds
  0 siblings, 2 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17 20:01 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Matti Aarnio, Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel,
	hpa


How about this diff? It does both the 6-parameter thing _and_ the
AT_SYSINFO addition. Untested, since I have to run off and watch my kids
do their winter program ;)

		Linus

-----
===== arch/i386/kernel/entry.S 1.42 vs edited =====
--- 1.42/arch/i386/kernel/entry.S	Mon Dec 16 21:39:04 2002
+++ edited/arch/i386/kernel/entry.S	Tue Dec 17 11:59:13 2002
@@ -232,7 +232,7 @@
 #endif

 /* Points to after the "sysenter" instruction in the vsyscall page */
-#define SYSENTER_RETURN 0xfffff007
+#define SYSENTER_RETURN 0xffffe007

 	# sysenter call handler stub
 	ALIGN
@@ -243,6 +243,21 @@
 	pushfl
 	pushl $(__USER_CS)
 	pushl $SYSENTER_RETURN
+
+/*
+ * Load the potential sixth argument from user stack.
+ * Careful about security.
+ */
+	cmpl $0xc0000000,%ebp
+	jae syscall_badsys
+1:	movl (%ebp),%ebp
+.section .fixup,"ax"
+2:	xorl %ebp,%ebp
+.previous
+.section __ex_table,"a"
+	.align 4
+	.long 1b,2b
+.previous

 	pushl %eax
 	SAVE_ALL
===== arch/i386/kernel/sysenter.c 1.1 vs edited =====
--- 1.1/arch/i386/kernel/sysenter.c	Mon Dec 16 21:39:04 2002
+++ edited/arch/i386/kernel/sysenter.c	Tue Dec 17 11:39:39 2002
@@ -48,14 +48,14 @@
 		0xc3			/* ret */
 	};
 	static const char sysent[] = {
-		0x55,			/* push %ebp */
 		0x51,			/* push %ecx */
 		0x52,			/* push %edx */
+		0x55,			/* push %ebp */
 		0x89, 0xe5,		/* movl %esp,%ebp */
 		0x0f, 0x34,		/* sysenter */
+		0x5d,			/* pop %ebp */
 		0x5a,			/* pop %edx */
 		0x59,			/* pop %ecx */
-		0x5d,			/* pop %ebp */
 		0xc3			/* ret */
 	};
 	unsigned long page = get_zeroed_page(GFP_ATOMIC);
===== include/asm-i386/elf.h 1.3 vs edited =====
--- 1.3/include/asm-i386/elf.h	Thu Oct 17 00:48:55 2002
+++ edited/include/asm-i386/elf.h	Tue Dec 17 10:12:58 2002
@@ -100,6 +100,12 @@

 #define ELF_PLATFORM  (system_utsname.machine)

+/*
+ * Architecture-neutral AT_ values in 0-17, leave some room
+ * for more of them, start the x86-specific ones at 32.
+ */
+#define AT_SYSINFO	32
+
 #ifdef __KERNEL__
 #define SET_PERSONALITY(ex, ibcs2) set_personality((ibcs2)?PER_SVR4:PER_LINUX)

@@ -115,6 +121,11 @@
 extern void dump_smp_unlazy_fpu(void);
 #define ELF_CORE_SYNC dump_smp_unlazy_fpu
 #endif
+
+#define ARCH_DLINFO					\
+do {							\
+		NEW_AUX_ENT(AT_SYSINFO, 0xffffe000);	\
+} while (0)

 #endif

===== include/asm-i386/fixmap.h 1.9 vs edited =====
--- 1.9/include/asm-i386/fixmap.h	Mon Dec 16 21:39:04 2002
+++ edited/include/asm-i386/fixmap.h	Tue Dec 17 10:11:31 2002
@@ -42,8 +42,8 @@
  * task switches.
  */
 enum fixed_addresses {
-	FIX_VSYSCALL,
 	FIX_HOLE,
+	FIX_VSYSCALL,
 #ifdef CONFIG_X86_LOCAL_APIC
 	FIX_APIC_BASE,	/* local (CPU) APIC) -- required for SMP or not */
 #endif


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 20:01                             ` Linus Torvalds
@ 2002-12-17 20:17                               ` Ulrich Drepper
  2002-12-18  4:15                                 ` Linus Torvalds
  2002-12-18  4:15                               ` Linus Torvalds
  1 sibling, 1 reply; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-17 20:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matti Aarnio, Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel,
	hpa

Linus Torvalds wrote:

> ===== arch/i386/kernel/sysenter.c 1.1 vs edited =====
> --- 1.1/arch/i386/kernel/sysenter.c	Mon Dec 16 21:39:04 2002
> +++ edited/arch/i386/kernel/sysenter.c	Tue Dec 17 11:39:39 2002
> @@ -48,14 +48,14 @@
>  		0xc3			/* ret */
>  	};
>  	static const char sysent[] = {
> -		0x55,			/* push %ebp */
>  		0x51,			/* push %ecx */
>  		0x52,			/* push %edx */
> +		0x55,			/* push %ebp */
>  		0x89, 0xe5,		/* movl %esp,%ebp */
>  		0x0f, 0x34,		/* sysenter */
> +		0x5d,			/* pop %ebp */
>  		0x5a,			/* pop %edx */
>  		0x59,			/* pop %ecx */
> -		0x5d,			/* pop %ebp */
>  		0xc3			/* ret */

Instead of duplicating the push/pop %ebp just use the first one by using

  movl 12(%ebo), %ebp

in the kernel code or remove the first.  The later is better, smaller code.

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 20:17                               ` Ulrich Drepper
@ 2002-12-18  4:15                                 ` Linus Torvalds
  0 siblings, 0 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-18  4:15 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Matti Aarnio, Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel,
	hpa

On Tue, 17 Dec 2002, Ulrich Drepper wrote:

> > -		0x55,			/* push %ebp */
> > +		0x55,			/* push %ebp */
> > +		0x5d,			/* pop %ebp */
> > -		0x5d,			/* pop %ebp */
>
> Instead of duplicating the push/pop %ebp just use the first one by using

No, it's not duplicating it. Look closer. It's just _moving_ it, so that
the old %ebp value will naturally be pointed to by %esp, which is what we
want.

Anyway, I reverted the %ebp games from my kernel, because they are
fundamentally not restartable and thus not really a good idea. Besides, it
might be wrong to try to optimize the fast system calls to handle six
arguments too, if that makes the (much more common case) the other system
calls slower. So the six-argument case might as well just continue to use
"int 0x80".

		Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 20:01                             ` Linus Torvalds
  2002-12-17 20:17                               ` Ulrich Drepper
@ 2002-12-18  4:15                               ` Linus Torvalds
  2002-12-18  4:39                                 ` H. Peter Anvin
                                                   ` (2 more replies)
  1 sibling, 3 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-18  4:15 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Matti Aarnio, Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel,
	hpa

On Tue, 17 Dec 2002, Linus Torvalds wrote:
>
> How about this diff? It does both the 6-parameter thing _and_ the
> AT_SYSINFO addition.

The 6-parameter thing is broken. It's clever, but playing games with %ebp
is not going to work with restarting of the system call - we need to
restart with the proper %ebp.

I pushed out the AT_SYSINFO stuff, but we're back to the "needs to use
'int $0x80' for system calls that take 6 arguments" drawing board.

The only sane way I see to fix the %ebp problem is to actually expand the
kernel "struct ptregs" to have separate "ebp" and "arg6" fields (so that
we can re-start with the right ebp, and have arg6 as the right argument on
the stack). That would work but is not really worth it.

		Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  4:15                               ` Linus Torvalds
@ 2002-12-18  4:39                                 ` H. Peter Anvin
  2002-12-18  4:49                                   ` Linus Torvalds
  2002-12-18 13:17                                 ` Richard B. Johnson
  2002-12-18 13:40                                 ` Horst von Brand
  2 siblings, 1 reply; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-18  4:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ulrich Drepper, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, linux-kernel

Linus Torvalds wrote:
> On Tue, 17 Dec 2002, Linus Torvalds wrote:
> 
>>How about this diff? It does both the 6-parameter thing _and_ the
>>AT_SYSINFO addition.
> 
> 
> The 6-parameter thing is broken. It's clever, but playing games with %ebp
> is not going to work with restarting of the system call - we need to
> restart with the proper %ebp.
> 

This confuses me -- there seems to be no reason this shouldn't work as 
long as %esp == %ebp on sysexit.  The SYSEXIT-trashed GPRs seem like a 
bigger problem.

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  4:39                                 ` H. Peter Anvin
@ 2002-12-18  4:49                                   ` Linus Torvalds
  2002-12-18  6:38                                     ` Linus Torvalds
  0 siblings, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-18  4:49 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ulrich Drepper, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, linux-kernel

On Tue, 17 Dec 2002, H. Peter Anvin wrote:
>
> This confuses me -- there seems to be no reason this shouldn't work as
> long as %esp == %ebp on sysexit.  The SYSEXIT-trashed GPRs seem like a
> bigger problem.

The thing is, the argument save area == the kernel stack frame. This is
part of the reason why Linux has very fast system calls - there is
absolutely _zero_ extraneous setup. No argument fetching and marshalling,
it's all part of just setting up the regular kernel stack.

So to get the right argument in arg6, the argument _needs_ to be saved in
the %ebp entry on the kernel stack. Which means that on return from the
system call (which may not actually be through a "sysenter" at all, if
signals happen it will go through the generic paths), %ebp will have been
updated as part of the kernel stack unwinding.

Which is ok for a regular fast system call (ebp will get restored
immediately), but it is NOT ok for the system call restart case, since in
that case we want %ebp to contain the old stack pointer, not the sixth
argument.

If we just save the stack pointer value (== the initial %ebx value), the
right thing will get restored, but then system calls will see the stack
pointer value as arg6 - because of the 1:1 relationship between arguments
and stack save.

		Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  4:49                                   ` Linus Torvalds
@ 2002-12-18  6:38                                     ` Linus Torvalds
  0 siblings, 0 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-18  6:38 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ulrich Drepper, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, linux-kernel

On Tue, 17 Dec 2002, Linus Torvalds wrote:
>
> Which is ok for a regular fast system call (ebp will get restored
> immediately), but it is NOT ok for the system call restart case, since in
> that case we want %ebp to contain the old stack pointer, not the sixth
> argument.

I came up with an absolutely wonderfully _disgusting_ solution for this.

The thing to realize on how to solve this is that since "sysenter" loses
track of EIP, there's really no real reason to try to return directly
after the "sysenter" instruction anyway. The return point is really
totally arbitrary, after all.

Now, couple this with the fact that system call restarting will always
just subtract two from the "return point" aka saved EIP value (that's the
size of an "int 0x80" instruction), and what you can do is to make the
kernel point the sysexit return point not at just past the "sysenter", but
instead make it point to just past a totally unrelated 2-byte jump
instruction.

With that in mind, I made the sysentry trampoline look like this:

        static const char sysent[] = {
                0x51,                   /* push %ecx */
                0x52,                   /* push %edx */
                0x55,                   /* push %ebp */
                0x89, 0xe5,             /* movl %esp,%ebp */
                0x0f, 0x34,             /* sysenter */
        /* System call restart point is here! (SYSENTER_RETURN - 2) */
                0xeb, 0xfa,             /* jmp to "movl %esp,%ebp" */
        /* System call normal return point is here! (SYSENTER_RETURN in entry.S) */
                0x5d,                   /* pop %ebp */
                0x5a,                   /* pop %edx */
                0x59,                   /* pop %ecx */
                0xc3                    /* ret */
        };

which does the right thing for a "restarted" system call (ie when it
restarts, it won't re-do just the sysenter instruction, it will really
restart at the backwards jump, and thus re-start the "movl %esp,%ebp"
too).

Which means that now the kernel can happily trash %ebp as part of the
sixth argument setup, since system call restarting will re-initialize it
to point to the user-level stack that we need in %ebp because otherwise it
gets totally lost.

I'm a disgusting pig, and proud of it to boot.

			Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  4:15                               ` Linus Torvalds
  2002-12-18  4:39                                 ` H. Peter Anvin
@ 2002-12-18 13:17                                 ` Richard B. Johnson
  2002-12-18 13:40                                 ` Horst von Brand
  2 siblings, 0 replies; 268+ messages in thread
From: Richard B. Johnson @ 2002-12-18 13:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ulrich Drepper, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, linux-kernel, hpa

On Tue, 17 Dec 2002, Linus Torvalds wrote:

> 
> On Tue, 17 Dec 2002, Linus Torvalds wrote:
> >
> > How about this diff? It does both the 6-parameter thing _and_ the
> > AT_SYSINFO addition.
> 
> The 6-parameter thing is broken. It's clever, but playing games with %ebp
> is not going to work with restarting of the system call - we need to
> restart with the proper %ebp.
> 
> I pushed out the AT_SYSINFO stuff, but we're back to the "needs to use
> 'int $0x80' for system calls that take 6 arguments" drawing board.
> 
> The only sane way I see to fix the %ebp problem is to actually expand the
> kernel "struct ptregs" to have separate "ebp" and "arg6" fields (so that
> we can re-start with the right ebp, and have arg6 as the right argument on
> the stack). That would work but is not really worth it.
> 
> 		Linus
> 

How about for the new interface, a one-parameter arg, i.e., a pointer
to a descriptor (structure)?? For the typical one-argument call, i.e.,
getpid(), it's just one de-reference. The pointer register can be
EAX on Intel, a register normally available in a 'C' call.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  4:15                               ` Linus Torvalds
  2002-12-18  4:39                                 ` H. Peter Anvin
  2002-12-18 13:17                                 ` Richard B. Johnson
@ 2002-12-18 13:40                                 ` Horst von Brand
  2002-12-18 13:47                                   ` Sean Neakums
                                                     ` (2 more replies)
  2 siblings, 3 replies; 268+ messages in thread
From: Horst von Brand @ 2002-12-18 13:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

[Extremely interesting new syscall mechanism tread elided]

What happened to "feature freeze"?
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 13:40                                 ` Horst von Brand
@ 2002-12-18 13:47                                   ` Sean Neakums
  2002-12-18 14:10                                     ` Horst von Brand
  2002-12-18 15:52                                   ` Alan Cox
  2002-12-18 16:41                                   ` Dave Jones
  2 siblings, 1 reply; 268+ messages in thread
From: Sean Neakums @ 2002-12-18 13:47 UTC (permalink / raw)
  To: linux-kernel

commence  Horst von Brand quotation:

> [Extremely interesting new syscall mechanism tread elided]
>
> What happened to "feature freeze"?

How are system calls a new feature?  Or is optimizing an existing
feature not allowed by your definition of "feature freeze"?

-- 
 /                          |
[|] Sean Neakums            |  Questions are a burden to others;
[|] <sneakums@zork.net>     |      answers a prison for oneself.
 \                          |

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 13:47                                   ` Sean Neakums
@ 2002-12-18 14:10                                     ` Horst von Brand
  2002-12-18 14:51                                       ` dada1
  2002-12-18 19:12                                       ` Mark Mielke
  0 siblings, 2 replies; 268+ messages in thread
From: Horst von Brand @ 2002-12-18 14:10 UTC (permalink / raw)
  To: linux-kernel

Sean Neakums <sneakums@zork.net> said:
> commence  Horst von Brand quotation:
> 
> > [Extremely interesting new syscall mechanism tread elided]
> >
> > What happened to "feature freeze"?
> 
> How are system calls a new feature?  Or is optimizing an existing
> feature not allowed by your definition of "feature freeze"?

This "optimizing" is very much userspace-visible, and a radical change in
an interface this fundamental counts as a new feature in my book.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 14:10                                     ` Horst von Brand
@ 2002-12-18 14:51                                       ` dada1
  2002-12-18 19:12                                       ` Mark Mielke
  1 sibling, 0 replies; 268+ messages in thread
From: dada1 @ 2002-12-18 14:51 UTC (permalink / raw)
  To: linux-kernel, Horst von Brand

From: "Horst von Brand" <vonbrand@inf.utfsm.cl>
> > How are system calls a new feature?  Or is optimizing an existing
> > feature not allowed by your definition of "feature freeze"?
>
> This "optimizing" is very much userspace-visible, and a radical change in
> an interface this fundamental counts as a new feature in my book.

Since int 0x80 is supported/ will be supported for the next 20 years, I dont
think this is a radical change.
No userspace visible at all.
You are free to use the old way of calling the kernel...


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 14:10                                     ` Horst von Brand
  2002-12-18 14:51                                       ` dada1
@ 2002-12-18 19:12                                       ` Mark Mielke
  1 sibling, 0 replies; 268+ messages in thread
From: Mark Mielke @ 2002-12-18 19:12 UTC (permalink / raw)
  To: Horst von Brand; +Cc: linux-kernel

On Wed, Dec 18, 2002 at 11:10:50AM -0300, Horst von Brand wrote:
> Sean Neakums <sneakums@zork.net> said:
> > How are system calls a new feature?  Or is optimizing an existing
> > feature not allowed by your definition of "feature freeze"?
> This "optimizing" is very much userspace-visible, and a radical change in
> an interface this fundamental counts as a new feature in my book.

Since operating systems like WIN32 are at least published to take
advantage of SYSENTER, it may not be in Linux's interest to
purposefully use a slower interface until 2.8 (how long will that be
until people can use?). The last thing I want to read about in a
technical journal is how WIN32 has lower system call overhead than
Linux on IA-32 architectures. That might just be selfish of me for
the Linux community... :-)

mark

-- 
mark@mielke.cc/markm@ncf.ca/markm@nortelnetworks.com __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 13:40                                 ` Horst von Brand
  2002-12-18 13:47                                   ` Sean Neakums
@ 2002-12-18 15:52                                   ` Alan Cox
  2002-12-18 16:41                                   ` Dave Jones
  2 siblings, 0 replies; 268+ messages in thread
From: Alan Cox @ 2002-12-18 15:52 UTC (permalink / raw)
  To: Horst von Brand; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Wed, 2002-12-18 at 13:40, Horst von Brand wrote:
> [Extremely interesting new syscall mechanism tread elided]
> 
> What happened to "feature freeze"?

I'm wondering that. 2.5.49 was usable for devel work, no kernel since
has been. Its stopped IDE getting touched until January.

Linus. you are doing the slow slide into a second round of development
work again, just like mid 2.3, just like 1.3.60, ...


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 13:40                                 ` Horst von Brand
  2002-12-18 13:47                                   ` Sean Neakums
  2002-12-18 15:52                                   ` Alan Cox
@ 2002-12-18 16:41                                   ` Dave Jones
  2002-12-18 18:41                                     ` Horst von Brand
  2 siblings, 1 reply; 268+ messages in thread
From: Dave Jones @ 2002-12-18 16:41 UTC (permalink / raw)
  To: Horst von Brand; +Cc: Linus Torvalds, linux-kernel

On Wed, Dec 18, 2002 at 10:40:24AM -0300, Horst von Brand wrote:
 > [Extremely interesting new syscall mechanism tread elided]
 > 
 > What happened to "feature freeze"?

*bites lip* it's fairly low impact *duck*.
Given the wins seem to be fairly impressive across the board, spending
a few days on getting this right isn't going to push 2.6 back any
noticable amount of time.

This stuff is mostly of the case "it either works, or it doesn't".
And right now, corner cases like apm aside, it seems to be holding up
so far. This isn't as far reaching as it sounds. There are still
drivers being turned upside down which are changing things in a
lot bigger ways than this[1]

		Dave

[1] Myself being one of the guilty parties there, wrt agp.

-- 
| Dave Jones.        http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 16:41                                   ` Dave Jones
@ 2002-12-18 18:41                                     ` Horst von Brand
  0 siblings, 0 replies; 268+ messages in thread
From: Horst von Brand @ 2002-12-18 18:41 UTC (permalink / raw)
  To: Dave Jones, Horst von Brand, Linus Torvalds, linux-kernel

Dave Jones <davej@codemonkey.org.uk> said:
> On Wed, Dec 18, 2002 at 10:40:24AM -0300, Horst von Brand wrote:
>  > [Extremely interesting new syscall mechanism tread elided]
>  > 
>  > What happened to "feature freeze"?

> *bites lip* it's fairly low impact *duck*.
> Given the wins seem to be fairly impressive across the board, spending
> a few days on getting this right isn't going to push 2.6 back any
> noticable amount of time.

Ever hear Larry McVoy's [I think, please correct me if wrong] standard
rant of how $UNIX_FROM_BIG_VENDOR sucks, one "almost unnoticeable
performance impact" feature at a time? 

Similarly, Fred Brooks tells in "The Mythical Man Month" how schedules
don't slip by months, they slip a day at a time...

> This stuff is mostly of the case "it either works, or it doesn't".
> And right now, corner cases like apm aside, it seems to be holding up
> so far. This isn't as far reaching as it sounds. There are still
> drivers being turned upside down which are changing things in a
> lot bigger ways than this[1]
> 
> 		Dave
> 
> [1] Myself being one of the guilty parties there, wrt agp.

Fixing a broken feature is in for me. Adding new features is supposed to be
out until 2.7 opens.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 18:30                     ` Ulrich Drepper
  2002-12-17 19:04                       ` Linus Torvalds
@ 2002-12-17 19:26                       ` Alan Cox
  2002-12-17 18:57                         ` Ulrich Drepper
  1 sibling, 1 reply; 268+ messages in thread
From: Alan Cox @ 2002-12-17 19:26 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Linus Torvalds, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, Linux Kernel Mailing List, hpa

On Tue, 2002-12-17 at 18:30, Ulrich Drepper wrote:
> demultiplexing happens in the kernel.  Do we want to do this at
> userlevel?  This would allow almost no-cost determination of those
> syscalls which can be handled at userlevel (getpid, getppid, ...).

getppid changes and so I think has to go to kernel (unless we go around
patching user pages on process exit [ick]). getpid can already be done
by reading it once at startup time and caching the data.



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:26                       ` Alan Cox
@ 2002-12-17 18:57                         ` Ulrich Drepper
  2002-12-17 19:10                           ` Linus Torvalds
  2002-12-17 21:38                           ` Benjamin LaHaise
  0 siblings, 2 replies; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-17 18:57 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, Linux Kernel Mailing List, hpa

Alan Cox wrote:

> getppid changes and so I think has to go to kernel (unless we go around
> patching user pages on process exit [ick]).

But this is exactly what I expect to happen.  If you want to implement
gettimeofday() at user-level you need to modify the page.  Some of the
information the kernel has to keep for the thread group can be stored in
this place and eventually be used by some uerlevel code  executed by
jumping to 0xfffff000 or whatever the address is.

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 18:57                         ` Ulrich Drepper
@ 2002-12-17 19:10                           ` Linus Torvalds
  2002-12-17 19:21                             ` H. Peter Anvin
                                               ` (2 more replies)
  2002-12-17 21:38                           ` Benjamin LaHaise
  1 sibling, 3 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17 19:10 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Alan Cox, Matti Aarnio, Hugh Dickins, Dave Jones, Ingo Molnar,
	Linux Kernel Mailing List, hpa

On Tue, 17 Dec 2002, Ulrich Drepper wrote:
>
> But this is exactly what I expect to happen.  If you want to implement
> gettimeofday() at user-level you need to modify the page.

Note that I really don't think we ever want to do the user-level
gettimeofday(). The complexity just argues against it, it's better to try
to make system calls be cheap enough that you really don't care.

sysenter helps a bit there.

If we'd need to modify the page, we couldn't share one page between all
processes, and we couldn't make it global in the TLB. So modifying the
info page is something we should avoid at all cost - it's not totally
unlikely that the overheads implied by per-thread pages would drown out
the wins from trying to be clever.

The advantage of the current static fixmap is that it's _extremely_
streamlined. The only overhead is literally the system entry itself, which
while a bit too high on a P4 is not that bad in general (and hopefully
Intel will fix the stupidities that cause the P4 to be slow at kernel
entry. Somebody already mentioned that apparently the newer P4 cores are
actually faster at system calls than mine is).

			Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:10                           ` Linus Torvalds
@ 2002-12-17 19:21                             ` H. Peter Anvin
  2002-12-17 19:37                               ` Linus Torvalds
  2002-12-17 19:47                             ` Dave Jones
  2002-12-18 12:57                             ` Rogier Wolff
  2 siblings, 1 reply; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-17 19:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ulrich Drepper, Alan Cox, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, Linux Kernel Mailing List

Linus Torvalds wrote:
> 
> On Tue, 17 Dec 2002, Ulrich Drepper wrote:
> 
>>But this is exactly what I expect to happen.  If you want to implement
>>gettimeofday() at user-level you need to modify the page.
> 
> Note that I really don't think we ever want to do the user-level
> gettimeofday(). The complexity just argues against it, it's better to try
> to make system calls be cheap enough that you really don't care.
> 

Let's see... it works fine on UP and on *most* SMP, and on the ones
where it doesn't work you just fill in a system call into the vsyscall
slot.  It just means that gettimeofday() needs a different vsyscall slot.

	-hpa


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:21                             ` H. Peter Anvin
@ 2002-12-17 19:37                               ` Linus Torvalds
  2002-12-17 19:43                                 ` H. Peter Anvin
                                                   ` (4 more replies)
  0 siblings, 5 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17 19:37 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ulrich Drepper, Alan Cox, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, Linux Kernel Mailing List

On Tue, 17 Dec 2002, H. Peter Anvin wrote:
>
> Let's see... it works fine on UP and on *most* SMP, and on the ones
> where it doesn't work you just fill in a system call into the vsyscall
> slot.  It just means that gettimeofday() needs a different vsyscall slot.

The thing is, gettimeofday() isn't _that_ special. It's just not worth a
vsyscall of it's own, I feel. Where do you stop? Do we do getpid() too?
Just because we can?

This is especially true since the people who _really_ might care about
gettimeofday() are exactly the people who wouldn't be able to use the fast
user-space-only version.

How much do you think gettimeofday() really matters on a desktop? Sure, X
apps do gettimeofday() calls, but they do a whole lot more of _other_
calls, and gettimeofday() is really far far down in the noise for them.
The people who really call for gettimeofday() as a performance thing seem
to be database people who want it as a timestamp. But those are the same
people who also want NUMA machines which don't necessarily have
synchronized clocks.

		Linus

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:37                               ` Linus Torvalds
@ 2002-12-17 19:43                                 ` H. Peter Anvin
  2002-12-17 20:07                                   ` Matti Aarnio
  2002-12-17 19:59                                 ` Matti Aarnio
                                                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-17 19:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ulrich Drepper, Alan Cox, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, Linux Kernel Mailing List

Linus Torvalds wrote:
> 
> The thing is, gettimeofday() isn't _that_ special. It's just not worth a
> vsyscall of it's own, I feel. Where do you stop? Do we do getpid() too?
> Just because we can?
>

getpid() could be implemented in userspace, but not via vsyscalls
(instead it could be passed in the ELF data area at process start.)

"Because we can and it's relatively easy" is a pretty good argument in
my opinion.

> This is especially true since the people who _really_ might care about
> gettimeofday() are exactly the people who wouldn't be able to use the fast
> user-space-only version.
> 
> How much do you think gettimeofday() really matters on a desktop? Sure, X
> apps do gettimeofday() calls, but they do a whole lot more of _other_
> calls, and gettimeofday() is really far far down in the noise for them.
> The people who really call for gettimeofday() as a performance thing seem
> to be database people who want it as a timestamp. But those are the same
> people who also want NUMA machines which don't necessarily have
> synchronized clocks.
> 

I think this is really an overstatement.  Timestamping etc. (and heck,
even databases) are actually perfectly usable even on smaller machines
these days.  Sure, DB vendors like to boast of their 128-way NUMA
machines, but I suspect the bulk of them run on single- and
dual-processor machines (by count, not necessarily by data volume.)

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:43                                 ` H. Peter Anvin
@ 2002-12-17 20:07                                   ` Matti Aarnio
  2002-12-17 20:10                                     ` H. Peter Anvin
  0 siblings, 1 reply; 268+ messages in thread
From: Matti Aarnio @ 2002-12-17 20:07 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linus Torvalds, Linux Kernel Mailing List

(cutting down To:/Cc:)

On Tue, Dec 17, 2002 at 11:43:57AM -0800, H. Peter Anvin wrote:
> Linus Torvalds wrote:
> > The thing is, gettimeofday() isn't _that_ special. It's just not worth a
> > vsyscall of it's own, I feel. Where do you stop? Do we do getpid() too?
> > Just because we can?
> 
> getpid() could be implemented in userspace, but not via vsyscalls
> (instead it could be passed in the ELF data area at process start.)

  After fork() or clone()  ?
  If we had only spawn(), and some separate way to start threads...

...
> 	-hpa

/Matti Aarnio

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 20:07                                   ` Matti Aarnio
@ 2002-12-17 20:10                                     ` H. Peter Anvin
  0 siblings, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-17 20:10 UTC (permalink / raw)
  To: Matti Aarnio; +Cc: Linus Torvalds, Linux Kernel Mailing List

Matti Aarnio wrote:
> (cutting down To:/Cc:)
> 
> On Tue, Dec 17, 2002 at 11:43:57AM -0800, H. Peter Anvin wrote:
> 
>>Linus Torvalds wrote:
>>
>>>The thing is, gettimeofday() isn't _that_ special. It's just not worth a
>>>vsyscall of it's own, I feel. Where do you stop? Do we do getpid() too?
>>>Just because we can?
>>
>>getpid() could be implemented in userspace, but not via vsyscalls
>>(instead it could be passed in the ELF data area at process start.)
> 
> 
>   After fork() or clone()  ?
>   If we had only spawn(), and some separate way to start threads...
> 

fork() and clone() would have to return the self-pid as an auxilliary
return value.  This, of course, is getting rather fuggly.

Anything that cares caches getpid() anyway.

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:37                               ` Linus Torvalds
  2002-12-17 19:43                                 ` H. Peter Anvin
@ 2002-12-17 19:59                                 ` Matti Aarnio
  2002-12-17 20:06                                 ` Ulrich Drepper
                                                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 268+ messages in thread
From: Matti Aarnio @ 2002-12-17 19:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Ulrich Drepper, Alan Cox, Matti Aarnio,
	Hugh Dickins, Dave Jones, Ingo Molnar, Linux Kernel Mailing List

On Tue, Dec 17, 2002 at 11:37:04AM -0800, Linus Torvalds wrote:
> On Tue, 17 Dec 2002, H. Peter Anvin wrote:
> > Let's see... it works fine on UP and on *most* SMP, and on the ones
> > where it doesn't work you just fill in a system call into the vsyscall
> > slot.  It just means that gettimeofday() needs a different vsyscall slot.
> 
> The thing is, gettimeofday() isn't _that_ special. It's just not worth a
> vsyscall of it's own, I feel. Where do you stop? Do we do getpid() too?
> Just because we can?

  clone()   -- which doesn't really like anybody using stack-pointer ?

  (I do use  gettimeofday() a _lot_, but I have my own userspace
   mapped shared segment thingamajingie doing it..  And I write
   code that runs on lots of systems, not only at Linux. )

> 		Linus

/Matti Aarnio

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:37                               ` Linus Torvalds
  2002-12-17 19:43                                 ` H. Peter Anvin
  2002-12-17 19:59                                 ` Matti Aarnio
@ 2002-12-17 20:06                                 ` Ulrich Drepper
  2002-12-17 20:35                                   ` Daniel Jacobowitz
  2002-12-18  0:20                                   ` Linus Torvalds
  2002-12-18  7:41                                 ` Kai Henningsen
  2002-12-18 13:00                                 ` Rogier Wolff
  4 siblings, 2 replies; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-17 20:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Alan Cox, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, Linux Kernel Mailing List

Linus Torvalds wrote:

> The thing is, gettimeofday() isn't _that_ special. It's just not worth a
> vsyscall of it's own, I feel. Where do you stop? Do we do getpid() too?

This is why I'd say mkae no distinction at all.  Have the first
nr_syscalls * 8 bytes starting at 0xfffff000 as a jump table.  We can
transfer to a different slot for each syscall.  Each slot then could be
a PC-relative jump to the common sysenter code or to some special code
sequence which is also in the global page.

If we don't do this now and it seems desirable in future we wither have
to introduce a second ABI for the vsyscall stuff (ugly!) or you'll have
to do the demultiplexing yourself in the code starting at 0xfffff000.

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 20:06                                 ` Ulrich Drepper
@ 2002-12-17 20:35                                   ` Daniel Jacobowitz
  2002-12-18  0:20                                   ` Linus Torvalds
  1 sibling, 0 replies; 268+ messages in thread
From: Daniel Jacobowitz @ 2002-12-17 20:35 UTC (permalink / raw)
  To: Linux Kernel Mailing List

On Tue, Dec 17, 2002 at 12:06:29PM -0800, Ulrich Drepper wrote:
> Linus Torvalds wrote:
> 
> > The thing is, gettimeofday() isn't _that_ special. It's just not worth a
> > vsyscall of it's own, I feel. Where do you stop? Do we do getpid() too?
> 
> This is why I'd say mkae no distinction at all.  Have the first
> nr_syscalls * 8 bytes starting at 0xfffff000 as a jump table.  We can
> transfer to a different slot for each syscall.  Each slot then could be
> a PC-relative jump to the common sysenter code or to some special code
> sequence which is also in the global page.
> 
> If we don't do this now and it seems desirable in future we wither have
> to introduce a second ABI for the vsyscall stuff (ugly!) or you'll have
> to do the demultiplexing yourself in the code starting at 0xfffff000.

But what does this do to things like PTRACE_SYSCALL?  And do we care...
I suppose not if we keep the syscall trace checks on every kernel entry
path.

-- 
Daniel Jacobowitz
MontaVista Software                         Debian GNU/Linux Developer

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 20:06                                 ` Ulrich Drepper
  2002-12-17 20:35                                   ` Daniel Jacobowitz
@ 2002-12-18  0:20                                   ` Linus Torvalds
  2002-12-18  0:38                                     ` Ulrich Drepper
  1 sibling, 1 reply; 268+ messages in thread
From: Linus Torvalds @ 2002-12-18  0:20 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: H. Peter Anvin, Alan Cox, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, Linux Kernel Mailing List



On Tue, 17 Dec 2002, Ulrich Drepper wrote:
>
> This is why I'd say mkae no distinction at all.  Have the first
> nr_syscalls * 8 bytes starting at 0xfffff000 as a jump table.

No, the way sysenter works, the table approach just sucks up dcache space
(the kernel cannot know which sysenter is the one that the user uses
anyway, so the jump table would have to just add back some index and we'd
be back exactly where we started)

I'll keep it the way it is now.

		Linus


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  0:20                                   ` Linus Torvalds
@ 2002-12-18  0:38                                     ` Ulrich Drepper
  0 siblings, 0 replies; 268+ messages in thread
From: Ulrich Drepper @ 2002-12-18  0:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Alan Cox, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, Linux Kernel Mailing List

Linus Torvalds wrote:

> No, the way sysenter works, the table approach just sucks up dcache space
> (the kernel cannot know which sysenter is the one that the user uses
> anyway, so the jump table would have to just add back some index and we'd
> be back exactly where we started)
> 
> I'll keep it the way it is now.

I won't argue since honestly, not doing it is much easier for me.  But I
want to be sure I'm clear.

What I suggested is to have the first part of the global page be


   .p2align 3
   jmp sysenter_label
   .p2align 3
   jmp sysenter_label
   ...
   .p2align
   jmp userlevel_gettimeofday

sysenter_label:
  the usual sysenter code

userlevel_gettimeofday:
  whatever necessary


All this would be in the global page.  There is only one sysenter call.

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:37                               ` Linus Torvalds
                                                   ` (2 preceding siblings ...)
  2002-12-17 20:06                                 ` Ulrich Drepper
@ 2002-12-18  7:41                                 ` Kai Henningsen
  2002-12-18 13:00                                 ` Rogier Wolff
  4 siblings, 0 replies; 268+ messages in thread
From: Kai Henningsen @ 2002-12-18  7:41 UTC (permalink / raw)
  To: linux-kernel

torvalds@transmeta.com (Linus Torvalds)  wrote on 17.12.02 in <Pine.LNX.4.44.0212171132530.1095-100000@home.transmeta.com>:

> On Tue, 17 Dec 2002, H. Peter Anvin wrote:
> >
> > Let's see... it works fine on UP and on *most* SMP, and on the ones
> > where it doesn't work you just fill in a system call into the vsyscall
> > slot.  It just means that gettimeofday() needs a different vsyscall slot.
>
> The thing is, gettimeofday() isn't _that_ special. It's just not worth a
> vsyscall of it's own, I feel. Where do you stop? Do we do getpid() too?
> Just because we can?

It's special enough that while programming under DOS, I had my own special  
routine which just took the BIOS ticker from low memory for a lot of  
things - even to decide if calling the actual time-of-day syscall was  
useful or if I should expect to get the same value back as last time.

That was a *serious* performance improvement. (Of course, DOS syscalls are  
S-L-O-W ...)

These days, the equivalent does call gettimeofday(). It's still probably  
the most-used syscall by far. (Hmm - maybe I can get some numbers for  
that? Must see if I get time today.) And *that* is why optimizing this one  
call makes sense.

> This is especially true since the people who _really_ might care about
> gettimeofday() are exactly the people who wouldn't be able to use the fast
> user-space-only version.

Say what? Why wouldn't I be able to use it? Right now, I know of no SMP  
installation that's even in the planning ...

> How much do you think gettimeofday() really matters on a desktop? Sure, X

Why desktop? We use the same kind of thing in the server, and it's much  
more important there. Client performance is uninteresting - clients mostly  
wait anyway.

> The people who really call for gettimeofday() as a performance thing seem
> to be database people who want it as a timestamp. But those are the same

Not database, but otherwise on the nail.

> people who also want NUMA machines which don't necessarily have
> synchronized clocks.

Nope, no interest in those. SMP *might* become interesting, but I don't  
think we'd ever want to care about weird stuff like NUMA ... at least not  
for the next five years or so.

We don't shovel nearly as much data around as the database guys.

MfG Kai

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:37                               ` Linus Torvalds
                                                   ` (3 preceding siblings ...)
  2002-12-18  7:41                                 ` Kai Henningsen
@ 2002-12-18 13:00                                 ` Rogier Wolff
  4 siblings, 0 replies; 268+ messages in thread
From: Rogier Wolff @ 2002-12-18 13:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Ulrich Drepper, Alan Cox, Matti Aarnio,
	Hugh Dickins, Dave Jones, Ingo Molnar, Linux Kernel Mailing List

On Tue, Dec 17, 2002 at 11:37:04AM -0800, Linus Torvalds wrote:
> How much do you think gettimeofday() really matters on a desktop? Sure, X
> apps do gettimeofday() calls, but they do a whole lot more of _other_
> calls, and gettimeofday() is really far far down in the noise for them.
> The people who really call for gettimeofday() as a performance thing seem
> to be database people who want it as a timestamp. But those are the same
> people who also want NUMA machines which don't necessarily have
> synchronized clocks.

Once the kernel provides the right infrastructure, doing it may become
so easy that it can be tried and implemented and benchmarked with so
little effort that it would simply stick.

			Roger. 


-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* The Worlds Ecosystem is a stable system. Stable systems may experience *
* excursions from the stable situation. We are currently in such an      * 
* excursion: The stable situation does not include humans. ***************

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:10                           ` Linus Torvalds
  2002-12-17 19:21                             ` H. Peter Anvin
@ 2002-12-17 19:47                             ` Dave Jones
  2002-12-18 12:57                             ` Rogier Wolff
  2 siblings, 0 replies; 268+ messages in thread
From: Dave Jones @ 2002-12-17 19:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ulrich Drepper, Alan Cox, Matti Aarnio, Hugh Dickins, Ingo Molnar,
	Linux Kernel Mailing List, hpa

On Tue, Dec 17, 2002 at 11:10:20AM -0800, Linus Torvalds wrote:
 > Intel will fix the stupidities that cause the P4 to be slow at kernel
 > entry. Somebody already mentioned that apparently the newer P4 cores are
 > actually faster at system calls than mine is).

My HT Northwood returns slightly better results than your xeon,
but the syscall stuff still completely trounces it.

(19:38:46:davej@tetrachloride:davej)$ ./a.out 
440.107164 cycles
1152.596084 cycles

		Dave

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 19:10                           ` Linus Torvalds
  2002-12-17 19:21                             ` H. Peter Anvin
  2002-12-17 19:47                             ` Dave Jones
@ 2002-12-18 12:57                             ` Rogier Wolff
  2002-12-19  0:14                               ` Pavel Machek
  2 siblings, 1 reply; 268+ messages in thread
From: Rogier Wolff @ 2002-12-18 12:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ulrich Drepper, Alan Cox, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, Linux Kernel Mailing List, hpa

On Tue, Dec 17, 2002 at 11:10:20AM -0800, Linus Torvalds wrote:
> 
> 
> On Tue, 17 Dec 2002, Ulrich Drepper wrote:
> >
> > But this is exactly what I expect to happen.  If you want to implement
> > gettimeofday() at user-level you need to modify the page.
> 
> Note that I really don't think we ever want to do the user-level
> gettimeofday(). The complexity just argues against it, it's better to try
> to make system calls be cheap enough that you really don't care.

I'd say that this should not be "fixed" from userspace, but from the
kernel. Thus if the kernel knows that the "gettimeofday" can be made
faster by doing it completely in userspace, then that system call
should be "patched" by the kernel to do it faster for everybody.

Next, someone might find a faster (full-userspace) way to do some
"reads"(*). Then it might pay to check for that specific
filedescriptor in userspace, and only call into the kernel for the
other filedescriptors. The idea is that the kernel knows best when
optimizations are possible.

Thus that ONE magic address is IMHO not the right way to do it. By
demultiplexing the stuff in userspace, you can do "sane" things with
specific syscalls. 

So for example, the code at 0xffff80000 would be: 
	mov 0x00,%eax
	int $80
	ret

(in the case where sysenter & friends is not available)

moving the "load syscall number into the register" into the
kernel-modifiable area does not cost a thing, but because we have
demultiplexed the code, we can easily replace the gettimeofday call by
something that (when it's easy) doesn't require the 600-cycle call 
into kernel mode. 

The "syscall _NR" would then become: 

	call	0xffff8000 + _NR * 0x80

allowing for up to 0x80 bytes of "patchup code" or "do it quickly"
code, but also for a jump to some other "magic page", that has more
extensive code.

(Oh, I'm showing a base of 0xffff8000: A bit lower than previous
suggestions: allowing for a per-syscall entrypoint, and up to 0x80
bytes of fixup or "do it really quickly" code.)

P.S. People might argue that using this large "stride" would have a
larger cache-footprint. I think that all "where it matters" programs
will have a very small working-set of system calls. It might pay to
use a stride of say 0xa0 to spread the different
system-call-code-points over different cache-lines whenever possible.

		Roger. 

(*) I was trying to pick a particularly unlikely case, but I can even
see a case where this is useful. For example, a kernel might be
compiled with "high performance pipes", which would move most of the
pipe reads and writes into userspace, through a shared-memory window. 

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* The Worlds Ecosystem is a stable system. Stable systems may experience *
* excursions from the stable situation. We are currently in such an      * 
* excursion: The stable situation does not include humans. ***************

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18 12:57                             ` Rogier Wolff
@ 2002-12-19  0:14                               ` Pavel Machek
  0 siblings, 0 replies; 268+ messages in thread
From: Pavel Machek @ 2002-12-19  0:14 UTC (permalink / raw)
  To: Rogier Wolff
  Cc: Linus Torvalds, Ulrich Drepper, Alan Cox, Matti Aarnio,
	Hugh Dickins, Dave Jones, Ingo Molnar, Linux Kernel Mailing List,
	hpa

Hi!

> > > But this is exactly what I expect to happen.  If you want to implement
> > > gettimeofday() at user-level you need to modify the page.
> > 
> > Note that I really don't think we ever want to do the user-level
> > gettimeofday(). The complexity just argues against it, it's better to try
> > to make system calls be cheap enough that you really don't care.
> 
> I'd say that this should not be "fixed" from userspace, but from the
> kernel. Thus if the kernel knows that the "gettimeofday" can be made
> faster by doing it completely in userspace, then that system call
> should be "patched" by the kernel to do it faster for everybody.
> 
> Next, someone might find a faster (full-userspace) way to do some
> "reads"(*). Then it might pay to check for that specific
> filedescriptor in userspace, and only call into the kernel for the
> other filedescriptors. The idea is that the kernel knows best when
> optimizations are possible.
> 
> Thus that ONE magic address is IMHO not the right way to do it. By
> demultiplexing the stuff in userspace, you can do "sane" things with
> specific syscalls. 
> 
> So for example, the code at 0xffff80000 would be: 
> 	mov 0x00,%eax
> 	int $80
> 	ret
> 
> (in the case where sysenter & friends is not available)

This could save that one register needed for 6-args syscalls. If code
at 0xffff8000 was mov %ebp, %eax; sysenter; ret for P4, you could do
6-args syscalls this way.
								Pavel
-- 
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 18:57                         ` Ulrich Drepper
  2002-12-17 19:10                           ` Linus Torvalds
@ 2002-12-17 21:38                           ` Benjamin LaHaise
  2002-12-17 21:41                             ` H. Peter Anvin
  1 sibling, 1 reply; 268+ messages in thread
From: Benjamin LaHaise @ 2002-12-17 21:38 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Alan Cox, Linus Torvalds, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, Linux Kernel Mailing List, hpa

On Tue, Dec 17, 2002 at 10:57:29AM -0800, Ulrich Drepper wrote:
> But this is exactly what I expect to happen.  If you want to implement
> gettimeofday() at user-level you need to modify the page.  Some of the
> information the kernel has to keep for the thread group can be stored in
> this place and eventually be used by some uerlevel code  executed by
> jumping to 0xfffff000 or whatever the address is.

You don't actually need to modify the page, rather the data for the user 
level gettimeofday needs to be in a shared page and some register (like 
%tr) must expose the current cpu number to index into the data.  Either 
way, it's an internal implementation detail for the kernel to take care 
of, with multiple potential solutions.

		-ben
-- 
"Do you seek knowledge in time travel?"

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 21:38                           ` Benjamin LaHaise
@ 2002-12-17 21:41                             ` H. Peter Anvin
  0 siblings, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-17 21:41 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Ulrich Drepper, Alan Cox, Linus Torvalds, Matti Aarnio,
	Hugh Dickins, Dave Jones, Ingo Molnar, Linux Kernel Mailing List

Benjamin LaHaise wrote:
> On Tue, Dec 17, 2002 at 10:57:29AM -0800, Ulrich Drepper wrote:
> 
>>But this is exactly what I expect to happen.  If you want to implement
>>gettimeofday() at user-level you need to modify the page.  Some of the
>>information the kernel has to keep for the thread group can be stored in
>>this place and eventually be used by some uerlevel code  executed by
>>jumping to 0xfffff000 or whatever the address is.
> 
> 
> You don't actually need to modify the page, rather the data for the user 
> level gettimeofday needs to be in a shared page and some register (like 
> %tr) must expose the current cpu number to index into the data.  Either 
> way, it's an internal implementation detail for the kernel to take care 
> of, with multiple potential solutions.
> 

That's not the problem... the problem is that the userland code can get
preempted at any time and rescheduled on another CPU.

	-hpa



^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 17:55                   ` Linus Torvalds
  2002-12-17 18:24                     ` Linus Torvalds
  2002-12-17 18:30                     ` Ulrich Drepper
@ 2002-12-17 18:39                     ` Jeff Dike
  2002-12-17 19:05                       ` Linus Torvalds
  2002-12-18  5:34                     ` Jeremy Fitzhardinge
  3 siblings, 1 reply; 268+ messages in thread
From: Jeff Dike @ 2002-12-17 18:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ulrich Drepper, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, linux-kernel, hpa

torvalds@transmeta.com said:
> That also allows the kernel to move around the SYSINFO page at will,
> and even makes it possible to avoid it altogether (ie this will solve
> the inevitable problems with UML - UML just wouldn't set AT_SYSINFO,
> so user level just wouldn't even _try_ to use it). 

Why shouldn't I just set it to where UML provides its own SYSINFO page?

				Jeff


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 18:39                     ` Jeff Dike
@ 2002-12-17 19:05                       ` Linus Torvalds
  0 siblings, 0 replies; 268+ messages in thread
From: Linus Torvalds @ 2002-12-17 19:05 UTC (permalink / raw)
  To: Jeff Dike; +Cc: Hugh Dickins, Dave Jones, Ingo Molnar, linux-kernel, hpa



On Tue, 17 Dec 2002, Jeff Dike wrote:
>
> torvalds@transmeta.com said:
> > That also allows the kernel to move around the SYSINFO page at will,
> > and even makes it possible to avoid it altogether (ie this will solve
> > the inevitable problems with UML - UML just wouldn't set AT_SYSINFO,
> > so user level just wouldn't even _try_ to use it).
>
> Why shouldn't I just set it to where UML provides its own SYSINFO page?

Sure, that works fine too.

		Linus


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17 17:55                   ` Linus Torvalds
                                       ` (2 preceding siblings ...)
  2002-12-17 18:39                     ` Jeff Dike
@ 2002-12-18  5:34                     ` Jeremy Fitzhardinge
  2002-12-18  5:38                       ` H. Peter Anvin
  2002-12-18 15:50                       ` Alan Cox
  3 siblings, 2 replies; 268+ messages in thread
From: Jeremy Fitzhardinge @ 2002-12-18  5:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ulrich Drepper, Matti Aarnio, Hugh Dickins, Dave Jones,
	Ingo Molnar, Linux Kernel List, H. Peter Anvin

On Tue, 2002-12-17 at 09:55, Linus Torvalds wrote:
> Uli, how about I just add one ne warchitecture-specific ELF AT flag, which
> is the "base of sysinfo page". Right now that page is all zeroes except
> for the system call trampoline at the beginning, but we might want to add
> other system information to the page in the future (it is readable, after
> all).

The P4 optimisation guide promises horrible things if you write within
2k of a cached instruction from another CPU (it dumps the whole trace
cache, it seems), so you'd need to be careful about mixing mutable data
and the syscall code in that page.

Immutable data should be fine.
        
        J


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  5:34                     ` Jeremy Fitzhardinge
@ 2002-12-18  5:38                       ` H. Peter Anvin
  2002-12-18 15:50                       ` Alan Cox
  1 sibling, 0 replies; 268+ messages in thread
From: H. Peter Anvin @ 2002-12-18  5:38 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Linus Torvalds, Ulrich Drepper, Matti Aarnio, Hugh Dickins,
	Dave Jones, Ingo Molnar, Linux Kernel List

Jeremy Fitzhardinge wrote:
> On Tue, 2002-12-17 at 09:55, Linus Torvalds wrote:
> 
>>Uli, how about I just add one ne warchitecture-specific ELF AT flag, which
>>is the "base of sysinfo page". Right now that page is all zeroes except
>>for the system call trampoline at the beginning, but we might want to add
>>other system information to the page in the future (it is readable, after
>>all).
> 
> 
> The P4 optimisation guide promises horrible things if you write within
> 2k of a cached instruction from another CPU (it dumps the whole trace
> cache, it seems), so you'd need to be careful about mixing mutable data
> and the syscall code in that page.
> 
> Immutable data should be fine.
>         

Yes, you really want to use a second page.

	-hpa




^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-18  5:34                     ` Jeremy Fitzhardinge
  2002-12-18  5:38                       ` H. Peter Anvin
@ 2002-12-18 15:50                       ` Alan Cox
  1 sibling, 0 replies; 268+ messages in thread
From: Alan Cox @ 2002-12-18 15:50 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Linus Torvalds, Ulrich Drepper, Matti Aarnio, Hugh Dickins,
	Dave Jones, Ingo Molnar, Linux Kernel Mailing List,
	H. Peter Anvin

On Wed, 2002-12-18 at 05:34, Jeremy Fitzhardinge wrote:
> The P4 optimisation guide promises horrible things if you write within
> 2k of a cached instruction from another CPU (it dumps the whole trace
> cache, it seems), so you'd need to be careful about mixing mutable data
> and the syscall code in that page.

The PIII errata promise worse things with SMP and code modified as
another cpu ruins it and seems to mark them WONTFIX, so there is another
dragon to beware of


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-17  5:55           ` Linus Torvalds
                               ` (3 preceding siblings ...)
  2002-12-17 16:12             ` Hugh Dickins
@ 2002-12-18 23:51             ` Pavel Machek
  4 siblings, 0 replies; 268+ messages in thread
From: Pavel Machek @ 2002-12-18 23:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Ingo Molnar, linux-kernel, hpa

Hi!

> > (Modulo the missing syscall page I already mentioned and potential bugs
> > in the code itself, of course ;)
> 
> Ok, I did the vsyscall page too, and tried to make it do the right thing
> (but I didn't bother to test it on a non-SEP machine).
> 
> I'm pushing the changes out right now, but basically it boils down to the
> fact that with these changes, user space can instead of doing an
> 
> 	int $0x80
> 
> instruction for a system call just do a
> 
> 	call 0xfffff000
> 
> instead. The vsyscall page will be set up to use sysenter if the CPU
> supports it, and if it doesn't, it will just do the old "int $0x80"
> instead (and it could use the AMD syscall instruction if it wants to).
> User mode shouldn't know or care, the calling convention is the same as it
> ever was.

Perhaps it makes sense to define that gettimeofday is done by

	call 0xfffff100,

NOW? So we can add vsyscalls later?
								Pavel
-- 
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-09  8:30 Mike Hayward
  2002-12-09 15:40 ` erich
  2002-12-09 17:48 ` Linus Torvalds
@ 2002-12-13 15:45 ` William Lee Irwin III
  2002-12-13 16:49   ` Mike Hayward
  2002-12-15 21:59   ` Pavel Machek
  2 siblings, 2 replies; 268+ messages in thread
From: William Lee Irwin III @ 2002-12-13 15:45 UTC (permalink / raw)
  To: Mike Hayward; +Cc: linux-kernel

On Mon, Dec 09, 2002 at 01:30:28AM -0700, Mike Hayward wrote:
> Any ideas?  Not sure I want to upgrade to the P7 architecture if this
> is right, since for me system calls are probably more important than
> raw cpu computational power.

This is the same for me. I'm extremely uninterested in the P-IV for my
own use because of this.


Bill

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-13 15:45 ` William Lee Irwin III
@ 2002-12-13 16:49   ` Mike Hayward
  2002-12-14  0:55     ` GrandMasterLee
  2002-12-15 21:59   ` Pavel Machek
  1 sibling, 1 reply; 268+ messages in thread
From: Mike Hayward @ 2002-12-13 16:49 UTC (permalink / raw)
  To: wli; +Cc: linux-kernel

Hi Bill,

 > On Mon, Dec 09, 2002 at 01:30:28AM -0700, Mike Hayward wrote:
 > > Any ideas?  Not sure I want to upgrade to the P7 architecture if this
 > > is right, since for me system calls are probably more important than
 > > raw cpu computational power.
 > 
 > This is the same for me. I'm extremely uninterested in the P-IV for my
 > own use because of this.

I've also noticed that algorithms like the recursive one I run which
simulates solving the Tower of Hanoi problem are most likely very hard
to do branch prediction on.  Both the code and data no doubt fit
entirely in the L2 cache.  The AMD processor below is a much lower
cost and significantly lower clock rate (and on a machine with only a
100Mhz Memory bus) than the Xeon, yet dramatically outperforms it with
the same executable, compiled with gcc -march=i686 -O3.  Maybe with a
better Pentium 4 optimizing compiler the P4 and Xeon could improve a
few percent, but I doubt it'll ever see the AMD numbers.

Recursion Test--Tower of Hanoi

Uni  AMD XP 1800            2.4.18 kernel  46751.6 lps   (10 secs, 6 samples)
Dual Pentium 4 Xeon 2.4Ghz  2.4.19 kernel  33661.9 lps   (10 secs, 6 samples)

- Mike

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-13 16:49   ` Mike Hayward
@ 2002-12-14  0:55     ` GrandMasterLee
  2002-12-14  4:41       ` Mike Dresser
  0 siblings, 1 reply; 268+ messages in thread
From: GrandMasterLee @ 2002-12-14  0:55 UTC (permalink / raw)
  To: Mike Hayward; +Cc: wli, linux-kernel

On Fri, 2002-12-13 at 10:49, Mike Hayward wrote:
> Hi Bill,
> 
>  > On Mon, Dec 09, 2002 at 01:30:28AM -0700, Mike Hayward wrote:
>  > > Any ideas?  Not sure I want to upgrade to the P7 architecture if this
>  > > is right, since for me system calls are probably more important than
>  > > raw cpu computational power.
>  > 
>  > This is the same for me. I'm extremely uninterested in the P-IV for my
>  > own use because of this.
> 
> I've also noticed that algorithms like the recursive one I run which
> simulates solving the Tower of Hanoi problem are most likely very hard
> to do branch prediction on.  Both the code and data no doubt fit
> entirely in the L2 cache.  The AMD processor below is a much lower
> cost and significantly lower clock rate (and on a machine with only a
> 100Mhz Memory bus) than the Xeon, yet dramatically outperforms it with
> the same executable, compiled with gcc -march=i686 -O3.  Maybe with a
> better Pentium 4 optimizing compiler the P4 and Xeon could improve a
> few percent, but I doubt it'll ever see the AMD numbers.
What GCC were you using? I'd use 3.2, or 3.2.1 myself with
-march=pentium4 and -mcpu=pentium4 to see if there *is* any difference
there. On my quad P4 Xeon 1.6Ghz with 1M L3 cache, I can compile a
kernel in about 35 seconds. Mind you that's my own config, not
*everything*. On a dual athlon MP at 1.8 Ghz, I get about 5 mins or so.
Both are running with make -jx where X is the saturation value.


> Recursion Test--Tower of Hanoi
> 
> Uni  AMD XP 1800            2.4.18 kernel  46751.6 lps   (10 secs, 6 samples)
> Dual Pentium 4 Xeon 2.4Ghz  2.4.19 kernel  33661.9 lps   (10 secs, 6 samples)
> 
> - Mike
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
GrandMasterLee <masterlee@digitalroadkill.net>

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-14  0:55     ` GrandMasterLee
@ 2002-12-14  4:41       ` Mike Dresser
  2002-12-14  4:53         ` Mike Dresser
  0 siblings, 1 reply; 268+ messages in thread
From: Mike Dresser @ 2002-12-14  4:41 UTC (permalink / raw)
  To: GrandMasterLee; +Cc: linux-kernel

On 13 Dec 2002, GrandMasterLee wrote:

> there. On my quad P4 Xeon 1.6Ghz with 1M L3 cache, I can compile a
> kernel in about 35 seconds. Mind you that's my own config, not
> *everything*. On a dual athlon MP at 1.8 Ghz, I get about 5 mins or so.
> Both are running with make -jx where X is the saturation value.

Something seems odd about the athlon MP time, I've got a celeron 533
with slow disks that does a pretty standard make dep ; make of 2.4.20 in
7m05, which is not that much different considering it's a third the speed,
and one cpu instead of two.

The single P4/2.53 in another machine can haul down in 3m17s

Guess our kernel .config's or version must vary greatly.

Mike


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-14  4:41       ` Mike Dresser
@ 2002-12-14  4:53         ` Mike Dresser
  2002-12-14 10:01           ` Dave Jones
  0 siblings, 1 reply; 268+ messages in thread
From: Mike Dresser @ 2002-12-14  4:53 UTC (permalink / raw)
  To: GrandMasterLee; +Cc: linux-kernel

On Fri, 13 Dec 2002, Mike Dresser wrote:

> The single P4/2.53 in another machine can haul down in 3m17s
>
Amend that to 2m19s, forgot to kill a background backup that was moving
files around at about 20 meg a second.

Mike


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-14  4:53         ` Mike Dresser
@ 2002-12-14 10:01           ` Dave Jones
  2002-12-14 17:48             ` Mike Dresser
  2002-12-14 18:36             ` GrandMasterLee
  0 siblings, 2 replies; 268+ messages in thread
From: Dave Jones @ 2002-12-14 10:01 UTC (permalink / raw)
  To: Mike Dresser; +Cc: GrandMasterLee, linux-kernel

On Fri, Dec 13, 2002 at 11:53:51PM -0500, Mike Dresser wrote:
 > On Fri, 13 Dec 2002, Mike Dresser wrote:
 > 
 > > The single P4/2.53 in another machine can haul down in 3m17s
 > >
 > Amend that to 2m19s, forgot to kill a background backup that was moving
 > files around at about 20 meg a second.

Note that there are more factors at play than raw cpu speed in a
kernel compile. Your time here is slightly faster than my 2.8Ghz P4-HT for
example.  My guess is you have faster disk(s) than I do, as most of
the time mine seems to be waiting for something to do.

*note also that this is compiling stock 2.4.20 with default configuration.
The minute you change any options, we're comparings apples to oranges.

		Dave

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-14 10:01           ` Dave Jones
@ 2002-12-14 17:48             ` Mike Dresser
  2002-12-14 18:36             ` GrandMasterLee
  1 sibling, 0 replies; 268+ messages in thread
From: Mike Dresser @ 2002-12-14 17:48 UTC (permalink / raw)
  To: Dave Jones; +Cc: linux-kernel

On Sat, 14 Dec 2002, Dave Jones wrote:

> Note that there are more factors at play than raw cpu speed in a
> kernel compile. Your time here is slightly faster than my 2.8Ghz P4-HT for
> example.  My guess is you have faster disk(s) than I do, as most of
> the time mine seems to be waiting for something to do.

Quantum Fireball AS's in that machine.  My main comment was that his
Althon MP at 1.8 was half or less the speed of a single P4.  Even with
compiler changes, I wouldn't think it would make THAT much of a
difference?

Mike


^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-14 10:01           ` Dave Jones
  2002-12-14 17:48             ` Mike Dresser
@ 2002-12-14 18:36             ` GrandMasterLee
  2002-12-15  2:03               ` J.A. Magallon
  1 sibling, 1 reply; 268+ messages in thread
From: GrandMasterLee @ 2002-12-14 18:36 UTC (permalink / raw)
  To: Dave Jones; +Cc: Mike Dresser, linux-kernel

On Sat, 2002-12-14 at 04:01, Dave Jones wrote:
> On Fri, Dec 13, 2002 at 11:53:51PM -0500, Mike Dresser wrote:
>  > On Fri, 13 Dec 2002, Mike Dresser wrote:
>  > 
>  > > The single P4/2.53 in another machine can haul down in 3m17s
>  > >
>  > Amend that to 2m19s, forgot to kill a background backup that was moving
>  > files around at about 20 meg a second.



> Note that there are more factors at play than raw cpu speed in a
> kernel compile. Your time here is slightly faster than my 2.8Ghz P4-HT for
> example.  My guess is you have faster disk(s) than I do, as most of
> the time mine seems to be waiting for something to do.

An easy way to level the playing field would be to use /dev/shm to build
your kernel in. That way it's all in memory. If you've got a maching
with 512M, then it's easily accomplished.

> *note also that this is compiling stock 2.4.20 with default configuration.
> The minute you change any options, we're comparings apples to oranges.
> 
> 		Dave

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-14 18:36             ` GrandMasterLee
@ 2002-12-15  2:03               ` J.A. Magallon
  0 siblings, 0 replies; 268+ messages in thread
From: J.A. Magallon @ 2002-12-15  2:03 UTC (permalink / raw)
  To: GrandMasterLee; +Cc: linux-kernel


On 2002.12.14 GrandMasterLee wrote:
>On Sat, 2002-12-14 at 04:01, Dave Jones wrote:
>> On Fri, Dec 13, 2002 at 11:53:51PM -0500, Mike Dresser wrote:
>>  > On Fri, 13 Dec 2002, Mike Dresser wrote:
>>  > 
>>  > > The single P4/2.53 in another machine can haul down in 3m17s
>>  > >
>>  > Amend that to 2m19s, forgot to kill a background backup that was moving
>>  > files around at about 20 meg a second.
>
>
>
>> Note that there are more factors at play than raw cpu speed in a
>> kernel compile. Your time here is slightly faster than my 2.8Ghz P4-HT for
>> example.  My guess is you have faster disk(s) than I do, as most of
>> the time mine seems to be waiting for something to do.
>
>An easy way to level the playing field would be to use /dev/shm to build
>your kernel in. That way it's all in memory. If you've got a maching
>with 512M, then it's easily accomplished.
>

tmpfs does not guarantee you that it is always in ram. It also can be paged.
An easier way is to fill you page cache with the kernel tree like

werewolf:/usr/src/linux# grep -v -r "" *

and then build, so no disk read will be trown.

-- 
J.A. Magallon <jamagallon@able.es>      \                 Software is like sex:
werewolf.able.es                         \           It's better when it's free
Mandrake Linux release 9.1 (Cooker) for i586
Linux 2.4.20-jam1 (gcc 3.2 (Mandrake Linux 9.1 3.2-4mdk))

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-13 15:45 ` William Lee Irwin III
  2002-12-13 16:49   ` Mike Hayward
@ 2002-12-15 21:59   ` Pavel Machek
  2002-12-15 22:37     ` William Lee Irwin III
  1 sibling, 1 reply; 268+ messages in thread
From: Pavel Machek @ 2002-12-15 21:59 UTC (permalink / raw)
  To: William Lee Irwin III, Mike Hayward, linux-kernel

Hi!

> > Any ideas?  Not sure I want to upgrade to the P7 architecture if this
> > is right, since for me system calls are probably more important than
> > raw cpu computational power.
> 
> This is the same for me. I'm extremely uninterested in the P-IV for my
> own use because of this.

Well, then you should fix the kernel so that syscalls are done by
sysenter (or how is it called).
								Pavel
-- 
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-15 21:59   ` Pavel Machek
@ 2002-12-15 22:37     ` William Lee Irwin III
  2002-12-15 22:43       ` Pavel Machek
  0 siblings, 1 reply; 268+ messages in thread
From: William Lee Irwin III @ 2002-12-15 22:37 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Mike Hayward, linux-kernel

At some point in the past, I wrote:
>> This is the same for me. I'm extremely uninterested in the P-IV for my
>> own use because of this.

On Sun, Dec 15, 2002 at 10:59:51PM +0100, Pavel Machek wrote:
> Well, then you should fix the kernel so that syscalls are done by
> sysenter (or how is it called).
> 								Pavel

ABI is immutable. I actually run apps at home.

sysenter is also unusable for low-level loss-of-state reasons mentioned
elsewhere in this thread.


Nice try, though.


Bill

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
  2002-12-15 22:37     ` William Lee Irwin III
@ 2002-12-15 22:43       ` Pavel Machek
  0 siblings, 0 replies; 268+ messages in thread
From: Pavel Machek @ 2002-12-15 22:43 UTC (permalink / raw)
  To: William Lee Irwin III, Pavel Machek, Mike Hayward, linux-kernel

Hi!

> >> This is the same for me. I'm extremely uninterested in the P-IV for my
> >> own use because of this.
> 
> > Well, then you should fix the kernel so that syscalls are done by
> > sysenter (or how is it called).
> > 								Pavel
> 
> ABI is immutable. I actually run apps at home.

Perhaps that one killer app can be recompiled?

> sysenter is also unusable for low-level loss-of-state reasons mentioned
> elsewhere in this thread.

Well, disabling v86 may be well wroth it :-).
								Pavel
-- 
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.

^ permalink raw reply	[flat|nested] 268+ messages in thread

* Re: Intel P6 vs P7 system call performance
@ 2002-12-09  7:01 Samium Gromoff
  0 siblings, 0 replies; 268+ messages in thread
From: Samium Gromoff @ 2002-12-09  7:01 UTC (permalink / raw)
  To: linux-kernel

  As of dualie Xeon vs one-way, the possible reason is the SMP overhead,
because single thread can not benefit from multicpuness...

cheers, Samium Gromoff

^ permalink raw reply	[flat|nested] 268+ messages in thread

end of thread, other threads:[~2003-01-10 18:01 UTC | newest]

Thread overview: 268+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-12-18 12:55 Intel P6 vs P7 system call performance Terje Eggestad
2002-12-18 20:14 ` H. Peter Anvin
2002-12-18 20:25   ` Richard B. Johnson
2002-12-18 20:26     ` H. Peter Anvin
2002-12-18 22:28   ` Jamie Lokier
2002-12-18 22:37     ` Linus Torvalds
2002-12-18 22:57       ` Linus Torvalds
2002-12-20  0:53         ` Daniel Jacobowitz
2002-12-20  1:47           ` Linus Torvalds
2002-12-20  2:37             ` Daniel Jacobowitz
2002-12-18 22:39     ` H. Peter Anvin
  -- strict thread matches above, loose matches on Subject: below --
2003-01-10 18:08 Gabriel Paubert
2002-12-30 13:06 Manfred Spraul
2002-12-30 14:54 ` Andi Kleen
2002-12-22 15:45 Nakajima, Jun
2002-12-22 12:33 Mikael Pettersson
2002-12-22 16:00 ` Jamie Lokier
2002-12-19 18:46 billyrose
2002-12-19 16:10 billyrose
2002-12-19 15:20 bart
2002-12-19 14:57 bart
2002-12-19 14:40 billyrose
2002-12-19 15:11 ` Richard B. Johnson
2002-12-19 13:55 bart
2002-12-19 19:37 ` Linus Torvalds
2002-12-19 22:10   ` Jamie Lokier
2002-12-19 22:16     ` H. Peter Anvin
2002-12-19 22:22     ` Linus Torvalds
2002-12-19 22:26       ` H. Peter Anvin
2002-12-19 22:49         ` Linus Torvalds
2002-12-19 23:30           ` Linus Torvalds
2002-12-22 11:08       ` James H. Cloos Jr.
2002-12-22 18:49         ` Linus Torvalds
2002-12-22 19:07           ` Ulrich Drepper
2002-12-22 19:34             ` Linus Torvalds
2002-12-22 19:51               ` Ulrich Drepper
2002-12-22 20:50                 ` James H. Cloos Jr.
2002-12-22 20:56                   ` Ulrich Drepper
2002-12-22 19:17           ` Ulrich Drepper
2002-12-20 10:08   ` Ulrich Drepper
2002-12-20 12:06     ` Jamie Lokier
2002-12-20 16:47       ` Linus Torvalds
2002-12-20 23:38         ` Jamie Lokier
2002-12-20 23:50           ` H. Peter Anvin
2002-12-21  0:09           ` Linus Torvalds
2002-12-21 17:18             ` Jamie Lokier
2002-12-21 19:39               ` Linus Torvalds
2002-12-22  2:18                 ` Jamie Lokier
2002-12-22  3:11                   ` Linus Torvalds
2002-12-22 10:13                     ` Ingo Molnar
2002-12-22 15:32                       ` Jamie Lokier
2002-12-22 18:53                       ` Linus Torvalds
2002-12-23  5:03                         ` Linus Torvalds
2002-12-23  7:14                           ` Ulrich Drepper
2002-12-23 23:27                           ` Petr Vandrovec
2002-12-24  0:22                             ` Stephen Rothwell
2002-12-24  4:10                               ` Linus Torvalds
2002-12-24  8:05                                 ` Rogier Wolff
2002-12-24 18:51                                   ` Linus Torvalds
2002-12-24 21:10                                     ` Rogier Wolff
2002-12-27 16:14                                 ` Kai Henningsen
2002-12-24 19:36                       ` Linus Torvalds
2002-12-24 20:20                         ` Ingo Molnar
2002-12-24 20:27                           ` Linus Torvalds
2002-12-24 20:31                         ` Ingo Molnar
2002-12-24 20:39                           ` Linus Torvalds
2002-12-28  2:05                             ` H. Peter Anvin
2002-12-28  2:04                           ` H. Peter Anvin
2002-12-26  7:47                         ` Pavel Machek
2003-01-10 11:30                         ` Gabriel Paubert
2003-01-10 17:11                           ` Linus Torvalds
2002-12-22 10:23                     ` Ingo Molnar
2002-12-19 13:22 bart
2002-12-19 13:38 ` Dave Jones
2002-12-19 14:22   ` Jamie Lokier
2002-12-19 16:56     ` Dave Jones
2002-12-19 19:29 ` H. Peter Anvin
2002-12-18 23:51 billyrose
2002-12-19 13:10 ` Richard B. Johnson
2002-12-18  1:30 Nakajima, Jun
2002-12-18  1:54 ` Ulrich Drepper
2002-12-18  3:36   ` H. Peter Anvin
2002-12-18  4:05     ` Linus Torvalds
2002-12-18  4:36       ` H. Peter Anvin
2002-12-18  4:07     ` Linus Torvalds
2002-12-18  4:40       ` Stephen Rothwell
2002-12-18  4:52         ` Linus Torvalds
2002-12-18  4:53         ` Andrew Morton
2002-12-18 19:12         ` Andrew Morton
2002-12-18 23:45       ` Pavel Machek
2002-12-20  3:05         ` Alan Cox
2002-12-20  4:03           ` Stephen Rothwell
2002-12-18  6:00   ` Brian Gerst
2002-12-17 16:32 Manfred Spraul
2002-12-17 17:13 ` Richard B. Johnson
2002-12-17 17:19   ` Richard B. Johnson
2002-12-17 17:37     ` Mikael Pettersson
2002-12-17 16:14 John Reiser
2002-12-17 16:01 John Reiser
     [not found] <20021209193649.GC10316@suse.de.suse.lists.linux.kernel>
     [not found] ` <Pine.LNX.4.44.0212161639310.1623-100000@penguin.transmeta.com.suse.lists.linux.kernel>
2002-12-17  8:56   ` Andi Kleen
2002-12-17 16:57     ` Linus Torvalds
2002-12-18  5:25       ` Brian Gerst
2002-12-18  6:06         ` Linus Torvalds
2002-12-21 11:24           ` Ingo Molnar
2002-12-21 17:28             ` Jamie Lokier
2002-12-21 16:07         ` Christian Leber
2002-12-15  8:43 scott thomason
2002-12-15  4:06 Albert D. Cahalan
2002-12-15 22:01 ` Pavel Machek
2002-12-16  7:33   ` Albert D. Cahalan
2002-12-16 11:17     ` Pavel Machek
2002-12-16 17:54       ` Mark Mielke
2002-12-16 16:07         ` Jonah Sherman
2002-12-17  4:10           ` David Schwartz
2002-12-17  8:02         ` Helge Hafting
2002-12-16 19:55       ` H. Peter Anvin
2002-12-13 21:52 Margit Schubert-While
2002-12-13 19:32 Dieter Nützel
2002-12-13 17:51 Margit Schubert-While
2002-12-11 12:48 Terje Eggestad
2002-12-11 18:50 ` H. Peter Anvin
2002-12-12  9:42   ` Terje Eggestad
2002-12-12 10:06     ` Arjan van de Ven
2002-12-12 10:31       ` Terje Eggestad
2002-12-12 19:03       ` H. Peter Anvin
2002-12-12 20:36     ` Mark Mielke
2002-12-12 20:56       ` J.A. Magallon
2002-12-12 20:12         ` Zac Hansen
2002-12-13  9:21         ` Terje Eggestad
2002-12-13 15:58           ` Ville Herva
2002-12-13 21:57             ` Terje Eggestad
2002-12-13 22:53               ` H. Peter Anvin
2002-12-12 20:56       ` Vojtech Pavlik
2002-12-09  8:30 Mike Hayward
2002-12-09 15:40 ` erich
2002-12-09 17:48 ` Linus Torvalds
2002-12-09 19:36   ` Dave Jones
2002-12-09 19:46     ` H. Peter Anvin
2002-12-28 20:37       ` Ville Herva
2002-12-29  2:05         ` Christian Leber
2002-12-30 18:22           ` Christian Leber
2002-12-30 21:22             ` Linus Torvalds
2002-12-30 11:29         ` Dave Jones
2002-12-17  0:47     ` Linus Torvalds
2002-12-17  1:03       ` Dave Jones
2002-12-17  2:36         ` Linus Torvalds
2002-12-17  5:55           ` Linus Torvalds
2002-12-17  6:09             ` Linus Torvalds
2002-12-17  6:18               ` Linus Torvalds
2002-12-19 14:03                 ` Shuji YAMAMURA
2002-12-17  6:19               ` GrandMasterLee
2002-12-17  6:43               ` dean gaudet
2002-12-17 16:50                 ` Linus Torvalds
2002-12-17 19:11                 ` H. Peter Anvin
2002-12-17 21:39                   ` Benjamin LaHaise
2002-12-17 21:41                     ` H. Peter Anvin
2002-12-17 21:53                       ` Benjamin LaHaise
2002-12-18 23:53                 ` Pavel Machek
2002-12-19 22:18                   ` H. Peter Anvin
2002-12-19 22:21                     ` Pavel Machek
2002-12-19 22:23                       ` H. Peter Anvin
2002-12-19 22:26                         ` Pavel Machek
2002-12-19 22:30                           ` H. Peter Anvin
2002-12-19 22:34                             ` Pavel Machek
2002-12-19 22:36                               ` H. Peter Anvin
2002-12-17 19:12               ` H. Peter Anvin
2002-12-17 19:26                 ` Martin J. Bligh
2002-12-17 20:51                   ` Alan Cox
2002-12-17 20:16                     ` H. Peter Anvin
2002-12-17 20:49                 ` Alan Cox
2002-12-17 20:12                   ` H. Peter Anvin
2002-12-17  9:45             ` Andre Hedrick
2002-12-17 12:40               ` Dave Jones
2002-12-17 23:18                 ` Andre Hedrick
2002-12-17 15:12               ` Alan Cox
2002-12-18 23:55                 ` Pavel Machek
2002-12-19 22:17                   ` H. Peter Anvin
2002-12-17 10:53             ` Ulrich Drepper
2002-12-17 11:17               ` dada1
2002-12-17 17:33                 ` Ulrich Drepper
2002-12-17 17:06               ` Linus Torvalds
2002-12-17 17:55                 ` Ulrich Drepper
2002-12-17 18:01                   ` Linus Torvalds
2002-12-17 19:23                   ` Alan Cox
2002-12-17 18:48                     ` Ulrich Drepper
2002-12-17 19:19                       ` H. Peter Anvin
2002-12-17 19:44                       ` Alan Cox
2002-12-17 19:52                         ` Richard B. Johnson
2002-12-17 19:54                           ` H. Peter Anvin
2002-12-17 19:58                           ` Linus Torvalds
2002-12-18  7:20                             ` Kai Henningsen
2002-12-17 18:49                     ` Linus Torvalds
2002-12-17 19:09                       ` Ross Biro
2002-12-17 21:34                       ` Benjamin LaHaise
2002-12-17 21:36                         ` H. Peter Anvin
2002-12-17 21:50                           ` Benjamin LaHaise
2002-12-18 23:59               ` Pavel Machek
2002-12-17 16:12             ` Hugh Dickins
2002-12-17 16:33               ` Richard B. Johnson
2002-12-17 17:47                 ` Linus Torvalds
2002-12-17 16:54               ` Hugh Dickins
2002-12-17 17:07               ` Linus Torvalds
2002-12-17 17:19                 ` Matti Aarnio
2002-12-17 17:55                   ` Linus Torvalds
2002-12-17 18:24                     ` Linus Torvalds
2002-12-17 18:33                       ` Ulrich Drepper
2002-12-17 18:30                     ` Ulrich Drepper
2002-12-17 19:04                       ` Linus Torvalds
2002-12-17 19:19                         ` Ulrich Drepper
2002-12-17 19:28                         ` Linus Torvalds
2002-12-17 19:32                           ` H. Peter Anvin
2002-12-17 19:44                             ` Linus Torvalds
2002-12-17 19:53                           ` Ulrich Drepper
2002-12-17 20:01                             ` Linus Torvalds
2002-12-17 20:17                               ` Ulrich Drepper
2002-12-18  4:15                                 ` Linus Torvalds
2002-12-18  4:15                               ` Linus Torvalds
2002-12-18  4:39                                 ` H. Peter Anvin
2002-12-18  4:49                                   ` Linus Torvalds
2002-12-18  6:38                                     ` Linus Torvalds
2002-12-18 13:17                                 ` Richard B. Johnson
2002-12-18 13:40                                 ` Horst von Brand
2002-12-18 13:47                                   ` Sean Neakums
2002-12-18 14:10                                     ` Horst von Brand
2002-12-18 14:51                                       ` dada1
2002-12-18 19:12                                       ` Mark Mielke
2002-12-18 15:52                                   ` Alan Cox
2002-12-18 16:41                                   ` Dave Jones
2002-12-18 18:41                                     ` Horst von Brand
2002-12-17 19:26                       ` Alan Cox
2002-12-17 18:57                         ` Ulrich Drepper
2002-12-17 19:10                           ` Linus Torvalds
2002-12-17 19:21                             ` H. Peter Anvin
2002-12-17 19:37                               ` Linus Torvalds
2002-12-17 19:43                                 ` H. Peter Anvin
2002-12-17 20:07                                   ` Matti Aarnio
2002-12-17 20:10                                     ` H. Peter Anvin
2002-12-17 19:59                                 ` Matti Aarnio
2002-12-17 20:06                                 ` Ulrich Drepper
2002-12-17 20:35                                   ` Daniel Jacobowitz
2002-12-18  0:20                                   ` Linus Torvalds
2002-12-18  0:38                                     ` Ulrich Drepper
2002-12-18  7:41                                 ` Kai Henningsen
2002-12-18 13:00                                 ` Rogier Wolff
2002-12-17 19:47                             ` Dave Jones
2002-12-18 12:57                             ` Rogier Wolff
2002-12-19  0:14                               ` Pavel Machek
2002-12-17 21:38                           ` Benjamin LaHaise
2002-12-17 21:41                             ` H. Peter Anvin
2002-12-17 18:39                     ` Jeff Dike
2002-12-17 19:05                       ` Linus Torvalds
2002-12-18  5:34                     ` Jeremy Fitzhardinge
2002-12-18  5:38                       ` H. Peter Anvin
2002-12-18 15:50                       ` Alan Cox
2002-12-18 23:51             ` Pavel Machek
2002-12-13 15:45 ` William Lee Irwin III
2002-12-13 16:49   ` Mike Hayward
2002-12-14  0:55     ` GrandMasterLee
2002-12-14  4:41       ` Mike Dresser
2002-12-14  4:53         ` Mike Dresser
2002-12-14 10:01           ` Dave Jones
2002-12-14 17:48             ` Mike Dresser
2002-12-14 18:36             ` GrandMasterLee
2002-12-15  2:03               ` J.A. Magallon
2002-12-15 21:59   ` Pavel Machek
2002-12-15 22:37     ` William Lee Irwin III
2002-12-15 22:43       ` Pavel Machek
2002-12-09  7:01 Samium Gromoff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).