[BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
@ 2003-01-03  8:59 Aniruddha M Marathe
  2003-01-03  9:33 ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Aniruddha M Marathe @ 2003-01-03  8:59 UTC (permalink / raw)
  To: linux-kernel

Here is a comparison of results of 2.5.54 with mm2 and 2.5.54. 
There are many improvements with mm2. Please see the short summary below.
The figures in the table below indicate median of 5 repetitions of tests. 
This result doesn't have many 
Differences than the previous one.


						2.5.54mm2	2.5.54
==============================================================================
Processor, Processes - times in microseconds - smaller is better

1. sig handle				3.59		5.38
2. proc fs					1212		1534
3. sh proc					6606		7872
------------------------------------------------------------------------------
Context switching - times in microseconds - smaller is better

1. 2p/0K ctxsw				1.370		1.560
------------------------------------------------------------------------------
*Local* Communication latencies in microseconds - smaller is better
1.AF UNIX					13		19
2.UDP						24		30
3.TCP						58		75 
------------------------------------------------------------------------------
File & VM system latencies in microseconds - smaller is better
1. 0K create				90		118
2. 0k delete				28		55
3. 10K create				313		375
4. 10 K delete				79		126
------------------------------------------------------------------------------
*Local* Communication bandwidths in MB/s - bigger is better
1. AF UNIX					277		109
2. TCP					51		25
==============================================================================


*****************************************************************************
				Lmbench result
				kernel 2.5.54-mm2
****************************************************************************
L M B E N C H  2 . 0   S U M M A R Y
                 ------------------------------------
		 (Alpha software, do not distribute)

Basic system parameters
----------------------------------------------------
Host                 OS Description              Mhz
                                                    
--------- ------------- ----------------------- ----
benchtest  Linux 2.5.54       i686-pc-linux-gnu  790
benchtest  Linux 2.5.54       i686-pc-linux-gnu  790
benchtest  Linux 2.5.54       i686-pc-linux-gnu  790
benchtest  Linux 2.5.54       i686-pc-linux-gnu  790
benchtest  Linux 2.5.54       i686-pc-linux-gnu  790

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
Host                 	OS  Mhz null null      open selct sig  sig  fork exec sh  
                             	call  I/O stat clos TCP   inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
benchtest  Linux 2.5.54  790 0.44 0.81 6.28 7.58       1.27 3.63  282 1212 6618
benchtest  Linux 2.5.54  790 0.46 0.83 6.42 7.71    33 1.27 3.59  304 1217 6569
benchtest  Linux 2.5.54  790 0.46 0.82 6.39 7.68    30 1.24 3.59  337 1211 6609
benchtest  Linux 2.5.54  790 0.46 0.82 6.40 7.61    30 1.27 3.63  318 1212 6588
benchtest  Linux 2.5.54  790 0.46 0.80 6.41 7.72    32 1.24 3.59  274 1232 6606

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                        ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
benchtest  Linux 2.5.54 1.350 4.8700     71 6.8600    176      41     180
benchtest  Linux 2.5.54 1.410 4.8200     18 8.3300    180      41     180
benchtest  Linux 2.5.54 1.390 4.7400     17 7.8600    179      43     180
benchtest  Linux 2.5.54 1.370 4.8200     18 8.0500    178      39     180
benchtest  Linux 2.5.54 1.370 4.7900     73     10    178      39     181

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                        ctxsw       UNIX         UDP         TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
benchtest  Linux 2.5.54 1.350 8.273   13    23    45    32    58  104
benchtest  Linux 2.5.54 1.410 8.174   13    24    45    28    58  104
benchtest  Linux 2.5.54 1.390 8.221   13    24    45    32    58  104
benchtest  Linux 2.5.54 1.370 8.279   13    24    45    32    58  104
benchtest  Linux 2.5.54 1.370 8.143   13    21    45    32    58  104

File & VM system latencies in microseconds - smaller is better
--------------------------------------------------------------
Host                 OS   0K File      10K File      Mmap    Prot    Page	
                        Create Delete Create Delete  Latency Fault   Fault 
--------- ------------- ------ ------ ------ ------  ------- -----   ----- 
benchtest  Linux 2.5.54     90     28    311     72      640 0.962 4.00000
benchtest  Linux 2.5.54     91     29    316     79      637 0.963 5.00000
benchtest  Linux 2.5.54     90     28    313     79      635 0.971 4.00000
benchtest  Linux 2.5.54     90     28    315     77      643 0.969 4.00000
benchtest  Linux 2.5.54     90     28    312     79      636 0.965 4.00000

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                             UNIX      reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
benchtest  Linux 2.5.54  581  230   51    299    355    125    114  354   170
benchtest  Linux 2.5.54  595  454   52    294    352    123    112  352   169
benchtest  Linux 2.5.54  582  277   51    294    352    123    112  351   168
benchtest  Linux 2.5.54  589  238   53    281    352    123    112  351   168
benchtest  Linux 2.5.54  566  397   51    294    351    123    112  351   168

Memory latencies in nanoseconds - smaller is better
    (WARNING - may not be correct, check graphs)
---------------------------------------------------
Host                 OS   Mhz  L1 $   L2 $    Main mem    Guesses
--------- -------------  ---- ----- ------    --------    -------
benchtest  Linux 2.5.54   790 3.799 8.8810    175
benchtest  Linux 2.5.54   790 3.798 8.8810    176
benchtest  Linux 2.5.54   790 3.797 8.8810    176
benchtest  Linux 2.5.54   790 3.808 8.8800    176
benchtest  Linux 2.5.54   790 3.798 8.8710    176

Aniruddha Marathe
WIPRO Technologies, India
aniruddha.marathe@wipro.com
+91-80-5502001 to 2008 extn 5092 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
  2003-01-03  8:59 Aniruddha M Marathe
@ 2003-01-03  9:33 ` Andrew Morton
  2003-01-03 10:24   ` David S. Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2003-01-03  9:33 UTC (permalink / raw)
  To: Aniruddha M Marathe; +Cc: linux-kernel

Aniruddha M Marathe wrote:
> 
> Here is a comparison of results of 2.5.54 with mm2 and 2.5.54.

I'm sorry, but all you are doing with these tests is discrediting
lmbench, AIM9, tiobench and unixbench.  There really is nothing in
these patches which can account for the changes which you are observing.

Possibly, it is all caused by cache colouring effects - the physical
addresses at which critical kernel and userspace text and data
happen to end up.

I'd suggest that you look for more complex tests.  There's a decent
list at http://lbs.sourceforge.net/, but even those are rather microscopic.

If you have time, things like the osdl dbt1 test, http://osdb.sourceforge.net/
and the commercial benchmarks would be more interesting.

Or cook up some of your own: it's not hard.  Just think of some time-consuming
operation which we perform on a daily basis and measure it.  Script
the startup and shutdown of X11 applications. rsync. sendmail.  cvs.

Mixed workloads are interesting and real world: run tiobench or dbench
or qsbench or whatever while trying to do something else, see how long
"something else" takes.

It is these sorts of things which will find areas of weakness which
can be addressed in this phase of kernel development.

The teeny little microbenchmarks are telling us that the rmap overhead
hurts, that the uninlining of copy_*_user may have been a bad idea, that
the addition of AIO has cost a little and that the complexity which
yielded large improvements in readv(), writev() and SMP throughput were
not free.  All of this is already known.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
  2003-01-03 10:24   ` David S. Miller
@ 2003-01-03 10:22     ` Andrew Morton
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Morton @ 2003-01-03 10:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: Aniruddha M Marathe, linux-kernel

"David S. Miller" wrote:
> 
> On Fri, 2003-01-03 at 01:33, Andrew Morton wrote:
> > I'm sorry, but all you are doing with these tests is discrediting
> > lmbench, AIM9, tiobench and unixbench.
>  ...
> > Possibly, it is all caused by cache colouring effects - the physical
> > addresses at which critical kernel and userspace text and data
> > happen to end up.
>  ...
> > The teeny little microbenchmarks are telling us that the rmap overhead
> > hurts, that the uninlining of copy_*_user may have been a bad idea, that
> > the addition of AIO has cost a little and that the complexity which
> > yielded large improvements in readv(), writev() and SMP throughput were
> > not free.  All of this is already known.
> 
> I think if anything, you are stating the true value of the
> microbenchmarks.  They are showing us how the kernel is getting
> more and more complex, causing basic operations to take longer
> and longer.  That's bad. :-)

Yup.  But these things are already known about.

> Last time I brought up an issue like this (a "nobody but weirdos use
> feature which is costing us cycles everywhere"), it got redone until
> it did cost nothing for people who don't use the feature.  See the
> whole security layer fiasco for example.

There would be some small benefit in disabling the per-cpu-pages
pools on uniprocessor, and probably the deferred lru-addition queues.

That's fairly simple to do but I didn't do it because it would mean
that SMP and UP are running significantly different codepaths.  Benching
this is on my todo list somewhere.

> I truly wish I could config out AIO for example, the overhead is just
> stupid.  I know that if some thought is put into it, the cost could
> be consumed completely.

hm.  Its cost in filesystem/VFS land is quite small.  I assume you're
referring to networking here?

> People who don't see the true value of researching even minor jitters
> in lmbench results (and fixing the causes or backing out the guilty
> patch) aren't kernel developers in my opinion. :-)

But the statistically significant differences _are_ researched, and are
well understood.

We should't lose sight of large optimisations which happen to not be
covered by these tests.  eg: SMP scalability.

To cite an extreme case, the readv/writev changes sped up O_SYNC and
O_DIRECT writev() by up to 300x and buffered writev() by 3x.  But it cost
us a few percent on write(fd, buf, 1).

quad:/usr/src> grep -r writev lmbench
quad:/usr/src> grep -r writev aim9
quad:/usr/src> grep -r writev tiobench 
quad:/usr/src> grep -r writev unixbench-4.1.0-971022 
quad:/usr/src> 

The big, big one here is the reverse map.  I still don't believe that
its benefit has been shown to exceed its speed and space costs.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
  2003-01-03  9:33 ` Andrew Morton
@ 2003-01-03 10:24   ` David S. Miller
  2003-01-03 10:22     ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: David S. Miller @ 2003-01-03 10:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Aniruddha M Marathe, linux-kernel

On Fri, 2003-01-03 at 01:33, Andrew Morton wrote:
> I'm sorry, but all you are doing with these tests is discrediting
> lmbench, AIM9, tiobench and unixbench.
 ...
> Possibly, it is all caused by cache colouring effects - the physical
> addresses at which critical kernel and userspace text and data
> happen to end up.
 ...
> The teeny little microbenchmarks are telling us that the rmap overhead
> hurts, that the uninlining of copy_*_user may have been a bad idea, that
> the addition of AIO has cost a little and that the complexity which
> yielded large improvements in readv(), writev() and SMP throughput were
> not free.  All of this is already known.

I think if anything, you are stating the true value of the
microbenchmarks.  They are showing us how the kernel is getting
more and more complex, causing basic operations to take longer
and longer.  That's bad. :-)

Last time I brought up an issue like this (a "nobody but weirdos use
feature which is costing us cycles everywhere"), it got redone until
it did cost nothing for people who don't use the feature.  See the
whole security layer fiasco for example.

I truly wish I could config out AIO for example, the overhead is just
stupid.  I know that if some thought is put into it, the cost could
be consumed completely.

People who don't see the true value of researching even minor jitters
in lmbench results (and fixing the causes or backing out the guilty
patch) aren't kernel developers in my opinion. :-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
       [not found] ` <3E155903.F8C22286@digeo.com.suse.lists.linux.kernel>
@ 2003-01-03 18:40   ` Andi Kleen
  2003-01-03 21:32     ` Andrew Morton
  2003-01-05  1:01     ` Andrew Morton
  0 siblings, 2 replies; 18+ messages in thread
From: Andi Kleen @ 2003-01-03 18:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davem, linux-kernel

Andrew Morton <akpm@digeo.com> writes:
> 
> The teeny little microbenchmarks are telling us that the rmap overhead
> hurts, that the uninlining of copy_*_user may have been a bad idea, that
> the addition of AIO has cost a little and that the complexity which
> yielded large improvements in readv(), writev() and SMP throughput were
> not free.  All of this is already known.

If you mean the signal speed regressions they caused - I fixed 
that on x86-64 by inlining 1,2,4,8,10(used by signal fpu frame),16.
But it should not use the stupud rep ; ..., of the old ersio but direct 
unrolled moves.

x86-64 version in include/asm-x86_64/uaccess.h, could be ported
to i386 given that movqs need to be replaced by two movls.

-Andi

P.S.: regarding recent lmbench slow downs: I'm a bit
worried about the two wrmsrs which are in the i386 context switch
in load_esp0 for sysenter now. Last time I benchmarked WRMSRs on
Athlon they were really slow and knowing the P4 it is probably
even slower there. Imho it would be better to undo that patch
and use Linus' original trampoline stack.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
  2003-01-03 18:40   ` [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements) Andi Kleen
@ 2003-01-03 21:32     ` Andrew Morton
  2003-01-05  1:01     ` Andrew Morton
  1 sibling, 0 replies; 18+ messages in thread
From: Andrew Morton @ 2003-01-03 21:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: davem, linux-kernel

Andi Kleen wrote:
> 
> Andrew Morton <akpm@digeo.com> writes:
> >
> > The teeny little microbenchmarks are telling us that the rmap overhead
> > hurts, that the uninlining of copy_*_user may have been a bad idea, that
> > the addition of AIO has cost a little and that the complexity which
> > yielded large improvements in readv(), writev() and SMP throughput were
> > not free.  All of this is already known.
> 
> If you mean the signal speed regressions they caused - I fixed
> that on x86-64 by inlining 1,2,4,8,10(used by signal fpu frame),16.
> But it should not use the stupud rep ; ..., of the old ersio but direct
> unrolled moves.

Yes, that would help a bit.  We should do that for ia32.  It's a little
worrisome that the return value from such a copy_*_user() implementation
will be incorrect - it is supposed to return the number of uncopied bytes.
Probably doesn't matter.

Most of the optimisation opportunities wrt signal delivery were soaked up
by replacing the copy_*_user() calls with put_user() and friends.

We could speed up signals heaps by re-lazying the fpu state storage in
some manner.

> x86-64 version in include/asm-x86_64/uaccess.h, could be ported
> to i386 given that movqs need to be replaced by two movls.
> 
> -Andi
> 
> P.S.: regarding recent lmbench slow downs: I'm a bit
> worried about the two wrmsrs which are in the i386 context switch
> in load_esp0 for sysenter now. Last time I benchmarked WRMSRs on
> Athlon they were really slow and knowing the P4 it is probably
> even slower there. Imho it would be better to undo that patch
> and use Linus' original trampoline stack.

hm.  How slow?  Any numbers on that?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
  2003-01-03 18:40   ` [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements) Andi Kleen
  2003-01-03 21:32     ` Andrew Morton
@ 2003-01-05  1:01     ` Andrew Morton
  2003-01-05  3:35       ` Linus Torvalds
  1 sibling, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2003-01-05  1:01 UTC (permalink / raw)
  To: Andi Kleen, Linus Torvalds; +Cc: davem, linux-kernel

Andi Kleen wrote:
> 
> P.S.: regarding recent lmbench slow downs: I'm a bit
> worried about the two wrmsrs which are in the i386 context switch
> in load_esp0 for sysenter now. Last time I benchmarked WRMSRs on
> Athlon they were really slow and knowing the P4 it is probably
> even slower there. Imho it would be better to undo that patch
> and use Linus' original trampoline stack.
> 

Looks like you're right.  The indications are that this change
has slowed context switches by ~5% on a PIII.   The backout patch
against 2.5.54 is below.  Testing on a P4 would be useful.

2.5.54, stock:

lmbench:

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                        ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
i686-linu  Linux 2.5.54    3     16     44    18     47      20      77

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                        ctxsw       UNIX         UDP         TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
i686-linu  Linux 2.5.54     3    16   26    65          78        231

AIM9:

     3 tcp_test            10.00       2767  276.70000        24903.00 TCP/IP Messages/second
     4 udp_test            10.00       4658  465.80000        46580.00 UDP/IP DataGrams/second
     5 fifo_test           10.01       6507  650.04995        65005.00 FIFO Messages/second
     6 dgram_pipe          10.00      11228 1122.80000       112280.00 DataGram Pipe Messages/second
     7 pipe_cpy            10.00      15463 1546.30000       154630.00 Pipe Messages/second

pollbench:
pollbench 1 100 5000
  result with handles 1 processes 100 loops 5000:time  9.609487 sec.
pollbench 2 100 2000
  result with handles 2 processes 100 loops 2000:time  4.016496 sec.
pollbench 5 100 2000
  result with handles 5 processes 100 loops 2000:time  4.917921 sec.


2.5.54, with the below backout patch:

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                        ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
i686-linu  Linux 2.5.54    3     14     47    18     50      20      61

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                        ctxsw       UNIX         UDP         TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
i686-linu  Linux 2.5.54     3    15   25    64          69        231


     3 tcp_test            10.00       2908  290.80000        26172.00 TCP/IP Messages/second
     4 udp_test            10.00       4971  497.10000        49710.00 UDP/IP DataGrams/second
     5 fifo_test           10.01       6642  663.53646        66353.65 FIFO Messages/second
     6 dgram_pipe          10.00      11516 1151.60000       115160.00 DataGram Pipe Messages/second
     7 pipe_cpy            10.00      15930 1593.00000       159300.00 Pipe Messages/second


pollbench 1 100 5000
  result with handles 1 processes 100 loops 5000:time  9.106732 sec.
pollbench 2 100 2000
  result with handles 2 processes 100 loops 2000:time  3.853814 sec.
pollbench 5 100 2000
  result with handles 5 processes 100 loops 2000:time  4.533519 sec.



 arch/i386/kernel/cpu/common.c |    2 +-
 arch/i386/kernel/process.c    |    2 +-
 arch/i386/kernel/sysenter.c   |   34 ++++++++++++++++++++++++++++++----
 arch/i386/kernel/vm86.c       |    4 +---
 include/asm-i386/cpufeature.h |    3 ---
 include/asm-i386/msr.h        |    4 ----
 include/asm-i386/processor.h  |   16 ----------------
 7 files changed, 33 insertions(+), 32 deletions(-)

--- 25/arch/i386/kernel/cpu/common.c~bad3	Fri Jan  3 16:36:48 2003
+++ 25-akpm/arch/i386/kernel/cpu/common.c	Fri Jan  3 16:36:48 2003
@@ -484,7 +484,7 @@ void __init cpu_init (void)
 		BUG();
 	enter_lazy_tlb(&init_mm, current, cpu);
 
-	load_esp0(t, thread->esp0);
+	t->esp0 = thread->esp0;
 	set_tss_desc(cpu,t);
 	cpu_gdt_table[cpu][GDT_ENTRY_TSS].b &= 0xfffffdff;
 	load_TR_desc();
--- 25/arch/i386/kernel/process.c~bad3	Fri Jan  3 16:36:48 2003
+++ 25-akpm/arch/i386/kernel/process.c	Fri Jan  3 16:36:48 2003
@@ -437,7 +437,7 @@ void __switch_to(struct task_struct *pre
 	/*
 	 * Reload esp0, LDT and the page table pointer:
 	 */
-	load_esp0(tss, next->esp0);
+	tss->esp0 = next->esp0;
 
 	/*
 	 * Load the per-thread Thread-Local Storage descriptor.
--- 25/arch/i386/kernel/sysenter.c~bad3	Fri Jan  3 16:36:48 2003
+++ 25-akpm/arch/i386/kernel/sysenter.c	Fri Jan  3 16:36:48 2003
@@ -34,14 +34,40 @@ struct fake_sep_struct {
 	unsigned char stack[0];
 } __attribute__((aligned(8192)));
 	
+static struct fake_sep_struct *alloc_sep_thread(int cpu)
+{
+	struct fake_sep_struct *entry;
+
+	entry = (struct fake_sep_struct *) __get_free_pages(GFP_ATOMIC, 1);
+	if (!entry)
+		return NULL;
+
+	memset(entry, 0, PAGE_SIZE<<1);
+	entry->thread.task = &entry->task;
+	entry->task.thread_info = &entry->thread;
+	entry->thread.preempt_count = 1;
+	entry->thread.cpu = cpu;	
+
+	return entry;
+}
+
 static void __init enable_sep_cpu(void *info)
 {
 	int cpu = get_cpu();
-	struct tss_struct *tss = init_tss + cpu;
+	struct fake_sep_struct *sep = alloc_sep_thread(cpu);
+	unsigned long *esp0_ptr = &(init_tss + cpu)->esp0;
+	unsigned long rel32;
+
+	rel32 = (unsigned long) sysenter_entry - (unsigned long) (sep->trampoline+11);
+	
+	*(short *) (sep->trampoline+0) = 0x258b;		/* movl xxxxx,%esp */
+	*(long **) (sep->trampoline+2) = esp0_ptr;
+	*(char *)  (sep->trampoline+6) = 0xe9;			/* jmp rl32 */
+	*(long *)  (sep->trampoline+7) = rel32;
 
-	wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
-	wrmsr(MSR_IA32_SYSENTER_ESP, tss->esp0, 0);
-	wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long) sysenter_entry, 0);
+	wrmsr(0x174, __KERNEL_CS, 0);				/* SYSENTER_CS_MSR */
+	wrmsr(0x175, PAGE_SIZE*2 + (unsigned long) sep, 0);	/* SYSENTER_ESP_MSR */
+	wrmsr(0x176, (unsigned long) &sep->trampoline, 0);	/* SYSENTER_EIP_MSR */
 
 	printk("Enabling SEP on CPU %d\n", cpu);
 	put_cpu();	
--- 25/arch/i386/kernel/vm86.c~bad3	Fri Jan  3 16:36:48 2003
+++ 25-akpm/arch/i386/kernel/vm86.c	Fri Jan  3 16:36:48 2003
@@ -113,8 +113,7 @@ struct pt_regs * save_v86_state(struct k
 		do_exit(SIGSEGV);
 	}
 	tss = init_tss + smp_processor_id();
-	current->thread.esp0 = current->thread.saved_esp0;
-	load_esp0(tss, current->thread.esp0);
+	tss->esp0 = current->thread.esp0 = current->thread.saved_esp0;
 	current->thread.saved_esp0 = 0;
 	loadsegment(fs, current->thread.saved_fs);
 	loadsegment(gs, current->thread.saved_gs);
@@ -290,7 +289,6 @@ static void do_sys_vm86(struct kernel_vm
 
 	tss = init_tss + smp_processor_id();
 	tss->esp0 = tsk->thread.esp0 = (unsigned long) &info->VM86_TSS_ESP0;
-	disable_sysenter();
 
 	tsk->thread.screen_bitmap = info->screen_bitmap;
 	if (info->flags & VM86_SCREEN_BITMAP)
--- 25/include/asm-i386/cpufeature.h~bad3	Fri Jan  3 16:36:48 2003
+++ 25-akpm/include/asm-i386/cpufeature.h	Fri Jan  3 16:36:48 2003
@@ -7,8 +7,6 @@
 #ifndef __ASM_I386_CPUFEATURE_H
 #define __ASM_I386_CPUFEATURE_H
 
-#include <linux/bitops.h>
-
 #define NCAPINTS	4	/* Currently we have 4 32-bit words worth of info */
 
 /* Intel-defined CPU features, CPUID level 0x00000001, word 0 */
@@ -77,7 +75,6 @@
 #define cpu_has_pge		boot_cpu_has(X86_FEATURE_PGE)
 #define cpu_has_sse2		boot_cpu_has(X86_FEATURE_XMM2)
 #define cpu_has_apic		boot_cpu_has(X86_FEATURE_APIC)
-#define cpu_has_sep		boot_cpu_has(X86_FEATURE_SEP)
 #define cpu_has_mtrr		boot_cpu_has(X86_FEATURE_MTRR)
 #define cpu_has_mmx		boot_cpu_has(X86_FEATURE_MMX)
 #define cpu_has_fxsr		boot_cpu_has(X86_FEATURE_FXSR)
--- 25/include/asm-i386/msr.h~bad3	Fri Jan  3 16:36:48 2003
+++ 25-akpm/include/asm-i386/msr.h	Fri Jan  3 16:36:48 2003
@@ -53,10 +53,6 @@
 
 #define MSR_IA32_BBL_CR_CTL		0x119
 
-#define MSR_IA32_SYSENTER_CS		0x174
-#define MSR_IA32_SYSENTER_ESP		0x175
-#define MSR_IA32_SYSENTER_EIP		0x176
-
 #define MSR_IA32_MCG_CAP		0x179
 #define MSR_IA32_MCG_STATUS		0x17a
 #define MSR_IA32_MCG_CTL		0x17b
--- 25/include/asm-i386/processor.h~bad3	Fri Jan  3 16:36:48 2003
+++ 25-akpm/include/asm-i386/processor.h	Fri Jan  3 16:36:48 2003
@@ -14,7 +14,6 @@
 #include <asm/types.h>
 #include <asm/sigcontext.h>
 #include <asm/cpufeature.h>
-#include <asm/msr.h>
 #include <linux/cache.h>
 #include <linux/config.h>
 #include <linux/threads.h>
@@ -411,21 +410,6 @@ struct thread_struct {
 	.io_bitmap	= { [ 0 ... IO_BITMAP_SIZE ] = ~0 },		\
 }
 
-static inline void load_esp0(struct tss_struct *tss, unsigned long esp0)
-{
-	tss->esp0 = esp0;
-	if (cpu_has_sep) {
-		wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
-		wrmsr(MSR_IA32_SYSENTER_ESP, esp0, 0);
-	}
-}
-
-static inline void disable_sysenter(void)
-{
-	if (cpu_has_sep)  
-		wrmsr(MSR_IA32_SYSENTER_CS, 0, 0);
-}
-
 #define start_thread(regs, new_eip, new_esp) do {		\
 	__asm__("movl %0,%%fs ; movl %0,%%gs": :"r" (0));	\
 	set_fs(USER_DS);					\

_

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
  2003-01-05  1:01     ` Andrew Morton
@ 2003-01-05  3:35       ` Linus Torvalds
  2003-01-05  3:51         ` Linus Torvalds
                           ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Linus Torvalds @ 2003-01-05  3:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, davem, linux-kernel

On Sat, 4 Jan 2003, Andrew Morton wrote:
> 
> Looks like you're right.  The indications are that this change
> has slowed context switches by ~5% on a PIII.   The backout patch
> against 2.5.54 is below.  Testing on a P4 would be useful.

Hmm.. The backup patch doesn't handle single-stepping correctly: the
eflags cleanup singlestep patch later in the sysenter sequence _depends_
on the stack (and thus thread) being right on the very first in-kernel
instruction.

That (along with benchmarking of system call numbers - the stack switch at
system call run-time ends up being quite expensive on a P4) was what made
me decide to do the traditional "write MSR in schedule" approach, even
though I agree that it would be much nicer to not have to rewrite that
stupid MSR all the time.

It doesn't show up on lmbench (insufficient precision), but your AIM9
numbers are quite interesting. Are they stable?

		Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
  2003-01-05  3:35       ` Linus Torvalds
@ 2003-01-05  3:51         ` Linus Torvalds
  2003-01-05  3:54         ` Andrew Morton
  2003-01-05  9:18         ` Andrew Morton
  2 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2003-01-05  3:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, davem, linux-kernel

On Sat, 4 Jan 2003, Linus Torvalds wrote:
> 
> It doesn't show up on lmbench (insufficient precision), but your AIM9
> numbers are quite interesting. Are they stable?

Btw, which checking whether the numbers are stable it is also interesting
to see stability across reboots etc, since for the scheduling latency in
particular it can easily depend on location of the binaries in physical
memory etc, since that matters for cache accesses (I think the L1 D$ on a
PIII is 4-way associative, I'm not sure - it makes it _reasonably_ good at
avoiding cache conflicts, but they can still happen and easily account for
a 5% fluctuation. I don't remember what the L1 I$ situation is).

And with a fairly persistent page cache, whatever cache situation there is
tends to largely stay the same, so just re-running the benchmark may not
change much, at least for the I$ situation.

You can see this effect quite clearly in lmbench: while the 2p/0k context
switch numbers tend to be fairly stable (almost zero likelyhood of any
cache conflicts), the others often fluctuate more even with the same
kernel (ie for me the 2p/16kB numbers fluctuate between 3 and 6 usecs).

D$ conflicts are largely easier to see (because they usually _will_ change 
when you re-run the benchmark, so they show up as fluctuations), but the 
I$ effects in particular can be quite persistent because (a) the kernel 
code will always be at the same place and (b) the user code tends to be 
sticky in the same place due to the page cache. 

I'm convinced the I$ effects are one major issue why we sometimes see
largish fluctuations on some ubenchmarks between kernels when nothing has
really changed.

		Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
  2003-01-05  3:54         ` Andrew Morton
@ 2003-01-05  3:52           ` Linus Torvalds
  2003-01-05 10:06             ` Andi Kleen
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2003-01-05  3:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, davem, linux-kernel


On Sat, 4 Jan 2003, Andrew Morton wrote:
> > 
> > Hmm.. The backup patch doesn't handle single-stepping correctly: the
> > eflags cleanup singlestep patch later in the sysenter sequence _depends_
> > on the stack (and thus thread) being right on the very first in-kernel
> > instruction.
> 
> Well that's just a straight `patch -R' of the patch which added the wrmsr's.

Yes, but the breakage comes laterr when a subsequent patch in the 
2.5.53->54 stuff started depending on the stack location being stable even 
on the first instruction.

> > It doesn't show up on lmbench (insufficient precision), but your AIM9
> > numbers are quite interesting. Are they stable?
> 
> Seem to be, but more work is needed, including oprofiling.  Andi is doing
> some P4 testing at present.

Ok.

		Linus


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
  2003-01-05  3:35       ` Linus Torvalds
  2003-01-05  3:51         ` Linus Torvalds
@ 2003-01-05  3:54         ` Andrew Morton
  2003-01-05  3:52           ` Linus Torvalds
  2003-01-05  9:18         ` Andrew Morton
  2 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2003-01-05  3:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andi Kleen, davem, linux-kernel

Linus Torvalds wrote:
> 
> On Sat, 4 Jan 2003, Andrew Morton wrote:
> >
> > Looks like you're right.  The indications are that this change
> > has slowed context switches by ~5% on a PIII.   The backout patch
> > against 2.5.54 is below.  Testing on a P4 would be useful.
> 
> Hmm.. The backup patch doesn't handle single-stepping correctly: the
> eflags cleanup singlestep patch later in the sysenter sequence _depends_
> on the stack (and thus thread) being right on the very first in-kernel
> instruction.

Well that's just a straight `patch -R' of the patch which added the wrmsr's.

> That (along with benchmarking of system call numbers - the stack switch at
> system call run-time ends up being quite expensive on a P4) was what made
> me decide to do the traditional "write MSR in schedule" approach, even
> though I agree that it would be much nicer to not have to rewrite that
> stupid MSR all the time.
> 
> It doesn't show up on lmbench (insufficient precision), but your AIM9
> numbers are quite interesting. Are they stable?
> 

Seem to be, but more work is needed, including oprofiling.  Andi is doing
some P4 testing at present.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
  2003-01-05  3:35       ` Linus Torvalds
  2003-01-05  3:51         ` Linus Torvalds
  2003-01-05  3:54         ` Andrew Morton
@ 2003-01-05  9:18         ` Andrew Morton
  2 siblings, 0 replies; 18+ messages in thread
From: Andrew Morton @ 2003-01-05  9:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andi Kleen, davem, linux-kernel

Linus Torvalds wrote:
> 
> ...
> It doesn't show up on lmbench (insufficient precision), but your AIM9
> numbers are quite interesting. Are they stable?

OK, a closer look.  This is on a dual 1.7G P4, with HT disabled (involuntarily,
grr.)   Looks like an 8-10% hit on context-switch intensive stuff.


2.5.54+BK
=========

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                        ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
i686-linu  Linux 2.5.54    3      4     11     6     48      12      53

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                        ctxsw       UNIX         UDP         TCP conn

tbench 32:			(85k switches/sec)

Throughput 114.633 MB/sec (NB=143.291 MB/sec  1146.33 MBit/sec)
Throughput 114.157 MB/sec (NB=142.696 MB/sec  1141.57 MBit/sec)
Throughput 115.095 MB/sec (NB=143.869 MB/sec  1150.95 MBit/sec)

pollbench 1 100 5000		(118k switches/sec)
  result with handles 1 processes 100 loops 5000:time  8.371942 sec.
  result with handles 1 processes 100 loops 5000:time  8.381814 sec.
  result with handles 1 processes 100 loops 5000:time  8.367576 sec.
pollbench 2 100 2000		(105k switches/sec)
  result with handles 2 processes 100 loops 2000:time  3.694412 sec.
  result with handles 2 processes 100 loops 2000:time  3.672226 sec.
  result with handles 2 processes 100 loops 2000:time  3.657455 sec.
pollbench 5 100 2000		(79k switches/sec)
  result with handles 5 processes 100 loops 2000:time  4.564727 sec.
  result with handles 5 processes 100 loops 2000:time  4.783192 sec.
  result with handles 5 processes 100 loops 2000:time  4.561067 sec.

2.5.54+BK+broken-wrmsr-backout-patch:
=====================================


Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                        ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
i686-linu  Linux 2.5.54    3      4     11     6     48      12      53
i686-linu  Linux 2.5.54    1      3      8     4     40      10      51

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                        ctxsw       UNIX         UDP         TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
i686-linu  Linux 2.5.54     3    14   22    26          30         57
i686-linu  Linux 2.5.54     1    12   28    22          32         58


tbench 32:

Throughput 121.701 MB/sec (NB=152.126 MB/sec  1217.01 MBit/sec)
Throughput 124.958 MB/sec (NB=156.197 MB/sec  1249.58 MBit/sec)
Throughput 124.086 MB/sec (NB=155.107 MB/sec  1240.86 MBit/sec)

pollbench 1 100 5000
  result with handles 1 processes 100 loops 5000:time  7.306432 sec.
  result with handles 1 processes 100 loops 5000:time  7.352913 sec.
  result with handles 1 processes 100 loops 5000:time  7.337019 sec.
pollbench 2 100 2000
  result with handles 2 processes 100 loops 2000:time  3.184550 sec.
  result with handles 2 processes 100 loops 2000:time  3.251854 sec.
  result with handles 2 processes 100 loops 2000:time  3.209147 sec.
pollbench 5 100 2000
  result with handles 5 processes 100 loops 2000:time  4.135773 sec.
  result with handles 5 processes 100 loops 2000:time  4.117304 sec.
  result with handles 5 processes 100 loops 2000:time  4.119047 sec.


The tbench changes should probably be ignored.  After profiling tbench
I can say that this thoughput difference is _not_ due to the task switcher
change (__switch_to is only 1%).  I left the numbers here to show what
the effect of simply relinking and rebooting the kernel can be.


BTW, the pollbench numbers are not stunningly better than the 500MHz PIII:
pollbench 1 100 5000
  result with handles 1 processes 100 loops 5000:time  9.609487 sec.
pollbench 2 100 2000
  result with handles 2 processes 100 loops 2000:time  4.016496 sec.
pollbench 5 100 2000
  result with handles 5 processes 100 loops 2000:time  4.917921 sec.

I didn't profile the P4.  John has promised P4 oprofile support for
next week, which will be nice.

I did profile Manfred's pollbench on the PIII, uniprocessor build.  Note
that there is only a 5% throughput difference on this machine.  It's all
in __switch_to().   Here the PIII is doing 70k switches/sec.

2.5.54+BK:

c012abbc 534      2.69888     buffered_rmqueue
c0116714 617      3.11837     __wake_up_common
c010a606 635      3.20934     restore_all
c014b038 745      3.76529     do_poll
c013d4dc 757      3.82594     fget
c014551c 766      3.87142     pipe_write
c010a5c4 1249     6.31254     system_call
c014b0f0 1273     6.43384     sys_poll
c01090a4 1775     8.97099     __switch_to
c0116484 1922     9.71394     schedule

2.5.54+BK+backout-patch:

c012abbc 768      3.1024      buffered_rmqueue
c0116714 790      3.19127     __wake_up_common
c010a5e6 809      3.26803     restore_all
c013d4dc 918      3.70834     fget
c014551c 936      3.78105     pipe_write
c014b038 977      3.94668     do_poll
c01090a4 1070     4.32236     __switch_to
c014b0f0 1606     6.48758     sys_poll
c010a5a4 1678     6.77843     system_call
c0116484 2542     10.2686     schedule

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
  2003-01-05  3:52           ` Linus Torvalds
@ 2003-01-05 10:06             ` Andi Kleen
  2003-01-05 18:51               ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2003-01-05 10:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: davem, andrew.morton, linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

> On Sat, 4 Jan 2003, Andrew Morton wrote:
> > > 
> > > Hmm.. The backup patch doesn't handle single-stepping correctly: the
> > > eflags cleanup singlestep patch later in the sysenter sequence _depends_
> > > on the stack (and thus thread) being right on the very first in-kernel
> > > instruction.
> > 
> > Well that's just a straight `patch -R' of the patch which added the wrmsr's.
> 
> Yes, but the breakage comes laterr when a subsequent patch in the 
> 2.5.53->54 stuff started depending on the stack location being stable even 
> on the first instruction.

Regarding the EFLAGS handling: why can't you just do 
a pushfl in the vsyscall page before pushing the 6th arg on the stack
and a popfl afterwards. 

Then the syscall entry in kernel code could just do

        pushfl $fixed_eflags
        popfl 

The first popl for the 6th arg in the vsyscall page wouldn't be traced 
then, but I doubt that is a problem.

Would add a few cycles to the entry path, but then it is better than
having slow context switch.

This would also eliminate the random IOPL problem Luca noticed.
BTW I think I have the same issue on x86-64 with SYSCALL (random IOPL
in kernel), but so far nothing broke, so it is probably not a big
problem.

> 
> > > It doesn't show up on lmbench (insufficient precision), but your AIM9
> > > numbers are quite interesting. Are they stable?
> > 
> > Seem to be, but more work is needed, including oprofiling.  Andi is doing
> > some P4 testing at present.
> 
> Ok.

Here are the numbers from a Dual 2.4Ghz Xeon. The first is plain 
2.5.54, the second is with the WRMSR-in-switch-to patch backed out.
Also 2.4.18-aa for co

Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                        ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw

oenone     Linux 2.5.54 2.410 3.5600 6.0300 3.9900   34.8 8.59000    43.7
oenone     Linux 2.5.54 1.270 2.3300 4.7700 2.5100   29.5 4.16000    39.2

If that is true the slowdown would be nearly 50% for the 2p case.
That looks a bit much, I wonder if lmbench is very accurate here
(do we have some other context switch benchmark to double check?)
but all numbers show a significant slowdown.

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
  2003-01-05 10:06             ` Andi Kleen
@ 2003-01-05 18:51               ` Linus Torvalds
  2003-01-05 23:46                 ` Andi Kleen
  2003-01-06  0:58                 ` H. Peter Anvin
  0 siblings, 2 replies; 18+ messages in thread
From: Linus Torvalds @ 2003-01-05 18:51 UTC (permalink / raw)
  To: Andi Kleen; +Cc: davem, andrew.morton, linux-kernel

On 5 Jan 2003, Andi Kleen wrote:
> 
> Regarding the EFLAGS handling: why can't you just do 
> a pushfl in the vsyscall page before pushing the 6th arg on the stack
> and a popfl afterwards. 

I did that originally, but timings from Jamie convinced me that it's 
actually a quite noticeable overhead for the system call path.

You should realize that the 5-9% slowdown in schedule (which I don't like)  
comes with a 360% speedup on a P4 in simple system call handling (which I
_do_ like). My P4 does a system call in 428 cycles as opposed to 1568
cycles according to my benchmarks.

And part of the reason for the huge speedup is that the vsyscall/sysenter
path is actually pretty much the fastest possible. Yes, it would have been
faster just from using sysenter/sysexit, but not by 360%. The other
speedups come from not reloading segment registers multiple times
(noticeable on a PIII, not a P4) and from avoiding things liek the flags
pushing.

NOTE! We could trivially speed up the task switching by making 
"load_esp0()" a bit smarter. Right now it actually re-writes _both_ 
SYSENTER_CS and SYSENTER_ESP on a taskswitch, and that's because a process 
that was in vm86 mode will have cleared SYSENTER_CS (so that sysenter will 
cause a GP fault inside vm86 mode).

Now, that SYSENTER_CS thing is very rare indeed, and by keeping track of 
what the previous value was (ie just caching the SYSENTER_CS value in the 
thread_struct), we could get rid of it with a conditional jump instead. 
Want to try it?

> This would also eliminate the random IOPL problem Luca noticed.

Nope, it wouldn't. A "popfl" in user mode does nothing for iopl. You have 
to have the popfl in kernel mode.

			Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
  2003-01-05 18:51               ` Linus Torvalds
@ 2003-01-05 23:46                 ` Andi Kleen
  2003-01-06  1:33                   ` Linus Torvalds
  2003-01-06  0:58                 ` H. Peter Anvin
  1 sibling, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2003-01-05 23:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andi Kleen, davem, andrew.morton, linux-kernel

On Sun, Jan 05, 2003 at 07:51:44PM +0100, Linus Torvalds wrote:
> 
> On 5 Jan 2003, Andi Kleen wrote:
> > 
> > Regarding the EFLAGS handling: why can't you just do 
> > a pushfl in the vsyscall page before pushing the 6th arg on the stack
> > and a popfl afterwards. 
> 
> I did that originally, but timings from Jamie convinced me that it's 
> actually a quite noticeable overhead for the system call path.
> 
> You should realize that the 5-9% slowdown in schedule (which I don't like)  
> comes with a 360% speedup on a P4 in simple system call handling (which I
> _do_ like). My P4 does a system call in 428 cycles as opposed to 1568
> cycles according to my benchmarks.

According to my benchmarks the slowdown on context switch is a lot 
more than 5-9% on P4:

Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                        ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw

with wrmsr Linux 2.5.54 2.410 3.5600 6.0300 3.9900   34.8 8.59000    43.7
no wrmsr   Linux 2.5.54 1.270 2.3300 4.7700 2.5100   29.5 4.16000    39.2

That looks more like between 10%-51%

[Note I don't trust the numbers completely, the slowdown looks a bit too
extreme especially for the 16p case. But it is clear that it is a lot
slower]

I haven't benchmarked pushfl/popfl, but I cannot imagine it being that 
slow to offset that. I agree that syscalls are a slightly hotter path than the
context switch, but hurting one for the other that much looks a bit
misbalanced.

> 
> And part of the reason for the huge speedup is that the vsyscall/sysenter
> path is actually pretty much the fastest possible. Yes, it would have been

I can think of some things to speed it up more. e.g. replace all the
push / pop in SAVE/RESTORE_ALL with sub $frame,%esp ; movl %reg,offset(%esp) 
and movl offset(%esp),%reg ; addl $frame,%esp. This way the CPU has 
no dependencies between all the load/store options unlike push/pop.

(that is what all the code optimization guides recommend and gcc / icc
do too for saving/restoring of lots of registers) 

Perhaps that would offset a pushfl / popfl in kernel mode, may be worth 
a try.

-Andi

P.S.: For me it is actually good if the i386 context switch is slow.
On x86-64 I have some ugly wrmsrs in the context switch for the 
64bit segment base rewriting too and slowing down i386 like this will
make the 64bit kernel look better compared to 32bit ;););)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
  2003-01-05 18:51               ` Linus Torvalds
  2003-01-05 23:46                 ` Andi Kleen
@ 2003-01-06  0:58                 ` H. Peter Anvin
  1 sibling, 0 replies; 18+ messages in thread
From: H. Peter Anvin @ 2003-01-06  0:58 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <Pine.LNX.4.44.0301051040020.11848-100000@home.transmeta.com>
By author:    Linus Torvalds <torvalds@transmeta.com>
In newsgroup: linux.dev.kernel
> 
> Now, that SYSENTER_CS thing is very rare indeed, and by keeping track of 
> what the previous value was (ie just caching the SYSENTER_CS value in the 
> thread_struct), we could get rid of it with a conditional jump instead. 
> Want to try it?
> 

This seems like the first thing to do.

Dealing with the SYSENTER_ESP issue is a lot tricker.  It seems that
it can be done with a magic EIP range test in the #DB handler, the
range is the part that finds the top of the real kernel stack and
pushfl's to it; the #DB handler if it receives a trap in this region
could emulate this piece of code (including pushing the pre-exception
flags onto the stack) and then invoke the instruction immediately
after the pushf..

Yes, it's ugly, but it should be relatively straightforward, and since
this particular chunk is assembly code by necessity it shouldn't be
brittle.

The other variant (which I have suggested) is to simply state "TF set
in user space is not honoured."  This would require a system call to
set TF -> 1.  That way the kernel already has the TF state for all
processes.

Again, it's ugly.

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt	<amsp@zytor.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
  2003-01-05 23:46                 ` Andi Kleen
@ 2003-01-06  1:33                   ` Linus Torvalds
  2003-01-06  2:05                     ` Andi Kleen
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2003-01-06  1:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David S. Miller, andrew.morton, linux-kernel, Mikael Pettersson

On Mon, 6 Jan 2003, Andi Kleen wrote:
> 
> Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
>                         ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
> 
> with wrmsr Linux 2.5.54 2.410 3.5600 6.0300 3.9900   34.8 8.59000    43.7
> no wrmsr   Linux 2.5.54 1.270 2.3300 4.7700 2.5100   29.5 4.16000    39.2
> 
> That looks more like between 10%-51%

The lmbench numbers for context switch overhead vary _way_ too much to say
anything at all based on two runs. By all logic the worst-affected case 
should be the 2p/0K case, since the overhead of the wrmsr should be 100% 
constant.

The numbers by Mikael seem to say that the MSR writes are 800 cycles each 
(!) on a P4, so avoiding the CS write would make the overhead about half 
of what it is now (at the cost of making it conditional).

800 cycles in the context switch path is still nasty, I agree. 

> I haven't benchmarked pushfl/popfl, but I cannot imagine it being that 
> slow to offset that. I agree that syscalls are a slightly hotter path than the
> context switch, but hurting one for the other that much looks a bit
> misbalanced.

Note that pushf/popf is a totally different thing, and has nothing to do 
witht he MSR save. 

For pushf/popf, the tradeoff is very clear: you have to either do the 
pushf/popf in the system call path, or you have to do it in the context 
switch path. They are equally expensive in both, but we do a hell of a lot 
more system calls, so it's _obviously_ better to do the pushf/popf in the 
context switch.

The WRMSR thing is much less obvious. Unlike the pushf/popf, the code 
isn't the same, you have two different cases:

 (a) single static wrmsr / CPU

	Change ESP at system call entry and extra jump to common code

 (b) wrmsr each context switch

	System call entry is free.

The thing that makes me like (b) _despite_ the slowdown of context
switching is that the esp reload it made a difference of tens of cycles in
my testing of system calls (which is surprising, I admit - it might be
more than the esp reload, it might be some interaction with jumps right
after a switch to CPL0 causing badness for the pipeline).

Also, (b) allows us to simplify the TF handling (which is _not_ the eflags
issue: the eflags issue is about IOPL) and means that there are no issues
with NMI's and fake temporary stacks.

> I can think of some things to speed it up more. e.g. replace all the
> push / pop in SAVE/RESTORE_ALL with sub $frame,%esp ; movl %reg,offset(%esp) 
> and movl offset(%esp),%reg ; addl $frame,%esp. This way the CPU has 
> no dependencies between all the load/store options unlike push/pop.

Last I remember, that only made a difference on Athlons, and Intel CPU's 
have special-case logic for pushes/pops where they follow the value of the 
stack pointer down the chain and break the dependencies (this may also be 
why reloading %esp caused such a problem for the pipeline - if I remember 
correctly the ESP-following can only handle simple add/sub chains).

> (that is what all the code optimization guides recommend and gcc / icc
> do too for saving/restoring of lots of registers) 

It causes bigger code, which is why I don't particularly like it. 
Benchmarks might be interesting.

			Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
  2003-01-06  1:33                   ` Linus Torvalds
@ 2003-01-06  2:05                     ` Andi Kleen
  0 siblings, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2003-01-06  2:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, David S. Miller, andrew.morton, linux-kernel,
	Mikael Pettersson

On Mon, Jan 06, 2003 at 02:33:28AM +0100, Linus Torvalds wrote:
> > I can think of some things to speed it up more. e.g. replace all the
> > push / pop in SAVE/RESTORE_ALL with sub $frame,%esp ; movl %reg,offset(%esp) 
> > and movl offset(%esp),%reg ; addl $frame,%esp. This way the CPU has 
> > no dependencies between all the load/store options unlike push/pop.
> 
> Last I remember, that only made a difference on Athlons, and Intel CPU's 

I didn't benchmark it, but as a data point ICC 7 generates the movls instead 
of pushes now too, (even though it generates bigger code). In fact it is even more 
aggressive on that than gcc: gcc does it only for more than three or four registers, 
icc does it for two and more.  So I expect it being faster on Intel CPUs - at least on 
the P4 - too.  I doubt they tuned it for Athlons.

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2003-01-06  1:57 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <94F20261551DC141B6B559DC4910867204491F@blr-m3-msg.wipro.com.suse.lists.linux.kernel>
     [not found] ` <3E155903.F8C22286@digeo.com.suse.lists.linux.kernel>
2003-01-03 18:40   ` [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements) Andi Kleen
2003-01-03 21:32     ` Andrew Morton
2003-01-05  1:01     ` Andrew Morton
2003-01-05  3:35       ` Linus Torvalds
2003-01-05  3:51         ` Linus Torvalds
2003-01-05  3:54         ` Andrew Morton
2003-01-05  3:52           ` Linus Torvalds
2003-01-05 10:06             ` Andi Kleen
2003-01-05 18:51               ` Linus Torvalds
2003-01-05 23:46                 ` Andi Kleen
2003-01-06  1:33                   ` Linus Torvalds
2003-01-06  2:05                     ` Andi Kleen
2003-01-06  0:58                 ` H. Peter Anvin
2003-01-05  9:18         ` Andrew Morton
2003-01-03  8:59 Aniruddha M Marathe
2003-01-03  9:33 ` Andrew Morton
2003-01-03 10:24   ` David S. Miller
2003-01-03 10:22     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox