public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-22  3:06 Chuck Ebbert
  2005-07-22  3:27 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Chuck Ebbert @ 2005-07-22  3:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Linus Torvalds


  This patch makes restore_fpu() an inline.  When L1/L2 cache are saturated
it makes a measurable difference.

  Results from profiling Volanomark follow.  Sample rate was 2000 samples/sec
(HZ = 250, profile multiplier = 8) on a dual-processor Pentium II Xeon.


Before:

 10680 restore_fpu                              333.7500
  8351 device_not_available                     203.6829
  3823 math_state_restore                        59.7344
 -----
 22854


After:

 12534 math_state_restore                       130.5625
  8354 device_not_available                     203.7561
 -----
 20888


Patch is "obviously correct" and cuts 9% of the overhead.  Please apply.

Next step should be to physically place math_state_restore() after
device_not_available().  Would such a patch be accepted?  (Yes it
would be ugly and require linker script changes.)

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>

Index: 2.6.13-rc3a/arch/i386/kernel/i387.c
===================================================================
--- 2.6.13-rc3a.orig/arch/i386/kernel/i387.c
+++ 2.6.13-rc3a/arch/i386/kernel/i387.c
@@ -82,17 +82,6 @@
 }
 EXPORT_SYMBOL_GPL(kernel_fpu_begin);
 
-void restore_fpu( struct task_struct *tsk )
-{
-	if ( cpu_has_fxsr ) {
-		asm volatile( "fxrstor %0"
-			      : : "m" (tsk->thread.i387.fxsave) );
-	} else {
-		asm volatile( "frstor %0"
-			      : : "m" (tsk->thread.i387.fsave) );
-	}
-}
-
 /*
  * FPU tag word conversions.
  */
Index: 2.6.13-rc3a/include/asm-i386/i387.h
===================================================================
--- 2.6.13-rc3a.orig/include/asm-i386/i387.h
+++ 2.6.13-rc3a/include/asm-i386/i387.h
@@ -22,11 +22,20 @@
 /*
  * FPU lazy state save handling...
  */
-extern void restore_fpu( struct task_struct *tsk );
-
 extern void kernel_fpu_begin(void);
 #define kernel_fpu_end() do { stts(); preempt_enable(); } while(0)
 
+static inline void restore_fpu( struct task_struct *tsk )
+{
+	if ( cpu_has_fxsr ) {
+		asm volatile( "fxrstor %0"
+			      : : "m" (tsk->thread.i387.fxsave) );
+	} else {
+		asm volatile( "frstor %0"
+			      : : "m" (tsk->thread.i387.fsave) );
+	}
+}
+
 /*
  * These must be called with preempt disabled
  */
__
Chuck

^ permalink raw reply	[flat|nested] 22+ messages in thread
* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-22  9:58 Chuck Ebbert
  0 siblings, 0 replies; 22+ messages in thread
From: Chuck Ebbert @ 2005-07-22  9:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, linux-kernel


On Fri, 22 Jul 2005 at13:27:56 +1000, Andrew Morton wrote:

> hm.  What context switch rate is that thing doing?

  23000 - 25000 / second.  I guess that explains why my attempt to
duplicate this in C failed -- it switches at 15-19 times per second:


/* i387 context switch benchmark
 *    compile this program without optimization
 */
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <signal.h>
#include <sched.h>

int lo = 0, fp = 0, af = 0;
long i, iters;
pid_t child;
cpu_set_t cpuset;

void handler(int sig) {
	lo = 0; /* stop looping */
}
struct sigaction sa = {
	.sa_handler = handler,
};
main(int argc, char *argv[]) {
	if (argc < 2) /* mandatory loop iteration count */
		exit(1);
	iters = atol(argv[1]);
	if (argc > 2) /* loop in parent while waiting for child to exit */
		lo = atoi(argv[2]);
	if (argc > 3) /* use FP while wait-looping */
		fp = atoi(argv[3]);
	if (argc > 4) /* use cpu affinity -- all code runs on cpu 0 */
		af = atoi(argv[4]);

	__CPU_SET(0, &cpuset);

	child = fork();
	if (child) {  /* parent */
		if (af)
			sched_setaffinity(0, &cpuset);
		if (lo) {
			sigaction(SIGCHLD, &sa, NULL);
			if (fp)
				__asm__ __volatile__("fld1" "\n\t" "fldz");
			while (lo)
				if (fp)
					__asm__ __volatile__(
						"fadd %st(1), %st(0)");
		} else 
			wait(NULL);

	} else {  /* child */
		if (af)
			sched_setaffinity(0, &cpuset);
		__asm__ __volatile__("fld1" "\n\t" "fldz");
		for (i = 0; i < iters; i++)
			__asm__ __volatile__("fadd %st(1), %st(0)");
	}
}


> Is the benchmark actually doing floating point stuff?

  They only release .class files.  ISTR java does FP math even when
developers think they are doing integer math but I can't find any
references.

__
Chuck

^ permalink raw reply	[flat|nested] 22+ messages in thread
* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-23  7:09 Chuck Ebbert
  2005-07-23 17:38 ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Chuck Ebbert @ 2005-07-23  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, linux-kernel, Arjan van de Ven, Adrian Bunk

On Fri, 22 Jul 2005 at 13:27:56 +1000, Andrew Morton wrote:

> Chuck Ebbert <76306.1226@compuserve.com> wrote:

> > After:
> > 
> >  12534 math_state_restore                       130.5625
> >   8354 device_not_available                     203.7561
> >  -----
> >  20888

> > Next step should be to physically place math_state_restore() after
> > device_not_available().  Would such a patch be accepted?  (Yes it
> > would be ugly and require linker script changes.)
>
> Depends on the benefit/ugly ratio ;)

This patch (may not apply cleanly to unpatched -rc3) drops overhead
a few more percent:

 11835 math_state_restore                       144.3293
  8096 device_not_available                     197.4634
 -----
 19931

Most of the exception/interrupt handlers should get the same treatment but
the linker script would grow.

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>

Index: 2.6.13-rc3a/arch/i386/kernel/entry.S
===================================================================
--- 2.6.13-rc3a.orig/arch/i386/kernel/entry.S	2005-07-16 18:09:48.000000000 -0400
+++ 2.6.13-rc3a/arch/i386/kernel/entry.S	2005-07-22 05:19:22.000000000 -0400
@@ -469,6 +469,7 @@
 	pushl $do_simd_coprocessor_error
 	jmp error_code
 
+	.section ".text.i387_1", "a"
 ENTRY(device_not_available)
 	pushl $-1			# mark this as an int
 	SAVE_ALL
@@ -484,6 +485,8 @@
 	addl $4, %esp
 	jmp ret_from_exception
 
+	.previous
+
 /*
  * Debug traps and NMI can happen at the one SYSENTER instruction
  * that sets up the real kernel stack. Check here, since we can't
Index: 2.6.13-rc3a/arch/i386/kernel/traps.c
===================================================================
--- 2.6.13-rc3a.orig/arch/i386/kernel/traps.c	2005-07-16 18:05:40.000000000 -0400
+++ 2.6.13-rc3a/arch/i386/kernel/traps.c	2005-07-22 14:02:50.000000000 -0400
@@ -972,7 +972,8 @@
  * Must be called with kernel preemption disabled (in this case,
  * local interrupts are disabled at the call-site in entry.S).
  */
-asmlinkage void math_state_restore(struct pt_regs regs)
+asmlinkage void __attribute__((__section__(".text.i387_2")))
+math_state_restore(struct pt_regs regs)
 {
 	struct thread_info *thread = current_thread_info();
 	struct task_struct *tsk = thread->task;
Index: 2.6.13-rc3a/arch/i386/kernel/vmlinux.lds.S
===================================================================
--- 2.6.13-rc3a.orig/arch/i386/kernel/vmlinux.lds.S	2005-07-16 18:05:39.000000000 -0400
+++ 2.6.13-rc3a/arch/i386/kernel/vmlinux.lds.S	2005-07-22 05:20:30.000000000 -0400
@@ -20,6 +20,8 @@
   _text = .;			/* Text and read-only data */
   .text : AT(ADDR(.text) - LOAD_OFFSET) {
 	*(.text)
+	*(.text.i387_1)
+	*(.text.i387_2)
 	SCHED_TEXT
 	LOCK_TEXT
 	*(.fixup)
__
Chuck

^ permalink raw reply	[flat|nested] 22+ messages in thread
* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-23  7:09 Chuck Ebbert
  2005-07-23 17:33 ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Chuck Ebbert @ 2005-07-23  7:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, linux-kernel

On Date: Fri, 22 Jul 2005 at 16:19:25 -0700 (PDT), Linus Torvalds wrote:

>       /*
>        * The "nop" is needed to make the instructions the same
>        * length.
>        */
>       #define restore_fpu(tsk)                        \
>               alternative_input(                      \
>                       "nop ; frstor %1",              \
>                       "fxrstor %1",                   \
>                       X86_FEATURE_FXSR,               \
>                       "m" ((tsk)->thread.i387.fsave))

   Probably a very minor point, but shouldn't it be

                        "m" ((tsk)->thread.i387.fxsave))

because that's the largest possible operand that could end up being read?


> Now, we should do the same for "fnsave ; fwait" vs "fxsave ; fnclex" too,
> but I was lazy. If somebody wants to try that, it would need an 
> "alternative_output()" define but should otherwise be trivial too.

  Is that function called __save_init_fpu because it saves and then
initializes the fpu?  Unlike fsave, fxsave does not init the fpu after
it saves the state; to make the behavior match it should be "fxsave ; fninit"


__
Chuck

^ permalink raw reply	[flat|nested] 22+ messages in thread
[parent not found: <200507212309_MC3-1-A534-95EF@compuserve.com.suse.lists.linux.kernel>]
* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-24 12:56 Kenneth Parrish
  0 siblings, 0 replies; 22+ messages in thread
From: Kenneth Parrish @ 2005-07-24 12:56 UTC (permalink / raw)
  To: linux-kernel

-=> Linus Torvalds wrote to All and Chuck Ebbert <=-

[..]

 Chuck>
      >                         "m" ((tsk)->thread.i387.fxsave))
      >
      > because that's the largest possible operand that could end up
      > being read?

 LT> Yes, I ended up fixing that already (along with handling the fxsave
 LT> case too).
[..]

 2.6.13-rc3-git5 lags -git4 on nbench's bitfield test, a part of its
memory index, eg typical runs
-git5
BITFIELD            :      1.1589e+07  :       1.99  :       0.42
-git4
BITFIELD            :      2.1892e+07  :       3.76  :       0.78

lmbench has slower context switch numbers for -git5.

--- MultiMail/Linux v0.46

^ permalink raw reply	[flat|nested] 22+ messages in thread
* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-25  2:34 Kenneth Parrish
  0 siblings, 0 replies; 22+ messages in thread
From: Kenneth Parrish @ 2005-07-25  2:34 UTC (permalink / raw)
  To: linux-kernel

-=> Kenneth Parrish wrote to All <=-

 KP>  2.6.13-rc3-git5 lags -git4 on nbench's bitfield test, a part of its
 KP> memory index, eg typical runs
 KP> -git5
 KP> BITFIELD            :      1.1589e+07  :       1.99  :       0.42
 KP> -git4
 KP> BITFIELD            :      2.1892e+07  :       3.76  :       0.78

 KP> lmbench has slower context switch numbers for -git5.

i've been experimentally adding -funswitch-loops and -fpeel loops to the
default -march=i586 in the ../arch/i386/Makefile

cflags-$(CONFIG_M586)       += -march=i586 -funswitch-loops -fpeel-loops

based on some tests with gcc 3.4.3,4 and the cyrix mii cpu, but
with v2.6.13-rc3-git5 carelessly put these flags on the next,
cflags-$(CONFIG_M586TSC) line, and so they would not have been used.
i shall retest after recompiling to confirm.
[..]
yeah, a couple nbench runs show similar overall results as -git4

^ permalink raw reply	[flat|nested] 22+ messages in thread
* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-26 21:23 Chuck Ebbert
  0 siblings, 0 replies; 22+ messages in thread
From: Chuck Ebbert @ 2005-07-26 21:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Adrian Bunk, Arjan van de Ven, linux-kernel, Andrew Morton

On Sat, 23 Jul 2005 at 10:38:40 -0700 (PDT), Linus Torvalds wrote:

> On Sat, 23 Jul 2005, Chuck Ebbert wrote:
> > 
> > This patch (may not apply cleanly to unpatched -rc3) drops overhead
> > a few more percent:
>
> That really is pretty ugly.
>
> I'd rather hope for something like function-sections to just make games 
> like this be unnecessary.

 The link stage might cause meltdown on typical machines, though.
__
Chuck

^ permalink raw reply	[flat|nested] 22+ messages in thread
* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-26 21:23 Chuck Ebbert
  2005-07-26 21:47 ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Chuck Ebbert @ 2005-07-26 21:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, linux-kernel

On Fri, 22 Jul 2005 at 11:13:21 -0700 (PDT), Linus Torvalds wrote:

> Something like the following (totally untested) should make it be
> non-lazy. It's going to slow down normal task switches, but might speed up 
> the "restoring FP context all the time" case.
> 
> Chuck? This should work fine with or without your inline thing. Does it 
> make any difference?

 Now that I am looking at the right output file -- yes, it does (after fixing
one small bug.)  Volanomark set a new record of 6125 messages/sec and the
profile shows almost zero hits in math_state_restore.

 Since fxsave leaves the FPU state intact, there ought to be a better way to do
this but it gets tricky.  Maybe using the TSC to put a timestamp in every thread
save area?

  when saving FPU state:
        put cpu# and timestamp in thread state info
        also store timestamp in per-cpu data

  on task switch:
        compare cpu# and timestamps for next task
        if equal, clear TS and set TS_USEDFPU

  when state becomes invalid for some reason:
        zero cpu's timestamp

But the extra overhead might be too much in many cases.


(Below is what I changed in the original patch, if anyone wants to try it.)

+               if (tsk_used_math(next_p))

should be:

+               if (!tsk_used_math(next_p))

__
Chuck

^ permalink raw reply	[flat|nested] 22+ messages in thread
* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-27  1:40 linux
  0 siblings, 0 replies; 22+ messages in thread
From: linux @ 2005-07-27  1:40 UTC (permalink / raw)
  To: 76306.1226; +Cc: linux-kernel

> Since fxsave leaves the FPU state intact, there ought to be a better way
> to do this but it gets tricky.  Maybe using the TSC to put a timestamp
> in every thread save area?
> 
>   when saving FPU state:
>         put cpu# and timestamp in thread state info
>         also store timestamp in per-cpu data
> 
>   on task switch:
>         compare cpu# and timestamps for next task
>         if equal, clear TS and set TS_USEDFPU
> 
>   when state becomes invalid for some reason:
>         zero cpu's timestamp
> 
> But the extra overhead might be too much in many cases.

Simpler:
- Thread has "CPU that I last used FPU on" pointer.  Never NULL.
- Each CPU has "thread whose FPU state I hold" pointer.  May be NULL.

When *loading* FPU state:
- Set up both pointers.

On task switch:
- If the pointers point to each other, then clear TS and skip restore.
  ("Preloaded")

When state becomes invalid (kernel MMX use, or whatever)
- Set CPU's pointer to NULL.

On thread creation:
- If current CPU's thread pointer points to the newly allocated thread,
  clear it to NULL.
- Set thread's CPU pointer to current CPU.

The UP case just omits the per-thread CPU pointer.  (Well, stores
it in zero bits.)

An alternative SMP thread-creation case would be to have a NULL value for
the thread-to-CPU pointer and initialize the thread's CPU pointer to that,
but that then complicates the UP case.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2005-07-27  1:40 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-22  3:06 [patch 2.6.13-rc3a] i386: inline restore_fpu Chuck Ebbert
2005-07-22  3:27 ` Andrew Morton
2005-07-22  5:22   ` Linus Torvalds
2005-07-22 11:23     ` Arjan van de Ven
2005-07-22  8:14 ` Adrian Bunk
2005-07-22 18:13   ` Linus Torvalds
2005-07-25 19:26     ` Bill Davidsen
2005-07-22 23:19 ` Linus Torvalds
  -- strict thread matches above, loose matches on Subject: below --
2005-07-22  9:58 Chuck Ebbert
2005-07-23  7:09 Chuck Ebbert
2005-07-23 17:38 ` Linus Torvalds
2005-07-23 17:46   ` Arjan van de Ven
2005-07-23 18:02     ` Linus Torvalds
2005-07-23  7:09 Chuck Ebbert
2005-07-23 17:33 ` Linus Torvalds
     [not found] <200507212309_MC3-1-A534-95EF@compuserve.com.suse.lists.linux.kernel>
     [not found] ` <20050722132756.578acca7.akpm@osdl.org.suse.lists.linux.kernel>
2005-07-23 15:35   ` Andi Kleen
2005-07-24 12:56 Kenneth Parrish
2005-07-25  2:34 Kenneth Parrish
2005-07-26 21:23 Chuck Ebbert
2005-07-26 21:23 Chuck Ebbert
2005-07-26 21:47 ` Linus Torvalds
2005-07-27  1:40 linux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox