[patch 2.6.13-rc3a] i386: inline restore

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-22  3:06 Chuck Ebbert
  2005-07-22  3:27 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Chuck Ebbert @ 2005-07-22  3:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Linus Torvalds


  This patch makes restore_fpu() an inline.  When L1/L2 cache are saturated
it makes a measurable difference.

  Results from profiling Volanomark follow.  Sample rate was 2000 samples/sec
(HZ = 250, profile multiplier = 8) on a dual-processor Pentium II Xeon.


Before:

 10680 restore_fpu                              333.7500
  8351 device_not_available                     203.6829
  3823 math_state_restore                        59.7344
 -----
 22854


After:

 12534 math_state_restore                       130.5625
  8354 device_not_available                     203.7561
 -----
 20888


Patch is "obviously correct" and cuts 9% of the overhead.  Please apply.

Next step should be to physically place math_state_restore() after
device_not_available().  Would such a patch be accepted?  (Yes it
would be ugly and require linker script changes.)

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>

Index: 2.6.13-rc3a/arch/i386/kernel/i387.c
===================================================================
--- 2.6.13-rc3a.orig/arch/i386/kernel/i387.c
+++ 2.6.13-rc3a/arch/i386/kernel/i387.c
@@ -82,17 +82,6 @@
 }
 EXPORT_SYMBOL_GPL(kernel_fpu_begin);
 
-void restore_fpu( struct task_struct *tsk )
-{
-	if ( cpu_has_fxsr ) {
-		asm volatile( "fxrstor %0"
-			      : : "m" (tsk->thread.i387.fxsave) );
-	} else {
-		asm volatile( "frstor %0"
-			      : : "m" (tsk->thread.i387.fsave) );
-	}
-}
-
 /*
  * FPU tag word conversions.
  */
Index: 2.6.13-rc3a/include/asm-i386/i387.h
===================================================================
--- 2.6.13-rc3a.orig/include/asm-i386/i387.h
+++ 2.6.13-rc3a/include/asm-i386/i387.h
@@ -22,11 +22,20 @@
 /*
  * FPU lazy state save handling...
  */
-extern void restore_fpu( struct task_struct *tsk );
-
 extern void kernel_fpu_begin(void);
 #define kernel_fpu_end() do { stts(); preempt_enable(); } while(0)
 
+static inline void restore_fpu( struct task_struct *tsk )
+{
+	if ( cpu_has_fxsr ) {
+		asm volatile( "fxrstor %0"
+			      : : "m" (tsk->thread.i387.fxsave) );
+	} else {
+		asm volatile( "frstor %0"
+			      : : "m" (tsk->thread.i387.fsave) );
+	}
+}
+
 /*
  * These must be called with preempt disabled
  */
__
Chuck

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
  2005-07-22  3:06 [patch 2.6.13-rc3a] i386: inline restore_fpu Chuck Ebbert
@ 2005-07-22  3:27 ` Andrew Morton
  2005-07-22  5:22   ` Linus Torvalds
  2005-07-22  8:14 ` Adrian Bunk
  2005-07-22 23:19 ` Linus Torvalds
  2 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2005-07-22  3:27 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-kernel, torvalds

Chuck Ebbert <76306.1226@compuserve.com> wrote:
>
> 
>   This patch makes restore_fpu() an inline.  When L1/L2 cache are saturated
> it makes a measurable difference.
> 
>   Results from profiling Volanomark follow.  Sample rate was 2000 samples/sec
> (HZ = 250, profile multiplier = 8) on a dual-processor Pentium II Xeon.
> 
> 
> Before:
> 
>  10680 restore_fpu                              333.7500
>   8351 device_not_available                     203.6829
>   3823 math_state_restore                        59.7344
>  -----
>  22854
> 
> 
> After:
> 
>  12534 math_state_restore                       130.5625
>   8354 device_not_available                     203.7561
>  -----
>  20888
> 
> 
> Patch is "obviously correct" and cuts 9% of the overhead.  Please apply.

hm.  What context switch rate is that thing doing?

Is the benchmark actually doing floating point stuff?

We do have the `used_math' optimisation in there which attempts to avoid
doing the FP save/restore if the app isn't actually using math.  But
<ancient recollections> there's code in glibc startup which always does a
bit of float, so that optimisation is always defeated.  There was some
discussion about periodically setting tasks back into !used_math state to
try to restore the optimisation for tasks which only do a little bit of FP,
but nothing actually got done.

> Next step should be to physically place math_state_restore() after
> device_not_available().  Would such a patch be accepted?  (Yes it
> would be ugly and require linker script changes.)

Depends on the benefit/ugly ratio ;)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
  2005-07-22  3:27 ` Andrew Morton
@ 2005-07-22  5:22   ` Linus Torvalds
  2005-07-22 11:23     ` Arjan van de Ven
  0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2005-07-22  5:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Chuck Ebbert, linux-kernel

On Fri, 22 Jul 2005, Andrew Morton wrote:
> 
> Is the benchmark actually doing floating point stuff?

It must be. We still do lazy FP saves.

> We do have the `used_math' optimisation in there which attempts to avoid
> doing the FP save/restore if the app isn't actually using math.

No, it's more than that. There's a per-processor "used_math" flag to
determine if we need to _initialize_ the FPU, but on context switches we 
always assume the program we're switching to will _not_ use FP, and we 
just set the "fault on FP" flag and do not normally restore FP state.

It seems volanomark will always use FP, if this is the hot path. 

We'll only save the FP context if the thread has used the FP in _that_ 
particular time-slice (TS_USEDFPU).

As to why volanomark also uses FP, I don't know. I wouldn't be surprised 
if the benchmark was designed by somebody to not benefit from the x87 
state save optimization.

On the other hand, I also wouldn't be surprised if glibc (or similar
system libraries) is over-eagerly using things like SSE instructions for
memcopies etc, not realizing that they can have serious downsides. I don't
see why volanomark would use FP, but hey, it's billed as a java VM and
thread torture test for "chatrooms". Whatever.

		Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
  2005-07-22  3:06 [patch 2.6.13-rc3a] i386: inline restore_fpu Chuck Ebbert
  2005-07-22  3:27 ` Andrew Morton
@ 2005-07-22  8:14 ` Adrian Bunk
  2005-07-22 18:13   ` Linus Torvalds
  2005-07-22 23:19 ` Linus Torvalds
  2 siblings, 1 reply; 22+ messages in thread
From: Adrian Bunk @ 2005-07-22  8:14 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-kernel, Andrew Morton, Linus Torvalds

On Thu, Jul 21, 2005 at 11:06:22PM -0400, Chuck Ebbert wrote:
> 
>   This patch makes restore_fpu() an inline.  When L1/L2 cache are saturated
> it makes a measurable difference.
> 
>   Results from profiling Volanomark follow.  Sample rate was 2000 samples/sec
> (HZ = 250, profile multiplier = 8) on a dual-processor Pentium II Xeon.
> 
> 
> Before:
> 
>  10680 restore_fpu                              333.7500
>   8351 device_not_available                     203.6829
>   3823 math_state_restore                        59.7344
>  -----
>  22854
> 
> 
> After:
> 
>  12534 math_state_restore                       130.5625
>   8354 device_not_available                     203.7561
>  -----
>  20888
> 
> 
> Patch is "obviously correct" and cuts 9% of the overhead.  Please apply.
>...

If this patch makes a difference, could you do me a favour and check 
whether replacing the current cpu_has_fxsr #define in
include/asm-i386/cpufeature.h with

  #define cpu_has_fxsr           1

on top of your patch brings an additional improvement?

> Chuck

TIA
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-22  9:58 Chuck Ebbert
  0 siblings, 0 replies; 22+ messages in thread
From: Chuck Ebbert @ 2005-07-22  9:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, linux-kernel


On Fri, 22 Jul 2005 at13:27:56 +1000, Andrew Morton wrote:

> hm.  What context switch rate is that thing doing?

  23000 - 25000 / second.  I guess that explains why my attempt to
duplicate this in C failed -- it switches at 15-19 times per second:


/* i387 context switch benchmark
 *    compile this program without optimization
 */
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <signal.h>
#include <sched.h>

int lo = 0, fp = 0, af = 0;
long i, iters;
pid_t child;
cpu_set_t cpuset;

void handler(int sig) {
	lo = 0; /* stop looping */
}
struct sigaction sa = {
	.sa_handler = handler,
};
main(int argc, char *argv[]) {
	if (argc < 2) /* mandatory loop iteration count */
		exit(1);
	iters = atol(argv[1]);
	if (argc > 2) /* loop in parent while waiting for child to exit */
		lo = atoi(argv[2]);
	if (argc > 3) /* use FP while wait-looping */
		fp = atoi(argv[3]);
	if (argc > 4) /* use cpu affinity -- all code runs on cpu 0 */
		af = atoi(argv[4]);

	__CPU_SET(0, &cpuset);

	child = fork();
	if (child) {  /* parent */
		if (af)
			sched_setaffinity(0, &cpuset);
		if (lo) {
			sigaction(SIGCHLD, &sa, NULL);
			if (fp)
				__asm__ __volatile__("fld1" "\n\t" "fldz");
			while (lo)
				if (fp)
					__asm__ __volatile__(
						"fadd %st(1), %st(0)");
		} else 
			wait(NULL);

	} else {  /* child */
		if (af)
			sched_setaffinity(0, &cpuset);
		__asm__ __volatile__("fld1" "\n\t" "fldz");
		for (i = 0; i < iters; i++)
			__asm__ __volatile__("fadd %st(1), %st(0)");
	}
}


> Is the benchmark actually doing floating point stuff?

  They only release .class files.  ISTR java does FP math even when
developers think they are doing integer math but I can't find any
references.

__
Chuck

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
  2005-07-22  5:22   ` Linus Torvalds
@ 2005-07-22 11:23     ` Arjan van de Ven
  0 siblings, 0 replies; 22+ messages in thread
From: Arjan van de Ven @ 2005-07-22 11:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Chuck Ebbert, linux-kernel


> > We do have the `used_math' optimisation in there which attempts to avoid
> > doing the FP save/restore if the app isn't actually using math.
> 
> No, it's more than that. There's a per-processor "used_math" flag to
> determine if we need to _initialize_ the FPU, but on context switches we 
> always assume the program we're switching to will _not_ use FP, and we 
> just set the "fault on FP" flag and do not normally restore FP state.

This shows room for optimization; if an app is consistently faulting to
use FP after a context switch, in principle the kernel could start to
assume that it will in the next timeslice as well.


> On the other hand, I also wouldn't be surprised if glibc (or similar

I doubt glibc is normally, at least most distros don't ship an SSE
enabled glibc, only an "i686" one.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
  2005-07-22  8:14 ` Adrian Bunk
@ 2005-07-22 18:13   ` Linus Torvalds
  2005-07-25 19:26     ` Bill Davidsen
  0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2005-07-22 18:13 UTC (permalink / raw)
  To: Adrian Bunk; +Cc: Chuck Ebbert, linux-kernel, Andrew Morton

On Fri, 22 Jul 2005, Adrian Bunk wrote:
> 
> If this patch makes a difference, could you do me a favour and check 
> whether replacing the current cpu_has_fxsr #define in
> include/asm-i386/cpufeature.h with
> 
>   #define cpu_has_fxsr           1
> 
> on top of your patch brings an additional improvement?

It would be really sad if it made a difference. There might be a branch
mispredict, but the real expense of the fnsave/fxsave will be that
instruction itself, and any cache misses associated with it. The 9%
performace difference would almost have to be due to a memory bank
conflict or something (likely some unnecessary I$ prefetching that
interacts badly with the writeback needed for the _big_ memory write
forced by the fxsave).

I can't see any way that a single branch mispredict could make that big of 
a difference, but I _can_ see how bad memory access patterns could do it.

Btw, the switch from fnsave to fxsave (and thus the change from a 112-byte
save area to a 512-byte one, or whatever the exact details are) caused
_huge_ performance degradation for various context switching benchmarks. I
really hated that, but obviously the need to support SSE2 made it
non-optional. The point being that the real overhead is that big memory 
read/write in fxrestor/fxsave.

What _could_ make a bigger difference is not doing the lazy FPU at all.  
That lazy FPU is a huge optimization on 99.9% of all loads, but it sounds
like java/volanomark are broken and always use the FPU, and then we take a
big hit on doing the FP restore exception (an exception is a lot more
expensive than a mispredict).

Something like the following (totally untested) should make it be
non-lazy. It's going to slow down normal task switches, but might speed up 
the "restoring FP context all the time" case.

Chuck? This should work fine with or without your inline thing. Does it 
make any difference?

		Linus

-----
diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
--- a/arch/i386/kernel/process.c
+++ b/arch/i386/kernel/process.c
@@ -678,8 +678,16 @@ struct task_struct fastcall * __switch_t
 	struct tss_struct *tss = &per_cpu(init_tss, cpu);

 	/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */
-
-	__unlazy_fpu(prev_p);
+	if (prev_p->thread_info->status & TS_USEDFPU) {
+		__save_init_fpu(prev_p);
+		if (tsk_used_math(next_p))
+			init_fpu(next_p);
+		restore_fpu(next_p);
+		next_p->thread_info->status |= TS_USEDFPU;
+	} else {
+		stts();
+		next_p->thread_info->status &= ~TS_USEDFPU;
+	}

 	/*
 	 * Reload esp0, LDT and the page table pointer:
@@ -710,13 +718,13 @@ struct task_struct fastcall * __switch_t
 	 * Now maybe reload the debug registers
 	 */
 	if (unlikely(next->debugreg[7])) {
-		set_debugreg(current->thread.debugreg[0], 0);
-		set_debugreg(current->thread.debugreg[1], 1);
-		set_debugreg(current->thread.debugreg[2], 2);
-		set_debugreg(current->thread.debugreg[3], 3);
+		set_debugreg(next->debugreg[0], 0);
+		set_debugreg(next->debugreg[1], 1);
+		set_debugreg(next->debugreg[2], 2);
+		set_debugreg(next->debugreg[3], 3);
 		/* no 4 and 5 */
-		set_debugreg(current->thread.debugreg[6], 6);
-		set_debugreg(current->thread.debugreg[7], 7);
+		set_debugreg(next->debugreg[6], 6);
+		set_debugreg(next->debugreg[7], 7);
 	}

 	if (unlikely(prev->io_bitmap_ptr || next->io_bitmap_ptr))

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
  2005-07-22  3:06 [patch 2.6.13-rc3a] i386: inline restore_fpu Chuck Ebbert
  2005-07-22  3:27 ` Andrew Morton
  2005-07-22  8:14 ` Adrian Bunk
@ 2005-07-22 23:19 ` Linus Torvalds
  2 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2005-07-22 23:19 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-kernel, Andrew Morton



On Thu, 21 Jul 2005, Chuck Ebbert wrote:
> 
>   This patch makes restore_fpu() an inline.  When L1/L2 cache are saturated
> it makes a measurable difference.

I've now pushed out an alternative fix for this, which imho is much 
cleaner.

We've long had infrastructure for "alternative asm instructions" that are 
conditional on CPU features, and I just made restore_fpu() use that 
instead:

	/*
	 * The "nop" is needed to make the instructions the same
	 * length.
	 */
	#define restore_fpu(tsk)                        \
	        alternative_input(                      \
	                "nop ; frstor %1",              \
	                "fxrstor %1",                   \
	                X86_FEATURE_FXSR,               \
	                "m" ((tsk)->thread.i387.fsave))

which not only makes it inline, but makes it a single instruction instead 
of the mess it was before.

Now, we should do the same for "fnsave ; fwait" vs "fxsave ; fnclex" too,
but I was lazy. If somebody wants to try that, it would need an 
"alternative_output()" define but should otherwise be trivial too.

		Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-23  7:09 Chuck Ebbert
  2005-07-23 17:38 ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Chuck Ebbert @ 2005-07-23  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, linux-kernel, Arjan van de Ven, Adrian Bunk

On Fri, 22 Jul 2005 at 13:27:56 +1000, Andrew Morton wrote:

> Chuck Ebbert <76306.1226@compuserve.com> wrote:

> > After:
> > 
> >  12534 math_state_restore                       130.5625
> >   8354 device_not_available                     203.7561
> >  -----
> >  20888

> > Next step should be to physically place math_state_restore() after
> > device_not_available().  Would such a patch be accepted?  (Yes it
> > would be ugly and require linker script changes.)
>
> Depends on the benefit/ugly ratio ;)

This patch (may not apply cleanly to unpatched -rc3) drops overhead
a few more percent:

 11835 math_state_restore                       144.3293
  8096 device_not_available                     197.4634
 -----
 19931

Most of the exception/interrupt handlers should get the same treatment but
the linker script would grow.

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>

Index: 2.6.13-rc3a/arch/i386/kernel/entry.S
===================================================================
--- 2.6.13-rc3a.orig/arch/i386/kernel/entry.S	2005-07-16 18:09:48.000000000 -0400
+++ 2.6.13-rc3a/arch/i386/kernel/entry.S	2005-07-22 05:19:22.000000000 -0400
@@ -469,6 +469,7 @@
 	pushl $do_simd_coprocessor_error
 	jmp error_code
 
+	.section ".text.i387_1", "a"
 ENTRY(device_not_available)
 	pushl $-1			# mark this as an int
 	SAVE_ALL
@@ -484,6 +485,8 @@
 	addl $4, %esp
 	jmp ret_from_exception
 
+	.previous
+
 /*
  * Debug traps and NMI can happen at the one SYSENTER instruction
  * that sets up the real kernel stack. Check here, since we can't
Index: 2.6.13-rc3a/arch/i386/kernel/traps.c
===================================================================
--- 2.6.13-rc3a.orig/arch/i386/kernel/traps.c	2005-07-16 18:05:40.000000000 -0400
+++ 2.6.13-rc3a/arch/i386/kernel/traps.c	2005-07-22 14:02:50.000000000 -0400
@@ -972,7 +972,8 @@
  * Must be called with kernel preemption disabled (in this case,
  * local interrupts are disabled at the call-site in entry.S).
  */
-asmlinkage void math_state_restore(struct pt_regs regs)
+asmlinkage void __attribute__((__section__(".text.i387_2")))
+math_state_restore(struct pt_regs regs)
 {
 	struct thread_info *thread = current_thread_info();
 	struct task_struct *tsk = thread->task;
Index: 2.6.13-rc3a/arch/i386/kernel/vmlinux.lds.S
===================================================================
--- 2.6.13-rc3a.orig/arch/i386/kernel/vmlinux.lds.S	2005-07-16 18:05:39.000000000 -0400
+++ 2.6.13-rc3a/arch/i386/kernel/vmlinux.lds.S	2005-07-22 05:20:30.000000000 -0400
@@ -20,6 +20,8 @@
   _text = .;			/* Text and read-only data */
   .text : AT(ADDR(.text) - LOAD_OFFSET) {
 	*(.text)
+	*(.text.i387_1)
+	*(.text.i387_2)
 	SCHED_TEXT
 	LOCK_TEXT
 	*(.fixup)
__
Chuck

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-23  7:09 Chuck Ebbert
  2005-07-23 17:33 ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Chuck Ebbert @ 2005-07-23  7:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, linux-kernel

On Date: Fri, 22 Jul 2005 at 16:19:25 -0700 (PDT), Linus Torvalds wrote:

>       /*
>        * The "nop" is needed to make the instructions the same
>        * length.
>        */
>       #define restore_fpu(tsk)                        \
>               alternative_input(                      \
>                       "nop ; frstor %1",              \
>                       "fxrstor %1",                   \
>                       X86_FEATURE_FXSR,               \
>                       "m" ((tsk)->thread.i387.fsave))

   Probably a very minor point, but shouldn't it be

                        "m" ((tsk)->thread.i387.fxsave))

because that's the largest possible operand that could end up being read?


> Now, we should do the same for "fnsave ; fwait" vs "fxsave ; fnclex" too,
> but I was lazy. If somebody wants to try that, it would need an 
> "alternative_output()" define but should otherwise be trivial too.

  Is that function called __save_init_fpu because it saves and then
initializes the fpu?  Unlike fsave, fxsave does not init the fpu after
it saves the state; to make the behavior match it should be "fxsave ; fninit"


__
Chuck

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
       [not found] ` <20050722132756.578acca7.akpm@osdl.org.suse.lists.linux.kernel>
@ 2005-07-23 15:35   ` Andi Kleen
  0 siblings, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2005-07-23 15:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, 76306.1226

Andrew Morton <akpm@osdl.org> writes:
> 
> We do have the `used_math' optimisation in there which attempts to avoid
> doing the FP save/restore if the app isn't actually using math.  But
> <ancient recollections> there's code in glibc startup which always does a
> bit of float, so that optimisation is always defeated.  There was some
> discussion about periodically setting tasks back into !used_math state to
> try to restore the optimisation for tasks which only do a little bit of FP,
> but nothing actually got done.

Actually we reset the flag on every context switch, so that works just fine.

But I was considering to do it less often so that we switch the FP 
state non lazily for FP intensive processes and avoid the overhead
of all these exceptions.

-Andi

P.S.: Original profile data looks a bit fishy. Normally avoiding a single
function call should not make tht much difference unless you call
it in a inner loop, but that is not the case here.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
  2005-07-23  7:09 Chuck Ebbert
@ 2005-07-23 17:33 ` Linus Torvalds
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2005-07-23 17:33 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: Andrew Morton, linux-kernel

On Sat, 23 Jul 2005, Chuck Ebbert wrote:
> 
>    Probably a very minor point, but shouldn't it be
> 
>                         "m" ((tsk)->thread.i387.fxsave))
> 
> because that's the largest possible operand that could end up being read?

Yes, I ended up fixing that already (along with handling the fxsave case 
too).

>   Is that function called __save_init_fpu because it saves and then
> initializes the fpu?  Unlike fsave, fxsave does not init the fpu after
> it saves the state; to make the behavior match it should be "fxsave ; fninit"

The "init" part is really just "reset_exceptions" - we don't care about
the rest of the state, and the naming is historical (ie it's really called
"init" because "fnsave" does the reset by re-initializing everything, ie 
there's an implied "fninit").

So for the fxsave path, we just use fnclex, since that's faster than a
full fninit.

			Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
  2005-07-23  7:09 Chuck Ebbert
@ 2005-07-23 17:38 ` Linus Torvalds
  2005-07-23 17:46   ` Arjan van de Ven
  0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2005-07-23 17:38 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: Andrew Morton, linux-kernel, Arjan van de Ven, Adrian Bunk

On Sat, 23 Jul 2005, Chuck Ebbert wrote:
> 
> This patch (may not apply cleanly to unpatched -rc3) drops overhead
> a few more percent:

That really is pretty ugly.

I'd rather hope for something like function-sections to just make games 
like this be unnecessary. The linker really should be able to do things 
like this for us (ie if it sees that the only reference to a function is 
from another function, it should be able to just put them next to each 
other anyway).

And yes, I realize that "should be able" isn't the same thing as "does", 
but the thing is, doing it by hand ends up being a maintenance problem in 
the long run. It also misses all the other cases where it might be 
beneficial, but where you don't happen to run the right benchmark or look 
at the right place.

So maybe a few hints to the binutils people might just make them go: "try 
this patch/cmdline flag", and solve many more problems. They likely have a 
lot of this kind of code _already_, or have at least been thinking about 
it.

I personally believe that there's likely a lot more to be had from code 
(and data) layout than there is from things like alias analysis or 
aggressive inlining.

		Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
  2005-07-23 17:38 ` Linus Torvalds
@ 2005-07-23 17:46   ` Arjan van de Ven
  2005-07-23 18:02     ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Arjan van de Ven @ 2005-07-23 17:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Chuck Ebbert, Andrew Morton, linux-kernel, Adrian Bunk

On Sat, 2005-07-23 at 10:38 -0700, Linus Torvalds wrote:

> So maybe a few hints to the binutils people might just make them go: "try 
> this patch/cmdline flag", and solve many more problems. They likely have a 
> lot of this kind of code _already_, or have at least been thinking about 
> it.
> 
> I personally believe that there's likely a lot more to be had from code 
> (and data) layout than there is from things like alias analysis or 
> aggressive inlining.

for userland it's not that complex and exists already; basically gprof
has this analysis capability, and from that there's tooling (well I
wrote it and I'm sure others did too) to create a linker script
automatically to order things according to their gprof desired order.
This gprof based approach is taking actual runtime patterns into account
not just static callgraph analysis.

For the kernel the runtime measurement is obviously tricky (kgprof
anyone?) but the static analysis method really shouldn't be too hard.
(well I guess "optimal" will be NP complete, but "pretty damn close"
ought to be reasonable)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
  2005-07-23 17:46   ` Arjan van de Ven
@ 2005-07-23 18:02     ` Linus Torvalds
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2005-07-23 18:02 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Chuck Ebbert, Andrew Morton, linux-kernel, Adrian Bunk

On Sat, 23 Jul 2005, Arjan van de Ven wrote:
> 
> For the kernel the runtime measurement is obviously tricky (kgprof
> anyone?)

Ehh, doesn't "opgprof" do that already?

Anyway, judging by real life, people just don't _do_ profile runs. Any 
build automation that depends on the user profiling the result is a total 
wet dream by compiler developers - it just doesn't happen in real life.

So:

> but the static analysis method really shouldn't be too hard.

I really think that the static analysis is the only one that actually 
would matter in real life. It's also the only one that is repeatable, 
which is a big big bonus, since at least that way different people are 
running basically the same code (heisenbugs, anyone?) and benchmarks are 
actually meaningful, since they don't depend on whatever load was picked 
to generate the layout.

So dynamic analysis with dynamic profile data is just one big theoretical
compiler-writer masturbation session, in my not-so-humble opinion. I bet
static analysis (with possibly some hints from the programmer, the same
way we use likely/unlikely) gets you 90% of the way, without all the
downsides.

		Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-24 12:56 Kenneth Parrish
  0 siblings, 0 replies; 22+ messages in thread
From: Kenneth Parrish @ 2005-07-24 12:56 UTC (permalink / raw)
  To: linux-kernel

-=> Linus Torvalds wrote to All and Chuck Ebbert <=-

[..]

 Chuck>
      >                         "m" ((tsk)->thread.i387.fxsave))
      >
      > because that's the largest possible operand that could end up
      > being read?

 LT> Yes, I ended up fixing that already (along with handling the fxsave
 LT> case too).
[..]

 2.6.13-rc3-git5 lags -git4 on nbench's bitfield test, a part of its
memory index, eg typical runs
-git5
BITFIELD            :      1.1589e+07  :       1.99  :       0.42
-git4
BITFIELD            :      2.1892e+07  :       3.76  :       0.78

lmbench has slower context switch numbers for -git5.

--- MultiMail/Linux v0.46

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-25  2:34 Kenneth Parrish
  0 siblings, 0 replies; 22+ messages in thread
From: Kenneth Parrish @ 2005-07-25  2:34 UTC (permalink / raw)
  To: linux-kernel

-=> Kenneth Parrish wrote to All <=-

 KP>  2.6.13-rc3-git5 lags -git4 on nbench's bitfield test, a part of its
 KP> memory index, eg typical runs
 KP> -git5
 KP> BITFIELD            :      1.1589e+07  :       1.99  :       0.42
 KP> -git4
 KP> BITFIELD            :      2.1892e+07  :       3.76  :       0.78

 KP> lmbench has slower context switch numbers for -git5.

i've been experimentally adding -funswitch-loops and -fpeel loops to the
default -march=i586 in the ../arch/i386/Makefile

cflags-$(CONFIG_M586)       += -march=i586 -funswitch-loops -fpeel-loops

based on some tests with gcc 3.4.3,4 and the cyrix mii cpu, but
with v2.6.13-rc3-git5 carelessly put these flags on the next,
cflags-$(CONFIG_M586TSC) line, and so they would not have been used.
i shall retest after recompiling to confirm.
[..]
yeah, a couple nbench runs show similar overall results as -git4

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
  2005-07-22 18:13   ` Linus Torvalds
@ 2005-07-25 19:26     ` Bill Davidsen
  0 siblings, 0 replies; 22+ messages in thread
From: Bill Davidsen @ 2005-07-25 19:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Chuck Ebbert, linux-kernel, Andrew Morton

Linus Torvalds wrote:
> 
> On Fri, 22 Jul 2005, Adrian Bunk wrote:
> 
>>If this patch makes a difference, could you do me a favour and check 
>>whether replacing the current cpu_has_fxsr #define in
>>include/asm-i386/cpufeature.h with
>>
>>  #define cpu_has_fxsr           1
>>
>>on top of your patch brings an additional improvement?
> 
> 
> It would be really sad if it made a difference. There might be a branch
> mispredict, but the real expense of the fnsave/fxsave will be that
> instruction itself, and any cache misses associated with it. The 9%
> performace difference would almost have to be due to a memory bank
> conflict or something (likely some unnecessary I$ prefetching that
> interacts badly with the writeback needed for the _big_ memory write
> forced by the fxsave).
> 
> I can't see any way that a single branch mispredict could make that big of 
> a difference, but I _can_ see how bad memory access patterns could do it.
> 
> Btw, the switch from fnsave to fxsave (and thus the change from a 112-byte
> save area to a 512-byte one, or whatever the exact details are) caused
> _huge_ performance degradation for various context switching benchmarks. I
> really hated that, but obviously the need to support SSE2 made it
> non-optional. The point being that the real overhead is that big memory 
> read/write in fxrestor/fxsave.
> 
> What _could_ make a bigger difference is not doing the lazy FPU at all.  
> That lazy FPU is a huge optimization on 99.9% of all loads, but it sounds
> like java/volanomark are broken and always use the FPU, and then we take a
> big hit on doing the FP restore exception (an exception is a lot more
> expensive than a mispredict).

It seems expensive to do the save/restore when it isn't needed, that's 
why the code got lazy. Would it be useful to have a small flag or count 
field and start by assuming that FPU is not used, and if the exception 
takes place set the count to unconditionally save the FP state for some 
number of context switches and then try reverting to lazy save?

That 99.9% may be a guess, but I suspect that there are a lot of 
applications which alternate between using FPU and not, even if they do 
use FPU for some parts of the application. That way the performance of 
lazy save would be realized for the common applications, and the 
overhead of exception would be greatly reduced for both the ill-behaved 
and legitimately FPU intensive application.

> 
> Something like the following (totally untested) should make it be
> non-lazy. It's going to slow down normal task switches, but might speed up 
> the "restoring FP context all the time" case.
> 
> Chuck? This should work fine with or without your inline thing. Does it 
> make any difference?

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-26 21:23 Chuck Ebbert
  0 siblings, 0 replies; 22+ messages in thread
From: Chuck Ebbert @ 2005-07-26 21:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Adrian Bunk, Arjan van de Ven, linux-kernel, Andrew Morton

On Sat, 23 Jul 2005 at 10:38:40 -0700 (PDT), Linus Torvalds wrote:

> On Sat, 23 Jul 2005, Chuck Ebbert wrote:
> > 
> > This patch (may not apply cleanly to unpatched -rc3) drops overhead
> > a few more percent:
>
> That really is pretty ugly.
>
> I'd rather hope for something like function-sections to just make games 
> like this be unnecessary.

 The link stage might cause meltdown on typical machines, though.
__
Chuck

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-26 21:23 Chuck Ebbert
  2005-07-26 21:47 ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Chuck Ebbert @ 2005-07-26 21:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, linux-kernel

On Fri, 22 Jul 2005 at 11:13:21 -0700 (PDT), Linus Torvalds wrote:

> Something like the following (totally untested) should make it be
> non-lazy. It's going to slow down normal task switches, but might speed up 
> the "restoring FP context all the time" case.
> 
> Chuck? This should work fine with or without your inline thing. Does it 
> make any difference?

 Now that I am looking at the right output file -- yes, it does (after fixing
one small bug.)  Volanomark set a new record of 6125 messages/sec and the
profile shows almost zero hits in math_state_restore.

 Since fxsave leaves the FPU state intact, there ought to be a better way to do
this but it gets tricky.  Maybe using the TSC to put a timestamp in every thread
save area?

  when saving FPU state:
        put cpu# and timestamp in thread state info
        also store timestamp in per-cpu data

  on task switch:
        compare cpu# and timestamps for next task
        if equal, clear TS and set TS_USEDFPU

  when state becomes invalid for some reason:
        zero cpu's timestamp

But the extra overhead might be too much in many cases.


(Below is what I changed in the original patch, if anyone wants to try it.)

+               if (tsk_used_math(next_p))

should be:

+               if (!tsk_used_math(next_p))

__
Chuck

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
  2005-07-26 21:23 Chuck Ebbert
@ 2005-07-26 21:47 ` Linus Torvalds
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2005-07-26 21:47 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: Andrew Morton, linux-kernel

On Tue, 26 Jul 2005, Chuck Ebbert wrote:
> 
>  Since fxsave leaves the FPU state intact, there ought to be a better way to do
> this but it gets tricky.  Maybe using the TSC to put a timestamp in every thread
> save area?

We used to have totally lazy FP saving, and not toucht he FP state at 
_all_ in the scheduler except to just set the TS bit.

It worked wonderfully well on UP, but getting it working on SMP is a major
pain, since the lazy state you want to switch back into might be cached on
some other CPU's registers, so we never did it on SMP. Eventually it got
too painful to maintain two totally different logical code-paths between
UP and SMP, and some bug or other ended up resulting in the current "lazy
on a time slice level" thing which works well in SMP too.

Also, a lot of the cost is really the save, and before SSE2 the fnsave
would clear the FPU state, so you couldn't just do a save and try to elide 
just the restore in the lazy case. In SSE2 (with fxsave) we _could_ try to 
do that, but the thing is, I doubt it really helps.

First off, 99% of all programs don't hit the nasty case at all, and for
something broken like volanomark that _does_ hit it, I bet that there i
smore than one thread using the FP, so you can't just cache the FP state
in the CPU _anyway_.

So we could enhance the current state by having a "nonlazy" mode like in
the example patch, except we'd have to make it a dynamic flag. Which could
either be done by explicitly marking binaries we want to be non-lazy, or
by just dynamically noticing that the rate of FP restores is very high.

Does anybody really care about volanomark? Quite frankly, I think you'd
see a _lot_ more performance improvement if you could instead teach the
Java stuff not to use FP all the time, so it feels a bit like papering
over the _real_ bug if we'd try to optimize this abnormal and silly case
in the kernel.

		Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
@ 2005-07-27  1:40 linux
  0 siblings, 0 replies; 22+ messages in thread
From: linux @ 2005-07-27  1:40 UTC (permalink / raw)
  To: 76306.1226; +Cc: linux-kernel

> Since fxsave leaves the FPU state intact, there ought to be a better way
> to do this but it gets tricky.  Maybe using the TSC to put a timestamp
> in every thread save area?
> 
>   when saving FPU state:
>         put cpu# and timestamp in thread state info
>         also store timestamp in per-cpu data
> 
>   on task switch:
>         compare cpu# and timestamps for next task
>         if equal, clear TS and set TS_USEDFPU
> 
>   when state becomes invalid for some reason:
>         zero cpu's timestamp
> 
> But the extra overhead might be too much in many cases.

Simpler:
- Thread has "CPU that I last used FPU on" pointer.  Never NULL.
- Each CPU has "thread whose FPU state I hold" pointer.  May be NULL.

When *loading* FPU state:
- Set up both pointers.

On task switch:
- If the pointers point to each other, then clear TS and skip restore.
  ("Preloaded")

When state becomes invalid (kernel MMX use, or whatever)
- Set CPU's pointer to NULL.

On thread creation:
- If current CPU's thread pointer points to the newly allocated thread,
  clear it to NULL.
- Set thread's CPU pointer to current CPU.

The UP case just omits the per-thread CPU pointer.  (Well, stores
it in zero bits.)

An alternative SMP thread-creation case would be to have a NULL value for
the thread-to-CPU pointer and initialize the thread's CPU pointer to that,
but that then complicates the UP case.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2005-07-27  1:40 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-22  3:06 [patch 2.6.13-rc3a] i386: inline restore_fpu Chuck Ebbert
2005-07-22  3:27 ` Andrew Morton
2005-07-22  5:22   ` Linus Torvalds
2005-07-22 11:23     ` Arjan van de Ven
2005-07-22  8:14 ` Adrian Bunk
2005-07-22 18:13   ` Linus Torvalds
2005-07-25 19:26     ` Bill Davidsen
2005-07-22 23:19 ` Linus Torvalds
  -- strict thread matches above, loose matches on Subject: below --
2005-07-22  9:58 Chuck Ebbert
2005-07-23  7:09 Chuck Ebbert
2005-07-23 17:38 ` Linus Torvalds
2005-07-23 17:46   ` Arjan van de Ven
2005-07-23 18:02     ` Linus Torvalds
2005-07-23  7:09 Chuck Ebbert
2005-07-23 17:33 ` Linus Torvalds
     [not found] <200507212309_MC3-1-A534-95EF@compuserve.com.suse.lists.linux.kernel>
     [not found] ` <20050722132756.578acca7.akpm@osdl.org.suse.lists.linux.kernel>
2005-07-23 15:35   ` Andi Kleen
2005-07-24 12:56 Kenneth Parrish
2005-07-25  2:34 Kenneth Parrish
2005-07-26 21:23 Chuck Ebbert
2005-07-26 21:23 Chuck Ebbert
2005-07-26 21:47 ` Linus Torvalds
2005-07-27  1:40 linux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox