* [PATCH] Use x86 SSE instructions for clear_page, copy_page
@ 2004-08-17 6:13 Jens Maurer
2004-08-17 7:27 ` Arjan van de Ven
2004-08-18 7:00 ` Ingo Molnar
0 siblings, 2 replies; 9+ messages in thread
From: Jens Maurer @ 2004-08-17 6:13 UTC (permalink / raw)
To: Linux Kernel
[-- Attachment #1: Type: text/plain, Size: 8999 bytes --]
The attached patch (against kernel 2.6.8.1) enables using SSE
instructions for copy_page and clear_page.
A user-space test on my Pentium III 850 MHz shows a 3x speedup for
clear_page (compared to the default "rep stosl"), and a 50% speedup
for copy_page (compared to the default "rep movsl"). For a Pentium-4,
the speedup is about 50% in both the clear_page and copy_page cases.
The attached (admittedly perverse) user-space program
"malloc-fork-load.c" takes 30 sec with stock kernel 2.6.8.1, which
improves to about 15 sec when running a kernel with the attached
kernel patch applied.
Notes: I cannot replace clear_page and copy_page with their SSE
equivalents at compile-time, because clear_page is used before the CPU
is fully set up (in particular the CR4.OSFXSR bit, without which SSE
instructions kill the kernel with an invalid operand exception). The
current function-pointer based approach could be extended to include
the current MMX-based improvements for AMD CPUs as well. If a
function pointer is considered too wasteful for a boot-time
initialization issue, a "memcpy" approach similar to the
"apply_alternatives()" code modifications would be possible.
Please test.
Jens Maurer
diff -urN -X /home/jmaurer/Linux/excludes-for-diff.txt linux-2.6.8.1.orig/arch/i386/Kconfig linux-2.6.8.1/arch/i386/Kconfig
--- linux-2.6.8.1.orig/arch/i386/Kconfig Mon Aug 16 22:02:03 2004
+++ linux-2.6.8.1/arch/i386/Kconfig Mon Aug 16 21:56:09 2004
@@ -419,6 +419,11 @@
depends on MCYRIXIII || MK7
default y
+config X86_USE_SSE
+ bool
+ depends on MPENTIUMIII || MPENTIUMM || MPENTIUM4
+ default y
+
config X86_OOSTORE
bool
depends on (MWINCHIP3D || MWINCHIP2 || MWINCHIPC6) && MTRR
diff -urN -X /home/jmaurer/Linux/excludes-for-diff.txt linux-2.6.8.1.orig/arch/i386/kernel/setup.c linux-2.6.8.1/arch/i386/kernel/setup.c
--- linux-2.6.8.1.orig/arch/i386/kernel/setup.c Mon Aug 16 22:02:04 2004
+++ linux-2.6.8.1/arch/i386/kernel/setup.c Mon Aug 16 21:56:14 2004
@@ -1241,13 +1241,46 @@
}
static int no_replacement __initdata = 0;
-
+
+#ifdef CONFIG_X86_USE_SSE
+
+static void std_clear_page(void *page)
+{
+ int d0, d1;
+ asm volatile("cld\n\t"
+ "rep; stosl"
+ : "=&c" (d0), "=&D" (d1)
+ : "a" (0), "0" (PAGE_SIZE/4), "1" (page)
+ : "memory");
+}
+
+static void std_copy_page(void *to, void *from)
+{
+ int d0, d1, d2;
+ asm volatile("cld\n\t"
+ "rep; movsl"
+ : "=&c" (d0), "=&D" (d1), "=&S" (d2)
+ : "0" (PAGE_SIZE/4), "1" (to), "2" (from)
+ : "memory");
+}
+
+void (*__sse_clear_page)(void *) = &std_clear_page;
+void (*__sse_copy_page)(void *, void *) = &std_copy_page;
+EXPORT_SYMBOL(__sse_clear_page);
+EXPORT_SYMBOL(__sse_copy_page);
+#endif
+
void __init alternative_instructions(void)
{
extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
+ extern void activate_sse_replacements(void);
if (no_replacement)
return;
apply_alternatives(__alt_instructions, __alt_instructions_end);
+
+#ifdef CONFIG_X86_USE_SSE
+ activate_sse_replacements();
+#endif
}
static int __init noreplacement_setup(char *s)
diff -urN -X /home/jmaurer/Linux/excludes-for-diff.txt linux-2.6.8.1.orig/arch/i386/lib/Makefile linux-2.6.8.1/arch/i386/lib/Makefile
--- linux-2.6.8.1.orig/arch/i386/lib/Makefile Mon Aug 16 22:02:05 2004
+++ linux-2.6.8.1/arch/i386/lib/Makefile Mon Aug 16 21:56:15 2004
@@ -7,4 +7,5 @@
bitops.o
lib-$(CONFIG_X86_USE_3DNOW) += mmx.o
+lib-$(CONFIG_X86_USE_SSE) += sse.o
lib-$(CONFIG_HAVE_DEC_LOCK) += dec_and_lock.o
diff -urN -X /home/jmaurer/Linux/excludes-for-diff.txt linux-2.6.8.1.orig/arch/i386/lib/sse.c linux-2.6.8.1/arch/i386/lib/sse.c
--- linux-2.6.8.1.orig/arch/i386/lib/sse.c Thu Jan 1 01:00:00 1970
+++ linux-2.6.8.1/arch/i386/lib/sse.c Mon Aug 9 00:57:23 2004
@@ -0,0 +1,115 @@
+/*
+ * linux/arch/i386/lib/sse.c
+ *
+ * Copyright 2004 Jens Maurer
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * Send feedback to <Jens.Maurer@gmx.net>
+ */
+
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/preempt.h>
+#include <asm/page.h>
+#include <asm/system.h>
+
+
+/*
+ * SSE library helper functions
+ */
+
+#define SSE_START(cr0) do { \
+ preempt_disable(); \
+ cr0 = read_cr0(); \
+ clts(); \
+ } while(0)
+
+
+#define SSE_END(cr0) do { \
+ write_cr0(cr0); \
+ preempt_enable(); \
+ } while(0)
+
+static void sse_clear_page(void * page)
+{
+ unsigned char xmm_save[16];
+ unsigned int cr0;
+ int i;
+
+ SSE_START(cr0);
+ asm volatile("movups %%xmm0, (%0)\n\t"
+ "xorps %%xmm0, %%xmm0"
+ : : "r" (xmm_save));
+ for(i = 0; i < PAGE_SIZE/16/4; i++) {
+ asm volatile("movntps %%xmm0, (%0)\n\t"
+ "movntps %%xmm0, 16(%0)\n\t"
+ "movntps %%xmm0, 32(%0)\n\t"
+ "movntps %%xmm0, 48(%0)"
+ : : "r"(page) : "memory");
+ page += 16*4;
+ }
+ asm volatile("sfence\n\t"
+ "movups (%0), %%xmm0"
+ : : "r" (xmm_save) : "memory");
+ SSE_END(cr0);
+}
+
+static void sse_copy_page(void *to, void *from)
+{
+ unsigned char tmp[16*4+15] __attribute__((aligned(16)));
+ /* gcc 3.4.x does not honor alignment requests for stack variables */
+ unsigned char * xmm_save =
+ (unsigned char *)ALIGN((unsigned long)tmp, 16);
+ unsigned int cr0;
+ int i;
+
+ SSE_START(cr0);
+ asm volatile("movaps %%xmm0, (%0)\n\t"
+ "movaps %%xmm1, 16(%0)\n\t"
+ "movaps %%xmm2, 32(%0)\n\t"
+ "movaps %%xmm3, 48(%0)"
+ : : "r" (xmm_save));
+ for(i = 0; i < 4096/16/4; i++) {
+ asm volatile("movaps (%0), %%xmm0\n\t"
+ "movaps 16(%0), %%xmm1\n\t"
+ "movaps 32(%0), %%xmm2\n\t"
+ "movaps 48(%0), %%xmm3\n\t"
+ "movntps %%xmm0, (%1)\n\t"
+ "movntps %%xmm1, 16(%1)\n\t"
+ "movntps %%xmm2, 32(%1)\n\t"
+ "movntps %%xmm3, 48(%1)"
+ : : "r" (from), "r" (to) : "memory");
+ from += 16*4;
+ to += 16*4;
+ }
+ asm volatile("sfence\n"
+ "movaps (%0), %%xmm0\n\t"
+ "movaps 16(%0), %%xmm1\n\t"
+ "movaps 32(%0), %%xmm2\n\t"
+ "movaps 48(%0), %%xmm3"
+ : : "r" (xmm_save) : "memory");
+ SSE_END(cr0);
+}
+
+void activate_sse_replacements(void)
+{
+ if(cpu_has_xmm && (mmu_cr4_features & X86_CR4_OSFXSR)) {
+ __sse_clear_page = &sse_clear_page;
+ __sse_copy_page = &sse_copy_page;
+ }
+}
diff -urN -X /home/jmaurer/Linux/excludes-for-diff.txt linux-2.6.8.1.orig/include/asm-i386/page.h linux-2.6.8.1/include/asm-i386/page.h
--- linux-2.6.8.1.orig/include/asm-i386/page.h Mon Aug 16 22:04:14 2004
+++ linux-2.6.8.1/include/asm-i386/page.h Mon Aug 16 21:58:35 2004
@@ -21,6 +21,15 @@
#define clear_page(page) mmx_clear_page((void *)(page))
#define copy_page(to,from) mmx_copy_page(to,from)
+#elif defined(CONFIG_X86_USE_SSE)
+
+#include <asm/sse.h>
+
+extern void (*__sse_clear_page)(void *);
+extern void (*__sse_copy_page)(void *, void*);
+#define clear_page(page) (*__sse_clear_page)(page)
+#define copy_page(to,from) (*__sse_copy_page)(to,from)
+
#else
/*
diff -urN -X /home/jmaurer/Linux/excludes-for-diff.txt linux-2.6.8.1.orig/include/asm-i386/sse.h linux-2.6.8.1/include/asm-i386/sse.h
--- linux-2.6.8.1.orig/include/asm-i386/sse.h Thu Jan 1 01:00:00 1970
+++ linux-2.6.8.1/include/asm-i386/sse.h Sun Aug 8 22:21:36 2004
@@ -0,0 +1,34 @@
+/*
+ * linux/include/asm-i386/sse.h
+ *
+ * Copyright 2004 Jens Maurer
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ */
+
+#ifndef _ASM_SSE_H
+#define _ASM_SSE_H
+
+/*
+ * SSE helper operations
+ */
+
+#include <linux/types.h>
+
+extern void sse_clear_page(void *page);
+extern void sse_copy_page(void *to, void *from);
+
+#endif
[-- Attachment #2: malloc-fork-load.c --]
[-- Type: text/plain, Size: 801 bytes --]
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#define N 20240
#define SIZE 4096
int main()
{
int k;
for(k = 0; k < 10; k++) {
int i = 0;
int pid;
unsigned char *mem = mmap(0, N*SIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
if(mem == MAP_FAILED)
perror("mmap");
printf("pagesize: %d\n", getpagesize());
for(i = 0; i < N; i++)
mem[i*SIZE] = i*1000000007ul;
printf("pages allocated\n");
pid = fork();
if(pid == 0) {
/* child */
for(i = 0; i < N; i++)
mem[i*SIZE+1] = i; /* force copy */
printf("copy complete\n");
exit(0);
} else if(pid == -1) {
perror("fork");
} else {
/* parent */
waitpid(pid, NULL, 0);
}
munmap(mem, N*SIZE);
}
}
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH] Use x86 SSE instructions for clear_page, copy_page
2004-08-17 6:13 [PATCH] Use x86 SSE instructions for clear_page, copy_page Jens Maurer
@ 2004-08-17 7:27 ` Arjan van de Ven
2004-08-17 8:10 ` Andrey Panin
2004-08-17 22:40 ` Jens Maurer
2004-08-18 7:00 ` Ingo Molnar
1 sibling, 2 replies; 9+ messages in thread
From: Arjan van de Ven @ 2004-08-17 7:27 UTC (permalink / raw)
To: Jens Maurer; +Cc: Linux Kernel
[-- Attachment #1: Type: text/plain, Size: 767 bytes --]
On Tue, 2004-08-17 at 08:13, Jens Maurer wrote:
> The attached patch (against kernel 2.6.8.1) enables using SSE
> instructions for copy_page and clear_page.
>
> A user-space test on my Pentium III 850 MHz shows a 3x speedup for
> clear_page (compared to the default "rep stosl"), and a 50% speedup
> for copy_page (compared to the default "rep movsl"). For a Pentium-4,
> the speedup is about 50% in both the clear_page and copy_page cases.
we used to have code like this in 2.4 but it got removed: the non
temperal store code is faster in a microbenchmark but has the
fundamental problem that it evics the data from the cpu cache; the
actual USE of the data thus is a LOT more expensive, result is that the
overall system performance goes down ;(
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Use x86 SSE instructions for clear_page, copy_page
2004-08-17 7:27 ` Arjan van de Ven
@ 2004-08-17 8:10 ` Andrey Panin
2004-08-17 8:11 ` Arjan van de Ven
2004-08-17 22:40 ` Jens Maurer
1 sibling, 1 reply; 9+ messages in thread
From: Andrey Panin @ 2004-08-17 8:10 UTC (permalink / raw)
To: Arjan van de Ven; +Cc: Jens Maurer, Linux Kernel
[-- Attachment #1: Type: text/plain, Size: 1023 bytes --]
On 230, 08 17, 2004 at 09:27:51AM +0200, Arjan van de Ven wrote:
> On Tue, 2004-08-17 at 08:13, Jens Maurer wrote:
> > The attached patch (against kernel 2.6.8.1) enables using SSE
> > instructions for copy_page and clear_page.
> >
> > A user-space test on my Pentium III 850 MHz shows a 3x speedup for
> > clear_page (compared to the default "rep stosl"), and a 50% speedup
> > for copy_page (compared to the default "rep movsl"). For a Pentium-4,
> > the speedup is about 50% in both the clear_page and copy_page cases.
>
>
> we used to have code like this in 2.4 but it got removed: the non
> temperal store code is faster in a microbenchmark but has the
> fundamental problem that it evics the data from the cpu cache; the
> actual USE of the data thus is a LOT more expensive, result is that the
> overall system performance goes down ;(
Did SSE clear_page() suffered from this issue too ?
--
Andrey Panin | Linux and UNIX system administrator
pazke@donpac.ru | PGP key: wwwkeys.pgp.net
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Use x86 SSE instructions for clear_page, copy_page
2004-08-17 8:10 ` Andrey Panin
@ 2004-08-17 8:11 ` Arjan van de Ven
0 siblings, 0 replies; 9+ messages in thread
From: Arjan van de Ven @ 2004-08-17 8:11 UTC (permalink / raw)
To: Jens Maurer, Linux Kernel
[-- Attachment #1: Type: text/plain, Size: 1261 bytes --]
On Tue, Aug 17, 2004 at 12:10:09PM +0400, Andrey Panin wrote:
> On 230, 08 17, 2004 at 09:27:51AM +0200, Arjan van de Ven wrote:
> > On Tue, 2004-08-17 at 08:13, Jens Maurer wrote:
> > > The attached patch (against kernel 2.6.8.1) enables using SSE
> > > instructions for copy_page and clear_page.
> > >
> > > A user-space test on my Pentium III 850 MHz shows a 3x speedup for
> > > clear_page (compared to the default "rep stosl"), and a 50% speedup
> > > for copy_page (compared to the default "rep movsl"). For a Pentium-4,
> > > the speedup is about 50% in both the clear_page and copy_page cases.
> >
> >
> > we used to have code like this in 2.4 but it got removed: the non
> > temperal store code is faster in a microbenchmark but has the
> > fundamental problem that it evics the data from the cpu cache; the
> > actual USE of the data thus is a LOT more expensive, result is that the
> > overall system performance goes down ;(
>
> Did SSE clear_page() suffered from this issue too ?
yes especially clear_page; the kernel only calls clear_page when it's *just
about* to use it, so it's actually the worst case example ;(
(and clear_page gains the most because non-temperal stores avoid the write
allocate so it like halves the memory bandwidth)
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Use x86 SSE instructions for clear_page, copy_page
2004-08-17 7:27 ` Arjan van de Ven
2004-08-17 8:10 ` Andrey Panin
@ 2004-08-17 22:40 ` Jens Maurer
2004-08-18 2:33 ` David S. Miller
1 sibling, 1 reply; 9+ messages in thread
From: Jens Maurer @ 2004-08-17 22:40 UTC (permalink / raw)
To: Linux Kernel; +Cc: arjanv
Arjan van de Ven wrote:
> On Tue, 2004-08-17 at 08:13, Jens Maurer wrote:
>
>>The attached patch (against kernel 2.6.8.1) enables using SSE
>>instructions for copy_page and clear_page.
> we used to have code like this in 2.4 but it got removed: the non
> temperal store code is faster in a microbenchmark but has the
> fundamental problem that it evics the data from the cpu cache; the
> actual USE of the data thus is a LOT more expensive, result is that the
> overall system performance goes down ;(
Hm... With the current clear_page, we are filling 4KB of my
Pentium-III's 16 KB L1 d-cache (i.e. 25%) with zeroes. I'm not
sure that we will use all of this data right away.
I would like to point out that the current arch/i386/lib/mmx.c
uses MMX movntq instructions #ifdef CONFIG_MK7 .
Apparently, bypassing the cache was considered a good idea
in that case.
What is a set of useful benchmarks to find out which approach
is better? We should have some real-world programs that show
significant oprofile hits in clear_page or copy_page.
It might very well be that the results on Pentium-III and
Pentium-4 are different, for example that SSE is only useful
for a Pentium-III, and only for clear_page.
Jens Maurer
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Use x86 SSE instructions for clear_page, copy_page
2004-08-17 22:40 ` Jens Maurer
@ 2004-08-18 2:33 ` David S. Miller
2004-08-22 20:49 ` Jens Maurer
0 siblings, 1 reply; 9+ messages in thread
From: David S. Miller @ 2004-08-18 2:33 UTC (permalink / raw)
To: Jens Maurer; +Cc: linux-kernel, arjanv
On Wed, 18 Aug 2004 00:40:06 +0200
Jens Maurer <Jens.Maurer@gmx.net> wrote:
> What is a set of useful benchmarks to find out which approach
> is better?
Time kernel builds on an otherwise totally idle machine.
Do multiple runs so that the cache gets loaded and you're
testing the memory accesses rather than disk reads.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Use x86 SSE instructions for clear_page, copy_page
2004-08-18 2:33 ` David S. Miller
@ 2004-08-22 20:49 ` Jens Maurer
0 siblings, 0 replies; 9+ messages in thread
From: Jens Maurer @ 2004-08-22 20:49 UTC (permalink / raw)
To: David S. Miller; +Cc: linux-kernel, arjanv
David S. Miller wrote:
> Time kernel builds on an otherwise totally idle machine.
> Do multiple runs so that the cache gets loaded and you're
> testing the memory accesses rather than disk reads.
I've applied Denis Vlasenko's patch that allows runtime
switching between "rep stosl" and SSE instructions.
I'm using this script to perform the kernel build:
rm mm/*.o kernel/*.o
make SUBDIRS="mm kernel"
Hardware: Pentium III 850 MHz with 256 MB RAM
Results of "time" using "rep stosl":
real 1m19.038s 1m18.099s 1m18.612s 1m18.630s
user 1m07.258s 1m07.056s 1m07.198s 1m07.052s
sys 0m10.062s 0m10.205s 0m10.141s 0m10.209s
oprofile result for "rep stosl" (obtained on a separate run):
samples % app name symbol name
497537 71.9851 cc1 (no symbols)
37358 5.4051 vmlinux-2.6-sse zero_page_std
6933 1.0031 vmlinux-2.6-sse __copy_to_user_ll
6010 0.8695 libc-2.3.3.so _int_malloc
5940 0.8594 vmlinux-2.6-sse copy_page_std
5262 0.7613 vmlinux-2.6-sse mark_offset_tsc
Results using SSE instructions (writes do not use
the caches):
real 1m16.724s 1m16.281s 1m16.681s 1m16.681s 1m16.207s
user 1m07.938s 1m07.752s 1m07.905s 1m07.916s 1m07.846s
sys 0m07.517s 0m07.866s 0m07.715s 0m07.753s 0m07.684s
oprofile result for SSE (obtained on a separate run):
501243 74.5966 cc1 (no symbols)
21533 3.2046 vmlinux-2.6-sse zero_page_sse
8166 1.2153 vmlinux-2.6-sse __copy_to_user_ll
5797 0.8627 libc-2.3.3.so _int_malloc
5344 0.7953 vmlinux-2.6-sse copy_page_sse
4757 0.7080 libc-2.3.3.so __GI_memset
4490 0.6682 vmlinux-2.6-sse mark_offset_tsc
Interpretation:
Real time is about 2 sec lower, user CPU is about 0.7-0.9 sec
higher, system CPU is about 2.3 sec lower.
Results from test runs fluctuate by about 0.2 sec, thus the
measured differences appear to be significant, and in favor
of using SSE instructions for clear_page(). copy_page()
was not used enough (compared to clear_page) to give a clear
picture.
It looks like a future patch should allow for independent
"rep stosl"/SSE configuration for clear_page() and copy_page().
Jens Maurer
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Use x86 SSE instructions for clear_page, copy_page
2004-08-17 6:13 [PATCH] Use x86 SSE instructions for clear_page, copy_page Jens Maurer
2004-08-17 7:27 ` Arjan van de Ven
@ 2004-08-18 7:00 ` Ingo Molnar
2004-08-18 7:11 ` Ingo Molnar
1 sibling, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2004-08-18 7:00 UTC (permalink / raw)
To: Jens Maurer; +Cc: Linux Kernel, Andrew Morton, Arjan van de Ven
* Jens Maurer <Jens.Maurer@gmx.net> wrote:
> The attached patch (against kernel 2.6.8.1) enables using SSE
> instructions for copy_page and clear_page.
besides the cache arguments Arjan raised, you are also corrupting SSE
registers big way. You are saving/clearing/restoring the TS but that's
not enough - what if e.g. a pagefault happened while userspace code
executed SSE code? You are corrupting those registers.
check out raid6_before_sse()/raid6_after_sse() how to write proper SSE
code for the kernel.
Ingo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Use x86 SSE instructions for clear_page, copy_page
2004-08-18 7:00 ` Ingo Molnar
@ 2004-08-18 7:11 ` Ingo Molnar
0 siblings, 0 replies; 9+ messages in thread
From: Ingo Molnar @ 2004-08-18 7:11 UTC (permalink / raw)
To: Jens Maurer; +Cc: Linux Kernel, Andrew Morton, Arjan van de Ven
* Ingo Molnar <mingo@elte.hu> wrote:
> executed SSE code? You are corrupting those registers.
doh - should read before i write. The code is perfectly fine.
Arjan's cache arguments remain. What i'd suggest to test is the precise
speed of compiling the kernel, fully done in ramfs. (if you dont have
enough RAM for this then compile a portion of the kernel.) For the
testing, add a /proc/sys switch to turn the SSE functions on/off
runtime, hence you can eliminate the effects of page placement. If the
compilation timings are stable enough then you can try the runtime
switch to see whether it has any effect. There should be a small but
visible change (in one direction or the other), as compilation brings in
lots of new pages and copies around stuff too.
Ingo
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2004-08-22 20:50 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-17 6:13 [PATCH] Use x86 SSE instructions for clear_page, copy_page Jens Maurer
2004-08-17 7:27 ` Arjan van de Ven
2004-08-17 8:10 ` Andrey Panin
2004-08-17 8:11 ` Arjan van de Ven
2004-08-17 22:40 ` Jens Maurer
2004-08-18 2:33 ` David S. Miller
2004-08-22 20:49 ` Jens Maurer
2004-08-18 7:00 ` Ingo Molnar
2004-08-18 7:11 ` Ingo Molnar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox