From: Jeff Garzik <jgarzik@pobox.com>
To: Benjamin LaHaise <bcrl@kvack.org>
Cc: ak@muc.de, linux-kernel@vger.kernel.org
Subject: Re: [RFC] x86-64: Use SSE for copy_page and clear_page
Date: Mon, 30 May 2005 14:45:01 -0400 [thread overview]
Message-ID: <429B5F2D.9010804@pobox.com> (raw)
In-Reply-To: <20050530181626.GA10212@kvack.org>
Benjamin LaHaise wrote:
> Hello Andi,
>
> Below is a patch that uses 128 bit SSE instructions for copy_page and
> clear_page. This is an improvement on P4 systems as can be seen by
> running the test program at http://www.kvack.org/~bcrl/xmm64.c to get
> results like:
>
> SSE test program $Id: fast.c,v 1.6 2000/09/23 09:05:45 arjan Exp $ buffer = 0x2aaaaaad6000
> clear_page() tests
> clear_page function 'warm up run' took 25444 cycles per page
> clear_page function 'kernel clear' took 6595 cycles per page
> clear_page function '2.4 non MMX' took 7827 cycles per page
> clear_page function '2.4 MMX fallback' took 7741 cycles per page
> clear_page function '2.4 MMX version' took 6454 cycles per page
> clear_page function 'faster_clear_page' took 4344 cycles per page
> clear_page function 'even_faster_clear' took 4151 cycles per page
> clear_page function 'xmm_clear ' took 3204 cycles per page
> clear_page function 'xmma_clear ' took 6080 cycles per page
> clear_page function 'xmm2_clear ' took 3370 cycles per page
> clear_page function 'xmma2_clear ' took 6115 cycles per page
> clear_page function 'kernel clear' took 6583 cycles per page
>
> copy_page() tests
> copy_page function 'warm up run' took 9770 cycles per page
> copy_page function '2.4 non MMX' took 9758 cycles per page
> copy_page function '2.4 MMX fallback' took 9572 cycles per page
> copy_page function '2.4 MMX version' took 9405 cycles per page
> copy_page function 'faster_copy' took 7407 cycles per page
> copy_page function 'even_faster' took 7158 cycles per page
> copy_page function 'xmm_copy_page_no' took 6110 cycles per page
> copy_page function 'xmm_copy_page' took 5914 cycles per page
> copy_page function 'xmma_copy_page' took 5913 cycles per page
> copy_page function 'v26_copy_page' took 9168 cycles per page
>
> The SSE clear page fuction is almost twice as fast as the kernel's
> current clear_page, while the copy_page implementation is roughly a
> third faster. This is likely due to the fact that SSE instructions
> can keep the 256 bit wide L2 cache bus at a higher utilisation than
> 64 bit movs are able to. Comments?
Sounds pretty darn cool to me. I can give it a test on athlon64 and
em64t here.
I have some codingstyle whining to do though...
> :r public_html/patches/v2.6.12-rc4-xmm-2.diff
> diff -purN v2.6.12-rc4/arch/x86_64/lib/c_clear_page.c xmm-rc4/arch/x86_64/lib/c_clear_page.c
> --- v2.6.12-rc4/arch/x86_64/lib/c_clear_page.c 1969-12-31 19:00:00.000000000 -0500
> +++ xmm-rc4/arch/x86_64/lib/c_clear_page.c 2005-05-26 11:16:09.000000000 -0400
> @@ -0,0 +1,45 @@
> +#include <linux/config.h>
> +#include <linux/preempt.h>
> +#include <asm/page.h>
> +#include <linux/kernel.h>
> +#include <asm/string.h>
preferred ordering:
linux/config
linux/kernel
linux/preempt
asm/*
> +typedef struct { unsigned long a,b; } __attribute__((aligned(16))) xmm_store_t;
space between "a,b"
> +void c_clear_page_xmm(void *page)
> +{
> + /* Note! gcc doesn't seem to align stack variables properly, so we
> + * need to make use of unaligned loads and stores.
> + */
> + xmm_store_t xmm_save[1];
> + unsigned long cr0;
> + int i;
> +
> + preempt_disable();
> + __asm__ __volatile__ (
> + " mov %%cr0,%0\n"
> + " clts\n"
> + " movdqu %%xmm0,(%1)\n"
> + " pxor %%xmm0, %%xmm0\n"
> + : "=&r" (cr0): "r" (xmm_save) : "memory"
> + );
> +
> + for(i=0;i<PAGE_SIZE/64;i++)
exercise that spacebar :)
> + {
> + __asm__ __volatile__ (
> + " movntdq %%xmm0, (%0)\n"
> + " movntdq %%xmm0, 16(%0)\n"
> + " movntdq %%xmm0, 32(%0)\n"
> + " movntdq %%xmm0, 48(%0)\n"
> + : : "r" (page) : "memory");
> + page+=64;
> + }
> +
> + __asm__ __volatile__ (
> + " sfence \n "
> + " movdqu (%0),%%xmm0\n"
> + " mov %1,%%cr0\n"
> + :: "r" (xmm_save), "r" (cr0)
> + );
> + preempt_enable();
> +}
> diff -purN v2.6.12-rc4/arch/x86_64/lib/c_copy_page.c xmm-rc4/arch/x86_64/lib/c_copy_page.c
> --- v2.6.12-rc4/arch/x86_64/lib/c_copy_page.c 1969-12-31 19:00:00.000000000 -0500
> +++ xmm-rc4/arch/x86_64/lib/c_copy_page.c 2005-05-30 14:07:28.000000000 -0400
> @@ -0,0 +1,52 @@
> +#include <linux/config.h>
> +#include <linux/preempt.h>
> +#include <asm/page.h>
> +#include <linux/kernel.h>
> +#include <asm/string.h>
> +
> +typedef struct { unsigned long a,b; } __attribute__((aligned(16))) xmm_store_t;
ditto
> +void c_copy_page_xmm(void *to, void *from)
> +{
> + /* Note! gcc doesn't seem to align stack variables properly, so we
> + * need to make use of unaligned loads and stores.
> + */
> + xmm_store_t xmm_save[2];
> + unsigned long cr0;
> + int i;
> +
> + preempt_disable();
> + __asm__ __volatile__ (
> + " prefetchnta (%1)\n"
> + " prefetchnta 64(%1)\n"
> + " prefetchnta 128(%1)\n"
> + " prefetchnta 192(%1)\n"
> + " prefetchnta 256(%1)\n"
> + " mov %%cr0,%0\n"
> + " clts\n"
> + " movdqu %%xmm0, (%1)\n"
> + " movdqu %%xmm1,16(%1)\n"
> + : "=&r" (cr0): "r" (xmm_save) : "memory"
> + );
> +
> + for(i=0;i<PAGE_SIZE/32;i++) {
ditto
next prev parent reply other threads:[~2005-05-30 18:46 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-05-30 18:16 [RFC] x86-64: Use SSE for copy_page and clear_page Benjamin LaHaise
2005-05-30 18:45 ` Jeff Garzik [this message]
2005-05-30 19:06 ` dean gaudet
2005-05-30 19:11 ` dean gaudet
2005-05-30 19:32 ` Andi Kleen
2005-05-31 8:37 ` Denis Vlasenko
2005-05-31 9:15 ` Denis Vlasenko
2005-05-31 9:23 ` Andi Kleen
2005-05-31 13:59 ` Benjamin LaHaise
2005-06-01 6:22 ` Denis Vlasenko
2005-06-01 6:47 ` Denis Vlasenko
2005-06-01 7:22 ` michael
2005-06-01 7:48 ` Andi Kleen
2005-06-01 7:48 ` Denis Vlasenko
2005-06-01 21:46 ` dean gaudet
2005-06-01 8:01 ` Nick Piggin
2005-05-30 19:38 ` Andi Kleen
2005-05-30 20:05 ` Michael Thonke
2005-05-30 20:14 ` Benjamin LaHaise
2005-05-30 20:42 ` Michael Thonke
2005-05-31 7:11 ` Andi Kleen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=429B5F2D.9010804@pobox.com \
--to=jgarzik@pobox.com \
--cc=ak@muc.de \
--cc=bcrl@kvack.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox