Obvious one-liner - Use 3DNOW on MK8

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Obvious one-liner - Use 3DNOW on MK8
@ 2004-08-22  0:14 James M.
  0 siblings, 0 replies; 7+ messages in thread
From: James M. @ 2004-08-22  0:14 UTC (permalink / raw)
  To: linux-kernel

Title says it...my Athlon 64 definitely uses 3DNOW. Patch changes 
arch/i386/Kconfig and has a 3 line fudge factor(I created it a few 
kernels back). Might want to check other arches for the same bug.

Dart

--- current/arch/i386/Kconfig.orig      2004-03-12 22:44:10.000000000 -0600
+++ current/arch/i386/Kconfig   2004-03-12 22:44:53.000000000 -0600
@@ -413,7 +413,7 @@ config X86_USE_PPRO_CHECKSUM

  config X86_USE_3DNOW
         bool
-       depends on MCYRIXIII || MK7
+       depends on MCYRIXIII || MK7 || MK8
         default y

  config X86_OOSTORE

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Obvious one-liner - Use 3DNOW on MK8
       [not found] <2vOfA-7Vg-7@gated-at.bofh.it>
@ 2004-08-22  1:29 ` Andi Kleen
  2004-08-22  5:52   ` James M.
       [not found]   ` <200408221118.45146.vda@port.imtp.ilyichevsk.odessa.ua>
       [not found] ` <200408222146.34798.vda@port.imtp.ilyichevsk.odessa.ua>
  1 sibling, 2 replies; 7+ messages in thread
From: Andi Kleen @ 2004-08-22  1:29 UTC (permalink / raw)
  To: James M.; +Cc: linux-kernel

"James M." <dart@windeath.2y.net> writes:

> Title says it...my Athlon 64 definitely uses 3DNOW. Patch changes
> arch/i386/Kconfig and has a 3 line fudge factor(I created it a few
> kernels back). Might want to check other arches for the same bug.

It it's not a bug, it is a feature. The K8 is better off not using 
the 3dnow memcpy, which is the only feature this CONFIG controls.

Please don't apply.

-Andi


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Obvious one-liner - Use 3DNOW on MK8
  2004-08-22  1:29 ` Obvious one-liner - Use 3DNOW on MK8 Andi Kleen
@ 2004-08-22  5:52   ` James M.
  2004-08-22 11:38     ` Marc Ballarin
       [not found]   ` <200408221118.45146.vda@port.imtp.ilyichevsk.odessa.ua>
  1 sibling, 1 reply; 7+ messages in thread
From: James M. @ 2004-08-22  5:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

Then perhaps the code should be removed or at least explain in the help 
why it can't be selected...do you have a preference?

Does having this disabled confuse apps like Mplayer who detect the cpu 
and expect it to be there? I'm guessing the kernel handles it correctly 
but I'm just curious.

Andi Kleen wrote:
> "James M." writes:
> 
> 
>>Title says it...my Athlon 64 definitely uses 3DNOW. Patch changes
>>arch/i386/Kconfig and has a 3 line fudge factor(I created it a few
>>kernels back). Might want to check other arches for the same bug.
> 
> 
> It it's not a bug, it is a feature. The K8 is better off not using 
> the 3dnow memcpy, which is the only feature this CONFIG controls.
> 
> Please don't apply.
> 
> -Andi
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Obvious one-liner - Use 3DNOW on MK8
  2004-08-22  5:52   ` James M.
@ 2004-08-22 11:38     ` Marc Ballarin
  0 siblings, 0 replies; 7+ messages in thread
From: Marc Ballarin @ 2004-08-22 11:38 UTC (permalink / raw)
  To: James M.; +Cc: ak, linux-kernel

On Sun, 22 Aug 2004 00:52:27 -0500
"James M." <dart@windeath.2y.net> wrote:

> Then perhaps the code should be removed or at least explain in the help 
> why it can't be selected...do you have a preference?
> 
> Does having this disabled confuse apps like Mplayer who detect the cpu 
> and expect it to be there? I'm guessing the kernel handles it correctly 
> but I'm just curious.
> 

The option only affects in-kernel usage of 3dnow.  It has absolutely no
effect on userspace apps.

mfg


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Obvious one-liner - Use 3DNOW on MK8
       [not found]   ` <200408221118.45146.vda@port.imtp.ilyichevsk.odessa.ua>
@ 2004-08-22 14:47     ` Andi Kleen
  0 siblings, 0 replies; 7+ messages in thread
From: Andi Kleen @ 2004-08-22 14:47 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: James M., linux-kernel

On Sun, Aug 22, 2004 at 11:18:45AM +0300, Denis Vlasenko wrote:
> On Sunday 22 August 2004 04:29, Andi Kleen wrote:
> > "James M." <dart@windeath.2y.net> writes:
> > > Title says it...my Athlon 64 definitely uses 3DNOW. Patch changes
> > > arch/i386/Kconfig and has a 3 line fudge factor(I created it a few
> > > kernels back). Might want to check other arches for the same bug.
> >
> > It it's not a bug, it is a feature. The K8 is better off not using
> > the 3dnow memcpy, which is the only feature this CONFIG controls.
> 
> However, 3dnow _copy_page_ is a huge win. I explained why in an emails

On K8? Significant resources were spent on tuning the x86-64
memcpy and memset, and since C stepping K8 rep ; movsl/q are fastest.
Before that an unrolled integer loop was best.

On 32bit the same applies.

Using SSE2 only helps for very large data sets that are
never used in the kernel (several MB). 3dnow wasn't tested, but it is 
unlikely to be any better than SSE2.

-Andi


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Obvious one-liner - Use 3DNOW on MK8
       [not found]   ` <20040823195842.GA7952@muc.de>
@ 2004-08-23 22:24     ` Denis Vlasenko
  2004-08-23 22:31       ` Denis Vlasenko
  0 siblings, 1 reply; 7+ messages in thread
From: Denis Vlasenko @ 2004-08-23 22:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

> > > > However, 3dnow _copy_page_ is a huge win. I explained why in an
> > > > emails
> > >
> > > On K8? Significant resources were spent on tuning the x86-64
> >
> > Sorry Andi, I cannot test on K8, don't have the hw...
> >
> > > memcpy and memset, and since C stepping K8 rep ; movsl/q are fastest.
> > > Before that an unrolled integer loop was best.
> > >
> > > On 32bit the same applies.
> > >
> > > Using SSE2 only helps for very large data sets that are
> > > never used in the kernel (several MB). 3dnow wasn't tested, but it is
> > > unlikely to be any better than SSE2.
> >
> > I sent to lkml an email with subject "copy_page(): non-temporal
> > stores look useful?". It's possible you never received it.
> > I'll resend it now. I'd like to have your comments, if any.
>
> Hmm, interesting. You're relying on the fact that normally
> only a few cache lines are touched in a copied page.  Still

Where? I think my test programs touch every cacheline in
zeroed/copied pages!

> I would be pretty careful, because you're optimizing one case
> a lot over another and it may fail badly on some workloads
> that use more of the pages.
>
> Also your Duron with its extremly small cache probably skews
> results a bit.
>
> We did some test on K8 (although no big macro benchmarks)
> and it was usually a loss there.
>
> It's really designed for very big data buffers (many MBs).

I did separate zero_page() and copy_page() tests (see other mail).
I totally agree that zero_page() isn't a win.
However, copy_page() usage pattern are sufficiently different,
and this has interesting consequences:

> And the kernel never processes that much at one piece.

No it does, at least on current CPUs:

fork() touches 12k via three zero_page calls and 2x32k via
eight copy_page()s, 76k in total. This is more than L1 size for
any existing x86 processor. "Standard" zero/copy ops thus evict
entire L1 at each fork(). Not-temporal ops do not, they evict
only 32k. This helps by keeping useful data still in cache.

Also, these eight copied pages aren't 100% used by fork(),
and thus fork() do not suffer from needing to pull NT-stored
data back to cache and is actually faster with NT copy_page().

To summarize: we get fork() speedup due to faster copies
*and* speedup elsewhere because cache is not flushed.

I think that explains why I was unable to find any buffer size
so that
	buf=alloc(size);
	for(10000) {
		fork();
		if(child) { touch(buf); exit(); }
	}
is running faster with "standard" ops. Let me repeat test results here:

128k copying, 5x5000 loops:
slow:               0m4.732 0m4.747 0m4.751 0m4.773 0m4.776 75466/1000469
mmx_APn:            0m4.258 0m4.331 0m4.343 0m4.386 0m4.422 75406/1000422
mmx_APN:            0m3.658 0m3.672 0m3.784 0m3.798 0m3.818 75452/1000436
mmx_APn/APN:        0m3.713 0m3.713 0m3.840 0m3.850 0m3.857 75435/1000413
64k copying, 5x10000 loops:
slow:               0m5.869 0m5.885 0m5.894 0m5.904 0m5.906 150356/1200472
mmx_APn:            0m5.369 0m5.391 0m5.404 0m5.424 0m5.426 150345/1200444
mmx_APN:            0m4.804 0m4.826 0m4.843 0m4.843 0m4.934 150355/1200436
mmx_APn/APN:        0m4.878 0m4.883 0m4.926 0m4.937 0m4.962 150343/1200441
32k copying, 5x20000 loops:
slow:               0m8.088 0m8.125 0m8.241 0m8.245 0m8.326 300320/1600461
mmx_APn:            0m7.527 0m7.662 0m7.706 0m7.750 0m7.802 300303/1600438
mmx_APN:            0m6.630 0m6.661 0m6.681 0m6.696 0m6.735 300303/1600442
20k copying, 5x20000 loops:
slow:               0m6.610 0m6.665 0m6.694 0m6.750 0m6.774 300315/1300468
mmx_APn:            0m6.208 0m6.218 0m6.263 0m6.335 0m6.452 300352/1300448
mmx_APN:            0m4.887 0m4.984 0m5.021 0m5.052 0m5.057 300295/1300443
mmx_APn/APN:        0m5.115 0m5.160 0m5.167 0m5.172 0m5.183 300292/1300443
4k copying, 5x40000 loops:
slow:               0m8.303 0m8.334 0m8.354 0m8.510 0m8.572 600313/1800473
mmx_APn:            0m8.233 0m8.350 0m8.406 0m8.407 0m8.642 600323/1800467
mmx_APN:            0m6.475 0m6.501 0m6.510 0m6.534 0m6.783 600302/1800436
mmx_APn/APN:        0m6.540 0m6.551 0m6.603 0m6.640 0m6.708 600271/1800442

See? NT wins everywhere.

My main questions are:

* Is there any copy_page() benchmarks which do not
  use fork()?
* Maybe we simply should use "standard" zero_page()
  and "non-temporal" copy_page()?

> In general using write combining is a bit fragile from
> the performance perspective. There is a reason why AMD
> and Intel add more write combining buffers with each major
> CPU revision.  So if it's not an extremly clear win
> I would avoid it.

Speedups of 50% or more are a bit large to dismiss lightly.

I think we can benchmark and pick best at boot time.
--
vda


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Obvious one-liner - Use 3DNOW on MK8
  2004-08-23 22:24     ` Denis Vlasenko
@ 2004-08-23 22:31       ` Denis Vlasenko
  0 siblings, 0 replies; 7+ messages in thread
From: Denis Vlasenko @ 2004-08-23 22:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 288 bytes --]

These are test programs.

copy_load.c:

            for(i = 0; i < N; i++)
                mem[i*SIZE+1] = 'b';          /* force copy */
            strchr(mem, 'c') == mem+N*SIZE-1 || printf("BUG\n");        /* read all */

This forces page copying, and then touches every byte.
--
vda

[-- Attachment #2: copy_load.c --]
[-- Type: text/x-csrc, Size: 889 bytes --]

#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>

#define N (256/4)
#define SIZE 4096

int main()
{
    int i,k;
    unsigned char *mem = mmap(0, N*SIZE, PROT_READ|PROT_WRITE,
  	MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
    if(mem == MAP_FAILED)
    	perror("mmap");
    memset(mem, 'a', N*SIZE); /* force page clearing */
    mem[N*SIZE-1]='c';
    //for(i = 0; i < N; i++) mem[i*SIZE] = i; /* force page clearing */

    for(k = 0; k < 5000; k++) {
	int pid;
	pid = fork();
	if(pid == 0) {
    	    /* child */
    	    for(i = 0; i < N; i++)
    	        mem[i*SIZE+1] = 'b';          /* force copy */
    	    //printf("copy complete\n");
	    strchr(mem, 'c') == mem+N*SIZE-1 || printf("BUG\n");	/* read all */
    	    exit(0);
	} else if(pid == -1) {
    	    perror("fork");
	} else {
    	    /* parent */
    	    waitpid(pid, NULL, 0);
	}
    }
    munmap(mem, N*SIZE);
}

[-- Attachment #3: zero_load.c --]
[-- Type: text/x-csrc, Size: 594 bytes --]

#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>

#define N (320/4)
#define SIZE 4096

int dummy;

int main()
{
    int k;
    for(k = 0; k < 20000; k++) {
	int i = 0;
	int pid;
	unsigned char *mem,*p;
	mem = mmap(0, N*SIZE, PROT_READ|PROT_WRITE,
  		MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
	if(mem == MAP_FAILED)
    	    perror("mmap");
	for(i = 0; i < N; i++) mem[i*SIZE] = i; /* force page allocation and clearing */
	//memset(mem, 'a', N*SIZE); /* force page clearing */
	p = mem;
	while(p != mem+N*SIZE) { i += *p; p+=32; } /* use data */
	dummy = i;
	munmap(mem, N*SIZE);
    }
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-08-23 22:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <2vOfA-7Vg-7@gated-at.bofh.it>
2004-08-22  1:29 ` Obvious one-liner - Use 3DNOW on MK8 Andi Kleen
2004-08-22  5:52   ` James M.
2004-08-22 11:38     ` Marc Ballarin
     [not found]   ` <200408221118.45146.vda@port.imtp.ilyichevsk.odessa.ua>
2004-08-22 14:47     ` Andi Kleen
     [not found] ` <200408222146.34798.vda@port.imtp.ilyichevsk.odessa.ua>
     [not found]   ` <20040823195842.GA7952@muc.de>
2004-08-23 22:24     ` Denis Vlasenko
2004-08-23 22:31       ` Denis Vlasenko
2004-08-22  0:14 James M.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox