public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] reduce inlined x86 memcpy by 2 bytes
@ 2005-03-18  9:21 Denis Vlasenko
  2005-03-18 10:07 ` Denis Vlasenko
  2005-03-20 13:17 ` Adrian Bunk
  0 siblings, 2 replies; 4+ messages in thread
From: Denis Vlasenko @ 2005-03-18  9:21 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Matt Mackall, vital

[-- Attachment #1: Type: text/plain, Size: 1202 bytes --]

This memcpy() is 2 bytes shorter than one currently in mainline
and it have one branch less. It is also 3-4% faster in microbenchmarks
on small blocks if block size is multiple of 4. Mainline is slower
because it has to branch twice per memcpy, both mispredicted
(but branch prediction hides that in microbenchmark).

Last remaining branch can be dropped too, but then we execute second
'rep movsb' always, even if blocksize%4==0. This is slower than mainline
because 'rep movsb' is microcoded. I wonder, tho, whether 'branchlessness'
wins over this in real world use (not in bench).

I think blocksize%4==0 happens more than 25% of the time.

This is how many 'allyesconfig' vmlinux gains on branchless memcpy():

# size vmlinux.org vmlinux.memcpy
   text    data     bss     dec     hex filename
18178950        6293427 1808916 26281293        191054d vmlinux.org
18165160        6293427 1808916 26267503        190cf6f vmlinux.memcpy

# echo $(( (18178950-18165160) ))
13790 <============= bytes saved on allyesconfig

# echo $(( (18178950-18165160)/4 ))
3447 <============= memcpy() callsites optimized

Attached patch (with one branch) would save 6.5k instead of 13k.

Patch is run tested.
--
vda

[-- Attachment #2: string.memcpy.diff --]
[-- Type: text/x-diff, Size: 709 bytes --]

--- linux-2.6.11.src/include/asm-i386/string.h.orig	Thu Mar  3 09:31:08 2005
+++ linux-2.6.11.src/include/asm-i386/string.h	Fri Mar 18 10:55:51 2005
@@ -198,15 +198,13 @@ static inline void * __memcpy(void * to,
 int d0, d1, d2;
 __asm__ __volatile__(
 	"rep ; movsl\n\t"
-	"testb $2,%b4\n\t"
-	"je 1f\n\t"
-	"movsw\n"
-	"1:\ttestb $1,%b4\n\t"
-	"je 2f\n\t"
-	"movsb\n"
-	"2:"
+	"movl %4,%%ecx\n\t"
+	"andl $3,%%ecx\n\t"
+	"jz 1f\n\t"	/* pay 2 byte penalty for a chance to skip microcoded rep */
+	"rep ; movsb\n\t"
+	"1:"
 	: "=&c" (d0), "=&D" (d1), "=&S" (d2)
-	:"0" (n/4), "q" (n),"1" ((long) to),"2" ((long) from)
+	: "0" (n/4), "g" (n), "1" ((long) to), "2" ((long) from)
 	: "memory");
 return (to);
 }

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] reduce inlined x86 memcpy by 2 bytes
  2005-03-18  9:21 [PATCH] reduce inlined x86 memcpy by 2 bytes Denis Vlasenko
@ 2005-03-18 10:07 ` Denis Vlasenko
  2005-03-20 13:17 ` Adrian Bunk
  1 sibling, 0 replies; 4+ messages in thread
From: Denis Vlasenko @ 2005-03-18 10:07 UTC (permalink / raw)
  To: Denis Vlasenko, Linux Kernel Mailing List; +Cc: Matt Mackall

On Friday 18 March 2005 11:21, Denis Vlasenko wrote:
> This memcpy() is 2 bytes shorter than one currently in mainline
> and it have one branch less. It is also 3-4% faster in microbenchmarks
> on small blocks if block size is multiple of 4. Mainline is slower
> because it has to branch twice per memcpy, both mispredicted
> (but branch prediction hides that in microbenchmark).
> 
> Last remaining branch can be dropped too, but then we execute second
> 'rep movsb' always, even if blocksize%4==0. This is slower than mainline
> because 'rep movsb' is microcoded. I wonder, tho, whether 'branchlessness'
> wins over this in real world use (not in bench).
> 
> I think blocksize%4==0 happens more than 25% of the time.

s/%4/&3 of course.
--
vda


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] reduce inlined x86 memcpy by 2 bytes
  2005-03-18  9:21 [PATCH] reduce inlined x86 memcpy by 2 bytes Denis Vlasenko
  2005-03-18 10:07 ` Denis Vlasenko
@ 2005-03-20 13:17 ` Adrian Bunk
  2005-03-22  6:40   ` Denis Vlasenko
  1 sibling, 1 reply; 4+ messages in thread
From: Adrian Bunk @ 2005-03-20 13:17 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: Linux Kernel Mailing List, Matt Mackall, vital

Hi Denis,

what do your benchmarks say about replacing the whole assembler code 
with a

  #define __memcpy __builtin_memcpy

?

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] reduce inlined x86 memcpy by 2 bytes
  2005-03-20 13:17 ` Adrian Bunk
@ 2005-03-22  6:40   ` Denis Vlasenko
  0 siblings, 0 replies; 4+ messages in thread
From: Denis Vlasenko @ 2005-03-22  6:40 UTC (permalink / raw)
  To: Adrian Bunk, Denis Vlasenko
  Cc: Linux Kernel Mailing List, Matt Mackall, vital

On Sunday 20 March 2005 15:17, Adrian Bunk wrote:
> Hi Denis,
> 
> what do your benchmarks say about replacing the whole assembler code 
> with a
> 
>   #define __memcpy __builtin_memcpy

It generates call to out-of-line memcpy()
if count is non-constant.

# cat t.c
extern char *a, *b;
extern int n;

void f() {
    __builtin_memcpy(a,b,n);
}

void g() {
    __builtin_memcpy(a,b,24);
}
# gcc -S -O2 --omit-frame-pointer t.c
# cat t.s
        .file   "t.c"
        .text
        .p2align 2,,3
.globl f
        .type   f, @function
f:
        subl    $16, %esp
        pushl   n
        pushl   b
        pushl   a
        call    memcpy
        addl    $28, %esp
        ret
        .size   f, .-f
        .p2align 2,,3
.globl g
        .type   g, @function
g:
        pushl   %edi
        pushl   %esi
        movl    a, %edi
        movl    b, %esi
        cld
        movl    $6, %ecx
        rep
        movsl
        popl    %esi
        popl    %edi
        ret
        .size   g, .-g
        .section        .note.GNU-stack,"",@progbits
        .ident  "GCC: (GNU) 3.4.1"

Proving that it is slower than inline is left
as an excercise to the reader :)

Kernel one will be inlined always.
void h) { __memcpy(a,b,n);} is
        movl    n, %eax
        pushl   %edi
        movl    %eax, %ecx
        pushl   %esi
        movl    a, %edi
        movl    b, %esi
        shrl    $2, %ecx
#APP
        rep ; movsl
        movl %eax,%ecx
        andl $3,%ecx
        jz 1f
        rep ; movsb
        1:
#NO_APP
        popl    %esi
        popl    %edi
        ret
--
vda


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2005-03-22  6:45 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-18  9:21 [PATCH] reduce inlined x86 memcpy by 2 bytes Denis Vlasenko
2005-03-18 10:07 ` Denis Vlasenko
2005-03-20 13:17 ` Adrian Bunk
2005-03-22  6:40   ` Denis Vlasenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox