From mboxrd@z Thu Jan 1 00:00:00 1970 From: Steven Noonan Subject: Re: Why Git is so fast Date: Thu, 30 Apr 2009 17:23:57 -0700 Message-ID: References: <46a038f90904270155i6c802fceoffc73eb5ab57130e@mail.gmail.com> <200904301728.06989.jnareb@gmail.com> <20090430185244.GR23604@spearce.org> <86iqkllw0c.fsf@broadpark.no> <20090430204033.GV23604@spearce.org> <8663gllt88.fsf@broadpark.no> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: "Shawn O. Pearce" , git@vger.kernel.org To: Kjetil Barvik X-From: git-owner@vger.kernel.org Fri May 01 02:24:35 2009 Return-path: Envelope-to: gcvg-git-2@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1LzgYL-0000lJ-9O for gcvg-git-2@gmane.org; Fri, 01 May 2009 02:24:33 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1763499AbZEAAYF convert rfc822-to-quoted-printable (ORCPT ); Thu, 30 Apr 2009 20:24:05 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932449AbZEAAYC (ORCPT ); Thu, 30 Apr 2009 20:24:02 -0400 Received: from yx-out-2324.google.com ([74.125.44.30]:55812 "EHLO yx-out-2324.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758629AbZEAAX7 convert rfc822-to-8bit (ORCPT ); Thu, 30 Apr 2009 20:23:59 -0400 Received: by yx-out-2324.google.com with SMTP id 3so1239624yxj.1 for ; Thu, 30 Apr 2009 17:23:57 -0700 (PDT) Received: by 10.90.115.17 with SMTP id n17mr657169agc.104.1241137437362; Thu, 30 Apr 2009 17:23:57 -0700 (PDT) In-Reply-To: <8663gllt88.fsf@broadpark.no> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Thu, Apr 30, 2009 at 2:36 PM, Kjetil Barvik wr= ote: > * "Shawn O. Pearce" writes: > |> =A0 =A0 =A04) The "static inline void hashcpy(....)" in cache.h co= uld then > |> =A0 =A0 =A0 =A0 maybe be written like this: > | > | Its already done as "memcpy(a, b, 20)" which most compilers will > | inline and probably reduce to 5 word moves anyway. =A0That's why > | hashcpy() itself is inline. > > =A0But would the compiler be able to trust that the hashcpy() is alwa= ys > =A0called with correct word alignment on variables a and b? > > =A0I made a test and compiled git with: > > =A0 =A0make USE_NSEC=3D1 CFLAGS=3D"-march=3Dcore2 -mtune=3Dcore2 -O2 = -g2 -fno-stack-protector" clean all > > =A0compiler: gcc (Gentoo 4.3.3-r2 p1.1, pie-10.1.5) 4.3.3 > =A0CPU: Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz GenuineIntel > > =A0Then used gdb to get the following: > > (gdb) disassemble write_sha1_file > Dump of assembler code for function write_sha1_file: > 0x080e3830 : push =A0 %ebp > 0x080e3831 : mov =A0 =A0%esp,%ebp > 0x080e3833 : sub =A0 =A0$0x58,%esp > 0x080e3836 : lea =A0 =A0-0x10(%ebp),%eax > 0x080e3839 : mov =A0 =A0%ebx,-0xc(%ebp) > 0x080e383c : =A0 =A0 =A0 =A0mov =A0 =A0%esi,-0x8(= %ebp) > 0x080e383f : =A0 =A0 =A0 =A0mov =A0 =A0%edi,-0x4(= %ebp) > 0x080e3842 : =A0 =A0 =A0 =A0mov =A0 =A00x14(%ebp)= ,%ebx > 0x080e3845 : =A0 =A0 =A0 =A0mov =A0 =A0%eax,0x8(%= esp) > 0x080e3849 : =A0 =A0 =A0 =A0lea =A0 =A0-0x44(%ebp= ),%edi > 0x080e384c : =A0 =A0 =A0 =A0lea =A0 =A0-0x24(%ebp= ),%esi > 0x080e384f : =A0 =A0 =A0 =A0mov =A0 =A0%edi,0x4(%= esp) > 0x080e3853 : =A0 =A0 =A0 =A0mov =A0 =A0%esi,(%esp= ) > 0x080e3856 : =A0 =A0 =A0 =A0mov =A0 =A00x10(%ebp)= ,%ecx > 0x080e3859 : =A0 =A0 =A0 =A0mov =A0 =A00xc(%ebp),= %edx > 0x080e385c : =A0 =A0 =A0 =A0mov =A0 =A00x8(%ebp),= %eax > 0x080e385f : =A0 =A0 =A0 =A0call =A0 0x80e0350 > 0x080e3864 : =A0 =A0 =A0 =A0test =A0 %ebx,%ebx > 0x080e3866 : =A0 =A0 =A0 =A0je =A0 =A0 0x80e3885 = > > 0x080e3868 : =A0 =A0 =A0 =A0mov =A0 =A0-0x24(%ebp= ),%eax > 0x080e386b : =A0 =A0 =A0 =A0mov =A0 =A0%eax,(%ebx= ) > 0x080e386d : =A0 =A0 =A0 =A0mov =A0 =A0-0x20(%ebp= ),%eax > 0x080e3870 : =A0 =A0 =A0 =A0mov =A0 =A0%eax,0x4(%= ebx) > 0x080e3873 : =A0 =A0 =A0 =A0mov =A0 =A0-0x1c(%ebp= ),%eax > 0x080e3876 : =A0 =A0 =A0 =A0mov =A0 =A0%eax,0x8(%= ebx) > 0x080e3879 : =A0 =A0 =A0 =A0mov =A0 =A0-0x18(%ebp= ),%eax > 0x080e387c : =A0 =A0 =A0 =A0mov =A0 =A0%eax,0xc(%= ebx) > 0x080e387f : =A0 =A0 =A0 =A0mov =A0 =A0-0x14(%ebp= ),%eax > 0x080e3882 : =A0 =A0 =A0 =A0mov =A0 =A0%eax,0x10(= %ebx) > > =A0I admit that I am not particular familar with intel machine > =A0instructions, but I guess that the above 10 mov instructions is th= e > =A0result for the compiled inline hashcpy() in the write_sha1_file() > =A0function in sha1_file.c > > =A0Question: would it be possible for the compiler to compile it down= to > =A0just 5 mov instructions if we had used unsigned 32 bits type? =A0O= r is > =A0this the best we can reasonable hope for inside the write_sha1_fil= e() > =A0function? > > =A0I checked 3 other output of "disassemble function_foo", and it see= ms > =A0that those 3 functions I checked got 10 mov instructions for the > =A0inline hashcpy(), as far as I can tell. > > 0x080e3885 : =A0 =A0 =A0 =A0mov =A0 =A0%esi,(%esp= ) > 0x080e3888 : =A0 =A0 =A0 =A0call =A0 0x80e3800 > 0x080e388d : =A0 =A0 =A0 =A0xor =A0 =A0%edx,%edx > 0x080e388f : =A0 =A0 =A0 =A0test =A0 %eax,%eax > 0x080e3891 : =A0 =A0 =A0 =A0jne =A0 =A00x80e38b6 = > 0x080e3893 : =A0 =A0 =A0 =A0mov =A0 =A00xc(%ebp),= %eax > 0x080e3896 : =A0 =A0 =A0 mov =A0 =A0%edi,%edx > 0x080e3898 : =A0 =A0 =A0 mov =A0 =A0%eax,0x4(%es= p) > 0x080e389c : =A0 =A0 =A0 mov =A0 =A0-0x10(%ebp),= %ecx > 0x080e389f : =A0 =A0 =A0 mov =A0 =A00x8(%ebp),%e= ax > 0x080e38a2 : =A0 =A0 =A0 movl =A0 $0x0,0x8(%esp) > 0x080e38aa : =A0 =A0 =A0 mov =A0 =A0%eax,(%esp) > 0x080e38ad : =A0 =A0 =A0 mov =A0 =A0%esi,%eax > 0x080e38af : =A0 =A0 =A0 call =A0 0x80e1e40 > 0x080e38b4 : =A0 =A0 =A0 mov =A0 =A0%eax,%edx > 0x080e38b6 : =A0 =A0 =A0 mov =A0 =A0%edx,%eax > 0x080e38b8 : =A0 =A0 =A0 mov =A0 =A0-0xc(%ebp),%= ebx > 0x080e38bb : =A0 =A0 =A0 mov =A0 =A0-0x8(%ebp),%= esi > 0x080e38be : =A0 =A0 =A0 mov =A0 =A0-0x4(%ebp),%= edi > 0x080e38c1 : =A0 =A0 =A0 leave > 0x080e38c2 : =A0 =A0 =A0 ret > End of assembler dump. > (gdb) > > =A0So, maybe the compiler is doing the right thing after all? > Well, I just tested this with GCC myself. I used this segment of code: #include void hashcpy(unsigned char *sha_dst, const unsigned char *sha_s= rc) { memcpy(sha_dst, sha_src, 20); } I compiled using Apple's GCC 4.0.1 (note that GCC 4.3 and 4.4 vanilla yield the same code) with these parameters to get Intel assembly: gcc -O2 -arch i386 -march=3Dpentium3 -mtune=3Dpentium3 -fomit-frame-pointer -fno-strict-aliasing -S test.c and these parameters to get the equivalent PowerPC code: gcc -O2 -mcpu=3DG5 -arch ppc -fomit-frame-pointer -fno-strict-aliasing -S test.c Intel code: .text .align 4,0x90 =2Eglobl _hashcpy _hashcpy: subl $12, %esp movl 20(%esp), %edx movl 16(%esp), %ecx movl (%edx), %eax movl %eax, (%ecx) movl 4(%edx), %eax movl %eax, 4(%ecx) movl 8(%edx), %eax movl %eax, 8(%ecx) movl 12(%edx), %eax movl %eax, 12(%ecx) movl 16(%edx), %eax movl %eax, 16(%ecx) addl $12, %esp ret .subsections_via_symbols and the PowerPC code: .section __TEXT,__text,regular,pure_instructions .section __TEXT,__picsymbolstub1,symbol_stubs,pure_instructions= ,32 .machine ppc970 .text .align 2 .p2align 4,,15 .globl _hashcpy _hashcpy: lwz r0,0(r4) lwz r2,4(r4) lwz r9,8(r4) lwz r11,12(r4) stw r0,0(r3) stw r2,4(r3) stw r9,8(r3) stw r11,12(r3) lwz r0,16(r4) stw r0,16(r3) blr .subsections_via_symbols So it does look like GCC does what it should and it inlines the memcpy. A bit off topic, but the results are rather interesting to me, and I think I see a weakness in how GCC is doing this on Intel. Someone please correct me if I'm wrong, but the PowerPC code seems much better because it can yield very high instruction-level parallelism. It does 5 loads and then 5 stores, using 4 registers for temporary storage and 2 registers for pointers. I realize the Intel x86 architecture is quite constrained in that it has so few general purpose registers, but there has to be better code than what GCC emitted above. It seems like the processor would stall because of the quantity of sequential inter-dependent instructions that can't be done in parallel (mov to memory that depends on a mov to eax, etc). I suppose the code might not be stalling if it's using the maximum number of registers and doing as many memory accesses that it can per clock, but based on known details about the architecture, does it seem to be doing that? - Steven