From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ingo Molnar Subject: [PATCH] git gc: Speed it up by 18% via faster hash comparisons Date: Thu, 28 Apr 2011 00:51:14 +0200 Message-ID: <20110427225114.GA16765@elte.hu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: "H. Peter Anvin" , Thomas Gleixner , Peter Zijlstra , Arnaldo Carvalho de Melo , =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker , Pekka Enberg To: git@vger.kernel.org X-From: git-owner@vger.kernel.org Thu Apr 28 00:52:00 2011 Return-path: Envelope-to: gcvg-git-2@lo.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1QFDaS-0006jp-5V for gcvg-git-2@lo.gmane.org; Thu, 28 Apr 2011 00:52:00 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757053Ab1D0Wvz (ORCPT ); Wed, 27 Apr 2011 18:51:55 -0400 Received: from mx2.mail.elte.hu ([157.181.151.9]:34278 "EHLO mx2.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751227Ab1D0Wvy (ORCPT ); Wed, 27 Apr 2011 18:51:54 -0400 Received: from elvis.elte.hu ([157.181.1.14]) by mx2.mail.elte.hu with esmtp (Exim) id 1QFDZk-0003Mo-QC from ; Thu, 28 Apr 2011 00:51:18 +0200 Received: by elvis.elte.hu (Postfix, from userid 1004) id 931503E250F; Thu, 28 Apr 2011 00:51:14 +0200 (CEST) Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-08-17) Received-SPF: neutral (mx2.mail.elte.hu: 157.181.1.14 is neither permitted nor denied by domain of elte.hu) client-ip=157.181.1.14; envelope-from=mingo@elte.hu; helo=elvis.elte.hu; X-ELTE-SpamScore: -2.0 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-2.0 required=5.9 tests=BAYES_00 autolearn=no SpamAssassin version=3.3.1 -2.0 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: Looking at the stalled-cycled profile of a cached 'git gc' i noticed the find_pack_entry_one() entry: $ perf record -e stalled-cycles -F 10000 ./git gc $ perf report # Events: 26K stalled-cycles # # Overhead Command Shared Object Symbol # ........ .......... ..................... ......................................... # 26.07% git git [.] lookup_object 10.22% git libz.so.1.2.5 [.] 0xc43a 7.08% git libz.so.1.2.5 [.] inflate 6.63% git git [.] find_pack_entry_one 5.37% git [kernel.kallsyms] [k] do_raw_spin_lock 4.03% git git [.] lookup_blob 3.09% git libc-2.13.90.so [.] __strlen_sse42 2.81% git libc-2.13.90.so [.] __memcpy_ssse3_back Annotation shows: $ perf annotate find_pack_entry_one Percent | Source code & Disassembly of git ------------------------------------------------ : ... : int cmp = hashcmp(index + mi * stride, sha1); 0.90 : 4b9264: 89 ee mov %ebp,%esi 0.45 : 4b9266: 41 0f af f2 imul %r10d,%esi 2.86 : 4b926a: 4c 01 de add %r11,%rsi 53.34 : 4b926d: f3 a6 repz cmpsb %es:(%rdi),%ds:(%rsi) 14.37 : 4b926f: 0f 92 c0 setb %al 5.78 : 4b9272: 41 0f 97 c4 seta %r12b 1.52 : 4b9276: 41 28 c4 sub %al,%r12b Most overhead is in hashcmp(), which uses memcmp(), which falls back to assembly string operations. But we know that hashcmp() compares hashes, which if they do not match, the first byte will differ in 99% of the cases. So i tried the patch below: instead of relying on GCC putting in the string ops, i used an open-coded loop for this relatively short comparison, which does not go beyond the first byte in 99% of the cases. While it at i also open-coded the is_null_sha1() comparison: instead of comparing it to null_sha1[] byte by byte, we can use the sha1 value directly and use much more optimal 64-bit and 32-bit comparisons. The results were rather surprising: # # Before: # $ perf stat --sync --repeat 10 ./git gc Performance counter stats for './git gc' (10 runs): 2771.119892 task-clock # 0.863 CPUs utilized ( +- 0.16% ) 1,813 context-switches # 0.001 M/sec ( +- 3.06% ) 167 CPU-migrations # 0.000 M/sec ( +- 2.92% ) 39,210 page-faults # 0.014 M/sec ( +- 0.26% ) 8,828,405,654 cycles # 3.186 GHz ( +- 0.13% ) 2,102,083,909 stalled-cycles # 23.81% of all cycles are idle ( +- 0.52% ) 8,821,931,740 instructions # 1.00 insns per cycle # 0.24 stalled cycles per insn ( +- 0.04% ) 1,750,408,175 branches # 631.661 M/sec ( +- 0.04% ) 74,612,120 branch-misses # 4.26% of all branches ( +- 0.07% ) 3.211098537 seconds time elapsed ( +- 1.52% ) [ Note: the --sync option to perf stat emits a sync() before each 'git gc' test-run, this reduces the noise of 'elapsed time' numbers enormously. ] # # After: # $ perf stat --sync --repeat 10 ./git gc Performance counter stats for './git gc' (10 runs): 2349.498022 task-clock # 0.807 CPUs utilized ( +- 0.15% ) 1,842 context-switches # 0.001 M/sec ( +- 2.50% ) 164 CPU-migrations # 0.000 M/sec ( +- 3.67% ) 39,350 page-faults # 0.017 M/sec ( +- 0.06% ) 7,484,317,230 cycles # 3.185 GHz ( +- 0.15% ) 1,577,673,341 stalled-cycles # 21.08% of all cycles are idle ( +- 0.67% ) 11,067,826,786 instructions # 1.48 insns per cycle # 0.14 stalled cycles per insn ( +- 0.02% ) 2,489,157,909 branches # 1059.442 M/sec ( +- 0.02% ) 59,384,019 branch-misses # 2.39% of all branches ( +- 0.22% ) 2.910829134 seconds time elapsed ( +- 1.39% ) 'git gc' got faster by 18%! Interestingly, 33% of all prior stalled cycles disappeared: most of them turned into actual cycle count savings and speedups. So it's rather clear that the string assembly instructions based memcmp is suboptimal for short comparisons like this: there's quite a bit of setup latency in repz cmpsb and the CPU is idling around during that time. (I ran this on a Nehalem system, so it's a rather fast and modern Intel CPU.) Also note another very interesting detail, the number of branch misses went way down, well beyond the measurement noise: before: 74,612,120 branch-misses # 4.26% of all branches ( +- 0.07% ) after: 59,384,019 branch-misses # 2.39% of all branches ( +- 0.22% ) One theory would be that the open-coded loop is easier for the CPU to speculate along, so it produces less branch misses. The number of instructions and branches increased - this is mostly because the PMU counts a complex 'repz cmpsb' as a single instruction issuing many uops, while the open-coded loop consists of separate instructions - but roughly the same amount of uops. I suspect 'git fsck' got faster as well, but i have not measured that. There's more string op use in the Git sha1 code, but this was the lowest hanging fruit. Thanks, Ingo Signed-off-by: Ingo Molnar diff --git a/cache.h b/cache.h index 2674f4c..c5a54fb 100644 --- a/cache.h +++ b/cache.h @@ -675,14 +675,33 @@ extern char *sha1_pack_name(const unsigned char *sha1); extern char *sha1_pack_index_name(const unsigned char *sha1); extern const char *find_unique_abbrev(const unsigned char *sha1, int); extern const unsigned char null_sha1[20]; -static inline int is_null_sha1(const unsigned char *sha1) + +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2) { - return !memcmp(sha1, null_sha1, 20); + int i; + + for (i = 0; i < 20; i++, sha1++, sha2++) { + if (*sha1 != *sha2) { + if (*sha1 < *sha2) + return -1; + return +1; + } + } + + return 0; } -static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2) + +static inline int is_null_sha1(const unsigned char *sha1) { - return memcmp(sha1, sha2, 20); + const unsigned long long *sha1_64 = (void *)sha1; + const unsigned int *sha1_32 = (void *)sha1; + + if (sha1_64[0] || sha1_64[1] || sha1_32[4]) + return 0; + + return 1; } + static inline void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src) { memcpy(sha_dst, sha_src, 20);