git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] lookup_object(): Speed up 'git gc' by 12%, by reducing hash chain length
@ 2011-04-27 21:35 Ingo Molnar
  2011-04-29  6:22 ` Ingo Molnar
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Ingo Molnar @ 2011-04-27 21:35 UTC (permalink / raw)
  To: git


I was looking at the 'git gc' stalled-cycles profile and noticed that 
lookup_object() was the top entry:

 aldebaran:~/git> perf record -e stalled-cycles -F 10000 ./git gc 
 Counting objects: 32459, done .
 Delta compression using up to 16 threads.
 Compressing objects: 100% (8161/8161), done.
 Writing objects: 100% (32459/32459), done.
 Total 32459 (delta 24077), reused 32459 (delta 24077)
 [ perf record: Woken up 5 times to write data ]
 [ perf record: Captured and wrote 1.199 MB perf.data (~52366 samples) ]

 aldebaran:~/git> perf report | head
 # Events: 30K stalled-cycles
 #
 # Overhead     Command          Shared Object                                     Symbol
 # ........  ..........  .....................  .........................................
 #
    36.36%         git  git                    [.] lookup_object
     6.55%         git  git                    [.] find_pack_entry_one
     6.53%         git  libz.so.1.2.5          [.] 0xc416          
     5.94%         git  libz.so.1.2.5          [.] inflate
     4.12%         git  [kernel.kallsyms]      [k] do_raw_spin_lock

Annotated output showed the culprit:

         :              if (!obj_hash)
         :                      return NULL;
         :
         :              i = hashtable_index(sha1);
         :              while ((obj = obj_hash[i]) != NULL) {
    4.13 :        498316:       eb 1f                   jmp    498337 <lookup_object+0x47>
    0.00 :        498318:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
    0.00 :        49831f:       00 
         :                      if (!hashcmp(sha1, obj->sha1))
    1.48 :        498320:       48 8d 78 04             lea    0x4(%rax),%rdi
    0.02 :        498324:       4c 89 d6                mov    %r10,%rsi
    0.00 :        498327:       4c 89 d9                mov    %r11,%rcx
   26.12 :        49832a:       f3 a6                   repz cmpsb %es:(%rdi),%ds:(%rsi)
   17.12 :        49832c:       74 14                   je     498342 <lookup_object+0x52>
         :                              break;
         :                      i++;
    6.88 :        49832e:       83 c2 01                add    $0x1,%edx
         :                      if (i == obj_hash_size)
         :                              i = 0;
    2.28 :        498331:       44 39 ca                cmp    %r9d,%edx
    0.24 :        498334:       0f 44 d3                cmove  %ebx,%edx

"perf stat --detailed" shows us the following picture:

 Performance counter stats for './git gc':

       3145.596314 task-clock               #    0.877 CPUs utilized          
             1,760 context-switches         #    0.001 M/sec                  
               174 CPU-migrations           #    0.000 M/sec                  
            41,509 page-faults              #    0.013 M/sec                  
     9,753,859,587 cycles                   #    3.101 GHz                      (22.91%)
     2,555,944,921 stalled-cycles           #   26.20% of all cycles are idle   (33.89%)
     8,976,468,086 instructions             #    0.92  insns per cycle        
                                            #    0.28  stalled cycles per insn  (44.83%)
     1,782,743,476 branches                 #  566.743 M/sec                    (55.70%)
        85,045,367 branch-misses            #    4.77% of all branches          (66.54%)
     1,982,452,996 L1-dcache-loads          #  630.231 M/sec                    (66.18%)
       152,320,833 L1-dcache-load-misses    #    7.68% of all L1-dcache hits    (55.50%)
        43,358,073 LLC-loads                #   13.784 M/sec                    (45.33%)
         2,636,774 LLC-load-misses          #    0.838 M/sec                    (11.50%)

        3.586922714  seconds time elapsed

... so git gc is still fitting into the L1 cache mostly, and it rarely falls 
out of the L2 cache. So CPU execution is stalling processing longish hash 
chains and comparing sha1's.

So i tried the quick patch below, which just increases the object hash size 
more aggressively, to 16x of the object count, not the previous 2x sizing.

The results are (run against the Git repo itself, on ec014ea ("Git 1.7.5")):

 #
 # Before:
 #

 Performance counter stats for './git gc' (10 runs):

       3147.437358 task-clock               #    0.793 CPUs utilized            ( +-  0.18% )
             1,753 context-switches         #    0.001 M/sec                    ( +-  3.09% )
               165 CPU-migrations           #    0.000 M/sec                    ( +-  2.86% )
            42,587 page-faults              #    0.014 M/sec                    ( +-  0.04% )
    10,041,078,653 cycles                   #    3.190 GHz                      ( +-  0.18% )
     2,613,923,719 stalled-cycles           #   26.03% of all cycles are idle   ( +-  0.45% )
     9,110,524,009 instructions             #    0.91  insns per cycle        
                                            #    0.29  stalled cycles per insn  ( +-  0.03% )
     1,796,732,369 branches                 #  570.856 M/sec                    ( +-  0.04% )
        84,828,313 branch-misses            #    4.72% of all branches          ( +-  0.06% )

        3.971525714  seconds time elapsed  ( +-  8.56% )

 #
 # After:
 #

 Performance counter stats for './git gc' (10 runs):

       2805.034899 task-clock               #    0.757 CPUs utilized            ( +-  0.16% )
             1,709 context-switches         #    0.001 M/sec                    ( +-  2.51% )
               169 CPU-migrations           #    0.000 M/sec                    ( +-  1.73% )
            42,963 page-faults              #    0.015 M/sec                    ( +-  0.23% )
     8,944,314,899 cycles                   #    3.189 GHz                      ( +-  0.17% )
     2,118,720,399 stalled-cycles           #   23.69% of all cycles are idle   ( +-  0.52% )
     9,017,027,059 instructions             #    1.01  insns per cycle        
                                            #    0.23  stalled cycles per insn  ( +-  0.03% )
     1,780,388,097 branches                 #  634.712 M/sec                    ( +-  0.04% )
        76,104,907 branch-misses            #    4.27% of all branches          ( +-  0.07% )

        3.707549437  seconds time elapsed  ( +-  7.13% )

The takeaway is that stalled cycles dropped by 23%:

     2,613,923,719 stalled-cycles           #   26.03% of all cycles are idle   ( +-  0.45% )
     2,118,720,399 stalled-cycles           #   23.69% of all cycles are idle   ( +-  0.52% )

And total runtime (measured in cycles) decreased by 12.2%:

    10,041,078,653 cycles                   #    3.190 GHz                      ( +-  0.18% )
     8,944,314,899 cycles                   #    3.189 GHz                      ( +-  0.17% )

Elapsed time dropped as well as expected, but measurement noise [last column] 
is high there, due to IO effects.

Cache misses are down as well:

 # Before:

     1,982,452,996 L1-dcache-loads          #  630.231 M/sec                    (66.18%)
       152,320,833 L1-dcache-load-misses    #    7.68% of all L1-dcache hits    (55.50%)
        43,358,073 LLC-loads                #   13.784 M/sec                    (45.33%)
         2,636,774 LLC-load-misses          #    0.838 M/sec                    (11.50%)

 # After:

     1,946,597,133 L1-dcache-loads          #  687.078 M/sec                    (67.61%)
       125,634,149 L1-dcache-load-misses    #    6.45% of all L1-dcache hits    (56.38%)
        30,778,323 LLC-loads                #   10.864 M/sec                    (44.40%)
         2,697,827 LLC-load-misses          #    0.952 M/sec                    (10.94%)

This is somewhat surprising, the hash table got larger after all. Cachemisses 
went down probably due to less chain-walking reducing the effective working set 
size of git gc.

So oversizing the hash seems to works well for git gc. I tried a 4x and 8x 
oversizing as well, it was an improvement but 16x clearly beats it. Larger
than 16x seems like overkill.

I guess the hash function itself is as good as it gets:

 static unsigned int hashtable_index(const unsigned char *sha1)
 {
         unsigned int i;
         memcpy(&i, sha1, sizeof(unsigned int));
         return i % obj_hash_size;
 }

as sha1's ought to be fairly well distributed. The problem i suspect is that 
even with perfectly random distribution of the hash, there will always be 
second, third and higher order chains, which get more and more expensive to 
walk. So a 2x sized hash table becomes overcrowded.

Thanks,

	Ingo

Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/object.c b/object.c
index 7e1f2bb..b3fe485 100644
--- a/object.c
+++ b/object.c
@@ -91,7 +91,7 @@ struct object *lookup_object(const unsigned char *sha1)
 static void grow_object_hash(void)
 {
 	int i;
-	int new_hash_size = obj_hash_size < 32 ? 32 : 2 * obj_hash_size;
+	int new_hash_size = obj_hash_size < 32 ? 32 : 16 * obj_hash_size;
 	struct object **new_hash;
 
 	new_hash = xcalloc(new_hash_size, sizeof(struct object *));
@@ -116,7 +116,7 @@ void *create_object(const unsigned char *sha1, int type, void *o)
 	obj->flags = 0;
 	hashcpy(obj->sha1, sha1);
 
-	if (obj_hash_size - 1 <= nr_objs * 2)
+	if (obj_hash_size - 1 <= nr_objs * 16)
 		grow_object_hash();
 
 	insert_obj_hash(obj, obj_hash, obj_hash_size);

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] lookup_object(): Speed up 'git gc' by 12%, by reducing hash chain length
  2011-04-27 21:35 [PATCH] lookup_object(): Speed up 'git gc' by 12%, by reducing hash chain length Ingo Molnar
@ 2011-04-29  6:22 ` Ingo Molnar
  2011-04-29  6:58 ` Junio C Hamano
  2011-05-01 13:21 ` Avi Kivity
  2 siblings, 0 replies; 9+ messages in thread
From: Ingo Molnar @ 2011-04-29  6:22 UTC (permalink / raw)
  To: git, Junio C Hamano


Got no feedback for this patch - might have been drowned in the hashcmp 
discussion :)


I tried various levels of object hash spreading factor and 16x seemed like a 
good one, but i can try a different patch as well if there's other ideas to fix 
this source of CPU utilization inefficiency.

Thanks,

	Ingo

* Ingo Molnar <mingo@elte.hu> wrote:

> I was looking at the 'git gc' stalled-cycles profile and noticed that 
> lookup_object() was the top entry:
> 
>  aldebaran:~/git> perf record -e stalled-cycles -F 10000 ./git gc 
>  Counting objects: 32459, done .
>  Delta compression using up to 16 threads.
>  Compressing objects: 100% (8161/8161), done.
>  Writing objects: 100% (32459/32459), done.
>  Total 32459 (delta 24077), reused 32459 (delta 24077)
>  [ perf record: Woken up 5 times to write data ]
>  [ perf record: Captured and wrote 1.199 MB perf.data (~52366 samples) ]
> 
>  aldebaran:~/git> perf report | head
>  # Events: 30K stalled-cycles
>  #
>  # Overhead     Command          Shared Object                                     Symbol
>  # ........  ..........  .....................  .........................................
>  #
>     36.36%         git  git                    [.] lookup_object
>      6.55%         git  git                    [.] find_pack_entry_one
>      6.53%         git  libz.so.1.2.5          [.] 0xc416          
>      5.94%         git  libz.so.1.2.5          [.] inflate
>      4.12%         git  [kernel.kallsyms]      [k] do_raw_spin_lock
> 
> Annotated output showed the culprit:
> 
>          :              if (!obj_hash)
>          :                      return NULL;
>          :
>          :              i = hashtable_index(sha1);
>          :              while ((obj = obj_hash[i]) != NULL) {
>     4.13 :        498316:       eb 1f                   jmp    498337 <lookup_object+0x47>
>     0.00 :        498318:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
>     0.00 :        49831f:       00 
>          :                      if (!hashcmp(sha1, obj->sha1))
>     1.48 :        498320:       48 8d 78 04             lea    0x4(%rax),%rdi
>     0.02 :        498324:       4c 89 d6                mov    %r10,%rsi
>     0.00 :        498327:       4c 89 d9                mov    %r11,%rcx
>    26.12 :        49832a:       f3 a6                   repz cmpsb %es:(%rdi),%ds:(%rsi)
>    17.12 :        49832c:       74 14                   je     498342 <lookup_object+0x52>
>          :                              break;
>          :                      i++;
>     6.88 :        49832e:       83 c2 01                add    $0x1,%edx
>          :                      if (i == obj_hash_size)
>          :                              i = 0;
>     2.28 :        498331:       44 39 ca                cmp    %r9d,%edx
>     0.24 :        498334:       0f 44 d3                cmove  %ebx,%edx
> 
> "perf stat --detailed" shows us the following picture:
> 
>  Performance counter stats for './git gc':
> 
>        3145.596314 task-clock               #    0.877 CPUs utilized          
>              1,760 context-switches         #    0.001 M/sec                  
>                174 CPU-migrations           #    0.000 M/sec                  
>             41,509 page-faults              #    0.013 M/sec                  
>      9,753,859,587 cycles                   #    3.101 GHz                      (22.91%)
>      2,555,944,921 stalled-cycles           #   26.20% of all cycles are idle   (33.89%)
>      8,976,468,086 instructions             #    0.92  insns per cycle        
>                                             #    0.28  stalled cycles per insn  (44.83%)
>      1,782,743,476 branches                 #  566.743 M/sec                    (55.70%)
>         85,045,367 branch-misses            #    4.77% of all branches          (66.54%)
>      1,982,452,996 L1-dcache-loads          #  630.231 M/sec                    (66.18%)
>        152,320,833 L1-dcache-load-misses    #    7.68% of all L1-dcache hits    (55.50%)
>         43,358,073 LLC-loads                #   13.784 M/sec                    (45.33%)
>          2,636,774 LLC-load-misses          #    0.838 M/sec                    (11.50%)
> 
>         3.586922714  seconds time elapsed
> 
> ... so git gc is still fitting into the L1 cache mostly, and it rarely falls 
> out of the L2 cache. So CPU execution is stalling processing longish hash 
> chains and comparing sha1's.
> 
> So i tried the quick patch below, which just increases the object hash size 
> more aggressively, to 16x of the object count, not the previous 2x sizing.
> 
> The results are (run against the Git repo itself, on ec014ea ("Git 1.7.5")):
> 
>  #
>  # Before:
>  #
> 
>  Performance counter stats for './git gc' (10 runs):
> 
>        3147.437358 task-clock               #    0.793 CPUs utilized            ( +-  0.18% )
>              1,753 context-switches         #    0.001 M/sec                    ( +-  3.09% )
>                165 CPU-migrations           #    0.000 M/sec                    ( +-  2.86% )
>             42,587 page-faults              #    0.014 M/sec                    ( +-  0.04% )
>     10,041,078,653 cycles                   #    3.190 GHz                      ( +-  0.18% )
>      2,613,923,719 stalled-cycles           #   26.03% of all cycles are idle   ( +-  0.45% )
>      9,110,524,009 instructions             #    0.91  insns per cycle        
>                                             #    0.29  stalled cycles per insn  ( +-  0.03% )
>      1,796,732,369 branches                 #  570.856 M/sec                    ( +-  0.04% )
>         84,828,313 branch-misses            #    4.72% of all branches          ( +-  0.06% )
> 
>         3.971525714  seconds time elapsed  ( +-  8.56% )
> 
>  #
>  # After:
>  #
> 
>  Performance counter stats for './git gc' (10 runs):
> 
>        2805.034899 task-clock               #    0.757 CPUs utilized            ( +-  0.16% )
>              1,709 context-switches         #    0.001 M/sec                    ( +-  2.51% )
>                169 CPU-migrations           #    0.000 M/sec                    ( +-  1.73% )
>             42,963 page-faults              #    0.015 M/sec                    ( +-  0.23% )
>      8,944,314,899 cycles                   #    3.189 GHz                      ( +-  0.17% )
>      2,118,720,399 stalled-cycles           #   23.69% of all cycles are idle   ( +-  0.52% )
>      9,017,027,059 instructions             #    1.01  insns per cycle        
>                                             #    0.23  stalled cycles per insn  ( +-  0.03% )
>      1,780,388,097 branches                 #  634.712 M/sec                    ( +-  0.04% )
>         76,104,907 branch-misses            #    4.27% of all branches          ( +-  0.07% )
> 
>         3.707549437  seconds time elapsed  ( +-  7.13% )
> 
> The takeaway is that stalled cycles dropped by 23%:
> 
>      2,613,923,719 stalled-cycles           #   26.03% of all cycles are idle   ( +-  0.45% )
>      2,118,720,399 stalled-cycles           #   23.69% of all cycles are idle   ( +-  0.52% )
> 
> And total runtime (measured in cycles) decreased by 12.2%:
> 
>     10,041,078,653 cycles                   #    3.190 GHz                      ( +-  0.18% )
>      8,944,314,899 cycles                   #    3.189 GHz                      ( +-  0.17% )
> 
> Elapsed time dropped as well as expected, but measurement noise [last column] 
> is high there, due to IO effects.
> 
> Cache misses are down as well:
> 
>  # Before:
> 
>      1,982,452,996 L1-dcache-loads          #  630.231 M/sec                    (66.18%)
>        152,320,833 L1-dcache-load-misses    #    7.68% of all L1-dcache hits    (55.50%)
>         43,358,073 LLC-loads                #   13.784 M/sec                    (45.33%)
>          2,636,774 LLC-load-misses          #    0.838 M/sec                    (11.50%)
> 
>  # After:
> 
>      1,946,597,133 L1-dcache-loads          #  687.078 M/sec                    (67.61%)
>        125,634,149 L1-dcache-load-misses    #    6.45% of all L1-dcache hits    (56.38%)
>         30,778,323 LLC-loads                #   10.864 M/sec                    (44.40%)
>          2,697,827 LLC-load-misses          #    0.952 M/sec                    (10.94%)
> 
> This is somewhat surprising, the hash table got larger after all. Cachemisses 
> went down probably due to less chain-walking reducing the effective working set 
> size of git gc.
> 
> So oversizing the hash seems to works well for git gc. I tried a 4x and 8x 
> oversizing as well, it was an improvement but 16x clearly beats it. Larger
> than 16x seems like overkill.
> 
> I guess the hash function itself is as good as it gets:
> 
>  static unsigned int hashtable_index(const unsigned char *sha1)
>  {
>          unsigned int i;
>          memcpy(&i, sha1, sizeof(unsigned int));
>          return i % obj_hash_size;
>  }
> 
> as sha1's ought to be fairly well distributed. The problem i suspect is that 
> even with perfectly random distribution of the hash, there will always be 
> second, third and higher order chains, which get more and more expensive to 
> walk. So a 2x sized hash table becomes overcrowded.
> 
> Thanks,
> 
> 	Ingo
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> 
> diff --git a/object.c b/object.c
> index 7e1f2bb..b3fe485 100644
> --- a/object.c
> +++ b/object.c
> @@ -91,7 +91,7 @@ struct object *lookup_object(const unsigned char *sha1)
>  static void grow_object_hash(void)
>  {
>  	int i;
> -	int new_hash_size = obj_hash_size < 32 ? 32 : 2 * obj_hash_size;
> +	int new_hash_size = obj_hash_size < 32 ? 32 : 16 * obj_hash_size;
>  	struct object **new_hash;
>  
>  	new_hash = xcalloc(new_hash_size, sizeof(struct object *));
> @@ -116,7 +116,7 @@ void *create_object(const unsigned char *sha1, int type, void *o)
>  	obj->flags = 0;
>  	hashcpy(obj->sha1, sha1);
>  
> -	if (obj_hash_size - 1 <= nr_objs * 2)
> +	if (obj_hash_size - 1 <= nr_objs * 16)
>  		grow_object_hash();
>  
>  	insert_obj_hash(obj, obj_hash, obj_hash_size);

-- 
Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] lookup_object(): Speed up 'git gc' by 12%, by reducing hash chain length
  2011-04-27 21:35 [PATCH] lookup_object(): Speed up 'git gc' by 12%, by reducing hash chain length Ingo Molnar
  2011-04-29  6:22 ` Ingo Molnar
@ 2011-04-29  6:58 ` Junio C Hamano
  2011-04-29  7:26   ` Ingo Molnar
  2011-04-29 16:57   ` Shawn Pearce
  2011-05-01 13:21 ` Avi Kivity
  2 siblings, 2 replies; 9+ messages in thread
From: Junio C Hamano @ 2011-04-29  6:58 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: git, Shawn O. Pearce

Ingo Molnar <mingo@elte.hu> writes:

> diff --git a/object.c b/object.c
> index 7e1f2bb..b3fe485 100644
> --- a/object.c
> +++ b/object.c
> @@ -91,7 +91,7 @@ struct object *lookup_object(const unsigned char *sha1)
>  static void grow_object_hash(void)
>  {
>  	int i;
> -	int new_hash_size = obj_hash_size < 32 ? 32 : 2 * obj_hash_size;
> +	int new_hash_size = obj_hash_size < 32 ? 32 : 16 * obj_hash_size;
>  	struct object **new_hash;
>  
>  	new_hash = xcalloc(new_hash_size, sizeof(struct object *));
> @@ -116,7 +116,7 @@ void *create_object(const unsigned char *sha1, int type, void *o)
>  	obj->flags = 0;
>  	hashcpy(obj->sha1, sha1);
>  
> -	if (obj_hash_size - 1 <= nr_objs * 2)
> +	if (obj_hash_size - 1 <= nr_objs * 16)
>  		grow_object_hash();
>  
>  	insert_obj_hash(obj, obj_hash, obj_hash_size);

Shawn was telling me about this exact topic a few months ago, and I do
agree that object hash grows too slowly when you need to slurp in many
objects.

A few random thoughts, some are related and others are unrelated to what
your patch does:

 - We start out from a tiny table of 32 entries.  Would it make things
   noticeably worse if we start with a larger table for a workload that
   touch only a few dozen objects?  How about starting from a table with
   say 4 pages worth of pointers, or something?  Note that this is not
   about helping the case with near full-walk.

 - I agree x2 is growing the table too slowly, but at the same time, I do
   not think x16 is growing fast enough, if you will end up slurping
   millions of objects.  You would still need to rehash 5 times (maybe 4,
   I cannot count).

   Worse yet, the later rehash costs proportionally more (IOW having to
   rehash 5 times is not just 25% more expensive than having to rehash 4
   times).

 - If we grow the table too fast, wouldn't it make the largest table we
   could use smaller?  When we try to grow a large but crowded table by
   x2, we may be able to get enough core to rehash, but we may not be able
   to allocate x16 such an already large table.

I have this hunch that the workloads that truly require to hold huge
number of objects are limited, and we can enumerate them relatively
easily.  The callchain from gc to repack to pack-objects is one.  The
codepath in pack-objects that is called from "repack -a -d" should be able
to guess that it needs to slurp everything from the fact that there is no
negative ref given in the revision walking machinery.  It also should be
able to guess if the repository has large number of objects by looking at
the existing pack .idx files (they record the number of objects the
corresponding .pack files contain in a way that is cheap to read).

It might make sense to give an explicit hit to grow_object_hash() in such
a case (i.e. the caller e.g. pack-objects sets a flag for it to notice),
and have grow_object_hash() immediately jump to a huge hash size.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] lookup_object(): Speed up 'git gc' by 12%, by reducing hash chain length
  2011-04-29  6:58 ` Junio C Hamano
@ 2011-04-29  7:26   ` Ingo Molnar
  2011-04-29  7:38     ` Ingo Molnar
  2011-04-29 16:57   ` Shawn Pearce
  1 sibling, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2011-04-29  7:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Shawn O. Pearce


* Junio C Hamano <gitster@pobox.com> wrote:

> Ingo Molnar <mingo@elte.hu> writes:
> 
> > diff --git a/object.c b/object.c
> > index 7e1f2bb..b3fe485 100644
> > --- a/object.c
> > +++ b/object.c
> > @@ -91,7 +91,7 @@ struct object *lookup_object(const unsigned char *sha1)
> >  static void grow_object_hash(void)
> >  {
> >  	int i;
> > -	int new_hash_size = obj_hash_size < 32 ? 32 : 2 * obj_hash_size;
> > +	int new_hash_size = obj_hash_size < 32 ? 32 : 16 * obj_hash_size;
> >  	struct object **new_hash;
> >  
> >  	new_hash = xcalloc(new_hash_size, sizeof(struct object *));
> > @@ -116,7 +116,7 @@ void *create_object(const unsigned char *sha1, int type, void *o)
> >  	obj->flags = 0;
> >  	hashcpy(obj->sha1, sha1);
> >  
> > -	if (obj_hash_size - 1 <= nr_objs * 2)
> > +	if (obj_hash_size - 1 <= nr_objs * 16)
> >  		grow_object_hash();
> >  
> >  	insert_obj_hash(obj, obj_hash, obj_hash_size);
> 
> Shawn was telling me about this exact topic a few months ago, and I do
> agree that object hash grows too slowly when you need to slurp in many
> objects.

I think the main effect might not be the rate of growth and reduced overhead of 
reallocating and reconstructing the hash 4-6 times, but the *spread* of objects 
within the hash table - i.e. the maximum (i.e. optimal) size of the hash.

In a git gc run the hash grows to the max very quickly, then 99% of execution 
time is spent with that optimally sized hash - so growth rate per se does not 
matter much. (it might matter in other usecases)

Find below a debug patch i use to run with a configurable spread.

Note, i just ran the patch on a different system and there the effect was much 
less pronounced. So i'd prefer independent confirmation as well that it speeds 
up things for others as well.

I'll run more numbers - maybe we are just very sensitive to the exact layout of 
the object hash and a 16x spread created a different, more optimal layout.

Thanks,

	Ingo

---

 git.c    |    6 ++++++
 object.c |    5 +++--
 object.h |    1 +
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/git.c b/git.c
index 4b7dbfa..4c59316 100644
--- a/git.c
+++ b/git.c
@@ -4,6 +4,7 @@
 #include "help.h"
 #include "quote.h"
 #include "run-command.h"
+#include "object.h"
 
 const char git_usage_string[] =
 	"git [--version] [--exec-path[=<path>]] [--html-path]\n"
@@ -97,6 +98,11 @@ static int handle_options(const char ***argv, int *argc, int *envchanged)
 			exit(0);
 		} else if (!strcmp(cmd, "-p") || !strcmp(cmd, "--paginate")) {
 			use_pager = 1;
+		} else if (!strcmp(cmd, "--object-hash-spread")) {
+			object_hash_spread = atol((*argv)[1]);
+			printf("object hash spread: %d\n", object_hash_spread);
+			(*argv)++;
+			(*argc)--;
 		} else if (!strcmp(cmd, "--no-pager")) {
 			use_pager = 0;
 			if (envchanged)
diff --git a/object.c b/object.c
index 7e1f2bb..3d16a8a 100644
--- a/object.c
+++ b/object.c
@@ -7,6 +7,7 @@
 
 static struct object **obj_hash;
 static int nr_objs, obj_hash_size;
+int object_hash_spread = 2;
 
 unsigned int get_max_object_index(void)
 {
@@ -91,7 +92,7 @@ struct object *lookup_object(const unsigned char *sha1)
 static void grow_object_hash(void)
 {
 	int i;
-	int new_hash_size = obj_hash_size < 32 ? 32 : 2 * obj_hash_size;
+	int new_hash_size = obj_hash_size < 32 ? 32 : object_hash_spread * obj_hash_size;
 	struct object **new_hash;
 
 	new_hash = xcalloc(new_hash_size, sizeof(struct object *));
@@ -116,7 +117,7 @@ void *create_object(const unsigned char *sha1, int type, void *o)
 	obj->flags = 0;
 	hashcpy(obj->sha1, sha1);
 
-	if (obj_hash_size - 1 <= nr_objs * 2)
+	if (obj_hash_size - 1 <= nr_objs * object_hash_spread)
 		grow_object_hash();
 
 	insert_obj_hash(obj, obj_hash, obj_hash_size);
diff --git a/object.h b/object.h
index b6618d9..180a6c1 100644
--- a/object.h
+++ b/object.h
@@ -75,5 +75,6 @@ int object_list_contains(struct object_list *list, struct object *obj);
 void add_object_array(struct object *obj, const char *name, struct object_array *array);
 void add_object_array_with_mode(struct object *obj, const char *name, struct object_array *array, unsigned mode);
 void object_array_remove_duplicates(struct object_array *);
+extern int object_hash_spread;
 
 #endif /* OBJECT_H */

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] lookup_object(): Speed up 'git gc' by 12%, by reducing hash chain length
  2011-04-29  7:26   ` Ingo Molnar
@ 2011-04-29  7:38     ` Ingo Molnar
  2011-04-29  7:46       ` Sverre Rabbelier
  0 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2011-04-29  7:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Shawn O. Pearce


* Ingo Molnar <mingo@elte.hu> wrote:

> Find below a debug patch i use to run with a configurable spread.
> 
> Note, i just ran the patch on a different system and there the effect was 
> much less pronounced. So i'd prefer independent confirmation as well that it 
> speeds up things for others as well.
> 
> I'll run more numbers - maybe we are just very sensitive to the exact layout 
> of the object hash and a 16x spread created a different, more optimal layout.

Here are those numbers:

 $ for ((size=2; size<24; size++)); do printf "%5d: " $size; perf stat -e instructions:u -e cycles:u -e task-clock --sync --repeat 10 ./git --object-hash-spread $size gc 2>&1 | grep cycles; done 

    2:      9,362,801,669 cycles:u                 #    2.982 GHz                      ( +-  0.25% )
    3:      9,464,946,158 cycles:u                 #    2.993 GHz                      ( +-  1.17% )
    4:      9,382,214,358 cycles:u                 #    2.981 GHz                      ( +-  0.26% )
    5:      9,373,537,954 cycles:u                 #    2.986 GHz                      ( +-  0.24% )
    6:      9,492,635,404 cycles:u                 #    2.988 GHz                      ( +-  1.25% )
    7:      9,427,037,835 cycles:u                 #    2.982 GHz                      ( +-  0.19% )
    8:      9,311,764,604 cycles:u                 #    2.987 GHz                      ( +-  0.23% )
    9:      9,384,331,920 cycles:u                 #    2.985 GHz                      ( +-  0.27% )
   10:      9,388,460,044 cycles:u                 #    2.983 GHz                      ( +-  0.31% )
   11:      9,374,380,165 cycles:u                 #    2.984 GHz                      ( +-  0.25% )
   12:      9,417,466,827 cycles:u                 #    2.984 GHz                      ( +-  0.27% )
   13:      9,348,550,619 cycles:u                 #    2.982 GHz                      ( +-  0.12% )
   14:      9,369,435,508 cycles:u                 #    2.982 GHz                      ( +-  0.31% )
   15:      9,361,127,598 cycles:u                 #    2.983 GHz                      ( +-  0.27% )
   16:      9,402,077,866 cycles:u                 #    2.987 GHz                      ( +-  0.20% )
   17:      9,390,950,850 cycles:u                 #    2.985 GHz                      ( +-  0.27% )
   18:      9,355,126,542 cycles:u                 #    2.986 GHz                      ( +-  0.30% )
   19:      9,357,143,371 cycles:u                 #    2.974 GHz                      ( +-  0.33% )
   20:      9,372,977,607 cycles:u                 #    2.985 GHz                      ( +-  0.34% )
   21:      9,355,406,722 cycles:u                 #    2.985 GHz                      ( +-  0.45% )
   22:      9,342,730,882 cycles:u                 #    2.982 GHz                      ( +-  0.31% )
   23:      9,372,321,792 cycles:u                 #    2.982 GHz                      ( +-  0.28% )

They are utterly unconvincing - there seems to be no improvement, it's all 
within noise.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] lookup_object(): Speed up 'git gc' by 12%, by reducing hash chain length
  2011-04-29  7:38     ` Ingo Molnar
@ 2011-04-29  7:46       ` Sverre Rabbelier
  2011-04-29  9:50         ` Ingo Molnar
  0 siblings, 1 reply; 9+ messages in thread
From: Sverre Rabbelier @ 2011-04-29  7:46 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Junio C Hamano, git, Shawn O. Pearce

Heya,

On Fri, Apr 29, 2011 at 09:38, Ingo Molnar <mingo@elte.hu> wrote:
>   16:      9,402,077,866 cycles:u                 #    2.987 GHz                      ( +-  0.20% )
>
> They are utterly unconvincing - there seems to be no improvement, it's all
> within noise.

Is this in a different repository from the one you ran the numbers on
initially, or did something else change to negate that 12.2% decrease?

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] lookup_object(): Speed up 'git gc' by 12%, by reducing hash chain length
  2011-04-29  7:46       ` Sverre Rabbelier
@ 2011-04-29  9:50         ` Ingo Molnar
  0 siblings, 0 replies; 9+ messages in thread
From: Ingo Molnar @ 2011-04-29  9:50 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Junio C Hamano, git, Shawn O. Pearce


* Sverre Rabbelier <srabbelier@gmail.com> wrote:

> Heya,
> 
> On Fri, Apr 29, 2011 at 09:38, Ingo Molnar <mingo@elte.hu> wrote:
> >   16:      9,402,077,866 cycles:u                 #    2.987 GHz                      ( +-  0.20% )
> >
> > They are utterly unconvincing - there seems to be no improvement, it's all
> > within noise.
> 
> Is this in a different repository from the one you ran the numbers on
> initially, or did something else change to negate that 12.2% decrease?

Yes, a newer Git repository with more objects in it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] lookup_object(): Speed up 'git gc' by 12%, by reducing hash chain length
  2011-04-29  6:58 ` Junio C Hamano
  2011-04-29  7:26   ` Ingo Molnar
@ 2011-04-29 16:57   ` Shawn Pearce
  1 sibling, 0 replies; 9+ messages in thread
From: Shawn Pearce @ 2011-04-29 16:57 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ingo Molnar, git

On Thu, Apr 28, 2011 at 23:58, Junio C Hamano <gitster@pobox.com> wrote:
> Ingo Molnar <mingo@elte.hu> writes:
>
>> diff --git a/object.c b/object.c
>> index 7e1f2bb..b3fe485 100644
>> --- a/object.c
>> +++ b/object.c
>> @@ -91,7 +91,7 @@ struct object *lookup_object(const unsigned char *sha1)
>>  static void grow_object_hash(void)
>>  {
>>       int i;
>> -     int new_hash_size = obj_hash_size < 32 ? 32 : 2 * obj_hash_size;
>> +     int new_hash_size = obj_hash_size < 32 ? 32 : 16 * obj_hash_size;
>>       struct object **new_hash;
>
> Shawn was telling me about this exact topic a few months ago, and I do
> agree that object hash grows too slowly when you need to slurp in many
> objects.

My experience in JGit with this data structure isn't directly
repeatable in C Git. In Java I am battling more than just telling the
compiler what assembly instructions to emit. :-)

The table starts out too small, yes. JGit now has two different table
representations; one allocates at 2048 entries initially. With a load
factor of 50% (matches current C Git code before Ingo's x16 patch), we
fit 1024 objects before doubling in size. Within a hash chain JGit's
hashcmp() function evaluates like this:

  !memcmp(a + 4, b + 4, 16) && memcmp(a, b, 4)

because the 1st word was already used as the hash index. This does
seem to help break away from a non-match quickly. But remember, this
is in Java where some stuff is pretty damn costly. A good x86 chip and
a halfway decent C compiler might be penalized with this variant of
hashcmp(), I don't know.


Our second table in JGit is very different from what C Git uses... but
it works better for the purpose we are discussing. We extended our
equivalent of "struct object" to have a chained "struct object *next"
field within the object itself, rather than using the table to contain
the chain. This does limit the object to being placed into only one
object hashtable, but this isn't actually a problem for either
implementation. The entire point of struct object and this table in
object.c is to have only one of them for the process.

The table is actually a variable sized array, similar to dynamic
hashing. There is a directory array of 1024 entries that points to
segment arrays; each segment array is 2048 struct object* entries.
Growing the array is just a matter of malloc(sizeof(struct object*) *
2048) and linking it into the next free slot of the directory array.
This avoids needing to grow the array and pay memory management
penalties associated with dropping the old one and picking up the new
one. glibc malloc/free might handle this workload OK, the Java GC
didn't so we had to use this more complex table. The table still
doubles in size, so during the 2nd grow we have to malloc 2 segment
arrays, 3rd grow we malloc 4 segment arrays, 4th grow we malloc 8
segment arrays.

Searching in the table is a matter of taking the first 4 bytes of the
SHA-1, applying a mask to find which index of the directory array
holds the relevant segment array to scan. The higher bits get used to
access the index in the segment, and then the hash chain is walked
using a classical singly linked list:

  unsigned int h = *((unsigned int*)obj_sha1);
  V obj = directory[h & mask][h >>> SEGMENT_SHIFT];
  for (; obj; obj = obj->next) {
    if (!hashcmp(obj->sha1, obj_sha1))
      return obj;

With this approach we run the table at 100% capacity, which means the
table is much smaller. Its approximately only 1 segment entry per
object. But each object is 1 pointer larger. For 1,886,362 objects,
this approach wastes 200,000 pointers, even though each object has a
"next" pointer field within it. This is because the table doubling
with a 50% load factor has to grow to 4,194,304 pointers to store
1,886,362 objects. That's wasting 2,307,942 pointers.

A nice thing about the table is, it grows in small allocation bursts
and doesn't need to find 32 MiB of contiguous heap. Again this may not
matter much in a short-lived C process that has a relatively clean
heap. It matters a lot in a Java process that has been running for a
while and whose heap is pretty fragmented. Its also nice that the
older part of the table remains with us. We reuse the old segments as
we rehash objects on the chain. The rehashing is also very efficient,
we only need to inspect 1 additional bit on each object in the chain
to determine if it stays in the old segment, or moves to the newly
allocated sister segment.


Around this same time I did look at the chain lengths. The repository
in question was linux-2.6, but from around January 2011:

------
As far as chain lengths go, we're not that bad.

Here is the raw data. Objects is the number of items in the table at
the end, table is the size of the obj_hash array, wasted is the
difference. A hit is a successful get() returning an object, a miss is
a get that returned null and later turns into an insert. The number of
calls on chain lengths above 18 falls off fast so I elided it out.
With a 50% load factor, most operations have shorter than 7 items in
their chain. A wide majority are below 2.

objects: 1886362
table: 4194304
 (wasted 2307942)
chains (hit):
 length   0 : 42396217 get calls
 length   1 : 13675300 get calls
 length   2 : 6756795 get calls
 length   3 : 3759100 get calls
 length   4 : 2213449 get calls
 length   5 : 1413852 get calls
 length   6 : 1046289 get calls
 length   7 : 812226 get calls
 length   8 : 596076 get calls
 length   9 : 529995 get calls
 length  10 : 357039 get calls
 length  11 : 321895 get calls
 length  12 : 261752 get calls
 length  13 : 162623 get calls
 length  14 : 163727 get calls
 length  15 : 112538 get calls
 length  16 : 78401 get calls
 length  17 : 103203 get calls
 length  18 : 81563 get calls
 ...
 length >63 : 11553 get calls

chains (miss):
 length   0 : 872894 get calls
 length   1 : 345177 get calls
 length   2 : 187751 get calls
 length   3 : 117402 get calls
 length   4 : 78825 get calls
 length   5 : 55710 get calls
 length   6 : 41359 get calls
 length   7 : 31547 get calls
 length   8 : 24517 get calls
 length   9 : 19507 get calls
 length  10 : 15565 get calls
 length  11 : 12767 get calls
 length  12 : 10550 get calls
 length  13 : 8895 get calls
 length  14 : 7573 get calls
 length  15 : 6382 get calls
 length  16 : 5542 get calls
 length  17 : 4847 get calls
 length  18 : 4162 get calls
 ...
 length >63 : 1096 get calls
------

I unfortunately didn't rerun the chain data with the newer table
design. Our goals weren't to reduce chain length, it was to reduce
memory management overheads associated with using the table in Java.
We succeeded there.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] lookup_object(): Speed up 'git gc' by 12%, by reducing hash chain length
  2011-04-27 21:35 [PATCH] lookup_object(): Speed up 'git gc' by 12%, by reducing hash chain length Ingo Molnar
  2011-04-29  6:22 ` Ingo Molnar
  2011-04-29  6:58 ` Junio C Hamano
@ 2011-05-01 13:21 ` Avi Kivity
  2 siblings, 0 replies; 9+ messages in thread
From: Avi Kivity @ 2011-05-01 13:21 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: git

On 04/28/2011 12:35 AM, Ingo Molnar wrote:
>           :              while ((obj = obj_hash[i]) != NULL) {
>      4.13 :        498316:       eb 1f                   jmp    498337<lookup_object+0x47>
>      0.00 :        498318:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
>      0.00 :        49831f:       00
>           :                      if (!hashcmp(sha1, obj->sha1))
>      1.48 :        498320:       48 8d 78 04             lea    0x4(%rax),%rdi
>      0.02 :        498324:       4c 89 d6                mov    %r10,%rsi
>      0.00 :        498327:       4c 89 d9                mov    %r11,%rcx
>     26.12 :        49832a:       f3 a6                   repz cmpsb %es:(%rdi),%ds:(%rsi)
>     17.12 :        49832c:       74 14                   je     498342<lookup_object+0x52>
>           :                              break;

rep cmps can be very slow on some machines, and in particular, rep cmpsb 
is optimized for really small strings (the tail of a larger rep cmps[lq] 
run).

I think that if you'll replace hashcmp() by something like

static inline bool hashcmp(const unsigned char *sha1, const unsigned 
char *sha2)
{
      unsigned long cmp;

      cmp = *(uint64_t *)sha1 ^ *(unint64_t *)sha2;
      cmp |= *(uint64_t *)(sha1 + 8) ^ *(unint64_t *)(sha2 + 8);
      cmp |= *(uint32_t *)(sha1 + 16) ^ *(unint32_t *)(sha2 + 16);
      return cmp == 0;
}

You'll see much better results.

Of course this only works in general if the hashes are aligned.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-05-01 13:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-27 21:35 [PATCH] lookup_object(): Speed up 'git gc' by 12%, by reducing hash chain length Ingo Molnar
2011-04-29  6:22 ` Ingo Molnar
2011-04-29  6:58 ` Junio C Hamano
2011-04-29  7:26   ` Ingo Molnar
2011-04-29  7:38     ` Ingo Molnar
2011-04-29  7:46       ` Sverre Rabbelier
2011-04-29  9:50         ` Ingo Molnar
2011-04-29 16:57   ` Shawn Pearce
2011-05-01 13:21 ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).