All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@elte.hu>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, "Shawn O. Pearce" <spearce@spearce.org>
Subject: Re: [PATCH] lookup_object(): Speed up 'git gc' by 12%, by reducing hash chain length
Date: Fri, 29 Apr 2011 09:26:04 +0200	[thread overview]
Message-ID: <20110429072604.GA16371@elte.hu> (raw)
In-Reply-To: <7vtydh1r3q.fsf@alter.siamese.dyndns.org>


* Junio C Hamano <gitster@pobox.com> wrote:

> Ingo Molnar <mingo@elte.hu> writes:
> 
> > diff --git a/object.c b/object.c
> > index 7e1f2bb..b3fe485 100644
> > --- a/object.c
> > +++ b/object.c
> > @@ -91,7 +91,7 @@ struct object *lookup_object(const unsigned char *sha1)
> >  static void grow_object_hash(void)
> >  {
> >  	int i;
> > -	int new_hash_size = obj_hash_size < 32 ? 32 : 2 * obj_hash_size;
> > +	int new_hash_size = obj_hash_size < 32 ? 32 : 16 * obj_hash_size;
> >  	struct object **new_hash;
> >  
> >  	new_hash = xcalloc(new_hash_size, sizeof(struct object *));
> > @@ -116,7 +116,7 @@ void *create_object(const unsigned char *sha1, int type, void *o)
> >  	obj->flags = 0;
> >  	hashcpy(obj->sha1, sha1);
> >  
> > -	if (obj_hash_size - 1 <= nr_objs * 2)
> > +	if (obj_hash_size - 1 <= nr_objs * 16)
> >  		grow_object_hash();
> >  
> >  	insert_obj_hash(obj, obj_hash, obj_hash_size);
> 
> Shawn was telling me about this exact topic a few months ago, and I do
> agree that object hash grows too slowly when you need to slurp in many
> objects.

I think the main effect might not be the rate of growth and reduced overhead of 
reallocating and reconstructing the hash 4-6 times, but the *spread* of objects 
within the hash table - i.e. the maximum (i.e. optimal) size of the hash.

In a git gc run the hash grows to the max very quickly, then 99% of execution 
time is spent with that optimally sized hash - so growth rate per se does not 
matter much. (it might matter in other usecases)

Find below a debug patch i use to run with a configurable spread.

Note, i just ran the patch on a different system and there the effect was much 
less pronounced. So i'd prefer independent confirmation as well that it speeds 
up things for others as well.

I'll run more numbers - maybe we are just very sensitive to the exact layout of 
the object hash and a 16x spread created a different, more optimal layout.

Thanks,

	Ingo

---

 git.c    |    6 ++++++
 object.c |    5 +++--
 object.h |    1 +
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/git.c b/git.c
index 4b7dbfa..4c59316 100644
--- a/git.c
+++ b/git.c
@@ -4,6 +4,7 @@
 #include "help.h"
 #include "quote.h"
 #include "run-command.h"
+#include "object.h"
 
 const char git_usage_string[] =
 	"git [--version] [--exec-path[=<path>]] [--html-path]\n"
@@ -97,6 +98,11 @@ static int handle_options(const char ***argv, int *argc, int *envchanged)
 			exit(0);
 		} else if (!strcmp(cmd, "-p") || !strcmp(cmd, "--paginate")) {
 			use_pager = 1;
+		} else if (!strcmp(cmd, "--object-hash-spread")) {
+			object_hash_spread = atol((*argv)[1]);
+			printf("object hash spread: %d\n", object_hash_spread);
+			(*argv)++;
+			(*argc)--;
 		} else if (!strcmp(cmd, "--no-pager")) {
 			use_pager = 0;
 			if (envchanged)
diff --git a/object.c b/object.c
index 7e1f2bb..3d16a8a 100644
--- a/object.c
+++ b/object.c
@@ -7,6 +7,7 @@
 
 static struct object **obj_hash;
 static int nr_objs, obj_hash_size;
+int object_hash_spread = 2;
 
 unsigned int get_max_object_index(void)
 {
@@ -91,7 +92,7 @@ struct object *lookup_object(const unsigned char *sha1)
 static void grow_object_hash(void)
 {
 	int i;
-	int new_hash_size = obj_hash_size < 32 ? 32 : 2 * obj_hash_size;
+	int new_hash_size = obj_hash_size < 32 ? 32 : object_hash_spread * obj_hash_size;
 	struct object **new_hash;
 
 	new_hash = xcalloc(new_hash_size, sizeof(struct object *));
@@ -116,7 +117,7 @@ void *create_object(const unsigned char *sha1, int type, void *o)
 	obj->flags = 0;
 	hashcpy(obj->sha1, sha1);
 
-	if (obj_hash_size - 1 <= nr_objs * 2)
+	if (obj_hash_size - 1 <= nr_objs * object_hash_spread)
 		grow_object_hash();
 
 	insert_obj_hash(obj, obj_hash, obj_hash_size);
diff --git a/object.h b/object.h
index b6618d9..180a6c1 100644
--- a/object.h
+++ b/object.h
@@ -75,5 +75,6 @@ int object_list_contains(struct object_list *list, struct object *obj);
 void add_object_array(struct object *obj, const char *name, struct object_array *array);
 void add_object_array_with_mode(struct object *obj, const char *name, struct object_array *array, unsigned mode);
 void object_array_remove_duplicates(struct object_array *);
+extern int object_hash_spread;
 
 #endif /* OBJECT_H */

  reply	other threads:[~2011-04-29  7:26 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-27 21:35 [PATCH] lookup_object(): Speed up 'git gc' by 12%, by reducing hash chain length Ingo Molnar
2011-04-29  6:22 ` Ingo Molnar
2011-04-29  6:58 ` Junio C Hamano
2011-04-29  7:26   ` Ingo Molnar [this message]
2011-04-29  7:38     ` Ingo Molnar
2011-04-29  7:46       ` Sverre Rabbelier
2011-04-29  9:50         ` Ingo Molnar
2011-04-29 16:57   ` Shawn Pearce
2011-05-01 13:21 ` Avi Kivity

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110429072604.GA16371@elte.hu \
    --to=mingo@elte.hu \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.