From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ingo Molnar Subject: Re: [PATCH] lookup_object(): Speed up 'git gc' by 12%, by reducing hash chain length Date: Fri, 29 Apr 2011 09:26:04 +0200 Message-ID: <20110429072604.GA16371@elte.hu> References: <20110427213502.GA13647@elte.hu> <7vtydh1r3q.fsf@alter.siamese.dyndns.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: git@vger.kernel.org, "Shawn O. Pearce" To: Junio C Hamano X-From: git-owner@vger.kernel.org Fri Apr 29 09:26:23 2011 Return-path: Envelope-to: gcvg-git-2@lo.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1QFi5m-0004N3-Ub for gcvg-git-2@lo.gmane.org; Fri, 29 Apr 2011 09:26:23 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755924Ab1D2H0Q (ORCPT ); Fri, 29 Apr 2011 03:26:16 -0400 Received: from mx2.mail.elte.hu ([157.181.151.9]:42802 "EHLO mx2.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754747Ab1D2H0P (ORCPT ); Fri, 29 Apr 2011 03:26:15 -0400 Received: from elvis.elte.hu ([157.181.1.14]) by mx2.mail.elte.hu with esmtp (Exim) id 1QFi5X-000211-LL from ; Fri, 29 Apr 2011 09:26:13 +0200 Received: by elvis.elte.hu (Postfix, from userid 1004) id 110F13E2510; Fri, 29 Apr 2011 09:26:02 +0200 (CEST) Content-Disposition: inline In-Reply-To: <7vtydh1r3q.fsf@alter.siamese.dyndns.org> User-Agent: Mutt/1.5.20 (2009-08-17) Received-SPF: neutral (mx2.mail.elte.hu: 157.181.1.14 is neither permitted nor denied by domain of elte.hu) client-ip=157.181.1.14; envelope-from=mingo@elte.hu; helo=elvis.elte.hu; X-ELTE-SpamScore: -2.0 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-2.0 required=5.9 tests=BAYES_00 autolearn=no SpamAssassin version=3.3.1 -2.0 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: * Junio C Hamano wrote: > Ingo Molnar writes: > > > diff --git a/object.c b/object.c > > index 7e1f2bb..b3fe485 100644 > > --- a/object.c > > +++ b/object.c > > @@ -91,7 +91,7 @@ struct object *lookup_object(const unsigned char *sha1) > > static void grow_object_hash(void) > > { > > int i; > > - int new_hash_size = obj_hash_size < 32 ? 32 : 2 * obj_hash_size; > > + int new_hash_size = obj_hash_size < 32 ? 32 : 16 * obj_hash_size; > > struct object **new_hash; > > > > new_hash = xcalloc(new_hash_size, sizeof(struct object *)); > > @@ -116,7 +116,7 @@ void *create_object(const unsigned char *sha1, int type, void *o) > > obj->flags = 0; > > hashcpy(obj->sha1, sha1); > > > > - if (obj_hash_size - 1 <= nr_objs * 2) > > + if (obj_hash_size - 1 <= nr_objs * 16) > > grow_object_hash(); > > > > insert_obj_hash(obj, obj_hash, obj_hash_size); > > Shawn was telling me about this exact topic a few months ago, and I do > agree that object hash grows too slowly when you need to slurp in many > objects. I think the main effect might not be the rate of growth and reduced overhead of reallocating and reconstructing the hash 4-6 times, but the *spread* of objects within the hash table - i.e. the maximum (i.e. optimal) size of the hash. In a git gc run the hash grows to the max very quickly, then 99% of execution time is spent with that optimally sized hash - so growth rate per se does not matter much. (it might matter in other usecases) Find below a debug patch i use to run with a configurable spread. Note, i just ran the patch on a different system and there the effect was much less pronounced. So i'd prefer independent confirmation as well that it speeds up things for others as well. I'll run more numbers - maybe we are just very sensitive to the exact layout of the object hash and a 16x spread created a different, more optimal layout. Thanks, Ingo --- git.c | 6 ++++++ object.c | 5 +++-- object.h | 1 + 3 files changed, 10 insertions(+), 2 deletions(-) diff --git a/git.c b/git.c index 4b7dbfa..4c59316 100644 --- a/git.c +++ b/git.c @@ -4,6 +4,7 @@ #include "help.h" #include "quote.h" #include "run-command.h" +#include "object.h" const char git_usage_string[] = "git [--version] [--exec-path[=]] [--html-path]\n" @@ -97,6 +98,11 @@ static int handle_options(const char ***argv, int *argc, int *envchanged) exit(0); } else if (!strcmp(cmd, "-p") || !strcmp(cmd, "--paginate")) { use_pager = 1; + } else if (!strcmp(cmd, "--object-hash-spread")) { + object_hash_spread = atol((*argv)[1]); + printf("object hash spread: %d\n", object_hash_spread); + (*argv)++; + (*argc)--; } else if (!strcmp(cmd, "--no-pager")) { use_pager = 0; if (envchanged) diff --git a/object.c b/object.c index 7e1f2bb..3d16a8a 100644 --- a/object.c +++ b/object.c @@ -7,6 +7,7 @@ static struct object **obj_hash; static int nr_objs, obj_hash_size; +int object_hash_spread = 2; unsigned int get_max_object_index(void) { @@ -91,7 +92,7 @@ struct object *lookup_object(const unsigned char *sha1) static void grow_object_hash(void) { int i; - int new_hash_size = obj_hash_size < 32 ? 32 : 2 * obj_hash_size; + int new_hash_size = obj_hash_size < 32 ? 32 : object_hash_spread * obj_hash_size; struct object **new_hash; new_hash = xcalloc(new_hash_size, sizeof(struct object *)); @@ -116,7 +117,7 @@ void *create_object(const unsigned char *sha1, int type, void *o) obj->flags = 0; hashcpy(obj->sha1, sha1); - if (obj_hash_size - 1 <= nr_objs * 2) + if (obj_hash_size - 1 <= nr_objs * object_hash_spread) grow_object_hash(); insert_obj_hash(obj, obj_hash, obj_hash_size); diff --git a/object.h b/object.h index b6618d9..180a6c1 100644 --- a/object.h +++ b/object.h @@ -75,5 +75,6 @@ int object_list_contains(struct object_list *list, struct object *obj); void add_object_array(struct object *obj, const char *name, struct object_array *array); void add_object_array_with_mode(struct object *obj, const char *name, struct object_array *array, unsigned mode); void object_array_remove_duplicates(struct object_array *); +extern int object_hash_spread; #endif /* OBJECT_H */