[PATCH] Preallocate hash tables when the number of inserts are known in advance

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Preallocate hash tables when the number of inserts are known in advance
@ 2013-03-17  3:28 Nguyễn Thái Ngọc Duy
  2013-03-17  5:34 ` Junio C Hamano
  2013-03-17  8:51 ` Jeff King
  0 siblings, 2 replies; 5+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2013-03-17  3:28 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

This avoids unnecessary re-allocations and reinsertions. On webkit.git
(i.e. about 182k inserts to the name hash table), this reduces about
100ms out of 3s user time.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 nd/read-directory-recursive-optim reduces the number of input (from
 182k to 11k on webkit) to exclude machinery that all patches in the
 exclude optimization series I posted seem insignificant. So I won't
 repost them for inclusion unless you think it has cleanup values.

 This one is worth doing though. I think keeping "untracked index"
 would help avoid looking up in name-hash, where all user-space CPU
 cycles are spent. But I have nothing to show about that.

 diffcore-rename.c | 1 +
 hash.h            | 7 +++++++
 name-hash.c       | 2 ++
 3 files changed, 10 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 512d0ac..8d3d9bb 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -389,6 +389,7 @@ static int find_exact_renames(struct diff_options *options)
 	struct hash_table file_table;
 
 	init_hash(&file_table);
+	preallocate_hash(&file_table, (rename_src_nr + rename_dst_nr) * 2);
 	for (i = 0; i < rename_src_nr; i++)
 		insert_file_table(&file_table, -1, i, rename_src[i].p->one);
 
diff --git a/hash.h b/hash.h
index b875ce6..244d1fe 100644
--- a/hash.h
+++ b/hash.h
@@ -40,4 +40,11 @@ static inline void init_hash(struct hash_table *table)
 	table->array = NULL;
 }
 
+static inline void preallocate_hash(struct hash_table *table, unsigned int size)
+{
+	assert(table->size == 0 && table->nr == 0 && table->array == NULL);
+	table->size = size;
+	table->array = xcalloc(sizeof(struct hash_table_entry), size);
+}
+
 #endif
diff --git a/name-hash.c b/name-hash.c
index 942c459..12364d1 100644
--- a/name-hash.c
+++ b/name-hash.c
@@ -92,6 +92,8 @@ static void lazy_init_name_hash(struct index_state *istate)
 
 	if (istate->name_hash_initialized)
 		return;
+	if (istate->cache_nr)
+		preallocate_hash(&istate->name_hash, istate->cache_nr * 2);
 	for (nr = 0; nr < istate->cache_nr; nr++)
 		hash_index_entry(istate, istate->cache[nr]);
 	istate->name_hash_initialized = 1;
-- 
1.8.2.83.gc99314b

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] Preallocate hash tables when the number of inserts are known in advance
  2013-03-17  3:28 [PATCH] Preallocate hash tables when the number of inserts are known in advance Nguyễn Thái Ngọc Duy
@ 2013-03-17  5:34 ` Junio C Hamano
  2013-03-17  5:39   ` Duy Nguyen
  2013-03-17  8:51 ` Jeff King
  1 sibling, 1 reply; 5+ messages in thread
From: Junio C Hamano @ 2013-03-17  5:34 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes:

> This avoids unnecessary re-allocations and reinsertions. On webkit.git
> (i.e. about 182k inserts to the name hash table), this reduces about
> 100ms out of 3s user time.
>
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>

I think this is a very good idea, but I would prefer the second
parameter to the "preallocate" to be "expected number of entries"
and have the preallocate, which is a part of the hash API, decide
how to inflate that number to adjust to the desired load factor of
the hash table.  We shouldn't have to adjust the caller when the
internal implementation of the hash table changes.

> ---
>  nd/read-directory-recursive-optim reduces the number of input (from
>  182k to 11k on webkit) to exclude machinery that all patches in the
>  exclude optimization series I posted seem insignificant. So I won't
>  repost them for inclusion unless you think it has cleanup values.

Sorry, without a pointer, it is unclear what "exclude optimization
series" you are referring to.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Preallocate hash tables when the number of inserts are known in advance
  2013-03-17  5:34 ` Junio C Hamano
@ 2013-03-17  5:39   ` Duy Nguyen
  2013-03-17  6:18     ` Junio C Hamano
  0 siblings, 1 reply; 5+ messages in thread
From: Duy Nguyen @ 2013-03-17  5:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sun, Mar 17, 2013 at 12:34 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes:
>
>> This avoids unnecessary re-allocations and reinsertions. On webkit.git
>> (i.e. about 182k inserts to the name hash table), this reduces about
>> 100ms out of 3s user time.
>>
>> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
>
> I think this is a very good idea, but I would prefer the second
> parameter to the "preallocate" to be "expected number of entries"
> and have the preallocate, which is a part of the hash API, decide
> how to inflate that number to adjust to the desired load factor of
> the hash table.  We shouldn't have to adjust the caller when the
> internal implementation of the hash table changes.

OK will do.

>> ---
>>  nd/read-directory-recursive-optim reduces the number of input (from
>>  182k to 11k on webkit) to exclude machinery that all patches in the
>>  exclude optimization series I posted seem insignificant. So I won't
>>  repost them for inclusion unless you think it has cleanup values.
>
> Sorry, without a pointer, it is unclear what "exclude optimization
> series" you are referring to.

http://thread.gmane.org/gmane.comp.version-control.git/217697/focus=217950
-- 
Duy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Preallocate hash tables when the number of inserts are known in advance
  2013-03-17  5:39   ` Duy Nguyen
@ 2013-03-17  6:18     ` Junio C Hamano
  0 siblings, 0 replies; 5+ messages in thread
From: Junio C Hamano @ 2013-03-17  6:18 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: git

Duy Nguyen <pclouds@gmail.com> writes:

> On Sun, Mar 17, 2013 at 12:34 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes:
>>
>>> This avoids unnecessary re-allocations and reinsertions. On webkit.git
>>> (i.e. about 182k inserts to the name hash table), this reduces about
>>> 100ms out of 3s user time.
>>>
>>> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
>>
>> I think this is a very good idea, but I would prefer the second
>> parameter to the "preallocate" to be "expected number of entries"
>> and have the preallocate, which is a part of the hash API, decide
>> how to inflate that number to adjust to the desired load factor of
>> the hash table.  We shouldn't have to adjust the caller when the
>> internal implementation of the hash table changes.
>
> OK will do.

I've squashed it in myself, so no need to resend only for this.

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 8d3d9bb..6c7a72f 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -389,7 +389,7 @@ static int find_exact_renames(struct diff_options *options)
 	struct hash_table file_table;
 
 	init_hash(&file_table);
-	preallocate_hash(&file_table, (rename_src_nr + rename_dst_nr) * 2);
+	preallocate_hash(&file_table, rename_src_nr + rename_dst_nr);
 	for (i = 0; i < rename_src_nr; i++)
 		insert_file_table(&file_table, -1, i, rename_src[i].p->one);
 
diff --git a/hash.h b/hash.h
index 244d1fe..1d43ac0 100644
--- a/hash.h
+++ b/hash.h
@@ -40,11 +40,11 @@ static inline void init_hash(struct hash_table *table)
 	table->array = NULL;
 }
 
-static inline void preallocate_hash(struct hash_table *table, unsigned int size)
+static inline void preallocate_hash(struct hash_table *table, unsigned int elts)
 {
 	assert(table->size == 0 && table->nr == 0 && table->array == NULL);
-	table->size = size;
-	table->array = xcalloc(sizeof(struct hash_table_entry), size);
+	table->size = elts * 2;
+	table->array = xcalloc(sizeof(struct hash_table_entry), table->size);
 }
 
 #endif
diff --git a/name-hash.c b/name-hash.c
index 90c7b99..2a1f108 100644
--- a/name-hash.c
+++ b/name-hash.c
@@ -93,7 +93,7 @@ static void lazy_init_name_hash(struct index_state *istate)
 	if (istate->name_hash_initialized)
 		return;
 	if (istate->cache_nr)
-		preallocate_hash(&istate->name_hash, istate->cache_nr * 2);
+		preallocate_hash(&istate->name_hash, istate->cache_nr);
 	for (nr = 0; nr < istate->cache_nr; nr++)
 		hash_index_entry(istate, istate->cache[nr]);
 	istate->name_hash_initialized = 1;

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] Preallocate hash tables when the number of inserts are known in advance
  2013-03-17  3:28 [PATCH] Preallocate hash tables when the number of inserts are known in advance Nguyễn Thái Ngọc Duy
  2013-03-17  5:34 ` Junio C Hamano
@ 2013-03-17  8:51 ` Jeff King
  1 sibling, 0 replies; 5+ messages in thread
From: Jeff King @ 2013-03-17  8:51 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git, Junio C Hamano

On Sun, Mar 17, 2013 at 10:28:06AM +0700, Nguyen Thai Ngoc Duy wrote:

> This avoids unnecessary re-allocations and reinsertions. On webkit.git
> (i.e. about 182k inserts to the name hash table), this reduces about
> 100ms out of 3s user time.

Good idea.

I had a similar thought when analyzing the hashing behavior of
pack-objects' "Counting objects..." phase, but it had even less impact
there. The insertions are just drowned out by the number of lookups in
that case.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-03-17  8:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-17  3:28 [PATCH] Preallocate hash tables when the number of inserts are known in advance Nguyễn Thái Ngọc Duy
2013-03-17  5:34 ` Junio C Hamano
2013-03-17  5:39   ` Duy Nguyen
2013-03-17  6:18     ` Junio C Hamano
2013-03-17  8:51 ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).