git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Karsten Blees <karsten.blees@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Git List <git@vger.kernel.org>
Subject: Re: [PATCH/RFC 1/5] add a hashtable implementation that supports O(1) removal
Date: Mon, 23 Sep 2013 11:16:45 +0200	[thread overview]
Message-ID: <524006FD.2010604@gmail.com> (raw)
In-Reply-To: <xmqqtxhqrjzf.fsf@gitster.dls.corp.google.com>

Sorry for the delay, I've been on vacation...

Am 12.09.2013 01:56, schrieb Junio C Hamano:
> Karsten Blees <karsten.blees@gmail.com> writes:
> 
>> +#define FNV32_BASE ((unsigned int) 0x811c9dc5)
>> +#define FNV32_PRIME ((unsigned int) 0x01000193)
>> + ...
>> +static inline unsigned int bucket(const hashmap *map, const hashmap_entry *key)
>> +{
>> +	return key->hash & (map->tablesize - 1);
>> +}
> 
> As tablesize would hopefully be reasonably small, not worrying about
> platforms' "unsigned int" being 64-bit (in which case it would be
> more appropriate to compute with FNV64_PRIME) should be fine.
> 
>> +static inline hashmap_entry **find_entry(const hashmap *map,
>> +		const hashmap_entry *key)
>> +{
>> +	hashmap_entry **e = &map->table[bucket(map, key)];
>> +	while (*e && !entry_equals(map, *e, key))
>> +		e = &(*e)->next;
>> +	return e;
>> +}
> 
> (mental note) This finds the location the pointer to the entry is
> stored, not the entry itself.
> 

Will rename to find_entry_ptr to make this clear

>> +void *hashmap_get(const hashmap *map, const void *key)
>> +{
>> +	return *find_entry(map, key);
>> +}
> 
> ... which is consistent with this, and more importantly, it is
> crucial for hashmap_remove()'s implementation, because...
> 
>> +void *hashmap_remove(hashmap *map, const void *key)
>> +{
>> +	hashmap_entry *old;
>> +	hashmap_entry **e = find_entry(map, key);
>> +	if (!*e)
>> +		return NULL;
>> +
>> +	/* remove existing entry */
>> +	old = *e;
>> +	*e = old->next;
> 
> ... this wants to update the linked list in place.
> 
> Looking good.
> 
> I however wonder if the singly linked linear chain is a really good
> alternative for the access pattern of the hashes we use, though.

I don't think it will make a big difference in lookup performance, especially with good hash codes (such as the first bytes of a sha1). In theory, chaining should be slightly faster, because entries are strictly confined to their buckets (i.e. no extraneous entries need to be traversed when looking up an entry). With uniform hash distribution, chaining should require (1 + loadfactor) comparisons to find an entry, while open addressing requires (1/(1 - loadfactor)) [1]. I'll add a performance test for lookups, though.

[1] http://en.wikipedia.org/wiki/Hash_table#Performance_analysis

> Do we really want to trigger growth on the bucket load factor, not the
> length of the longest chain, for example?
> 

With good hashes and a load factor < 1, the longest 'chain' is 1. We only get chains in case of collisions, which however cannot necessarily be resolved by resizing. E.g. in the worst case, all entries have the same hash code, which deforms the hash table into a long linked list in a single bucket. Resizing won't change that.

An alternative would be to resize on the number of used buckets instead of total entries. I.e. with exceptionally bad hash codes and lots of collisions, we at least don't waste memory by making the table unnecessarily large. However, I don't think this is worth the extra effort.

>> +	old->next = NULL;
>> +
>> +	/* fix size and rehash if appropriate */
>> +	map->size--;
>> +	if (map->tablesize > HASHMAP_INITIAL_SIZE &&
>> +		map->size * HASHMAP_SHRINK_AT < map->tablesize)
>> +		rehash(map, map->tablesize >> HASHMAP_GROW);
> 
> Please align the first two lines so that the first non-whitespace on
> the second line of the condition part of the "if" statement
> (i.e. 'm') aligns with the first non-whitespace inside the '(' open
> parenthesis (i.e. 'm').
> 

Will do.

  reply	other threads:[~2013-09-23  9:16 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-10 23:27 [PATCH/RFC 0/5] New hash table implementation Karsten Blees
2013-09-10 23:28 ` [PATCH/RFC 1/5] add a hashtable implementation that supports O(1) removal Karsten Blees
2013-09-11 23:56   ` Junio C Hamano
2013-09-23  9:16     ` Karsten Blees [this message]
2013-09-12  4:10   ` Junio C Hamano
2013-09-23  9:21     ` Karsten Blees
2013-09-10 23:28 ` [PATCH/RFC 2/5] buitin/describe.c: use new hash map implementation Karsten Blees
2013-09-10 23:29 ` [PATCH/RFC 3/5] diffcore-rename.c: move code around to prepare for the next patch Karsten Blees
2013-09-10 23:30 ` [PATCH/RFC 4/5] diffcore-rename.c: simplify finding exact renames Karsten Blees
2013-09-10 23:30 ` [PATCH/RFC 5/5] diffcore-rename.c: use new hash map implementation Karsten Blees
2013-09-24  9:50 ` [PATCH v2 0/5] New hash table implementation Karsten Blees
2013-09-24  9:51   ` [PATCH v2 1/5] add a hashtable implementation that supports O(1) removal Karsten Blees
2013-09-24  9:52   ` [PATCH v2 2/5] buitin/describe.c: use new hash map implementation Karsten Blees
2013-09-24  9:52   ` [PATCH v2 3/5] diffcore-rename.c: move code around to prepare for the next patch Karsten Blees
2013-09-24  9:53   ` [PATCH v2 4/5] diffcore-rename.c: simplify finding exact renames Karsten Blees
2013-09-24  9:54   ` [PATCH v2 5/5] diffcore-rename.c: use new hash map implementation Karsten Blees
2013-09-24 10:18   ` [PATCH v2 0/5] New hash table implementation Fredrik Gustafsson
2013-09-24 11:16   ` Tay Ray Chuan
2013-09-26 14:38     ` Karsten Blees
2013-09-26 10:16   ` Fredrik Gustafsson
2013-09-26 10:26     ` Duy Nguyen
2013-09-26 11:08       ` Fredrik Gustafsson
2013-09-26 11:14         ` Duy Nguyen
2013-09-26 13:55     ` Karsten Blees

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=524006FD.2010604@gmail.com \
    --to=karsten.blees@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).