From: Karsten Blees <karsten.blees@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Git List <git@vger.kernel.org>, Thomas Rast <tr@thomasrast.ch>,
Jens Lehmann <Jens.Lehmann@web.de>
Subject: Re: [PATCH v4 02/14] add a hashtable implementation that supports O(1) removal
Date: Fri, 08 Nov 2013 11:27:26 +0100 [thread overview]
Message-ID: <527CBC8E.6050507@gmail.com> (raw)
In-Reply-To: <xmqq4n7nriuj.fsf@gitster.dls.corp.google.com>
Am 07.11.2013 22:40, schrieb Junio C Hamano:
> Karsten Blees <karsten.blees@gmail.com> writes:
>
>> +`void hashmap_add(struct hashmap *map, void *entry)`::
>> +
>> + Adds a hashmap entry. This allows to add duplicate entries (i.e.
>> + separate values with the same key according to hashmap_cmp_fn).
>> ++
>> +`map` is the hashmap structure.
>> ++
>> +`entry` is the entry to add.
>> +
>> +`void *hashmap_put(struct hashmap *map, void *entry)`::
>> +
>> + Adds or replaces a hashmap entry.
>> ++
>> +`map` is the hashmap structure.
>> ++
>> +`entry` is the entry to add or update.
>> ++
>> +Returns the previous entry, or NULL if not found (i.e. the entry was added).
>
> What happens when there were duplicate entries previously added?
> All are replaced? One of them is randomly chosen and gets replaced?
>
> With s/replaced/removed/, the same question applies to hashmap_remove().
>
> I am guessing that "we pick one at random and pretend as if other
> duplicates do not exist" is the answer,
Exactly, however, you won't have duplicates if you use put exclusively.
> and I do not immediately
> have an objection to that semantics, but whatever the rule is, it
> needs to be spelled out.
>
I'll clarify this.
>> +`void *hashmap_remove(struct hashmap *map, const void *key, const void *keydata)`::
>> +
>> + Removes a hashmap entry matching the specified key.
>> ...
>> +Usage example
>> +-------------
>> +
>> +Here's a simple usage example that maps long keys to double values.
>> +[source,c]
>> +------------
>> +struct hashmap map;
>> +
>> +struct long2double {
>> + struct hashmap_entry ent; /* must be the first member! */
>> + long key;
>> + double value;
>> +};
>> +
>> +static int long2double_cmp(const struct long2double *e1, const struct long2double *e2, const void *unused)
>> +{
>> + return !(e1->key == e2->key);
>> +}
>> +
>> +void long2double_init()
>
> void long2double_init(void)
>
>> +{
>> + hashmap_init(&map, (hashmap_cmp_fn) long2double_cmp, 0);
>> +}
>> +
>> +void long2double_free()
>
> Likewise.
>
>> diff --git a/hashmap.c b/hashmap.c
>> new file mode 100644
>> index 0000000..3a9f8c1
>> --- /dev/null
>> +++ b/hashmap.c
>> ...
>> +unsigned int memhash(const void *buf, size_t len)
>> +{
>> + unsigned int hash = FNV32_BASE;
>> + unsigned char *ucbuf = (unsigned char*) buf;
>
> "(unsigned char *)buf"; pointer-asterisk does not stick to type.
>
Ok, I'll recheck all casts.
>> + while (len--) {
>> + unsigned int c = *ucbuf++;
>> + hash = (hash * FNV32_PRIME) ^ c;
>> + }
>> + return hash;
>> +}
>> +
>> +unsigned int memihash(const void *buf, size_t len)
>> +{
>> + unsigned int hash = FNV32_BASE;
>> + unsigned char *ucbuf = (unsigned char*) buf;
>> + while (len--) {
>> + unsigned int c = *ucbuf++;
>> + if (c >= 'a' && c <= 'z')
>> + c -= 'a' - 'A';
>> + hash = (hash * FNV32_PRIME) ^ c;
>> + }
>> + return hash;
>> +}
>> +
>> +#define HASHMAP_INITIAL_SIZE 64
>> +/* grow / shrink by 2^2 */
>> +#define HASHMAP_GROW 2
>> +/* grow if > 80% full (to 20%) */
>> +#define HASHMAP_GROW_AT 1.25
>> +/* shrink if < 16.6% full (to 66.6%) */
>> +#define HASHMAP_SHRINK_AT 6
>
> May be I am too old fashioned, but seeing a floating-point constant
> in our otherwise pretty-much integer-only code makes me feel uneasy.
> "gcc -S -O2" does seem to generate floating point instruction for
> "i" but not for "j":
>
> extern void inspect(unsigned int i, unsigned int j);
>
> void grow(unsigned int i, unsigned int j)
> {
> i *= 1.25;
> j += j >> 2;
> inspect(i, j);
> }
>
I guess there's no more i486SXs around, so using floating point should not be a problem (at least performance-wise...).
However, defining the constants inversely is a bit unintuitive (i.e. 1.25 instead of 0.8, 6 instead of 0.166). Perhaps the thresholds should also be calculated once on resize, not on every add / remove.
What about this:
#define HASHMAP_GROW_AT 80
#define HASHMAP_SHRINK_AT 16
...in rehash:
map->grow_at = (unsigned int)((uint64_t) map->tablesize * HASHMAP_GROW_AT / 100);
map->shrink_at = (unsigned int)((uint64_t) map->tablesize * HASHMAP_SHRINK_AT / 100);
...in add:
size++;
if (map->size > map->grow_at)
...in remove:
size--;
if (map->size < map->shrink_at)
> Perhaps
>
> #define HASHMAP_GROW_AT(current) ((current) + (current) >> 2)
> #define HASHMAP_SHRINK_AT(current) ((current) * 6)
> #define HASHMAP_GROW(current) ((current) << 2)
> #define HASHMAP_SHRINK(current) ((current) >> 2)
>
> may alleviate my worries; I dunno.
>
>> +
>> +static inline int entry_equals(const struct hashmap *map,
>> + const struct hashmap_entry *e1, const struct hashmap_entry *e2,
>> + const void *keydata)
>> +{
>> + return (e1 == e2) || (e1->hash == e2->hash && !(*map->cmpfn)(e1, e2, keydata));
>
> Our code tends not to say "this is a pointer to a function"
> explicitly, i.e.
>
> !map->cmpfn(e1, e2, keydata)
>
> is more in-line with our code and should also be sufficient.
>
>> +}
>> +
>> +static inline unsigned int bucket(const struct hashmap *map,
>> + const struct hashmap_entry *key)
>> +{
>> + return key->hash & (map->tablesize - 1);
>> +}
>> +
>> +static void rehash(struct hashmap *map, unsigned int newsize)
>> +{
>> + unsigned int i, oldsize = map->tablesize;
>> + struct hashmap_entry **oldtable = map->table;
>> +
>> + map->tablesize = newsize;
>> + map->table = xcalloc(sizeof(struct hashmap_entry*), map->tablesize);
>
> Even though multiplication is commutative, please match the official
> function signature, i.e.
>
> calloc(size_t nmemb, size_t size)
>
> where the number of elements comes first and size of an element
> comes next. I.e.
>
> xcalloc(map->tablesize, sizeof(struct hashmap_entry *));
>
> Also don't forget the SP between type and asterisk.
>
>> + for (i = 0; i < oldsize; i++) {
>> + struct hashmap_entry *e = oldtable[i];
>> + while (e) {
>> + struct hashmap_entry *next = e->next;
>> + unsigned int b = bucket(map, e);
>> + e->next = map->table[b];
>> + map->table[b] = e;
>> + e = next;
>> + }
>> + }
>> + free(oldtable);
>> +}
>> +
>> +static inline struct hashmap_entry **find_entry_ptr(const struct hashmap *map,
>> + const struct hashmap_entry *key, const void *keydata)
>> +{
>> + struct hashmap_entry **e = &map->table[bucket(map, key)];
>> + while (*e && !entry_equals(map, *e, key, keydata))
>> + e = &(*e)->next;
>> + return e;
>> +}
>> +
>> +static int always_equal(const void *unused1, const void *unused2, const void *unused3)
>> +{
>> + return 0;
>> +}
>> +
>> +void hashmap_init(struct hashmap *map, hashmap_cmp_fn equals_function,
>> + size_t initial_size)
>> +{
>> + map->size = 0;
>> + map->cmpfn = equals_function ? equals_function : always_equal;
>> + /* calculate initial table size and allocate the table */
>> + map->tablesize = HASHMAP_INITIAL_SIZE;
>> + initial_size *= HASHMAP_GROW_AT;
>> + while (initial_size > map->tablesize)
>> + map->tablesize <<= HASHMAP_GROW;
>> + map->table = xcalloc(sizeof(struct hashmap_entry*), map->tablesize);
>> +}
>> +
>> +void hashmap_free(struct hashmap *map, hashmap_free_fn free_function)
>> +{
>
> Why is free_function not part of the constants defiend at
> hashmap_init() time? Your API allows the same hashmap, depending on
> the way it has been used, to be cleaned up with different
> free_function, but I am not sure if that "flexibility" is intended
> (and in what application it would be useful).
>
The free_function is a convenience so you don't have to loop over the entries yourself. It is only needed by hashmap_free (e.g. remove() and put() return the entry without freeing it), so I don't see a reason to carry the free_function around from construction time.
Git uses overallocated structs a lot (i.e. ending in char name[FLEX_ARRAY]), so stdlib free is all we need so far. If the entries have char *name1; char *name2; which are individually allocated, you need a special free function. But perhaps this is a case of premature generalization and a simple 'to free or not to free' boolean would suffice.
> Just like a NULL passed for equals_function gets the internal
> always_equal fallback function, if you make this a part of
> hashmap_init(), a NULL passed for free_funcion can be used as a
> signal to use the system's free(3) and you no longer have to say
> "free from stdlib" in the docs and the comment.
>
No, there are cases where the entries are not owned by the hashmap, so passing NULL = 'don't free entries' is a valid case. See name-hash.c:
hashmap_free(&istate->name_hash, NULL);
hashmap_free(&istate->dir_hash, free);
The cache_entries are owned by index_state.cache, while the dir_entries are our own.
>> + if (!map || !map->table)
>> + return;
>> + if (free_function) {
>> + struct hashmap_iter iter;
>> + struct hashmap_entry *e;
>> + hashmap_iter_init(map, &iter);
>> + while ((e = hashmap_iter_next(&iter)))
>> + (*free_function)(e);
>> + }
>> + free(map->table);
>> + memset(map, 0, sizeof(*map));
>> +}
>> +
>> +void *hashmap_get(const struct hashmap *map, const void *key, const void *keydata)
>> +{
>> + return *find_entry_ptr(map, key, keydata);
>> +}
>> +
>> +void *hashmap_get_next(const struct hashmap *map, const void *entry)
>> +{
>> + struct hashmap_entry *e = ((struct hashmap_entry*) entry)->next;
>> + for (; e; e = e->next)
>> + if (entry_equals(map, entry, e, NULL))
>> + return e;
>> + return NULL;
>> +}
>> +
>> +void hashmap_add(struct hashmap *map, void *entry)
>> +{
>> + unsigned int b = bucket(map, entry);
>> +
>> + /* add entry */
>> + ((struct hashmap_entry*) entry)->next = map->table[b];
>> + map->table[b] = entry;
>> +
>> + /* fix size and rehash if appropriate */
>> + map->size++;
>> + if (map->size * HASHMAP_GROW_AT > map->tablesize)
>> + rehash(map, map->tablesize << HASHMAP_GROW);
>> +}
>> +
>> +void *hashmap_remove(struct hashmap *map, const void *key, const void *keydata)
>> +{
>> + struct hashmap_entry *old;
>> + struct hashmap_entry **e = find_entry_ptr(map, key, keydata);
>> + if (!*e)
>> + return NULL;
>> +
>> + /* remove existing entry */
>> + old = *e;
>> + *e = old->next;
>> + old->next = NULL;
>> +
>> + /* fix size and rehash if appropriate */
>> + map->size--;
>> + if (map->tablesize > HASHMAP_INITIAL_SIZE &&
>> + map->size * HASHMAP_SHRINK_AT < map->tablesize)
>> + rehash(map, map->tablesize >> HASHMAP_GROW);
>
> This "we shrink by the same amount" looks inconsistent with the use
> of separate grow-at and shrink-at constants (see above for four
> suggested #define's).
>
These values account for a small hysteresis so that there is no size at which a sequence of add, remove, add, remove (or put, put, put, put) results in permanent resizes.
We grow at 80% (* 4 = 20% full after grow), and shrink at 16.6% ( / 4 = 66.6% full after shrink).
next prev parent reply other threads:[~2013-11-08 10:27 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-11-07 14:32 [PATCH v4 00/14] New hash table implementation Karsten Blees
2013-11-07 14:33 ` [PATCH v4 01/14] submodule: don't access the .gitmodules cache entry after removing it Karsten Blees
2013-11-07 22:27 ` Heiko Voigt
2013-11-07 14:34 ` [PATCH v4 02/14] add a hashtable implementation that supports O(1) removal Karsten Blees
2013-11-07 21:40 ` Junio C Hamano
2013-11-08 10:27 ` Karsten Blees [this message]
2013-11-08 16:45 ` Philip Oakley
2013-11-08 17:08 ` Junio C Hamano
2013-11-13 16:37 ` Karsten Blees
2013-11-07 14:35 ` [PATCH v4 03/14] buitin/describe.c: use new hash map implementation Karsten Blees
2013-11-07 14:36 ` [PATCH v4 04/14] diffcore-rename.c: move code around to prepare for the next patch Karsten Blees
2013-11-07 14:36 ` [PATCH v4 05/14] diffcore-rename.c: simplify finding exact renames Karsten Blees
2013-11-07 14:37 ` [PATCH v4 06/14] diffcore-rename.c: use new hash map implementation Karsten Blees
2013-11-07 14:38 ` [PATCH v4 07/14] name-hash.c: use new hash map implementation for directories Karsten Blees
2013-11-07 14:38 ` [PATCH v4 08/14] name-hash.c: remove unreferenced directory entries Karsten Blees
2013-11-07 14:39 ` [PATCH v4 09/14] name-hash.c: use new hash map implementation for cache entries Karsten Blees
2013-11-07 14:39 ` [PATCH v4 10/14] name-hash.c: remove cache entries instead of marking them CE_UNHASHED Karsten Blees
2013-11-07 14:40 ` [PATCH v4 11/14] remove old hash.[ch] implementation Karsten Blees
2013-11-07 14:43 ` [PATCH v4 12/14] fix 'git update-index --verbose --again' output Karsten Blees
2013-11-07 22:12 ` Junio C Hamano
2013-11-08 10:27 ` Karsten Blees
2013-11-07 14:44 ` [PATCH v4 13/14] builtin/update-index.c: cleanup update_one Karsten Blees
2013-11-07 21:40 ` Junio C Hamano
2013-11-08 10:27 ` Karsten Blees
2013-11-07 14:45 ` [PATCH v4 14/14] read-cache.c: fix memory leaks caused by removed cache entries Karsten Blees
2013-11-07 21:40 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=527CBC8E.6050507@gmail.com \
--to=karsten.blees@gmail.com \
--cc=Jens.Lehmann@web.de \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=tr@thomasrast.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).