[PATCH] hashmap: hashmap_get_next passes through keydata as well

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] hashmap: hashmap_get_next passes through keydata as well
@ 2017-05-12 20:02 Stefan Beller
  2017-05-13  8:50 ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Stefan Beller @ 2017-05-12 20:02 UTC (permalink / raw)
  To: gitster; +Cc: git, Stefan Beller

The 'keydata' may be of value in the underlying compare function to decide
if the given two entries are the same.

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 Documentation/technical/api-hashmap.txt | 6 ++++--
 diffcore-rename.c                       | 2 +-
 hashmap.c                               | 5 +++--
 hashmap.h                               | 3 ++-
 name-hash.c                             | 2 +-
 t/helper/test-hashmap.c                 | 2 +-
 6 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/Documentation/technical/api-hashmap.txt b/Documentation/technical/api-hashmap.txt
index ccc634bbd7..343f1660a9 100644
--- a/Documentation/technical/api-hashmap.txt
+++ b/Documentation/technical/api-hashmap.txt
@@ -140,11 +140,11 @@ to `hashmap_cmp_fn` to decide whether the entry matches the key.
 `hash` is the hash code of the entry to look up.
 +
 If an entry with matching hash code is found, `keydata` is passed to
-`hashmap_cmp_fn` to decide whether the entry matches the key. The
+`hashmap_cmp_fn` to decide whether the `entry` matches the key. The
 `entry_or_key` parameter points to a bogus hashmap_entry structure that
 should not be used in the comparison.
 
-`void *hashmap_get_next(const struct hashmap *map, const void *entry)`::
+`void *hashmap_get_next(const struct hashmap *map, const void *entry, const void *keydata)`::
 
 	Returns the next equal hashmap entry, or NULL if not found. This can be
 	used to iterate over duplicate entries (see `hashmap_add`).
@@ -153,6 +153,8 @@ should not be used in the comparison.
 +
 `entry` is the hashmap_entry to start the search from, obtained via a previous
 call to `hashmap_get` or `hashmap_get_next`.
++
+`keydata` is passed to `hashmap_cmp_fn` to decide whether the entry matches the key.
 
 `void hashmap_add(struct hashmap *map, void *entry)`::
 
diff --git a/diffcore-rename.c b/diffcore-rename.c
index f7444c86bd..0007fcba23 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -280,7 +280,7 @@ static int find_identical_files(struct hashmap *srcs,
 	 * Find the best source match for specified destination.
 	 */
 	p = hashmap_get_from_hash(srcs, hash_filespec(target), NULL);
-	for (; p; p = hashmap_get_next(srcs, p)) {
+	for (; p; p = hashmap_get_next(srcs, p, NULL)) {
 		int score;
 		struct diff_filespec *source = p->filespec;
 
diff --git a/hashmap.c b/hashmap.c
index 7d1044eb5d..f144d2fc04 100644
--- a/hashmap.c
+++ b/hashmap.c
@@ -182,11 +182,12 @@ void *hashmap_get(const struct hashmap *map, const void *key, const void *keydat
 	return *find_entry_ptr(map, key, keydata);
 }
 
-void *hashmap_get_next(const struct hashmap *map, const void *entry)
+void *hashmap_get_next(const struct hashmap *map, const void *entry,
+		       const void *keydata)
 {
 	struct hashmap_entry *e = ((struct hashmap_entry *) entry)->next;
 	for (; e; e = e->next)
-		if (entry_equals(map, entry, e, NULL))
+		if (entry_equals(map, entry, e, keydata))
 			return e;
 	return NULL;
 }
diff --git a/hashmap.h b/hashmap.h
index de6022a3a9..536c53a8ab 100644
--- a/hashmap.h
+++ b/hashmap.h
@@ -64,7 +64,8 @@ static inline void hashmap_entry_init(void *entry, unsigned int hash)
 }
 extern void *hashmap_get(const struct hashmap *map, const void *key,
 		const void *keydata);
-extern void *hashmap_get_next(const struct hashmap *map, const void *entry);
+extern void *hashmap_get_next(const struct hashmap *map, const void *entry,
+			      const void *keydata);
 extern void hashmap_add(struct hashmap *map, void *entry);
 extern void *hashmap_put(struct hashmap *map, void *entry);
 extern void *hashmap_remove(struct hashmap *map, const void *key,
diff --git a/name-hash.c b/name-hash.c
index 39309efb7f..55758ab232 100644
--- a/name-hash.c
+++ b/name-hash.c
@@ -703,7 +703,7 @@ struct cache_entry *index_file_exists(struct index_state *istate, const char *na
 	while (ce) {
 		if (same_name(ce, name, namelen, icase))
 			return ce;
-		ce = hashmap_get_next(&istate->name_hash, ce);
+		ce = hashmap_get_next(&istate->name_hash, ce, NULL);
 	}
 	return NULL;
 }
diff --git a/t/helper/test-hashmap.c b/t/helper/test-hashmap.c
index 7aa9440e27..fa33b32317 100644
--- a/t/helper/test-hashmap.c
+++ b/t/helper/test-hashmap.c
@@ -206,7 +206,7 @@ int cmd_main(int argc, const char **argv)
 				puts("NULL");
 			while (entry) {
 				puts(get_value(entry));
-				entry = hashmap_get_next(&map, entry);
+				entry = hashmap_get_next(&map, entry, NULL);
 			}
 
 		} else if (!strcmp("remove", cmd) && l1) {
-- 
2.13.0.18.g183880de0a


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] hashmap: hashmap_get_next passes through keydata as well
  2017-05-12 20:02 [PATCH] hashmap: hashmap_get_next passes through keydata as well Stefan Beller
@ 2017-05-13  8:50 ` Jeff King
  2017-05-13 14:06   ` Stefan Beller
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2017-05-13  8:50 UTC (permalink / raw)
  To: Stefan Beller; +Cc: gitster, git

On Fri, May 12, 2017 at 01:02:44PM -0700, Stefan Beller wrote:

> The 'keydata' may be of value in the underlying compare function to decide
> if the given two entries are the same.

I had to scratch my head over this for a minute, because there isn't
really any motivating example of what you're trying to do.

I think I figured it out, but I have a feeling it is violating the
intent of the "keydata" parameter.  That parameter is typically used not
as a pointer to arbitrary auxiliary data, but as a trick for finding a
hash entry without having to allocate a struct for it.

So generally, I'd think two entries in the table should be able to be
compared on their own merits, even if no keydata is available. Without
that property, any internal operations in the hashmap can't actually do
an entry comparison (e.g., a table resize that needs to rehash the
entries).

It works out in the current code because the chaining is purely linear,
and doesn't care about order. So we can rehash and just stick the
elements into a new list. But if it were switched out for a different
data structure (e.g., a tree), then the hashmap code would need to be
able to compare elements.

I don't think we have any particular plans for such a change, but I
wonder if we should avoid encouraging callers to rely on the current
implementation.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] hashmap: hashmap_get_next passes through keydata as well
  2017-05-13  8:50 ` Jeff King
@ 2017-05-13 14:06   ` Stefan Beller
  2017-05-16 16:13     ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Stefan Beller @ 2017-05-13 14:06 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git@vger.kernel.org

On Sat, May 13, 2017 at 1:50 AM, Jeff King <peff@peff.net> wrote:
> On Fri, May 12, 2017 at 01:02:44PM -0700, Stefan Beller wrote:
>
>> The 'keydata' may be of value in the underlying compare function to decide
>> if the given two entries are the same.
>
> I had to scratch my head over this for a minute, because there isn't
> really any motivating example of what you're trying to do.
>
> I think I figured it out, but I have a feeling it is violating the
> intent of the "keydata" parameter.  That parameter is typically used not
> as a pointer to arbitrary auxiliary data, but as a trick for finding a
> hash entry without having to allocate a struct for it.

Yes, I was violating the intent exactly as you describe. I'll adapt my patches
accordingly.

I do not really buy into the trick though, because when the overhead of
allocating a 'key' struct filling in the key parts only leaving out the value
is so much more expensive than giving the key via this keydata argument,
then there are serious issues with your data structure IMHO.
Example:

  struct {
    struct hashmap_entry ent;
    int actual key;
    char value[4096*1024]
  } key_plus_value;

Now when you want to look up a specific key, you don't want to allocate
such a struct on the stack, as we'd be wasting 4M memory of the stack
trashing the caches. However the neat part about this struct is that the
key part is (a) totally separate from the values, and (b) comes before
any value part, such that

  struct {
    struct hashmap_entry ent;
    int actual key;
  } key_only;

  struct key_only = {{NULL, 0}, 42};
  struct key_plus_value *match = hashmap_get(&map, &key, NULL);

works, I would think?

>
> So generally, I'd think two entries in the table should be able to be
> compared on their own merits, even if no keydata is available. Without
> that property, any internal operations in the hashmap can't actually do
> an entry comparison (e.g., a table resize that needs to rehash the
> entries).
>
> It works out in the current code because the chaining is purely linear,
> and doesn't care about order. So we can rehash and just stick the
> elements into a new list. But if it were switched out for a different
> data structure (e.g., a tree), then the hashmap code would need to be
> able to compare elements.

Note that most compare functions do not return an order, but only
a boolean [no]match, so putting it into an ordered tree could only
rely on the hash that we already know without aid from the compare function.
Of course we could fix our compare functions before doing such a
refactoring, but I want to point out how involved that would be.

>
> I don't think we have any particular plans for such a change, but I
> wonder if we should avoid encouraging callers to rely on the current
> implementation.

After a night of sleep it is easy to fix my code to behave as the API
intended. Yesterday I could not see how to fix it.

Thanks,
Stefan

>
> -Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] hashmap: hashmap_get_next passes through keydata as well
  2017-05-13 14:06   ` Stefan Beller
@ 2017-05-16 16:13     ` Jeff King
  0 siblings, 0 replies; 4+ messages in thread
From: Jeff King @ 2017-05-16 16:13 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Junio C Hamano, git@vger.kernel.org

On Sat, May 13, 2017 at 07:06:35AM -0700, Stefan Beller wrote:

> > I think I figured it out, but I have a feeling it is violating the
> > intent of the "keydata" parameter.  That parameter is typically used not
> > as a pointer to arbitrary auxiliary data, but as a trick for finding a
> > hash entry without having to allocate a struct for it.
> 
> Yes, I was violating the intent exactly as you describe. I'll adapt my patches
> accordingly.
> 
> I do not really buy into the trick though, because when the overhead of
> allocating a 'key' struct filling in the key parts only leaving out the value
> is so much more expensive than giving the key via this keydata argument,
> then there are serious issues with your data structure IMHO.
> Example:
> [...]

Sure, in your example it works to allocate a partial structure on the
stack. But if you look at the users of the hashmap, quite a few are
structs with final flex-array members, which cannot easily do that
without allocating a new struct on the heap.  You can work around it by
converting them to flex-ptrs (at the cost of an extra 8 bytes per
struct) and having the "key only" version be an oddball by pointing
outside the struct. But I do think having the flexibility in hashmap is
nice.

I actually wish that it did not even require the hashmap entry to be at
the beginning of the struct; it makes it hard to put the same structure
into multiple hashes/lists. See for example the pack MRU, which is in
both a hash and a doubly-linked list. Fortunately the list code is
flexible enough to allow its pointers anywhere in the struct.

So yeah, we could design all of our data structures to fit into the
hashmap's world-view. But I think it's handy for it to be flexible.

> > It works out in the current code because the chaining is purely linear,
> > and doesn't care about order. So we can rehash and just stick the
> > elements into a new list. But if it were switched out for a different
> > data structure (e.g., a tree), then the hashmap code would need to be
> > able to compare elements.
> 
> Note that most compare functions do not return an order, but only
> a boolean [no]match, so putting it into an ordered tree could only
> rely on the hash that we already know without aid from the compare function.
> Of course we could fix our compare functions before doing such a
> refactoring, but I want to point out how involved that would be.

Good point, I didn't think of that. Moving to any non-linear lookup
within a single bucket would require totally changing the cmp_fn
contract, and that's not likely to happen (and anyway, if we're starting
to care about intra-bucket lookup times, that's probably a sign we
should be growing the table more aggressively).

So my concern was just nonsense.

> > I don't think we have any particular plans for such a change, but I
> > wonder if we should avoid encouraging callers to rely on the current
> > implementation.
> 
> After a night of sleep it is easy to fix my code to behave as the API
> intended. Yesterday I could not see how to fix it.

Yay, then. :)

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-05-16 16:13 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-05-12 20:02 [PATCH] hashmap: hashmap_get_next passes through keydata as well Stefan Beller
2017-05-13  8:50 ` Jeff King
2017-05-13 14:06   ` Stefan Beller
2017-05-16 16:13     ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).