Hash Tables

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Hash Tables
@ 2009-08-06  4:35 Philip Herron
  2009-08-06  5:17 ` Philip Herron
  0 siblings, 1 reply; 3+ messages in thread
From: Philip Herron @ 2009-08-06  4:35 UTC (permalink / raw)
  To: git

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hey guys

This is my first time posting to git mailing-list didn't get much
response from irc.

I've been loving git and just been poking at its hash tables in
hash.{c,h}, which are very nice. I've been implementing my own hash
tables and i have some questions on how you guys done it.

1 - Are you using the sha1.c as your hashing function? And why did you
choose it, I have been playing with a few different ones to see how it
goes.

2 -table lookup i see your hash_table has an unsigned int nr, not
quite sure what that is for num_remaing      or something is what i
first thought.
    But i see in your table lookup you use hash % size and it returns
an index to the array. And would love       to  know how that works.
Is size the current number of hash entries in the table?

3 - what is the initial table size as in when you first insert an
entry into the table, the array must be                allocated to a
length, and you grow when you reach a threshold.

Anyways thanks!

- --Phil
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkp6XZsACgkQAhcOgIaQQ2HWGgCfU6l909k7/JK3gf6BB2Cu35xB
jBIAn2Jl6UWp5ZvTXJxUWc91tgn//z25
=6l91
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Hash Tables
  2009-08-06  4:35 Hash Tables Philip Herron
@ 2009-08-06  5:17 ` Philip Herron
  2009-08-06  8:53   ` Thomas Rast
  0 siblings, 1 reply; 3+ messages in thread
From: Philip Herron @ 2009-08-06  5:17 UTC (permalink / raw)
  To: git

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hey

Sorry for the mail i played around with the hash.c and i see how it
works now! Feel little bit stupid what threw me off was the  alloc_nr(
); but its defined to #define alloc_nr(x) (((x)+16)*3/2) which is
quite nice.

and the nr threw me off but i see how it works now its actually the
similar as what i was doing in my program, but your grow table was
better because of alloc_nr acts like a threshold to grow and scale
better, mine just added on another chunk of 32 elements just because
it seemed like a good number to get something working.

Question still stands is the hashing function one, which one and why?

Thanks loads, Sorry for bad posts!

- --Phil
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkp6Z1UACgkQAhcOgIaQQ2FvQwCdGAgcuAUNG2/YyyzhXct3J2qc
azwAninE/8I+Z4T4h294tCzXlLzmyGqW
=Vahj
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Hash Tables
  2009-08-06  5:17 ` Philip Herron
@ 2009-08-06  8:53   ` Thomas Rast
  0 siblings, 0 replies; 3+ messages in thread
From: Thomas Rast @ 2009-08-06  8:53 UTC (permalink / raw)
  To: Philip Herron; +Cc: git

Philip Herron wrote:
> 
> Question still stands is the hashing function [in hash.c], which one and why?

In the spirit of teaching you to fish...

First you'll want to find out where the original users of this code
were.  So you run

  git blame -- hash.c

and see that most of the lines come from 9027f53 (Do linear-time/space
rename logic for exact renames, 2007-10-25).  So you can then look at
this commit:

  git show 9027f53c

Aha, it says

    In the expectation that we will indeed do the same hashing trick for the
    general rename case, this code uses a generic hash-table implementation
    that can be used for other things too.  In fact, we might be able to
    consolidate some of our existing hash tables with the new generic code
    in hash.[ch]

and further down in the patch

+       hash = hash_filespec(filespec);
+       pos = insert_hash(hash, entry, table);

and right above that

+static unsigned int hash_filespec(struct diff_filespec *filespec)
+{
+       unsigned int hash;
+       if (!filespec->sha1_valid) {
+               if (diff_populate_filespec(filespec, 0))
+                       return 0;
+               hash_sha1_file(filespec->data, filespec->size, "blob", filespec-
+       }
+       memcpy(&hash, filespec->sha1, sizeof(hash));
+       return hash;
+}

See?

As for the *why*, presumably because all of git assumes two objects
with the same SHA1 are indeed the same file; so we can later make the
same optimisation again:

+                       if (hashcmp(one->sha1, two->sha1))
+                               continue;

And then, as we've already computed the SHA1, any subset of it is as
good a hash as anything else; it'll be uniformly distributed.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2009-08-06  8:54 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-06  4:35 Hash Tables Philip Herron
2009-08-06  5:17 ` Philip Herron
2009-08-06  8:53   ` Thomas Rast

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).