I'm a total push-over..

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* I'm a total push-over..
@ 2008-01-22 23:37 Linus Torvalds
  2008-01-23  1:35 ` Kevin Ballard
                   ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Linus Torvalds @ 2008-01-22 23:37 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano


Ok, here's an interesting patch based on the current 'next' (since it very 
intimately requires the new in-memory index format).

What it does is to create a hash index of every single file added to the 
index. Right now that hash index isn't actually used for much: I 
implemented a "cache_name_exists()" function that uses it to efficiently 
look up a filename in the index without having to do the O(logn) binary 
search, but quite frankly, that's not why this patch is interesting.

No, the whole and only reason to create the hash of the filenames in the 
index is that by modifying the hash function, you can fairly easily do 
things like making it always hash equivalent names into the same bucket. 

That, in turn, means that suddenly questions like "does this name exists 
in the index under an _equivalent_ name?" becomes much much cheaper.

Guiding principles behind this patch:

 - it shouldn't be too costly. In fact, my primary goal here was to 
   actually speed up "git commit" with a fully populated kernel tree, by 
   being faster at checking whether a file already existed in the index. I 
   did succeed, but only barely:

	Best before:
		[torvalds@woody linux]$ time git commit > /dev/null
		real    0m0.255s
		user    0m0.168s
		sys     0m0.088s

	Best after:

		[torvalds@woody linux]$ time ~/git/git commit > /dev/null
		real    0m0.233s
		user    0m0.144s
		sys     0m0.088s

   so some things are actually faster (~8%).

   Caveat: that's really the best case. Other things are invariably going 
   to be slightly slower, since we populate that index cache, and quite 
   frankly, few things really use it to look things up. 

   That said, the cost is really quite small. The worst case is probably 
   doing a "git ls-files", which will do very little except puopulate the 
   index, and never actually looks anything up in it, just lists it.

	Before:
		[torvalds@woody linux]$ time git ls-files > /dev/null
		real    0m0.016s
		user    0m0.016s
		sys     0m0.000s

	After:
		[torvalds@woody linux]$ time ~/git/git ls-files > /dev/null
			real    0m0.021s
			user    0m0.012s
			sys     0m0.008s

   and while the thing has really gotten relatively much slower, we're 
   still talking about something almost unmeasurable (eg 5ms). And that 
   really should be pretty much the worst case.

   So we lose 5ms on one "benchmark", but win 22ms on another. Pick your 
   poison - this patch has the advantage that it will _likely_ speed up 
   the cases that are complex and expensive more than it slows down the 
   cases that are already so fast that nobody cares. But if you look at
   relative speedups/slowdowns, it doesn't look so good.

 - It should be simple and clean

   The code may be a bit subtle (the reasons I do hash removal the way I 
   do etc), but it re-uses the existing hash.c files, so it really is 
   fairly small and straightforward apart from a few odd details.

Now, this patch on its own doesn't really do much, but I think it's worth 
looking at, if only because if done correctly, the name hashing really can 
make an improvement to the whole issue of "do we have a filename that 
looks like this in the index already". And at least it gets real testing 
by being used even by default (ie there is a real use-case for it even 
without any insane filesystems).

NOTE NOTE NOTE! The current hash is a joke. I'm ashamed of it, I'm just 
not ashamed of it enough to really care. I took all the numbers out of my 
nether regions - I'm sure it's good enough that it works in practice, but 
the whole point was that you can make a really much fancier hash that 
hashes characters not directly, but by their upper-case value or something 
like that, and thus you get a case-insensitive hash, while still keeping 
the name and the index itself totally case sensitive.

And let's face it, it was kind of fun. These things are all _soo_ much 
simpler than all the issues you have to do in the kernel, so this is just 
a complete toy compared to all the things we do inside Linux to do the 
same thing with pluggable hashes on a per-path-component basis etc.

(User space developers are weenies. One of the most fun parts of git 
development for me has been how easy everything is ;)

		Linus

----
 cache.h      |    6 +++
 dir.c        |    2 +-
 read-cache.c |   97 ++++++++++++++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 94 insertions(+), 11 deletions(-)

diff --git a/cache.h b/cache.h
index 3a47cdc..409738c 100644
--- a/cache.h
+++ b/cache.h
@@ -3,6 +3,7 @@
 
 #include "git-compat-util.h"
 #include "strbuf.h"
+#include "hash.h"
 
 #include SHA1_HEADER
 #include <zlib.h>
@@ -109,6 +110,7 @@ struct ondisk_cache_entry {
 };
 
 struct cache_entry {
+	struct cache_entry *next;
 	unsigned int ce_ctime;
 	unsigned int ce_mtime;
 	unsigned int ce_dev;
@@ -131,6 +133,7 @@ struct cache_entry {
 #define CE_UPDATE    (0x10000)
 #define CE_REMOVE    (0x20000)
 #define CE_UPTODATE  (0x40000)
+#define CE_UNHASHED  (0x80000)
 
 static inline unsigned create_ce_flags(size_t len, unsigned stage)
 {
@@ -188,6 +191,7 @@ struct index_state {
 	struct cache_tree *cache_tree;
 	time_t timestamp;
 	void *alloc;
+	struct hash_table name_hash;
 };
 
 extern struct index_state the_index;
@@ -211,6 +215,7 @@ extern struct index_state the_index;
 #define refresh_cache(flags) refresh_index(&the_index, (flags), NULL, NULL)
 #define ce_match_stat(ce, st, options) ie_match_stat(&the_index, (ce), (st), (options))
 #define ce_modified(ce, st, options) ie_modified(&the_index, (ce), (st), (options))
+#define cache_name_exists(name, namelen) index_name_exists(&the_index, (name), (namelen))
 #endif
 
 enum object_type {
@@ -297,6 +302,7 @@ extern int read_index_from(struct index_state *, const char *path);
 extern int write_index(struct index_state *, int newfd);
 extern int discard_index(struct index_state *);
 extern int verify_path(const char *path);
+extern int index_name_exists(struct index_state *istate, const char *name, int namelen);
 extern int index_name_pos(struct index_state *, const char *name, int namelen);
 #define ADD_CACHE_OK_TO_ADD 1		/* Ok to add */
 #define ADD_CACHE_OK_TO_REPLACE 2	/* Ok to replace file/directory */
diff --git a/dir.c b/dir.c
index 1b9cc7a..6543105 100644
--- a/dir.c
+++ b/dir.c
@@ -346,7 +346,7 @@ static struct dir_entry *dir_entry_new(const char *pathname, int len)
 
 struct dir_entry *dir_add_name(struct dir_struct *dir, const char *pathname, int len)
 {
-	if (cache_name_pos(pathname, len) >= 0)
+	if (cache_name_exists(pathname, len))
 		return NULL;
 
 	ALLOC_GROW(dir->entries, dir->nr+1, dir->alloc);
diff --git a/read-cache.c b/read-cache.c
index 8ba8f0f..33a8ca5 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -23,6 +23,70 @@
 
 struct index_state the_index;
 
+static unsigned int hash_name(const char *name, int namelen)
+{
+	unsigned int hash = 0x123;
+
+	do {
+		unsigned char c = *name++;
+		hash = hash*101 + c;
+	} while (--namelen);
+	return hash;
+}
+
+static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+{
+	void **pos;
+	unsigned int hash = hash_name(ce->name, ce_namelen(ce));
+
+	istate->cache[nr] = ce;
+	pos = insert_hash(hash, ce, &istate->name_hash);
+	if (pos) {
+		ce->next = *pos;
+		*pos = ce;
+	}
+}
+
+/*
+ * We don't actually *remove* it, we can just mark it invalid so that
+ * we won't find it in lookups.
+ *
+ * Not only would we have to search the lists (simple enough), but
+ * we'd also have to rehash other hash buckets in case this makes the
+ * hash bucket empty (common). So it's much better to just mark
+ * it.
+ */
+static void remove_hash_entry(struct index_state *istate, struct cache_entry *ce)
+{
+	ce->ce_flags |= CE_UNHASHED;
+}
+
+static void replace_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+{
+	struct cache_entry *old = istate->cache[nr];
+
+	if (ce != old) {
+		remove_hash_entry(istate, old);
+		set_index_entry(istate, nr, ce);
+	}
+	istate->cache_changed = 1;
+}
+
+int index_name_exists(struct index_state *istate, const char *name, int namelen)
+{
+	unsigned int hash = hash_name(name, namelen);
+	struct cache_entry *ce = lookup_hash(hash, &istate->name_hash);
+
+	while (ce) {
+		if (!(ce->ce_flags & CE_UNHASHED)) {
+			if (!cache_name_compare(name, namelen, ce->name, ce->ce_flags))
+				return 1;
+		}
+		ce = ce->next;
+	}
+	return 0;
+}
+
 /*
  * This only updates the "non-critical" parts of the directory
  * cache, ie the parts that aren't tracked by GIT, and only used
@@ -323,6 +387,9 @@ int index_name_pos(struct index_state *istate, const char *name, int namelen)
 /* Remove entry, return true if there are more entries to go.. */
 int remove_index_entry_at(struct index_state *istate, int pos)
 {
+	struct cache_entry *ce = istate->cache[pos];
+
+	remove_hash_entry(istate, ce);
 	istate->cache_changed = 1;
 	istate->cache_nr--;
 	if (pos >= istate->cache_nr)
@@ -697,8 +764,7 @@ static int add_index_entry_with_check(struct index_state *istate, struct cache_e
 
 	/* existing match? Just replace it. */
 	if (pos >= 0) {
-		istate->cache_changed = 1;
-		istate->cache[pos] = ce;
+		replace_index_entry(istate, pos, ce);
 		return 0;
 	}
 	pos = -pos-1;
@@ -758,7 +824,7 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti
 		memmove(istate->cache + pos + 1,
 			istate->cache + pos,
 			(istate->cache_nr - pos - 1) * sizeof(ce));
-	istate->cache[pos] = ce;
+	set_index_entry(istate, pos, ce);
 	istate->cache_changed = 1;
 	return 0;
 }
@@ -887,11 +953,8 @@ int refresh_index(struct index_state *istate, unsigned int flags, const char **p
 			has_errors = 1;
 			continue;
 		}
-		istate->cache_changed = 1;
-		/* You can NOT just free istate->cache[i] here, since it
-		 * might not be necessarily malloc()ed but can also come
-		 * from mmap(). */
-		istate->cache[i] = new;
+
+		replace_index_entry(istate, i, new);
 	}
 	return has_errors;
 }
@@ -966,6 +1029,20 @@ static void convert_from_disk(struct ondisk_cache_entry *ondisk, struct cache_en
 	memcpy(ce->name, ondisk->name, len + 1);
 }
 
+static inline size_t estimate_cache_size(size_t ondisk_size, unsigned int entries)
+{
+	long per_entry;
+
+	per_entry = sizeof(struct cache_entry) - sizeof(struct ondisk_cache_entry);
+
+	/*
+	 * Alignment can cause differences. This should be "alignof", but 
+	 * since that's a gcc'ism, just use the size of a pointer.
+	 */
+	per_entry += sizeof(void *);
+	return ondisk_size + entries*per_entry;
+}
+
 /* remember to discard_cache() before reading a different cache! */
 int read_index_from(struct index_state *istate, const char *path)
 {
@@ -1016,7 +1093,7 @@ int read_index_from(struct index_state *istate, const char *path)
 	 * has room for a few  more flags, we can allocate using the same
 	 * index size
 	 */
-	istate->alloc = xmalloc(mmap_size);
+	istate->alloc = xmalloc(estimate_cache_size(mmap_size, istate->cache_nr));
 
 	src_offset = sizeof(*hdr);
 	dst_offset = 0;
@@ -1027,7 +1104,7 @@ int read_index_from(struct index_state *istate, const char *path)
 		disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset);
 		ce = (struct cache_entry *)((char *)istate->alloc + dst_offset);
 		convert_from_disk(disk_ce, ce);
-		istate->cache[i] = ce;
+		set_index_entry(istate, i, ce);
 
 		src_offset += ondisk_ce_size(ce);
 		dst_offset += ce_size(ce);

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-22 23:37 I'm a total push-over Linus Torvalds
@ 2008-01-23  1:35 ` Kevin Ballard
  2008-01-23  2:23 ` Junio C Hamano
  2008-01-23  8:32 ` Andreas Ericsson
  2 siblings, 0 replies; 51+ messages in thread
From: Kevin Ballard @ 2008-01-23  1:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1308 bytes --]

On Jan 22, 2008, at 6:37 PM, Linus Torvalds wrote:

> Ok, here's an interesting patch based on the current 'next' (since  
> it very
> intimately requires the new in-memory index format).
>
> What it does is to create a hash index of every single file added to  
> the
> index. Right now that hash index isn't actually used for much: I
> implemented a "cache_name_exists()" function that uses it to  
> efficiently
> look up a filename in the index without having to do the O(logn)  
> binary
> search, but quite frankly, that's not why this patch is interesting.
>
> No, the whole and only reason to create the hash of the filenames in  
> the
> index is that by modifying the hash function, you can fairly easily do
> things like making it always hash equivalent names into the same  
> bucket.

This is fantastic. Thank you very much for actually taking this issue  
seriously despite the mess I made on the list. This is exactly why I  
wanted to discuss on the lists instead of hacking away myself - there  
are very smart people on the list (like you) that already know how git  
works that can come up with ideas like this while I would still be  
trying to figure out where the index code is even stored.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-22 23:37 I'm a total push-over Linus Torvalds
  2008-01-23  1:35 ` Kevin Ballard
@ 2008-01-23  2:23 ` Junio C Hamano
  2008-01-23  2:36   ` Junio C Hamano
  2008-01-23  2:58   ` Linus Torvalds
  2008-01-23  8:32 ` Andreas Ericsson
  2 siblings, 2 replies; 51+ messages in thread
From: Junio C Hamano @ 2008-01-23  2:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Ok, here's an interesting patch based on the current 'next' (since it very 
> intimately requires the new in-memory index format).

This is nice.  It does not do anything specific with HFS+ issues
but aims for faster look-ups, which would help everybody.

Two things I noticed (only two, not necessarily because you are
good but mostly because I am still mired in day job and could
not get enough uninterrupted minutes to read the patch ;-)):

 - You might want to store the hash table (once computed) in the
   index extension section, and lazily unpack the table the
   first time index_name_exists() or set_index_entry() is called
   on the given istate, instead of unpacking it immediately when
   you read from the disk.  That way, ls-files does not have to
   suffer at all.

 - You would need to get rid of the table in discard_index().

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23  2:23 ` Junio C Hamano
@ 2008-01-23  2:36   ` Junio C Hamano
  2008-01-23 12:24     ` Johannes Schindelin
  2008-01-23  2:58   ` Linus Torvalds
  1 sibling, 1 reply; 51+ messages in thread
From: Junio C Hamano @ 2008-01-23  2:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Junio C Hamano <gitster@pobox.com> writes:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
>> Ok, here's an interesting patch based on the current 'next' (since it very 
>> intimately requires the new in-memory index format).
>
> This is nice.  It does not do anything specific with HFS+ issues
> but aims for faster look-ups, which would help everybody.
>
> Two things I noticed (only two, not necessarily because you are
> good but mostly because I am still mired in day job and could
> not get enough uninterrupted minutes to read the patch ;-)):
>
>  - You might want to store the hash table (once computed) in the
>    index extension section, and lazily unpack the table the
>    first time index_name_exists() or set_index_entry() is called
>    on the given istate, instead of unpacking it immediately when
>    you read from the disk.  That way, ls-files does not have to
>    suffer at all.

Actually, I take one fourth of this back.  

 * I am not yet retracting the suggestion to do the hashing
   lazily (what I mean by "lazily" is that the first access that
   wants hashed access will iterate active_cache and hash them
   all, not "lazily one entry at a time as needed" which would
   not make any sense for a hashtable).  I have to find time to
   try benching the effect of it myself.  So that's one half
   retained.

 * We certainly do not necessarily want to store this in the
   index right now.  The hash algorithms would be improved from
   the version you are almost ashamed of ;-).  That sounds as if
   I am retrating the other half, but not quite.

 * Once we have a mechanism to detect that the extension section
   stores a precomputed hash that was done with a different
   algorithm and ignore it (and recompute afresh when needed),
   then we can afford to put a more elaborate hashing algorithm,
   slightly loosening one of your "Guiding principles", and keep
   the result in the generated index to be reused by the next
   user.  So that is why I am retracting only half of the
   suggestion to save it in the extension section (which in turn
   is a half of my suggestion).

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23  2:36   ` Junio C Hamano
@ 2008-01-23 12:24     ` Johannes Schindelin
  2008-01-23 12:28       ` David Kastrup
  0 siblings, 1 reply; 51+ messages in thread
From: Johannes Schindelin @ 2008-01-23 12:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Git Mailing List

Hi,

On Tue, 22 Jan 2008, Junio C Hamano wrote:

>  * We certainly do not necessarily want to store this in the
>    index right now.  The hash algorithms would be improved from
>    the version you are almost ashamed of ;-).  That sounds as if
>    I am retrating the other half, but not quite.
> 
>  * Once we have a mechanism to detect that the extension section
>    stores a precomputed hash that was done with a different
>    algorithm and ignore it (and recompute afresh when needed),
>    then we can afford to put a more elaborate hashing algorithm,
>    slightly loosening one of your "Guiding principles", and keep
>    the result in the generated index to be reused by the next
>    user.  So that is why I am retracting only half of the
>    suggestion to save it in the extension section (which in turn
>    is a half of my suggestion).

Both issues (and the config variable issue Linus raised) are easily helped 
with: store not only the hashmap in the extension, but also an identifier 
for the hash method used.

Then you can improve on the hash function all you like, and add the config 
variable dependent choice of the hashing everybody seems to want all of a 
sudden, as long as you change the method identifier, too.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23 12:24     ` Johannes Schindelin
@ 2008-01-23 12:28       ` David Kastrup
  2008-01-23 12:56         ` Theodore Tso
  0 siblings, 1 reply; 51+ messages in thread
From: David Kastrup @ 2008-01-23 12:28 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, Linus Torvalds, Git Mailing List

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Both issues (and the config variable issue Linus raised) are easily
> helped with: store not only the hashmap in the extension, but also an
> identifier for the hash method used.

For a reasonably implemented hash algorithm, computing the hash should
be cheaper than reading it from disk.

So storing precomputed hashes is not worth the trouble.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23 12:28       ` David Kastrup
@ 2008-01-23 12:56         ` Theodore Tso
  0 siblings, 0 replies; 51+ messages in thread
From: Theodore Tso @ 2008-01-23 12:56 UTC (permalink / raw)
  To: David Kastrup
  Cc: Johannes Schindelin, Junio C Hamano, Linus Torvalds,
	Git Mailing List

On Wed, Jan 23, 2008 at 01:28:27PM +0100, David Kastrup wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > Both issues (and the config variable issue Linus raised) are easily
> > helped with: store not only the hashmap in the extension, but also an
> > identifier for the hash method used.
> 
> For a reasonably implemented hash algorithm, computing the hash should
> be cheaper than reading it from disk.
> 
> So storing precomputed hashes is not worth the trouble.

Yes, but if on Mac OS systems when the git repository is stored on an
HFS+ system, the hash algorithm gets changed to one which forces
Unicode strings which HFS+ happens to "normalize" into the same hash
bucket pre- and post- normalization, it might not be cheap any
more....

       	  	      	    	   	     - Ted

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23  2:23 ` Junio C Hamano
  2008-01-23  2:36   ` Junio C Hamano
@ 2008-01-23  2:58   ` Linus Torvalds
  2008-01-23  3:19     ` Linus Torvalds
  2008-01-23  7:23     ` Junio C Hamano
  1 sibling, 2 replies; 51+ messages in thread
From: Linus Torvalds @ 2008-01-23  2:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

On Tue, 22 Jan 2008, Junio C Hamano wrote:
> 
>  - You might want to store the hash table (once computed) in the
>    index extension section, and lazily unpack the table the
>    first time index_name_exists() or set_index_entry() is called
>    on the given istate, instead of unpacking it immediately when
>    you read from the disk.  That way, ls-files does not have to
>    suffer at all.

I really hate that. 

Basically, I dislike having two copies of the same data. If something can 
be computed from something else, then only the original data should exist, 
and the other thing should be recomputed. Otherwise you easily get into 
situations where you spend a lot of time maintaining the other copy, or 
worse, you have inconsistent data and it's really subtle what is going on.

Also, one of the ideas behind the index is that would depend on your 
notion of what is "equivalent", which is actually somehing fairly fluid. 
In fact, it's likely going to depend on a config option.

So encoding the indexing on disk, when it can change when you do a simple 
"git config", or even just depending on your LANG environment variable, 
seems like a singularly bad idea, even if it wasn't for the coherence.

I did consider doing the indexing only on demand, and we can certainly 
simply just "turn it off" when we know it's never going to get used (ie 
"git ls-files"). So in that sense, it's easy to get rid of the overhead, 
but it didn't really seem like the conceptual complexity (even if it's 
just a couple of lines) is really worth it. It's not like git ls-files is 
really performance-critical anyway.

>  - You would need to get rid of the table in discard_index().

Now this, of course, is obviously true.

And the patch to do that is very simple too. No need to walk any chains, 
since the "free(istate->alloc);" will release all the pre-allocated 
cache_entry structures, and the rest are (necessarily) leaked anyway.

[ Side note for non-Junios: the leaking of cache_entry structures isn't 
  new, we've always done it, and it's even done on purpose. The common 
  case is that there is one *big* allocation (istate->alloc) that contains 
  all the original cache entries.

  There are usually none, or only a very few individual allocations, and 
  we don't even keep track of them. With the new in-memory format, we 
  could make a special flag that does "is this cache-entry an individual 
  allocation or not" (or we could even just see if they are inside the 
  "alloc" range), but the common case really should be that there's just a 
  couple of them, and we just drop them rather than tracking them. ]

Here.

		Linus

---
 read-cache.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index 33a8ca5..abee0fc 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1142,6 +1142,7 @@ int discard_index(struct index_state *istate)
 	istate->cache_nr = 0;
 	istate->cache_changed = 0;
 	istate->timestamp = 0;
+	free_hash(&istate->name_hash);
 	cache_tree_free(&(istate->cache_tree));
 	free(istate->alloc);
 	istate->alloc = NULL;

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23  2:58   ` Linus Torvalds
@ 2008-01-23  3:19     ` Linus Torvalds
  2008-01-25  6:50       ` Junio C Hamano
  2008-01-23  7:23     ` Junio C Hamano
  1 sibling, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2008-01-23  3:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List



On Tue, 22 Jan 2008, Linus Torvalds wrote:
> 
> And the patch to do that is very simple too. 

Ok, I pushed the squashed/fixed commit to the "new-lstat" branch. It's now 
based on your 'next' branch, and I should have renamed it, since it's not 
about any of the old lstat() optimizations any more. Whatever.

But here it is also as a full patch, with a fixed up subject line etc. You 
tend to want to have them as topic-branches, and it's probably easier this 
way.

		Linus

---
From ca98bbc7b0cc1d9d088a8c6ae80e733115a1f775 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 22 Jan 2008 18:41:14 -0800
Subject: [PATCH] Create pathname-based hash-table lookup into index

This creates a hash index of every single file added to the index.
Right now that hash index isn't actually used for much: I implemented a
"cache_name_exists()" function that uses it to efficiently look up a
filename in the index without having to do the O(logn) binary search,
but quite frankly, that's not why this patch is interesting.

No, the whole and only reason to create the hash of the filenames in the
index is that by modifying the hash function, you can fairly easily do
things like making it always hash equivalent names into the same bucket.

That, in turn, means that suddenly questions like "does this name exist
in the index under an _equivalent_ name?" becomes much much cheaper.

Guiding principles behind this patch:

 - it shouldn't be too costly. In fact, my primary goal here was to
   actually speed up "git commit" with a fully populated kernel tree, by
   being faster at checking whether a file already existed in the index. I
   did succeed, but only barely:

	Best before:
		[torvalds@woody linux]$ time git commit > /dev/null
		real    0m0.255s
		user    0m0.168s
		sys     0m0.088s

	Best after:

		[torvalds@woody linux]$ time ~/git/git commit > /dev/null
		real    0m0.233s
		user    0m0.144s
		sys     0m0.088s

   so some things are actually faster (~8%).

   Caveat: that's really the best case. Other things are invariably going
   to be slightly slower, since we populate that index cache, and quite
   frankly, few things really use it to look things up.

   That said, the cost is really quite small. The worst case is probably
   doing a "git ls-files", which will do very little except puopulate the
   index, and never actually looks anything up in it, just lists it.

	Before:
		[torvalds@woody linux]$ time git ls-files > /dev/null
		real    0m0.016s
		user    0m0.016s
		sys     0m0.000s

	After:
		[torvalds@woody linux]$ time ~/git/git ls-files > /dev/null
		real    0m0.021s
		user    0m0.012s
		sys     0m0.008s

   and while the thing has really gotten relatively much slower, we're
   still talking about something almost unmeasurable (eg 5ms). And that
   really should be pretty much the worst case.

   So we lose 5ms on one "benchmark", but win 22ms on another. Pick your
   poison - this patch has the advantage that it will _likely_ speed up
   the cases that are complex and expensive more than it slows down the
   cases that are already so fast that nobody cares. But if you look at
   relative speedups/slowdowns, it doesn't look so good.

 - It should be simple and clean

   The code may be a bit subtle (the reasons I do hash removal the way I
   do etc), but it re-uses the existing hash.c files, so it really is
   fairly small and straightforward apart from a few odd details.

Now, this patch on its own doesn't really do much, but I think it's worth
looking at, if only because if done correctly, the name hashing really can
make an improvement to the whole issue of "do we have a filename that
looks like this in the index already". And at least it gets real testing
by being used even by default (ie there is a real use-case for it even
without any insane filesystems).

NOTE NOTE NOTE! The current hash is a joke. I'm ashamed of it, I'm just
not ashamed of it enough to really care. I took all the numbers out of my
nether regions - I'm sure it's good enough that it works in practice, but
the whole point was that you can make a really much fancier hash that
hashes characters not directly, but by their upper-case value or something
like that, and thus you get a case-insensitive hash, while still keeping
the name and the index itself totally case sensitive.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 cache.h      |    6 +++
 dir.c        |    2 +-
 read-cache.c |   98 ++++++++++++++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 95 insertions(+), 11 deletions(-)

diff --git a/cache.h b/cache.h
index 3a47cdc..409738c 100644
--- a/cache.h
+++ b/cache.h
@@ -3,6 +3,7 @@
 
 #include "git-compat-util.h"
 #include "strbuf.h"
+#include "hash.h"
 
 #include SHA1_HEADER
 #include <zlib.h>
@@ -109,6 +110,7 @@ struct ondisk_cache_entry {
 };
 
 struct cache_entry {
+	struct cache_entry *next;
 	unsigned int ce_ctime;
 	unsigned int ce_mtime;
 	unsigned int ce_dev;
@@ -131,6 +133,7 @@ struct cache_entry {
 #define CE_UPDATE    (0x10000)
 #define CE_REMOVE    (0x20000)
 #define CE_UPTODATE  (0x40000)
+#define CE_UNHASHED  (0x80000)
 
 static inline unsigned create_ce_flags(size_t len, unsigned stage)
 {
@@ -188,6 +191,7 @@ struct index_state {
 	struct cache_tree *cache_tree;
 	time_t timestamp;
 	void *alloc;
+	struct hash_table name_hash;
 };
 
 extern struct index_state the_index;
@@ -211,6 +215,7 @@ extern struct index_state the_index;
 #define refresh_cache(flags) refresh_index(&the_index, (flags), NULL, NULL)
 #define ce_match_stat(ce, st, options) ie_match_stat(&the_index, (ce), (st), (options))
 #define ce_modified(ce, st, options) ie_modified(&the_index, (ce), (st), (options))
+#define cache_name_exists(name, namelen) index_name_exists(&the_index, (name), (namelen))
 #endif
 
 enum object_type {
@@ -297,6 +302,7 @@ extern int read_index_from(struct index_state *, const char *path);
 extern int write_index(struct index_state *, int newfd);
 extern int discard_index(struct index_state *);
 extern int verify_path(const char *path);
+extern int index_name_exists(struct index_state *istate, const char *name, int namelen);
 extern int index_name_pos(struct index_state *, const char *name, int namelen);
 #define ADD_CACHE_OK_TO_ADD 1		/* Ok to add */
 #define ADD_CACHE_OK_TO_REPLACE 2	/* Ok to replace file/directory */
diff --git a/dir.c b/dir.c
index 1b9cc7a..6543105 100644
--- a/dir.c
+++ b/dir.c
@@ -346,7 +346,7 @@ static struct dir_entry *dir_entry_new(const char *pathname, int len)
 
 struct dir_entry *dir_add_name(struct dir_struct *dir, const char *pathname, int len)
 {
-	if (cache_name_pos(pathname, len) >= 0)
+	if (cache_name_exists(pathname, len))
 		return NULL;
 
 	ALLOC_GROW(dir->entries, dir->nr+1, dir->alloc);
diff --git a/read-cache.c b/read-cache.c
index 8ba8f0f..abee0fc 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -23,6 +23,70 @@
 
 struct index_state the_index;
 
+static unsigned int hash_name(const char *name, int namelen)
+{
+	unsigned int hash = 0x123;
+
+	do {
+		unsigned char c = *name++;
+		hash = hash*101 + c;
+	} while (--namelen);
+	return hash;
+}
+
+static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+{
+	void **pos;
+	unsigned int hash = hash_name(ce->name, ce_namelen(ce));
+
+	istate->cache[nr] = ce;
+	pos = insert_hash(hash, ce, &istate->name_hash);
+	if (pos) {
+		ce->next = *pos;
+		*pos = ce;
+	}
+}
+
+/*
+ * We don't actually *remove* it, we can just mark it invalid so that
+ * we won't find it in lookups.
+ *
+ * Not only would we have to search the lists (simple enough), but
+ * we'd also have to rehash other hash buckets in case this makes the
+ * hash bucket empty (common). So it's much better to just mark
+ * it.
+ */
+static void remove_hash_entry(struct index_state *istate, struct cache_entry *ce)
+{
+	ce->ce_flags |= CE_UNHASHED;
+}
+
+static void replace_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+{
+	struct cache_entry *old = istate->cache[nr];
+
+	if (ce != old) {
+		remove_hash_entry(istate, old);
+		set_index_entry(istate, nr, ce);
+	}
+	istate->cache_changed = 1;
+}
+
+int index_name_exists(struct index_state *istate, const char *name, int namelen)
+{
+	unsigned int hash = hash_name(name, namelen);
+	struct cache_entry *ce = lookup_hash(hash, &istate->name_hash);
+
+	while (ce) {
+		if (!(ce->ce_flags & CE_UNHASHED)) {
+			if (!cache_name_compare(name, namelen, ce->name, ce->ce_flags))
+				return 1;
+		}
+		ce = ce->next;
+	}
+	return 0;
+}
+
 /*
  * This only updates the "non-critical" parts of the directory
  * cache, ie the parts that aren't tracked by GIT, and only used
@@ -323,6 +387,9 @@ int index_name_pos(struct index_state *istate, const char *name, int namelen)
 /* Remove entry, return true if there are more entries to go.. */
 int remove_index_entry_at(struct index_state *istate, int pos)
 {
+	struct cache_entry *ce = istate->cache[pos];
+
+	remove_hash_entry(istate, ce);
 	istate->cache_changed = 1;
 	istate->cache_nr--;
 	if (pos >= istate->cache_nr)
@@ -697,8 +764,7 @@ static int add_index_entry_with_check(struct index_state *istate, struct cache_e
 
 	/* existing match? Just replace it. */
 	if (pos >= 0) {
-		istate->cache_changed = 1;
-		istate->cache[pos] = ce;
+		replace_index_entry(istate, pos, ce);
 		return 0;
 	}
 	pos = -pos-1;
@@ -758,7 +824,7 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti
 		memmove(istate->cache + pos + 1,
 			istate->cache + pos,
 			(istate->cache_nr - pos - 1) * sizeof(ce));
-	istate->cache[pos] = ce;
+	set_index_entry(istate, pos, ce);
 	istate->cache_changed = 1;
 	return 0;
 }
@@ -887,11 +953,8 @@ int refresh_index(struct index_state *istate, unsigned int flags, const char **p
 			has_errors = 1;
 			continue;
 		}
-		istate->cache_changed = 1;
-		/* You can NOT just free istate->cache[i] here, since it
-		 * might not be necessarily malloc()ed but can also come
-		 * from mmap(). */
-		istate->cache[i] = new;
+
+		replace_index_entry(istate, i, new);
 	}
 	return has_errors;
 }
@@ -966,6 +1029,20 @@ static void convert_from_disk(struct ondisk_cache_entry *ondisk, struct cache_en
 	memcpy(ce->name, ondisk->name, len + 1);
 }
 
+static inline size_t estimate_cache_size(size_t ondisk_size, unsigned int entries)
+{
+	long per_entry;
+
+	per_entry = sizeof(struct cache_entry) - sizeof(struct ondisk_cache_entry);
+
+	/*
+	 * Alignment can cause differences. This should be "alignof", but 
+	 * since that's a gcc'ism, just use the size of a pointer.
+	 */
+	per_entry += sizeof(void *);
+	return ondisk_size + entries*per_entry;
+}
+
 /* remember to discard_cache() before reading a different cache! */
 int read_index_from(struct index_state *istate, const char *path)
 {
@@ -1016,7 +1093,7 @@ int read_index_from(struct index_state *istate, const char *path)
 	 * has room for a few  more flags, we can allocate using the same
 	 * index size
 	 */
-	istate->alloc = xmalloc(mmap_size);
+	istate->alloc = xmalloc(estimate_cache_size(mmap_size, istate->cache_nr));
 
 	src_offset = sizeof(*hdr);
 	dst_offset = 0;
@@ -1027,7 +1104,7 @@ int read_index_from(struct index_state *istate, const char *path)
 		disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset);
 		ce = (struct cache_entry *)((char *)istate->alloc + dst_offset);
 		convert_from_disk(disk_ce, ce);
-		istate->cache[i] = ce;
+		set_index_entry(istate, i, ce);
 
 		src_offset += ondisk_ce_size(ce);
 		dst_offset += ce_size(ce);
@@ -1065,6 +1142,7 @@ int discard_index(struct index_state *istate)
 	istate->cache_nr = 0;
 	istate->cache_changed = 0;
 	istate->timestamp = 0;
+	free_hash(&istate->name_hash);
 	cache_tree_free(&(istate->cache_tree));
 	free(istate->alloc);
 	istate->alloc = NULL;
-- 
1.5.4.rc4.1130.g9ad85

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23  3:19     ` Linus Torvalds
@ 2008-01-25  6:50       ` Junio C Hamano
  2008-01-25 16:24         ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: Junio C Hamano @ 2008-01-25  6:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> +static inline size_t estimate_cache_size(size_t ondisk_size, unsigned int entries)
> +{
> +	long per_entry;
> +
> +	per_entry = sizeof(struct cache_entry) - sizeof(struct ondisk_cache_entry);
> +
> +	/*
> +	 * Alignment can cause differences. This should be "alignof", but 
> +	 * since that's a gcc'ism, just use the size of a pointer.
> +	 */
> +	per_entry += sizeof(void *);
> +	return ondisk_size + entries*per_entry;
> +}
> +

I wonder if the issue Dave Miller addressed with
69ae517541ed5ab7d4fdcd8f82a9b8bd949df347 (fast-import: fix
unalinged allocation and access) applies here.

commit 69ae517541ed5ab7d4fdcd8f82a9b8bd949df347
Author: David S. Miller <davem@davemloft.net>
Date:   Fri Dec 14 20:39:16 2007 -0800

    fast-import: fix unalinged allocation and access
    
    The specialized pool allocator fast-import uses aligned objects on the
    size of a pointer, which was not sufficient at least on Sparc.  Instead,
    make the alignment for objects of type unitmax_t.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Junio C Hamano <gitster@pobox.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-25  6:50       ` Junio C Hamano
@ 2008-01-25 16:24         ` Linus Torvalds
  0 siblings, 0 replies; 51+ messages in thread
From: Linus Torvalds @ 2008-01-25 16:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

On Thu, 24 Jan 2008, Junio C Hamano wrote:
> 
> I wonder if the issue Dave Miller addressed with
> 69ae517541ed5ab7d4fdcd8f82a9b8bd949df347 (fast-import: fix
> unalinged allocation and access) applies here.

Good point, although we actually do things wrong for *another* reason.

We currently force cache_entry to be 8-byte aligned regardless of what the 
actual "sizeof(ptr)" is, so we should assume that alignment:

	#define cache_entry_size(len) ((offsetof(struct cache_entry,name) + (len) + 8) & ~7)

and if that isn't correct, we'd need to change this #define.

So right now, the right thing to do is probably to make this alignment 
explicit:

	#define CE_ALIGN 8

and then use that both in the "cache_entry_size()" _and_ in the 
"estimate_cache_size()" calculations to make it obvious what the alignment 
is.

And then we could actually make the alignment less on architectures that 
don't need that much (there may be architectures that need more, but I 
doubt it: we don't have any large fields in that structure, so the 
structure alignment really probably does max out at 8 in practice even if 
the C language theory doesn't give you any such guarantees).

Side note: this is not likely to be a problem in _practice_. The on-disk 
representation is also aligned (also by 8), and while they can be 
*differently* aligned due to the relative alignment of the varying-length 
"name[]" field, and that can cause some padding to be needed, in practice 
it will never matter. The on-disk size also contains a header that we 
don't take into account, so it's already "over-estimated" to begin with 
for the in-memory representation.

So "estimate_cache_size()" really does over-estimate its needs by a 
biggish amount, which is why it all works regardless, but better safe than 
sorry. 

		Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23  2:58   ` Linus Torvalds
  2008-01-23  3:19     ` Linus Torvalds
@ 2008-01-23  7:23     ` Junio C Hamano
  2008-01-23 12:25       ` Johannes Schindelin
  1 sibling, 1 reply; 51+ messages in thread
From: Junio C Hamano @ 2008-01-23  7:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Basically, I dislike having two copies of the same data. If something can 
> be computed from something else, then only the original data should exist, 
> and the other thing should be recomputed.

Yes, I agree with that in principle. Storing computable values
makes sense only when it is expensive to recompute.  We did not
have cache-tree for quite a long time until you noticed that it
was rather expensive and wasteful to recompute tree objects from
unchanged parts of the index every time.

It's the same argument; when the hashing performance starts to
become noticeable, we can think about storing and reusing it,
not before.

> I did consider doing the indexing only on demand, and we can certainly 
> simply just "turn it off" when we know it's never going to get used (ie 
> "git ls-files"). So in that sense, it's easy to get rid of the overhead, 
> but it didn't really seem like the conceptual complexity (even if it's 
> just a couple of lines) is really worth it. It's not like git ls-files is 
> really performance-critical anyway.

Yes, ls-files is cheap.  So is lstat(2) on Linux.  It only
matters when you do it many many times.

In any case, the change does not look too bad.  The best time
(real) of running git-ls-files in the kernel repository on my
box is 0.010s vs 0.011s (10% improvement, heh!, which is the
same as the master version) and empty commit is both 0.082s (no
change).

-- >8 --
[PATCH] lazy index hashing

This delays the hashing of index names until it becomes necessary for
the first time.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 cache.h      |    1 +
 read-cache.c |   26 +++++++++++++++++++++++---
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/cache.h b/cache.h
index 409738c..e4aeff0 100644
--- a/cache.h
+++ b/cache.h
@@ -191,6 +191,7 @@ struct index_state {
 	struct cache_tree *cache_tree;
 	time_t timestamp;
 	void *alloc;
+	unsigned name_hash_initialized : 1;
 	struct hash_table name_hash;
 };
 
diff --git a/read-cache.c b/read-cache.c
index 9477c0b..e45f4b3 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -34,12 +34,11 @@ static unsigned int hash_name(const char *name, int namelen)
 	return hash;
 }
 
-static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
 {
 	void **pos;
 	unsigned int hash = hash_name(ce->name, ce_namelen(ce));
 
-	istate->cache[nr] = ce;
 	pos = insert_hash(hash, ce, &istate->name_hash);
 	if (pos) {
 		ce->next = *pos;
@@ -47,6 +46,24 @@ static void set_index_entry(struct index_state *istate, int nr, struct cache_ent
 	}
 }
 
+static void lazy_init_name_hash(struct index_state *istate)
+{
+	int nr;
+
+	if (istate->name_hash_initialized)
+		return;
+	for (nr = 0; nr < istate->cache_nr; nr++)
+		hash_index_entry(istate, istate->cache[nr]);
+	istate->name_hash_initialized = 1;
+}
+
+static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+{
+	istate->cache[nr] = ce;
+	if (istate->name_hash_initialized)
+		hash_index_entry(istate, ce);
+}
+
 /*
  * We don't actually *remove* it, we can just mark it invalid so that
  * we won't find it in lookups.
@@ -75,7 +92,10 @@ static void replace_index_entry(struct index_state *istate, int nr, struct cache
 int index_name_exists(struct index_state *istate, const char *name, int namelen)
 {
 	unsigned int hash = hash_name(name, namelen);
-	struct cache_entry *ce = lookup_hash(hash, &istate->name_hash);
+	struct cache_entry *ce;
+
+	lazy_init_name_hash(istate);
+	ce = lookup_hash(hash, &istate->name_hash);
 
 	while (ce) {
 		if (!(ce->ce_flags & CE_UNHASHED)) {
-- 
1.5.4.rc4.14.g6fc74

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23  7:23     ` Junio C Hamano
@ 2008-01-23 12:25       ` Johannes Schindelin
  2008-01-23 16:25         ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: Johannes Schindelin @ 2008-01-23 12:25 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Git Mailing List

Hi,

On Tue, 22 Jan 2008, Junio C Hamano wrote:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > Basically, I dislike having two copies of the same data. If something 
> > can be computed from something else, then only the original data 
> > should exist, and the other thing should be recomputed.
> 
> Yes, I agree with that in principle. Storing computable values makes 
> sense only when it is expensive to recompute.  We did not have 
> cache-tree for quite a long time until you noticed that it was rather 
> expensive and wasteful to recompute tree objects from unchanged parts of 
> the index every time.
> 
> It's the same argument; when the hashing performance starts to become 
> noticeable, we can think about storing and reusing it, not before.

I fully expect it to be noticable with that UTF-8 "normalisation".  But 
then, the infrastructure is there, and whoever has an itch to scratch...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23 12:25       ` Johannes Schindelin
@ 2008-01-23 16:25         ` Linus Torvalds
  2008-01-23 16:34           ` Johannes Schindelin
  0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2008-01-23 16:25 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, Git Mailing List

On Wed, 23 Jan 2008, Johannes Schindelin wrote:
> 
> I fully expect it to be noticable with that UTF-8 "normalisation".  But 
> then, the infrastructure is there, and whoever has an itch to scratch...

Actually, it's going to be totally invisible even with UTF-8 
normalization, because we're going to do it sanely.

And by "sanely" I mean just having the code test the high bit, and using 
US-ASCII as-is (possibly with that " & ~0x20 " thing to ignore case in 
it).

End result: practically all projects will never notice anything at all for 
99.9% of all files. One extra well-predicted branch, and a few more hash 
collissions for cases where you have both "Makefile" and "makefile" etc.

Doing names with *lots* of UTF-8 characters will be rather slower. It's 
still not horrible to do if you do it the smart way, though. In fact, it's 
pretty simple, just a few table lookups (one to find the NFD form, one to 
do the upcasing).

And yes, for hashing, it makes sense to turn things into NFD because it's 
generally simpler, but the point is that you really don't actually modify 
the name itself at all, you just hash things (or compare things) character 
by expanded character.

IOW, only a total *moron* does Unicode name comparisons with

	strcmp(convert_to_nfd(a), convert_to_nfd(b));

which is essentially what Apple does. It's quite possible to do

	utf8_nfd_strcmp(a,b)

and (a) do it tons and tons faster and (b) never have to modify the 
strings themselves. Same goes (even more) for hashing.

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23 16:25         ` Linus Torvalds
@ 2008-01-23 16:34           ` Johannes Schindelin
  2008-01-23 17:09             ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: Johannes Schindelin @ 2008-01-23 16:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

Hi,

On Wed, 23 Jan 2008, Linus Torvalds wrote:

> On Wed, 23 Jan 2008, Johannes Schindelin wrote:
> > 
> > I fully expect it to be noticable with that UTF-8 "normalisation".  
> > But then, the infrastructure is there, and whoever has an itch to 
> > scratch...
> 
> Actually, it's going to be totally invisible even with UTF-8 
> normalization, because we're going to do it sanely.
> 
> And by "sanely" I mean just having the code test the high bit, and using 
> US-ASCII as-is (possibly with that " & ~0x20 " thing to ignore case in 
> it).
> 
> End result: practically all projects will never notice anything at all for 
> 99.9% of all files. One extra well-predicted branch, and a few more hash 
> collissions for cases where you have both "Makefile" and "makefile" etc.

Well, that's the point, to avoid having both "Makefile" and "makefile" in 
your repository when you are on case-challenged filesystems, right?

> Doing names with *lots* of UTF-8 characters will be rather slower. It's 
> still not horrible to do if you do it the smart way, though. In fact, 
> it's pretty simple, just a few table lookups (one to find the NFD form, 
> one to do the upcasing).
> 
> And yes, for hashing, it makes sense to turn things into NFD because 
> it's generally simpler, but the point is that you really don't actually 
> modify the name itself at all, you just hash things (or compare things) 
> character by expanded character.
> 
> IOW, only a total *moron* does Unicode name comparisons with
> 
> 	strcmp(convert_to_nfd(a), convert_to_nfd(b));
> 
> which is essentially what Apple does.

Heh, indeed that is what I would have done as an initial step (out of 
laziness).

> It's quite possible to do
> 
> 	utf8_nfd_strcmp(a,b)
> 
> and (a) do it tons and tons faster and (b) never have to modify the 
> strings themselves. Same goes (even more) for hashing.

Okay.  Point taken.

But I really hope that you are not proposing to use the case-ignoring 
hash when we are _not_ on a case-challenged filesystem...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23 16:34           ` Johannes Schindelin
@ 2008-01-23 17:09             ` Linus Torvalds
  2008-01-23 17:29               ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2008-01-23 17:09 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, Git Mailing List

On Wed, 23 Jan 2008, Johannes Schindelin wrote:
>
> > End result: practically all projects will never notice anything at all for 
> > 99.9% of all files. One extra well-predicted branch, and a few more hash 
> > collissions for cases where you have both "Makefile" and "makefile" etc.
> 
> Well, that's the point, to avoid having both "Makefile" and "makefile" in 
> your repository when you are on case-challenged filesystems, right?

Right. But what I'm saying is that this is *really* cheap to test for for 
US-ASCII-only characters, and if only 0.1% of all filenames have unicode 
in them, the fact that they are much mroe expensive isn't even going to be 
noticeable. Except for some very odd-ball environments.

> > It's quite possible to do
> > 
> > 	utf8_nfd_strcmp(a,b)
> > 
> > and (a) do it tons and tons faster and (b) never have to modify the 
> > strings themselves. Same goes (even more) for hashing.
> 
> Okay.  Point taken.

Note that one reason the above is tons faster is that even with complex 
unicode, the *common* case is going to be that the names match with a 
binary compare.

> But I really hope that you are not proposing to use the case-ignoring 
> hash when we are _not_ on a case-challenged filesystem...

I actually suspect that we could, and nobody will notice. The hash would 
cause a few more collissions, but not so you'd know.

And the thing is, people who work with other people who are on 
case-challenged systems would still want to have the case-insenstive 
compare too - although it should just warn, not actually "work".

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23 17:09             ` Linus Torvalds
@ 2008-01-23 17:29               ` Linus Torvalds
  2008-01-25  5:21                 ` Jeremy Maitin-Shepard
  0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2008-01-23 17:29 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, Git Mailing List

On Wed, 23 Jan 2008, Linus Torvalds wrote:
> 
> > But I really hope that you are not proposing to use the case-ignoring 
> > hash when we are _not_ on a case-challenged filesystem...
> 
> I actually suspect that we could, and nobody will notice. The hash would 
> cause a few more collissions, but not so you'd know.

To clarify: the thing I want to point out that the decision to *hash* the 
filenames in a case-insensitive hash, is very different from the decision 
to then *compare* the filenames when traversing the hash with a 
case-insensitive compare.

And this difference is actually very important. Hashing things together 
that are "equivalent" according to any random rule is what makes it 
possible to then *check* for equivalence cheaply (because you only need to 
make the potentially expensive check with the subset of cases where it 
might trigger), but it in no way forces you to actually recode or mangle 
or compare things equivalently.

In fact, I'd argue that this is what HFS+ did wrong in the first place: 
they had stupid/incompetent people who didn't understand about this, so 
they normalized the string *before* the hashing rather than as part of the 
hash itself, and thus actually corrupt the string itself.

So what you can do (and I'd argue that we do) is to have a hash that can 
handle almost arbitrary input, but then never corrupt the filename, and 
always compare exactly by default.

Then, depending on a config option, we can decide to change the compare so 
that equivalent (according to whatever rule) filenames either cause a 
warning (people on sane filesystems, but working with people who aren't), 
or are silently considered the same file (people on insane filesystems).

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23 17:29               ` Linus Torvalds
@ 2008-01-25  5:21                 ` Jeremy Maitin-Shepard
  2008-01-25 12:51                   ` Johannes Schindelin
  0 siblings, 1 reply; 51+ messages in thread
From: Jeremy Maitin-Shepard @ 2008-01-25  5:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds wrote:
[snip]
> So what you can do (and I'd argue that we do) is to have a hash that can 
> handle almost arbitrary input, but then never corrupt the filename, and 
> always compare exactly by default.

In general, there may be a large number of comparison function options
that git will eventually support, and they will likely not all form a
single chain of increasing "strictness".

Given that the hash values aren't even being stored on disk (and if they
were, a simple approach of also storing an identifier for the hash
function to know whether they stored values are still valid could be
used), having a chain of increasingly "strict" comparison functions and
using a hash function that corresponds to the least strict one is useful
for exactly one reason: giving (possibly several different levels of)
non-fatal warnings for various types of duplicates.

But since multiple hash functions will be needed anyway to support
different notions of case-insensitivity, if the warning is not enabled,
there is no reason to use a case-insensitive hash function with a
byte-exact comparison.

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-25  5:21                 ` Jeremy Maitin-Shepard
@ 2008-01-25 12:51                   ` Johannes Schindelin
  2008-01-25 18:19                     ` Jeremy Maitin-Shepard
  0 siblings, 1 reply; 51+ messages in thread
From: Johannes Schindelin @ 2008-01-25 12:51 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard; +Cc: Linus Torvalds, git

Hi,

On Fri, 25 Jan 2008, Jeremy Maitin-Shepard wrote:

> But since multiple hash functions will be needed anyway to support 
> different notions of case-insensitivity, if the warning is not enabled, 
> there is no reason to use a case-insensitive hash function with a 
> byte-exact comparison.

No, only multiple compare functions will be needed.  The hash function can 
be built in such a manner that it guarantees that file names being equal 
with _any_ of the compare functions fall into the same bucket.

The upside of such a hash function: less code to maintain.

Hth,
Dscho

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-25 12:51                   ` Johannes Schindelin
@ 2008-01-25 18:19                     ` Jeremy Maitin-Shepard
  2008-01-25 18:24                       ` Johannes Schindelin
  2008-01-25 19:07                       ` Junio C Hamano
  0 siblings, 2 replies; 51+ messages in thread
From: Jeremy Maitin-Shepard @ 2008-01-25 18:19 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Linus Torvalds, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> On Fri, 25 Jan 2008, Jeremy Maitin-Shepard wrote:

>> But since multiple hash functions will be needed anyway to support 
>> different notions of case-insensitivity, if the warning is not enabled, 
>> there is no reason to use a case-insensitive hash function with a 
>> byte-exact comparison.

> No, only multiple compare functions will be needed.  The hash function can 
> be built in such a manner that it guarantees that file names being equal 
> with _any_ of the compare functions fall into the same bucket.

In theory, I agree that this is possible, but in practice it may not be
reasonable at all.  Consider two possible comparison functions:

1. compare file names as strings case-insensitively assuming a latin 1
encoding

2. compare file names as strings case-insensitively assuming a UTF-8
encoding

Actually writing a hash function such that two strings hash to the same
value if either of these comparison functions says that the strings are
equal would appear to be rather difficult.

> The upside of such a hash function: less code to maintain.

A simple hash function that doesn't try to do anything regarding
case-insensitivity is extremely short and simple and therefore is hardly
a maintenance burden.

Although in some cases it is possible to "share" a hash function, except
for the "warning" purpose, actually doing so doesn't make much sense.
Using the "case-insensitive" hash function when you intend to use an
"exact" comparison function just amounts to using a hash function that
is unequivocally worse: it is slower, more complicated, and has a higher
collision rate.

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-25 18:19                     ` Jeremy Maitin-Shepard
@ 2008-01-25 18:24                       ` Johannes Schindelin
  2008-01-25 19:07                       ` Junio C Hamano
  1 sibling, 0 replies; 51+ messages in thread
From: Johannes Schindelin @ 2008-01-25 18:24 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard; +Cc: Linus Torvalds, git

Hi,

On Fri, 25 Jan 2008, Jeremy Maitin-Shepard wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > The upside of such a hash function: less code to maintain.
> 
> A simple hash function that doesn't try to do anything regarding 
> case-insensitivity is extremely short and simple and therefore is hardly 
> a maintenance burden.

You misunderstand me.  If the complicated hash function is the one that is 
less exercised, you _will_ face problems.

OTOH if you _already_ need the "complicated" hash function, there is 
_little_ point not to use it, and be consistent between platforms, 
_especially_ since now all people eat the same dog food.

So I never thought about the simple hash function as being a burden.

Hth,
Dscho

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-25 18:19                     ` Jeremy Maitin-Shepard
  2008-01-25 18:24                       ` Johannes Schindelin
@ 2008-01-25 19:07                       ` Junio C Hamano
  1 sibling, 0 replies; 51+ messages in thread
From: Junio C Hamano @ 2008-01-25 19:07 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard; +Cc: Johannes Schindelin, Linus Torvalds, git

Jeremy Maitin-Shepard <jbms@cmu.edu> writes:

> In theory, I agree that this is possible, but in practice it may not be
> reasonable at all.  Consider two possible comparison functions:
>
> 1. compare file names as strings case-insensitively assuming a latin 1
> encoding
>
> 2. compare file names as strings case-insensitively assuming a UTF-8
> encoding
>
> Actually writing a hash function such that two strings hash to the same
> value if either of these comparison functions says that the strings are
> equal would appear to be rather difficult.

Once you start adding more "case folding" supported filesystems
to the repertoire, such a unified hash function Dscho suggests
needs to throw paths that other (N-1) "case folding" filesystems
treat as distinct but only 1 filesystem treats "equivalent" into
the same hash bucket.  I would say not just difficult but the
resulting function would have too many collisions to make it
ineffective.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-22 23:37 I'm a total push-over Linus Torvalds
  2008-01-23  1:35 ` Kevin Ballard
  2008-01-23  2:23 ` Junio C Hamano
@ 2008-01-23  8:32 ` Andreas Ericsson
  2008-01-23  9:15   ` Dmitry Potapov
  2008-01-23 16:06   ` Linus Torvalds
  2 siblings, 2 replies; 51+ messages in thread
From: Andreas Ericsson @ 2008-01-23  8:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

Linus Torvalds wrote:
> 
> NOTE NOTE NOTE! The current hash is a joke.

Insofar as hashes go, it's not that shabby for hashing filenames.
Here's the test output from a small hash-comparison program I've
got, which runs the test-input through a series of different hashes
to compare dispersion, collisions, lookup- and insert times and
other things that are interesting from a practical PoV.

Fowler/Noll/Vo cheap hash
Collisions: 1829, 7.91%. Depth max: 3, average: 0.09.
Time spent in lookups: 167.859ms. Per entry: 0.145us
Time spent inserting: 2.753ms. Per entry: 0.119us

PHP Zend cheap hash
Collisions: 1908, 8.25%. Depth max: 3, average: 0.09.
Time spent in lookups: 171.819ms. Per entry: 0.149us
Time spent inserting: 2.778ms. Per entry: 0.120us

Phong Vo's linear congruential hash
Collisions: 1996, 8.63%. Depth max: 3, average: 0.09.
Time spent in lookups: 168.276ms. Per entry: 0.146us
Time spent inserting: 2.840ms. Per entry: 0.123us

Phong Vo's second linear congruential hash
Collisions: 1933, 8.36%. Depth max: 3, average: 0.09.
Time spent in lookups: 170.416ms. Per entry: 0.147us
Time spent inserting: 2.774ms. Per entry: 0.120us

Glib string hash
Collisions: 1907, 8.24%. Depth max: 3, average: 0.09.
Time spent in lookups: 192.420ms. Per entry: 0.166us
Time spent inserting: 3.154ms. Per entry: 0.136us

sdbm hash
Collisions: 1899, 8.21%. Depth max: 3, average: 0.09.
Time spent in lookups: 170.797ms. Per entry: 0.148us
Time spent inserting: 2.724ms. Per entry: 0.118us

bfd hash
Collisions: 1949, 8.43%. Depth max: 3, average: 0.09.
Time spent in lookups: 206.504ms. Per entry: 0.179us
Time spent inserting: 3.241ms. Per entry: 0.140us

Linus' GIT hash
Collisions: 1946, 8.41%. Depth max: 4, average: 0.09.
Time spent in lookups: 179.336ms. Per entry: 0.155us
Time spent inserting: 2.781ms. Per entry: 0.120us

This is with 64K buckets, which (with my implementation) means
a total hash-table size of 256KiB. The test-input is a simple
file-listing from the linux kernel (although it has the .git
directory included too).

As you can see, it loses out on mathematical correctness, as it
has more collisions (but not that many). OTOH, the simplicity
of the implementation makes it a viable option anyway, since
the time spent in insertion (which is almost completely inside
the hash-function) is on average so much shorter.

For a temporary hash-table, it will work splendidly. It will
probably have issues scaling to, say, 30 or 40 million input
lines (fairly typical test-data for database hashes, fe).

Note that real-world timings will be shorter. My test-program
doesn't have the luxury of keeping hash-functions inline, so
some compiler optimizations become impossible.

This is on a Intel(R) Core(TM)2 Duo CPU T7700  @ 2.40GHz
(according to /proc/cpuinfo), with a cache size of 4meg. The
use of multiplication is sane though, as it means no arch
will suffer greatly.

The FNV hash would be better (pasted below), but I doubt
anyone will ever care, and there will be larger differences
between architectures with this one than the lt_git hash (well,
a function's gotta have a name).

/*
 * Fowler/Noll/Vo hash
 *
 * The basis of the hash algorithm was taken from an idea sent by email to the
 * IEEE Posix P1003.2 mailing list from Phong Vo (kpv@research.att.com) and
 * Glenn Fowler (gsf@research.att.com).  Landon Curt Noll (chongo@toad.com)
 * later improved on their algorithm.
 *
 * The magic is in the interesting relationship between the special prime
 * 16777619 (2^24 + 403) and 2^32 and 2^8.
 *
 * This hash produces the fewest collisions of any function that we've seen so
 * far, and works well on both numbers and strings.
 *
 * (Last comment from MySQL code)
 *
 */
u32 FNV1(u8 *k, u32 len)
{
	u8 *e;
	u32 h;

	e = k + len;
	for (h = 0; k < e; k++) {
		h *= 16777619;
		h ^= *k;
	}

	return (h);
}

I could provide figures for other table-sizes too, if anyone's
interested.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23  8:32 ` Andreas Ericsson
@ 2008-01-23  9:15   ` Dmitry Potapov
  2008-01-23  9:31     ` Andreas Ericsson
  2008-01-23 16:06   ` Linus Torvalds
  1 sibling, 1 reply; 51+ messages in thread
From: Dmitry Potapov @ 2008-01-23  9:15 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano

On Wed, Jan 23, 2008 at 09:32:54AM +0100, Andreas Ericsson wrote:
> 
> The FNV hash would be better (pasted below), but I doubt
> anyone will ever care, and there will be larger differences
> between architectures with this one than the lt_git hash (well,
> a function's gotta have a name).

Actually, Bob Jenkins' lookup3 hash is twice faster in my tests
than FNV, and also it is much less likely to have any collision.

The description and some comparision with other hash can be found here:
http://burtleburtle.net/bob/hash/doobs.html
http://burtleburtle.net/bob/c/lookup3.c

Perhaps, the second choice is Paul Hsieh's hash.
http://www.azillionmonkeys.com/qed/hash.html

Note: Paul Hsieh provides the table where he compares his hash
with others. There is also the program he used. I ran his program
on my computer, advantage of his over others was not so big on
my computer. Moreover, his test includes an old version of Bob
Jenkins' hash. The new version -- lookup3, which I mentione above,
has about the same speed as Paul Hsieh's hash (with -O2) or even
12% faster when I used -O3 -march=athlon-xp.

Also, Bob Jenkins' hash is better for non-x86 architectures. So,
I believe it is the best hash for today.

Dmitry

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23  9:15   ` Dmitry Potapov
@ 2008-01-23  9:31     ` Andreas Ericsson
  2008-01-23 14:01       ` Marko Kreen
  2008-01-23 17:10       ` Dmitry Potapov
  0 siblings, 2 replies; 51+ messages in thread
From: Andreas Ericsson @ 2008-01-23  9:31 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano

Dmitry Potapov wrote:
> On Wed, Jan 23, 2008 at 09:32:54AM +0100, Andreas Ericsson wrote:
>> The FNV hash would be better (pasted below), but I doubt
>> anyone will ever care, and there will be larger differences
>> between architectures with this one than the lt_git hash (well,
>> a function's gotta have a name).
> 
> Actually, Bob Jenkins' lookup3 hash is twice faster in my tests
> than FNV, and also it is much less likely to have any collision.
> 

>From http://burtleburtle.net/bob/hash/doobs.html
---
FNV Hash

I need to fill this in. Search the web for FNV hash. It's faster than my hash on Intel (because Intel has fast multiplication), but slower on most other platforms. Preliminary tests suggested it has decent distributions. 
---

My tests ran on Intel. I also noticed I had a few hashes commented out when
doing the test, one of them being Paul Hsie's. For some reason, Jenkin's and
Hsie's didn't perform well for me last time I used the comparison thing (I
did a more thorough job back then, with tests running for several minutes
per hash and table-size, so I commented out the poor candidates).

I still believe that for this very simple case, the lookup3.c case is not
very practical, as the code is that much more complicated, which was my
main point with posting the comparison. Iow, not "switch to this hash,
because it's better", but rather "the hash is not as bad as you think and
will probably work well for all practical purposes".

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23  9:31     ` Andreas Ericsson
@ 2008-01-23 14:01       ` Marko Kreen
  2008-01-23 14:39         ` Andreas Ericsson
  2008-01-23 17:10       ` Dmitry Potapov
  1 sibling, 1 reply; 51+ messages in thread
From: Marko Kreen @ 2008-01-23 14:01 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: Dmitry Potapov, Linus Torvalds, Git Mailing List, Junio C Hamano

On 1/23/08, Andreas Ericsson <ae@op5.se> wrote:
> Dmitry Potapov wrote:
> > On Wed, Jan 23, 2008 at 09:32:54AM +0100, Andreas Ericsson wrote:
> >> The FNV hash would be better (pasted below), but I doubt
> >> anyone will ever care, and there will be larger differences
> >> between architectures with this one than the lt_git hash (well,
> >> a function's gotta have a name).
> >
> > Actually, Bob Jenkins' lookup3 hash is twice faster in my tests
> > than FNV, and also it is much less likely to have any collision.
> >
>
> >From http://burtleburtle.net/bob/hash/doobs.html
> ---
> FNV Hash
>
> I need to fill this in. Search the web for FNV hash. It's faster than my hash on Intel (because Intel has fast multiplication), but slower on most other platforms. Preliminary tests suggested it has decent distributions.

I suspect that this paragraph was about comparison with lookup2
(not lookup3) because lookup3 beat easily all the "simple" hashes
in my testing.  Only competitor was Hsieh one which was like 50:50
faster or slower depending on alignment / compiler / cpu.

> ---
>
> My tests ran on Intel. I also noticed I had a few hashes commented out when
> doing the test, one of them being Paul Hsie's. For some reason, Jenkin's and
> Hsie's didn't perform well for me last time I used the comparison thing (I
> did a more thorough job back then, with tests running for several minutes
> per hash and table-size, so I commented out the poor candidates).
>
> I still believe that for this very simple case, the lookup3.c case is not
> very practical, as the code is that much more complicated, which was my
> main point with posting the comparison. Iow, not "switch to this hash,
> because it's better", but rather "the hash is not as bad as you think and
> will probably work well for all practical purposes".


If you don't mind few percent speed penalty compared to Jenkings
own optimized version, you can use my simplified version:

  http://repo.or.cz/w/pgbouncer.git?a=blob;f=src/hash.c;h=5c9a73639ad098c296c0be562c34573189f3e083;hb=HEAD

It works always with "native" endianess, unlike Jenkins fixed-endian
hashlittle() / hashbig().  It may or may not matter if you plan
to write values on disk.

Speed-wise it may be 10-30% slower worst case (in my case sparc-classic
with unaligned data), but on x86, lucky gcc version and maybe
also memcpy() hack seen in system.h, it tends to be ~10% faster,
especially as it does always 4byte read in main loop.

-- 
marko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23 14:01       ` Marko Kreen
@ 2008-01-23 14:39         ` Andreas Ericsson
  2008-01-24  6:51           ` Luke Lu
  2008-01-24 13:19           ` Marko Kreen
  0 siblings, 2 replies; 51+ messages in thread
From: Andreas Ericsson @ 2008-01-23 14:39 UTC (permalink / raw)
  To: Marko Kreen
  Cc: Dmitry Potapov, Linus Torvalds, Git Mailing List, Junio C Hamano

Marko Kreen wrote:
> On 1/23/08, Andreas Ericsson <ae@op5.se> wrote:
>> Dmitry Potapov wrote:
>>> On Wed, Jan 23, 2008 at 09:32:54AM +0100, Andreas Ericsson wrote:
>>>> The FNV hash would be better (pasted below), but I doubt
>>>> anyone will ever care, and there will be larger differences
>>>> between architectures with this one than the lt_git hash (well,
>>>> a function's gotta have a name).
>>> Actually, Bob Jenkins' lookup3 hash is twice faster in my tests
>>> than FNV, and also it is much less likely to have any collision.
>>>
>> >From http://burtleburtle.net/bob/hash/doobs.html
>> ---
>> FNV Hash
>>
>> I need to fill this in. Search the web for FNV hash. It's faster than my hash on Intel (because Intel has fast multiplication), but slower on most other platforms. Preliminary tests suggested it has decent distributions.
> 
> I suspect that this paragraph was about comparison with lookup2


It might be. It's from the link Dmitry posted in his reply to my original
message. (something/something/doobs.html).

> (not lookup3) because lookup3 beat easily all the "simple" hashes

By how much? FNV beat Linus' hash by 0.01 microseconds / insertion,
and 0.1 microsecons / lookup. We're talking about a case here where
there will never be more lookups than insertions (unless I'm much
mistaken).

> 
> If you don't mind few percent speed penalty compared to Jenkings
> own optimized version, you can use my simplified version:
> 
>   http://repo.or.cz/w/pgbouncer.git?a=blob;f=src/hash.c;h=5c9a73639ad098c296c0be562c34573189f3e083;hb=HEAD
> 

I don't, but I don't care that deeply either. On the one hand,
it would be nifty to have an excellent hash-function in git.
On the other hand, it would look stupid with something that's
quite clearly over-kill.

> It works always with "native" endianess, unlike Jenkins fixed-endian
> hashlittle() / hashbig().  It may or may not matter if you plan
> to write values on disk.
> 
> Speed-wise it may be 10-30% slower worst case (in my case sparc-classic
> with unaligned data), but on x86, lucky gcc version and maybe
> also memcpy() hack seen in system.h, it tends to be ~10% faster,
> especially as it does always 4byte read in main loop.
> 

It would have to be a significant improvement in wall-clock time
on a test-case of hashing 30k strings to warrant going from 6 to 80
lines of code, imo. I still believe the original dumb hash Linus
wrote is "good enough".

On a side-note, it was very interesting reading, and I shall have
to add jenkins3_mkreen() to my test-suite (although the "keep
copyright note" license thing bugs me a bit).

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23 14:39         ` Andreas Ericsson
@ 2008-01-24  6:51           ` Luke Lu
  2008-01-24 10:24             ` Andreas Ericsson
  2008-01-24 13:19           ` Marko Kreen
  1 sibling, 1 reply; 51+ messages in thread
From: Luke Lu @ 2008-01-24  6:51 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: Marko Kreen, Dmitry Potapov, Linus Torvalds, Git Mailing List,
	Junio C Hamano

On Jan 23, 2008, at 6:39 AM, Andreas Ericsson wrote:
> Marko Kreen wrote:
>> On 1/23/08, Andreas Ericsson <ae@op5.se> wrote:
>>> Dmitry Potapov wrote:
>>>> On Wed, Jan 23, 2008 at 09:32:54AM +0100, Andreas Ericsson wrote:
>>>>> The FNV hash would be better (pasted below), but I doubt
>>>>> anyone will ever care, and there will be larger differences
>>>>> between architectures with this one than the lt_git hash (well,
>>>>> a function's gotta have a name).
>>>> Actually, Bob Jenkins' lookup3 hash is twice faster in my tests
>>>> than FNV, and also it is much less likely to have any collision.
>>>>
>>> >From http://burtleburtle.net/bob/hash/doobs.html
>>> ---
>>> FNV Hash
>>>
>>> I need to fill this in. Search the web for FNV hash. It's faster  
>>> than my hash on Intel (because Intel has fast multiplication),  
>>> but slower on most other platforms. Preliminary tests suggested  
>>> it has decent distributions.
>> I suspect that this paragraph was about comparison with lookup2
>
>
> It might be. It's from the link Dmitry posted in his reply to my  
> original
> message. (something/something/doobs.html).
>
>> (not lookup3) because lookup3 beat easily all the "simple" hashes
>
> By how much? FNV beat Linus' hash by 0.01 microseconds / insertion,
> and 0.1 microsecons / lookup. We're talking about a case here where
> there will never be more lookups than insertions (unless I'm much
> mistaken).
>
>> If you don't mind few percent speed penalty compared to Jenkings
>> own optimized version, you can use my simplified version:
>>   http://repo.or.cz/w/pgbouncer.git?a=blob;f=src/ 
>> hash.c;h=5c9a73639ad098c296c0be562c34573189f3e083;hb=HEAD
>
> I don't, but I don't care that deeply either. On the one hand,
> it would be nifty to have an excellent hash-function in git.
> On the other hand, it would look stupid with something that's
> quite clearly over-kill.
>
>> It works always with "native" endianess, unlike Jenkins fixed-endian
>> hashlittle() / hashbig().  It may or may not matter if you plan
>> to write values on disk.
>> Speed-wise it may be 10-30% slower worst case (in my case sparc- 
>> classic
>> with unaligned data), but on x86, lucky gcc version and maybe
>> also memcpy() hack seen in system.h, it tends to be ~10% faster,
>> especially as it does always 4byte read in main loop.
>
> It would have to be a significant improvement in wall-clock time
> on a test-case of hashing 30k strings to warrant going from 6 to 80
> lines of code, imo. I still believe the original dumb hash Linus
> wrote is "good enough".
>
> On a side-note, it was very interesting reading, and I shall have
> to add jenkins3_mkreen() to my test-suite (although the "keep
> copyright note" license thing bugs me a bit).

Would you, for completeness' sake, please add Tcl and STL hashes to  
your test suite? The numbers are quite interesting. Is your test  
suite available somewhere, so we can test with our own data and  
hardware as well. Both Tcl hash and STL (from SGI probably HP days,  
still the current default with g++) string hashes are extremely  
simple (excluding the loop constructs):

Tcl: h += (h<<3) + c; 	// essentially *9+c (but work better on non- 
late-intels)
STL: h = h * 5 + c;	// worse than above for most of my data

Thanks,

__Luke

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-24  6:51           ` Luke Lu
@ 2008-01-24 10:24             ` Andreas Ericsson
  0 siblings, 0 replies; 51+ messages in thread
From: Andreas Ericsson @ 2008-01-24 10:24 UTC (permalink / raw)
  To: Luke Lu
  Cc: Marko Kreen, Dmitry Potapov, Linus Torvalds, Git Mailing List,
	Junio C Hamano

Luke Lu wrote:
> On Jan 23, 2008, at 6:39 AM, Andreas Ericsson wrote:
>> Marko Kreen wrote:
>>> On 1/23/08, Andreas Ericsson <ae@op5.se> wrote:
>>>> Dmitry Potapov wrote:
>>>>> On Wed, Jan 23, 2008 at 09:32:54AM +0100, Andreas Ericsson wrote:
>>>>>> The FNV hash would be better (pasted below), but I doubt
>>>>>> anyone will ever care, and there will be larger differences
>>>>>> between architectures with this one than the lt_git hash (well,
>>>>>> a function's gotta have a name).
>>>>> Actually, Bob Jenkins' lookup3 hash is twice faster in my tests
>>>>> than FNV, and also it is much less likely to have any collision.
>>>>>
>>>> >From http://burtleburtle.net/bob/hash/doobs.html
>>>> ---
>>>> FNV Hash
>>>>
>>>> I need to fill this in. Search the web for FNV hash. It's faster 
>>>> than my hash on Intel (because Intel has fast multiplication), but 
>>>> slower on most other platforms. Preliminary tests suggested it has 
>>>> decent distributions.
>>> I suspect that this paragraph was about comparison with lookup2
>>
>>
>> It might be. It's from the link Dmitry posted in his reply to my original
>> message. (something/something/doobs.html).
>>
>>> (not lookup3) because lookup3 beat easily all the "simple" hashes
>>
>> By how much? FNV beat Linus' hash by 0.01 microseconds / insertion,
>> and 0.1 microsecons / lookup. We're talking about a case here where
>> there will never be more lookups than insertions (unless I'm much
>> mistaken).
>>
>>> If you don't mind few percent speed penalty compared to Jenkings
>>> own optimized version, you can use my simplified version:
>>>   
>>> http://repo.or.cz/w/pgbouncer.git?a=blob;f=src/hash.c;h=5c9a73639ad098c296c0be562c34573189f3e083;hb=HEAD 
>>>
>>
>> I don't, but I don't care that deeply either. On the one hand,
>> it would be nifty to have an excellent hash-function in git.
>> On the other hand, it would look stupid with something that's
>> quite clearly over-kill.
>>
>>> It works always with "native" endianess, unlike Jenkins fixed-endian
>>> hashlittle() / hashbig().  It may or may not matter if you plan
>>> to write values on disk.
>>> Speed-wise it may be 10-30% slower worst case (in my case sparc-classic
>>> with unaligned data), but on x86, lucky gcc version and maybe
>>> also memcpy() hack seen in system.h, it tends to be ~10% faster,
>>> especially as it does always 4byte read in main loop.
>>
>> It would have to be a significant improvement in wall-clock time
>> on a test-case of hashing 30k strings to warrant going from 6 to 80
>> lines of code, imo. I still believe the original dumb hash Linus
>> wrote is "good enough".
>>
>> On a side-note, it was very interesting reading, and I shall have
>> to add jenkins3_mkreen() to my test-suite (although the "keep
>> copyright note" license thing bugs me a bit).
> 
> Would you, for completeness' sake, please add Tcl and STL hashes to your 
> test suite?

I could do that. Or I just publish the entire ugly thing and let someone
else add them ;-)

> The numbers are quite interesting. Is your test suite 
> available somewhere, so we can test with our own data and hardware as 
> well.

Not yet, no. I usually munge it up quite a lot when I want to test hashes
for a specific input, so it's not what anyone would call "pretty".

> Both Tcl hash and STL (from SGI probably HP days, still the 
> current default with g++) string hashes are extremely simple (excluding 
> the loop constructs):
> 
> Tcl: h += (h<<3) + c;     // essentially *9+c (but work better on 
> non-late-intels)
> STL: h = h * 5 + c;    // worse than above for most of my data
> 

They sure do look simple enough. As for loop constructs, I've tried to
use the same looping mechanics for everything, so as to let the algorithm
be the only difference. Otherwise it gets tricky to do comparisons. The
exceptions are ofcourse hashes relying on Duff's device or similar
alignment trickery.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23 14:39         ` Andreas Ericsson
  2008-01-24  6:51           ` Luke Lu
@ 2008-01-24 13:19           ` Marko Kreen
  2008-01-24 16:00             ` Andreas Ericsson
  1 sibling, 1 reply; 51+ messages in thread
From: Marko Kreen @ 2008-01-24 13:19 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: Dmitry Potapov, Linus Torvalds, Git Mailing List, Junio C Hamano

On 1/23/08, Andreas Ericsson <ae@op5.se> wrote:
> Marko Kreen wrote:
> > (not lookup3) because lookup3 beat easily all the "simple" hashes
>
> By how much? FNV beat Linus' hash by 0.01 microseconds / insertion,
> and 0.1 microsecons / lookup. We're talking about a case here where
> there will never be more lookups than insertions (unless I'm much
> mistaken).

FNV is around 40% slower than lookup3 on my Intel Core CPU, on 4byte aligned
input. See below for more detailed info.

> > If you don't mind few percent speed penalty compared to Jenkings
> > own optimized version, you can use my simplified version:
> >
> >   http://repo.or.cz/w/pgbouncer.git?a=blob;f=src/hash.c;h=5c9a73639ad098c296c0be562c34573189f3e083;hb=HEAD
>
> I don't, but I don't care that deeply either. On the one hand,
> it would be nifty to have an excellent hash-function in git.
> On the other hand, it would look stupid with something that's
> quite clearly over-kill.

Jenkins hash is fast because it does not look at individual bytes.
If you _do_ want to look at them for unrelated reasons, (case-insensitive,
unicode-juggling), then it obiously loses the point.  That is, if you
want to process the string in one go.

> > It works always with "native" endianess, unlike Jenkins fixed-endian
> > hashlittle() / hashbig().  It may or may not matter if you plan
> > to write values on disk.
> >
> > Speed-wise it may be 10-30% slower worst case (in my case sparc-classic
> > with unaligned data), but on x86, lucky gcc version and maybe
> > also memcpy() hack seen in system.h, it tends to be ~10% faster,
> > especially as it does always 4byte read in main loop.
>
> It would have to be a significant improvement in wall-clock time
> on a test-case of hashing 30k strings to warrant going from 6 to 80
> lines of code, imo. I still believe the original dumb hash Linus
> wrote is "good enough".

Well, ad-hoc dumb hashes may have horrible worst-cases that you cant
see with light testing.  Therefore I'd still suggest some better
researched dumb hash (eg. FNV or OAT).

> On a side-note, it was very interesting reading, and I shall have
> to add jenkins3_mkreen() to my test-suite (although the "keep
> copyright note" license thing bugs me a bit).

Sorry.  I just used template boilerplate.  Considering all the
hard work was done by other people, it not proper to put under
my own license.  I tagged the file as 'public domain' and pushed out.

Btw, the reason I started cleaning lookup3 was that at first I was
scared of the complexity of Jenkins code and decided to go with
Hsieh hash.  Then I found out that Hsieh code is under totally
werdo license (http://www.azillionmonkeys.com/qed/weblicense.html)
so I could not use it.

====================================================================

Here is my raw-speed test of different hashes.  Input is 4-byte
aligned which should be common case for malloc()-ed strings.
This also is best case for original lookup3(), on unaligned
input the memcpy variants beat it easily.  Input string
length varies randomly in range 0..64.

own_memcpy - last 12-byte memcpy() calls out to libc
memcpy_hack - last memcpy is inlined bytewise copy loop:

  while (len--) *dst++ = *src++;

Note that is is raw-speed test, if you benchmark larger code the
hash difference probably matters less.

--------------------------------------------------------------------

Testing: seed=34 align=4 minlen=0 maxlen=64 trycnt=2 duration=10

lookup3 : try=0: ... 247.4880 MB/s
lookup3 : try=1: ... 247.6154 MB/s
own_memcpy: try=0: ... 223.5508 MB/s
own_memcpy: try=1: ... 223.5830 MB/s
memcpy_hack: try=0: ... 241.2241 MB/s
memcpy_hack: try=1: ... 241.2492 MB/s
lookup2 : try=0: ... 190.2697 MB/s
lookup2 : try=1: ... 190.3283 MB/s
fnv     : try=0: ... 153.0318 MB/s
fnv     : try=1: ... 153.0178 MB/s
hsieh   : try=0: ... 234.0468 MB/s
hsieh   : try=1: ... 234.0426 MB/s
oat     : try=0: ... 154.7804 MB/s
oat     : try=1: ... 154.8226 MB/s
elf     : try=0: ... 125.5892 MB/s
elf     : try=1: ... 125.5734 MB/s

Results compared to reference:

lookup3         : 100.000 %
own_memcpy      :  90.311 %
memcpy_hack     :  97.449 %
lookup2         :  76.872 %
fnv             :  61.815 %
hsieh           :  94.544 %
oat             :  62.533 %
elf             :  50.729 %

-- 
marko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-24 13:19           ` Marko Kreen
@ 2008-01-24 16:00             ` Andreas Ericsson
  2008-01-24 16:13               ` Marko Kreen
                                 ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Andreas Ericsson @ 2008-01-24 16:00 UTC (permalink / raw)
  To: Marko Kreen
  Cc: Dmitry Potapov, Linus Torvalds, Git Mailing List, Junio C Hamano

Marko Kreen wrote:
> On 1/23/08, Andreas Ericsson <ae@op5.se> wrote:
>> Marko Kreen wrote:
>>> (not lookup3) because lookup3 beat easily all the "simple" hashes
>> By how much? FNV beat Linus' hash by 0.01 microseconds / insertion,
>> and 0.1 microsecons / lookup. We're talking about a case here where
>> there will never be more lookups than insertions (unless I'm much
>> mistaken).
> 
> FNV is around 40% slower than lookup3 on my Intel Core CPU, on 4byte aligned
> input. See below for more detailed info.
> 

But the tests surely need to check for unaligned cases, as that's what
we're likely to hash, no?

>>> If you don't mind few percent speed penalty compared to Jenkings
>>> own optimized version, you can use my simplified version:
>>>
>>>   http://repo.or.cz/w/pgbouncer.git?a=blob;f=src/hash.c;h=5c9a73639ad098c296c0be562c34573189f3e083;hb=HEAD
>> I don't, but I don't care that deeply either. On the one hand,
>> it would be nifty to have an excellent hash-function in git.
>> On the other hand, it would look stupid with something that's
>> quite clearly over-kill.
> 
> Jenkins hash is fast because it does not look at individual bytes.
> If you _do_ want to look at them for unrelated reasons, (case-insensitive,
> unicode-juggling), then it obiously loses the point.  That is, if you
> want to process the string in one go.
> 

I believe the ability to add unicode-juggling was a major point
with the patch, so perhaps Jenkins' isn't such a good option.

I'm not familiar with data-mangling the way Linus (or Theo Tso
is), so I hadn't even considered that aspect of unrolled hashes.

>> It would have to be a significant improvement in wall-clock time
>> on a test-case of hashing 30k strings to warrant going from 6 to 80
>> lines of code, imo. I still believe the original dumb hash Linus
>> wrote is "good enough".
> 
> Well, ad-hoc dumb hashes may have horrible worst-cases that you cant
> see with light testing.  Therefore I'd still suggest some better
> researched dumb hash (eg. FNV or OAT).
> 

True. FNV is used in both MySQL and PostgreSQL. I'd say it's safe to
assume it's fairly well tested.

>> On a side-note, it was very interesting reading, and I shall have
>> to add jenkins3_mkreen() to my test-suite (although the "keep
>> copyright note" license thing bugs me a bit).
> 
> Sorry.  I just used template boilerplate.  Considering all the
> hard work was done by other people, it not proper to put under
> my own license.  I tagged the file as 'public domain' and pushed out.
> 

Thanks. I'll see if I can add it, although it'll probably have to
wait until I have reason to dig into it at work again. I've added
the hash to the code-base, but not yet incorporated it into the
test-case.

> Btw, the reason I started cleaning lookup3 was that at first I was
> scared of the complexity of Jenkins code and decided to go with
> Hsieh hash.  Then I found out that Hsieh code is under totally
> werdo license (http://www.azillionmonkeys.com/qed/weblicense.html)
> so I could not use it.
> 

True. I was in contact with him a while back since I wanted to use it
in an opensource project, but the licensing issues made me go with
another one instead. The patch got turned down anyways, so it was a
non-issue in the end, but... Ah well.

> 
> Here is my raw-speed test of different hashes.  Input is 4-byte
> aligned which should be common case for malloc()-ed strings.

Unless arena allocated, like we do in git.

I'm not surprised that this test favours Jenkin's and Hsie's.
That's to be expected as those benefit far more than simpler
hashing algorithms for long strings. The overhead when trying
shorter strings (say, between 3 and 15 chars, and not necessarily
4-byte aligned) sometimes make them quite a lot slower though.

> This also is best case for original lookup3(), on unaligned
> input the memcpy variants beat it easily.  Input string
> length varies randomly in range 0..64.
> 

Well, memcpy() isn't very interesting to compare against
hashes, as they test vastly different parts of the hardware's
parts' performance. memcpy() should also perform exactly the
same no matter what the test-data, which isn't always true for
hashes.

What *would* be interesting would be something along the lines
of "duff_cpy()": ie, an unrolled loop that aligns itself and
copies each byte to the same address each time.

The bytewise equivalence would ofcourse be

magic_cpy(unsigned char *k, int len)
{
	unsigned char magic;
	do {
		magic = *k++;
	} while (--len);
}

> own_memcpy - last 12-byte memcpy() calls out to libc
> memcpy_hack - last memcpy is inlined bytewise copy loop:
> 
>   while (len--) *dst++ = *src++;
> 
> Note that is is raw-speed test, if you benchmark larger code the
> hash difference probably matters less.
> 
> --------------------------------------------------------------------
> 
> Testing: seed=34 align=4 minlen=0 maxlen=64 trycnt=2 duration=10
> 
> lookup3 : try=0: ... 247.4880 MB/s
> lookup3 : try=1: ... 247.6154 MB/s
> own_memcpy: try=0: ... 223.5508 MB/s
> own_memcpy: try=1: ... 223.5830 MB/s
> memcpy_hack: try=0: ... 241.2241 MB/s
> memcpy_hack: try=1: ... 241.2492 MB/s
> lookup2 : try=0: ... 190.2697 MB/s
> lookup2 : try=1: ... 190.3283 MB/s
> fnv     : try=0: ... 153.0318 MB/s
> fnv     : try=1: ... 153.0178 MB/s
> hsieh   : try=0: ... 234.0468 MB/s
> hsieh   : try=1: ... 234.0426 MB/s
> oat     : try=0: ... 154.7804 MB/s
> oat     : try=1: ... 154.8226 MB/s
> elf     : try=0: ... 125.5892 MB/s
> elf     : try=1: ... 125.5734 MB/s
> 
> Results compared to reference:
> 
> lookup3         : 100.000 %
> own_memcpy      :  90.311 %
> memcpy_hack     :  97.449 %
> lookup2         :  76.872 %
> fnv             :  61.815 %
> hsieh           :  94.544 %
> oat             :  62.533 %
> elf             :  50.729 %
> 
> 

Interesting output, but not very surprising. Do you have the code
available somewhere?

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-24 16:00             ` Andreas Ericsson
@ 2008-01-24 16:13               ` Marko Kreen
  2008-01-24 16:28               ` Dmitry Potapov
  2008-01-25 20:08               ` Marko Kreen
  2 siblings, 0 replies; 51+ messages in thread
From: Marko Kreen @ 2008-01-24 16:13 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: Dmitry Potapov, Linus Torvalds, Git Mailing List, Junio C Hamano

On 1/24/08, Andreas Ericsson <ae@op5.se> wrote:
> Marko Kreen wrote:
> > On 1/23/08, Andreas Ericsson <ae@op5.se> wrote:
> >> Marko Kreen wrote:
> >>> (not lookup3) because lookup3 beat easily all the "simple" hashes
> >> By how much? FNV beat Linus' hash by 0.01 microseconds / insertion,
> >> and 0.1 microsecons / lookup. We're talking about a case here where
> >> there will never be more lookups than insertions (unless I'm much
> >> mistaken).
> >
> > FNV is around 40% slower than lookup3 on my Intel Core CPU, on 4byte aligned
> > input. See below for more detailed info.
> >
>
> But the tests surely need to check for unaligned cases, as that's what
> we're likely to hash, no?



> >> It would have to be a significant improvement in wall-clock time
> >> on a test-case of hashing 30k strings to warrant going from 6 to 80
> >> lines of code, imo. I still believe the original dumb hash Linus
> >> wrote is "good enough".
> >
> > Well, ad-hoc dumb hashes may have horrible worst-cases that you cant
> > see with light testing.  Therefore I'd still suggest some better
> > researched dumb hash (eg. FNV or OAT).
> >
>
> True. FNV is used in both MySQL and PostgreSQL. I'd say it's safe to
> assume it's fairly well tested.

PostgreSQL uses lookup2...

> > Here is my raw-speed test of different hashes.  Input is 4-byte
> > aligned which should be common case for malloc()-ed strings.
>
> Unless arena allocated, like we do in git.
>
> I'm not surprised that this test favours Jenkin's and Hsie's.
> That's to be expected as those benefit far more than simpler
> hashing algorithms for long strings. The overhead when trying
> shorter strings (say, between 3 and 15 chars, and not necessarily
> 4-byte aligned) sometimes make them quite a lot slower though.

Ok, here is 0..15 chars, random alignment:

Testing: seed=34 align=0 minlen=0 maxlen=15 trycnt=2 duration=10

lookup3 : try=0: ...  69.8092 MB/s
lookup3 : try=1: ...  69.8146 MB/s
own_memcpy: try=0: ...  66.7808 MB/s
own_memcpy: try=1: ...  66.7814 MB/s
memcpy_hack: try=0: ...  74.0635 MB/s
memcpy_hack: try=1: ...  74.0518 MB/s
lookup2 : try=0: ...  68.6582 MB/s
lookup2 : try=1: ...  68.6634 MB/s
fnv     : try=0: ...  74.5098 MB/s
fnv     : try=1: ...  74.5283 MB/s
hsieh   : try=0: ...  71.6708 MB/s
hsieh   : try=1: ...  71.6814 MB/s
oat     : try=0: ...  74.7828 MB/s
oat     : try=1: ...  74.7716 MB/s
elf     : try=0: ...  65.2077 MB/s
elf     : try=1: ...  65.2128 MB/s

Results compared to reference:

lookup3         : 100.000 %
own_memcpy      :  95.659 %
memcpy_hack     : 106.082 %
lookup2         :  98.351 %
fnv             : 106.743 %
hsieh           : 102.670 %
oat             : 107.112 %
elf             :  93.409 %

> > This also is best case for original lookup3(), on unaligned
> > input the memcpy variants beat it easily.  Input string
> > length varies randomly in range 0..64.
> >
>
> Well, memcpy() isn't very interesting to compare against
> hashes, as they test vastly different parts of the hardware's
> parts' performance. memcpy() should also perform exactly the
> same no matter what the test-data, which isn't always true for
> hashes.

Sorry, I meant my "simple-memcpy-based-lookup3".

> What *would* be interesting would be something along the lines
> of "duff_cpy()": ie, an unrolled loop that aligns itself and
> copies each byte to the same address each time.

How the hash fetched data from mempry is _very_ relevant.

> Interesting output, but not very surprising. Do you have the code
> available somewhere?

I can put it out.

-- 
marko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-24 16:00             ` Andreas Ericsson
  2008-01-24 16:13               ` Marko Kreen
@ 2008-01-24 16:28               ` Dmitry Potapov
  2008-01-24 17:15                 ` Linus Torvalds
  2008-01-25 20:08               ` Marko Kreen
  2 siblings, 1 reply; 51+ messages in thread
From: Dmitry Potapov @ 2008-01-24 16:28 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: Marko Kreen, Linus Torvalds, Git Mailing List, Junio C Hamano

On Jan 24, 2008 7:00 PM, Andreas Ericsson <ae@op5.se> wrote:
> Marko Kreen wrote:
> >
> > Jenkins hash is fast because it does not look at individual bytes.
> > If you _do_ want to look at them for unrelated reasons, (case-insensitive,
> > unicode-juggling), then it obiously loses the point.  That is, if you
> > want to process the string in one go.
> >
>
> I believe the ability to add unicode-juggling was a major point
> with the patch, so perhaps Jenkins' isn't such a good option.

I don't think you can any meaningful unicode-juggling without converting
symbols to UCS-4, and after that it makes much more sense to operate
with uint32 than bytes. So, Jenkins' hash is still relevant, just because
it does not operate on single bytes, but using uint32.

Dmitry

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-24 16:28               ` Dmitry Potapov
@ 2008-01-24 17:15                 ` Linus Torvalds
  2008-01-24 18:45                   ` Dmitry Potapov
  2008-01-25 20:52                   ` Marko Kreen
  0 siblings, 2 replies; 51+ messages in thread
From: Linus Torvalds @ 2008-01-24 17:15 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Andreas Ericsson, Marko Kreen, Git Mailing List, Junio C Hamano

On Thu, 24 Jan 2008, Dmitry Potapov wrote:
> 
> I don't think you can any meaningful unicode-juggling without converting
> symbols to UCS-4, and after that it makes much more sense to operate
> with uint32 than bytes. So, Jenkins' hash is still relevant, just because
> it does not operate on single bytes, but using uint32.

No, no, no, NO!

Egads! Why do people constantly do these totally idiotic things for 
Unicode?

You can do a perfectly fine 8-bytes-at-a-time hash for almost 100% of all 
source code projects in UTF-8, without *ever* doing any format changes at 
all. Admittedly, it's a lot easier if the hash is a pure in-memory one (ie 
we don't care about byte-order or size of integers or anything like that), 
but that's the common case for most hashes that aren't used for BTree 
lookup on disk or something like that.

Here, let me show you:

	unsigned int name_hash(const char *name, int size)
	{
		hash = HASH_INIT;
		do {
			unsigned char c;
			if (size >= sizeof(long)) {
				unsigned long val = get_unaligned_long(name);
				if (!(val & 0x8080808080808080)) {
					/* Make it equivalent in case */
					val &= ~0x2020202020202020;
					hash = hash_long(hash, val);
					name += sizeof(long);
					size -= sizeof(long);
					continue;
				}
			}

			c = *name;
			if (!(c & 0x80)) {
				hash = hash_long(hash, c & ~0x20);
				name++;
				size--;
				continue;
			}

			/* This is the unusual and slowish case */
			hash = hash_utf8_char(hash, c, &name, &size);
		} while (size);
		return hassh;
	}

and then the only point you ever do that actual UTF8->codepoint conversion 
is for that "high bit set" case.

A few things to note on the above:

 - the hash obviously has "odd" characteristics. We're not necessarily 
   hashing characters at a time at all, and the alignment of characters 
   with high bits *within*the*string* will make a difference to how we 
   hash them.

   But it's also important that the "get_unaligned_long()" means that the 
   alignment of the string itself doesn't matter, so its' purely a 
   "chunking within the string" thing, and the alignment of the string 
   itself won't affect the hash value

 - I'm not writing out hash_utf8_char(), because it's certainly not 
   totally trivial, but it's not *really* complex either. The 
   nontriviality isn't so much the decoding into a codepoint (which is 
   pretty simple), but the fact that when you have the codepoint you 
   should then decompose it and turn it into lower case, which is 
   generally two table lookups. Then, you just do

	for_each_decomposed_uppercased_codepoint(c)
		hash = hash_long(hash, c);

   and one thing to note is that for the hashing, the decomposition and 
   uppercasing doesn't even have to be "exact" (the same way I didn't do 
   an "exact" upper-casing for US-ASCII, just a "good enough" one!)

Similarly, when you actually do a unicode *compare* function, you should 
never *ever* actually convert to any unicode codepoints or do any 
expensive decomposition AT ALL by default! What you do is to compare 
things byte-by-byte, and only convert to unicode/decompse if there are any 
differences, and only for those parts of the sequence that differ!

So if you have two UTF-8 strings (even if they have "complex" characters, 
ie with the  high bit set), the *common* case is that you'll match them 
byte for byte, and they'll match without any Unicode conversion needed at 
all! This is common because:

 - normally, even if you don't ever normalize, people tend to input things 
   in a *similar* manner (again, OS X is the odd man out), so even 
   non-normalized strings are often non-normalized the same way!

 - we're only going to compare things that have hashed to the same thing 
   anyway, so the common case is that it's the same string, and most 
   likely had the same source. And if it's a collision, it's often totally 
   different. And even if it's different only in case, the *common* case 
   is going to be (for source code trees, at least) that the different 
   point is a US-ASCII letter, and the case-comparison will again be done 
   without any Unicode knowledge at all!

This is why normalization of strings before-hand is generally so utterly 
stupid. It doesn't buy you anything. It complicates things a lot (you 
don't want to normalize in-place, so you have memory management issues), 
and it actually SLOWS THINGS DOWN.

It's much better to do UTF-8 comparisons and hashing char-by-char. At 
least if you know that the common case is not going to be the complex part 
of the character set (which is undoubtedly true for source code 
filenames). 

Normalizing things ahead of time *only* makes sense if:

 - you expect complex characters to be a big part of your load

and

 - you're going to do literally *thousands* of comparisons against the 
   *same* strings over and over (so that the cost of normalization is
   literally up-front)

For example, for a filesystem, it's true that you're going to compare 
against the *target* (on-disk) multiple times, but that doesn't actually 
mean that normalizing it makes any sense - because the data you're going 
to compare against comes from user space and isn't guaranteed to be 
normalized, so you still cannot do a simple memcmp() without the expense 
of normalizing that.

And since you're going to hash the filenames anyway, you will not have 
"thousands of comparisons" per source lookup, you'll generally only have a 
couple, so now your normalization actually cost you *more* than doing the 
above on-the-fly comparison!

See?

Basically, it's almost always a stupid thing to actually normalize a whole 
string. You do those things character-by-character, and only lazily when 
you actually need to!

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-24 17:15                 ` Linus Torvalds
@ 2008-01-24 18:45                   ` Dmitry Potapov
  2008-01-24 19:08                     ` Linus Torvalds
  2008-01-25 20:52                   ` Marko Kreen
  1 sibling, 1 reply; 51+ messages in thread
From: Dmitry Potapov @ 2008-01-24 18:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Ericsson, Marko Kreen, Git Mailing List, Junio C Hamano

On Thu, Jan 24, 2008 at 09:15:43AM -0800, Linus Torvalds wrote:
>
>
> You can do a perfectly fine 8-bytes-at-a-time hash for almost 100% of all

I suppose 8 bytes for 64-bit platforms and 4 bytes for 32-bits.

>
> 	unsigned int name_hash(const char *name, int size)
> 	{
> 		hash = HASH_INIT;
> 		do {
> 			unsigned char c;
> 			if (size >= sizeof(long)) {
> 				unsigned long val = get_unaligned_long(name);
> 				if (!(val & 0x8080808080808080)) {
> 					/* Make it equivalent in case */
> 					val &= ~0x2020202020202020;
> 					hash = hash_long(hash, val);
> 					name += sizeof(long);
> 					size -= sizeof(long);
> 					continue;
> 				}
> 			}
>
> 			c = *name;
> 			if (!(c & 0x80)) {
> 				hash = hash_long(hash, c & ~0x20);
> 				name++;
> 				size--;
> 				continue;
> 			}

It is better to use 'while' instead of 'if' here, i.e.:

			while (!((c = *name) & 0x80)) {
				hash = hash_long(hash, c & ~0x20);
				name++;
				if (!--size)
					return hash;
			}

>
> 			/* This is the unusual and slowish case */
> 			hash = hash_utf8_char(hash, c, &name, &size);
> 		} while (size);
> 		return hassh;
> 	}


Dmitry

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-24 18:45                   ` Dmitry Potapov
@ 2008-01-24 19:08                     ` Linus Torvalds
  0 siblings, 0 replies; 51+ messages in thread
From: Linus Torvalds @ 2008-01-24 19:08 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Andreas Ericsson, Marko Kreen, Git Mailing List, Junio C Hamano



On Thu, 24 Jan 2008, Dmitry Potapov wrote:
> 
> It is better to use 'while' instead of 'if' here, i.e.:

Yes, that looks like a good further micro-optimization.

		Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-24 17:15                 ` Linus Torvalds
  2008-01-24 18:45                   ` Dmitry Potapov
@ 2008-01-25 20:52                   ` Marko Kreen
  2008-01-25 22:16                     ` Linus Torvalds
  1 sibling, 1 reply; 51+ messages in thread
From: Marko Kreen @ 2008-01-25 20:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dmitry Potapov, Andreas Ericsson, Git Mailing List,
	Junio C Hamano

On 1/24/08, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> You can do a perfectly fine 8-bytes-at-a-time hash for almost 100% of all
> source code projects in UTF-8, without *ever* doing any format changes at
> all. Admittedly, it's a lot easier if the hash is a pure in-memory one (ie
> we don't care about byte-order or size of integers or anything like that),
> but that's the common case for most hashes that aren't used for BTree
> lookup on disk or something like that.
>
> Here, let me show you:
>
>         unsigned int name_hash(const char *name, int size)

Well, although this is very clever approach, I suggest against it.
You'll end up with complex code that gives out substandard results.

I think its better to have separate case-folding function (or several),
that copies string to temp buffer and then run proper optimized hash
function on that buffer.

That way you can use already tested building blocks and can optimize
both sides separately.  Eg. the folding-only function can  aswell be
optimized to load 4 or 8-byte at-a-time.  This also isolates hashing
from exact details how folding happens to access the input string which
seem to be the weak point in your approach.  (In both collision and
complexity sense.)

Such temp buffer happens to fits my lookup3_memcpy also better (heh).
Its weak point is that on platforms that do not allow unaligned access,
it degenerates to byte-by-byte loading.  But if know you always
have aligned buffer, you can notify gcc to do 4-byte fetch there too.
It should be as simple as tagging data pointer as uint32_t *.

Anyway, now you dont need to worry about folding when picking hash.

> Basically, it's almost always a stupid thing to actually normalize a whole
> string. You do those things character-by-character, and only lazily when
> you actually need to!

If your input strings are over kilobyte on average then I'd
agree with you, but if you process 20-30 bytes on average,
is the additional complexity worth it?

-- 
marko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-25 20:52                   ` Marko Kreen
@ 2008-01-25 22:16                     ` Linus Torvalds
  2008-01-25 22:35                       ` Linus Torvalds
                                         ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Linus Torvalds @ 2008-01-25 22:16 UTC (permalink / raw)
  To: Marko Kreen
  Cc: Dmitry Potapov, Andreas Ericsson, Git Mailing List,
	Junio C Hamano

On Fri, 25 Jan 2008, Marko Kreen wrote:
> 
> Well, although this is very clever approach, I suggest against it.
> You'll end up with complex code that gives out substandard results.

Actually, *your* operation is the one that gives substandard results.

> I think its better to have separate case-folding function (or several),
> that copies string to temp buffer and then run proper optimized hash
> function on that buffer.

I'm sorry, but you just cannot do that efficiently and portably.

I can write a hash function that reliably does 8 bytes at a time for the 
common case on a 64-bit architecture, exactly because it's easy to do 
"test high bits in parallel" with a simple bitwise 'and', and we can do 
the same with "approximate lower-to-uppercase 8 bytes at a time" for a 
hash by just clearing bit 5.

In contrast, trying to do the same thing in half-way portable C, but being 
limited to having to get the case-folding *exactly* right (which you need 
for the comparison function) is much much harder. It's basically 
impossible in portable C (it's doable with architecture-specific features, 
ie vector extensions that have per-byte compares etc).

And hashing is performance-critical, much more so than the compares (ie 
you're likely to have to hash tens of thousands of files, while you will 
only compare a couple). So it really is worth optimizing for.

And the thing is, "performance" isn't a secondary feature. It's also not 
something you can add later by optimizing. 

It's also a mindset issue. Quite frankly, people who do this by "convert 
to some folded/normalized form, then do the operation" will generally make 
much more fundamental mistakes. Once you get into the mindset of "let's 
pass a corrupted strign around", you are in trouble. You start thinking 
that the corrupted string isn't really "corrupt", it's in an "optimized 
format". 

And it's all downhill from there. Don't do it.

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-25 22:16                     ` Linus Torvalds
@ 2008-01-25 22:35                       ` Linus Torvalds
  2008-01-26 12:16                       ` Marko Kreen
  2008-01-26 12:37                       ` Marko Kreen
  2 siblings, 0 replies; 51+ messages in thread
From: Linus Torvalds @ 2008-01-25 22:35 UTC (permalink / raw)
  To: Marko Kreen
  Cc: Dmitry Potapov, Andreas Ericsson, Git Mailing List,
	Junio C Hamano

On Fri, 25 Jan 2008, Linus Torvalds wrote:
> 
> I can write a hash function that reliably does 8 bytes at a time for the 
> common case on a 64-bit architecture, exactly because it's easy to do 
> "test high bits in parallel" with a simple bitwise 'and', and we can do 
> the same with "approximate lower-to-uppercase 8 bytes at a time" for a 
> hash by just clearing bit 5.

Side note: you *can* get better approximations fairly cheaply if you care.

If you want to distinguish the characters 0-31 from the characters 31-63 
in your hash (pointless for filenames, but it can be worthwhile for some 
other string cases), you can decide to clear bit#5 only if bit#6 in that 
byte was also set, with just a few bitwise operations.

Eg, imagine that you have "unsigned long x" containing eight bytes of 
ascii data (ie you already did the test by 0x8080808080808080), you can do 
things like

	unsigned long bit6 = x & 0x4040404040404040;
	x &= ~(bit6 >> 1);

which will only clear bit5 if bit6 in the same byte was set..

So you can do tricks like that, and it will still be plenty fast. And 
nobody will ever care that while it collides 'A' with 'a' (by design), it 
also causes '{' and '[' to be considered "case collisions".

[ Amusing side note: '{' and '[' *are* case collisions in legacy 7-bit 
  "Finnish ASCII". The editor I use still "upper-cases" '{' to '['. I'm 
  not kidding, and yes, it really does it on purpose!

  It used to be that before everybody turned to Latin1, the {|} characters 
  were re-used in Finland (and Sweden, for that matter) for the 
  extra characters needed in Finnish. Because obviously nobody ever
  needed them for any real work.

  I (and probably every Finnish C UNIX programmer) used to be very good at 
  reading C source code even when it was full of odd finnish characters 
  with dots on top, instead of curly braces! ]

And yes, from a performance standpoint, things liek this probably do realy 
matter. For the kernel tree, the average pathname length is ~28 
characters. If you can do it with three iterations that do the first 24 
characters eight characters at a time, and then four iterations over the 
four last ones, rather than 28 iterations with byte->longword and 
multiplications in each, I bet it's quite visible.

Of course, it's going to be visible only if everything else is fast too, 
but git has been pretty good at that in general.

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-25 22:16                     ` Linus Torvalds
  2008-01-25 22:35                       ` Linus Torvalds
@ 2008-01-26 12:16                       ` Marko Kreen
  2008-01-27  6:51                         ` Linus Torvalds
  2008-01-26 12:37                       ` Marko Kreen
  2 siblings, 1 reply; 51+ messages in thread
From: Marko Kreen @ 2008-01-26 12:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dmitry Potapov, Andreas Ericsson, Git Mailing List,
	Junio C Hamano

On 1/26/08, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, 25 Jan 2008, Marko Kreen wrote:
> > Well, although this is very clever approach, I suggest against it.
> > You'll end up with complex code that gives out substandard results.
>
> Actually, *your* operation is the one that gives substandard results.
>
> > I think its better to have separate case-folding function (or several),
> > that copies string to temp buffer and then run proper optimized hash
> > function on that buffer.
>
> I'm sorry, but you just cannot do that efficiently and portably.
>
> I can write a hash function that reliably does 8 bytes at a time for the
> common case on a 64-bit architecture, exactly because it's easy to do
> "test high bits in parallel" with a simple bitwise 'and', and we can do
> the same with "approximate lower-to-uppercase 8 bytes at a time" for a
> hash by just clearing bit 5.
>
> In contrast, trying to do the same thing in half-way portable C, but being
> limited to having to get the case-folding *exactly* right (which you need
> for the comparison function) is much much harder. It's basically
> impossible in portable C (it's doable with architecture-specific features,
> ie vector extensions that have per-byte compares etc).

Here you misunderstood me, I was proposing following:

int hash_folded(const char *str, int len)
{
   char buf[512];
   do_folding(buf, str, len);
   return do_hash(buf, len);
}

That is - the folded string should stay internal to hash function.

Only difference from combined foling+hashing would be that
you can code each part separately.

> And hashing is performance-critical, much more so than the compares (ie
> you're likely to have to hash tens of thousands of files, while you will
> only compare a couple). So it really is worth optimizing for.
>
> And the thing is, "performance" isn't a secondary feature. It's also not
> something you can add later by optimizing.
>
> It's also a mindset issue. Quite frankly, people who do this by "convert
> to some folded/normalized form, then do the operation" will generally make
> much more fundamental mistakes. Once you get into the mindset of "let's
> pass a corrupted strign around", you are in trouble. You start thinking
> that the corrupted string isn't really "corrupt", it's in an "optimized
> format".
>
> And it's all downhill from there. Don't do it.

Againg, you seem to keep HFS+ behaviour in mind, but that was
not what I did suggest.  Probably my mistake, sorry.

-- 
marko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-26 12:16                       ` Marko Kreen
@ 2008-01-27  6:51                         ` Linus Torvalds
  2008-01-27  8:21                           ` Dmitry Potapov
  2008-01-27  9:45                           ` Marko Kreen
  0 siblings, 2 replies; 51+ messages in thread
From: Linus Torvalds @ 2008-01-27  6:51 UTC (permalink / raw)
  To: Marko Kreen
  Cc: Dmitry Potapov, Andreas Ericsson, Git Mailing List,
	Junio C Hamano

On Sat, 26 Jan 2008, Marko Kreen wrote:
> 
> Here you misunderstood me, I was proposing following:
> 
> int hash_folded(const char *str, int len)
> {
>    char buf[512];
>    do_folding(buf, str, len);
>    return do_hash(buf, len);
> }
> 
> That is - the folded string should stay internal to hash function.

If it's internal, it's much better, but you still missed the performance 
angle.

The fact is, hashing can take shortcuts that folding cannot do!

Case folding, by definition, has to be "exact" (since the whole point is 
what you're going to use the same folding function to do the compare, so 
if you play games with folding, the compares will be wrong).

But hashing doesn't have to be exact. It's ok to hash '{' and '[' as if 
they were different cases of the same character, if that gives you a 
faster hash function. Especially as those charactes are rather rare in 
filenames.

So if you do hashing as a function of its own, you can simply do a better 
job at it.

I do agree that the functions that create a folded set of characters from 
a _complex_ UTF-8 character should be shared between folding and hashing, 
since that code is too complex and there are no simple shortcuts for doing 
a faster hash that still retains all the properties we want. 

			Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-27  6:51                         ` Linus Torvalds
@ 2008-01-27  8:21                           ` Dmitry Potapov
  2008-01-27 14:07                             ` Johannes Schindelin
  2008-01-27  9:45                           ` Marko Kreen
  1 sibling, 1 reply; 51+ messages in thread
From: Dmitry Potapov @ 2008-01-27  8:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marko Kreen, Andreas Ericsson, Git Mailing List, Junio C Hamano

On Sat, Jan 26, 2008 at 10:51:18PM -0800, Linus Torvalds wrote:
> 
> 
> On Sat, 26 Jan 2008, Marko Kreen wrote:
> > 
> > Here you misunderstood me, I was proposing following:
> > 
> > int hash_folded(const char *str, int len)
> > {
> >    char buf[512];
> >    do_folding(buf, str, len);
> >    return do_hash(buf, len);
> > }
> > 
> > That is - the folded string should stay internal to hash function.
> 
> If it's internal, it's much better, but you still missed the performance 
> angle.
> 
> The fact is, hashing can take shortcuts that folding cannot do!
> 
> Case folding, by definition, has to be "exact" (since the whole point is 
> what you're going to use the same folding function to do the compare, so 
> if you play games with folding, the compares will be wrong).

Let's rename do_folding as something else, because it is not a real
folding, but a preparation step for hash calculation. Keeping these
steps separately simplifies the code, and allows further optimization,
for instance, you do not need this do_folding step on a case-sensitive
filesystem. Though it is certainly possible to mix both steps together,
it bloats the code and makes it less readable. Of course, the idea to
avoid a temporary buffer and do everything at once is very appealing,
so I gave it a try -- and here is a 32-bit version of name_hash(), but
I am not very happy with the result:


#define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k))))

#define mix(a,b,c) \
{ \
	a -= c;  a ^= rot(c, 4);  c += b; \
	b -= a;  b ^= rot(a, 6);  a += c; \
	c -= b;  c ^= rot(b, 8);  b += a; \
	a -= c;  a ^= rot(c,16);  c += b; \
	b -= a;  b ^= rot(a,19);  a += c; \
	c -= b;  c ^= rot(b, 4);  b += a; \
}
#define final(a,b,c) \
{ \
	c ^= b; c -= rot(b,14); \
	a ^= c; a -= rot(c,11); \
	b ^= a; b -= rot(a,25); \
	c ^= b; c -= rot(b,16); \
	a ^= c; a -= rot(c,4);  \
	b ^= a; b -= rot(a,14); \
	c ^= b; c -= rot(b,24); \
}

#define hash_value(x) \
	hs[hp] += (x); \
	if (++hp == 3) { \
		mix (hs[0], hs[1], hs[2]); \
		hp = 0; \
	}
unsigned int name_hash(const char *name, unsigned size)
{
	unsigned hp = 0;
	unsigned hs[3];
	hs[0] = hs[1] = hs[2] = 0xdeadbeef + size;

	do {
		unsigned char c;
		if (size >= sizeof(unsigned)) {
			unsigned val = get_unaligned_uint(name);
			if (!(val & 0x80808080)) {
				val &= ~0x20202020;
				hash_value(val);
				name += sizeof(val);
				size -= sizeof(val);
				continue;
			}
		}

		while (!((c = *name) & 0x80)) {
			hash_value(c & ~0x20);
			name++;
			if (!--size)
				goto done:
		}

		do {
			// TODO: add denormalization for Mac
			unsigned val = towupper (utf8_to_wchar(&name, &size));
			hash_value(val);
		} while (size && (*name & 0x80));

	} while (size);
done:
	if (hp)
		final(a,b,c);
	return hs[2];
}


Dmitry

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-27  8:21                           ` Dmitry Potapov
@ 2008-01-27 14:07                             ` Johannes Schindelin
  2008-01-27 14:48                               ` Dmitry Potapov
  0 siblings, 1 reply; 51+ messages in thread
From: Johannes Schindelin @ 2008-01-27 14:07 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Linus Torvalds, Marko Kreen, Andreas Ericsson, Git Mailing List,
	Junio C Hamano

Hi,

On Sun, 27 Jan 2008, Dmitry Potapov wrote:

> #define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k))))
> 
> #define mix(a,b,c) \
> { \
> 	a -= c;  a ^= rot(c, 4);  c += b; \
> 	b -= a;  b ^= rot(a, 6);  a += c; \
> 	c -= b;  c ^= rot(b, 8);  b += a; \
> 	a -= c;  a ^= rot(c,16);  c += b; \
> 	b -= a;  b ^= rot(a,19);  a += c; \
> 	c -= b;  c ^= rot(b, 4);  b += a; \
> }
> #define final(a,b,c) \
> { \
> 	c ^= b; c -= rot(b,14); \
> 	a ^= c; a -= rot(c,11); \
> 	b ^= a; b -= rot(a,25); \
> 	c ^= b; c -= rot(b,16); \
> 	a ^= c; a -= rot(c,4);  \
> 	b ^= a; b -= rot(a,14); \
> 	c ^= b; c -= rot(b,24); \
> }
> 
> #define hash_value(x) \
> 	hs[hp] += (x); \
> 	if (++hp == 3) { \
> 		mix (hs[0], hs[1], hs[2]); \
> 		hp = 0; \
> 	}
> unsigned int name_hash(const char *name, unsigned size)
> {
> 	unsigned hp = 0;
> 	unsigned hs[3];
> 	hs[0] = hs[1] = hs[2] = 0xdeadbeef + size;
> 
> 	do {
> 		unsigned char c;
> 		if (size >= sizeof(unsigned)) {
> 			unsigned val = get_unaligned_uint(name);
> 			if (!(val & 0x80808080)) {
> 				val &= ~0x20202020;
> 				hash_value(val);
> 				name += sizeof(val);
> 				size -= sizeof(val);
> 				continue;
> 			}
> 		}
> 
> 		while (!((c = *name) & 0x80)) {
> 			hash_value(c & ~0x20);
> 			name++;
> 			if (!--size)
> 				goto done:
> 		}
> 
> 		do {
> 			// TODO: add denormalization for Mac
> 			unsigned val = towupper (utf8_to_wchar(&name, &size));
> 			hash_value(val);
> 		} while (size && (*name & 0x80));
> 
> 	} while (size);
> done:
> 	if (hp)
> 		final(a,b,c);
> 	return hs[2];
> }

<irony>Oh yes, let's take this one, it is so much shorter, cleaner and 
overall more elegant than Linus' code.</irony>

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-27 14:07                             ` Johannes Schindelin
@ 2008-01-27 14:48                               ` Dmitry Potapov
  0 siblings, 0 replies; 51+ messages in thread
From: Dmitry Potapov @ 2008-01-27 14:48 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Linus Torvalds, Marko Kreen, Andreas Ericsson, Git Mailing List,
	Junio C Hamano

Hi,

On Sun, Jan 27, 2008 at 02:07:25PM +0000, Johannes Schindelin wrote:
> 
> 
> <irony>Oh yes, let's take this one, it is so much shorter, cleaner and 
> overall more elegant than Linus' code.</irony>

I am not sure what you meant by that, and what exactly code you meant
saying Linus' code....

Anyway, my point was that mixing both steps together does not look very
nice, and the code was intended to demonstrate why.

Dmitry

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-27  6:51                         ` Linus Torvalds
  2008-01-27  8:21                           ` Dmitry Potapov
@ 2008-01-27  9:45                           ` Marko Kreen
  2008-01-27 15:06                             ` Dmitry Potapov
  1 sibling, 1 reply; 51+ messages in thread
From: Marko Kreen @ 2008-01-27  9:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dmitry Potapov, Andreas Ericsson, Git Mailing List,
	Junio C Hamano

On 1/27/08, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sat, 26 Jan 2008, Marko Kreen wrote:
> >
> > Here you misunderstood me, I was proposing following:
> >
> > int hash_folded(const char *str, int len)
> > {
> >    char buf[512];
> >    do_folding(buf, str, len);
> >    return do_hash(buf, len);
> > }
> >
> > That is - the folded string should stay internal to hash function.
>
> If it's internal, it's much better, but you still missed the performance
> angle.
>
> The fact is, hashing can take shortcuts that folding cannot do!
>
> Case folding, by definition, has to be "exact" (since the whole point is
> what you're going to use the same folding function to do the compare, so
> if you play games with folding, the compares will be wrong).
>
> But hashing doesn't have to be exact. It's ok to hash '{' and '[' as if
> they were different cases of the same character, if that gives you a
> faster hash function. Especially as those charactes are rather rare in
> filenames.
>
> So if you do hashing as a function of its own, you can simply do a better
> job at it.
>
> I do agree that the functions that create a folded set of characters from
> a _complex_ UTF-8 character should be shared between folding and hashing,
> since that code is too complex and there are no simple shortcuts for doing
> a faster hash that still retains all the properties we want.

Well, you can always have fold_quick_and_dirty() function that
is used only internally in hash_folded() function, which can:

- fold with simple |= 0x20202020..
- write out full uint32/64, no need to make result proper string
- zero-fill at the end, so hash function does not need to check
  for partial block, which is pretty expensive part of hashing.

The win would be:
- more modularized code
- can use faster/any hash
- hash function can be certain to work on aligned data
  (win on non-x86)

The minus:
- some memory i/o overhead which may or may not matter
- the parts would not be fully generic, but special to hashing


-- 
marko


PS. Typo in last mail - "inner loop should be reversible - that
means that details from beginning of data should shift out of
horizon."  That obviously means "data should _not_ shift
out of horizon.

btw, "reversible" for integer hashes means that there is 1:1
mapping between input and output - no collisions.  Thus
no info loss.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-27  9:45                           ` Marko Kreen
@ 2008-01-27 15:06                             ` Dmitry Potapov
  0 siblings, 0 replies; 51+ messages in thread
From: Dmitry Potapov @ 2008-01-27 15:06 UTC (permalink / raw)
  To: Marko Kreen
  Cc: Linus Torvalds, Andreas Ericsson, Git Mailing List,
	Junio C Hamano

On Sun, Jan 27, 2008 at 11:45:25AM +0200, Marko Kreen wrote:

> The minus:
> - some memory i/o overhead which may or may not matter

If a string is short, it will probably reside in the processor cache, so
there is no real memory i/o overhead here. For more longer strings, it
may be better to do that in short chunks, so each chunk can reside in
the cache. But I don't think filenames are long, so it is not an issue
here.

> - the parts would not be fully generic, but special to hashing

I think the second part can be rather generic to be reused for hashing
something else that does not require filename specific case-folding.

Dmitry

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-25 22:16                     ` Linus Torvalds
  2008-01-25 22:35                       ` Linus Torvalds
  2008-01-26 12:16                       ` Marko Kreen
@ 2008-01-26 12:37                       ` Marko Kreen
  2 siblings, 0 replies; 51+ messages in thread
From: Marko Kreen @ 2008-01-26 12:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dmitry Potapov, Andreas Ericsson, Git Mailing List,
	Junio C Hamano

On 1/26/08, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> It's also a mindset issue. Quite frankly, people who do this by "convert
> to some folded/normalized form, then do the operation" will generally make
> much more fundamental mistakes. Once you get into the mindset of "let's
> pass a corrupted strign around", you are in trouble. You start thinking
> that the corrupted string isn't really "corrupt", it's in an "optimized
> format".

Ok, you seem to focus on case folding and general performance,
I focus on hash quality and code complexity.  Considering you
may want several folding methods, it seemed to me that it would
be good to separate the two aspects.

But I'll try to follow your path for a moment.

Hashing 32 or 64 bits at a time is not trivial, eg. you cannot
use same algorithm for both cases, 64 bits requires twice the
work to mix well.

Per Jenkins notes, hash inner loop should be reversible - that
means that details from beginning of data should shift out of horizon.
But final mixing should achieve avalanche - that means each bit in
input should affect 50% bits in output.

Also must be noted that we are mixing 32->32 and 64->64 instead
of the usual 8->32.  So it seems that the best bet would be to use
integer hash functions as core.  Jenkins himself points to Thomas
Wang (http://www.cris.com/~Ttwang/tech/inthash.htm) who has good
integer mix functions for both 32 and 64 bits.

Integer hash function must have both reversibility and avalance,
so they may be slightly overkill for this purpose, but fixing that
means lot of work.

So here is what I propose for hashing, if you really want to have
combined folding + hashing:

/* Thomas Wang integer hash functions */

static inline uint32_t hash32(uint32_t key)
{
        key = ~key + (key << 15);
        key = key ^ (key >> 12);
        key = key + (key << 2);
        key = key ^ (key >> 4);
        key = key * 2057;
        key = key ^ (key >> 16);
        return key;
}

static inline uint64_t hash64(uint64_t key)
{
        key = (~key) + (key << 21); // key = (key << 21) - key - 1;
        key = key ^ (key >> 24);
        key = (key + (key << 3)) + (key << 8); // key * 265
        key = key ^ (key >> 14);
        key = (key + (key << 2)) + (key << 4); // key * 21
        key = key ^ (key >> 28);
        key = key + (key << 31);
        return key;
}

/*
 * Simple addition should be enough for new values,
 * considering the mix functions does work well.
 */

/* this is functon to use in git */
static inline unsigned long hash_long(unsigned long hash, unsigned long val)
{
        if (sizeof(long) == 8)
                return hash64(hash + val);
        else
                return hash32(hash + val);
}

/* below is regular hash for testing */

static uint32_t inline hash_int32(uint32_t hash, uint32_t val)
{
        return hash32(hash + val);
}

static uint64_t inline hash_int64(uint64_t hash, uint64_t val)
{
        return hash64(hash + val);
}

/* hack to avoid call to libc memcpy() */
static inline void simple_memcpy(void *_dst, const void *_src, unsigned len)
{
        const uint8_t *src = _src;
        uint8_t *dst = _dst;
        while (len--)
                *dst++ = *src++;
}

uint32_t int32_hash(const void *_data, unsigned int size)
{
        const uint8_t *src = _data;
        /* inital value.  +size avoids \0 and \0\0 hashing same */
        uint32_t hash = 1234567890 + size;
        uint32_t val;
        while (size >= 4) {
                memcpy(&val, src, 4); /* direct load on x86/64 */
                src += 4;
                size -= 4;
                hash = hash_int32(hash, val);
        }
        if (size > 0) {
                val = 0;
                simple_memcpy(&val, src, size);
                hash = hash_int32(hash, val);
        }
        return hash;
}

uint32_t int64_hash(const void *_data, unsigned int size)
{
        const uint8_t *src = _data;
        uint64_t hash = 12345678901234567890ULL + size;
        uint64_t val;
        while (size >= 8) {
                memcpy(&val, src, 8); /* direct load on x86/64 */
                hash = hash_int64(hash, val);
                src += 8;
                size -= 8;
        }
        if (size > 0) {
                val = 0;
                simple_memcpy(&val, src, size);
                hash = hash_int64(hash, val);
        }
        /* here we go to 32 bits, simple masking is enough */
        return hash;
}


In the "regular" hash functions I again use the memcpy() trick, because
I don't want to bother with access optimizations.  Especially considering
that part is unnecessary for git.


Intel Core Duo (32bit).  String length 0 .. 40, random alignment:
-------------------------------------------------------------------


Testing: seed=34 align=0 minlen=0 maxlen=40 trycnt=3 duration=10

lookup3             :  #0 .. 165.489  #1 .. 165.494  #2 .. 165.490 MB/s
int32_hash          :  #0 .. 148.359  #1 .. 148.350  #2 .. 148.435 MB/s
int64_hash          :  #0 .. 123.105  #1 .. 123.040  #2 .. 123.039 MB/s
lookup3_memcpy_hack :  #0 .. 169.791  #1 .. 169.795  #2 .. 169.749 MB/s
oat                 :  #0 .. 134.737  #1 .. 134.702  #2 .. 134.735 MB/s
fnv                 :  #0 .. 131.457  #1 .. 131.470  #2 .. 131.474 MB/s
hsieh               :  #0 .. 166.619  #1 .. 166.622  #2 .. 166.588 MB/s

Results compared to reference:

lookup3             : 100.000 %
int32_hash          :  89.661 %
int64_hash          :  74.361 %
lookup3_memcpy_hack : 102.591 %
oat                 :  81.409 %
fnv                 :  79.441 %
hsieh               : 100.676 %


AMD Opteron(tm) Processor 252 (64bit) 2.6GHz
-------------------------------------------------

Testing: seed=34 align=0 minlen=0 maxlen=40 trycnt=3 duration=10

lookup3             :  #0 .. 208.819  #1 .. 208.877  #2 .. 208.897 MB/s
int32_hash          :  #0 .. 181.096  #1 .. 181.100  #2 .. 181.097 MB/s
int64_hash          :  #0 .. 196.823  #1 .. 196.761  #2 .. 196.825 MB/s
lookup3_memcpy_hack :  #0 .. 201.593  #1 .. 201.597  #2 .. 201.594 MB/s
oat                 :  #0 .. 160.769  #1 .. 160.774  #2 .. 160.772 MB/s
fnv                 :  #0 .. 200.046  #1 .. 200.044  #2 .. 200.046 MB/s
hsieh               :  #0 .. 205.515  #1 .. 205.520  #2 .. 205.517 MB/s

Results compared to reference:

lookup3             : 100.000 %
int32_hash          :  86.706 %
int64_hash          :  94.225 %
lookup3_memcpy_hack :  96.519 %
oat                 :  76.974 %
fnv                 :  95.778 %
hsieh               :  98.398 %


So speedwise the result is not bad.  Especially considering unoptimized
data fetching.  On larger data (~1k) is tends to lose to lookup3 more,
I guess lookup3 parallelizes better (3x 32bit int vs. 1x 32/64 int).

The functions pass Jenkins lookup3 selftest that eg. FNV does not.

The code is also available at:

 http://pgbouncer.projects.postgresql.org/hashtest/hashtest-2008-01-26.tgz


-- 
marko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-24 16:00             ` Andreas Ericsson
  2008-01-24 16:13               ` Marko Kreen
  2008-01-24 16:28               ` Dmitry Potapov
@ 2008-01-25 20:08               ` Marko Kreen
  2 siblings, 0 replies; 51+ messages in thread
From: Marko Kreen @ 2008-01-25 20:08 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: Dmitry Potapov, Linus Torvalds, Git Mailing List, Junio C Hamano

On 1/24/08, Andreas Ericsson <ae@op5.se> wrote:
> Interesting output, but not very surprising. Do you have the code
> available somewhere?

http://pgbouncer.projects.postgresql.org/hashtest/hashtest-2008-01-25.tgz

I cleaned it up a bit.  Also fixed a bug - the lookup3 was called
via wrapper function so it acually is tiny bit faster than the
results show.

You can consider the code as public domain too.  Also remember
the was not meant to be published, so it rather hack...

-- 
marko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23  9:31     ` Andreas Ericsson
  2008-01-23 14:01       ` Marko Kreen
@ 2008-01-23 17:10       ` Dmitry Potapov
  2008-01-24 10:39         ` Andreas Ericsson
  1 sibling, 1 reply; 51+ messages in thread
From: Dmitry Potapov @ 2008-01-23 17:10 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano

On Wed, Jan 23, 2008 at 10:31:11AM +0100, Andreas Ericsson wrote:
> ---
> FNV Hash
> 
> I need to fill this in. Search the web for FNV hash. It's faster than my 
> hash on Intel (because Intel has fast multiplication), but slower on most 
> other platforms. Preliminary tests suggested it has decent distributions. 
> ---

I believe that under words "my hash", Bob Jenkins meant lookup2, which
was significant slower.

> 
> My tests ran on Intel.

Please, could you specify your CPU model.

> I also noticed I had a few hashes commented out when
> doing the test, one of them being Paul Hsie's. For some reason, Jenkin's and
> Hsie's didn't perform well for me last time I used the comparison thing (I
> did a more thorough job back then, with tests running for several minutes
> per hash and table-size, so I commented out the poor candidates).

I expected that Paul Hsieh's hash may not do well on some architecture,
though it seems it did even worse than I expected.

> 
> I still believe that for this very simple case, the lookup3.c case is not
> very practical, as the code is that much more complicated, which was my
> main point with posting the comparison.

I would not describe lookup3 as impractical. It is widely used and well
tested. Perhaps, for some Intel CPUs, the difference in speed is not so
big, and FNV hash is much smaller and simpler, so FNV is a reasonable
choice, but the hash is twice slower on my AMD processor and I suspect
it may be even worse on other CPUs, where integer multiplication is slow.
Besides, it may turn out that hashing filename may be not only case where
a fast hash is needed.

Dmitry

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23 17:10       ` Dmitry Potapov
@ 2008-01-24 10:39         ` Andreas Ericsson
  0 siblings, 0 replies; 51+ messages in thread
From: Andreas Ericsson @ 2008-01-24 10:39 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano

Dmitry Potapov wrote:
> On Wed, Jan 23, 2008 at 10:31:11AM +0100, Andreas Ericsson wrote:
>> ---
>> FNV Hash
>>
>> I need to fill this in. Search the web for FNV hash. It's faster than my 
>> hash on Intel (because Intel has fast multiplication), but slower on most 
>> other platforms. Preliminary tests suggested it has decent distributions. 
>> ---
> 
> I believe that under words "my hash", Bob Jenkins meant lookup2, which
> was significant slower.
> 
>> My tests ran on Intel.
> 
> Please, could you specify your CPU model.
> 

>From /proc/cpuinfo. It's the best I can do without going to our purchase
department and asking for the spec so I can contact the vendor and get
the real thing. Dualcore shouldn't matter for this test, as it isn't
threaded.
Intel(R) Core(TM)2 Duo CPU     T7700  @ 2.40GHz

>> I also noticed I had a few hashes commented out when
>> doing the test, one of them being Paul Hsie's. For some reason, Jenkin's and
>> Hsie's didn't perform well for me last time I used the comparison thing (I
>> did a more thorough job back then, with tests running for several minutes
>> per hash and table-size, so I commented out the poor candidates).
> 
> I expected that Paul Hsieh's hash may not do well on some architecture,
> though it seems it did even worse than I expected.
> 

It doesn't do that well on certain types of data, in my experience. It does
have excellent dispersion, so with very long strings it's usually the
best to use, because collisions become so expensive.

>> I still believe that for this very simple case, the lookup3.c case is not
>> very practical, as the code is that much more complicated, which was my
>> main point with posting the comparison.
> 
> I would not describe lookup3 as impractical. It is widely used and well
> tested. Perhaps, for some Intel CPUs, the difference in speed is not so
> big, and FNV hash is much smaller and simpler, so FNV is a reasonable
> choice, but the hash is twice slower on my AMD processor and I suspect
> it may be even worse on other CPUs, where integer multiplication is slow.
> Besides, it may turn out that hashing filename may be not only case where
> a fast hash is needed.
> 

Ah well. I think once the patch is in master, it will be easy enough to
test and verify different algorithms. Since it's intended for in-memory
data only, it's no problem to have several algorithms and pick the one
most suitable for the architecture we're compiling for.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I'm a total push-over..
  2008-01-23  8:32 ` Andreas Ericsson
  2008-01-23  9:15   ` Dmitry Potapov
@ 2008-01-23 16:06   ` Linus Torvalds
  1 sibling, 0 replies; 51+ messages in thread
From: Linus Torvalds @ 2008-01-23 16:06 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Git Mailing List, Junio C Hamano

On Wed, 23 Jan 2008, Andreas Ericsson wrote:
> 
> Insofar as hashes go, it's not that shabby for hashing filenames.

Hashing filenames is pretty easy. You can do a reasonable job with any 
"multiply by an odd number, add in value". Picking the odd number is a 
random choice, some are better than others (and it depends on whether you 
end up always having a power-of-two bucket size etc), but it generally 
won't be horrible.

And considering that we generally won't have tons and tons of pathnames 
(ie we'd generally have thousands, not millions), and that the underlying 
hash not only resizes itself but actually uses the ful 32 bits as the 
lookup key, I don't worry too much. I suspect my random choice is fine.

So no, I didn't think my hash would _suck_, although I also didn't 
research which odd numbers to pick.

No, it's a joke because it doesn't really give an example of how to do the 
_expected_ hash collissions.

Here's another hash that is actually going to collide *much* more (and on 
purpose!), but is actually showing an example of something that actually 
hashes UTF strings so that upper-case and lower-case (and normalization) 
will all still hash to the same value:

	static unsigned int hash_name(const char *name, int namelen)
	{
	        unsigned int hash = 0x123;

	        do {
	                unsigned char c = *name++;
			if (c & 0x80)
				c = 0;
			c &= ~0x20;
	                hash = hash*101 + c;
	        } while (--namelen);
	        return hash;
	}

but the above does so by making the hash much much worse (although 
probably still acceptable for "normal source code name distributions" 
that don't have very many same-name-in-different-cases and high bit 
characters anyway).

The above is still fairly fast, but obviously at a serious cost in hash 
goodness, to the point of being totally unusable for anybody who uses 
Asian characters in their filenames.

To actually be really useful, you'd have to teach it about the character 
system and do a lookup into a case/normalization table.

So *that* is mainly why it's a joke. But it should work fine for the case 
it is used for now (exact match).

Picking a better-researched constant might still be a good idea.

		Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2008-01-27 15:06 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-22 23:37 I'm a total push-over Linus Torvalds
2008-01-23  1:35 ` Kevin Ballard
2008-01-23  2:23 ` Junio C Hamano
2008-01-23  2:36   ` Junio C Hamano
2008-01-23 12:24     ` Johannes Schindelin
2008-01-23 12:28       ` David Kastrup
2008-01-23 12:56         ` Theodore Tso
2008-01-23  2:58   ` Linus Torvalds
2008-01-23  3:19     ` Linus Torvalds
2008-01-25  6:50       ` Junio C Hamano
2008-01-25 16:24         ` Linus Torvalds
2008-01-23  7:23     ` Junio C Hamano
2008-01-23 12:25       ` Johannes Schindelin
2008-01-23 16:25         ` Linus Torvalds
2008-01-23 16:34           ` Johannes Schindelin
2008-01-23 17:09             ` Linus Torvalds
2008-01-23 17:29               ` Linus Torvalds
2008-01-25  5:21                 ` Jeremy Maitin-Shepard
2008-01-25 12:51                   ` Johannes Schindelin
2008-01-25 18:19                     ` Jeremy Maitin-Shepard
2008-01-25 18:24                       ` Johannes Schindelin
2008-01-25 19:07                       ` Junio C Hamano
2008-01-23  8:32 ` Andreas Ericsson
2008-01-23  9:15   ` Dmitry Potapov
2008-01-23  9:31     ` Andreas Ericsson
2008-01-23 14:01       ` Marko Kreen
2008-01-23 14:39         ` Andreas Ericsson
2008-01-24  6:51           ` Luke Lu
2008-01-24 10:24             ` Andreas Ericsson
2008-01-24 13:19           ` Marko Kreen
2008-01-24 16:00             ` Andreas Ericsson
2008-01-24 16:13               ` Marko Kreen
2008-01-24 16:28               ` Dmitry Potapov
2008-01-24 17:15                 ` Linus Torvalds
2008-01-24 18:45                   ` Dmitry Potapov
2008-01-24 19:08                     ` Linus Torvalds
2008-01-25 20:52                   ` Marko Kreen
2008-01-25 22:16                     ` Linus Torvalds
2008-01-25 22:35                       ` Linus Torvalds
2008-01-26 12:16                       ` Marko Kreen
2008-01-27  6:51                         ` Linus Torvalds
2008-01-27  8:21                           ` Dmitry Potapov
2008-01-27 14:07                             ` Johannes Schindelin
2008-01-27 14:48                               ` Dmitry Potapov
2008-01-27  9:45                           ` Marko Kreen
2008-01-27 15:06                             ` Dmitry Potapov
2008-01-26 12:37                       ` Marko Kreen
2008-01-25 20:08               ` Marko Kreen
2008-01-23 17:10       ` Dmitry Potapov
2008-01-24 10:39         ` Andreas Ericsson
2008-01-23 16:06   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).