* I'm a total push-over.. @ 2008-01-22 23:37 Linus Torvalds 2008-01-23 1:35 ` Kevin Ballard ` (2 more replies) 0 siblings, 3 replies; 51+ messages in thread From: Linus Torvalds @ 2008-01-22 23:37 UTC (permalink / raw) To: Git Mailing List, Junio C Hamano Ok, here's an interesting patch based on the current 'next' (since it very intimately requires the new in-memory index format). What it does is to create a hash index of every single file added to the index. Right now that hash index isn't actually used for much: I implemented a "cache_name_exists()" function that uses it to efficiently look up a filename in the index without having to do the O(logn) binary search, but quite frankly, that's not why this patch is interesting. No, the whole and only reason to create the hash of the filenames in the index is that by modifying the hash function, you can fairly easily do things like making it always hash equivalent names into the same bucket. That, in turn, means that suddenly questions like "does this name exists in the index under an _equivalent_ name?" becomes much much cheaper. Guiding principles behind this patch: - it shouldn't be too costly. In fact, my primary goal here was to actually speed up "git commit" with a fully populated kernel tree, by being faster at checking whether a file already existed in the index. I did succeed, but only barely: Best before: [torvalds@woody linux]$ time git commit > /dev/null real 0m0.255s user 0m0.168s sys 0m0.088s Best after: [torvalds@woody linux]$ time ~/git/git commit > /dev/null real 0m0.233s user 0m0.144s sys 0m0.088s so some things are actually faster (~8%). Caveat: that's really the best case. Other things are invariably going to be slightly slower, since we populate that index cache, and quite frankly, few things really use it to look things up. That said, the cost is really quite small. The worst case is probably doing a "git ls-files", which will do very little except puopulate the index, and never actually looks anything up in it, just lists it. Before: [torvalds@woody linux]$ time git ls-files > /dev/null real 0m0.016s user 0m0.016s sys 0m0.000s After: [torvalds@woody linux]$ time ~/git/git ls-files > /dev/null real 0m0.021s user 0m0.012s sys 0m0.008s and while the thing has really gotten relatively much slower, we're still talking about something almost unmeasurable (eg 5ms). And that really should be pretty much the worst case. So we lose 5ms on one "benchmark", but win 22ms on another. Pick your poison - this patch has the advantage that it will _likely_ speed up the cases that are complex and expensive more than it slows down the cases that are already so fast that nobody cares. But if you look at relative speedups/slowdowns, it doesn't look so good. - It should be simple and clean The code may be a bit subtle (the reasons I do hash removal the way I do etc), but it re-uses the existing hash.c files, so it really is fairly small and straightforward apart from a few odd details. Now, this patch on its own doesn't really do much, but I think it's worth looking at, if only because if done correctly, the name hashing really can make an improvement to the whole issue of "do we have a filename that looks like this in the index already". And at least it gets real testing by being used even by default (ie there is a real use-case for it even without any insane filesystems). NOTE NOTE NOTE! The current hash is a joke. I'm ashamed of it, I'm just not ashamed of it enough to really care. I took all the numbers out of my nether regions - I'm sure it's good enough that it works in practice, but the whole point was that you can make a really much fancier hash that hashes characters not directly, but by their upper-case value or something like that, and thus you get a case-insensitive hash, while still keeping the name and the index itself totally case sensitive. And let's face it, it was kind of fun. These things are all _soo_ much simpler than all the issues you have to do in the kernel, so this is just a complete toy compared to all the things we do inside Linux to do the same thing with pluggable hashes on a per-path-component basis etc. (User space developers are weenies. One of the most fun parts of git development for me has been how easy everything is ;) Linus ---- cache.h | 6 +++ dir.c | 2 +- read-cache.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++++------ 3 files changed, 94 insertions(+), 11 deletions(-) diff --git a/cache.h b/cache.h index 3a47cdc..409738c 100644 --- a/cache.h +++ b/cache.h @@ -3,6 +3,7 @@ #include "git-compat-util.h" #include "strbuf.h" +#include "hash.h" #include SHA1_HEADER #include <zlib.h> @@ -109,6 +110,7 @@ struct ondisk_cache_entry { }; struct cache_entry { + struct cache_entry *next; unsigned int ce_ctime; unsigned int ce_mtime; unsigned int ce_dev; @@ -131,6 +133,7 @@ struct cache_entry { #define CE_UPDATE (0x10000) #define CE_REMOVE (0x20000) #define CE_UPTODATE (0x40000) +#define CE_UNHASHED (0x80000) static inline unsigned create_ce_flags(size_t len, unsigned stage) { @@ -188,6 +191,7 @@ struct index_state { struct cache_tree *cache_tree; time_t timestamp; void *alloc; + struct hash_table name_hash; }; extern struct index_state the_index; @@ -211,6 +215,7 @@ extern struct index_state the_index; #define refresh_cache(flags) refresh_index(&the_index, (flags), NULL, NULL) #define ce_match_stat(ce, st, options) ie_match_stat(&the_index, (ce), (st), (options)) #define ce_modified(ce, st, options) ie_modified(&the_index, (ce), (st), (options)) +#define cache_name_exists(name, namelen) index_name_exists(&the_index, (name), (namelen)) #endif enum object_type { @@ -297,6 +302,7 @@ extern int read_index_from(struct index_state *, const char *path); extern int write_index(struct index_state *, int newfd); extern int discard_index(struct index_state *); extern int verify_path(const char *path); +extern int index_name_exists(struct index_state *istate, const char *name, int namelen); extern int index_name_pos(struct index_state *, const char *name, int namelen); #define ADD_CACHE_OK_TO_ADD 1 /* Ok to add */ #define ADD_CACHE_OK_TO_REPLACE 2 /* Ok to replace file/directory */ diff --git a/dir.c b/dir.c index 1b9cc7a..6543105 100644 --- a/dir.c +++ b/dir.c @@ -346,7 +346,7 @@ static struct dir_entry *dir_entry_new(const char *pathname, int len) struct dir_entry *dir_add_name(struct dir_struct *dir, const char *pathname, int len) { - if (cache_name_pos(pathname, len) >= 0) + if (cache_name_exists(pathname, len)) return NULL; ALLOC_GROW(dir->entries, dir->nr+1, dir->alloc); diff --git a/read-cache.c b/read-cache.c index 8ba8f0f..33a8ca5 100644 --- a/read-cache.c +++ b/read-cache.c @@ -23,6 +23,70 @@ struct index_state the_index; +static unsigned int hash_name(const char *name, int namelen) +{ + unsigned int hash = 0x123; + + do { + unsigned char c = *name++; + hash = hash*101 + c; + } while (--namelen); + return hash; +} + +static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce) +{ + void **pos; + unsigned int hash = hash_name(ce->name, ce_namelen(ce)); + + istate->cache[nr] = ce; + pos = insert_hash(hash, ce, &istate->name_hash); + if (pos) { + ce->next = *pos; + *pos = ce; + } +} + +/* + * We don't actually *remove* it, we can just mark it invalid so that + * we won't find it in lookups. + * + * Not only would we have to search the lists (simple enough), but + * we'd also have to rehash other hash buckets in case this makes the + * hash bucket empty (common). So it's much better to just mark + * it. + */ +static void remove_hash_entry(struct index_state *istate, struct cache_entry *ce) +{ + ce->ce_flags |= CE_UNHASHED; +} + +static void replace_index_entry(struct index_state *istate, int nr, struct cache_entry *ce) +{ + struct cache_entry *old = istate->cache[nr]; + + if (ce != old) { + remove_hash_entry(istate, old); + set_index_entry(istate, nr, ce); + } + istate->cache_changed = 1; +} + +int index_name_exists(struct index_state *istate, const char *name, int namelen) +{ + unsigned int hash = hash_name(name, namelen); + struct cache_entry *ce = lookup_hash(hash, &istate->name_hash); + + while (ce) { + if (!(ce->ce_flags & CE_UNHASHED)) { + if (!cache_name_compare(name, namelen, ce->name, ce->ce_flags)) + return 1; + } + ce = ce->next; + } + return 0; +} + /* * This only updates the "non-critical" parts of the directory * cache, ie the parts that aren't tracked by GIT, and only used @@ -323,6 +387,9 @@ int index_name_pos(struct index_state *istate, const char *name, int namelen) /* Remove entry, return true if there are more entries to go.. */ int remove_index_entry_at(struct index_state *istate, int pos) { + struct cache_entry *ce = istate->cache[pos]; + + remove_hash_entry(istate, ce); istate->cache_changed = 1; istate->cache_nr--; if (pos >= istate->cache_nr) @@ -697,8 +764,7 @@ static int add_index_entry_with_check(struct index_state *istate, struct cache_e /* existing match? Just replace it. */ if (pos >= 0) { - istate->cache_changed = 1; - istate->cache[pos] = ce; + replace_index_entry(istate, pos, ce); return 0; } pos = -pos-1; @@ -758,7 +824,7 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti memmove(istate->cache + pos + 1, istate->cache + pos, (istate->cache_nr - pos - 1) * sizeof(ce)); - istate->cache[pos] = ce; + set_index_entry(istate, pos, ce); istate->cache_changed = 1; return 0; } @@ -887,11 +953,8 @@ int refresh_index(struct index_state *istate, unsigned int flags, const char **p has_errors = 1; continue; } - istate->cache_changed = 1; - /* You can NOT just free istate->cache[i] here, since it - * might not be necessarily malloc()ed but can also come - * from mmap(). */ - istate->cache[i] = new; + + replace_index_entry(istate, i, new); } return has_errors; } @@ -966,6 +1029,20 @@ static void convert_from_disk(struct ondisk_cache_entry *ondisk, struct cache_en memcpy(ce->name, ondisk->name, len + 1); } +static inline size_t estimate_cache_size(size_t ondisk_size, unsigned int entries) +{ + long per_entry; + + per_entry = sizeof(struct cache_entry) - sizeof(struct ondisk_cache_entry); + + /* + * Alignment can cause differences. This should be "alignof", but + * since that's a gcc'ism, just use the size of a pointer. + */ + per_entry += sizeof(void *); + return ondisk_size + entries*per_entry; +} + /* remember to discard_cache() before reading a different cache! */ int read_index_from(struct index_state *istate, const char *path) { @@ -1016,7 +1093,7 @@ int read_index_from(struct index_state *istate, const char *path) * has room for a few more flags, we can allocate using the same * index size */ - istate->alloc = xmalloc(mmap_size); + istate->alloc = xmalloc(estimate_cache_size(mmap_size, istate->cache_nr)); src_offset = sizeof(*hdr); dst_offset = 0; @@ -1027,7 +1104,7 @@ int read_index_from(struct index_state *istate, const char *path) disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset); ce = (struct cache_entry *)((char *)istate->alloc + dst_offset); convert_from_disk(disk_ce, ce); - istate->cache[i] = ce; + set_index_entry(istate, i, ce); src_offset += ondisk_ce_size(ce); dst_offset += ce_size(ce); ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-22 23:37 I'm a total push-over Linus Torvalds @ 2008-01-23 1:35 ` Kevin Ballard 2008-01-23 2:23 ` Junio C Hamano 2008-01-23 8:32 ` Andreas Ericsson 2 siblings, 0 replies; 51+ messages in thread From: Kevin Ballard @ 2008-01-23 1:35 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano [-- Attachment #1: Type: text/plain, Size: 1308 bytes --] On Jan 22, 2008, at 6:37 PM, Linus Torvalds wrote: > Ok, here's an interesting patch based on the current 'next' (since > it very > intimately requires the new in-memory index format). > > What it does is to create a hash index of every single file added to > the > index. Right now that hash index isn't actually used for much: I > implemented a "cache_name_exists()" function that uses it to > efficiently > look up a filename in the index without having to do the O(logn) > binary > search, but quite frankly, that's not why this patch is interesting. > > No, the whole and only reason to create the hash of the filenames in > the > index is that by modifying the hash function, you can fairly easily do > things like making it always hash equivalent names into the same > bucket. This is fantastic. Thank you very much for actually taking this issue seriously despite the mess I made on the list. This is exactly why I wanted to discuss on the lists instead of hacking away myself - there are very smart people on the list (like you) that already know how git works that can come up with ideas like this while I would still be trying to figure out where the index code is even stored. -Kevin Ballard -- Kevin Ballard http://kevin.sb.org kevin@sb.org http://www.tildesoft.com [-- Attachment #2: smime.p7s --] [-- Type: application/pkcs7-signature, Size: 2432 bytes --] ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-22 23:37 I'm a total push-over Linus Torvalds 2008-01-23 1:35 ` Kevin Ballard @ 2008-01-23 2:23 ` Junio C Hamano 2008-01-23 2:36 ` Junio C Hamano 2008-01-23 2:58 ` Linus Torvalds 2008-01-23 8:32 ` Andreas Ericsson 2 siblings, 2 replies; 51+ messages in thread From: Junio C Hamano @ 2008-01-23 2:23 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Linus Torvalds <torvalds@linux-foundation.org> writes: > Ok, here's an interesting patch based on the current 'next' (since it very > intimately requires the new in-memory index format). This is nice. It does not do anything specific with HFS+ issues but aims for faster look-ups, which would help everybody. Two things I noticed (only two, not necessarily because you are good but mostly because I am still mired in day job and could not get enough uninterrupted minutes to read the patch ;-)): - You might want to store the hash table (once computed) in the index extension section, and lazily unpack the table the first time index_name_exists() or set_index_entry() is called on the given istate, instead of unpacking it immediately when you read from the disk. That way, ls-files does not have to suffer at all. - You would need to get rid of the table in discard_index(). ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 2:23 ` Junio C Hamano @ 2008-01-23 2:36 ` Junio C Hamano 2008-01-23 12:24 ` Johannes Schindelin 2008-01-23 2:58 ` Linus Torvalds 1 sibling, 1 reply; 51+ messages in thread From: Junio C Hamano @ 2008-01-23 2:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Junio C Hamano <gitster@pobox.com> writes: > Linus Torvalds <torvalds@linux-foundation.org> writes: > >> Ok, here's an interesting patch based on the current 'next' (since it very >> intimately requires the new in-memory index format). > > This is nice. It does not do anything specific with HFS+ issues > but aims for faster look-ups, which would help everybody. > > Two things I noticed (only two, not necessarily because you are > good but mostly because I am still mired in day job and could > not get enough uninterrupted minutes to read the patch ;-)): > > - You might want to store the hash table (once computed) in the > index extension section, and lazily unpack the table the > first time index_name_exists() or set_index_entry() is called > on the given istate, instead of unpacking it immediately when > you read from the disk. That way, ls-files does not have to > suffer at all. Actually, I take one fourth of this back. * I am not yet retracting the suggestion to do the hashing lazily (what I mean by "lazily" is that the first access that wants hashed access will iterate active_cache and hash them all, not "lazily one entry at a time as needed" which would not make any sense for a hashtable). I have to find time to try benching the effect of it myself. So that's one half retained. * We certainly do not necessarily want to store this in the index right now. The hash algorithms would be improved from the version you are almost ashamed of ;-). That sounds as if I am retrating the other half, but not quite. * Once we have a mechanism to detect that the extension section stores a precomputed hash that was done with a different algorithm and ignore it (and recompute afresh when needed), then we can afford to put a more elaborate hashing algorithm, slightly loosening one of your "Guiding principles", and keep the result in the generated index to be reused by the next user. So that is why I am retracting only half of the suggestion to save it in the extension section (which in turn is a half of my suggestion). ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 2:36 ` Junio C Hamano @ 2008-01-23 12:24 ` Johannes Schindelin 2008-01-23 12:28 ` David Kastrup 0 siblings, 1 reply; 51+ messages in thread From: Johannes Schindelin @ 2008-01-23 12:24 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, Git Mailing List Hi, On Tue, 22 Jan 2008, Junio C Hamano wrote: > * We certainly do not necessarily want to store this in the > index right now. The hash algorithms would be improved from > the version you are almost ashamed of ;-). That sounds as if > I am retrating the other half, but not quite. > > * Once we have a mechanism to detect that the extension section > stores a precomputed hash that was done with a different > algorithm and ignore it (and recompute afresh when needed), > then we can afford to put a more elaborate hashing algorithm, > slightly loosening one of your "Guiding principles", and keep > the result in the generated index to be reused by the next > user. So that is why I am retracting only half of the > suggestion to save it in the extension section (which in turn > is a half of my suggestion). Both issues (and the config variable issue Linus raised) are easily helped with: store not only the hashmap in the extension, but also an identifier for the hash method used. Then you can improve on the hash function all you like, and add the config variable dependent choice of the hashing everybody seems to want all of a sudden, as long as you change the method identifier, too. Ciao, Dscho ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 12:24 ` Johannes Schindelin @ 2008-01-23 12:28 ` David Kastrup 2008-01-23 12:56 ` Theodore Tso 0 siblings, 1 reply; 51+ messages in thread From: David Kastrup @ 2008-01-23 12:28 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Junio C Hamano, Linus Torvalds, Git Mailing List Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > Both issues (and the config variable issue Linus raised) are easily > helped with: store not only the hashmap in the extension, but also an > identifier for the hash method used. For a reasonably implemented hash algorithm, computing the hash should be cheaper than reading it from disk. So storing precomputed hashes is not worth the trouble. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 12:28 ` David Kastrup @ 2008-01-23 12:56 ` Theodore Tso 0 siblings, 0 replies; 51+ messages in thread From: Theodore Tso @ 2008-01-23 12:56 UTC (permalink / raw) To: David Kastrup Cc: Johannes Schindelin, Junio C Hamano, Linus Torvalds, Git Mailing List On Wed, Jan 23, 2008 at 01:28:27PM +0100, David Kastrup wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > > Both issues (and the config variable issue Linus raised) are easily > > helped with: store not only the hashmap in the extension, but also an > > identifier for the hash method used. > > For a reasonably implemented hash algorithm, computing the hash should > be cheaper than reading it from disk. > > So storing precomputed hashes is not worth the trouble. Yes, but if on Mac OS systems when the git repository is stored on an HFS+ system, the hash algorithm gets changed to one which forces Unicode strings which HFS+ happens to "normalize" into the same hash bucket pre- and post- normalization, it might not be cheap any more.... - Ted ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 2:23 ` Junio C Hamano 2008-01-23 2:36 ` Junio C Hamano @ 2008-01-23 2:58 ` Linus Torvalds 2008-01-23 3:19 ` Linus Torvalds 2008-01-23 7:23 ` Junio C Hamano 1 sibling, 2 replies; 51+ messages in thread From: Linus Torvalds @ 2008-01-23 2:58 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List On Tue, 22 Jan 2008, Junio C Hamano wrote: > > - You might want to store the hash table (once computed) in the > index extension section, and lazily unpack the table the > first time index_name_exists() or set_index_entry() is called > on the given istate, instead of unpacking it immediately when > you read from the disk. That way, ls-files does not have to > suffer at all. I really hate that. Basically, I dislike having two copies of the same data. If something can be computed from something else, then only the original data should exist, and the other thing should be recomputed. Otherwise you easily get into situations where you spend a lot of time maintaining the other copy, or worse, you have inconsistent data and it's really subtle what is going on. Also, one of the ideas behind the index is that would depend on your notion of what is "equivalent", which is actually somehing fairly fluid. In fact, it's likely going to depend on a config option. So encoding the indexing on disk, when it can change when you do a simple "git config", or even just depending on your LANG environment variable, seems like a singularly bad idea, even if it wasn't for the coherence. I did consider doing the indexing only on demand, and we can certainly simply just "turn it off" when we know it's never going to get used (ie "git ls-files"). So in that sense, it's easy to get rid of the overhead, but it didn't really seem like the conceptual complexity (even if it's just a couple of lines) is really worth it. It's not like git ls-files is really performance-critical anyway. > - You would need to get rid of the table in discard_index(). Now this, of course, is obviously true. And the patch to do that is very simple too. No need to walk any chains, since the "free(istate->alloc);" will release all the pre-allocated cache_entry structures, and the rest are (necessarily) leaked anyway. [ Side note for non-Junios: the leaking of cache_entry structures isn't new, we've always done it, and it's even done on purpose. The common case is that there is one *big* allocation (istate->alloc) that contains all the original cache entries. There are usually none, or only a very few individual allocations, and we don't even keep track of them. With the new in-memory format, we could make a special flag that does "is this cache-entry an individual allocation or not" (or we could even just see if they are inside the "alloc" range), but the common case really should be that there's just a couple of them, and we just drop them rather than tracking them. ] Here. Linus --- read-cache.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/read-cache.c b/read-cache.c index 33a8ca5..abee0fc 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1142,6 +1142,7 @@ int discard_index(struct index_state *istate) istate->cache_nr = 0; istate->cache_changed = 0; istate->timestamp = 0; + free_hash(&istate->name_hash); cache_tree_free(&(istate->cache_tree)); free(istate->alloc); istate->alloc = NULL; ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 2:58 ` Linus Torvalds @ 2008-01-23 3:19 ` Linus Torvalds 2008-01-25 6:50 ` Junio C Hamano 2008-01-23 7:23 ` Junio C Hamano 1 sibling, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2008-01-23 3:19 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List On Tue, 22 Jan 2008, Linus Torvalds wrote: > > And the patch to do that is very simple too. Ok, I pushed the squashed/fixed commit to the "new-lstat" branch. It's now based on your 'next' branch, and I should have renamed it, since it's not about any of the old lstat() optimizations any more. Whatever. But here it is also as a full patch, with a fixed up subject line etc. You tend to want to have them as topic-branches, and it's probably easier this way. Linus --- From ca98bbc7b0cc1d9d088a8c6ae80e733115a1f775 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue, 22 Jan 2008 18:41:14 -0800 Subject: [PATCH] Create pathname-based hash-table lookup into index This creates a hash index of every single file added to the index. Right now that hash index isn't actually used for much: I implemented a "cache_name_exists()" function that uses it to efficiently look up a filename in the index without having to do the O(logn) binary search, but quite frankly, that's not why this patch is interesting. No, the whole and only reason to create the hash of the filenames in the index is that by modifying the hash function, you can fairly easily do things like making it always hash equivalent names into the same bucket. That, in turn, means that suddenly questions like "does this name exist in the index under an _equivalent_ name?" becomes much much cheaper. Guiding principles behind this patch: - it shouldn't be too costly. In fact, my primary goal here was to actually speed up "git commit" with a fully populated kernel tree, by being faster at checking whether a file already existed in the index. I did succeed, but only barely: Best before: [torvalds@woody linux]$ time git commit > /dev/null real 0m0.255s user 0m0.168s sys 0m0.088s Best after: [torvalds@woody linux]$ time ~/git/git commit > /dev/null real 0m0.233s user 0m0.144s sys 0m0.088s so some things are actually faster (~8%). Caveat: that's really the best case. Other things are invariably going to be slightly slower, since we populate that index cache, and quite frankly, few things really use it to look things up. That said, the cost is really quite small. The worst case is probably doing a "git ls-files", which will do very little except puopulate the index, and never actually looks anything up in it, just lists it. Before: [torvalds@woody linux]$ time git ls-files > /dev/null real 0m0.016s user 0m0.016s sys 0m0.000s After: [torvalds@woody linux]$ time ~/git/git ls-files > /dev/null real 0m0.021s user 0m0.012s sys 0m0.008s and while the thing has really gotten relatively much slower, we're still talking about something almost unmeasurable (eg 5ms). And that really should be pretty much the worst case. So we lose 5ms on one "benchmark", but win 22ms on another. Pick your poison - this patch has the advantage that it will _likely_ speed up the cases that are complex and expensive more than it slows down the cases that are already so fast that nobody cares. But if you look at relative speedups/slowdowns, it doesn't look so good. - It should be simple and clean The code may be a bit subtle (the reasons I do hash removal the way I do etc), but it re-uses the existing hash.c files, so it really is fairly small and straightforward apart from a few odd details. Now, this patch on its own doesn't really do much, but I think it's worth looking at, if only because if done correctly, the name hashing really can make an improvement to the whole issue of "do we have a filename that looks like this in the index already". And at least it gets real testing by being used even by default (ie there is a real use-case for it even without any insane filesystems). NOTE NOTE NOTE! The current hash is a joke. I'm ashamed of it, I'm just not ashamed of it enough to really care. I took all the numbers out of my nether regions - I'm sure it's good enough that it works in practice, but the whole point was that you can make a really much fancier hash that hashes characters not directly, but by their upper-case value or something like that, and thus you get a case-insensitive hash, while still keeping the name and the index itself totally case sensitive. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- cache.h | 6 +++ dir.c | 2 +- read-cache.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++------ 3 files changed, 95 insertions(+), 11 deletions(-) diff --git a/cache.h b/cache.h index 3a47cdc..409738c 100644 --- a/cache.h +++ b/cache.h @@ -3,6 +3,7 @@ #include "git-compat-util.h" #include "strbuf.h" +#include "hash.h" #include SHA1_HEADER #include <zlib.h> @@ -109,6 +110,7 @@ struct ondisk_cache_entry { }; struct cache_entry { + struct cache_entry *next; unsigned int ce_ctime; unsigned int ce_mtime; unsigned int ce_dev; @@ -131,6 +133,7 @@ struct cache_entry { #define CE_UPDATE (0x10000) #define CE_REMOVE (0x20000) #define CE_UPTODATE (0x40000) +#define CE_UNHASHED (0x80000) static inline unsigned create_ce_flags(size_t len, unsigned stage) { @@ -188,6 +191,7 @@ struct index_state { struct cache_tree *cache_tree; time_t timestamp; void *alloc; + struct hash_table name_hash; }; extern struct index_state the_index; @@ -211,6 +215,7 @@ extern struct index_state the_index; #define refresh_cache(flags) refresh_index(&the_index, (flags), NULL, NULL) #define ce_match_stat(ce, st, options) ie_match_stat(&the_index, (ce), (st), (options)) #define ce_modified(ce, st, options) ie_modified(&the_index, (ce), (st), (options)) +#define cache_name_exists(name, namelen) index_name_exists(&the_index, (name), (namelen)) #endif enum object_type { @@ -297,6 +302,7 @@ extern int read_index_from(struct index_state *, const char *path); extern int write_index(struct index_state *, int newfd); extern int discard_index(struct index_state *); extern int verify_path(const char *path); +extern int index_name_exists(struct index_state *istate, const char *name, int namelen); extern int index_name_pos(struct index_state *, const char *name, int namelen); #define ADD_CACHE_OK_TO_ADD 1 /* Ok to add */ #define ADD_CACHE_OK_TO_REPLACE 2 /* Ok to replace file/directory */ diff --git a/dir.c b/dir.c index 1b9cc7a..6543105 100644 --- a/dir.c +++ b/dir.c @@ -346,7 +346,7 @@ static struct dir_entry *dir_entry_new(const char *pathname, int len) struct dir_entry *dir_add_name(struct dir_struct *dir, const char *pathname, int len) { - if (cache_name_pos(pathname, len) >= 0) + if (cache_name_exists(pathname, len)) return NULL; ALLOC_GROW(dir->entries, dir->nr+1, dir->alloc); diff --git a/read-cache.c b/read-cache.c index 8ba8f0f..abee0fc 100644 --- a/read-cache.c +++ b/read-cache.c @@ -23,6 +23,70 @@ struct index_state the_index; +static unsigned int hash_name(const char *name, int namelen) +{ + unsigned int hash = 0x123; + + do { + unsigned char c = *name++; + hash = hash*101 + c; + } while (--namelen); + return hash; +} + +static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce) +{ + void **pos; + unsigned int hash = hash_name(ce->name, ce_namelen(ce)); + + istate->cache[nr] = ce; + pos = insert_hash(hash, ce, &istate->name_hash); + if (pos) { + ce->next = *pos; + *pos = ce; + } +} + +/* + * We don't actually *remove* it, we can just mark it invalid so that + * we won't find it in lookups. + * + * Not only would we have to search the lists (simple enough), but + * we'd also have to rehash other hash buckets in case this makes the + * hash bucket empty (common). So it's much better to just mark + * it. + */ +static void remove_hash_entry(struct index_state *istate, struct cache_entry *ce) +{ + ce->ce_flags |= CE_UNHASHED; +} + +static void replace_index_entry(struct index_state *istate, int nr, struct cache_entry *ce) +{ + struct cache_entry *old = istate->cache[nr]; + + if (ce != old) { + remove_hash_entry(istate, old); + set_index_entry(istate, nr, ce); + } + istate->cache_changed = 1; +} + +int index_name_exists(struct index_state *istate, const char *name, int namelen) +{ + unsigned int hash = hash_name(name, namelen); + struct cache_entry *ce = lookup_hash(hash, &istate->name_hash); + + while (ce) { + if (!(ce->ce_flags & CE_UNHASHED)) { + if (!cache_name_compare(name, namelen, ce->name, ce->ce_flags)) + return 1; + } + ce = ce->next; + } + return 0; +} + /* * This only updates the "non-critical" parts of the directory * cache, ie the parts that aren't tracked by GIT, and only used @@ -323,6 +387,9 @@ int index_name_pos(struct index_state *istate, const char *name, int namelen) /* Remove entry, return true if there are more entries to go.. */ int remove_index_entry_at(struct index_state *istate, int pos) { + struct cache_entry *ce = istate->cache[pos]; + + remove_hash_entry(istate, ce); istate->cache_changed = 1; istate->cache_nr--; if (pos >= istate->cache_nr) @@ -697,8 +764,7 @@ static int add_index_entry_with_check(struct index_state *istate, struct cache_e /* existing match? Just replace it. */ if (pos >= 0) { - istate->cache_changed = 1; - istate->cache[pos] = ce; + replace_index_entry(istate, pos, ce); return 0; } pos = -pos-1; @@ -758,7 +824,7 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti memmove(istate->cache + pos + 1, istate->cache + pos, (istate->cache_nr - pos - 1) * sizeof(ce)); - istate->cache[pos] = ce; + set_index_entry(istate, pos, ce); istate->cache_changed = 1; return 0; } @@ -887,11 +953,8 @@ int refresh_index(struct index_state *istate, unsigned int flags, const char **p has_errors = 1; continue; } - istate->cache_changed = 1; - /* You can NOT just free istate->cache[i] here, since it - * might not be necessarily malloc()ed but can also come - * from mmap(). */ - istate->cache[i] = new; + + replace_index_entry(istate, i, new); } return has_errors; } @@ -966,6 +1029,20 @@ static void convert_from_disk(struct ondisk_cache_entry *ondisk, struct cache_en memcpy(ce->name, ondisk->name, len + 1); } +static inline size_t estimate_cache_size(size_t ondisk_size, unsigned int entries) +{ + long per_entry; + + per_entry = sizeof(struct cache_entry) - sizeof(struct ondisk_cache_entry); + + /* + * Alignment can cause differences. This should be "alignof", but + * since that's a gcc'ism, just use the size of a pointer. + */ + per_entry += sizeof(void *); + return ondisk_size + entries*per_entry; +} + /* remember to discard_cache() before reading a different cache! */ int read_index_from(struct index_state *istate, const char *path) { @@ -1016,7 +1093,7 @@ int read_index_from(struct index_state *istate, const char *path) * has room for a few more flags, we can allocate using the same * index size */ - istate->alloc = xmalloc(mmap_size); + istate->alloc = xmalloc(estimate_cache_size(mmap_size, istate->cache_nr)); src_offset = sizeof(*hdr); dst_offset = 0; @@ -1027,7 +1104,7 @@ int read_index_from(struct index_state *istate, const char *path) disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset); ce = (struct cache_entry *)((char *)istate->alloc + dst_offset); convert_from_disk(disk_ce, ce); - istate->cache[i] = ce; + set_index_entry(istate, i, ce); src_offset += ondisk_ce_size(ce); dst_offset += ce_size(ce); @@ -1065,6 +1142,7 @@ int discard_index(struct index_state *istate) istate->cache_nr = 0; istate->cache_changed = 0; istate->timestamp = 0; + free_hash(&istate->name_hash); cache_tree_free(&(istate->cache_tree)); free(istate->alloc); istate->alloc = NULL; -- 1.5.4.rc4.1130.g9ad85 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 3:19 ` Linus Torvalds @ 2008-01-25 6:50 ` Junio C Hamano 2008-01-25 16:24 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Junio C Hamano @ 2008-01-25 6:50 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Linus Torvalds <torvalds@linux-foundation.org> writes: > +static inline size_t estimate_cache_size(size_t ondisk_size, unsigned int entries) > +{ > + long per_entry; > + > + per_entry = sizeof(struct cache_entry) - sizeof(struct ondisk_cache_entry); > + > + /* > + * Alignment can cause differences. This should be "alignof", but > + * since that's a gcc'ism, just use the size of a pointer. > + */ > + per_entry += sizeof(void *); > + return ondisk_size + entries*per_entry; > +} > + I wonder if the issue Dave Miller addressed with 69ae517541ed5ab7d4fdcd8f82a9b8bd949df347 (fast-import: fix unalinged allocation and access) applies here. commit 69ae517541ed5ab7d4fdcd8f82a9b8bd949df347 Author: David S. Miller <davem@davemloft.net> Date: Fri Dec 14 20:39:16 2007 -0800 fast-import: fix unalinged allocation and access The specialized pool allocator fast-import uses aligned objects on the size of a pointer, which was not sufficient at least on Sparc. Instead, make the alignment for objects of type unitmax_t. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Junio C Hamano <gitster@pobox.com> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-25 6:50 ` Junio C Hamano @ 2008-01-25 16:24 ` Linus Torvalds 0 siblings, 0 replies; 51+ messages in thread From: Linus Torvalds @ 2008-01-25 16:24 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List On Thu, 24 Jan 2008, Junio C Hamano wrote: > > I wonder if the issue Dave Miller addressed with > 69ae517541ed5ab7d4fdcd8f82a9b8bd949df347 (fast-import: fix > unalinged allocation and access) applies here. Good point, although we actually do things wrong for *another* reason. We currently force cache_entry to be 8-byte aligned regardless of what the actual "sizeof(ptr)" is, so we should assume that alignment: #define cache_entry_size(len) ((offsetof(struct cache_entry,name) + (len) + 8) & ~7) and if that isn't correct, we'd need to change this #define. So right now, the right thing to do is probably to make this alignment explicit: #define CE_ALIGN 8 and then use that both in the "cache_entry_size()" _and_ in the "estimate_cache_size()" calculations to make it obvious what the alignment is. And then we could actually make the alignment less on architectures that don't need that much (there may be architectures that need more, but I doubt it: we don't have any large fields in that structure, so the structure alignment really probably does max out at 8 in practice even if the C language theory doesn't give you any such guarantees). Side note: this is not likely to be a problem in _practice_. The on-disk representation is also aligned (also by 8), and while they can be *differently* aligned due to the relative alignment of the varying-length "name[]" field, and that can cause some padding to be needed, in practice it will never matter. The on-disk size also contains a header that we don't take into account, so it's already "over-estimated" to begin with for the in-memory representation. So "estimate_cache_size()" really does over-estimate its needs by a biggish amount, which is why it all works regardless, but better safe than sorry. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 2:58 ` Linus Torvalds 2008-01-23 3:19 ` Linus Torvalds @ 2008-01-23 7:23 ` Junio C Hamano 2008-01-23 12:25 ` Johannes Schindelin 1 sibling, 1 reply; 51+ messages in thread From: Junio C Hamano @ 2008-01-23 7:23 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Linus Torvalds <torvalds@linux-foundation.org> writes: > Basically, I dislike having two copies of the same data. If something can > be computed from something else, then only the original data should exist, > and the other thing should be recomputed. Yes, I agree with that in principle. Storing computable values makes sense only when it is expensive to recompute. We did not have cache-tree for quite a long time until you noticed that it was rather expensive and wasteful to recompute tree objects from unchanged parts of the index every time. It's the same argument; when the hashing performance starts to become noticeable, we can think about storing and reusing it, not before. > I did consider doing the indexing only on demand, and we can certainly > simply just "turn it off" when we know it's never going to get used (ie > "git ls-files"). So in that sense, it's easy to get rid of the overhead, > but it didn't really seem like the conceptual complexity (even if it's > just a couple of lines) is really worth it. It's not like git ls-files is > really performance-critical anyway. Yes, ls-files is cheap. So is lstat(2) on Linux. It only matters when you do it many many times. In any case, the change does not look too bad. The best time (real) of running git-ls-files in the kernel repository on my box is 0.010s vs 0.011s (10% improvement, heh!, which is the same as the master version) and empty commit is both 0.082s (no change). -- >8 -- [PATCH] lazy index hashing This delays the hashing of index names until it becomes necessary for the first time. Signed-off-by: Junio C Hamano <gitster@pobox.com> --- cache.h | 1 + read-cache.c | 26 +++++++++++++++++++++++--- 2 files changed, 24 insertions(+), 3 deletions(-) diff --git a/cache.h b/cache.h index 409738c..e4aeff0 100644 --- a/cache.h +++ b/cache.h @@ -191,6 +191,7 @@ struct index_state { struct cache_tree *cache_tree; time_t timestamp; void *alloc; + unsigned name_hash_initialized : 1; struct hash_table name_hash; }; diff --git a/read-cache.c b/read-cache.c index 9477c0b..e45f4b3 100644 --- a/read-cache.c +++ b/read-cache.c @@ -34,12 +34,11 @@ static unsigned int hash_name(const char *name, int namelen) return hash; } -static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce) +static void hash_index_entry(struct index_state *istate, struct cache_entry *ce) { void **pos; unsigned int hash = hash_name(ce->name, ce_namelen(ce)); - istate->cache[nr] = ce; pos = insert_hash(hash, ce, &istate->name_hash); if (pos) { ce->next = *pos; @@ -47,6 +46,24 @@ static void set_index_entry(struct index_state *istate, int nr, struct cache_ent } } +static void lazy_init_name_hash(struct index_state *istate) +{ + int nr; + + if (istate->name_hash_initialized) + return; + for (nr = 0; nr < istate->cache_nr; nr++) + hash_index_entry(istate, istate->cache[nr]); + istate->name_hash_initialized = 1; +} + +static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce) +{ + istate->cache[nr] = ce; + if (istate->name_hash_initialized) + hash_index_entry(istate, ce); +} + /* * We don't actually *remove* it, we can just mark it invalid so that * we won't find it in lookups. @@ -75,7 +92,10 @@ static void replace_index_entry(struct index_state *istate, int nr, struct cache int index_name_exists(struct index_state *istate, const char *name, int namelen) { unsigned int hash = hash_name(name, namelen); - struct cache_entry *ce = lookup_hash(hash, &istate->name_hash); + struct cache_entry *ce; + + lazy_init_name_hash(istate); + ce = lookup_hash(hash, &istate->name_hash); while (ce) { if (!(ce->ce_flags & CE_UNHASHED)) { -- 1.5.4.rc4.14.g6fc74 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 7:23 ` Junio C Hamano @ 2008-01-23 12:25 ` Johannes Schindelin 2008-01-23 16:25 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Johannes Schindelin @ 2008-01-23 12:25 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, Git Mailing List Hi, On Tue, 22 Jan 2008, Junio C Hamano wrote: > Linus Torvalds <torvalds@linux-foundation.org> writes: > > > Basically, I dislike having two copies of the same data. If something > > can be computed from something else, then only the original data > > should exist, and the other thing should be recomputed. > > Yes, I agree with that in principle. Storing computable values makes > sense only when it is expensive to recompute. We did not have > cache-tree for quite a long time until you noticed that it was rather > expensive and wasteful to recompute tree objects from unchanged parts of > the index every time. > > It's the same argument; when the hashing performance starts to become > noticeable, we can think about storing and reusing it, not before. I fully expect it to be noticable with that UTF-8 "normalisation". But then, the infrastructure is there, and whoever has an itch to scratch... Ciao, Dscho ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 12:25 ` Johannes Schindelin @ 2008-01-23 16:25 ` Linus Torvalds 2008-01-23 16:34 ` Johannes Schindelin 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2008-01-23 16:25 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Junio C Hamano, Git Mailing List On Wed, 23 Jan 2008, Johannes Schindelin wrote: > > I fully expect it to be noticable with that UTF-8 "normalisation". But > then, the infrastructure is there, and whoever has an itch to scratch... Actually, it's going to be totally invisible even with UTF-8 normalization, because we're going to do it sanely. And by "sanely" I mean just having the code test the high bit, and using US-ASCII as-is (possibly with that " & ~0x20 " thing to ignore case in it). End result: practically all projects will never notice anything at all for 99.9% of all files. One extra well-predicted branch, and a few more hash collissions for cases where you have both "Makefile" and "makefile" etc. Doing names with *lots* of UTF-8 characters will be rather slower. It's still not horrible to do if you do it the smart way, though. In fact, it's pretty simple, just a few table lookups (one to find the NFD form, one to do the upcasing). And yes, for hashing, it makes sense to turn things into NFD because it's generally simpler, but the point is that you really don't actually modify the name itself at all, you just hash things (or compare things) character by expanded character. IOW, only a total *moron* does Unicode name comparisons with strcmp(convert_to_nfd(a), convert_to_nfd(b)); which is essentially what Apple does. It's quite possible to do utf8_nfd_strcmp(a,b) and (a) do it tons and tons faster and (b) never have to modify the strings themselves. Same goes (even more) for hashing. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 16:25 ` Linus Torvalds @ 2008-01-23 16:34 ` Johannes Schindelin 2008-01-23 17:09 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Johannes Schindelin @ 2008-01-23 16:34 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List Hi, On Wed, 23 Jan 2008, Linus Torvalds wrote: > On Wed, 23 Jan 2008, Johannes Schindelin wrote: > > > > I fully expect it to be noticable with that UTF-8 "normalisation". > > But then, the infrastructure is there, and whoever has an itch to > > scratch... > > Actually, it's going to be totally invisible even with UTF-8 > normalization, because we're going to do it sanely. > > And by "sanely" I mean just having the code test the high bit, and using > US-ASCII as-is (possibly with that " & ~0x20 " thing to ignore case in > it). > > End result: practically all projects will never notice anything at all for > 99.9% of all files. One extra well-predicted branch, and a few more hash > collissions for cases where you have both "Makefile" and "makefile" etc. Well, that's the point, to avoid having both "Makefile" and "makefile" in your repository when you are on case-challenged filesystems, right? > Doing names with *lots* of UTF-8 characters will be rather slower. It's > still not horrible to do if you do it the smart way, though. In fact, > it's pretty simple, just a few table lookups (one to find the NFD form, > one to do the upcasing). > > And yes, for hashing, it makes sense to turn things into NFD because > it's generally simpler, but the point is that you really don't actually > modify the name itself at all, you just hash things (or compare things) > character by expanded character. > > IOW, only a total *moron* does Unicode name comparisons with > > strcmp(convert_to_nfd(a), convert_to_nfd(b)); > > which is essentially what Apple does. Heh, indeed that is what I would have done as an initial step (out of laziness). > It's quite possible to do > > utf8_nfd_strcmp(a,b) > > and (a) do it tons and tons faster and (b) never have to modify the > strings themselves. Same goes (even more) for hashing. Okay. Point taken. But I really hope that you are not proposing to use the case-ignoring hash when we are _not_ on a case-challenged filesystem... Ciao, Dscho ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 16:34 ` Johannes Schindelin @ 2008-01-23 17:09 ` Linus Torvalds 2008-01-23 17:29 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2008-01-23 17:09 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Junio C Hamano, Git Mailing List On Wed, 23 Jan 2008, Johannes Schindelin wrote: > > > End result: practically all projects will never notice anything at all for > > 99.9% of all files. One extra well-predicted branch, and a few more hash > > collissions for cases where you have both "Makefile" and "makefile" etc. > > Well, that's the point, to avoid having both "Makefile" and "makefile" in > your repository when you are on case-challenged filesystems, right? Right. But what I'm saying is that this is *really* cheap to test for for US-ASCII-only characters, and if only 0.1% of all filenames have unicode in them, the fact that they are much mroe expensive isn't even going to be noticeable. Except for some very odd-ball environments. > > It's quite possible to do > > > > utf8_nfd_strcmp(a,b) > > > > and (a) do it tons and tons faster and (b) never have to modify the > > strings themselves. Same goes (even more) for hashing. > > Okay. Point taken. Note that one reason the above is tons faster is that even with complex unicode, the *common* case is going to be that the names match with a binary compare. > But I really hope that you are not proposing to use the case-ignoring > hash when we are _not_ on a case-challenged filesystem... I actually suspect that we could, and nobody will notice. The hash would cause a few more collissions, but not so you'd know. And the thing is, people who work with other people who are on case-challenged systems would still want to have the case-insenstive compare too - although it should just warn, not actually "work". Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 17:09 ` Linus Torvalds @ 2008-01-23 17:29 ` Linus Torvalds 2008-01-25 5:21 ` Jeremy Maitin-Shepard 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2008-01-23 17:29 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Junio C Hamano, Git Mailing List On Wed, 23 Jan 2008, Linus Torvalds wrote: > > > But I really hope that you are not proposing to use the case-ignoring > > hash when we are _not_ on a case-challenged filesystem... > > I actually suspect that we could, and nobody will notice. The hash would > cause a few more collissions, but not so you'd know. To clarify: the thing I want to point out that the decision to *hash* the filenames in a case-insensitive hash, is very different from the decision to then *compare* the filenames when traversing the hash with a case-insensitive compare. And this difference is actually very important. Hashing things together that are "equivalent" according to any random rule is what makes it possible to then *check* for equivalence cheaply (because you only need to make the potentially expensive check with the subset of cases where it might trigger), but it in no way forces you to actually recode or mangle or compare things equivalently. In fact, I'd argue that this is what HFS+ did wrong in the first place: they had stupid/incompetent people who didn't understand about this, so they normalized the string *before* the hashing rather than as part of the hash itself, and thus actually corrupt the string itself. So what you can do (and I'd argue that we do) is to have a hash that can handle almost arbitrary input, but then never corrupt the filename, and always compare exactly by default. Then, depending on a config option, we can decide to change the compare so that equivalent (according to whatever rule) filenames either cause a warning (people on sane filesystems, but working with people who aren't), or are silently considered the same file (people on insane filesystems). Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 17:29 ` Linus Torvalds @ 2008-01-25 5:21 ` Jeremy Maitin-Shepard 2008-01-25 12:51 ` Johannes Schindelin 0 siblings, 1 reply; 51+ messages in thread From: Jeremy Maitin-Shepard @ 2008-01-25 5:21 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds wrote: [snip] > So what you can do (and I'd argue that we do) is to have a hash that can > handle almost arbitrary input, but then never corrupt the filename, and > always compare exactly by default. In general, there may be a large number of comparison function options that git will eventually support, and they will likely not all form a single chain of increasing "strictness". Given that the hash values aren't even being stored on disk (and if they were, a simple approach of also storing an identifier for the hash function to know whether they stored values are still valid could be used), having a chain of increasingly "strict" comparison functions and using a hash function that corresponds to the least strict one is useful for exactly one reason: giving (possibly several different levels of) non-fatal warnings for various types of duplicates. But since multiple hash functions will be needed anyway to support different notions of case-insensitivity, if the warning is not enabled, there is no reason to use a case-insensitive hash function with a byte-exact comparison. -- Jeremy Maitin-Shepard ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-25 5:21 ` Jeremy Maitin-Shepard @ 2008-01-25 12:51 ` Johannes Schindelin 2008-01-25 18:19 ` Jeremy Maitin-Shepard 0 siblings, 1 reply; 51+ messages in thread From: Johannes Schindelin @ 2008-01-25 12:51 UTC (permalink / raw) To: Jeremy Maitin-Shepard; +Cc: Linus Torvalds, git Hi, On Fri, 25 Jan 2008, Jeremy Maitin-Shepard wrote: > But since multiple hash functions will be needed anyway to support > different notions of case-insensitivity, if the warning is not enabled, > there is no reason to use a case-insensitive hash function with a > byte-exact comparison. No, only multiple compare functions will be needed. The hash function can be built in such a manner that it guarantees that file names being equal with _any_ of the compare functions fall into the same bucket. The upside of such a hash function: less code to maintain. Hth, Dscho ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-25 12:51 ` Johannes Schindelin @ 2008-01-25 18:19 ` Jeremy Maitin-Shepard 2008-01-25 18:24 ` Johannes Schindelin 2008-01-25 19:07 ` Junio C Hamano 0 siblings, 2 replies; 51+ messages in thread From: Jeremy Maitin-Shepard @ 2008-01-25 18:19 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Linus Torvalds, git Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > On Fri, 25 Jan 2008, Jeremy Maitin-Shepard wrote: >> But since multiple hash functions will be needed anyway to support >> different notions of case-insensitivity, if the warning is not enabled, >> there is no reason to use a case-insensitive hash function with a >> byte-exact comparison. > No, only multiple compare functions will be needed. The hash function can > be built in such a manner that it guarantees that file names being equal > with _any_ of the compare functions fall into the same bucket. In theory, I agree that this is possible, but in practice it may not be reasonable at all. Consider two possible comparison functions: 1. compare file names as strings case-insensitively assuming a latin 1 encoding 2. compare file names as strings case-insensitively assuming a UTF-8 encoding Actually writing a hash function such that two strings hash to the same value if either of these comparison functions says that the strings are equal would appear to be rather difficult. > The upside of such a hash function: less code to maintain. A simple hash function that doesn't try to do anything regarding case-insensitivity is extremely short and simple and therefore is hardly a maintenance burden. Although in some cases it is possible to "share" a hash function, except for the "warning" purpose, actually doing so doesn't make much sense. Using the "case-insensitive" hash function when you intend to use an "exact" comparison function just amounts to using a hash function that is unequivocally worse: it is slower, more complicated, and has a higher collision rate. -- Jeremy Maitin-Shepard ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-25 18:19 ` Jeremy Maitin-Shepard @ 2008-01-25 18:24 ` Johannes Schindelin 2008-01-25 19:07 ` Junio C Hamano 1 sibling, 0 replies; 51+ messages in thread From: Johannes Schindelin @ 2008-01-25 18:24 UTC (permalink / raw) To: Jeremy Maitin-Shepard; +Cc: Linus Torvalds, git Hi, On Fri, 25 Jan 2008, Jeremy Maitin-Shepard wrote: > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: > > > The upside of such a hash function: less code to maintain. > > A simple hash function that doesn't try to do anything regarding > case-insensitivity is extremely short and simple and therefore is hardly > a maintenance burden. You misunderstand me. If the complicated hash function is the one that is less exercised, you _will_ face problems. OTOH if you _already_ need the "complicated" hash function, there is _little_ point not to use it, and be consistent between platforms, _especially_ since now all people eat the same dog food. So I never thought about the simple hash function as being a burden. Hth, Dscho ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-25 18:19 ` Jeremy Maitin-Shepard 2008-01-25 18:24 ` Johannes Schindelin @ 2008-01-25 19:07 ` Junio C Hamano 1 sibling, 0 replies; 51+ messages in thread From: Junio C Hamano @ 2008-01-25 19:07 UTC (permalink / raw) To: Jeremy Maitin-Shepard; +Cc: Johannes Schindelin, Linus Torvalds, git Jeremy Maitin-Shepard <jbms@cmu.edu> writes: > In theory, I agree that this is possible, but in practice it may not be > reasonable at all. Consider two possible comparison functions: > > 1. compare file names as strings case-insensitively assuming a latin 1 > encoding > > 2. compare file names as strings case-insensitively assuming a UTF-8 > encoding > > Actually writing a hash function such that two strings hash to the same > value if either of these comparison functions says that the strings are > equal would appear to be rather difficult. Once you start adding more "case folding" supported filesystems to the repertoire, such a unified hash function Dscho suggests needs to throw paths that other (N-1) "case folding" filesystems treat as distinct but only 1 filesystem treats "equivalent" into the same hash bucket. I would say not just difficult but the resulting function would have too many collisions to make it ineffective. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-22 23:37 I'm a total push-over Linus Torvalds 2008-01-23 1:35 ` Kevin Ballard 2008-01-23 2:23 ` Junio C Hamano @ 2008-01-23 8:32 ` Andreas Ericsson 2008-01-23 9:15 ` Dmitry Potapov 2008-01-23 16:06 ` Linus Torvalds 2 siblings, 2 replies; 51+ messages in thread From: Andreas Ericsson @ 2008-01-23 8:32 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano Linus Torvalds wrote: > > NOTE NOTE NOTE! The current hash is a joke. Insofar as hashes go, it's not that shabby for hashing filenames. Here's the test output from a small hash-comparison program I've got, which runs the test-input through a series of different hashes to compare dispersion, collisions, lookup- and insert times and other things that are interesting from a practical PoV. Fowler/Noll/Vo cheap hash Collisions: 1829, 7.91%. Depth max: 3, average: 0.09. Time spent in lookups: 167.859ms. Per entry: 0.145us Time spent inserting: 2.753ms. Per entry: 0.119us PHP Zend cheap hash Collisions: 1908, 8.25%. Depth max: 3, average: 0.09. Time spent in lookups: 171.819ms. Per entry: 0.149us Time spent inserting: 2.778ms. Per entry: 0.120us Phong Vo's linear congruential hash Collisions: 1996, 8.63%. Depth max: 3, average: 0.09. Time spent in lookups: 168.276ms. Per entry: 0.146us Time spent inserting: 2.840ms. Per entry: 0.123us Phong Vo's second linear congruential hash Collisions: 1933, 8.36%. Depth max: 3, average: 0.09. Time spent in lookups: 170.416ms. Per entry: 0.147us Time spent inserting: 2.774ms. Per entry: 0.120us Glib string hash Collisions: 1907, 8.24%. Depth max: 3, average: 0.09. Time spent in lookups: 192.420ms. Per entry: 0.166us Time spent inserting: 3.154ms. Per entry: 0.136us sdbm hash Collisions: 1899, 8.21%. Depth max: 3, average: 0.09. Time spent in lookups: 170.797ms. Per entry: 0.148us Time spent inserting: 2.724ms. Per entry: 0.118us bfd hash Collisions: 1949, 8.43%. Depth max: 3, average: 0.09. Time spent in lookups: 206.504ms. Per entry: 0.179us Time spent inserting: 3.241ms. Per entry: 0.140us Linus' GIT hash Collisions: 1946, 8.41%. Depth max: 4, average: 0.09. Time spent in lookups: 179.336ms. Per entry: 0.155us Time spent inserting: 2.781ms. Per entry: 0.120us This is with 64K buckets, which (with my implementation) means a total hash-table size of 256KiB. The test-input is a simple file-listing from the linux kernel (although it has the .git directory included too). As you can see, it loses out on mathematical correctness, as it has more collisions (but not that many). OTOH, the simplicity of the implementation makes it a viable option anyway, since the time spent in insertion (which is almost completely inside the hash-function) is on average so much shorter. For a temporary hash-table, it will work splendidly. It will probably have issues scaling to, say, 30 or 40 million input lines (fairly typical test-data for database hashes, fe). Note that real-world timings will be shorter. My test-program doesn't have the luxury of keeping hash-functions inline, so some compiler optimizations become impossible. This is on a Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz (according to /proc/cpuinfo), with a cache size of 4meg. The use of multiplication is sane though, as it means no arch will suffer greatly. The FNV hash would be better (pasted below), but I doubt anyone will ever care, and there will be larger differences between architectures with this one than the lt_git hash (well, a function's gotta have a name). /* * Fowler/Noll/Vo hash * * The basis of the hash algorithm was taken from an idea sent by email to the * IEEE Posix P1003.2 mailing list from Phong Vo (kpv@research.att.com) and * Glenn Fowler (gsf@research.att.com). Landon Curt Noll (chongo@toad.com) * later improved on their algorithm. * * The magic is in the interesting relationship between the special prime * 16777619 (2^24 + 403) and 2^32 and 2^8. * * This hash produces the fewest collisions of any function that we've seen so * far, and works well on both numbers and strings. * * (Last comment from MySQL code) * */ u32 FNV1(u8 *k, u32 len) { u8 *e; u32 h; e = k + len; for (h = 0; k < e; k++) { h *= 16777619; h ^= *k; } return (h); } I could provide figures for other table-sizes too, if anyone's interested. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 8:32 ` Andreas Ericsson @ 2008-01-23 9:15 ` Dmitry Potapov 2008-01-23 9:31 ` Andreas Ericsson 2008-01-23 16:06 ` Linus Torvalds 1 sibling, 1 reply; 51+ messages in thread From: Dmitry Potapov @ 2008-01-23 9:15 UTC (permalink / raw) To: Andreas Ericsson; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano On Wed, Jan 23, 2008 at 09:32:54AM +0100, Andreas Ericsson wrote: > > The FNV hash would be better (pasted below), but I doubt > anyone will ever care, and there will be larger differences > between architectures with this one than the lt_git hash (well, > a function's gotta have a name). Actually, Bob Jenkins' lookup3 hash is twice faster in my tests than FNV, and also it is much less likely to have any collision. The description and some comparision with other hash can be found here: http://burtleburtle.net/bob/hash/doobs.html http://burtleburtle.net/bob/c/lookup3.c Perhaps, the second choice is Paul Hsieh's hash. http://www.azillionmonkeys.com/qed/hash.html Note: Paul Hsieh provides the table where he compares his hash with others. There is also the program he used. I ran his program on my computer, advantage of his over others was not so big on my computer. Moreover, his test includes an old version of Bob Jenkins' hash. The new version -- lookup3, which I mentione above, has about the same speed as Paul Hsieh's hash (with -O2) or even 12% faster when I used -O3 -march=athlon-xp. Also, Bob Jenkins' hash is better for non-x86 architectures. So, I believe it is the best hash for today. Dmitry ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 9:15 ` Dmitry Potapov @ 2008-01-23 9:31 ` Andreas Ericsson 2008-01-23 14:01 ` Marko Kreen 2008-01-23 17:10 ` Dmitry Potapov 0 siblings, 2 replies; 51+ messages in thread From: Andreas Ericsson @ 2008-01-23 9:31 UTC (permalink / raw) To: Dmitry Potapov; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano Dmitry Potapov wrote: > On Wed, Jan 23, 2008 at 09:32:54AM +0100, Andreas Ericsson wrote: >> The FNV hash would be better (pasted below), but I doubt >> anyone will ever care, and there will be larger differences >> between architectures with this one than the lt_git hash (well, >> a function's gotta have a name). > > Actually, Bob Jenkins' lookup3 hash is twice faster in my tests > than FNV, and also it is much less likely to have any collision. > >From http://burtleburtle.net/bob/hash/doobs.html --- FNV Hash I need to fill this in. Search the web for FNV hash. It's faster than my hash on Intel (because Intel has fast multiplication), but slower on most other platforms. Preliminary tests suggested it has decent distributions. --- My tests ran on Intel. I also noticed I had a few hashes commented out when doing the test, one of them being Paul Hsie's. For some reason, Jenkin's and Hsie's didn't perform well for me last time I used the comparison thing (I did a more thorough job back then, with tests running for several minutes per hash and table-size, so I commented out the poor candidates). I still believe that for this very simple case, the lookup3.c case is not very practical, as the code is that much more complicated, which was my main point with posting the comparison. Iow, not "switch to this hash, because it's better", but rather "the hash is not as bad as you think and will probably work well for all practical purposes". -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 9:31 ` Andreas Ericsson @ 2008-01-23 14:01 ` Marko Kreen 2008-01-23 14:39 ` Andreas Ericsson 2008-01-23 17:10 ` Dmitry Potapov 1 sibling, 1 reply; 51+ messages in thread From: Marko Kreen @ 2008-01-23 14:01 UTC (permalink / raw) To: Andreas Ericsson Cc: Dmitry Potapov, Linus Torvalds, Git Mailing List, Junio C Hamano On 1/23/08, Andreas Ericsson <ae@op5.se> wrote: > Dmitry Potapov wrote: > > On Wed, Jan 23, 2008 at 09:32:54AM +0100, Andreas Ericsson wrote: > >> The FNV hash would be better (pasted below), but I doubt > >> anyone will ever care, and there will be larger differences > >> between architectures with this one than the lt_git hash (well, > >> a function's gotta have a name). > > > > Actually, Bob Jenkins' lookup3 hash is twice faster in my tests > > than FNV, and also it is much less likely to have any collision. > > > > >From http://burtleburtle.net/bob/hash/doobs.html > --- > FNV Hash > > I need to fill this in. Search the web for FNV hash. It's faster than my hash on Intel (because Intel has fast multiplication), but slower on most other platforms. Preliminary tests suggested it has decent distributions. I suspect that this paragraph was about comparison with lookup2 (not lookup3) because lookup3 beat easily all the "simple" hashes in my testing. Only competitor was Hsieh one which was like 50:50 faster or slower depending on alignment / compiler / cpu. > --- > > My tests ran on Intel. I also noticed I had a few hashes commented out when > doing the test, one of them being Paul Hsie's. For some reason, Jenkin's and > Hsie's didn't perform well for me last time I used the comparison thing (I > did a more thorough job back then, with tests running for several minutes > per hash and table-size, so I commented out the poor candidates). > > I still believe that for this very simple case, the lookup3.c case is not > very practical, as the code is that much more complicated, which was my > main point with posting the comparison. Iow, not "switch to this hash, > because it's better", but rather "the hash is not as bad as you think and > will probably work well for all practical purposes". If you don't mind few percent speed penalty compared to Jenkings own optimized version, you can use my simplified version: http://repo.or.cz/w/pgbouncer.git?a=blob;f=src/hash.c;h=5c9a73639ad098c296c0be562c34573189f3e083;hb=HEAD It works always with "native" endianess, unlike Jenkins fixed-endian hashlittle() / hashbig(). It may or may not matter if you plan to write values on disk. Speed-wise it may be 10-30% slower worst case (in my case sparc-classic with unaligned data), but on x86, lucky gcc version and maybe also memcpy() hack seen in system.h, it tends to be ~10% faster, especially as it does always 4byte read in main loop. -- marko ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 14:01 ` Marko Kreen @ 2008-01-23 14:39 ` Andreas Ericsson 2008-01-24 6:51 ` Luke Lu 2008-01-24 13:19 ` Marko Kreen 0 siblings, 2 replies; 51+ messages in thread From: Andreas Ericsson @ 2008-01-23 14:39 UTC (permalink / raw) To: Marko Kreen Cc: Dmitry Potapov, Linus Torvalds, Git Mailing List, Junio C Hamano Marko Kreen wrote: > On 1/23/08, Andreas Ericsson <ae@op5.se> wrote: >> Dmitry Potapov wrote: >>> On Wed, Jan 23, 2008 at 09:32:54AM +0100, Andreas Ericsson wrote: >>>> The FNV hash would be better (pasted below), but I doubt >>>> anyone will ever care, and there will be larger differences >>>> between architectures with this one than the lt_git hash (well, >>>> a function's gotta have a name). >>> Actually, Bob Jenkins' lookup3 hash is twice faster in my tests >>> than FNV, and also it is much less likely to have any collision. >>> >> >From http://burtleburtle.net/bob/hash/doobs.html >> --- >> FNV Hash >> >> I need to fill this in. Search the web for FNV hash. It's faster than my hash on Intel (because Intel has fast multiplication), but slower on most other platforms. Preliminary tests suggested it has decent distributions. > > I suspect that this paragraph was about comparison with lookup2 It might be. It's from the link Dmitry posted in his reply to my original message. (something/something/doobs.html). > (not lookup3) because lookup3 beat easily all the "simple" hashes By how much? FNV beat Linus' hash by 0.01 microseconds / insertion, and 0.1 microsecons / lookup. We're talking about a case here where there will never be more lookups than insertions (unless I'm much mistaken). > > If you don't mind few percent speed penalty compared to Jenkings > own optimized version, you can use my simplified version: > > http://repo.or.cz/w/pgbouncer.git?a=blob;f=src/hash.c;h=5c9a73639ad098c296c0be562c34573189f3e083;hb=HEAD > I don't, but I don't care that deeply either. On the one hand, it would be nifty to have an excellent hash-function in git. On the other hand, it would look stupid with something that's quite clearly over-kill. > It works always with "native" endianess, unlike Jenkins fixed-endian > hashlittle() / hashbig(). It may or may not matter if you plan > to write values on disk. > > Speed-wise it may be 10-30% slower worst case (in my case sparc-classic > with unaligned data), but on x86, lucky gcc version and maybe > also memcpy() hack seen in system.h, it tends to be ~10% faster, > especially as it does always 4byte read in main loop. > It would have to be a significant improvement in wall-clock time on a test-case of hashing 30k strings to warrant going from 6 to 80 lines of code, imo. I still believe the original dumb hash Linus wrote is "good enough". On a side-note, it was very interesting reading, and I shall have to add jenkins3_mkreen() to my test-suite (although the "keep copyright note" license thing bugs me a bit). -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 14:39 ` Andreas Ericsson @ 2008-01-24 6:51 ` Luke Lu 2008-01-24 10:24 ` Andreas Ericsson 2008-01-24 13:19 ` Marko Kreen 1 sibling, 1 reply; 51+ messages in thread From: Luke Lu @ 2008-01-24 6:51 UTC (permalink / raw) To: Andreas Ericsson Cc: Marko Kreen, Dmitry Potapov, Linus Torvalds, Git Mailing List, Junio C Hamano On Jan 23, 2008, at 6:39 AM, Andreas Ericsson wrote: > Marko Kreen wrote: >> On 1/23/08, Andreas Ericsson <ae@op5.se> wrote: >>> Dmitry Potapov wrote: >>>> On Wed, Jan 23, 2008 at 09:32:54AM +0100, Andreas Ericsson wrote: >>>>> The FNV hash would be better (pasted below), but I doubt >>>>> anyone will ever care, and there will be larger differences >>>>> between architectures with this one than the lt_git hash (well, >>>>> a function's gotta have a name). >>>> Actually, Bob Jenkins' lookup3 hash is twice faster in my tests >>>> than FNV, and also it is much less likely to have any collision. >>>> >>> >From http://burtleburtle.net/bob/hash/doobs.html >>> --- >>> FNV Hash >>> >>> I need to fill this in. Search the web for FNV hash. It's faster >>> than my hash on Intel (because Intel has fast multiplication), >>> but slower on most other platforms. Preliminary tests suggested >>> it has decent distributions. >> I suspect that this paragraph was about comparison with lookup2 > > > It might be. It's from the link Dmitry posted in his reply to my > original > message. (something/something/doobs.html). > >> (not lookup3) because lookup3 beat easily all the "simple" hashes > > By how much? FNV beat Linus' hash by 0.01 microseconds / insertion, > and 0.1 microsecons / lookup. We're talking about a case here where > there will never be more lookups than insertions (unless I'm much > mistaken). > >> If you don't mind few percent speed penalty compared to Jenkings >> own optimized version, you can use my simplified version: >> http://repo.or.cz/w/pgbouncer.git?a=blob;f=src/ >> hash.c;h=5c9a73639ad098c296c0be562c34573189f3e083;hb=HEAD > > I don't, but I don't care that deeply either. On the one hand, > it would be nifty to have an excellent hash-function in git. > On the other hand, it would look stupid with something that's > quite clearly over-kill. > >> It works always with "native" endianess, unlike Jenkins fixed-endian >> hashlittle() / hashbig(). It may or may not matter if you plan >> to write values on disk. >> Speed-wise it may be 10-30% slower worst case (in my case sparc- >> classic >> with unaligned data), but on x86, lucky gcc version and maybe >> also memcpy() hack seen in system.h, it tends to be ~10% faster, >> especially as it does always 4byte read in main loop. > > It would have to be a significant improvement in wall-clock time > on a test-case of hashing 30k strings to warrant going from 6 to 80 > lines of code, imo. I still believe the original dumb hash Linus > wrote is "good enough". > > On a side-note, it was very interesting reading, and I shall have > to add jenkins3_mkreen() to my test-suite (although the "keep > copyright note" license thing bugs me a bit). Would you, for completeness' sake, please add Tcl and STL hashes to your test suite? The numbers are quite interesting. Is your test suite available somewhere, so we can test with our own data and hardware as well. Both Tcl hash and STL (from SGI probably HP days, still the current default with g++) string hashes are extremely simple (excluding the loop constructs): Tcl: h += (h<<3) + c; // essentially *9+c (but work better on non- late-intels) STL: h = h * 5 + c; // worse than above for most of my data Thanks, __Luke ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-24 6:51 ` Luke Lu @ 2008-01-24 10:24 ` Andreas Ericsson 0 siblings, 0 replies; 51+ messages in thread From: Andreas Ericsson @ 2008-01-24 10:24 UTC (permalink / raw) To: Luke Lu Cc: Marko Kreen, Dmitry Potapov, Linus Torvalds, Git Mailing List, Junio C Hamano Luke Lu wrote: > On Jan 23, 2008, at 6:39 AM, Andreas Ericsson wrote: >> Marko Kreen wrote: >>> On 1/23/08, Andreas Ericsson <ae@op5.se> wrote: >>>> Dmitry Potapov wrote: >>>>> On Wed, Jan 23, 2008 at 09:32:54AM +0100, Andreas Ericsson wrote: >>>>>> The FNV hash would be better (pasted below), but I doubt >>>>>> anyone will ever care, and there will be larger differences >>>>>> between architectures with this one than the lt_git hash (well, >>>>>> a function's gotta have a name). >>>>> Actually, Bob Jenkins' lookup3 hash is twice faster in my tests >>>>> than FNV, and also it is much less likely to have any collision. >>>>> >>>> >From http://burtleburtle.net/bob/hash/doobs.html >>>> --- >>>> FNV Hash >>>> >>>> I need to fill this in. Search the web for FNV hash. It's faster >>>> than my hash on Intel (because Intel has fast multiplication), but >>>> slower on most other platforms. Preliminary tests suggested it has >>>> decent distributions. >>> I suspect that this paragraph was about comparison with lookup2 >> >> >> It might be. It's from the link Dmitry posted in his reply to my original >> message. (something/something/doobs.html). >> >>> (not lookup3) because lookup3 beat easily all the "simple" hashes >> >> By how much? FNV beat Linus' hash by 0.01 microseconds / insertion, >> and 0.1 microsecons / lookup. We're talking about a case here where >> there will never be more lookups than insertions (unless I'm much >> mistaken). >> >>> If you don't mind few percent speed penalty compared to Jenkings >>> own optimized version, you can use my simplified version: >>> >>> http://repo.or.cz/w/pgbouncer.git?a=blob;f=src/hash.c;h=5c9a73639ad098c296c0be562c34573189f3e083;hb=HEAD >>> >> >> I don't, but I don't care that deeply either. On the one hand, >> it would be nifty to have an excellent hash-function in git. >> On the other hand, it would look stupid with something that's >> quite clearly over-kill. >> >>> It works always with "native" endianess, unlike Jenkins fixed-endian >>> hashlittle() / hashbig(). It may or may not matter if you plan >>> to write values on disk. >>> Speed-wise it may be 10-30% slower worst case (in my case sparc-classic >>> with unaligned data), but on x86, lucky gcc version and maybe >>> also memcpy() hack seen in system.h, it tends to be ~10% faster, >>> especially as it does always 4byte read in main loop. >> >> It would have to be a significant improvement in wall-clock time >> on a test-case of hashing 30k strings to warrant going from 6 to 80 >> lines of code, imo. I still believe the original dumb hash Linus >> wrote is "good enough". >> >> On a side-note, it was very interesting reading, and I shall have >> to add jenkins3_mkreen() to my test-suite (although the "keep >> copyright note" license thing bugs me a bit). > > Would you, for completeness' sake, please add Tcl and STL hashes to your > test suite? I could do that. Or I just publish the entire ugly thing and let someone else add them ;-) > The numbers are quite interesting. Is your test suite > available somewhere, so we can test with our own data and hardware as > well. Not yet, no. I usually munge it up quite a lot when I want to test hashes for a specific input, so it's not what anyone would call "pretty". > Both Tcl hash and STL (from SGI probably HP days, still the > current default with g++) string hashes are extremely simple (excluding > the loop constructs): > > Tcl: h += (h<<3) + c; // essentially *9+c (but work better on > non-late-intels) > STL: h = h * 5 + c; // worse than above for most of my data > They sure do look simple enough. As for loop constructs, I've tried to use the same looping mechanics for everything, so as to let the algorithm be the only difference. Otherwise it gets tricky to do comparisons. The exceptions are ofcourse hashes relying on Duff's device or similar alignment trickery. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 14:39 ` Andreas Ericsson 2008-01-24 6:51 ` Luke Lu @ 2008-01-24 13:19 ` Marko Kreen 2008-01-24 16:00 ` Andreas Ericsson 1 sibling, 1 reply; 51+ messages in thread From: Marko Kreen @ 2008-01-24 13:19 UTC (permalink / raw) To: Andreas Ericsson Cc: Dmitry Potapov, Linus Torvalds, Git Mailing List, Junio C Hamano On 1/23/08, Andreas Ericsson <ae@op5.se> wrote: > Marko Kreen wrote: > > (not lookup3) because lookup3 beat easily all the "simple" hashes > > By how much? FNV beat Linus' hash by 0.01 microseconds / insertion, > and 0.1 microsecons / lookup. We're talking about a case here where > there will never be more lookups than insertions (unless I'm much > mistaken). FNV is around 40% slower than lookup3 on my Intel Core CPU, on 4byte aligned input. See below for more detailed info. > > If you don't mind few percent speed penalty compared to Jenkings > > own optimized version, you can use my simplified version: > > > > http://repo.or.cz/w/pgbouncer.git?a=blob;f=src/hash.c;h=5c9a73639ad098c296c0be562c34573189f3e083;hb=HEAD > > I don't, but I don't care that deeply either. On the one hand, > it would be nifty to have an excellent hash-function in git. > On the other hand, it would look stupid with something that's > quite clearly over-kill. Jenkins hash is fast because it does not look at individual bytes. If you _do_ want to look at them for unrelated reasons, (case-insensitive, unicode-juggling), then it obiously loses the point. That is, if you want to process the string in one go. > > It works always with "native" endianess, unlike Jenkins fixed-endian > > hashlittle() / hashbig(). It may or may not matter if you plan > > to write values on disk. > > > > Speed-wise it may be 10-30% slower worst case (in my case sparc-classic > > with unaligned data), but on x86, lucky gcc version and maybe > > also memcpy() hack seen in system.h, it tends to be ~10% faster, > > especially as it does always 4byte read in main loop. > > It would have to be a significant improvement in wall-clock time > on a test-case of hashing 30k strings to warrant going from 6 to 80 > lines of code, imo. I still believe the original dumb hash Linus > wrote is "good enough". Well, ad-hoc dumb hashes may have horrible worst-cases that you cant see with light testing. Therefore I'd still suggest some better researched dumb hash (eg. FNV or OAT). > On a side-note, it was very interesting reading, and I shall have > to add jenkins3_mkreen() to my test-suite (although the "keep > copyright note" license thing bugs me a bit). Sorry. I just used template boilerplate. Considering all the hard work was done by other people, it not proper to put under my own license. I tagged the file as 'public domain' and pushed out. Btw, the reason I started cleaning lookup3 was that at first I was scared of the complexity of Jenkins code and decided to go with Hsieh hash. Then I found out that Hsieh code is under totally werdo license (http://www.azillionmonkeys.com/qed/weblicense.html) so I could not use it. ==================================================================== Here is my raw-speed test of different hashes. Input is 4-byte aligned which should be common case for malloc()-ed strings. This also is best case for original lookup3(), on unaligned input the memcpy variants beat it easily. Input string length varies randomly in range 0..64. own_memcpy - last 12-byte memcpy() calls out to libc memcpy_hack - last memcpy is inlined bytewise copy loop: while (len--) *dst++ = *src++; Note that is is raw-speed test, if you benchmark larger code the hash difference probably matters less. -------------------------------------------------------------------- Testing: seed=34 align=4 minlen=0 maxlen=64 trycnt=2 duration=10 lookup3 : try=0: ... 247.4880 MB/s lookup3 : try=1: ... 247.6154 MB/s own_memcpy: try=0: ... 223.5508 MB/s own_memcpy: try=1: ... 223.5830 MB/s memcpy_hack: try=0: ... 241.2241 MB/s memcpy_hack: try=1: ... 241.2492 MB/s lookup2 : try=0: ... 190.2697 MB/s lookup2 : try=1: ... 190.3283 MB/s fnv : try=0: ... 153.0318 MB/s fnv : try=1: ... 153.0178 MB/s hsieh : try=0: ... 234.0468 MB/s hsieh : try=1: ... 234.0426 MB/s oat : try=0: ... 154.7804 MB/s oat : try=1: ... 154.8226 MB/s elf : try=0: ... 125.5892 MB/s elf : try=1: ... 125.5734 MB/s Results compared to reference: lookup3 : 100.000 % own_memcpy : 90.311 % memcpy_hack : 97.449 % lookup2 : 76.872 % fnv : 61.815 % hsieh : 94.544 % oat : 62.533 % elf : 50.729 % -- marko ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-24 13:19 ` Marko Kreen @ 2008-01-24 16:00 ` Andreas Ericsson 2008-01-24 16:13 ` Marko Kreen ` (2 more replies) 0 siblings, 3 replies; 51+ messages in thread From: Andreas Ericsson @ 2008-01-24 16:00 UTC (permalink / raw) To: Marko Kreen Cc: Dmitry Potapov, Linus Torvalds, Git Mailing List, Junio C Hamano Marko Kreen wrote: > On 1/23/08, Andreas Ericsson <ae@op5.se> wrote: >> Marko Kreen wrote: >>> (not lookup3) because lookup3 beat easily all the "simple" hashes >> By how much? FNV beat Linus' hash by 0.01 microseconds / insertion, >> and 0.1 microsecons / lookup. We're talking about a case here where >> there will never be more lookups than insertions (unless I'm much >> mistaken). > > FNV is around 40% slower than lookup3 on my Intel Core CPU, on 4byte aligned > input. See below for more detailed info. > But the tests surely need to check for unaligned cases, as that's what we're likely to hash, no? >>> If you don't mind few percent speed penalty compared to Jenkings >>> own optimized version, you can use my simplified version: >>> >>> http://repo.or.cz/w/pgbouncer.git?a=blob;f=src/hash.c;h=5c9a73639ad098c296c0be562c34573189f3e083;hb=HEAD >> I don't, but I don't care that deeply either. On the one hand, >> it would be nifty to have an excellent hash-function in git. >> On the other hand, it would look stupid with something that's >> quite clearly over-kill. > > Jenkins hash is fast because it does not look at individual bytes. > If you _do_ want to look at them for unrelated reasons, (case-insensitive, > unicode-juggling), then it obiously loses the point. That is, if you > want to process the string in one go. > I believe the ability to add unicode-juggling was a major point with the patch, so perhaps Jenkins' isn't such a good option. I'm not familiar with data-mangling the way Linus (or Theo Tso is), so I hadn't even considered that aspect of unrolled hashes. >> It would have to be a significant improvement in wall-clock time >> on a test-case of hashing 30k strings to warrant going from 6 to 80 >> lines of code, imo. I still believe the original dumb hash Linus >> wrote is "good enough". > > Well, ad-hoc dumb hashes may have horrible worst-cases that you cant > see with light testing. Therefore I'd still suggest some better > researched dumb hash (eg. FNV or OAT). > True. FNV is used in both MySQL and PostgreSQL. I'd say it's safe to assume it's fairly well tested. >> On a side-note, it was very interesting reading, and I shall have >> to add jenkins3_mkreen() to my test-suite (although the "keep >> copyright note" license thing bugs me a bit). > > Sorry. I just used template boilerplate. Considering all the > hard work was done by other people, it not proper to put under > my own license. I tagged the file as 'public domain' and pushed out. > Thanks. I'll see if I can add it, although it'll probably have to wait until I have reason to dig into it at work again. I've added the hash to the code-base, but not yet incorporated it into the test-case. > Btw, the reason I started cleaning lookup3 was that at first I was > scared of the complexity of Jenkins code and decided to go with > Hsieh hash. Then I found out that Hsieh code is under totally > werdo license (http://www.azillionmonkeys.com/qed/weblicense.html) > so I could not use it. > True. I was in contact with him a while back since I wanted to use it in an opensource project, but the licensing issues made me go with another one instead. The patch got turned down anyways, so it was a non-issue in the end, but... Ah well. > > Here is my raw-speed test of different hashes. Input is 4-byte > aligned which should be common case for malloc()-ed strings. Unless arena allocated, like we do in git. I'm not surprised that this test favours Jenkin's and Hsie's. That's to be expected as those benefit far more than simpler hashing algorithms for long strings. The overhead when trying shorter strings (say, between 3 and 15 chars, and not necessarily 4-byte aligned) sometimes make them quite a lot slower though. > This also is best case for original lookup3(), on unaligned > input the memcpy variants beat it easily. Input string > length varies randomly in range 0..64. > Well, memcpy() isn't very interesting to compare against hashes, as they test vastly different parts of the hardware's parts' performance. memcpy() should also perform exactly the same no matter what the test-data, which isn't always true for hashes. What *would* be interesting would be something along the lines of "duff_cpy()": ie, an unrolled loop that aligns itself and copies each byte to the same address each time. The bytewise equivalence would ofcourse be magic_cpy(unsigned char *k, int len) { unsigned char magic; do { magic = *k++; } while (--len); } > own_memcpy - last 12-byte memcpy() calls out to libc > memcpy_hack - last memcpy is inlined bytewise copy loop: > > while (len--) *dst++ = *src++; > > Note that is is raw-speed test, if you benchmark larger code the > hash difference probably matters less. > > -------------------------------------------------------------------- > > Testing: seed=34 align=4 minlen=0 maxlen=64 trycnt=2 duration=10 > > lookup3 : try=0: ... 247.4880 MB/s > lookup3 : try=1: ... 247.6154 MB/s > own_memcpy: try=0: ... 223.5508 MB/s > own_memcpy: try=1: ... 223.5830 MB/s > memcpy_hack: try=0: ... 241.2241 MB/s > memcpy_hack: try=1: ... 241.2492 MB/s > lookup2 : try=0: ... 190.2697 MB/s > lookup2 : try=1: ... 190.3283 MB/s > fnv : try=0: ... 153.0318 MB/s > fnv : try=1: ... 153.0178 MB/s > hsieh : try=0: ... 234.0468 MB/s > hsieh : try=1: ... 234.0426 MB/s > oat : try=0: ... 154.7804 MB/s > oat : try=1: ... 154.8226 MB/s > elf : try=0: ... 125.5892 MB/s > elf : try=1: ... 125.5734 MB/s > > Results compared to reference: > > lookup3 : 100.000 % > own_memcpy : 90.311 % > memcpy_hack : 97.449 % > lookup2 : 76.872 % > fnv : 61.815 % > hsieh : 94.544 % > oat : 62.533 % > elf : 50.729 % > > Interesting output, but not very surprising. Do you have the code available somewhere? -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-24 16:00 ` Andreas Ericsson @ 2008-01-24 16:13 ` Marko Kreen 2008-01-24 16:28 ` Dmitry Potapov 2008-01-25 20:08 ` Marko Kreen 2 siblings, 0 replies; 51+ messages in thread From: Marko Kreen @ 2008-01-24 16:13 UTC (permalink / raw) To: Andreas Ericsson Cc: Dmitry Potapov, Linus Torvalds, Git Mailing List, Junio C Hamano On 1/24/08, Andreas Ericsson <ae@op5.se> wrote: > Marko Kreen wrote: > > On 1/23/08, Andreas Ericsson <ae@op5.se> wrote: > >> Marko Kreen wrote: > >>> (not lookup3) because lookup3 beat easily all the "simple" hashes > >> By how much? FNV beat Linus' hash by 0.01 microseconds / insertion, > >> and 0.1 microsecons / lookup. We're talking about a case here where > >> there will never be more lookups than insertions (unless I'm much > >> mistaken). > > > > FNV is around 40% slower than lookup3 on my Intel Core CPU, on 4byte aligned > > input. See below for more detailed info. > > > > But the tests surely need to check for unaligned cases, as that's what > we're likely to hash, no? > >> It would have to be a significant improvement in wall-clock time > >> on a test-case of hashing 30k strings to warrant going from 6 to 80 > >> lines of code, imo. I still believe the original dumb hash Linus > >> wrote is "good enough". > > > > Well, ad-hoc dumb hashes may have horrible worst-cases that you cant > > see with light testing. Therefore I'd still suggest some better > > researched dumb hash (eg. FNV or OAT). > > > > True. FNV is used in both MySQL and PostgreSQL. I'd say it's safe to > assume it's fairly well tested. PostgreSQL uses lookup2... > > Here is my raw-speed test of different hashes. Input is 4-byte > > aligned which should be common case for malloc()-ed strings. > > Unless arena allocated, like we do in git. > > I'm not surprised that this test favours Jenkin's and Hsie's. > That's to be expected as those benefit far more than simpler > hashing algorithms for long strings. The overhead when trying > shorter strings (say, between 3 and 15 chars, and not necessarily > 4-byte aligned) sometimes make them quite a lot slower though. Ok, here is 0..15 chars, random alignment: Testing: seed=34 align=0 minlen=0 maxlen=15 trycnt=2 duration=10 lookup3 : try=0: ... 69.8092 MB/s lookup3 : try=1: ... 69.8146 MB/s own_memcpy: try=0: ... 66.7808 MB/s own_memcpy: try=1: ... 66.7814 MB/s memcpy_hack: try=0: ... 74.0635 MB/s memcpy_hack: try=1: ... 74.0518 MB/s lookup2 : try=0: ... 68.6582 MB/s lookup2 : try=1: ... 68.6634 MB/s fnv : try=0: ... 74.5098 MB/s fnv : try=1: ... 74.5283 MB/s hsieh : try=0: ... 71.6708 MB/s hsieh : try=1: ... 71.6814 MB/s oat : try=0: ... 74.7828 MB/s oat : try=1: ... 74.7716 MB/s elf : try=0: ... 65.2077 MB/s elf : try=1: ... 65.2128 MB/s Results compared to reference: lookup3 : 100.000 % own_memcpy : 95.659 % memcpy_hack : 106.082 % lookup2 : 98.351 % fnv : 106.743 % hsieh : 102.670 % oat : 107.112 % elf : 93.409 % > > This also is best case for original lookup3(), on unaligned > > input the memcpy variants beat it easily. Input string > > length varies randomly in range 0..64. > > > > Well, memcpy() isn't very interesting to compare against > hashes, as they test vastly different parts of the hardware's > parts' performance. memcpy() should also perform exactly the > same no matter what the test-data, which isn't always true for > hashes. Sorry, I meant my "simple-memcpy-based-lookup3". > What *would* be interesting would be something along the lines > of "duff_cpy()": ie, an unrolled loop that aligns itself and > copies each byte to the same address each time. How the hash fetched data from mempry is _very_ relevant. > Interesting output, but not very surprising. Do you have the code > available somewhere? I can put it out. -- marko ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-24 16:00 ` Andreas Ericsson 2008-01-24 16:13 ` Marko Kreen @ 2008-01-24 16:28 ` Dmitry Potapov 2008-01-24 17:15 ` Linus Torvalds 2008-01-25 20:08 ` Marko Kreen 2 siblings, 1 reply; 51+ messages in thread From: Dmitry Potapov @ 2008-01-24 16:28 UTC (permalink / raw) To: Andreas Ericsson Cc: Marko Kreen, Linus Torvalds, Git Mailing List, Junio C Hamano On Jan 24, 2008 7:00 PM, Andreas Ericsson <ae@op5.se> wrote: > Marko Kreen wrote: > > > > Jenkins hash is fast because it does not look at individual bytes. > > If you _do_ want to look at them for unrelated reasons, (case-insensitive, > > unicode-juggling), then it obiously loses the point. That is, if you > > want to process the string in one go. > > > > I believe the ability to add unicode-juggling was a major point > with the patch, so perhaps Jenkins' isn't such a good option. I don't think you can any meaningful unicode-juggling without converting symbols to UCS-4, and after that it makes much more sense to operate with uint32 than bytes. So, Jenkins' hash is still relevant, just because it does not operate on single bytes, but using uint32. Dmitry ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-24 16:28 ` Dmitry Potapov @ 2008-01-24 17:15 ` Linus Torvalds 2008-01-24 18:45 ` Dmitry Potapov 2008-01-25 20:52 ` Marko Kreen 0 siblings, 2 replies; 51+ messages in thread From: Linus Torvalds @ 2008-01-24 17:15 UTC (permalink / raw) To: Dmitry Potapov Cc: Andreas Ericsson, Marko Kreen, Git Mailing List, Junio C Hamano On Thu, 24 Jan 2008, Dmitry Potapov wrote: > > I don't think you can any meaningful unicode-juggling without converting > symbols to UCS-4, and after that it makes much more sense to operate > with uint32 than bytes. So, Jenkins' hash is still relevant, just because > it does not operate on single bytes, but using uint32. No, no, no, NO! Egads! Why do people constantly do these totally idiotic things for Unicode? You can do a perfectly fine 8-bytes-at-a-time hash for almost 100% of all source code projects in UTF-8, without *ever* doing any format changes at all. Admittedly, it's a lot easier if the hash is a pure in-memory one (ie we don't care about byte-order or size of integers or anything like that), but that's the common case for most hashes that aren't used for BTree lookup on disk or something like that. Here, let me show you: unsigned int name_hash(const char *name, int size) { hash = HASH_INIT; do { unsigned char c; if (size >= sizeof(long)) { unsigned long val = get_unaligned_long(name); if (!(val & 0x8080808080808080)) { /* Make it equivalent in case */ val &= ~0x2020202020202020; hash = hash_long(hash, val); name += sizeof(long); size -= sizeof(long); continue; } } c = *name; if (!(c & 0x80)) { hash = hash_long(hash, c & ~0x20); name++; size--; continue; } /* This is the unusual and slowish case */ hash = hash_utf8_char(hash, c, &name, &size); } while (size); return hassh; } and then the only point you ever do that actual UTF8->codepoint conversion is for that "high bit set" case. A few things to note on the above: - the hash obviously has "odd" characteristics. We're not necessarily hashing characters at a time at all, and the alignment of characters with high bits *within*the*string* will make a difference to how we hash them. But it's also important that the "get_unaligned_long()" means that the alignment of the string itself doesn't matter, so its' purely a "chunking within the string" thing, and the alignment of the string itself won't affect the hash value - I'm not writing out hash_utf8_char(), because it's certainly not totally trivial, but it's not *really* complex either. The nontriviality isn't so much the decoding into a codepoint (which is pretty simple), but the fact that when you have the codepoint you should then decompose it and turn it into lower case, which is generally two table lookups. Then, you just do for_each_decomposed_uppercased_codepoint(c) hash = hash_long(hash, c); and one thing to note is that for the hashing, the decomposition and uppercasing doesn't even have to be "exact" (the same way I didn't do an "exact" upper-casing for US-ASCII, just a "good enough" one!) Similarly, when you actually do a unicode *compare* function, you should never *ever* actually convert to any unicode codepoints or do any expensive decomposition AT ALL by default! What you do is to compare things byte-by-byte, and only convert to unicode/decompse if there are any differences, and only for those parts of the sequence that differ! So if you have two UTF-8 strings (even if they have "complex" characters, ie with the high bit set), the *common* case is that you'll match them byte for byte, and they'll match without any Unicode conversion needed at all! This is common because: - normally, even if you don't ever normalize, people tend to input things in a *similar* manner (again, OS X is the odd man out), so even non-normalized strings are often non-normalized the same way! - we're only going to compare things that have hashed to the same thing anyway, so the common case is that it's the same string, and most likely had the same source. And if it's a collision, it's often totally different. And even if it's different only in case, the *common* case is going to be (for source code trees, at least) that the different point is a US-ASCII letter, and the case-comparison will again be done without any Unicode knowledge at all! This is why normalization of strings before-hand is generally so utterly stupid. It doesn't buy you anything. It complicates things a lot (you don't want to normalize in-place, so you have memory management issues), and it actually SLOWS THINGS DOWN. It's much better to do UTF-8 comparisons and hashing char-by-char. At least if you know that the common case is not going to be the complex part of the character set (which is undoubtedly true for source code filenames). Normalizing things ahead of time *only* makes sense if: - you expect complex characters to be a big part of your load and - you're going to do literally *thousands* of comparisons against the *same* strings over and over (so that the cost of normalization is literally up-front) For example, for a filesystem, it's true that you're going to compare against the *target* (on-disk) multiple times, but that doesn't actually mean that normalizing it makes any sense - because the data you're going to compare against comes from user space and isn't guaranteed to be normalized, so you still cannot do a simple memcmp() without the expense of normalizing that. And since you're going to hash the filenames anyway, you will not have "thousands of comparisons" per source lookup, you'll generally only have a couple, so now your normalization actually cost you *more* than doing the above on-the-fly comparison! See? Basically, it's almost always a stupid thing to actually normalize a whole string. You do those things character-by-character, and only lazily when you actually need to! Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-24 17:15 ` Linus Torvalds @ 2008-01-24 18:45 ` Dmitry Potapov 2008-01-24 19:08 ` Linus Torvalds 2008-01-25 20:52 ` Marko Kreen 1 sibling, 1 reply; 51+ messages in thread From: Dmitry Potapov @ 2008-01-24 18:45 UTC (permalink / raw) To: Linus Torvalds Cc: Andreas Ericsson, Marko Kreen, Git Mailing List, Junio C Hamano On Thu, Jan 24, 2008 at 09:15:43AM -0800, Linus Torvalds wrote: > > > You can do a perfectly fine 8-bytes-at-a-time hash for almost 100% of all I suppose 8 bytes for 64-bit platforms and 4 bytes for 32-bits. > > unsigned int name_hash(const char *name, int size) > { > hash = HASH_INIT; > do { > unsigned char c; > if (size >= sizeof(long)) { > unsigned long val = get_unaligned_long(name); > if (!(val & 0x8080808080808080)) { > /* Make it equivalent in case */ > val &= ~0x2020202020202020; > hash = hash_long(hash, val); > name += sizeof(long); > size -= sizeof(long); > continue; > } > } > > c = *name; > if (!(c & 0x80)) { > hash = hash_long(hash, c & ~0x20); > name++; > size--; > continue; > } It is better to use 'while' instead of 'if' here, i.e.: while (!((c = *name) & 0x80)) { hash = hash_long(hash, c & ~0x20); name++; if (!--size) return hash; } > > /* This is the unusual and slowish case */ > hash = hash_utf8_char(hash, c, &name, &size); > } while (size); > return hassh; > } Dmitry ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-24 18:45 ` Dmitry Potapov @ 2008-01-24 19:08 ` Linus Torvalds 0 siblings, 0 replies; 51+ messages in thread From: Linus Torvalds @ 2008-01-24 19:08 UTC (permalink / raw) To: Dmitry Potapov Cc: Andreas Ericsson, Marko Kreen, Git Mailing List, Junio C Hamano On Thu, 24 Jan 2008, Dmitry Potapov wrote: > > It is better to use 'while' instead of 'if' here, i.e.: Yes, that looks like a good further micro-optimization. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-24 17:15 ` Linus Torvalds 2008-01-24 18:45 ` Dmitry Potapov @ 2008-01-25 20:52 ` Marko Kreen 2008-01-25 22:16 ` Linus Torvalds 1 sibling, 1 reply; 51+ messages in thread From: Marko Kreen @ 2008-01-25 20:52 UTC (permalink / raw) To: Linus Torvalds Cc: Dmitry Potapov, Andreas Ericsson, Git Mailing List, Junio C Hamano On 1/24/08, Linus Torvalds <torvalds@linux-foundation.org> wrote: > You can do a perfectly fine 8-bytes-at-a-time hash for almost 100% of all > source code projects in UTF-8, without *ever* doing any format changes at > all. Admittedly, it's a lot easier if the hash is a pure in-memory one (ie > we don't care about byte-order or size of integers or anything like that), > but that's the common case for most hashes that aren't used for BTree > lookup on disk or something like that. > > Here, let me show you: > > unsigned int name_hash(const char *name, int size) Well, although this is very clever approach, I suggest against it. You'll end up with complex code that gives out substandard results. I think its better to have separate case-folding function (or several), that copies string to temp buffer and then run proper optimized hash function on that buffer. That way you can use already tested building blocks and can optimize both sides separately. Eg. the folding-only function can aswell be optimized to load 4 or 8-byte at-a-time. This also isolates hashing from exact details how folding happens to access the input string which seem to be the weak point in your approach. (In both collision and complexity sense.) Such temp buffer happens to fits my lookup3_memcpy also better (heh). Its weak point is that on platforms that do not allow unaligned access, it degenerates to byte-by-byte loading. But if know you always have aligned buffer, you can notify gcc to do 4-byte fetch there too. It should be as simple as tagging data pointer as uint32_t *. Anyway, now you dont need to worry about folding when picking hash. > Basically, it's almost always a stupid thing to actually normalize a whole > string. You do those things character-by-character, and only lazily when > you actually need to! If your input strings are over kilobyte on average then I'd agree with you, but if you process 20-30 bytes on average, is the additional complexity worth it? -- marko ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-25 20:52 ` Marko Kreen @ 2008-01-25 22:16 ` Linus Torvalds 2008-01-25 22:35 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 51+ messages in thread From: Linus Torvalds @ 2008-01-25 22:16 UTC (permalink / raw) To: Marko Kreen Cc: Dmitry Potapov, Andreas Ericsson, Git Mailing List, Junio C Hamano On Fri, 25 Jan 2008, Marko Kreen wrote: > > Well, although this is very clever approach, I suggest against it. > You'll end up with complex code that gives out substandard results. Actually, *your* operation is the one that gives substandard results. > I think its better to have separate case-folding function (or several), > that copies string to temp buffer and then run proper optimized hash > function on that buffer. I'm sorry, but you just cannot do that efficiently and portably. I can write a hash function that reliably does 8 bytes at a time for the common case on a 64-bit architecture, exactly because it's easy to do "test high bits in parallel" with a simple bitwise 'and', and we can do the same with "approximate lower-to-uppercase 8 bytes at a time" for a hash by just clearing bit 5. In contrast, trying to do the same thing in half-way portable C, but being limited to having to get the case-folding *exactly* right (which you need for the comparison function) is much much harder. It's basically impossible in portable C (it's doable with architecture-specific features, ie vector extensions that have per-byte compares etc). And hashing is performance-critical, much more so than the compares (ie you're likely to have to hash tens of thousands of files, while you will only compare a couple). So it really is worth optimizing for. And the thing is, "performance" isn't a secondary feature. It's also not something you can add later by optimizing. It's also a mindset issue. Quite frankly, people who do this by "convert to some folded/normalized form, then do the operation" will generally make much more fundamental mistakes. Once you get into the mindset of "let's pass a corrupted strign around", you are in trouble. You start thinking that the corrupted string isn't really "corrupt", it's in an "optimized format". And it's all downhill from there. Don't do it. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-25 22:16 ` Linus Torvalds @ 2008-01-25 22:35 ` Linus Torvalds 2008-01-26 12:16 ` Marko Kreen 2008-01-26 12:37 ` Marko Kreen 2 siblings, 0 replies; 51+ messages in thread From: Linus Torvalds @ 2008-01-25 22:35 UTC (permalink / raw) To: Marko Kreen Cc: Dmitry Potapov, Andreas Ericsson, Git Mailing List, Junio C Hamano On Fri, 25 Jan 2008, Linus Torvalds wrote: > > I can write a hash function that reliably does 8 bytes at a time for the > common case on a 64-bit architecture, exactly because it's easy to do > "test high bits in parallel" with a simple bitwise 'and', and we can do > the same with "approximate lower-to-uppercase 8 bytes at a time" for a > hash by just clearing bit 5. Side note: you *can* get better approximations fairly cheaply if you care. If you want to distinguish the characters 0-31 from the characters 31-63 in your hash (pointless for filenames, but it can be worthwhile for some other string cases), you can decide to clear bit#5 only if bit#6 in that byte was also set, with just a few bitwise operations. Eg, imagine that you have "unsigned long x" containing eight bytes of ascii data (ie you already did the test by 0x8080808080808080), you can do things like unsigned long bit6 = x & 0x4040404040404040; x &= ~(bit6 >> 1); which will only clear bit5 if bit6 in the same byte was set.. So you can do tricks like that, and it will still be plenty fast. And nobody will ever care that while it collides 'A' with 'a' (by design), it also causes '{' and '[' to be considered "case collisions". [ Amusing side note: '{' and '[' *are* case collisions in legacy 7-bit "Finnish ASCII". The editor I use still "upper-cases" '{' to '['. I'm not kidding, and yes, it really does it on purpose! It used to be that before everybody turned to Latin1, the {|} characters were re-used in Finland (and Sweden, for that matter) for the extra characters needed in Finnish. Because obviously nobody ever needed them for any real work. I (and probably every Finnish C UNIX programmer) used to be very good at reading C source code even when it was full of odd finnish characters with dots on top, instead of curly braces! ] And yes, from a performance standpoint, things liek this probably do realy matter. For the kernel tree, the average pathname length is ~28 characters. If you can do it with three iterations that do the first 24 characters eight characters at a time, and then four iterations over the four last ones, rather than 28 iterations with byte->longword and multiplications in each, I bet it's quite visible. Of course, it's going to be visible only if everything else is fast too, but git has been pretty good at that in general. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-25 22:16 ` Linus Torvalds 2008-01-25 22:35 ` Linus Torvalds @ 2008-01-26 12:16 ` Marko Kreen 2008-01-27 6:51 ` Linus Torvalds 2008-01-26 12:37 ` Marko Kreen 2 siblings, 1 reply; 51+ messages in thread From: Marko Kreen @ 2008-01-26 12:16 UTC (permalink / raw) To: Linus Torvalds Cc: Dmitry Potapov, Andreas Ericsson, Git Mailing List, Junio C Hamano On 1/26/08, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, 25 Jan 2008, Marko Kreen wrote: > > Well, although this is very clever approach, I suggest against it. > > You'll end up with complex code that gives out substandard results. > > Actually, *your* operation is the one that gives substandard results. > > > I think its better to have separate case-folding function (or several), > > that copies string to temp buffer and then run proper optimized hash > > function on that buffer. > > I'm sorry, but you just cannot do that efficiently and portably. > > I can write a hash function that reliably does 8 bytes at a time for the > common case on a 64-bit architecture, exactly because it's easy to do > "test high bits in parallel" with a simple bitwise 'and', and we can do > the same with "approximate lower-to-uppercase 8 bytes at a time" for a > hash by just clearing bit 5. > > In contrast, trying to do the same thing in half-way portable C, but being > limited to having to get the case-folding *exactly* right (which you need > for the comparison function) is much much harder. It's basically > impossible in portable C (it's doable with architecture-specific features, > ie vector extensions that have per-byte compares etc). Here you misunderstood me, I was proposing following: int hash_folded(const char *str, int len) { char buf[512]; do_folding(buf, str, len); return do_hash(buf, len); } That is - the folded string should stay internal to hash function. Only difference from combined foling+hashing would be that you can code each part separately. > And hashing is performance-critical, much more so than the compares (ie > you're likely to have to hash tens of thousands of files, while you will > only compare a couple). So it really is worth optimizing for. > > And the thing is, "performance" isn't a secondary feature. It's also not > something you can add later by optimizing. > > It's also a mindset issue. Quite frankly, people who do this by "convert > to some folded/normalized form, then do the operation" will generally make > much more fundamental mistakes. Once you get into the mindset of "let's > pass a corrupted strign around", you are in trouble. You start thinking > that the corrupted string isn't really "corrupt", it's in an "optimized > format". > > And it's all downhill from there. Don't do it. Againg, you seem to keep HFS+ behaviour in mind, but that was not what I did suggest. Probably my mistake, sorry. -- marko ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-26 12:16 ` Marko Kreen @ 2008-01-27 6:51 ` Linus Torvalds 2008-01-27 8:21 ` Dmitry Potapov 2008-01-27 9:45 ` Marko Kreen 0 siblings, 2 replies; 51+ messages in thread From: Linus Torvalds @ 2008-01-27 6:51 UTC (permalink / raw) To: Marko Kreen Cc: Dmitry Potapov, Andreas Ericsson, Git Mailing List, Junio C Hamano On Sat, 26 Jan 2008, Marko Kreen wrote: > > Here you misunderstood me, I was proposing following: > > int hash_folded(const char *str, int len) > { > char buf[512]; > do_folding(buf, str, len); > return do_hash(buf, len); > } > > That is - the folded string should stay internal to hash function. If it's internal, it's much better, but you still missed the performance angle. The fact is, hashing can take shortcuts that folding cannot do! Case folding, by definition, has to be "exact" (since the whole point is what you're going to use the same folding function to do the compare, so if you play games with folding, the compares will be wrong). But hashing doesn't have to be exact. It's ok to hash '{' and '[' as if they were different cases of the same character, if that gives you a faster hash function. Especially as those charactes are rather rare in filenames. So if you do hashing as a function of its own, you can simply do a better job at it. I do agree that the functions that create a folded set of characters from a _complex_ UTF-8 character should be shared between folding and hashing, since that code is too complex and there are no simple shortcuts for doing a faster hash that still retains all the properties we want. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-27 6:51 ` Linus Torvalds @ 2008-01-27 8:21 ` Dmitry Potapov 2008-01-27 14:07 ` Johannes Schindelin 2008-01-27 9:45 ` Marko Kreen 1 sibling, 1 reply; 51+ messages in thread From: Dmitry Potapov @ 2008-01-27 8:21 UTC (permalink / raw) To: Linus Torvalds Cc: Marko Kreen, Andreas Ericsson, Git Mailing List, Junio C Hamano On Sat, Jan 26, 2008 at 10:51:18PM -0800, Linus Torvalds wrote: > > > On Sat, 26 Jan 2008, Marko Kreen wrote: > > > > Here you misunderstood me, I was proposing following: > > > > int hash_folded(const char *str, int len) > > { > > char buf[512]; > > do_folding(buf, str, len); > > return do_hash(buf, len); > > } > > > > That is - the folded string should stay internal to hash function. > > If it's internal, it's much better, but you still missed the performance > angle. > > The fact is, hashing can take shortcuts that folding cannot do! > > Case folding, by definition, has to be "exact" (since the whole point is > what you're going to use the same folding function to do the compare, so > if you play games with folding, the compares will be wrong). Let's rename do_folding as something else, because it is not a real folding, but a preparation step for hash calculation. Keeping these steps separately simplifies the code, and allows further optimization, for instance, you do not need this do_folding step on a case-sensitive filesystem. Though it is certainly possible to mix both steps together, it bloats the code and makes it less readable. Of course, the idea to avoid a temporary buffer and do everything at once is very appealing, so I gave it a try -- and here is a 32-bit version of name_hash(), but I am not very happy with the result: #define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k)))) #define mix(a,b,c) \ { \ a -= c; a ^= rot(c, 4); c += b; \ b -= a; b ^= rot(a, 6); a += c; \ c -= b; c ^= rot(b, 8); b += a; \ a -= c; a ^= rot(c,16); c += b; \ b -= a; b ^= rot(a,19); a += c; \ c -= b; c ^= rot(b, 4); b += a; \ } #define final(a,b,c) \ { \ c ^= b; c -= rot(b,14); \ a ^= c; a -= rot(c,11); \ b ^= a; b -= rot(a,25); \ c ^= b; c -= rot(b,16); \ a ^= c; a -= rot(c,4); \ b ^= a; b -= rot(a,14); \ c ^= b; c -= rot(b,24); \ } #define hash_value(x) \ hs[hp] += (x); \ if (++hp == 3) { \ mix (hs[0], hs[1], hs[2]); \ hp = 0; \ } unsigned int name_hash(const char *name, unsigned size) { unsigned hp = 0; unsigned hs[3]; hs[0] = hs[1] = hs[2] = 0xdeadbeef + size; do { unsigned char c; if (size >= sizeof(unsigned)) { unsigned val = get_unaligned_uint(name); if (!(val & 0x80808080)) { val &= ~0x20202020; hash_value(val); name += sizeof(val); size -= sizeof(val); continue; } } while (!((c = *name) & 0x80)) { hash_value(c & ~0x20); name++; if (!--size) goto done: } do { // TODO: add denormalization for Mac unsigned val = towupper (utf8_to_wchar(&name, &size)); hash_value(val); } while (size && (*name & 0x80)); } while (size); done: if (hp) final(a,b,c); return hs[2]; } Dmitry ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-27 8:21 ` Dmitry Potapov @ 2008-01-27 14:07 ` Johannes Schindelin 2008-01-27 14:48 ` Dmitry Potapov 0 siblings, 1 reply; 51+ messages in thread From: Johannes Schindelin @ 2008-01-27 14:07 UTC (permalink / raw) To: Dmitry Potapov Cc: Linus Torvalds, Marko Kreen, Andreas Ericsson, Git Mailing List, Junio C Hamano Hi, On Sun, 27 Jan 2008, Dmitry Potapov wrote: > #define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k)))) > > #define mix(a,b,c) \ > { \ > a -= c; a ^= rot(c, 4); c += b; \ > b -= a; b ^= rot(a, 6); a += c; \ > c -= b; c ^= rot(b, 8); b += a; \ > a -= c; a ^= rot(c,16); c += b; \ > b -= a; b ^= rot(a,19); a += c; \ > c -= b; c ^= rot(b, 4); b += a; \ > } > #define final(a,b,c) \ > { \ > c ^= b; c -= rot(b,14); \ > a ^= c; a -= rot(c,11); \ > b ^= a; b -= rot(a,25); \ > c ^= b; c -= rot(b,16); \ > a ^= c; a -= rot(c,4); \ > b ^= a; b -= rot(a,14); \ > c ^= b; c -= rot(b,24); \ > } > > #define hash_value(x) \ > hs[hp] += (x); \ > if (++hp == 3) { \ > mix (hs[0], hs[1], hs[2]); \ > hp = 0; \ > } > unsigned int name_hash(const char *name, unsigned size) > { > unsigned hp = 0; > unsigned hs[3]; > hs[0] = hs[1] = hs[2] = 0xdeadbeef + size; > > do { > unsigned char c; > if (size >= sizeof(unsigned)) { > unsigned val = get_unaligned_uint(name); > if (!(val & 0x80808080)) { > val &= ~0x20202020; > hash_value(val); > name += sizeof(val); > size -= sizeof(val); > continue; > } > } > > while (!((c = *name) & 0x80)) { > hash_value(c & ~0x20); > name++; > if (!--size) > goto done: > } > > do { > // TODO: add denormalization for Mac > unsigned val = towupper (utf8_to_wchar(&name, &size)); > hash_value(val); > } while (size && (*name & 0x80)); > > } while (size); > done: > if (hp) > final(a,b,c); > return hs[2]; > } <irony>Oh yes, let's take this one, it is so much shorter, cleaner and overall more elegant than Linus' code.</irony> Ciao, Dscho ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-27 14:07 ` Johannes Schindelin @ 2008-01-27 14:48 ` Dmitry Potapov 0 siblings, 0 replies; 51+ messages in thread From: Dmitry Potapov @ 2008-01-27 14:48 UTC (permalink / raw) To: Johannes Schindelin Cc: Linus Torvalds, Marko Kreen, Andreas Ericsson, Git Mailing List, Junio C Hamano Hi, On Sun, Jan 27, 2008 at 02:07:25PM +0000, Johannes Schindelin wrote: > > > <irony>Oh yes, let's take this one, it is so much shorter, cleaner and > overall more elegant than Linus' code.</irony> I am not sure what you meant by that, and what exactly code you meant saying Linus' code.... Anyway, my point was that mixing both steps together does not look very nice, and the code was intended to demonstrate why. Dmitry ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-27 6:51 ` Linus Torvalds 2008-01-27 8:21 ` Dmitry Potapov @ 2008-01-27 9:45 ` Marko Kreen 2008-01-27 15:06 ` Dmitry Potapov 1 sibling, 1 reply; 51+ messages in thread From: Marko Kreen @ 2008-01-27 9:45 UTC (permalink / raw) To: Linus Torvalds Cc: Dmitry Potapov, Andreas Ericsson, Git Mailing List, Junio C Hamano On 1/27/08, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, 26 Jan 2008, Marko Kreen wrote: > > > > Here you misunderstood me, I was proposing following: > > > > int hash_folded(const char *str, int len) > > { > > char buf[512]; > > do_folding(buf, str, len); > > return do_hash(buf, len); > > } > > > > That is - the folded string should stay internal to hash function. > > If it's internal, it's much better, but you still missed the performance > angle. > > The fact is, hashing can take shortcuts that folding cannot do! > > Case folding, by definition, has to be "exact" (since the whole point is > what you're going to use the same folding function to do the compare, so > if you play games with folding, the compares will be wrong). > > But hashing doesn't have to be exact. It's ok to hash '{' and '[' as if > they were different cases of the same character, if that gives you a > faster hash function. Especially as those charactes are rather rare in > filenames. > > So if you do hashing as a function of its own, you can simply do a better > job at it. > > I do agree that the functions that create a folded set of characters from > a _complex_ UTF-8 character should be shared between folding and hashing, > since that code is too complex and there are no simple shortcuts for doing > a faster hash that still retains all the properties we want. Well, you can always have fold_quick_and_dirty() function that is used only internally in hash_folded() function, which can: - fold with simple |= 0x20202020.. - write out full uint32/64, no need to make result proper string - zero-fill at the end, so hash function does not need to check for partial block, which is pretty expensive part of hashing. The win would be: - more modularized code - can use faster/any hash - hash function can be certain to work on aligned data (win on non-x86) The minus: - some memory i/o overhead which may or may not matter - the parts would not be fully generic, but special to hashing -- marko PS. Typo in last mail - "inner loop should be reversible - that means that details from beginning of data should shift out of horizon." That obviously means "data should _not_ shift out of horizon. btw, "reversible" for integer hashes means that there is 1:1 mapping between input and output - no collisions. Thus no info loss. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-27 9:45 ` Marko Kreen @ 2008-01-27 15:06 ` Dmitry Potapov 0 siblings, 0 replies; 51+ messages in thread From: Dmitry Potapov @ 2008-01-27 15:06 UTC (permalink / raw) To: Marko Kreen Cc: Linus Torvalds, Andreas Ericsson, Git Mailing List, Junio C Hamano On Sun, Jan 27, 2008 at 11:45:25AM +0200, Marko Kreen wrote: > The minus: > - some memory i/o overhead which may or may not matter If a string is short, it will probably reside in the processor cache, so there is no real memory i/o overhead here. For more longer strings, it may be better to do that in short chunks, so each chunk can reside in the cache. But I don't think filenames are long, so it is not an issue here. > - the parts would not be fully generic, but special to hashing I think the second part can be rather generic to be reused for hashing something else that does not require filename specific case-folding. Dmitry ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-25 22:16 ` Linus Torvalds 2008-01-25 22:35 ` Linus Torvalds 2008-01-26 12:16 ` Marko Kreen @ 2008-01-26 12:37 ` Marko Kreen 2 siblings, 0 replies; 51+ messages in thread From: Marko Kreen @ 2008-01-26 12:37 UTC (permalink / raw) To: Linus Torvalds Cc: Dmitry Potapov, Andreas Ericsson, Git Mailing List, Junio C Hamano On 1/26/08, Linus Torvalds <torvalds@linux-foundation.org> wrote: > It's also a mindset issue. Quite frankly, people who do this by "convert > to some folded/normalized form, then do the operation" will generally make > much more fundamental mistakes. Once you get into the mindset of "let's > pass a corrupted strign around", you are in trouble. You start thinking > that the corrupted string isn't really "corrupt", it's in an "optimized > format". Ok, you seem to focus on case folding and general performance, I focus on hash quality and code complexity. Considering you may want several folding methods, it seemed to me that it would be good to separate the two aspects. But I'll try to follow your path for a moment. Hashing 32 or 64 bits at a time is not trivial, eg. you cannot use same algorithm for both cases, 64 bits requires twice the work to mix well. Per Jenkins notes, hash inner loop should be reversible - that means that details from beginning of data should shift out of horizon. But final mixing should achieve avalanche - that means each bit in input should affect 50% bits in output. Also must be noted that we are mixing 32->32 and 64->64 instead of the usual 8->32. So it seems that the best bet would be to use integer hash functions as core. Jenkins himself points to Thomas Wang (http://www.cris.com/~Ttwang/tech/inthash.htm) who has good integer mix functions for both 32 and 64 bits. Integer hash function must have both reversibility and avalance, so they may be slightly overkill for this purpose, but fixing that means lot of work. So here is what I propose for hashing, if you really want to have combined folding + hashing: /* Thomas Wang integer hash functions */ static inline uint32_t hash32(uint32_t key) { key = ~key + (key << 15); key = key ^ (key >> 12); key = key + (key << 2); key = key ^ (key >> 4); key = key * 2057; key = key ^ (key >> 16); return key; } static inline uint64_t hash64(uint64_t key) { key = (~key) + (key << 21); // key = (key << 21) - key - 1; key = key ^ (key >> 24); key = (key + (key << 3)) + (key << 8); // key * 265 key = key ^ (key >> 14); key = (key + (key << 2)) + (key << 4); // key * 21 key = key ^ (key >> 28); key = key + (key << 31); return key; } /* * Simple addition should be enough for new values, * considering the mix functions does work well. */ /* this is functon to use in git */ static inline unsigned long hash_long(unsigned long hash, unsigned long val) { if (sizeof(long) == 8) return hash64(hash + val); else return hash32(hash + val); } /* below is regular hash for testing */ static uint32_t inline hash_int32(uint32_t hash, uint32_t val) { return hash32(hash + val); } static uint64_t inline hash_int64(uint64_t hash, uint64_t val) { return hash64(hash + val); } /* hack to avoid call to libc memcpy() */ static inline void simple_memcpy(void *_dst, const void *_src, unsigned len) { const uint8_t *src = _src; uint8_t *dst = _dst; while (len--) *dst++ = *src++; } uint32_t int32_hash(const void *_data, unsigned int size) { const uint8_t *src = _data; /* inital value. +size avoids \0 and \0\0 hashing same */ uint32_t hash = 1234567890 + size; uint32_t val; while (size >= 4) { memcpy(&val, src, 4); /* direct load on x86/64 */ src += 4; size -= 4; hash = hash_int32(hash, val); } if (size > 0) { val = 0; simple_memcpy(&val, src, size); hash = hash_int32(hash, val); } return hash; } uint32_t int64_hash(const void *_data, unsigned int size) { const uint8_t *src = _data; uint64_t hash = 12345678901234567890ULL + size; uint64_t val; while (size >= 8) { memcpy(&val, src, 8); /* direct load on x86/64 */ hash = hash_int64(hash, val); src += 8; size -= 8; } if (size > 0) { val = 0; simple_memcpy(&val, src, size); hash = hash_int64(hash, val); } /* here we go to 32 bits, simple masking is enough */ return hash; } In the "regular" hash functions I again use the memcpy() trick, because I don't want to bother with access optimizations. Especially considering that part is unnecessary for git. Intel Core Duo (32bit). String length 0 .. 40, random alignment: ------------------------------------------------------------------- Testing: seed=34 align=0 minlen=0 maxlen=40 trycnt=3 duration=10 lookup3 : #0 .. 165.489 #1 .. 165.494 #2 .. 165.490 MB/s int32_hash : #0 .. 148.359 #1 .. 148.350 #2 .. 148.435 MB/s int64_hash : #0 .. 123.105 #1 .. 123.040 #2 .. 123.039 MB/s lookup3_memcpy_hack : #0 .. 169.791 #1 .. 169.795 #2 .. 169.749 MB/s oat : #0 .. 134.737 #1 .. 134.702 #2 .. 134.735 MB/s fnv : #0 .. 131.457 #1 .. 131.470 #2 .. 131.474 MB/s hsieh : #0 .. 166.619 #1 .. 166.622 #2 .. 166.588 MB/s Results compared to reference: lookup3 : 100.000 % int32_hash : 89.661 % int64_hash : 74.361 % lookup3_memcpy_hack : 102.591 % oat : 81.409 % fnv : 79.441 % hsieh : 100.676 % AMD Opteron(tm) Processor 252 (64bit) 2.6GHz ------------------------------------------------- Testing: seed=34 align=0 minlen=0 maxlen=40 trycnt=3 duration=10 lookup3 : #0 .. 208.819 #1 .. 208.877 #2 .. 208.897 MB/s int32_hash : #0 .. 181.096 #1 .. 181.100 #2 .. 181.097 MB/s int64_hash : #0 .. 196.823 #1 .. 196.761 #2 .. 196.825 MB/s lookup3_memcpy_hack : #0 .. 201.593 #1 .. 201.597 #2 .. 201.594 MB/s oat : #0 .. 160.769 #1 .. 160.774 #2 .. 160.772 MB/s fnv : #0 .. 200.046 #1 .. 200.044 #2 .. 200.046 MB/s hsieh : #0 .. 205.515 #1 .. 205.520 #2 .. 205.517 MB/s Results compared to reference: lookup3 : 100.000 % int32_hash : 86.706 % int64_hash : 94.225 % lookup3_memcpy_hack : 96.519 % oat : 76.974 % fnv : 95.778 % hsieh : 98.398 % So speedwise the result is not bad. Especially considering unoptimized data fetching. On larger data (~1k) is tends to lose to lookup3 more, I guess lookup3 parallelizes better (3x 32bit int vs. 1x 32/64 int). The functions pass Jenkins lookup3 selftest that eg. FNV does not. The code is also available at: http://pgbouncer.projects.postgresql.org/hashtest/hashtest-2008-01-26.tgz -- marko ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-24 16:00 ` Andreas Ericsson 2008-01-24 16:13 ` Marko Kreen 2008-01-24 16:28 ` Dmitry Potapov @ 2008-01-25 20:08 ` Marko Kreen 2 siblings, 0 replies; 51+ messages in thread From: Marko Kreen @ 2008-01-25 20:08 UTC (permalink / raw) To: Andreas Ericsson Cc: Dmitry Potapov, Linus Torvalds, Git Mailing List, Junio C Hamano On 1/24/08, Andreas Ericsson <ae@op5.se> wrote: > Interesting output, but not very surprising. Do you have the code > available somewhere? http://pgbouncer.projects.postgresql.org/hashtest/hashtest-2008-01-25.tgz I cleaned it up a bit. Also fixed a bug - the lookup3 was called via wrapper function so it acually is tiny bit faster than the results show. You can consider the code as public domain too. Also remember the was not meant to be published, so it rather hack... -- marko ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 9:31 ` Andreas Ericsson 2008-01-23 14:01 ` Marko Kreen @ 2008-01-23 17:10 ` Dmitry Potapov 2008-01-24 10:39 ` Andreas Ericsson 1 sibling, 1 reply; 51+ messages in thread From: Dmitry Potapov @ 2008-01-23 17:10 UTC (permalink / raw) To: Andreas Ericsson; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano On Wed, Jan 23, 2008 at 10:31:11AM +0100, Andreas Ericsson wrote: > --- > FNV Hash > > I need to fill this in. Search the web for FNV hash. It's faster than my > hash on Intel (because Intel has fast multiplication), but slower on most > other platforms. Preliminary tests suggested it has decent distributions. > --- I believe that under words "my hash", Bob Jenkins meant lookup2, which was significant slower. > > My tests ran on Intel. Please, could you specify your CPU model. > I also noticed I had a few hashes commented out when > doing the test, one of them being Paul Hsie's. For some reason, Jenkin's and > Hsie's didn't perform well for me last time I used the comparison thing (I > did a more thorough job back then, with tests running for several minutes > per hash and table-size, so I commented out the poor candidates). I expected that Paul Hsieh's hash may not do well on some architecture, though it seems it did even worse than I expected. > > I still believe that for this very simple case, the lookup3.c case is not > very practical, as the code is that much more complicated, which was my > main point with posting the comparison. I would not describe lookup3 as impractical. It is widely used and well tested. Perhaps, for some Intel CPUs, the difference in speed is not so big, and FNV hash is much smaller and simpler, so FNV is a reasonable choice, but the hash is twice slower on my AMD processor and I suspect it may be even worse on other CPUs, where integer multiplication is slow. Besides, it may turn out that hashing filename may be not only case where a fast hash is needed. Dmitry ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 17:10 ` Dmitry Potapov @ 2008-01-24 10:39 ` Andreas Ericsson 0 siblings, 0 replies; 51+ messages in thread From: Andreas Ericsson @ 2008-01-24 10:39 UTC (permalink / raw) To: Dmitry Potapov; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano Dmitry Potapov wrote: > On Wed, Jan 23, 2008 at 10:31:11AM +0100, Andreas Ericsson wrote: >> --- >> FNV Hash >> >> I need to fill this in. Search the web for FNV hash. It's faster than my >> hash on Intel (because Intel has fast multiplication), but slower on most >> other platforms. Preliminary tests suggested it has decent distributions. >> --- > > I believe that under words "my hash", Bob Jenkins meant lookup2, which > was significant slower. > >> My tests ran on Intel. > > Please, could you specify your CPU model. > >From /proc/cpuinfo. It's the best I can do without going to our purchase department and asking for the spec so I can contact the vendor and get the real thing. Dualcore shouldn't matter for this test, as it isn't threaded. Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz >> I also noticed I had a few hashes commented out when >> doing the test, one of them being Paul Hsie's. For some reason, Jenkin's and >> Hsie's didn't perform well for me last time I used the comparison thing (I >> did a more thorough job back then, with tests running for several minutes >> per hash and table-size, so I commented out the poor candidates). > > I expected that Paul Hsieh's hash may not do well on some architecture, > though it seems it did even worse than I expected. > It doesn't do that well on certain types of data, in my experience. It does have excellent dispersion, so with very long strings it's usually the best to use, because collisions become so expensive. >> I still believe that for this very simple case, the lookup3.c case is not >> very practical, as the code is that much more complicated, which was my >> main point with posting the comparison. > > I would not describe lookup3 as impractical. It is widely used and well > tested. Perhaps, for some Intel CPUs, the difference in speed is not so > big, and FNV hash is much smaller and simpler, so FNV is a reasonable > choice, but the hash is twice slower on my AMD processor and I suspect > it may be even worse on other CPUs, where integer multiplication is slow. > Besides, it may turn out that hashing filename may be not only case where > a fast hash is needed. > Ah well. I think once the patch is in master, it will be easy enough to test and verify different algorithms. Since it's intended for in-memory data only, it's no problem to have several algorithms and pick the one most suitable for the architecture we're compiling for. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I'm a total push-over.. 2008-01-23 8:32 ` Andreas Ericsson 2008-01-23 9:15 ` Dmitry Potapov @ 2008-01-23 16:06 ` Linus Torvalds 1 sibling, 0 replies; 51+ messages in thread From: Linus Torvalds @ 2008-01-23 16:06 UTC (permalink / raw) To: Andreas Ericsson; +Cc: Git Mailing List, Junio C Hamano On Wed, 23 Jan 2008, Andreas Ericsson wrote: > > Insofar as hashes go, it's not that shabby for hashing filenames. Hashing filenames is pretty easy. You can do a reasonable job with any "multiply by an odd number, add in value". Picking the odd number is a random choice, some are better than others (and it depends on whether you end up always having a power-of-two bucket size etc), but it generally won't be horrible. And considering that we generally won't have tons and tons of pathnames (ie we'd generally have thousands, not millions), and that the underlying hash not only resizes itself but actually uses the ful 32 bits as the lookup key, I don't worry too much. I suspect my random choice is fine. So no, I didn't think my hash would _suck_, although I also didn't research which odd numbers to pick. No, it's a joke because it doesn't really give an example of how to do the _expected_ hash collissions. Here's another hash that is actually going to collide *much* more (and on purpose!), but is actually showing an example of something that actually hashes UTF strings so that upper-case and lower-case (and normalization) will all still hash to the same value: static unsigned int hash_name(const char *name, int namelen) { unsigned int hash = 0x123; do { unsigned char c = *name++; if (c & 0x80) c = 0; c &= ~0x20; hash = hash*101 + c; } while (--namelen); return hash; } but the above does so by making the hash much much worse (although probably still acceptable for "normal source code name distributions" that don't have very many same-name-in-different-cases and high bit characters anyway). The above is still fairly fast, but obviously at a serious cost in hash goodness, to the point of being totally unusable for anybody who uses Asian characters in their filenames. To actually be really useful, you'd have to teach it about the character system and do a lookup into a case/normalization table. So *that* is mainly why it's a joke. But it should work fine for the case it is used for now (exact match). Picking a better-researched constant might still be a good idea. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2008-01-27 15:06 UTC | newest] Thread overview: 51+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-01-22 23:37 I'm a total push-over Linus Torvalds 2008-01-23 1:35 ` Kevin Ballard 2008-01-23 2:23 ` Junio C Hamano 2008-01-23 2:36 ` Junio C Hamano 2008-01-23 12:24 ` Johannes Schindelin 2008-01-23 12:28 ` David Kastrup 2008-01-23 12:56 ` Theodore Tso 2008-01-23 2:58 ` Linus Torvalds 2008-01-23 3:19 ` Linus Torvalds 2008-01-25 6:50 ` Junio C Hamano 2008-01-25 16:24 ` Linus Torvalds 2008-01-23 7:23 ` Junio C Hamano 2008-01-23 12:25 ` Johannes Schindelin 2008-01-23 16:25 ` Linus Torvalds 2008-01-23 16:34 ` Johannes Schindelin 2008-01-23 17:09 ` Linus Torvalds 2008-01-23 17:29 ` Linus Torvalds 2008-01-25 5:21 ` Jeremy Maitin-Shepard 2008-01-25 12:51 ` Johannes Schindelin 2008-01-25 18:19 ` Jeremy Maitin-Shepard 2008-01-25 18:24 ` Johannes Schindelin 2008-01-25 19:07 ` Junio C Hamano 2008-01-23 8:32 ` Andreas Ericsson 2008-01-23 9:15 ` Dmitry Potapov 2008-01-23 9:31 ` Andreas Ericsson 2008-01-23 14:01 ` Marko Kreen 2008-01-23 14:39 ` Andreas Ericsson 2008-01-24 6:51 ` Luke Lu 2008-01-24 10:24 ` Andreas Ericsson 2008-01-24 13:19 ` Marko Kreen 2008-01-24 16:00 ` Andreas Ericsson 2008-01-24 16:13 ` Marko Kreen 2008-01-24 16:28 ` Dmitry Potapov 2008-01-24 17:15 ` Linus Torvalds 2008-01-24 18:45 ` Dmitry Potapov 2008-01-24 19:08 ` Linus Torvalds 2008-01-25 20:52 ` Marko Kreen 2008-01-25 22:16 ` Linus Torvalds 2008-01-25 22:35 ` Linus Torvalds 2008-01-26 12:16 ` Marko Kreen 2008-01-27 6:51 ` Linus Torvalds 2008-01-27 8:21 ` Dmitry Potapov 2008-01-27 14:07 ` Johannes Schindelin 2008-01-27 14:48 ` Dmitry Potapov 2008-01-27 9:45 ` Marko Kreen 2008-01-27 15:06 ` Dmitry Potapov 2008-01-26 12:37 ` Marko Kreen 2008-01-25 20:08 ` Marko Kreen 2008-01-23 17:10 ` Dmitry Potapov 2008-01-24 10:39 ` Andreas Ericsson 2008-01-23 16:06 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).