From: Linus Torvalds <torvalds@osdl.org>
To: Junio C Hamano <junkio@cox.net>, Nicolas Pitre <nico@cam.org>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: Horrible re-packing?
Date: Mon, 5 Jun 2006 12:03:31 -0700 (PDT) [thread overview]
Message-ID: <Pine.LNX.4.64.0606051155000.5498@g5.osdl.org> (raw)
In-Reply-To: <Pine.LNX.4.64.0606051140530.5498@g5.osdl.org>
On this same thread..
This trivial patch not only simplifies the name hashing, it actually
improves packing for both git and the kernel.
The git archive pack shrinks from 6824090->6622627 bytes (a 3%
improvement), and the kernel pack shrinks from 108756213 to 108219021 (a
mere 0.5% improvement, but still, it's an improvement from making the
hashing much simpler!)
I think the hash function with its comment is self-explanatory:
/*
* This effectively just creates a sortable number from the
* last sixteen non-whitespace characters. Last characters
* count "most", so things that end in ".c" sort together.
*/
while ((c = *name++) != 0) {
if (isspace(c))
continue;
hash = (hash >> 2) + (c << 24);
}
return hash;
ie we just create a 32-bit hash, where we "age" previous characters by two
bits, so the last characters in a filename count most. So when we then
compare the hashes in the sort routine, filenames that end the same way
sort the same way.
It takes the subdirectory into account (unless the filename is > 16
characters), but files with the same name within the same subdirectory
will obviously sort closer than files in different subdirectories.
And, incidentally (which is why I tried the hash change in the first
place, of course) builtin-rev-list.c will sort fairly close to rev-list.c.
And no, it's not a "good hash" in the sense of being secure or unique, but
that's not what we're looking for. The whole "hash" thing is misnamed
here. It's not so much a hash as a "sorting number".
Comments?
Linus
----
pack-objects.c | 41 +++++++++++------------------------------
1 files changed, 11 insertions(+), 30 deletions(-)
diff --git a/pack-objects.c b/pack-objects.c
index 3590cd5..ae49fba 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -473,38 +473,19 @@ #define DIRBITS 12
static unsigned name_hash(struct name_path *path, const char *name)
{
- struct name_path *p = path;
- const char *n = name + strlen(name);
- unsigned hash = 0, name_hash = 0, name_done = 0;
-
- if (n != name && n[-1] == '\n')
- n--;
- while (name <= --n) {
- unsigned char c = *n;
- if (c == '/' && !name_done) {
- name_hash = hash;
- name_done = 1;
- hash = 0;
- }
- hash = hash * 11 + c;
- }
- if (!name_done) {
- name_hash = hash;
- hash = 0;
- }
- for (p = path; p; p = p->up) {
- hash = hash * 11 + '/';
- n = p->elem + p->len;
- while (p->elem <= --n) {
- unsigned char c = *n;
- hash = hash * 11 + c;
- }
- }
+ unsigned char c;
+ unsigned hash = 0;
+
/*
- * Make sure "Makefile" and "t/Makefile" are hashed separately
- * but close enough.
+ * This effectively just creates a sortable number from the
+ * last sixteen non-whitespace characters. Last characters
+ * count "most", so things that end in ".c" sort together.
*/
- hash = (name_hash<<DIRBITS) | (hash & ((1U<<DIRBITS )-1));
+ while ((c = *name++) != 0) {
+ if (isspace(c))
+ continue;
+ hash = (hash >> 2) + (c << 24);
+ }
return hash;
}
next prev parent reply other threads:[~2006-06-05 19:03 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-06-05 17:08 Horrible re-packing? Linus Torvalds
2006-06-05 18:44 ` Linus Torvalds
2006-06-05 19:03 ` Linus Torvalds [this message]
2006-06-05 19:37 ` Junio C Hamano
2006-06-05 19:57 ` Linus Torvalds
2006-06-05 23:54 ` Junio C Hamano
2006-06-06 0:14 ` Junio C Hamano
2006-06-05 21:14 ` Olivier Galibert
2006-06-05 21:22 ` Nicolas Pitre
2006-06-06 0:18 ` Chris Wedgwood
2006-06-06 0:35 ` Linus Torvalds
2006-06-05 21:27 ` Linus Torvalds
2006-06-05 21:20 ` Nicolas Pitre
2006-06-05 21:40 ` Linus Torvalds
2006-06-05 23:13 ` Nicolas Pitre
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0606051155000.5498@g5.osdl.org \
--to=torvalds@osdl.org \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
--cc=nico@cam.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).