From: "Marko Kreen" <markokr@gmail.com>
To: "Linus Torvalds" <torvalds@linux-foundation.org>
Cc: "Dmitry Potapov" <dpotapov@gmail.com>,
"Andreas Ericsson" <ae@op5.se>,
"Git Mailing List" <git@vger.kernel.org>,
"Junio C Hamano" <gitster@pobox.com>
Subject: Re: I'm a total push-over..
Date: Fri, 25 Jan 2008 22:52:06 +0200 [thread overview]
Message-ID: <e51f66da0801251252r1950c2d5g12caa5e71b9a37a@mail.gmail.com> (raw)
In-Reply-To: <alpine.LFD.1.00.0801240839590.2803@woody.linux-foundation.org>
On 1/24/08, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> You can do a perfectly fine 8-bytes-at-a-time hash for almost 100% of all
> source code projects in UTF-8, without *ever* doing any format changes at
> all. Admittedly, it's a lot easier if the hash is a pure in-memory one (ie
> we don't care about byte-order or size of integers or anything like that),
> but that's the common case for most hashes that aren't used for BTree
> lookup on disk or something like that.
>
> Here, let me show you:
>
> unsigned int name_hash(const char *name, int size)
Well, although this is very clever approach, I suggest against it.
You'll end up with complex code that gives out substandard results.
I think its better to have separate case-folding function (or several),
that copies string to temp buffer and then run proper optimized hash
function on that buffer.
That way you can use already tested building blocks and can optimize
both sides separately. Eg. the folding-only function can aswell be
optimized to load 4 or 8-byte at-a-time. This also isolates hashing
from exact details how folding happens to access the input string which
seem to be the weak point in your approach. (In both collision and
complexity sense.)
Such temp buffer happens to fits my lookup3_memcpy also better (heh).
Its weak point is that on platforms that do not allow unaligned access,
it degenerates to byte-by-byte loading. But if know you always
have aligned buffer, you can notify gcc to do 4-byte fetch there too.
It should be as simple as tagging data pointer as uint32_t *.
Anyway, now you dont need to worry about folding when picking hash.
> Basically, it's almost always a stupid thing to actually normalize a whole
> string. You do those things character-by-character, and only lazily when
> you actually need to!
If your input strings are over kilobyte on average then I'd
agree with you, but if you process 20-30 bytes on average,
is the additional complexity worth it?
--
marko
next prev parent reply other threads:[~2008-01-25 20:52 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-01-22 23:37 I'm a total push-over Linus Torvalds
2008-01-23 1:35 ` Kevin Ballard
2008-01-23 2:23 ` Junio C Hamano
2008-01-23 2:36 ` Junio C Hamano
2008-01-23 12:24 ` Johannes Schindelin
2008-01-23 12:28 ` David Kastrup
2008-01-23 12:56 ` Theodore Tso
2008-01-23 2:58 ` Linus Torvalds
2008-01-23 3:19 ` Linus Torvalds
2008-01-25 6:50 ` Junio C Hamano
2008-01-25 16:24 ` Linus Torvalds
2008-01-23 7:23 ` Junio C Hamano
2008-01-23 12:25 ` Johannes Schindelin
2008-01-23 16:25 ` Linus Torvalds
2008-01-23 16:34 ` Johannes Schindelin
2008-01-23 17:09 ` Linus Torvalds
2008-01-23 17:29 ` Linus Torvalds
2008-01-25 5:21 ` Jeremy Maitin-Shepard
2008-01-25 12:51 ` Johannes Schindelin
2008-01-25 18:19 ` Jeremy Maitin-Shepard
2008-01-25 18:24 ` Johannes Schindelin
2008-01-25 19:07 ` Junio C Hamano
2008-01-23 8:32 ` Andreas Ericsson
2008-01-23 9:15 ` Dmitry Potapov
2008-01-23 9:31 ` Andreas Ericsson
2008-01-23 14:01 ` Marko Kreen
2008-01-23 14:39 ` Andreas Ericsson
2008-01-24 6:51 ` Luke Lu
2008-01-24 10:24 ` Andreas Ericsson
2008-01-24 13:19 ` Marko Kreen
2008-01-24 16:00 ` Andreas Ericsson
2008-01-24 16:13 ` Marko Kreen
2008-01-24 16:28 ` Dmitry Potapov
2008-01-24 17:15 ` Linus Torvalds
2008-01-24 18:45 ` Dmitry Potapov
2008-01-24 19:08 ` Linus Torvalds
2008-01-25 20:52 ` Marko Kreen [this message]
2008-01-25 22:16 ` Linus Torvalds
2008-01-25 22:35 ` Linus Torvalds
2008-01-26 12:16 ` Marko Kreen
2008-01-27 6:51 ` Linus Torvalds
2008-01-27 8:21 ` Dmitry Potapov
2008-01-27 14:07 ` Johannes Schindelin
2008-01-27 14:48 ` Dmitry Potapov
2008-01-27 9:45 ` Marko Kreen
2008-01-27 15:06 ` Dmitry Potapov
2008-01-26 12:37 ` Marko Kreen
2008-01-25 20:08 ` Marko Kreen
2008-01-23 17:10 ` Dmitry Potapov
2008-01-24 10:39 ` Andreas Ericsson
2008-01-23 16:06 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e51f66da0801251252r1950c2d5g12caa5e71b9a37a@mail.gmail.com \
--to=markokr@gmail.com \
--cc=ae@op5.se \
--cc=dpotapov@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).