From: "Marko Kreen" <markokr@gmail.com>
To: "Linus Torvalds" <torvalds@linux-foundation.org>
Cc: "Dmitry Potapov" <dpotapov@gmail.com>,
"Andreas Ericsson" <ae@op5.se>,
"Git Mailing List" <git@vger.kernel.org>,
"Junio C Hamano" <gitster@pobox.com>
Subject: Re: I'm a total push-over..
Date: Sun, 27 Jan 2008 11:45:25 +0200 [thread overview]
Message-ID: <e51f66da0801270145w41a94414g7bebd4a31293344d@mail.gmail.com> (raw)
In-Reply-To: <alpine.LFD.1.00.0801262247140.3222@www.l.google.com>
On 1/27/08, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sat, 26 Jan 2008, Marko Kreen wrote:
> >
> > Here you misunderstood me, I was proposing following:
> >
> > int hash_folded(const char *str, int len)
> > {
> > char buf[512];
> > do_folding(buf, str, len);
> > return do_hash(buf, len);
> > }
> >
> > That is - the folded string should stay internal to hash function.
>
> If it's internal, it's much better, but you still missed the performance
> angle.
>
> The fact is, hashing can take shortcuts that folding cannot do!
>
> Case folding, by definition, has to be "exact" (since the whole point is
> what you're going to use the same folding function to do the compare, so
> if you play games with folding, the compares will be wrong).
>
> But hashing doesn't have to be exact. It's ok to hash '{' and '[' as if
> they were different cases of the same character, if that gives you a
> faster hash function. Especially as those charactes are rather rare in
> filenames.
>
> So if you do hashing as a function of its own, you can simply do a better
> job at it.
>
> I do agree that the functions that create a folded set of characters from
> a _complex_ UTF-8 character should be shared between folding and hashing,
> since that code is too complex and there are no simple shortcuts for doing
> a faster hash that still retains all the properties we want.
Well, you can always have fold_quick_and_dirty() function that
is used only internally in hash_folded() function, which can:
- fold with simple |= 0x20202020..
- write out full uint32/64, no need to make result proper string
- zero-fill at the end, so hash function does not need to check
for partial block, which is pretty expensive part of hashing.
The win would be:
- more modularized code
- can use faster/any hash
- hash function can be certain to work on aligned data
(win on non-x86)
The minus:
- some memory i/o overhead which may or may not matter
- the parts would not be fully generic, but special to hashing
--
marko
PS. Typo in last mail - "inner loop should be reversible - that
means that details from beginning of data should shift out of
horizon." That obviously means "data should _not_ shift
out of horizon.
btw, "reversible" for integer hashes means that there is 1:1
mapping between input and output - no collisions. Thus
no info loss.
next prev parent reply other threads:[~2008-01-27 9:45 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-01-22 23:37 I'm a total push-over Linus Torvalds
2008-01-23 1:35 ` Kevin Ballard
2008-01-23 2:23 ` Junio C Hamano
2008-01-23 2:36 ` Junio C Hamano
2008-01-23 12:24 ` Johannes Schindelin
2008-01-23 12:28 ` David Kastrup
2008-01-23 12:56 ` Theodore Tso
2008-01-23 2:58 ` Linus Torvalds
2008-01-23 3:19 ` Linus Torvalds
2008-01-25 6:50 ` Junio C Hamano
2008-01-25 16:24 ` Linus Torvalds
2008-01-23 7:23 ` Junio C Hamano
2008-01-23 12:25 ` Johannes Schindelin
2008-01-23 16:25 ` Linus Torvalds
2008-01-23 16:34 ` Johannes Schindelin
2008-01-23 17:09 ` Linus Torvalds
2008-01-23 17:29 ` Linus Torvalds
2008-01-25 5:21 ` Jeremy Maitin-Shepard
2008-01-25 12:51 ` Johannes Schindelin
2008-01-25 18:19 ` Jeremy Maitin-Shepard
2008-01-25 18:24 ` Johannes Schindelin
2008-01-25 19:07 ` Junio C Hamano
2008-01-23 8:32 ` Andreas Ericsson
2008-01-23 9:15 ` Dmitry Potapov
2008-01-23 9:31 ` Andreas Ericsson
2008-01-23 14:01 ` Marko Kreen
2008-01-23 14:39 ` Andreas Ericsson
2008-01-24 6:51 ` Luke Lu
2008-01-24 10:24 ` Andreas Ericsson
2008-01-24 13:19 ` Marko Kreen
2008-01-24 16:00 ` Andreas Ericsson
2008-01-24 16:13 ` Marko Kreen
2008-01-24 16:28 ` Dmitry Potapov
2008-01-24 17:15 ` Linus Torvalds
2008-01-24 18:45 ` Dmitry Potapov
2008-01-24 19:08 ` Linus Torvalds
2008-01-25 20:52 ` Marko Kreen
2008-01-25 22:16 ` Linus Torvalds
2008-01-25 22:35 ` Linus Torvalds
2008-01-26 12:16 ` Marko Kreen
2008-01-27 6:51 ` Linus Torvalds
2008-01-27 8:21 ` Dmitry Potapov
2008-01-27 14:07 ` Johannes Schindelin
2008-01-27 14:48 ` Dmitry Potapov
2008-01-27 9:45 ` Marko Kreen [this message]
2008-01-27 15:06 ` Dmitry Potapov
2008-01-26 12:37 ` Marko Kreen
2008-01-25 20:08 ` Marko Kreen
2008-01-23 17:10 ` Dmitry Potapov
2008-01-24 10:39 ` Andreas Ericsson
2008-01-23 16:06 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e51f66da0801270145w41a94414g7bebd4a31293344d@mail.gmail.com \
--to=markokr@gmail.com \
--cc=ae@op5.se \
--cc=dpotapov@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).