git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Marko Kreen" <markokr@gmail.com>
To: "Linus Torvalds" <torvalds@linux-foundation.org>
Cc: "Dmitry Potapov" <dpotapov@gmail.com>,
	"Andreas Ericsson" <ae@op5.se>,
	"Git Mailing List" <git@vger.kernel.org>,
	"Junio C Hamano" <gitster@pobox.com>
Subject: Re: I'm a total push-over..
Date: Sun, 27 Jan 2008 11:45:25 +0200	[thread overview]
Message-ID: <e51f66da0801270145w41a94414g7bebd4a31293344d@mail.gmail.com> (raw)
In-Reply-To: <alpine.LFD.1.00.0801262247140.3222@www.l.google.com>

On 1/27/08, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sat, 26 Jan 2008, Marko Kreen wrote:
> >
> > Here you misunderstood me, I was proposing following:
> >
> > int hash_folded(const char *str, int len)
> > {
> >    char buf[512];
> >    do_folding(buf, str, len);
> >    return do_hash(buf, len);
> > }
> >
> > That is - the folded string should stay internal to hash function.
>
> If it's internal, it's much better, but you still missed the performance
> angle.
>
> The fact is, hashing can take shortcuts that folding cannot do!
>
> Case folding, by definition, has to be "exact" (since the whole point is
> what you're going to use the same folding function to do the compare, so
> if you play games with folding, the compares will be wrong).
>
> But hashing doesn't have to be exact. It's ok to hash '{' and '[' as if
> they were different cases of the same character, if that gives you a
> faster hash function. Especially as those charactes are rather rare in
> filenames.
>
> So if you do hashing as a function of its own, you can simply do a better
> job at it.
>
> I do agree that the functions that create a folded set of characters from
> a _complex_ UTF-8 character should be shared between folding and hashing,
> since that code is too complex and there are no simple shortcuts for doing
> a faster hash that still retains all the properties we want.

Well, you can always have fold_quick_and_dirty() function that
is used only internally in hash_folded() function, which can:

- fold with simple |= 0x20202020..
- write out full uint32/64, no need to make result proper string
- zero-fill at the end, so hash function does not need to check
  for partial block, which is pretty expensive part of hashing.

The win would be:
- more modularized code
- can use faster/any hash
- hash function can be certain to work on aligned data
  (win on non-x86)

The minus:
- some memory i/o overhead which may or may not matter
- the parts would not be fully generic, but special to hashing


-- 
marko


PS. Typo in last mail - "inner loop should be reversible - that
means that details from beginning of data should shift out of
horizon."  That obviously means "data should _not_ shift
out of horizon.

btw, "reversible" for integer hashes means that there is 1:1
mapping between input and output - no collisions.  Thus
no info loss.

  parent reply	other threads:[~2008-01-27  9:45 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-01-22 23:37 I'm a total push-over Linus Torvalds
2008-01-23  1:35 ` Kevin Ballard
2008-01-23  2:23 ` Junio C Hamano
2008-01-23  2:36   ` Junio C Hamano
2008-01-23 12:24     ` Johannes Schindelin
2008-01-23 12:28       ` David Kastrup
2008-01-23 12:56         ` Theodore Tso
2008-01-23  2:58   ` Linus Torvalds
2008-01-23  3:19     ` Linus Torvalds
2008-01-25  6:50       ` Junio C Hamano
2008-01-25 16:24         ` Linus Torvalds
2008-01-23  7:23     ` Junio C Hamano
2008-01-23 12:25       ` Johannes Schindelin
2008-01-23 16:25         ` Linus Torvalds
2008-01-23 16:34           ` Johannes Schindelin
2008-01-23 17:09             ` Linus Torvalds
2008-01-23 17:29               ` Linus Torvalds
2008-01-25  5:21                 ` Jeremy Maitin-Shepard
2008-01-25 12:51                   ` Johannes Schindelin
2008-01-25 18:19                     ` Jeremy Maitin-Shepard
2008-01-25 18:24                       ` Johannes Schindelin
2008-01-25 19:07                       ` Junio C Hamano
2008-01-23  8:32 ` Andreas Ericsson
2008-01-23  9:15   ` Dmitry Potapov
2008-01-23  9:31     ` Andreas Ericsson
2008-01-23 14:01       ` Marko Kreen
2008-01-23 14:39         ` Andreas Ericsson
2008-01-24  6:51           ` Luke Lu
2008-01-24 10:24             ` Andreas Ericsson
2008-01-24 13:19           ` Marko Kreen
2008-01-24 16:00             ` Andreas Ericsson
2008-01-24 16:13               ` Marko Kreen
2008-01-24 16:28               ` Dmitry Potapov
2008-01-24 17:15                 ` Linus Torvalds
2008-01-24 18:45                   ` Dmitry Potapov
2008-01-24 19:08                     ` Linus Torvalds
2008-01-25 20:52                   ` Marko Kreen
2008-01-25 22:16                     ` Linus Torvalds
2008-01-25 22:35                       ` Linus Torvalds
2008-01-26 12:16                       ` Marko Kreen
2008-01-27  6:51                         ` Linus Torvalds
2008-01-27  8:21                           ` Dmitry Potapov
2008-01-27 14:07                             ` Johannes Schindelin
2008-01-27 14:48                               ` Dmitry Potapov
2008-01-27  9:45                           ` Marko Kreen [this message]
2008-01-27 15:06                             ` Dmitry Potapov
2008-01-26 12:37                       ` Marko Kreen
2008-01-25 20:08               ` Marko Kreen
2008-01-23 17:10       ` Dmitry Potapov
2008-01-24 10:39         ` Andreas Ericsson
2008-01-23 16:06   ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e51f66da0801270145w41a94414g7bebd4a31293344d@mail.gmail.com \
    --to=markokr@gmail.com \
    --cc=ae@op5.se \
    --cc=dpotapov@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).