From: Jeff Garzik <jgarzik@pobox.com>
To: Ray Heasman <lists@mythral.org>
Cc: Git Mailing List <git@vger.kernel.org>,
Linus Torvalds <torvalds@osdl.org>
Subject: Re: Hash collision count
Date: Sat, 23 Apr 2005 19:20:21 -0400 [thread overview]
Message-ID: <426AD835.5070404@pobox.com> (raw)
In-Reply-To: <1114297231.10264.12.camel@maze.mythral.org>
Ray Heasman wrote:
> On Sat, 2005-04-23 at 16:27 -0400, Jeff Garzik wrote:
>
>>Ideally a hash + collision-count pair would make the best key, rather
>>than just hash alone.
>>
>>A collision -will- occur eventually, and it is trivial to avoid this
>>problem:
>>
>> $n = 0
>> attempt to store as $hash-$n
>> if $hash-$n exists (unlikely)
>> $n++
>> goto restart
>> key = $hash-$n
>>
>
>
> Great. So what have you done here? Suppose you have 32 bits of counter
> for n. Whoopee, you just added 32 bits to your hash, using a two stage
> algorithm. So, you have a 192 bit hash assuming you started with the 160
> bit SHA. And, one day your 32 bit counter won't be enough. Then what?
First, there is no 32-bit limit. git stores keys (aka hashes) as
strings. As it should.
Second, in your scenario, it's highly unlikely you would get 4 billion
sha1 hash collisions, even if you had the disk space to store such a git
database.
>>Tangent-as-the-reason-I-bring-this-up:
>>
>>One of my long-term projects is an archive service, somewhat like
>>Plan9's venti: a multi-server key-value database, with sha1 hash as the
>>key.
>>
>>However, as the database grows into the terabyte (possibly petabyte)
>>range, the likelihood of a collision transitions rapidly from unlikely
>>-> possible -> likely.
>>
>>Since it is -so- simple to guarantee that you avoid collisions, I'm
>>hoping git will do so before the key structure is too ingrained.
>
>
> You aren't solving anything. You're just putting it off, and doing it in
> a way that breaks all the wonderful semantics possible by just assuming
> that the hash is unique. All of a sudden we are doing checks of data
> that we never did before, and we have to do the check trillions of times
> before the CPU time spent pays off.
First, the hash is NOT unique.
Second, you lose data if you pretend it is unique. I don't like losing
data.
Third, a data check only occurs in the highly unlikely case that a hash
already exists -- a collision. Rather than "trillions of times", more
like "one in a trillion chance."
Jeff
next prev parent reply other threads:[~2005-04-23 23:16 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-04-23 20:27 Hash collision count Jeff Garzik
2005-04-23 20:33 ` Jeff Garzik
2005-04-23 23:00 ` Ray Heasman
2005-04-23 23:20 ` Jeff Garzik [this message]
2005-04-23 23:46 ` Petr Baudis
2005-04-24 0:35 ` Jeff Garzik
2005-04-24 0:40 ` Petr Baudis
2005-04-24 0:43 ` Jeff Garzik
2005-04-24 21:24 ` Imre Simon
2005-04-24 22:25 ` Whales falling on houses - was: " Jon Seymour
2005-04-25 23:50 ` Tom Lord
2005-04-26 0:00 ` Petr Baudis
2005-04-24 1:01 ` Ray Heasman
2005-04-24 7:56 ` David Lang
-- strict thread matches above, loose matches on Subject: below --
2005-04-24 23:16 linux
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=426AD835.5070404@pobox.com \
--to=jgarzik@pobox.com \
--cc=git@vger.kernel.org \
--cc=lists@mythral.org \
--cc=torvalds@osdl.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.