Hash collision count

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Hash collision count
@ 2005-04-23 20:27 Jeff Garzik
  2005-04-23 20:33 ` Jeff Garzik
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Jeff Garzik @ 2005-04-23 20:27 UTC (permalink / raw)
  To: Git Mailing List, Linus Torvalds

Ideally a hash + collision-count pair would make the best key, rather 
than just hash alone.

A collision -will- occur eventually, and it is trivial to avoid this 
problem:

	$n = 0
	attempt to store as $hash-$n
	if $hash-$n exists (unlikely)
		$n++
		goto restart
	key = $hash-$n

Tangent-as-the-reason-I-bring-this-up:

One of my long-term projects is an archive service, somewhat like 
Plan9's venti:  a multi-server key-value database, with sha1 hash as the 
key.

However, as the database grows into the terabyte (possibly petabyte) 
range, the likelihood of a collision transitions rapidly from unlikely 
-> possible -> likely.

Since it is -so- simple to guarantee that you avoid collisions, I'm 
hoping git will do so before the key structure is too ingrained.

	Jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hash collision count
  2005-04-23 20:27 Hash collision count Jeff Garzik
@ 2005-04-23 20:33 ` Jeff Garzik
  2005-04-23 23:00 ` Ray Heasman
  2005-04-24  7:56 ` David Lang
  2 siblings, 0 replies; 14+ messages in thread
From: Jeff Garzik @ 2005-04-23 20:33 UTC (permalink / raw)
  To: Git Mailing List, Linus Torvalds

Jeff Garzik wrote:
> 
> Ideally a hash + collision-count pair would make the best key, rather 
> than just hash alone.
> 
> A collision -will- occur eventually, and it is trivial to avoid this 
> problem:
> 
>     $n = 0
>     attempt to store as $hash-$n
>     if $hash-$n exists (unlikely)
>         $n++
>         goto restart
>     key = $hash-$n


This of course presumes that you know that your new data does not exist 
in the cache.

Otherwise, you would need to hash the on-disk $hash-0 to determine if 
its a collision or a reference.

	Jeff



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hash collision count
  2005-04-23 20:27 Hash collision count Jeff Garzik
  2005-04-23 20:33 ` Jeff Garzik
@ 2005-04-23 23:00 ` Ray Heasman
  2005-04-23 23:20   ` Jeff Garzik
  2005-04-24  7:56 ` David Lang
  2 siblings, 1 reply; 14+ messages in thread
From: Ray Heasman @ 2005-04-23 23:00 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Git Mailing List, Linus Torvalds

On Sat, 2005-04-23 at 16:27 -0400, Jeff Garzik wrote:
> Ideally a hash + collision-count pair would make the best key, rather 
> than just hash alone.
> 
> A collision -will- occur eventually, and it is trivial to avoid this 
> problem:
> 
> 	$n = 0
> 	attempt to store as $hash-$n
> 	if $hash-$n exists (unlikely)
> 		$n++
> 		goto restart
> 	key = $hash-$n
> 

Great. So what have you done here? Suppose you have 32 bits of counter
for n. Whoopee, you just added 32 bits to your hash, using a two stage
algorithm. So, you have a 192 bit hash assuming you started with the 160
bit SHA. And, one day your 32 bit counter won't be enough. Then what?

> Tangent-as-the-reason-I-bring-this-up:
> 
> One of my long-term projects is an archive service, somewhat like 
> Plan9's venti:  a multi-server key-value database, with sha1 hash as the 
> key.
> 
> However, as the database grows into the terabyte (possibly petabyte) 
> range, the likelihood of a collision transitions rapidly from unlikely 
> -> possible -> likely.
> 
> Since it is -so- simple to guarantee that you avoid collisions, I'm 
> hoping git will do so before the key structure is too ingrained.

You aren't solving anything. You're just putting it off, and doing it in
a way that breaks all the wonderful semantics possible by just assuming
that the hash is unique. All of a sudden we are doing checks of data
that we never did before, and we have to do the check trillions of times
before the CPU time spent pays off.

If you want to use a bigger hash then use a bigger hash, but don't fool
yourself into thinking that isn't what you are doing.

Ciao,
Ray

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hash collision count
  2005-04-23 23:00 ` Ray Heasman
@ 2005-04-23 23:20   ` Jeff Garzik
  2005-04-23 23:46     ` Petr Baudis
  2005-04-24  1:01     ` Ray Heasman
  0 siblings, 2 replies; 14+ messages in thread
From: Jeff Garzik @ 2005-04-23 23:20 UTC (permalink / raw)
  To: Ray Heasman; +Cc: Git Mailing List, Linus Torvalds

Ray Heasman wrote:
> On Sat, 2005-04-23 at 16:27 -0400, Jeff Garzik wrote:
> 
>>Ideally a hash + collision-count pair would make the best key, rather 
>>than just hash alone.
>>
>>A collision -will- occur eventually, and it is trivial to avoid this 
>>problem:
>>
>>	$n = 0
>>	attempt to store as $hash-$n
>>	if $hash-$n exists (unlikely)
>>		$n++
>>		goto restart
>>	key = $hash-$n
>>
> 
> 
> Great. So what have you done here? Suppose you have 32 bits of counter
> for n. Whoopee, you just added 32 bits to your hash, using a two stage
> algorithm. So, you have a 192 bit hash assuming you started with the 160
> bit SHA. And, one day your 32 bit counter won't be enough. Then what?

First, there is no 32-bit limit.  git stores keys (aka hashes) as 
strings.  As it should.

Second, in your scenario, it's highly unlikely you would get 4 billion 
sha1 hash collisions, even if you had the disk space to store such a git 
database.


>>Tangent-as-the-reason-I-bring-this-up:
>>
>>One of my long-term projects is an archive service, somewhat like 
>>Plan9's venti:  a multi-server key-value database, with sha1 hash as the 
>>key.
>>
>>However, as the database grows into the terabyte (possibly petabyte) 
>>range, the likelihood of a collision transitions rapidly from unlikely 
>>-> possible -> likely.
>>
>>Since it is -so- simple to guarantee that you avoid collisions, I'm 
>>hoping git will do so before the key structure is too ingrained.
> 
> 
> You aren't solving anything. You're just putting it off, and doing it in
> a way that breaks all the wonderful semantics possible by just assuming
> that the hash is unique. All of a sudden we are doing checks of data
> that we never did before, and we have to do the check trillions of times
> before the CPU time spent pays off.

First, the hash is NOT unique.

Second, you lose data if you pretend it is unique.  I don't like losing 
data.

Third, a data check only occurs in the highly unlikely case that a hash 
already exists -- a collision.  Rather than "trillions of times", more 
like "one in a trillion chance."

	Jeff




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hash collision count
  2005-04-23 23:20   ` Jeff Garzik
@ 2005-04-23 23:46     ` Petr Baudis
  2005-04-24  0:35       ` Jeff Garzik
  2005-04-25 23:50       ` Tom Lord
  2005-04-24  1:01     ` Ray Heasman
  1 sibling, 2 replies; 14+ messages in thread
From: Petr Baudis @ 2005-04-23 23:46 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Ray Heasman, Git Mailing List, Linus Torvalds

Dear diary, on Sun, Apr 24, 2005 at 01:20:21AM CEST, I got a letter
where Jeff Garzik <jgarzik@pobox.com> told me that...
> Second, in your scenario, it's highly unlikely you would get 4 billion 
> sha1 hash collisions, even if you had the disk space to store such a git 
> database.

It's highly unlikely you would get a _single_ collision.

> First, the hash is NOT unique.
> 
> Second, you lose data if you pretend it is unique.  I don't like losing 
> data.

*sigh*

We've been through this before, haven't we?

> Third, a data check only occurs in the highly unlikely case that a hash 
> already exists -- a collision.  Rather than "trillions of times", more 
> like "one in a trillion chance."

No, a collision is pretty common thing, actually. It's the main power of
git, actually - when you do read-tree, modify it and do write-tree
(typically when doing commit), everything you didn't modify (99% of
stuff, most likely) is basically a collision - but it's ok since it
just stays the same.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hash collision count
  2005-04-23 23:46     ` Petr Baudis
@ 2005-04-24  0:35       ` Jeff Garzik
  2005-04-24  0:40         ` Petr Baudis
  2005-04-25 23:50       ` Tom Lord
  1 sibling, 1 reply; 14+ messages in thread
From: Jeff Garzik @ 2005-04-24  0:35 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Ray Heasman, Git Mailing List, Linus Torvalds

Petr Baudis wrote:
> Dear diary, on Sun, Apr 24, 2005 at 01:20:21AM CEST, I got a letter
> where Jeff Garzik <jgarzik@pobox.com> told me that...
> 
>>Second, in your scenario, it's highly unlikely you would get 4 billion 
>>sha1 hash collisions, even if you had the disk space to store such a git 
>>database.
> 
> 
> It's highly unlikely you would get a _single_ collision.

Agreed.


>>First, the hash is NOT unique.
>>
>>Second, you lose data if you pretend it is unique.  I don't like losing 
>>data.
> 
> 
> *sigh*
> 
> We've been through this before, haven't we?

<shrug>

In messing around with archive servers, people get nervous using 
(hash,value) based storage if there isn't even a simple test for collisions.

Someone just told me that one implementation of the Venti archive 
server[1] simply fails the write, if a data item exists with a duplicate 
hash value.  As long as git fails or does something -predictable- in the 
face of the hash collision, I'm satisfied.

	Jeff


[1] http://www.cs.bell-labs.com/sys/doc/venti/venti.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hash collision count
  2005-04-24  0:35       ` Jeff Garzik
@ 2005-04-24  0:40         ` Petr Baudis
  2005-04-24  0:43           ` Jeff Garzik
  0 siblings, 1 reply; 14+ messages in thread
From: Petr Baudis @ 2005-04-24  0:40 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Ray Heasman, Git Mailing List, Linus Torvalds

Dear diary, on Sun, Apr 24, 2005 at 02:35:57AM CEST, I got a letter
where Jeff Garzik <jgarzik@pobox.com> told me that...
> Someone just told me that one implementation of the Venti archive 
> server[1] simply fails the write, if a data item exists with a duplicate 
> hash value.  As long as git fails or does something -predictable- in the 
> face of the hash collision, I'm satisfied.

-DCOLLISION_CHECK

See the top of Makefile.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hash collision count
  2005-04-24  0:40         ` Petr Baudis
@ 2005-04-24  0:43           ` Jeff Garzik
  2005-04-24 21:24             ` Imre Simon
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff Garzik @ 2005-04-24  0:43 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Ray Heasman, Git Mailing List, Linus Torvalds

Petr Baudis wrote:
> -DCOLLISION_CHECK

Cool.  I am happy, then :)

Make sure that's enabled by default...

Thanks,

	Jeff



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hash collision count
  2005-04-24  0:43           ` Jeff Garzik
@ 2005-04-24 21:24             ` Imre Simon
  2005-04-24 22:25               ` Whales falling on houses - was: " Jon Seymour
  0 siblings, 1 reply; 14+ messages in thread
From: Imre Simon @ 2005-04-24 21:24 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Petr Baudis, Ray Heasman, Git Mailing List, Linus Torvalds,
	Imre Simon

On 4/23/05, Jeff Garzik <jgarzik@pobox.com> wrote:
> Petr Baudis wrote:
> > -DCOLLISION_CHECK
> 
> Cool.  I am happy, then :)
> 
> Make sure that's enabled by default...
> 
> Thanks,
> 
>         Jeff

I would like to second the suggestion that this should be enabled by default.

First, I do think it is highly unlikely that a collision will ever be
found. Actually because of this if one is found it would be an
important byproduct of git and it would probably influence future
system designs. Hence, I believe that it is worth loosing some
efficiency in order to illuminate better this question of the
collision.

I would like to add a reciepe by which one would surely produce a
collision with two files of the same length and quite similar to each
other. I believe that there is a popular belief that this is pretty
much impossible. Of course, we do not have time to execute this
algorithm, but I believe that it convinces everyone that similar files
*do* exist (actually in abundance) with the same hash.

1. Take your favorite text file, at least 160 characters long. 
2. Choose 160 positions in this file.
3. For each position choose your favorite mispelling of that character.
4. Produce all 2^160 text files, all of the same length, choosing for
each position either the original or the alternate character
5. Add an arbitrary file of the same length, different from the above

Two of these files have the same sha1 hash. Or, for that matter, for
any 160 bit  hash the same is true.

Cheers, Imre Simon

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Whales falling on houses - was: Hash collision count
  2005-04-24 21:24             ` Imre Simon
@ 2005-04-24 22:25               ` Jon Seymour
  0 siblings, 0 replies; 14+ messages in thread
From: Jon Seymour @ 2005-04-24 22:25 UTC (permalink / raw)
  To: Imre Simon
  Cc: Jeff Garzik, Petr Baudis, Ray Heasman, Git Mailing List,
	Linus Torvalds, Imre Simon

.> 
> 1. Take your favorite text file, at least 160 characters long.
> 2. Choose 160 positions in this file.
> 3. For each position choose your favorite mispelling of that character.
> 4. Produce all 2^160 text files, all of the same length, choosing for
> each position either the original or the alternate character
> 5. Add an arbitrary file of the same length, different from the above
> 
> Two of these files have the same sha1 hash. Or, for that matter, for
> any 160 bit  hash the same is true.

If you were to create those files at 10^9 files per second, it would
take you 10^38 years before you were in position to take step 5. I am
about to turn 38 this week. Would that I could live to 10^38.

It's absolute rubbish to say that the best solution from an
<double-quote>engineering</double-quote> point of view is to eliminate
the infinitessimal possibility of a collision. Engineering is all
about assessing risk and making suitable trade-offs. Every day of the
week, "real" engineers accept life-threatening risks that put
thousands of peoples lives in danger. They do it because we live in a
world where risk cannot be eliminated, merely reduced to an acceptable
level.

I can't understand that you are a prepared to drive a car or fly in a
Boeing or Airbus that has a demonstrated risk of killing you, yet you
want to insist on eliminating a risk that at most might create an
interesting Slashdot headline: "Jolt-crazed programmer finds SHA1
collision - but later dies when whale falls on house".

jon.
-- 
homepage: http://www.zeta.org.au/~jon/
blog: http://orwelliantremors.blogspot.com/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hash collision count
  2005-04-23 23:46     ` Petr Baudis
  2005-04-24  0:35       ` Jeff Garzik
@ 2005-04-25 23:50       ` Tom Lord
  2005-04-26  0:00         ` Petr Baudis
  1 sibling, 1 reply; 14+ messages in thread
From: Tom Lord @ 2005-04-25 23:50 UTC (permalink / raw)
  To: pasky; +Cc: git


  From: Petr Baudis <pasky@ucw.cz>

  Pasky:

  > No, a collision is pretty common thing, actually. It's the main power of
  > git, actually - when you do read-tree, modify it and do write-tree
  > (typically when doing commit), everything you didn't modify (99% of
  > stuff, most likely) is basically a collision - but it's ok since it
  > just stays the same.

That is not the way people ordinarily use the word "collision".
It's pretty much the opposite of the normal way, actually.

-t

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hash collision count
  2005-04-25 23:50       ` Tom Lord
@ 2005-04-26  0:00         ` Petr Baudis
  0 siblings, 0 replies; 14+ messages in thread
From: Petr Baudis @ 2005-04-26  0:00 UTC (permalink / raw)
  To: Tom Lord; +Cc: git

Dear diary, on Tue, Apr 26, 2005 at 01:50:31AM CEST, I got a letter
where Tom Lord <lord@emf.net> told me that...
> 
>   From: Petr Baudis <pasky@ucw.cz>
> 
>   Pasky:
> 
>   > No, a collision is pretty common thing, actually. It's the main power of
>   > git, actually - when you do read-tree, modify it and do write-tree
>   > (typically when doing commit), everything you didn't modify (99% of
>   > stuff, most likely) is basically a collision - but it's ok since it
>   > just stays the same.
> 
> That is not the way people ordinarily use the word "collision".
> It's pretty much the opposite of the normal way, actually.

You need to quote me in the context of Jeff Garzik's

> > Third, a data check only occurs in the highly unlikely case that a hash
> > already exists -- a collision.  Rather than "trillions of times", more
> > like "one in a trillion chance."

I just wanted to point out that the data check would hahve to occur
everytime you didn't modify an object.

Kind regards,

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hash collision count
  2005-04-23 23:20   ` Jeff Garzik
  2005-04-23 23:46     ` Petr Baudis
@ 2005-04-24  1:01     ` Ray Heasman
  1 sibling, 0 replies; 14+ messages in thread
From: Ray Heasman @ 2005-04-24  1:01 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Git Mailing List

On Sat, 2005-04-23 at 19:20 -0400, Jeff Garzik wrote:
> Ray Heasman wrote:
> > On Sat, 2005-04-23 at 16:27 -0400, Jeff Garzik wrote:
> > 
> >>Ideally a hash + collision-count pair would make the best key, rather 
> >>than just hash alone.
> >>
> >>A collision -will- occur eventually, and it is trivial to avoid this 
> >>problem:
> >>
> >>	$n = 0
> >>	attempt to store as $hash-$n
> >>	if $hash-$n exists (unlikely)
> >>		$n++
> >>		goto restart
> >>	key = $hash-$n
> >>
> > 
> > 
> > Great. So what have you done here? Suppose you have 32 bits of counter
> > for n. Whoopee, you just added 32 bits to your hash, using a two stage
> > algorithm. So, you have a 192 bit hash assuming you started with the 160
> > bit SHA. And, one day your 32 bit counter won't be enough. Then what?
> 
> First, there is no 32-bit limit.  git stores keys (aka hashes) as 
> strings.  As it should.

Oh great, now we have variable length id strings too. And we'll have to
pretend the OS can store infinite length file names.

> Second, in your scenario, it's highly unlikely you would get 4 billion 
> sha1 hash collisions, even if you had the disk space to store such a git 
> database.

Er. So your so-unlikely-the-sun-will-burn-out-first scenario beats my
so-unlikely-the-sun-will-burn-out-first scenario? Why am I not worried?

> > You aren't solving anything. You're just putting it off, and doing it in
> > a way that breaks all the wonderful semantics possible by just assuming
> > that the hash is unique. All of a sudden we are doing checks of data
> > that we never did before, and we have to do the check trillions of times
> > before the CPU time spent pays off.
> 
> First, the hash is NOT unique.

Nooooo. Really?

Why not just use a 8192 bit hash for each 1KiB of data? We could store a
zero length file and store all the data in the filename. Guaranteed, no
hash collisions that way.

We make an assumption that we know is right most of the time, and we use
it because we know our computer will crash from random quantum
fluctuations before we have a chance of bumping into the problem. You do
know that metastability means that every logic gate in your computer
hardware is guaranteed to fail every "x" operations, where x is defined
by process size, voltage and temperature? Sure, any failures in git
would be data dependent rather than random, but that just means we don't
get to store carefully crafted blocks invented by hypothetical
cryptographers that have completely broken SHA.

> Second, you lose data if you pretend it is unique.  I don't like losing 
> data.

You lose data either way. Just we get to burn out a few extra suns
before yours dies, and I can burn out whole galaxies before mine dies by
using a 256 bit hash, anyway.

> Third, a data check only occurs in the highly unlikely case that a hash 
> already exists -- a collision.  Rather than "trillions of times", more 
> like "one in a trillion chance."

Heh. I calculate it has a 50% probability of happening after you have
seen 10^24 input blocks. So, you are off by a factor of a trillion or
so.

Assuming we store 1 KiB blocks with a 160-bit hash, we would be able to
store 1000 Trillion Terabytes before the chance of hitting a collision
goes to 50%. To use marketing units, that is around 10 Trillion
Libraries of Congress. Every 2 bits we add to the hash doubles the
amount of data we can store before we hit a 50% probability of
collision.

I'm not sure how I could convince you that we're arguing about the
number of angels that could dance on a pin.

Ciao,
Ray

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hash collision count
  2005-04-23 20:27 Hash collision count Jeff Garzik
  2005-04-23 20:33 ` Jeff Garzik
  2005-04-23 23:00 ` Ray Heasman
@ 2005-04-24  7:56 ` David Lang
  2 siblings, 0 replies; 14+ messages in thread
From: David Lang @ 2005-04-24  7:56 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Git Mailing List, Linus Torvalds

On Sat, 23 Apr 2005, Jeff Garzik wrote:

> Ideally a hash + collision-count pair would make the best key, rather than 
> just hash alone.
>
> A collision -will- occur eventually, and it is trivial to avoid this problem:
>
> 	$n = 0
> 	attempt to store as $hash-$n
> 	if $hash-$n exists (unlikely)
> 		$n++
> 		goto restart
> 	key = $hash-$n
>
> Tangent-as-the-reason-I-bring-this-up:
>
> One of my long-term projects is an archive service, somewhat like Plan9's 
> venti:  a multi-server key-value database, with sha1 hash as the key.
>
> However, as the database grows into the terabyte (possibly petabyte) range, 
> the likelihood of a collision transitions rapidly from unlikely -> possible 
> -> likely.
>
> Since it is -so- simple to guarantee that you avoid collisions, I'm hoping 
> git will do so before the key structure is too ingrained.

Jeff, this can't work becouse you don't know what objects exist on other 
servers, in fact given the number of different repositories that will 
eventually exist the odds are good that when the colision occures it will 
be when object repositories get combined,

David Lang
-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2005-04-25 23:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-23 20:27 Hash collision count Jeff Garzik
2005-04-23 20:33 ` Jeff Garzik
2005-04-23 23:00 ` Ray Heasman
2005-04-23 23:20   ` Jeff Garzik
2005-04-23 23:46     ` Petr Baudis
2005-04-24  0:35       ` Jeff Garzik
2005-04-24  0:40         ` Petr Baudis
2005-04-24  0:43           ` Jeff Garzik
2005-04-24 21:24             ` Imre Simon
2005-04-24 22:25               ` Whales falling on houses - was: " Jon Seymour
2005-04-25 23:50       ` Tom Lord
2005-04-26  0:00         ` Petr Baudis
2005-04-24  1:01     ` Ray Heasman
2005-04-24  7:56 ` David Lang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).